RDF and Databases

Some RDF research dropped me to a nice paper (PDF) from IBM discussing RDF with relational databases. This combination can replace half-baked application data mechanisms. These crop up regularly in my consulting work. Think nested directories of Windows INI files and brittle, binary files breaking on minor design iterations. The pain, the pain.

Someone should describe RDF in 500 words or less as a generalization of INI. That note would spread understanding of RDF, which is simple but often described so abstractly that it seems complicated. It's better to start from the known and move to the unknown.

Here is a short attempt, just to spark interest. Experts may call me all wet. Windows INI format uses "key-equals-value," with keys grouped into sections. Think of "key-equals-value" as a special case of RDF's "subject-predicate-object." RDF generalizes to any verb, not just "equals," along with superior grouping. While INI nests just one level down (via sections), RDF URIs handle arbitrary nesting (via slashes), and URIs also permit remote data. That is not to say RDF data must be tree-structured. Most RDF papers focus too much on XML. XML is merely one expression syntax. There are several others and a relational database will store RDF data in its own way, completely independent of XML.

There are several projects in this domain. My favorite so far is OpenRDF Sesame. It supports querying at the semantic level. It seems more mature than others, having derived from previous efforts, and works with both PostgreSQL and MySQL as well as Oracle. An abstraction layer called SAIL makes Sesame database-agnostic. Sesame even sports a stand-alone b-tree system, or in-memory operation, if you don't want an external database. I like PostgreSQL much better than MySQL for its loose BSD license and technical merits. Apropos of that, another bit of news is that PostgreSQL now works natively on Windows. (The PostgreSQL client has always worked natively as a DLL.) PostgreSQL speed issues mentioned in Sesame papers have improved. As for Sesame, the only drawback is Java. But since Sesame interfaces over TCP through Java servlets, that's a don't-care.

On a related note, I looked into Python-based Chandler. The story there is that it's a custom job because, says Andi Vajda, When I started working on this project in May, the repository was late, very late, and the project was stalled because of that. I felt I could get something usable for the project to resume much faster if I started a data model implementation from scratch and persisted it using Sleepycat's Berkeley dbxml and Berkeley db. Today, the Chandler repository is not really so much an object database as an item XML database combined with large collections of references directly stored in Berkeley DB. Hmm...project behind, so build from scratch? I'm not clear why Chandler didn't go with RDF, but it sounds like project management problems. It seems as though RDF would support all that Chandler wants to do without the constrictions of XML. Note that Sesame has Python bindings.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

"Nice paper"?

I don't think that that paper is all that nice. The way I see it, the "problems" they're addressing are symptoms of fundamental errors in database design. For example, quoting from the paper:
Large number of columns: The current database systems do not permit a large numbers of columns in a table. This limit is 1012 columns in DB2 (also in Oracle), whereas we had nearly 5000 attributes across different categories.

Sparsity: Even if DB2 were to allow the desired number of columns, we would have had nulls in most of the fields. In addition to creating storage overhead 1, nulls increase the size of the index and they sort high in the DB2 B+ tree index.

If you have 5000 columns in a table, and most of these are irrelevant for any given record, you've simply allowed a completely unacceptable data model to come into being, either through negligence or incompetence. There are much better ways to do this.

One way is to split bundles of related columns into separate tables, give each table an OID column to be joined on, and keep metadata about which OIDs have information in which tables. Essentially, this way you give yourself a multiple typing system, where OIDs correspond to objects, tables correspond to types, and the metadata represents which types a given object has. To add a new type and its corresponding properties, you add a table.

Even if I had inherited a database like that, I would either (a) not grow it at all (and essentially start from scratch), or (b) bite the bullet and figure out what the ontology of the problem domain is (on the logic that the worst time to do this unenviable task is "later").

Fast reader?

Your remarks chase mine by two hours and indicate a lack of attentive reading. The 5000 column database is what the authors decry. Alternatives such as you propose are considered, but the focus is the "vertical" table solution. The conclusion offers a wish list for better DB capability.

The paper is an interesting study that relates to RDF triplet storage in relational databases.

Note SeRQL in Sesame.

I took that into account.

The authors offer a 3-column physical encoding for what logically remains a 5,000 column table. I submit that accepting a 5,000 column table of any sort, physical or logical, is an unacceptable solution that betrays a failure to analyze the ontology of the modeling domain. And this is only compounded if the table is very sparse.

I did read the article. They dwell on the sort of alternative I propose for all of one paragraph; they allude to the physical properties of the representation I have in mind (separate tables for common attributes), not the logical ones (use of a simple type system to organize the ontology of the problem domain). There is no serious discussion of the maintainability of the data model they propose beyond pointing out that it's trivial to add new logical columns, whether it's 5 or... yet another 5,000. And I stil think that "adding a column" is way too much of a low-level operation; if you have a data model with 5,000 properties, you really need to be thinking at the level of what entities you have, how they are typed, what relations they stand in to each other, how do you modify a type, how to add a new type, how to split up or combine existing types, and so on; not about what columns you have.

With all the emphasis on performance comparisons and the corresponding lack of attention to maintainability concerns such as above, I can't help but see this paper in the same light as I'd see a paper advocating the great performance gains that we'd reap if we all just went back to using GOTO, instead of all this "structured programming" stuff. Sure, perhaps our programs would be twice as fast... but they'd also be a mess.

Acid cynicism

What, you want a book? The sparse paragraphs give context, not coverage. This is a research paper with a defined scope. It studies one approach out of many. Lighten up! Don't just carp there, post a better research link on the home page!

Electronics is a huge industry but still, your numerical shock is misplaced. The figure of 5,000 arises from part attributes spread across 2,000 parts categories. On average that is merely 2.5 unique attributes per category. The problem domain sets these numbers, not the DB design. The DB design has to cope with them, though. Oh, I forgot to mention that they change frequently!

Some might consider it a nice trick to get a logical 5,000-column view (sparseness and all) which makes SQL happy, while keeping a rational, flexible storage format (RDF style). The authors clearly do not hail this design as the terminus of all future DB development and evolution. It's simply a design point tested with current tools.

There are dedicated RDF databases ("triple stores"). I'm interested to hear what people have to say about them, and also their kin like Sesame, which layer on traditional DBs.

Binary relations

One of the problems addressed by the paper is how to manage "sparsely populated" data.

Suppose I know that John has brown hair, eats sushi and drives a Lexus; that Jim eats candy bars, suffers from migraines, and collects miniature porcelain figurines; that Jack has blond hair, suffers from night sweats and has an annual salary of £30,000. These are all the facts I possess about John, Jim and Jack.

One way I can represent these facts is with a set of binary relations, as follows:

Person   Eats
-------------------
John     Sushi
Jim      Candy Bars


Person   Hair Colour
--------------------
John     Brown
Jim      Blonde


Person   Ailment
---------------------
Jim      Migraines
Jack     Night sweats


Person   Collects
--------------------------------------
Jim      Miniature Porcelain Figurines

...and so on. In order to gather together all the facts I have about any person, I must query all of these relations. If I obtain any new kind of fact about any person - if it comes to light that Jack wears green sneakers, or that Jim is a qualified scuba diving instructor - then a new relation has to be created for storing facts of that type.

For institutional/political reasons, creating new relations is usually the job of a DBA, which means that if developers or end users are constantly coming up with new kinds of facts that they wish to record, either the DBA will be terribly busy assessing the risk of modifying the database schema every five minutes, or the developers and end users will be terribly frustrated and disappointed because of the DBA's principled intransigence (motto: "You should have thought of that before").

RDF triples decompose binary relations a further step, into statements that can all be stored together in one big relvar:

Subject         Predicate   Object
------------------------------------------
Jim             Footwear    Jim's Sneakers
Jim's Sneakers  Colour      Green

The primary key in this relation is a composite of all three fields.

What we gain by doing this is the ability to add not only new facts, but new kinds of facts, into the database without needing to go through the DBA. Our facts can be related in complex graphs (this is the RDF "data model" in a nutshell: RDF statements make up a directed labelled graph) without our needing to petition the DBA to set up foreign key constraints.

The stinger is, what we lose by doing this is almost all of the data integrity and data management facilities provided by the relational model. The DBA isn't involved because the useful work that DBAs can do with relational database schemas can't be done with this model - and that's actually not a good thing.

The question that needs to be asked is: if maintaining data integrity is not up to the RDBMS, as administered by the DBA, then what, administered by whom, will be responsible for it? Will it be done in an ad hoc manner by applications programmers? Isn't that what the relational model was intended to get away from?

Problem is with current RDBMS's.

I've ended up doing this sort of thing for DB app's I've written. My personal thought on the matter is that table creation should not be thought of a "design time" activity, but rather, something the application can do when new types of information are desired to be inserted in the database. Of course, I also think database's should expose their schema through a relational variable/table as well, and in that sense, an insert into that table would in effect create new tables, and allow queries without a complete knowledge of the schema. Of course, I've been up half the night, so maybe this will make more sense after some sleep :).

Politics

Well, I think that's right; but a lot of the trouble people have with "rigid database schemas" and all the rest is actually not so much that the database is "rigid" - even in SQL, you can CREATE TABLE and ALTER TABLE to your heart's content; even on SQL Server I can query and even update the database schema tables if I have sufficient permissions - but that the institutional forces around who controls the database schema are rigid. In particular, the centralisation of authority and responsibility on the person (or office) of the DBA creates a bottleneck for change requests.

It may be that what's needed are some slightly less all-or-nothing constraints around who can do what, so that the sorts of "run-time" changes that can be made safely and nonchalantly are distinguished from the sorts of changes that need extremely careful oversight. Otherwise what you get is work-arounds that undermine data integrity, or that propose to liberate the data by making it intrinsically unmanageable.

Closed-world cases

OK. I thought of application programmers using RDF+DB tools to define "loose" file formats for their apps, which can rev easily, minimize maintenance, and maximize backward compatibility. In those cases the engineering teams are the DBAs.

This type of table already exists. It's very common

(unless I miss your meaning completely)

where a new relation comes into existence and needs to be tracked.

This type of table sometimes goes by the term 'data dictionary', all the major DBMSs have one to manage their own operation.

And I've seen many applications have their own data dictionaries. It used to be more common in Clipper/DB (please don't snicker ... Clipper had function-based indexing in the late 80s, Oracle just added them 3 years ago)

Developers sometimes put data dictionairies in their design with the rationale of "added flexibility" but many times this is code-talk for a way to be able to make changes without permissions (I've done it myself) ....

Re: Chandler

I think the reason they decided to write it from scratch is because they had to write it from scratch one way or another; the technology they might build it on just wasn't well enough developed. So by starting mostly from scratch (not entirely, they did use dbxml after all) they could address just the problems they needed to solve -- problems which were a bottleneck for a much larger project -- without being bogged down by technologies which are intended to solve a much wider set of problems (but still didn't solve all the problems they were interested in), but which lack implementations that matched the technology's ambition.

What it really comes down to is the principle that stability is proportionate to reusability, and the repository is a central piece of infrastructure. They couldn't rely on anyone else's schedule.

Re: Re: Chandler

The Chandler project is frank about itself and its missteps. I wish it all the best. Addressing your point, however, the Haystack personal information manager, based on RDF, shows otherwise. Interestingly for LtU, it sports a programming language called ADENINE which is claimed to be in some respects resembling a version of Python with native support for RDF data types built in.

Interesting

My first reaction was similar to em's, the authors were ignoring the modelling tools they had in preference of inventing their own. But the phrase "higher-order views" got my attention, and I recalled once trying to do something similar that was unrelated to RDF.

Now to me, having multiple relations with a high degree of similarity is a sign of data duplication that should be normalized/factored/simplified. The binary relations in Dominic's example are simple but sufficient as an illustration, though more complicated examples exist. If n predicates can be reduced to 1 with the addition of an extra variable, should they not be? (One implication of this is that all the predicates are type compatible; it would be interesting extending this to deal with type parameters)

In terms of inserting and deleting of tuples into such a transformed relvar, this isn't really a problem even with basic SQL. Just replace

insert into SOME_PREDICATE values ( A, B... )

with

insert into GROUP_OF_PREDICATES values ( SOME_PREDICATE, A, B... )

and away you go. I think most integrity constraints would be fine as well, just add an extra clause to deal with the predicate type. Unfortunately some queries that would be valid on a relvar in the original form could no longer be done on the transformed relvar, e.g. aggregates. Perhaps this is more of an artifact of conflating querying and presentation into one language/statement?

Most SQL DB's are already doing "higher-order views" anyway, but don't expose them through any language or API. Consider the system dictionary, which usually contains a table that lists the column meta-data as tuples (e.g., syscolumns in MSSQL.)

One of the problems I've always had with Date's D recommendations are that while a conformant language would be vastly superior to SQL, that really isn't saying much. Problems like this bring to mind first-order/second-order logic, and I think applying some of the knowledge gained over the years in functional and logical programming to the relational data model would be beneficial.

BTW I also think DB schemas should be rigid, but then I'm a fan of static type checking as well. :)

Much Missing in Proposed Table Structure

Concerning the Nice Paper:

  1. Their new structure (oid,key,value) fails to capture information associated with a column (e.g., type, domain, column constraints, check constraints,etc.) other than name.

    That problem could be corrected by

    • asserting that all values are of type "string" (after all, this Is a WWW application, right?) or
    • adding new fields, e.g., adding "type" to the current (oid,key,value) yielding (oid,key,type,value)

    but in the latter case most current RDBMS don't allow a column (in this case "value") to be of more than one type. To fix that problem, you could abandon type altogether and store values as type binary large object(BLOB). And so on and so on: the changes required ripple on recursively here.

  2. Is the universe of discourse captured in a single table? If so, then the original table (oid, A1, A2, A3) is merely a concept matrix with each rows representing objects, columns representing attributes and either attribute values or "X's" (to indicate that an attribute applies to an oid).
  3. If the database implementation of choice can support only n columns per table then it would be far simpler to create multiple tables, e.g.,
    table1(oid, A1, A2, A3, ....An-1)
    table2(oid, B1, B2, B3, ....Bn-1)
    ...
    than to abandon the relational model, as the authors are actually doing.

  4. In the end the referenced article is not about RDF with relational databases but is merely an explication of a technique for storing sparse string data in a relational database.

Moving On

In some sense this critique merely repeats statements from the paper. The paper isn't about RDF per se, it just says that potential applications of the work reported in this paper include stores for XML [7], RDF [1], KBMS [15], LDAP [19] and data mining. It discusses binary relations and N-table approaches on its own terms. Those interested can read for themselves. It also admits On the negative side, the vertical representation loses data typing since all values are stored as VARCHARs in the Val field, although it is easy to design extensions to support data typing if desired. From a strictly RDF standpoint, URIs might be considered as void pointers. Type the referent as you please, or put a tag in it directly. The authors think that vertical-under-the-hood-with-horizontal-access worked fairly well for massive and ever-changing electronics data. They note where the system groans. Critiques are nice, but who can offer better solutions against this data? To say that the authors "abandon the relational model" is weird, for that is what their query transformation preserves. (Note that they index every column in the vertical table.)

Problems which stretch existing software show both its weaknesses and potential growth vectors. That's why I find papers like this interesting. Because of its manageability and flexibility, the vertical representation is increasingly finding its way in many commercial systems. It behooves the database community to investigate and study how best to support the vertical representation to bring the new emerging applications to its fold. The work we present is a step in this direction that compliments earlier work.

Enough about the paper. I would be glad to read better research if someone cares to offer it, or some proof that good answers are impossible. What I really wanted to discuss was RDF storage in databases, whether old-school RDMS, new-school triple stores, or something not yet invented. What do folks think of Sesame, for example?

Please

The paper is not about RDF (point conceded) and not about relational databases because:
  1. The presence of NULLs in the first example raises the alert,
  2. the final result does not handle types, constraints, domains, etc. and is non-relational,
  3. there is no later attempt to reconcile abandonment of type, constraints, etc. by the authors,
  4. the authors demonstrate no understanding of the underpinnings of the relational model and provide no strong reason why we should abandon the only mathematically proven and successful database model in existence after some 25 years,
  5. I myself have seen structures as the Nice Paper presents in 10's of software packages that are as much as 30 or more years old; it was a very common technique in file-based databases from VSAM to DBASE. (IOW it isn't new; no one's getting a patent on this technique that will stick unless it's IBM). Such structures were largely supplanted by RDBMS.

So in the end, the Nice Paper is about neither RDF nor RDBMS; instead it's about a programming trick the authors want to show us. That is fine and interesting, but one should label it as such. It seems that the real problem is that so few understand the relational model today that it is almost impossible to filter out the wheat from the chaff (e.g., when asked to explain RDBMS, I've seen spreadsheets used as examples, so no understanding of the constraints inherent in the relational model are shown.)

Consequently AFAICT the original post is a series of links to interesting tools and topics but no single consistent theme connects them.

Your solution then

Be civil. The post is entitled "RDF and Databases," and the links relate, if not all to maximum degree. Sesame is 100% relevant. It's an RDF layer atop a relational database. Yet all people want to do is attack the paper by IBM Fellow Dr. Rakesh Agrawal, Yirong Xu, and Amit Somani from IBM's QUEST group. There was a lot of hyperbole just thrown ("no understanding of the underpinnings of the relational model" is unreal) so I invited the paper's authors to comment for themselves.

Does anyone care to comment about Sesame? Is it as bad as the paper is claimed to be?

Starting from scratch to use less time

In production development, I usually find that I use less time to rewrite reasonably simple functionality from scratch compared to using an existing implementation. For research , I usually find that it takes less code time to use a library than to write from scratch.

There are several reasons - mainly to do with the higher requirements for production libraries, the cost of doing full evaluation of code for production, the risk of having to serious fix up and maintain said code, and the fact more generaic interfaces are often more klunky for the specific case you're covering.

Of course, there are major pieces of functionality that almost always are more suitable for bringing in as libraries or frameworks (databases, GUIs, printing frameworks, etc). But for smaller pieces of functionality, I've often been bitten by trying to re-use somebody else's piece of code. Thus, in a time squeeze for a predictable area, I'll often choose rewrite (which has a fairly low risk[1] profile) instead of re-using (which has a higher risk).

With more time, I'll instead look for libraries. Libraries has a few advantages:

  • They're usually more generic, so the increased requirements for the component over time will be cheaper to accomodate.
  • For reasonably common libraries, there are more developers that know them. There are also better documentation, examples, etc available.
  • Maintenance for changing "world requirements" (due to the generic software update cycle) will be offloaded onto the library, instead of having to do this maintenance yourself. This includes that there often will be other people that test the library with new versions of other packages etc, so you won't always get "first adoptor" pain for the code.
Eivind.

 

[1] Risk == time variation (in this context)

Yes, but...

The paper definitively has to do with RDF, just see Storage Engine for Semantic Web which is drawn "upon results and lessons" from the disputed "nice paper" :-), but consider that this approach has some tradeoffs not documented in the paper.

But what?

Ricardo Stuven wrote:
The paper definitively has to do with RDF, just see Storage Engine for Semantic Web which is drawn "upon results and lessons" from the disputed "nice paper" :-), but consider that this approach has some tradeoffs not documented in the paper.

The first item you note, Storage Engine for Semantic Web, does not provide attributions; it is however merely a PowerPoint presentation of that so-called nice paper. Like the "nice paper" it does not reference RDF per se and, being a restatement of that paper, it also reveals the author(s?)' poor understanding of relational databases.

The second paper you note, tradeoffs, mentions RDF only once stating:

e-commerce applications, such as on-line shops, exchanges, marketplaces, Other potential applications of the work reported in this paper include stores for XML [7], RDF [1],...

tradeoffs is not about RDF, being instead about large databases. In particular tradeoffs describes experiments comparing performance of the "horizontal" and "vertical" database structures described in the "nice paper" atop two unnamed vendor's database platform. The conclusion of tradeoffs is that the primary contention [that the vertical representation is superior to the horizontal in all cases] of the "nice paper" is incorrect. Inotherwords tradeoffs contradicts the original "nice paper"! To tradeoffs's great credit, a good explanation of the context and references to previous work are provided.

The experiments in tradeoffs are insufficient to provide a complete characterisation of the two compared structures (horizontal & vertical). To do so many more experiments would be necessary. But I question such experiments for the following reasons:

  • If one is searching for an efficient data structure for a particular problem, why limit oneself to implementing ad-hoc data structures atop relational database management system(RDBMS)? Would it not be more reasonable to include non-relational data structures that might be more appropriate to the problem?
  • An RDBMS is a complex software system replete with bugs and features. Particular vendors' bugs and features vary. Experimenting with such systems provides useful information only if the intent is to purchase and implement the described system in the described manner. Otherwise such experiments are not useful research since the correction of a single bug or the addition of a single optimisation to one vendor's product could change an experiment's outcome radically. The author(s?) rediscovered this themselves as they noted how altering various database parameters changed their results.

In summary, comparing the performance of various data structures and algorithms atop multiple current-day RDBMS implementations is akin to building castles upon sand. I lament the waste of effort, especially when it would be better spent finding algorithms and structures other than RDBMS to solve the problem in question [how to store large amounts of inhomogeneous data in an efficient manner].

"Yes, but..." explained

Yes: "The paper definitively has to do with RDF", because RDF is a possible application field.

But: "this approach has some tradeoffs". So, I agree with you on "The conclusion of tradeoffs is that the primary contention [that the vertical representation is superior to the horizontal in all cases] of the 'nice paper' is incorrect".