RDF Elevator Pitch

Eureka, the perfect RDF introduction with thanks to A.M. Kuchling (amk). Nothing beats crayon-colored diagrams. It is short, sweet, and hits the main points precisely, including 'political' issues at the end. Much W3C advocacy makes the Semantic Web sound too futuristic....The RDF Core spec is hard to read and really boring....Introductory tutorials are few....Simple things can be done without much effort, and can still be useful.

On one island are the semantic web folks. On another island are semantic filesystem folks. A summit seems in order. I don't hear much about the two working together, but then I live on yet another island. RDF+ReiserFS looks like a match made in heaven, for example, Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases...Do you want a million files in a directory, and want to create them fast? No problem.

From the article,

Reiser has "substantial plans" for adding new kinds of semantics to ReiserFS to help it challenge Microsoft's efforts. "We're planning on competing with the Longhorn filesystem," he says.

The new ReiserFS will eschew the relational algebra approach and work with semistructured data. "The person entering data can employ [the] structure inherent in the data rather than forcing a structure," Reiser said, adding, "Flexibility in querying and creating data is our target. [This] will stand in contrast to Microsoft's SQL-based approach, which does not have that flexibility."

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Semistructured?

I'm not sure what semistructured data is supposed to mean. RDF triples are structured data, as far as I can see. Graphs are structures, aren't they?

[The] structure inherent in the data sounds a bit metaphysical - how is this Philosophick Mercury to be extracted?

Dancing trees sounds intriguing, though.

'semistructured' refers to th

'semistructured' refers to the fact that RDF data models do not necessarily adhere to a strict schema (unlike, say, the ER model which is structured).

A relative term

(subject predicate object) is pretty strict! But it's a mini-structure from which larger structures (that happen to be graphs) can be assembled. Reiser's names are also mini-structures: it's not that they're unstructured in any way (which is why I wonder about the semi-), but that the structure of a lot of them put together isn't totally predetermined by the structure of one of them by itself. In a DB schema, the structure of all of the pieces put together is already decided before you put any of the pieces in.

Semantic FS

The goal of the current semantic technologies is to enable not merely the creation of semantic information, but also the automated processing of that information.

RDF is not just a notation: it's also a data model (strictly speaking, RDF/XML or N3 or what have you are notations; triples and graphs are the data model). Given that data model, it's possible to do some automated processing of semantic information by algorithmic means: graph traversal, for instance. The data model allows us to say things like, in order to make valid inferences based on these statements, perform these operations on the graph made up by the triples representing the statements.

The semantics of the semantic web, as currently understood and practised, represent a highly constrained subset of the semantics of what we might call ordinary knowledge construction. There are things that I can know and say about the contents of my file system that are fairly difficult to put into RDF triples (I don't know whether there are any things I can know and say about the contents of my file system that it would be impossible to put into RDF triples).

To be more precise, the translation of the kinds of stuff human beings think they know, and the kinds of meanings they like to bandy about, into machine-processable semantic information generally entails a degree of (re-)formalization. We have not only to discover [the] structure inherent in the data, but also to derive a representation of that structure that will fit into our data model; and this is true even if the data model is claimed to support semistructured data.

The difficulty is then of the following kind: the process of formalizing semantic information so that it can be processed by an automaton is not itself automatable (or at least not by the same process that the machine will use to process the formalized semantic information; there might be some higher-order process, but the same problem would then apply at the higher level). The person entering data still has a job to do (apart from just typing the stuff in), and it is not necessarily an easier job than the job of the old-fashioned suit-wearing person who performs domain modelling and creates relational database schemas.

The (marketing) promise of the Longhorn FS has been that ordinary users will be able to transfer the things they know about the contents of their file systems into the machine, so that the machine will be able to do a variety of smart things with that information. The creation of better and easier-to-use tools for the (re-)formalization of human knowledge is I think a Good Thing; but there is an unfortunate tendency for such tools to be marketed as if they did the job themselves (or magically altered reality so that the job no longer needed to be done).

I would like to have, and could see myself benefiting from the use of, semantic technology in my file system. Even a few user-definable metadata tags that could be addressed by a straightforward query language would be useful. However, Google's desktop search (which chucks semantics out of the window and does pure syntax-crunching text processing) is currently more useful to me than any existing semantic technology, and I think this is because it places less of an onus on me as the end-user to translate myself into automatonese. Google's search engine just gets on and does what machines are good at doing. Semantic technologies want to be your friend.

Domninic: I believe the term

Domninic: I believe the term "semistructured data" is used in the sense Reiser uses it in his future vision whitepaper (a great read, BTW). In that document it's used as opposed to traditional tree and relational database models. Reiser's idea is more of a "soup" against which very specific, or very general queries can be run. In this sense it does apply well to RDF.

Great paper

Reiser's naming scheme mingles structural information of various kinds with content, so that a name can describe a more or less complex data structure into which its parts are then slotted. It looks pretty neat. I don't think it looks much like RDF, though. You could probably model its primitives using RDF primitives, but they're still different sets of primitives.

Summit Agenda

The islands are not 100% identical. That wasn't the point. It was that they share similar dreams and should probably talk. Both want data pools built on low-level concept primitives, semantic queries, distributed data pools, and enhanced end-user experience. Neither wants end users to become DBAs, as far as I know. Those tasks are for application programs. The paper mentioned touches on many of Dominic's points, like the Google desktop (text keywords), subsets of human knowledge, and the issue of distributed stores.

While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel.

That statement sounds very RDF-semantic-webish to me. RDF is terribly simple and terribly powerful, but RDF market-speak is wrapped up in abstract goo. That is one reason I made the post. RDF needs more "RDF for Dummies" to gain traction out there.

Bridges aside, ReiserFS might be an ideal persistence medium for RDF.

Even RDF isn't fixed in stone right now. There is talk about a fourth element to solve the provenance problem. Talking to ReiserFS folks might be very fruitful, that's all. Certainly these islands should know of each other's existence. That seems not the case right now, but I could be very happily wrong.

That statement sounds very RD

That statement sounds very RDF-semantic-webish to me. RDF is terribly simple and terribly powerful, but RDF market-speak is wrapped up in abstract goo.

I'm with you 100%. Too many RDF evangelists and triple-store vendors tout the "flexibility" of RDF like a silver bullet, and play dumb when you try to get them to speak to the hidden tradeoff with respect to managability. Anyway it's an interesting problem and really does fill a need, but we live in a world where relational data architects outnumber "ontologists" 10000 to 1, and I'm yet to be convinced that it doesn't matter.

Even RDF isn't fixed in stone right now. There is talk about a fourth element to solve the provenance problem.

That's very interesting news to me, as it addresses a very specific issue with RDF that I happen to be up against at the moment. Do you happen to have a pointer to any discussion underway about this? Has a proposal been formalized yet? I'd like to follow up with this...

The fourth element

The fact that RDF as it stands includes a (for some reason not terribly popular) approach to reification of statements, so that they can in turn be the object of statements such as S asserts that P, maybe points to the need for a more general mechanism for indicating provenance. I don't know, though; I would be interested to see the arguments on both sides. There is more than one sort of provenance, or more than one sort of possible relationship between S and P: S has verified that P, for instance, or even S strongly intuits that P...

Yeah, I'm definitely aware of

Yeah, I'm definitely aware of reification and all the performance and manageability issues it entails (no pun intended). And certainly in the most general case a reification standard is absolutely essential.

But in many cases there's a fixed set of statement metadata that we want to be able to store and query in a very efficient way. We want RDF extended to something like:

(subject, property, value, date added, source, version, status)

where status is something like "approved", "pending", "rejected", "obsolete"...

The current answer we're hearing from triple-store vendors is "use reification," but frankly it's a pretty poor answer when you're looking for a production-ready system. If a triplestore is really unable to accommodate something as simple as "date added" without relying on reification in its general form (and inference to boot, since I want to enforce that a statement can only have a single "date added"), I'm afraid it'll never perform.

I'm also afraid that ontology design would become a total nightmare, since I'd have to distinguish those statements that require this audit trail from those that don't. You certainly don't want to require that every "date added" statement have its own "date added" record (infinite regress, anyone?).

On the other hand, if I heard a truly compelling case that an implementation has completely solved the performance problems (time and space complexity) of reification, I could probably be convinced to model it that way. (Assuming I can find a qualified ontology design expert, but that's going back to a different fork in this thread...)

(I suppose maybe this is getting off-topic?)

RDF joined to non-RDF

It sounds as if you want something like:

(statement_id subject property object)

that could then be joined to whatever metadata-about-statements you liked in some other table(s), e.g.:

(statement_id date_added source version status)
You could even implement multiple records per statement, such as a revision history:
(statement_id revision_number revision_date revised_by comments)

The point being that the things you want to know about your triples will tend to be fairly fixed and regular, and RDF + reification is maybe not an ideal way of representing those things; at the same time, I doubt whether you could get every possible user of an RDF store to agree on the same schema for metadata-about-triples. A statement ID field could be used to provide a link between the RDF and relational worlds.

I also think many people who

I also think many people who have problems with reification as a solution fail to take into account that it is a conceptual mechanism, not necessarily an implementation mechanism.

What I mean is that most triple stores have support for context. This is typically implemented using a quad structure for rdf statements, where the fourth place encodes a grouping identifier. This grouping identifier in many systems is a provenance identifier (the source of the data), but it could equally well be agnostically implemented, that is: no semantics. Then such a grouping mechanism can be used to support date stamps, provenance, versions, etc. To the outside, the store could represent this information as reified RDF, but internally it could store it a lot more efficiently than just adding seven additional statements for each reification (which obviously does not scale too well).

thanks as always...

Thanks to everyone in this thread. I've got some digesting to do... My problem may be as simple as not talking to the right vendors... I'm still a bit reluctant to do this the "pure RDF" way, and Dominic really hit the nail on the head with his proposed RDF->relational join schema. That, to my mind, is the simplest, most elegant solution, and is basically what I'm looking for. We'd also like the storage layer to be aware of the metadata for the purposes of querying, etc., but it may be OK to build that as a separate layer, given the ability to do very efficient joins.

Anyway, thanks for all the food for thought.

No but I have a thought...

...that it might relate closely to what we call today file permissions and ownership. That linkage argues even more for discussions with ReiserFS people.

Come to that, there might be linkages to the Mozart Oz worldview and its various security issues (message passing stored procs).

Wish I could help more, but maybe experts will speak. I invited Reiser himself. His work is truly fantastic and LtU folks should know about it, regardless.

DARPA Funds Both Islands

Still more strangeness about the isolation of these islands is that DARPA funds both!

N-ary Relations

Just to respond to Matt & Dominic's points, RDF *is* a relational world, just that everything's expressed as binary relations. Imagine every predicate (property) as a separate table in a RDBMS with a column for subject and a column for object. There are various ways of representing n-ary (multi-column) relationships in triples with or without reification, there's even a best practices doc on it.

But generally I don't think it's likely to be a big problem in practice. Most datastore implementations include something that makes tracking provenance easy (the named graph approach seems to me the most straighforward). You might need reification when passing data between systems, but when storing and processing the data there's nothing to stop you going outside the RDF model locally.

Binary Relations

Binary Relations were discussed on LtU, and Dominic brought up their relationship with RDF...

If the semanticfs folks would like to talk, happy to do so

Is this summit a physical world event?

My phone is +1 510 482-2483.

Hans Reiser
Architect
ReiserFS
Namesys.com

It was just a notion

...that the two groups should communicate, since their goals are similar. The RDF people do have conventions, surely. (Try Google, I'm not the one to ask.) I think you would make a fantastic invited speaker at such an event. Thanks for your wonderful work, by the way!

Does RDF resemble

...ZigZag? There seem to be parallels. (I'm not claiming exact match, just similar intent, as with ReiserFS.) Manu Simoni comments that

Technically, ZigZag is a database and visualization/user-interface system for a subset of general graphs - the restriction is that a node may have only one incoming and one outgoing edge with a given edge label. So structures are organized as lists/strings of nodes, which makes it easier to visualize than general graphs, that can have any number of edges with a given label incoming/outgoing on a node.

The ZigZag-for-personal-computing vision, as I understand it, is to represent all information using interconnected graph structures, and to offer different visualizers and mini-applications that know how to display or manipulate different structures (as opposed to today's unconnected files and black-box applications). So where today's OSes offer folders and files as structure, a ZigZag system offers a much more fine grained structure.

talk about peninsulas...

I wonder where exactly is it that I'm living on...I have just finished my Msc. thesis on an AOP framework that aims at representing an application and its different perspectives with RDF/OWL. The so-called weaver should hopefully become smarter (the way a web-agent would in the SW-world).