SPARQL Query Language for RDF

The 2nd draft of the SPARQL Query Language for the Resource Description Framework (RDF) was published a few days ago by the W3C's RDF Data Access Working Group (DAWG). It's very SQL-like in syntax, but the fact that the data being queried is a graph brings in a raft of issues, as do things like RDF's support for (XML Schema) typed literals alongside named classes/instances. Early implementations include Rasqal (demo - Rasqal's C source, with lots of language bindings) and ARQ for Java. There's a public mailing list for comments.

Once the query language is hammered out then RDF triplestores should make a viable alternative to SQL/RDBMSs, especially for use on the Web and with other 'unstructured' data.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Looks like Prolog

I know I've said this before...

How does SPARQL manage transitive closure, especially if there are cycles in the graph? Say we have a social network, mapped with something like FOAF, and I want to know everybody who's connected - at any degree of separation - to a particular person. So I want all the nodes that are reachable via any arc all of whose edges are labelled with a particular predicate (or one of a set of predicates: parent, sibling, child, colleague). And I don't want to go round in circles infinitely if an arc happens to take me back to where I started ("my family's so inbred, I'm even related to myself"...).

Can SPARQL express such a query?

Anothe cycle-related question

Suppose we have a SPARQL query that looks for vertices connected by an arc of fixed depth, e.g.

SELECT DISTINCT ?foafoafoaf
WHERE
     ( ?f foaf:knows ?foaf    )
     ( ?foaf foaf:knows ?foafoaf )
     ( ?foafoaf foaf:knows ?foafoafoaf )

If A knows B, B knows C, C knows A, and ?f=A, will A be included in the result set? In other words, are already-visited vertices excluded/excludable from the search?

Finally, a couple of comments:

  1. Gosh, no RDF/XML!
  2. A graph/RDF-based substitute for SQL would presumably have to support INSERT, UPDATE and DELETE equivalents - to be a data manipulation language as well as a data query language. Otherwise you'd be doing queries in SPARQL, then writing application code to perform appropriate updates via direct operations on the triplestore.
Are there any plans afoot to extend SPARQL with data manipulation features?

IANAW3CL, but...

If A knows B, B knows C, C knows A, and ?f=A, will A be included in the result set? In other words, are already-visited vertices excluded/excludable from the search?

My understanding is that in general, A will be in the result set. Any subgraph that matches the pattern will be returned, and node sharing is allowed. You can constrain sharing like this (example from section 11.2.1 of the working draft):

SELECT ?name1 ?name2
WHERE ( ?x foaf:name ?name1 )
( ?x foaf:mbox ?mbox1 )
( ?y foaf:name ?name2 )
( ?y foaf:mbox ?mbox2 )
AND ?mbox1 == ?mbox2 && ?name1 NE ?name2

I don't know whether statements themselves are allowed to be shared, though. For instance, if we have only a single statement, "A knows A", then I'm not sure whether A will match your query or not...

Are there any plans afoot to extend SPARQL with data manipulation features?

I don't know, but I agree that it's essential if RDF is to gain any real traction as a data storage model. Current applications seem very focused on read-only access to RDF data, which seems very short-sighted. Ongoing on-line evolution and management of RDF repositories is still very undersupported. I think the W3c-RDF-DAWG (or whatever they're called) is working on this.

Current applications seem ver

Current applications seem very focused on read-only access to RDF data, which seems very short-sighted.

Perhaps a reason for this is that up until now, virtually every triple store implementation had its own query language and API. Since the query language provides no independence from the store that implements it, using that store's API (instead of the QL) for inserting and deleting triples is not such a hot issue: you're bound to that particular store anyway.

Aside from that: many triplestore developers do have extensions to allow INSERT/DELETE queries on the agenda. But the situation I just sketched means that it is always just not quite urgent enough, compared to other developments.

I expect that standardization of a QL (that _does_ support INSERT/DELETE) will help a lot in that respect. Whether SPARQL will include this soon, I don't know. But I'm confident that eventually it will. IMHO, SPARQL is a very modest attempt to come to a minimal QL proposal that fits a certain common denominator of expressiveness. As time progresses and more mature tools come to support it, that expressiveness may well be extended (insert hand waving).

I think not

I'm not totally up-to-date with the latest draft, but as far as I know SPARQL cannot express transitive closure in a single query. The application needs to recursively query the triplestore and manage cycles. I don't think this is a terribly compelling argument, but "SQL can't do it either..."

Frankly, I feel like RDF query languages are too young and poorly researched to be standardized at this point. But that won't stop the W3C, I guess.

No Kidding

RQL handles transitive closures and provides a consistent language for dealing both with data (RDF) and metadata (RDF Schema).

As far as I'm concerned, the W3C's RDF query language efforts are an excellent proof by demonstration that SQL has effectively polluted the minds of otherwise extremely intelligent people when it comes to what we should expect from query systems, especially non-relational ones, although even in the relational case there are superior alternatives to SQL! I keep praying that someday the industry will overcome its SQL fetish, but I don't expect to see it in my lifetime.

A real shame though is that R

A real shame though is that RQL is not completely compatible with the RDF specs. Specifically, it assumes a certain rigidity in the schema specification that is not endorsed by the RDF semantics (such as the requirement that every property must have exactly one domain and range). RQL tries to fit RDF in an OO perspective, which, for several reasons, does not quite fit.

As for transitive closures, RQL does not support this for arbitrary properties AFAIK, but only for the subclass and subproperty hierarchy of RDFS. So the foaf:knows example
would not be expressable in a single RQL query either.

For a relatively up-to-date overview of several RDF QLs and their properties, see this site. Keep in mind that a lot of these languages are still under development and that features reported as missing may be implemented in future versions.

Graphs and Objects

Jeen Broekstra: A real shame though is that RQL is not completely compatible with the RDF specs. Specifically, it assumes a certain rigidity in the schema specification that is not endorsed by the RDF semantics (such as the requirement that every property must have exactly one domain and range). RQL tries to fit RDF in an OO perspective, which, for several reasons, does not quite fit.

The property point is a good one, and it's interesting to note that SeRQL doesn't share this restriction.

I'm not sure that I agree that RQL tries to fit RDF in an OO perspective, at least in any meaningful way other than what can be inferred from the RDF model and semantics. Rather RQL tries to follow a formal graph model—the RQL papers do an excellent job of documenting it—and it happens that this graph model has a rather obvious isomorphism to a traditional class-based object model, particularly since RDFS itself borrow both semantics and terminology such as "class," "subclass," "inheritance..." from the OO world. It's true that the graph model is strictly richer than that, but it's also true (IMHO) that a lot of real-world use of RDF is likely to reflect users' familiarity with the DAG subset of the graph model that maps nicely to OO concepts.

Jeen: As for transitive closures, RQL does not support this for arbitrary properties AFAIK, but only for the subclass and subproperty hierarchy of RDFS. So the foaf:knows example would not be expressable in a single RQL query either.

It seems to me that RQL could indeed express this, but I don't currently have either RQL or SeRQL installed, so I can't try it right away. In any case, the RDFS subclass and subproperty transitive closures will probably cover 80% of real-world cases, and this remains a distinct benefit of RQL over what the W3C is promoting.

Graph queries as regexes

Just a thought - I'm wondering if this is either reinvention of the wheel or a bad idea from the get-go: why not express graph queries on directed labelled graphs as regular expressions (and execute them with a DFA)?

Suppose we have a graph where every vertex label can be represented by a single uppercase character (of an alphabet of arbitary length), and every edge label can be represented by a single lowercase-character. In the case of RDF graphs, each vertex label must be unique, whilst many edges can have the same label. Every possible arc can then be expressed as a string of alternate uppercase and lowercase letters, e.g. AaBaCbD (Fred knows Jim who knows Dave who is Charlie's golfing partner).

Now, given a graph of vertices (A, B, C, D...) and edges (AaB, BaC, CbD...), we could frame a query thus: Aa(_[a|c]_)*b$1 - that is, an expression that will be satisfied by any arc (string) that begins with A, followed by a, then (any character, then either a or c, then any character) any number of times, then b, then any character (to be captured in variable $1).

If we translate the single-character label aliases into namespace-qualified URLs, then this same query might be written something like:

person:Fred ->
foaf:knows ->
(_ -> [foaf:knows || foaf:spirit-brother] -> _)* ->
foaf:golfs-with ->
$1

To take a more SPARQL-y example:

SELECT ?name ?mbox
WHERE  {
         ( ?x foaf:name ?name )
         ( ?x foaf:mbox ?mbox )
       }

could be written as

_ -> [foaf:name -> $name && foaf:mbox -> $mbox]

(&& works like ||, but combines the results from distinct paths into the same result set.)

Is this a non-starter, or might it be a goer? Or have I just reinvented part of RQL?

Path expressions

I know nothing about RQL, but back before XML came along I saw your idea referred to as path expressions. Since RQL has something it calls path expressions, I guess you just reinvented part of it.