Q: Modularizing SQL?

(Sorry, this may be a bit off-topic. But I think it's related to PL design.)

Something has been bugging me for quite some time.

SQL is powerful. It's declarative. Complex data-access algorithms can be expressed very concisely. But its syntax is monolithic. A SQL statement cuts across multiple tables (or domain objects as in OOA&D). When I examine many SQL statements at work, many of them share similar patterns: Certain inner joins or where clauses are repeated in many places to achieve the same filtering rules. I can't see an easy way to abstract these patterns out, to reuse snipplets of the statements. (like mixins or traits?) I can't see an easy way to discompose complex SQLs or to compose complex SQL statements out of smaller building blocks in manageable ways.

From the OOP point of view, data access should be hidden in an OR layer and everything should be done via methods on the domain objects. But that's not feasible in many applications. Many of the business processes implemented by the software my company develops are batch-oriented and they already take hours to run. If I insist on having everything going through the OR layer and take away the power of optimized SQLs, the software is simply not usable.

What other alternatives are there? And, is there any research on modularizing SQL or on a modularizable data-access language which is as powerful as SQL? (I couldn't find much from Google.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

You're already further than most

You're already further along in your analysis than most people get. LtU hasn't been too interested in this particular question historically, although it seems on-topic enough to me. But it is of primary interest in another forum, the usenet group "comp.databases.theory". I should warn you that that forum is of a less refined sensibility than this one, although the technical content is also quite good. (Think of LtU as a British men's club where members sit in high-backed chairs discussing Babbage's latest work and ruffians are turned away at the door, and comp.databases.theory as a wild west saloon in which you will get a chair broken over your head if you don't salute the portrait of President Codd behind the bar when you walk in.)

My own take: the problem lies on both sides of the O/R fence. The "R" language (SQL) is "monolithic" as you say. The "O" languages, C++, Java, whatever, lack the right kind of type system to handle the kind of problems SQL is good at. In particular, nominal typing is ill-suited to handle arbitrary joins. This is just a summary; there are of course many more issue.

Marshall, could you explain...

Could you explain why the issue of nominal typing gets into the picture?

Nominal Typing vs. Joins

Suppose you have two tables with ten columns each. A query that joins the two tables can select any subset of the twenty columns. Even if we suppose that most joins include a foreign key, that still gives us the power set of the set of nineteen columns. Each one of the roughly half a million different combinations is a different relation type. Clearly giving them all distinct names is prohibitive. This is not just an issue for ad-hoc queries; every query anywhere in client code that picks a distinct set of columns will have its own type.

One of the strengths of the relational model is the ability to pull together exactly the data (and data type) you need in the moment; nothing extra need come along for the ride, just to satisfy the type system.

Gotcha. Thx.

Gotcha. Thx.

No abstraction

In order to abstract out common patterns in SQL, there must be a way to abstract queries over tables, joins, etc. Functional languages support this "common pattern factoring" via the abstraction of "higher order functions", but SQL does not support "higher order queries", so textual repetition is inevitable.

Re: No abstraction

In order to abstract out common patterns in SQL, there must be a way to abstract queries over tables, joins, etc.

Ok, I'll bite. Why doesn't a "view" count as an abstraction? You can use it in other queries without repeating the text of the query.

Not what he meant

No, what's missing is the ability to write a parameterized query. Something like:

query(T,R)
{
select sum(foo.R) from T,foo where T.id=foo.id
}

query(employees,salary)
query(select * from ships where hullType='wood', displacement)

Does this really solve anything?

I questions whether this approach helps with problems found in practice. At least in your example, any appearance that it does so seems to me to be entirely due to the syntactic abstraction cleaning up after the terrible mess that is SQL syntax.

At least in your example,

At least in your example, any appearance that it does so seems to me to be entirely due to the syntactic abstraction cleaning up after the terrible mess that is SQL syntax.

Syntactic abstraction has a semantics of its own (rewriting). The example could perhaps be implemented by macro expansion, or perhaps tables and field labels are first-class values in the underlying language, which is considerably more powerful.

I think such abstraction in the database could be quite useful in practice if done properly. There is currently no market for pre-packaged abstract SQL queries, but perhaps that's only because SQL provides few facilities for abstraction and reuse. If you could package schemas and queries for "simple" integration with an existing database, perhaps a market would arise. Imagine "mixing in" a pre-packaged inventory management database+abstract queries with a pre-packaged sales front-end.

In fact, we're seeing the trend from the opposite direction: integrating databases and search into languages which already support abstraction (see: Oz, LINQ, etc.). So why not integrate abstraction and reuse into the database to make it a proper language?

You may be right

In this example, yes. I'm not making the case for it; I was just trying to explain what naasking meant. There may be more power lurking in the idea somewhere.

For example, suppose queries and rows were truly first-order, and those T and R variables could be stored in a table, and you could do this:

select max(query(T,R)) from foo where foo.x>7

That would bring SQL into the realm of functional languages. Of course, the performance impact could be pretty large.

Here there be (second order) dragons

... suppose ... those T and R variables could be stored in a table ...

There is some advantage to the relational algebra being grounded in first order logic, and such a feature would leave that grounding behind.

There is some advantage to

There is some advantage to the relational algebra being grounded in first order logic, and such a feature would leave that grounding behind.

Hmm, I don't see why that is given higher-order programs can be represented as first-order programs via defunctionalization.

nassking, I skimmed through

nassking, I skimmed through the first couple sections of the defunctionalization paper. The authors say all functions in the whole program can be enumerated. Is the true? Can't functions be constructed dynamically? I'm wondering if defunctionalization will still work if the function that passed as a parameter to the high-order auxiliary function is constructed dynamically.

Depends

That depends what you mean by "constructed dynamically." If you can enumerate all the static places in the program where functions are dynamically created (i.e., if you can find all instances of "lambda"), then you can use defunctionalization, even if there are many different dynamic paths to each of those places, and even if they're called an unbounded number of times. To be concrete, you can use defunctionalization on code like this:

(define (make-adder x) (lambda (y) (+ x y)))
(define l1 (map (make-adder 5) '(1 2 3 4 5)))
(define l2 (map (lambda (x) (+ x 5)) '(1 2 3 4 5)))

Defunctionalization is normally a whole-program transformation, so you would need to find all the lambdas. However, I think there may be some tricks to make defunctionalization a bit more modular.

But if, on the other hand, when you say "constructed dynamically," you mean something like dynamic loading of code, or on-the-fly generation of bytecode (or machine code, etc.), or perhaps the use of eval-like metaprogramming, then you're probably right that defunctionalization won't work.

Yeah, I mean eval-like

Yeah, I mean eval-like metaprogramming, code generation on the fly. Thanks Matt.

But if, on the other hand,

But if, on the other hand, when you say "constructed dynamically," you mean something like dynamic loading of code, or on-the-fly generation of bytecode (or machine code, etc.), or perhaps the use of eval-like metaprogramming, then you're probably right that defunctionalization won't work.

I think it could work there too, but runtime defunctionalization already has a name: polymorphic inline cache.

As for erasing higher-order structures to a first-order language for SQL, I believe it is possible since all stored structures are first-order, and the entire program is known to the system when running a query (triggers, stored procedures, and the query itself). Would this defunctionalization be more complicated than DB query planners already are?

That's called a stored

That's called a stored procedure.

Stored procedure is slightly different

Usually you cannot use stored procedure as if it's a view.

The equivalent is UDF (user defined function) that returns table, unfortunately different RDBMS is different in stored procedure and UDF.

nassking, you hit the nail on the head!

We need high-order SQL!

I googled for the term but found nothing. I'm surprised there's no active research in this area...

Hmm... Does it make sense to create a SQL-like DSL embedded in something like Haskell which supports high-order queries and make the function evaluate to an ordinary SQL string which can then be passed to an ordinary RDB?

Links

Ah, sounds a bit like Links...

Relational algebra

Like others have said, higher-order functions would help. However, I think that more mathematical syntax usually associated with relational algebra also makes things cleaner. It covers roughly the same ground as SQL, but is composed of simpler, more orthogonal operators.

Wikipedia seems to have a relatively long article on it: http://en.wikipedia.org/wiki/Relational_algebra

The primitive operations (a "relation" is basically a table)

  • project[columns](relation) -> returns the relation with only the given columns
  • select[predicate](relation) -> returns the relation with only the rows that match the given predicate
  • product(relation, relation) -> returns the cartesian product
  • rename[rename map](relation) -> returns a new relation with the columns renamed

For example, take this simple SQL join:

SELECT FirstName, Building
FROM Employees e, Departments d
WHERE d.Name == e.Department

A roughly equivalent relational equation:

project[Employees.FirstName, Departments.Building](
  select[Departments.Name == Employees.Name](
    product(Employees, Departments)))

(I say "roughly" because the final column names are "Employees.FirstName" and "Departments.Building" instead of "FirstName" and "Building". This can be fixed with the "rename" operation.)

I thought of that too.

Is there any commercial RDBMS supporting Retional Algebra?

Relational algebra vs. relational calculus

My impression of SQL always was that it aimed to be modeled after relational calculus, not relational algebra.
The difference being, of course, declarative vs. procedural - while calculus terms just state what properties must be satisfied by the result, the algebra terms prescribe a specific way to obtain the result (and as such, are more similar to query plans used by RDBMS internally, than to SQL queries, which form the external interface).

The same critics apply to the OP: SQL is not intended to express algorithms as much as requirements, from which more or less reasonable algorithms can be inferred more or less elementary, obviously using internal knowledge about physical organization of the data during the process.

So what is probably being missing is a way to abstract parts of the declarative requirement (predicates in some logic, I believe).

From this also follows that views are not really the solution (for this problem) as they belong to another conceptual level (closer to algebra).

Calculus vs. Algebra

... calculus terms just state what properties must be satisfied by the result, the algebra terms prescribe a specific way to obtain the result ...

I have heard variations on this statement many times, and I cannot work it out in my head. I've read various papers comparing and constrasting the algebra and the calculus, and yet they continue to look almost exactly the same to me. I cannot see a single whit of proceduralness in the algebra. Can you help me understand what you mean? Perhaps an example?

From this also follows that views are not really the solution (for this problem) as they belong to another conceptual level (closer to algebra).

I cannot understand this statement either. A view is just a named query that can be treated as a base relation. Whether we are closer to the algebra or not seems to me an attribute of the query language, having nothing to do with whether or not the system supports named queries.

Calculus vs. Algebra

I cannot see a single whit of proceduralness in the algebra.

That's because there is none, both relational algebra and calculus being just first-order logic dialects with exactly the same expressive power.

It is just a different mental picture or interpretation of RA operators as "performing" some actions as opposed to being seen as functions mapping one domain to another . This mental picture is useful in database optimizer design and implementation where the original query is represented as a set of RA operators that are actually implemented "in metal", e.g. "hash join".

Such representation is also extremely useful for the optimization process itself where original RA operators are replaced with equivalents and permuted in various ways, simplistically speaking, to create an optimal "execution plan", something eminently impossible or partially possible but very hard with network query languages which are not closed over any kind of algebra thereby excluding algebraic transformations of the RA kind !

Relational algebra is a functional language.

Relational algebra is a functional language, comprised of a number of higher-order operations over relations. Thus, relational algebra + lambda abstraction would be a natural theoretical basis for an alternative.

But from a more practical perspective, if you're stuck with SQL, the best thing to do is to implement an intermediate language that your app would use to generate SQL. For this language, you'd implement a SQL query string generator back end, and a set of front-end operators that allow you to compose the intermediate language queries into more complex ones.

For example, you might represent an intermediate language query as a record with three fields: one for the SELECT part of the SQL query, one for the FROM part, and one for the WHERE. For each of these, you'd have as values abstract representations of, respectively, scalar SQL expressions (for the SELECT, but also used in the WHERE), simple and complex relation references (table/view names, INNER JOIN, LEFT OUTER JOIN, etc., for the FROM), and complex boolean predicate expressions (for the WHERE, but also used in some parts of the scalar expressions language).

After you have that, then you can implement a merge operation on multiple query objects, that does the following:

  1. For the result's SELECT clause, it takes the set union of the inputs' SELECT clauses;
  2. For the result's FROM clause, it takes the union of the inputs' FROM clauses;
  3. For the result's WHERE clause, it combines the inputs' WHERE clauses with AND.

Then the common bits that recur over many queries are abstracted by writing functions to generate them on their own, and feeding the results of these functions into the query-merge operator.

I believe SchemeQL works very much like this. My experience with this approach is from using an in-house language at my work that was inspired by SchemeQL, but I can't tell you how similar or different they are. I can tell you one thing: it really works.

is it functional?

Isn't relational algebra, by definition, a relational language? In other words, based on mathematical relations (from set theory) rather than mathematical functions.

There's functional and then there's *functional*

It's a bit of a terminological mashup. It's certainly fair to call the relational algebra a relational language! But sometimes when we say "functional language" we mean oriented around mathematical functions, and sometimes we mean not-imperative. The RA is functional in the second sense, but not the first.

Of course, any useful data-management language is going to have imperative operators as well...

"Computation performed by

"Computation performed by functions" rather than "not-imperative", perhaps? You could call it a first-order functional language with relations as data, for example.

"Computation performed by

"Computation performed by functions" rather than "not-imperative", perhaps?

As reasonable a proposal as that is, wouldn't this mean that C, Pascal, and FORTRAN are functional languages? That doesn't seem to be what people usually mean by the term.

Wrong value of "function" :-)

Wrong value of "function" :-)

But sometimes when we say

But sometimes when we say "functional language" we mean oriented around mathematical functions, and sometimes we mean not-imperative. The RA is functional in the second sense, but not the first.

It's functional in the first sense too. The relational operators are functions that range over relation types.

Yes, relational algebra is functional.

Isn't relational algebra, by definition, a relational language? In other words, based on mathematical relations (from set theory) rather than mathematical functions.

"Relational" and "functional" are labels, not description. The fact that we call relational algebra relational is just a labeling convention.

The relational algebra is, really, relation types plus a set of higher-order functions over those types. Every operation in the relational algebra can be trivially understood as pure function application.

huh?!

"mathematical relations (from set theory) rather than mathematical functions"

aren't functions relations mapping items from set Domain to items from set Codomain? That is, functions _are_ sets!

martian love

(Well, that's the mnemonic I use.)

-t

Native relational algebras

There has been a surge recently of experimentation with relational algebras and calculi in native languages rather than intermediate languages. The 800-pound example is LINQ, but it is by far predated by GLORP for Smalltalk. Erlang has Mnesia. And recently, a similar system for Ruby was announced: Ambition. These libraries all eschew intermediate languages (I'm reminded of Hibernate's HQL) in favor of native-language syntax.

And yes, it really works. My own ORMs for Python (Geniusql and Dejavu) use Python lambdas in precisely the manner you describe; there's even a Query object which combines the 3 expressions (relation, attributes, and restriction) into a single object. When processing sets of objects, those objects can be passed directly to the lambdas; when hitting RDBMS back-ends, those same lambdas are decompiled into equivalent SQL.

One interesting technical note from mine: the various relations in each expression are aligned based on the order of arguments to each lambda, so that the restriction expression "lambda x, y: x.Size > y.Size + 10" maps x and y to the same positional args a and b in the attribute expression "lambda a, b: [a.Name, a.Size, b.Size]".

I'll also point out the recent work on CouchDB, which lands much more on the algebraic end of the spectrum; there, you supply callbacks to filter objects in a stream-like model, with no generated SQL at all.

Robert, you lost me. Things

Robert, you lost me. Things like LINQ, the query syntax/semantics is SQL-like, not relational algebra-like...

Right. Most of those are

Right. Most of those are relational calculi (SQL-like); however, there's a distinct feeling one gets when writing such a beast that, although the API provides a calculus, the implementation behind that API is often just tearing expressions out of the given statements and applying them in a (relatively fixed) algebraic fashion. CouchDB exposes most of that to the developer.

My main point was simply that several tools are now avoiding the introduction of isolated intermediate DSLs and are using the ASTs or parse trees of expressions in the native language directly.

My main point was simply

My main point was simply that several tools are now avoiding the introduction of isolated intermediate DSLs and are using the ASTs or parse trees of expressions in the native language directly.

My worry with that, after having a brief look at a couple of those tools, is that there is a boundary between the parts of the host language that are supported in the query language and those which are not, and that the boundary is often far from obvious. You do in effect have a DSL, and it's a subset of the host language.

And frankly, I use this intermediate DSL approach in Lisp, so the incentive to use the syntax of the host language is not so great. The intermediate language is written in s-expressions, and is made to look as much like Lisp as possible, but there is a clean cut between it and the host language.

My main point was simply

My main point was simply that several tools are now avoiding the introduction of isolated intermediate DSLs and are using the ASTs or parse trees of expressions in the native language directly.

Not sure I follow, Robert. Isn't LINQ a DSL? I can see it's translated into C# AST though.

Isn't LINQ a DSL?

I'm OK with saying it is, since it introduced new keywords and syntactic structures. But the line between DSL and API is pretty fuzzy; most LINQ queries I've seen look like C# with some extra keywords, not like a completely different language. I've seen config files that were farther removed from the host language. ;)

Many of the others don't do that, often because they choose to be implemented in, and therefore bound by, the host language. If you restrict yourself to expressions (no statements), the differences between languages are amazingly small, vanishingly small once you've parsed them.

My only DSL-y wish for Dejavu/Geniusql would be for Python to grow a prettier, more full-featured first-class Expression object, that exposed its own AST and had a standard serialization scheme and such. But the one I provide is good enough, being a thin wrapper around lambdas.

Robert, you lost me. Things

Robert, you lost me. Things like LINQ, the query syntax/semantics is SQL-like, not relational algebra-like...


I am not sure I follow. Does LINQ provide the abstraction mecahnisms you asked for, or not?

No, I don't believe LINQ,

No, I don't believe LINQ, which is SQL-like, provides the solution I'm looking for.

My previous comment was just asking Robert to clarify if things like LINQ translate the query into relational algebra internally. (Anyway, my knowledge on RAs and R Calculi is getting robust...)

I agree with many of the other comments in this thread that we either need some kind of high-order SQL, use Relational Algebra instead of SQL, or implement some kind of DSL in the programming language which will eventually churn out SQL statements.

meh!

I've seen such horrendous MS SQL code at work, full of copy-paste joins all over huge queries that I'd feel glad if it had even C-like preprocessing macros... I guess it's hopeless...

a hack

A quick and dirty solution: in your host language, use text replacement on SQL strings to hack up modularity. It's not as scary as it sounds, as I explained in this comment long ago. You won't get recursion, but still.

That's an interesting

That's an interesting extension of simple text substitution for pure readability reason.

sqlalchemy

creating reusable SQL clauses which can be mixed and transformed is SQLAlchemy's core paradigm, and we've been doing it for over two years.

A: Q

The answer is hiding in the first letter of your question.

http://kx.com/q/d/q.htm

I don't think this language gets much traction outside the financial sector, but it serves the role you're interested in.

A somewhat relevant discussion

easy: use XQuery

SQL translates trivially and intuitively into XQuery if you assume a reasonable mapping of relational tables into a DOM model.

XQuery has functions, tail calls, and modules. It does not have higher order functions but it does have higher order operators for most common cases where you would want fully general higher order functions.

For full advantage you'll wind up preferring a back-end database that handles DOM directly -- because then you'll have he power of SQL plus at least rudimentary modularity plus a vastly richer, "aspect oriented" type system.

But... SQL per se is just legacy, these days.

-t

Higher-order XQuery

In my experience with XQuery, you'll quickly feel the lack of polymorphic functions, and you'll want higher-order functions in fairly short order. I've had to resort to meta-programming hacks (e.g., generating XQuery) to work around these problems.

[Edit: not that it's no good. I generally agree with your comments. I just think your assessment of XQuery is a tiny bit optimistic.]

missing higher orders in XQuery?

It does feel like strapping on bondage to go from a higher order language to XQuery -- lots of idioms that just flow in lambda languages suddenly turn into little puzzles where it feels like the solution is mostly in spite of the underlying language.

The payoff, though, is that XQuery programs are extremely analyzable. You can optimize the heck out of them and current practice in today's implementations is just light years behind what I think we're going to see in the future.

That constraints-in-the-name-of-performance trade off is exactly right for a piece of system software like the query language, in my opinion.

And, therefore, if you're talking about meta-programming as the way to build higher-order programming on top of an XQuery engine -- I think you are on exactly the right track, using XQuery just as it should be used. There are, however, a lot of ways to meta-program with XQuery and some, I think, will prove better than others. So, you may be on exactly the right track abstractly but (since I don't know anything about it) I have no opinion about the specific approaches you're taking.

-t

edit: I think it might not be widely known so I'll mention that...: There is an official XML data type and semantics for the parse trees of XQuery programs. In other words, just as in lisp, "code can be data". So, there is no end at all to the trouble you can get into with XQuery meta-programming.

Does it have ... ?

Does it have

  • aggregate operators
  • joins
  • update operators
  • views
  • declarative integrity constraints

?

does xquery have...

aggregate operators?

yes.

joins

yes.

update operators

No. There is a standards group within w3c (as I recall) proposing some. I'm not sure you really want or need them, though. Interpreting updates is a pretty distinct problem from executing queries Moreover, XQuery is basically a pure language: it would sully it to add side effects. What we probably want instead of updates-in-XQuery is just a separate language for updates but one that can share a "dynamic context" (e.g., input parameters, type definitions) with XQuery programs. That's how I do it, anyway.

views?

Nothing but, essentially. There is one built-in view that directly corresponds to the logical model of the underlying database but every time you define a new XQuery function you are essentially defining a new view -- an alternative logical model that may or may not correspond to the underlying database.

declarative integrity constraints

In spades, in multiple ways, depending on what you need. You can type-check values using XSchema (an interesting type system in its own right). You can also, of course, write integrity checks algorithmically in a pure, easy-to-analyze (since it lacks higher order functions) language -- "declarative" in some non-trivial ways.

Remember, Codd just called for a language, not for SQL specifically. Date et al.'s "fourth manefesto" hinted at a nicer language with "hierarchical types" and such. XQuery basically is a relational langauge, just like SQL, only much better.

-t

XQuery

Is not the data model XQuery is designed for a graph ? If so, why resurrect the long ago discredited both in theory and practice network/graph model (see Codd's writings on the matter) ? XQuery perhaps may be of some utility when walking XML woods, but can anyone seriously suggest it as an alternative to the relational algebra in the data management arena ?

declarative integrity constraints

In spades, in multiple ways, depending on what you need. You can type-check values using XSchema (an interesting type system in its own right). You can also, of course, write integrity checks algorithmically

So, I gather, the answer is 'no' ? Clearly, typechecking alone does not qualify as a general declarative constraint specification tool. That alone is quite a bit of a problem since the database "meaning" is the sum total of its constraints !

XQuery basically is a relational langauge, just like SQL, only much better.

XQuery is hardly a relational language, let alone "better", because it is not closed over the domain of relations like the relational algebra operators are.

Is not the data model XQuery

Is not the data model XQuery is designed for a graph ? If so, why resurrect the long ago discredited both in theory and practice network/graph model (see Codd's writings on the matter) ?

I've heard this said many times, re: discredited network/graph model. The lambda calculus itself is a "graph model", and it's arguably the most widely employed computational/representation model in existence both in theory and practice. In what way has it been discredited?

Data management is its own field

In what way has it been discredited?

It has been discredited for use with data management.

The lambda calculus is indeed a fine and wondrous thing, but not something that you're going to use for a database. No more so would you write a device driver with relational algebra. Right tool for the job and all.

Data management is an area that has very specific requirements. It is easy to assume, coming from outside the field, that one's existing tools and techniques are as good as anything else, but it's not the case. A deep understanding of the last 30 years of work in data management are necessary, unless one wants to spend the next 30 years reinventing them.

The lambda calculus is

The lambda calculus is indeed a fine and wondrous thing, but not something that you're going to use for a database. No more so would you write a device driver with relational algebra. Right tool for the job and all.

Agreed, but the relational algebra is not a general computational model (I don't think), and the lambda calculus is; you can represent the former in the latter, but not vice versa. So embedding the relational algebra in Haskell (see: HaskellDb) achieves all the advantages of the relational algebra within a graph model.

This does not "discredit" graph model, so much as discredit a particular implementation of data management within a graph model. Perhaps I'm misusing "graph model" as applies to data management.

re Xquery queries

Is not the data model XQuery is designed for a graph ?

I thought so too, for the longest time, and then I squinted differently and realized that, no, it isn't. Not in the way you mean.

The type system (I guess you could call it XQuery/Xpath Types) gives you extensional algebraic types over some simple, carefully decentralized, carefully globalized atomic types. You can define intentional type systems as computable properties of those extensionally defined values. In particular, you can project things like "object oriented" views of these datums in an open-ended variety of ways.

That stands in contrast to standard SQL databases whose primitive types are limited to simple types like strings and numbers, dates and types. SQL doesn't much have aggregate types other than tables and views.

Well, once you have such a rich type system for the values that can be a single entry in the database, you no longer need the concept of "columns" as a separate concept. A complex row (a list of columns) is just a product type so you can implement it as a "single column" row.

So, no columns -- just rows -- but then there's the next thing: ordering. Why impose any default ordering on a row? So, with XQuery, tables are just unordered finite bags of values, those values from the XQuery/Xpath algebraic type system. Classic SQL-style queries and 2D tables over atomic values are kind of trivially embedded in this.

Implementation-wise, it's fascinating: it's not very different at all from an ordinary SQL database. Have a look, for example, at Oracle dbxml (aka Berkeley dbxml) (an open source program, widely available). The literature behind it (e.g. about Berkeley DB, the underlying store) is pretty uniformly interesting. [edit: "interesting" and "highly accessible" -- good writing, for the most part.]

-t

XQuery etc

That stands in contrast to standard SQL databases whose primitive types are limited to simple types like strings and numbers, dates and types. SQL doesn't much have aggregate types other than tables and views.

Well, the statement is simply not true. All modern SQL databases such as DB2 and Oracle have user-defined types.Oracle, for example has had, for at least 10 years, an extensive toolset to implement both user-defined types as well as object-oriented data model on top of the relational tables. It is interesting to note that while UDTs turned out to be quite useful, the OO stuff as applied to data management was not quite a success story which is not at all surprising taking into account that the OO toolset application essentially amounted to a revival of the network model !

So, no columns -- just rows -- but then there's the next thing: ordering. Why impose any default ordering on a row?

It's a minor mistake on your part, but the relational model does not impose any ordering on the tuple attributes ! In the SQL databases, any kind of column ordering is an implementation artifact, and no sensible database programmer ever relies on it.

2D tables over atomic values

You are confusing the specific two-dimensional representation with the model itself. It is a common misconception that the relational model is somehow two-dimensional as was explain by Date more than ten years ago. Would you call a predicate describing the set of points on the sphere surface two dimensional, Sphere(x, y, z, R) def. x^2 + y^2 + z^2 = R^2 ?

I am familiar with Oracle xmldb. Some of its tools mainly related to xml documents transformation can be quite useful, but as a data model for anything serious beyond toy projects, it is simply not suitable being just another creature in the menagerie of the network species.

It is both amusing and discouraging to observe how the data management network model keeps being introduced once and again in various guises, from Codasyl through OODB to XML "database", despite the model spectacular failures during last thirty years almost in all the data management applications except perhaps some niches such as CAD.

All this stuff is rather well-known, and I can only direct your attention to Codd's original work, as well as to Date's later work in the data management area in order to avoid boring and fruitless slashdot-style discussion.

XQuery vs. ORDB and network models

That stands in contrast to standard SQL databases whose primitive types are limited to simple types like strings and numbers, dates and types. SQL doesn't much have aggregate types other than tables and views.

Well, the statement is simply not true. All modern SQL databases such as DB2 and Oracle have user-defined types.Oracle, for example has had, for at least 10 years, an extensive toolset to implement both user-defined types as well as object-oriented data model on top of the relational tables. It is interesting to note that while UDTs turned out to be quite useful, the OO stuff as applied to data management was not quite a success story which is not at all surprising taking into account that the OO toolset application essentially amounted to a revival of the network model !

I recognize the liturgy when I hear it preached but I respectfully reserve some skepticism here.

What's really going on in something like UDT, or ORDB, or ORM? As I see it, you've got a so-called "host language" with its own native type system and people are trying to flee as far as they can from SQL by turning the SQL back-end into a kind of virtual memory system. There is no such thing as a good virtual memory system for a free-form object system in typical "host languages." When people who really know what they're doing use ORDB or ORM (or OODB) they can get very good results but without a lot of expertise, they are just very poor performance at the cost of serious code complexity.

The DOM model strikes an interesting balance -- superior to anything in SQL or OODB models, I think. It is rich enough to afford an open-ended range of "object oriented" interpretations (this is how people use it, in fact, combining multiple incommenserate object oriented views of the same data in a single application). At the same time, at least as used with XQuery, it is purely functional, does not include any native concept of "references", and features pay-as-you-go access costs that are a good model of a wide variety storage and representation options. "Host" programs have to model DOM types rather than the database modeling host types -- but DOM types are so rich that that's not much of a burden.

Would you call a predicate describing the set of points on the sphere surface two dimensional, Sphere(x, y, z, R) def. x^2 + y^2 + z^2 = R^2 ?

Yes, I would, because the predicate describes a table of simple atomic values, albeit a constant table that can be lazilly computed for many useful kinds of query.

How is XML/XQuery different? Well, the "primitive value" types are not all atomic and are, in fact, constructively recursive. Unlike SQL, an XQuery "value" can contain an arbitrarilly large amount of information. That's the "third dimension" (so to speak).

When values are that rich, and extensionally rather than intentionally typed, you no longer need typed tables at all. You can instead model that structure (and many more convenient variations) in the values themselves.

I am familiar with Oracle xmldb. Some of its tools mainly related to xml documents transformation can be quite useful, but as a data model for anything serious beyond toy projects, it is simply not suitable being just another creature in the menagerie of the network species.

I find it more useful than that description would suggest that I should. It isn't a toy though: it's underdocumented in a few ways and it's tricky to use properly. And, I would agree with you that there are some sharp limits to how it will scale compared even to, say, MySQL -- but I don't see any huge technical problems standing in the way of closing that gap. Just my impression of it so far, anyway.

-t

Datalog

If you're looking for an alternative to SQL, then Datalog should probably be mentioned at some point.

Datalog

It is nice, but without aggregate functions or any functions for that matter, it's hardly a practical replacement for SQL.

HApps has a mini relational algebra

See http://happs.org/

In particular, the indexed sets http://happs.org/src/HAppS/DBMS/IxSet.hs are a little query language which anticipates that most applications won't need a full-blown database.

.QL: an object-oriented replacement for SQL

There exists an object-oriented variant of Datalog, called .QL, which includes aggregates. It has a sophisticated type system for catching empty queries. It is especially designed to make it easy to create libraries of reusable queries.

An implementation of .QL comes as part of SemmleCode, a free Eclipse plugin for advanced code search. While all the examples on the Semmle website are focused on code search, there is nothing in the .QL language that ties it to that application. Have a look at http://semmle.com . Also on that website, you'll find examples of substantial query libraries.

A more sober introduction to the language is this paper (to appear at the Intnl Conference on Source Code Analysis and Manipulation):

http://semmle.com/download-files/scam07.pdf

A thorough discussion of all the finer points, even including slides, exercises and so on, is in this course given at the GTTSE summer school:

http://semmle.com/documentation/further-reading/dotql-course/

At the end of the course notes you'll find detailed references to earlier work on modularising query languages.

[Meta]

It is good form to declare any personal interest you have in a product when recommending it even though it is quite clear from your user page.

Professor Oege de Moor,

Professor Oege de Moor, welcome to LtU!

SQL vs. .QL (Re: Keynote Address: .QL for Source Code Analysis)

While the approach looks interesting and certainly much more appealing than XQuery-like attempts, I disagree on several points:

1.

The way such metric computations would be
expressed in SQL is painful, involving complex constructs
such as group-by that are only mastered by
experts.

"group by" is a trivially simple and intuitive construct. I am not aware of any beginner level database programmer (granted there might be some !) who would not be able to use the operator with ease and confidence.

2.

In SQL, writing
such queries can be rather painful, due to the need to use
features like group-by, or complex nested SQL statements.

.....

from Package p
where p.hasName(”abc.aspectj. ast”)
select sum(CompilationUnit cu | cu.getPackage()=p | cu.getNumberOfLines())

The above is perfectly expressible in simple SQL as:

select
sum( (select getNumberOfLines() from CompilationUnit cu where cu.getPackage() = p.id)
from Package p
where p.name = ”abc.aspectj. ast”

which is almost the same as the above expression modulo some syntactic pecularities.

3.
Predicate modularization.
In SQL, it can be easily done with user-defined functions that are available in all the commercial and practically all non-commercial SQL databases.

4.
Extensible object type system.
I've mentioned elsewhere in the thread that the major databases like Oracle and DB2 have a well-developed object type system with iheritance and what not. Theoretical aspects of typing in the relational model are covered in writing of Date andDarwen (see for example the Third Manifesto). Also, see the SQL3(SQL-99) standard.

response

Thanks for your encouraging comments. About your points:

1. It's difficult to measure how hard "group-by" is, but in my class on databases at Oxford, I've seen people struggle - they may be atypical.

2. You're lifting just one example from the paper. In any event the "equivalent" SQL is not the same. For one thing, the sum of an empty set in SQL is NULL, not 0. So when there are empty packages, your SQL query returns a different result from the .QL query.

3/4: That kind of object-orientation is just not comparable with that in .QL, and it has not led to the creation of large frameworks of queries in the same way that OO has transformed other areas of programming.

Revive SQL abstraction discussion - SQL is still pathetic

I wanted to revive this thread perhaps. Recently I've been dealing with large SQL queries, and it's clear that SQL needs a basic black-box, first-order, non-recursive abstraction mechanism to allow large queries to be written without duplicating code all over the place, and to allow large queries to be decomposed into smaller testable and intelligable pieces.

Just views won't do it. What's needed are parameterized views (an earlier comment in this thread caught onto this, but somehow it got lost - that happens so infrequently at LtU..:-)

The various extensions to SQL like stored procedures won't do it. These are devices for programming imperatively around the basic query capability, but they don't provide any way to control the complexity of writing a very large query.

By large - I've seen 40 page long SQL queries. I've heard stories of single SQL queries that were 400Kbytes long of source code. These are all monoliths. Monstrocities really.

I hate to say SQL needs "lambda abstraction", because that brings up the whole Y-combinator thing, recursion, semantics, higher-order. None of that is needed. Just first order black box abstraction on the query-able entities.

I guess the problem with SQL not having this is too basic for most LtU folks. But it's a huge real industrial issue.

The solutions to this generally involve "dynamic SQL", which is basically macros that edit the SQL statements. This is awful, undebuggable stuff.

To fix SQL: add a "let" statement where you can introduce a local variable, and add a parameterized view capability. Some might call this a "table function", it returns something table-like, but in dialects I've seen, these can't accept table-like things as parameters, only scalars.

E.g.,

DEFINE FUNCTION F(t1, t2, t3) AS
LET Z = SELECT * FROM T1 LEFT JOIN H(T2, T3)
IN
SELECT * from G(T3, Z) JOIN Z

That would do it. Today people can't call functions as I did here for G and H, can't take tables as parameters, and have to repeat, textually, the Z subquery in multiple places where it is used. Which to me means it is simply NOT software development. Yet the size/complexity of queries needs a software-development approach.

I would mention that all this has absolutely zero semantic impact. I.e., it can all be expanded away (remember I said no recursion), which is why I figure this is a very practical issue only, and perhaps beneath the dignity of LtU discussion.

All this just proves my theory that programming language and OO people don't drink in the same bars as database people, nor attend the same conferences. This problem would have been long ago fixed if they did.

Anyone care to dig into MySQL and put this in?

Can anyone verify whether XQuery has the same issue?