archives

What space does SQL cover, or, why is max so weird in SQL?

I needed to select the max row per group in SQL. It seems like this should be easy, but in fact it is hard. Some circumstantial evidence of its difficulty can be found here on the Xaprb blog. The author is smart and has come up with many solutions, none of which are pretty. I take this to mean that it is a hard problem.

Okay, lest you think this is a request for SQL help on Ltu, which would be inappropriate, here's the meat of the matter. Why is this hard? Should I have been able to see from the structure of SQL that this max operation was going to be hard?

What is the structure of SQL? Is there little structure, e.g. is it an ad hoc set of features that cover the most common needs of a certain class of applications? Or is there structure, e.g. is it a parsimonious and harmonious set of features designed to cover a set of computations that have a concise mathematical definition? I'm guessing the answer is, as usual, somewhere in between. Perhaps I need to bite the bullet and try to understand Codd's seminal work to see what the latter, purer thing would be?

Implicit in my question is also the question of what does it mean to be "hard" or "easy," which I guess sometimes gets discussed (often unproductively) under the banner of expressiveness. What are some objective definitions out there that can make the discussion productive?

Some of the "hard" here may be not only syntactical but semantic, in that there is some concern that at least some of the solutions are O(n^2). In other words, it may be important to distinguish the pretty/ugly (syntactic) axis from the hard/easy (semantic/performance) aspect.

It seems to me that in SQL, one of my favorite questions becomes particularly important: what optimizations are optional and what optimizations are required, i.e. can be relied upon as part of the semantics of the language (and therefore could be argued to not be rightly called optimizations at all). My favorite example from general purpose languages is the tail call optimization. IMO if it is optional, there's almost no point in implementing it, since if it cannot be relied upon, then programmers can only use it where stack depth is known to be small, in which case the optimization doesn't help that much. So, getting back to the max problem at hand, some solutions may only be "syntactically hard" if you can prove to yourself that the optimizer makes them "semantically easy."