What space does SQL cover, or, why is max so weird in SQL?

I needed to select the max row per group in SQL. It seems like this should be easy, but in fact it is hard. Some circumstantial evidence of its difficulty can be found here on the Xaprb blog. The author is smart and has come up with many solutions, none of which are pretty. I take this to mean that it is a hard problem.

Okay, lest you think this is a request for SQL help on Ltu, which would be inappropriate, here's the meat of the matter. Why is this hard? Should I have been able to see from the structure of SQL that this max operation was going to be hard?

What is the structure of SQL? Is there little structure, e.g. is it an ad hoc set of features that cover the most common needs of a certain class of applications? Or is there structure, e.g. is it a parsimonious and harmonious set of features designed to cover a set of computations that have a concise mathematical definition? I'm guessing the answer is, as usual, somewhere in between. Perhaps I need to bite the bullet and try to understand Codd's seminal work to see what the latter, purer thing would be?

Implicit in my question is also the question of what does it mean to be "hard" or "easy," which I guess sometimes gets discussed (often unproductively) under the banner of expressiveness. What are some objective definitions out there that can make the discussion productive?

Some of the "hard" here may be not only syntactical but semantic, in that there is some concern that at least some of the solutions are O(n^2). In other words, it may be important to distinguish the pretty/ugly (syntactic) axis from the hard/easy (semantic/performance) aspect.

It seems to me that in SQL, one of my favorite questions becomes particularly important: what optimizations are optional and what optimizations are required, i.e. can be relied upon as part of the semantics of the language (and therefore could be argued to not be rightly called optimizations at all). My favorite example from general purpose languages is the tail call optimization. IMO if it is optional, there's almost no point in implementing it, since if it cannot be relied upon, then programmers can only use it where stack depth is known to be small, in which case the optimization doesn't help that much. So, getting back to the max problem at hand, some solutions may only be "syntactically hard" if you can prove to yourself that the optimizer makes them "semantically easy."

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

The set of operations that

The set of operations that can be performed by SQL is roughly the same as the set of operations that can be performed with relational algebra, which has a concise mathematical definition. That said, the set of operations supported by relational algebra is by no means beautiful -- to my knowledge it doesn't have a more beautiful characterization than "whatever operations can be performed by composing relational algebra operations". It is just as ad-hoc as the set of operations supported by SQL. And just like with SQL even the operations that it does support often require hoop jumping and are expressed in inefficient ways. Contrast this with Turing complete languages where the set of operations has a nice characterization as the set of computable functions.

It's a good language design challenge to design a better language for dealing with relations and aggregation of relations.

Predicate Calculus

Relational algebra is equivalent to a cut-down version of first-order predicate calculus. Datalog makes this connection clearer as does description logic (OWL 2 has a variant that is suitable for conversion to relational algebra IIRC). Pure Datalog is equivalent to relational algebra minus set difference plus recursive queries. Datalog extended with default negation adds the set difference operation. I like the combination ofDatalog with strong and default negation. E.g. Jason support both (strong negation as ~P and default negation-as-failure as not P). The combination allows expressing default rules such as "birds can usually fly":

flies(X) :- bird(X) and not ~flies(X).

which states that X flies if X is a bird and we don't know that X doesn't fly.

Of course, this doesn't make the OP's question any easier to state in Datalog, as it lacks aggregate functions or grouping at all! (these can be added in an ad-hoc manner, but I'm less sure of the theory).

My tl;dr answer

I wrote a long reply on my blog, since it was rather long and mostly based on my engineering wisdom.

great reply

Read your blog post. That was a terrific reply, very insightful.

I love this sort of

I love this sort of discussion. I think part of getting the hang of SQL is having a good mental model for its quite limited semantic range. One comment, though. It seems to me you use the term "semantic" in a rather restricted sense. The semantics you refer to are the execution semantics of various constructs, but constructs you deemed semantically equivalent in terms of performance, are in fact different as regards their declarative semantics. This distinction, though somewhat old fashioned, is a useful one.

Logical Monotonicity

I think the reason is that aggregation operations are not logically monotonic. You can see this in the article you link to which says, "this topic is related to numbering rows". As the Bloom paper says, non-monotonic operations require counting, and counting involves waiting, because such operations have stricter ordering and synchronization requirements. Monotonic operations are much more flexible, can be executed out of order, in parallel, etc.

SQL as a query language was probably (accidentally) designed to encourage monotonic queries, and make it harder to express non-monotonic queries because those tend to be harder to execute efficiently.

Non-monotonocity

I'd not considered before that aggregate operations are non-monotonic. It makes sense though as the sum of a relation can be changed by adding new facts. However, couldn't we make aggregates monotonic by adding a temporal argument to all relations and aggregates such that SUM(expr, t) is the sum of a particular relational expression at a given time? That would seem to satisfy the logical definition of monotonocity (I.e. If KB entails P then KB union {Q} still entails P), but perhaps the authors of the Bloom paper mean something different?

They do mean something slightly different

Performance will only be a factor if you want generic aggregate operations. If you know in advance what the parameters to your aggregate operations will be, then you can simply trade time for space. This is how OLAP/FASMI/Cubes work. They limit the structure of the database and then build a static point-in-time representation of the world and a bunch of calculations that can be derived from that world.

In Sandro's description above, he is assuming the aggregate can't be memoized or incrementally evaluated - the synchronization overhead he is describing is if we had to traverse the whole data structure to output count every single time we were asked, which is a naive strategy.

And there is really a somewhat under-explored continuum of design options. For example, the Log-Structured Merge Tree is an old data structure that allows blazing fast index maintenance. It works well on solid-state disks, mainly due to the fact traditional disks with read-write heads still had that physical latency to contend with.

The Bloom paper is cool because it is one level meta. It lets you program your consistency requirements, effectively automating what IT professionals do manually today: superimposing "tasks" that let them say "At 2AM, refresh the Cube from this database". The idea behind Bloom is that you define it once, logically, and then go drink beer. (Although it's still not automatic programming/beer drinking, so somebody will have to come up with that.)

So the OPs question can be looked at the way I initially did, from an internal engine perspective. It can also be looked at as, OK, I need these requirements, what does my internal engine need to automate for me to reach that performance bar? And then there is a separate question of designing a pleasant syntax for that answer.

However, couldn't we make

However, couldn't we make aggregates monotonic by adding a temporal argument to all relations and aggregates such that SUM(expr, t) is the sum of a particular relational expression at a given time?

I think that's correct, because introducing a global clock to totally order operations solves this problem. However, the global clock is now a point of contention, and further, SUM(expr, ti) must now wait for any insert/update operations for tj where j < i to complete before it can proceed.

So like the Bloom paper says, non-monotonic operations require counting (like a global clock), and counting involves waiting. These strict ordering requirements severely restrict optimizations. Of course, a global clock is effectively impossible in distributed systems, which is where Bloom is targeted, so it's even more problematic there.

Nitpick

In that case, the property they desire is something more than logical monotonicity, as the temporal versions are monotonic, but still require this counting/synchronisation.

Aggregation adds structure

Sorry — there was a mistake in my original post: I don't need the empty relation, but the unit relatin, that contains some unit value. Because my projection cannot create tuples.

I think that aggregation and grouping add further structure to relational algebra. To demonstrate this, I'll try to naively develope relational algebra starting from plain ralations over simple values (integers, booleans, and a unit value). Suddenly I'll need lists. (These are just rough ideas...)

Let's look at the signatures of the primitive relational operators, where ‘R’ denotes the type “Relation”, ignoring a schema (I'll use a lot of Haskell-like notation):

    × ∷ R → R → R
    ∪ ∷ R → R → R
    ∩ ∷ R → R → R
    ∖ ∷ R → R → R

All of them are well known. All of them take relations, and return relations, So we safely stay within the domain of ralations over simple values.

We could work with these if we had more relations, but we only have the unit relation

    I ∷ R

which contins a single tuple with a single value distinct from all other values.

So lets add something to create a new relation. Therefore we need to name attributes, and values. That's new, and I'd like to choose a somewhat strange approach to do so:

Let ‘L’ denote the type of expressions in a value-level language that has variables with assignment, numbers, booleans, and the usual operations on them (arithmetics, comparison, ...).

Now, we add projection that applies L-expressions on relations:

    π ∷ L → R → R

For this to make sense, the variables in the L-typed argument correspond to the attributes of the relations: Exactly the variabbles assigned to appear as attribute in the result. And the attributes of the input relation are available as variables in the L-expression. (Ugly: reusing the same name. Bear with me)

Examples:

                        x y
    π{x:=1, y:=2} I  = -----                — creating tuples
                        1 2
      
                                   x
    (π{x:=1} I) ∪ (π{x:=2} I)  =  ---       — creating columns
                                   1
                                   2
                                   
                                                            x z y
    π{ x:=1, z:=2, y:= 3 } I ∪ π{ x:=4, z:=5, y:= 6 } I  = -------
                                                            1 2 3
                                                            4 5 6
                         — construction of abigger table
                                            
                   x z y       x y
    π{x:=x, y:=z} -------  =  -----         — projection as usual
                   1 2 3       1 2
                   4 5 6       4 5

I'm pretty sure that this is enough to construct all relations over the given domain of values.

We can even use this projection to map an operation in the L-language (for example multiplication) over all tuples in a relation:

                       x       x y
    π{ x:=x, y:=2*x } ---  =  -----
                       1       1 2
                       2       2 4

                 x         y 
    π{ y:=x>1 } ---  =  -------
                 1       False 
                 2       True

What about selection?

    σ ∷ L → R → R

like in

                x y       x y
    σ{ x > 4 } -----  =  -----
                1 5       6 9
                6 9

There is something new here! The L-expression is abused to return a value, which decides whether a tuple survives the selection or not. IMHO this is quite different than the uses of L-expressions above: The value returned by the L-expression (formerly a tuple with named attributes) does not end up in the relation, but is rather used to steer its construction! Is this already new structure?

Joining, then (without outer joins) is easy, it can be rewritten as a cross product followed by a selection, as in the following example:

      ⋈{ «predicate» } r1 r2
    =
      σ{ «predicate» } (r1 × r2)

Back to work, selection might be spooky but aggregation and grouping are really scary!

Look at aggregation (I'll use ρ):

    ρ ∷ L → R → R

Again, we need to change the semantics of the L-expression, as we already did for σ: We add a list-structure and functions working on lists to the domain of L. The input relations make their attributes available as lists, and the output relation contains all variables assigned to in the L-expression, which must be singleton values (not lists).

Example:

                                 x y       a b
    ρ{ a := sum x, b := max y } -----  =  -----
                                 1 2       4 5
                                 3 5

So I was just urged to add lists to the value-level language of L-expressons. But that's not to bad, after all we wnated to use functions like ‘sum’ that only make sense on lists.

But that's not enough: Look at grouping (I'll use γ):

    γ ∷ L → R → ?

What should it return? The only thing I can imagine is a list of relations

    γ ∷ L → R → [R]        — Haskell syntax to denote list-type

which definitely adds more structure to our relational algebra language.

            x y         x y     x y
    γ{ x } -----  =  [ ----- , ----- ]
            1 4         1 4     3 5
            3 5         1 6
            1 6

We would even need tools in the proposed relational algebra that operate on lists, e.g., a function

    map ∷ (R → R) → [R] → [R]

that applies its first argument on all relations in a list, and a

    union ∷ [R] → R

that unions all its arguments.

Nested relations would also be an option, then we'd be able to use π again, but we would still need something like ‘union’ to do the unnesting.

By the way, it is not sufficient to identify tuples that form a group, and then to add an identifying attribute. Assume such a γ' as follows, wich identifies groups in a new attribute g:

    γ' ∷ L → R → R
    
             x y       x y g
    γ'{ x } -----  =  -------
             1 4       1 4 1
             3 5       3 5 2
             1 6       1 6 1

We did not win anything! This example is quite trenchant, but it points out one thing: This kind of grouping is not really useful.

So I was just urged to add lists to the relational algebra itself.

With γ, ρ, and functions ‘map’ and ‘union’ we can aggregate groups:

                                                a b
      ( union ∘ map ρ{ x := max a } ∘ γ{ b } ) -----
                                                1 4
                                                2 4
                                                1 5
    =
                                         a b     a b 
      ( union ∘ map ρ{ x := max a } ) [ ----- , ----- ]
                                         1 4     1 5
                                         2 4
    =
                               a b                     a b 
      union [ ρ{ x := max a } ----- , ρ{ x := max a } ----- ]
                               1 4                     1 5
                               2 4
    =
               x      x 
      union [ --- ,  --- ]
               2      1
    =
       x        — Yeah, well, I've lost the grouping attribute, but since it's
      ---         constant in each group, it could be saved by giving it as
       1          maximum in the aggregations: ρ{x:=max a, b:=max b}
       2

A workaround to having lists, map, union, etc. in the relational algebra is to extend all relational operators with a grouping facility, where appropriate.

I am not satisfied.

argmax()

So, the problem IMO is the designers of SQL forgot to implement argmax() and argmin(). (Wikipedia) Such a command should have little significant overhead beyond max() and min().

Such a query as you desire could then be written as:

SELECT * FROM (SELECT key_a, argmax(the_value, key_b) AS key_b FROM the_table GROUP BY key_a) a NATURAL JOIN the_table;

And you need only a simple peephole optimization to recognize the pattern of the NATURAL JOIN with the entire table to transform this into an O(n) operation.

argmax() and keys

Yes, let's not turn this into SQL blog. It's been well said that SQL is not structured nor a language -- certainly it's a poor model for a declarative language. But for these query examples, it does bear a passing resemblance to Codd's (much more declarative/better structured) relational calculus. And that, too, would need sub-queries.

To answer the question as put "to select the max row per group in SQL" is easy: SELECT type, max(price) from fruit group by type; But what you asked for is not what you want (I suspect). You want *the* (one) variety with that max price. The complexity lies far more in the problem than in any failings in SQL (although they are legion).

In Xaprb's example, each row has a different price, so a price uniquely identifies a type/variety, and it's meaningful to talk about *the* lowest-priced variety, or *the* two lowest-priced, etc.

But this is accidental -- it's perfectly possible for there to be two (or more) varieties with the same price for a given type. I suspect in that case Xaprb's queries would produce confusing results. (The self-join or correlated sub-query would return both/all at that price.)

I don't think that the problem as raised by the OP/example from Xaprb is due to the lack of argmax -- but more from not specifying the data model exactly. (I'm guessing a little, because we're not told the schema.)

What does argmax() produce when there's a tie for the max?

If it's deliberate that each type/variety has a different price, that should be declared as a key in the schema. (And will help optimise.)

Not quite

SELECT type, max(price) from fruit group by type;

This does not "select the max row per group in SQL" as you claim, if there are more columns than just type and price. One could of course self-join, but this would be slow since "price" is likely not a key. Using argmax() to return the key would allow a fast self-join; or one could simply use argmax() to return each other column.

What does argmax() produce when there's a tie for the max?

argmax() is not a function but a relation (or equivalently, functino returning a set) as the Wikipedia entry I linked states. It would return more than one row, and this would be useful and expected:

You want *the* (one) variety with that max price. The complexity lies far more in the problem than in any failings in SQL (although they are legion).

That part of the problem is so trivial to solve (e.g. using SORT BY and DISTINCT ON; you really just need to dictate some other sort criteria) that I doubt it is what the OP what trying to solve.

argmax() -- primitive or derived

Thank you Chris. So argmax returns a relation (not a tuple), which might include more than one row for a given max. So the application that consumes the result must handle that possibility. There we might bump into the 'impedance mismatch' between SQL (and the relational model/set-based operations) vs. programming language (typically tuple-at-a-time/list comprehension).

The question for language design (again to avoid turning this into SQL blog):
- must argmax be a primitive in the query language?
- or should the query language include features to declare/define argmax?
- or should the query language leave it to the programming language to declare/define argmax?

I don't like argmax being a primitive: there would be hosts of aggregate-based functions/operations we'd have to include on the same basis, not to even go near what we might consider with a decent transitive closure mechanism.

Needless to say, SQL is not a candidate for declaring/defining such stuff -- that's why it barely counts as a language.

If argmax is not to be a primitive, what's the underlying semantic/type-theoretic model we need in the language to define it? We'll need several layers of polymorphism:
- the f() that yields the ordinal must work on any suitable row
- (what does 'suitable' mean?)
- argmax must take the f and any suitable relation, and return a subset

Haskell has had a crack recently at some aggregate-like operations on lists. Is that a starting point?