## Data Representation Synthesis

I was just lucky enough to see Peter Hawkins present a particularly compelling synthesis language: data structure synthesis:

We consider the problem of specifying combinations of data structures with complex sharing in a manner that is both declarative and results in provably correct code. In our approach, abstract data types are specified using relational algebra and functional dependencies. We describe a language of decompositions that permit the user to specify different concrete representations for relations, and show that operations on concrete representations soundly implement their relational specification. It is easy to incorporate data representations synthesized by our compiler into existing systems, leading to code that is simpler, correct by construction, and comparable in performance to the code it replaces.

As somebody experimenting with data structure representations, bringing it down to the language level is great! It'll be appearing at PLDI next month.

## Comment viewing options

### i think i like it

this is totally the kind of stuff i want in my programming toolkit. sure hope it goes places and e.g. gets a java-compatible implementation :-). the #s from their experiments look good and compelling, and the autotuner is a nice pragmatic thing to boot.

### Boost Multi-index

Check out the Boost Multi-index Container Library.

### Database relation?

I would like to see this work related more concretely to relational databases. Perhaps an example of something that would be much harder to do with existing bridges between programming languages and SQL?

### I was pretty curious about

I was pretty curious about that as well -- there is a lot of representation tuning going on in them that I suspect PL researchers are glossing over!

To me, the interesting thing here is really two-fold: a language to refine representation selection (in contrast, for non-fancy databases, we just specify indices) and a framework for plugging in custom representation refinements. I'm not convinced of the former for the 'typical' case -- more automation is better, as is typical in the area, so no need to expose anything (e.g., databases are often black boxes). Synthesis seems like a principled approach to expert-level guidance of memory layout, rather than the current approach of ifdefs and perl scripts, so a language is good there. For writing new tuning algorithms, the potential of packaging them is particularly exciting: as far as I can tell, these algorithms never leave the libraries they were originally applied to unless they're manually reimplemented.

### Can you clarify?

Sorry, although Leo obviously understood you, I didn't.

For example, for "existing bridges between programming languages and SQL", are you referring to something like ODBC or Microsoft SQL Server's Tabular Data Stream protocol even?

Or are you referring to DDL statements? For DDL, it tends not to capture the exact intention behind the construct. For example, pushing all incoming transactions with post dates for today to the TODAY filegroup to help manage so-called "big data" - this trick simplifies a lot of write pressure. It is generally commingled with an external script that swaps the filegroup nightly. But this has nothing to do with "bridging" programming languages and SQL, unless you have some understanding of "bridging" that I don't.

Or are you referring to programming language DSLs that can auto-tune DBMSes and queries?

Another note here is that DBMSes somewhat limit the extent to which DBAs can truly customize, since each customization needs to be accounted for in the query optimizer and the right heuristics used to determine when to generate a plan that takes advantage of the customization. Related, at the programming language level an ORM generally has very naive query generation capabilities - that's why LINQ evangelist/author Fabian Marquerie started Re-Motion to allow users to rewrite LINQ queries (if you look at Hibernate, it is a complete mess and can't handle individual vendors extension to the SQL standard). Obviously, a term rewriting DSL would make that stuff a little easier.

I can answer all these in detail - but which ones do you want to know about?

### I had something very simple in mind

What I had in mind was just something like Embedded SQL or SQLJ. I would appreciate a concrete example of what would be much harder to do using Embedded SQL or SQLJ.

### Confused

I had to consult my Java Programming with Oracle SQLJ book for a refresher. This book covers SQLJ 9i for Oracle 9i. 9i was the first version of SQLJ that allowed dynamic references to tables and columns.

I don't see how these are related.

SQLJ doesn't allow control over the representation of data. While the translation engine that maps to the Oracle SQLJ runtime does provide an automatic representation of data, the power belongs to the runtime, not the programmer. In fact, SQLJ iterators require the program to specify what constants are part of an expression at the iterator level using a with clause after the (optional) implements clause. In other words, SQLJ is not smart enough to realize that:

 select 1 from dual union all select 1 from dual 

...can map to a single static final Java integer variable, so if the programmer wants to conserve iterator memory, they have to manually encode these details in a with(my_constant=1) clause and access it via iter.my_constant.

There is a mapping layer that allows you to map Java types to Oracle user-defined types, but there is a key conceptual difference between what is being talked about in this paper (as I understood it) and SQLJ. In SQLJ, the programmer picks the representation. In this paper, the system guides the programmer to pick the representation by having the programmer specify how the data is being used. If you need some concrete examples of how this would be useful in a SQL environment, then I can think of some. Predominantly, my needs come from managing multiple clients with identical database schemas but different read/write skews and data distributions and database sizes. Things that work on one scale don't work well on another, or aren't necessary on another (for example, certain full-text search queries are difficult to write in most environments today and so you pretty much resort to some sort of caching). Some queries might also not perform well with certain vendor implementations, usually depending on the details of concurrency control e.g., select count(*) from BAR_Transaction might be an awful idea on some DBMSes due to the implementation details of MVCC. The main advantage to the scheme in this paper is that since the operations and the structure they operate on go together, you could specialize the exact storage, and also probably specialize how in-memory transfers work (e.g., hash join, stream optimizations fall out automatically from monotonically increasing key indexed constraints etc).

Beyond that, the only fun thing about SQLJ is that you can write stored procedures with it thanks to Oracle JServer. This is really a question of placement and hosting of code, not data representation.

Also, since you are forced to specify the underlying Java reprsentation of types, the Oracle SQLJ runtime does not appear to do any sort of representation optimization on types like that done for dynamic language runtimes. This disallows conservative memory overhead for applications that reflectively construct new types on the fly in response to arbitrary user input. Think of a concept-based query language as a good example of arbitrary user input.

The fact that the SQLJ translator program checks the program against the schema also is a benefit greatly exaggerated. In practice, this sort of benefit doesn't always align with the developer-DBA workflow as well as it should.

Obviously unrelated, I can also say from experience that I've found the advantages of SQLJ negligible with an abstract data access layer, query DSL, and the SQL CLR. C# 3's LINQ DSL is much better than SQLJ in this regard, but it still lacks good abstraction of many problems in dealing with heterogeneous execution. I've hinted at some of these in my reply above, but if you want to talk about those, we should either pursue it in another thread or via e-mail.

### Seems overly complicated

The formalism in this paper seems overly complicated for the problem it tackles. There doesn't seem to be much meat to it - you get to pick from a handful of implementations for tables (doubly linked list, hash table, vector, ...). You could add more, but seemingly only if it's another implementation of "table", and I'm not sure how many more interesting implementations of that there are.

I'm not sure why they let you introduce arbitrary nodes in this decomposition graph - why would you ever want to have two nodes that represent the same set of column values? It looks to me like the defining language could be as simple as (to encode their running example):

from {} hash {ns}
from {ns} hash {pid}
from {state} vector {ns,pid}
from {ns,pid} unique {state, cpu} -- functional dependency

There are other things I don't understand, like why you can only perform a join operation on the two halves of a join decomposition node. It seems semantically easy enough to join arbitrary queries on any nodes, but for some reason the syntax only allows a particular type of join.