Automating Ad hoc Data Representation Transformations

Automating Ad hoc Data Representation Transformations by Vlad Ureche, Aggelos Biboudis, Yannis Smaragdakis, and Martin Odersky:

To maximize run-time performance, programmers often specialize their code by hand, replacing library collections and containers by custom objects in which data is restructured for efficient access. However, changing the data representation is a tedious and error-prone process that makes it hard to test, maintain and evolve the source code.

We present an automated and composable mechanism that allows programmers to safely change the data representation in delimited scopes containing anything from expressions to entire class definitions. To achieve this, programmers define a transformation and our mechanism automatically and transparently applies it during compilation, eliminating the need to manually change the source code.

Our technique leverages the type system in order to offer correctness guarantees on the transformation and its interaction with object-oriented language features, such as dynamic dispatch, inheritance and generics.

We have embedded this technique in a Scala compiler plugin and used it in four very different transformations, ranging from improving the data layout and encoding, to
retrofitting specialization and value class status, and all the way to collection deforestation. On our benchmarks, the technique obtained speedups between 1.8x and 24.5x.

This is a realization of an idea that has been briefly discussed here on LtU a few times, whereby a program is written using high-level representations, and the user has the option to provide a lowering to a more efficient representation after the fact.

This contrasts with the typical approach of providing efficient primitives, like primitive unboxed values, and leaving it to the programmer to compose them efficiently up front.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

The reason it isn't fully realized

Complex transformations to lower-level representations often involve the lower-level representation requiring unique maths that the high-level representation does not. In fact, correctly preserving the logical semantics during the transformation could mean requiring two types of math-books. One to guarantee "preserving properties" like termination, and another to guarantee properties like performance/efficiency.

That is why the best improvements in representation science in the last 10 years have been about hiding algorithms completely. If you look at efforts like Stanford DeepDive joint inference and their obliteration of the TAC-KPC benchmark challenge on the 500M web site corpus, as well as LogicBlox (also connected to Stanford) and their benchmarking of new join algorithms that could obliterate standard thinking about RDBMSes.

This is leading to a shift in thinking and realization that even most experts don't understand the high-level representations of their programs any more, because they are over-specialized in certain algorithms/problem solving processes, and so their optimizations even on large-scale problems are merely micro-optimizations.

Ergo, the secret sauce moving forward will be to assist human experts by having AI that teaches experts to consider factors they normally would not think about. Once we have AI that understands software architecture, we will see really big leaps in "lower-level transformations".

Benchmarking and Generic Programming

I would make a much closer to home conclusion: It is important to benchmark on real hardware with realistic data-sets. Simple 'O' notation performance factors do not cut-it any more due to the order of magnitude cost of poor-cache performance. Cache-oblivious algorithms give us a model for this sort of thing, but they still omit many details that can affect performance, so get empirical evidence via benchmarking.

As for changing the data representation, I would advocate generic programming techniques that do not depend on the data format, or container type, and that allow libraries of specialised algorithms to be selected quickly. You can then swap the algorithm without changing the code.

I suspect that the need for tools like this in Scala speak to the limitations of the language for generic programming, or perhaps being kinder to Scala, it permits generic programming but does not make it central to the language so people can structure their code in ways that make data-structures and algorithms not easily replaceable.

You're missing the point

The biggest propagation delay to getting meaningful results is training time.

In intergalactic scale systems, the speed of light is your bottleneck, so distant learning will win out.