Practicality of Exclusively Compiler-Driven Unboxing

It is sometimes claimed in the FP community that explicit representation controls (unboxing) are unnecessary, because a good optimizer can re-arrange the data structures into unboxed forms where appropriate (I disregard hardware interaction for this discussion). Litmus questions here:

  1. While there are definite successes here, are they real or are they a case of unrealistic benchmarks? Knowing something of the PL.8 research compiler history on this issue, I am hard pressed to believe that program-wide data restructuring will ever compete comprehensively with human prescription. Both would seem to have their place.

  2. Assuming that exclusively compiler-based data structure optimization is realistic, is it feasible in the absence of whole-program analysis (issue: dynamic libraries are not optional in the real world)?

  3. Again assuming that exclusively compiler-based data structure rearrangement is realistic, is there a reasonable expectation that we can characterize and specify this type of optimization well enough that independently crafted, conforming compilers will exhibit more or less consistent behavior (may exceed, but will do at least transforms X, Y, and Z)?

  4. If the comparative metric is performance compared to C rather than performance compared to non-repr-optimized Haskell/ML/O'Caml, is there any comprehensively credible evidence that compiler-driven repr rearrangement is both good enough and predicable enough (in the eyes of the developer) to render prescriptive specification unnecessary? Is there robust evidence that it isn't sufficient?

    I'll settle for promising partial evidence, provided the approach admits an eventual generalization. I'm aware of TIL, and of Xavier Leroy's paper assessing effectiveness of type-driven unboxing, and of the earlier work on PL.8. The conclusions to date suggest to me that leaving this optimization exclusively to the compiler is not likely to be good enough.

Thoughts and references?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I am hard pressed to believe

I am hard pressed to believe that program-wide data restructuring will ever compete comprehensively with human prescription.

Human prescription rarely succeeds at choosing data structures any where near optimal for all uses of data, especially when dealing with whole program analysis, and dynamic libraries. Based on my experience, I'd not place much faith in human prescription.

is it feasible in the absence of whole-program analysis (issue: dynamic libraries are not optional in the real world)?

Feasible? Yes, if you're willing to pay for specialized code.

Across libraries, you might not know the representation for ALL cases, but that doesn't prevent you from knowing which representations to expect for MOST cases. That is, if someone sends you a boxed 'set of integers', there is nothing preventing the compiler including the code to peek at it and say 'oh! that's represented as a sorted array of unboxed integers!' and send it down a specialized code path (or, via a JIT, producing a specialized code-path on the fly).

Of course, you won't hit 100% with static compilation, but 'optimization' is usually not aimed at 100% anyway. Given a common compiler, one might be able to hit 90% for having common representations, and have to send the other 10% down code paths that use the unspecialized boxed object interface. One might even do better than 99% if one, at compile-time, is peeking at existing annotations in the dynamic libraries that one is planning to integrate. At that point, one could even be rid of the extra code by sacrificing 1% linkability if the value or object representations don't match what was assumed.

So, I'd call it feasible, but with a potential space or linkability tradeoff that you won't necessarily need to make with a fully human-prescribed representation. (Not that human programmers working with prescribed representations don't involve themselves with 'converting' data structures (by hand!) when calling between different libraries...)

is there a reasonable expectation that we can characterize and specify this type of optimization well enough that independently crafted, conforming compilers will exhibit more or less consistent behavior (may exceed, but will do at least transforms X, Y, and Z)?

I expect you cannot do so usefully.

You're seeking some equivalent to 'all compliant compilers will at least do tail-call optimization', but for data structures, such that you can promise a given form of data structure (e.g. a 'set of integers') will at least be represented as, for example, a sorted array (but may exceed this by being a sorted array of unboxed integers).

The ability to 'exceed' the transformations will, for the most part, make entirely moot the advantages one might have gained from having a common representation in the first place. And if one cannot exceed the transformation AND must guarantee it, then the 'compiler driven' restructuring effectively becomes wholly human-prescribed anyway (compiler has no choices to make).

is there any comprehensively credible evidence that compiler-driven repr rearrangement is both good enough and predicable enough (in the eyes of the developer) to render prescriptive specification unnecessary? Is there robust evidence that it isn't sufficient?

I'm afraid you'll discover a remarkable shortage of robust evidence for much of anything that is 'quantifiable' (what is 'good enough'?) as opposed to being some sort of formally provable proposition.

In any case, there is a middle ground: non-prescriptive human annotations of 'suggested' optimizations, possibly combined with prescriptive annotations on linear typing and realtime or hard memory constraints. Also, dynamic libraries may include annotations on their internal data representations, available to compilers of new code that plans to optimally utilize those libraries. This ground also needs to be covered.

Some clarification

Human prescription rarely succeeds at choosing data structures any where near optimal for all uses of data...

Perhaps, but that isn't really germane. First off, there may not be an optimal data structure; this is one of the costs of code reuse. But more to the point, the human doesn't have to choose the optimal structure; the human merely has to out-do the compiler.

...if someone sends you a boxed 'set of integers', there is nothing preventing the compiler including the code to peek at it and say 'oh! that's represented as a sorted array of unboxed integers!' and send it down a specialized code path.

This misses the point. When data structures are exclusively selected by the compiler, there is no one sending a particular representation. In any case, your proposed solution relies on a choice of bads: explosive code growth or runtime code generation, both of which have unacceptable time, space, or run-time complexity costs for many programs. For many programs they are just fine, but I'm reluctant to wed myself to solutions that are conceptually inappropriate for a substantial number of important applications.

I expect you cannot [characterize and specify this type of optimization...] usefully.

Then we seem to be on the horns of a dilemma. For many programs the programmer requires at least an order of magnitude understanding of storage and compute costs. If a least set of storage representation optimizations cannot be guaranteed, then (for such programs) either it must be possible for the programmer to specify representation prescriptively or we must concede that such programs are not within the domain of FP languages — in which case any argument for FP universality goes out the window.

To be clear: I'm not arguing against FP here. I'ld welcome a conclusion that FP is viably universal. In order to sustain a universal claim, however, I think we must acknowledge that being able to characterize the space and time behavior and the runtime complexity of programs is important. Mere semantic equivalence coupled with reliance on nonexistent optimization strategies doesn't seem like a good place to stand.

I'm afraid you'll discover a remarkable shortage of robust evidence for much of anything that is 'quantifiable' (what is 'good enough'?)...

Good enough is "within 1.5% of a C program performing the same calculation". We could argue about whether the magic number is 1.5% or 3% or 5%, but the current state of the art is 10x to 20x for many things, so it's pretty clear that things are not (yet) "good enough."

Your middle ground proposition is interesting. Has this been explored?

the human merely has to

the human merely has to out-do the compiler

If that argument were worth much, humans would still be using Assembly as a first choice. Perhaps: a human needs to out-do the compiler consistently, ideally without a major loss of productivity. And that is a LOT harder to do, even for an expert.

Humans will suffer productivity AND efficiency losses when hand-converting data representations. Humans will suffer consistency losses when they don't anticipate the use of shared data in the larger scope of the program.

In any case, your proposed solution relies on a choice of bads: explosive code growth or runtime code generation, both of which have unacceptable time, space, or run-time complexity costs for many programs.

"Explosive code growth" is a dramatic overstatement, but perhaps results from a misunderstanding. I'm not suggesting that a library be compiled for every likely representation.

When compiling a dynamic library, the compiler selects a 'locally optimized' representation. This locally optimized rep may be targeted when compiling code that links to that library. Often, there will be some data-wide processing between receiving data and passing it to a library anyway, so most of the time this 'targeting' can be integrated into that processing. And all this can still allow the alternative use of a non-optimized 'abstract base class' approaches to values. That is, perhaps, a doubling of code... at most. Likely it is much less than that.

The cost of run-time code generation depends on the scope of code generation and the relative complexity of the project, I agree.

For many programs the programmer requires at least an order of magnitude understanding of storage and compute costs. If a least set of storage representation optimizations cannot be guaranteed [...]

I'm certain that understanding an order-of-magnitude storage and compute costs can be achieved without guaranteeing representation optimizations. It is common for programmers to know their language implementations to at least within an order of magnitude.

However, even supposing that the language did not achieve the right performance profile without human help, your conclusion (that "it must be possible for the programmer to specify representation prescriptively") does not follow. There are solutions that don't involve prescription.

Good enough is "within 1.5% of a C program performing the same calculation".

Interesting choice. For Bit-C I suppose it is a sensible choice, but I also have this idea that your "C calculation" approach will often just push the problem (ensuring the data is in the right representation for that calculation) to all users thereof.

I'm more interested in solutions that will beat or match what humans could typically do when there are five or six diverse computations to be performed concurrently on a single data set (e.g. delta isolation, addition, removal, searches, joins, versioning). Because, in my experience, data is often accessed concurrently and people want to be notified of the updates (the deltas).

Your middle ground proposition is interesting. Has this been explored?

I've not seen much on it recently, but five or six years back I read one article on the subject of 'suggestion' driven heuristic searches to achieve specified goals. The search problem for optimization is almost independent of the strategy to solve it, but human intelligence can help, as can 'expert' AI programs.

I've been considering this approach as a means to achieve extensible, pluggable optimizers (and typecheckers). Humans and every postprocessor in the pipeline could make notes and suggestions that could target other post-parse processes later in the pipeline (including optimizers).

We're talking past each other again.

Humans will suffer productivity AND efficiency losses when hand-converting data representations.

I have not advocated hand-conversion, and I don't think that's really the issue here.

When compiling a dynamic library, the compiler selects a 'locally optimized' representation.

How? Let me offer a concrete problem. We have a set of data structures corresponding to records in a database. If we argue for compiler-automated repr layout, then the compiler's goal is to work out, without programmer intervention, which of the various boxed bits that are expressible in FP languages can be unboxed. But also, which lists should be turned into internally stateful AVL trees (which is a qualitatively different transformation). Where reused code is concerned (that is: library code or modules that are used in multiple programs), it must do this conservatively, because it has no insight into how the client of the module/library plans to access the data structures.

This particular example is clearly pretty easy for humans to solve, and the compiler clearly doesn't have enough information to solve it.

I'm more interested in solutions that will beat or match what humans could typically do when there are five or six diverse computations to be performed concurrently on a single data set...

I agree completely, but I think this is a somewhat different objective than "compiler does all the work without user intervention". It seems very useful to have a "data structure wizard" that can offer suggestions on how to improve a (set of) layout choice(s). In fact, it seems hugely useful to have such a thing. But no such analysis can anticipate a future seventh computation; the human often can. I think compiler assist is great, but at the end of the day I think that there are a great many cases where the optimizer cannot divine good layout policies. Am I wrong?

Concerning "good enough", I'm definitely not going to argue that 1.5% of C performance is a sensible goal for all programs. It isn't. The question I'm pursuing here is whether FP will ever be appropriate in those applications that do have that requirement.

It might be fair to say that I'm testing my own dog food here. Many PL researchers would argue that anything non-semantic is unimportant. If they are right, we ought to be able to toss BitC out the window because it isn't needed. So perhaps my question might be re-stated as: if we re-coded the SSL library (which is very I/O and repr intensive) in PerfectHaskell(tm), could we get decent performance out of it, remembering that a library compile can't rely definitively on whole-program information.

Sigh. I'm increasingly of the opinion that I asked the wrong question here, or perhaps I asked in an unclear way. Thanks for helping me fumble through an attempt to sort it out; I appreciate your patience.

I have not advocated

I have not advocated hand-conversion, and I don't think that's really the issue here.

Maybe not directly, but I feel that hand-conversion is directly implied by human prescription of data representation. It seems a necessary complexity when receiving data from a source you don't control and passing it to an algorithm you do control.

Do you know how to obviate hand-conversions while allowing human prescription of data representations?

Let me offer a concrete problem. We have a set of data structures corresponding to records in a database. If we argue for compiler-automated repr layout, then the compiler's goal is to work out, without programmer intervention, which of the various boxed bits that are expressible in FP languages can be unboxed.

Something to consider: you only need to 'box' data that is potentially delayed (lazy/dataflow) or potentially of variant representation. It is not a problem for there to be a mix of record representations in database so long as each record is 'boxed' and has a pointer to the necessary access and manipulation interface.

Given a set of algorithms, those algorithms can be developed to operate via the 'boxed' representations (through the access and manipulation interface), but may also be specialized if you happen to 'know' the representation. For data that comes from an external library, you may still shift it to the 'specialized' path (either a simple conditional runtime test or a little bit of compiler/linker magic).

Thus, individual algorithms that operate upon the records and such may have specializations to operate on unboxed forms, at least if the compiler detects potential benefit from doing so. The reason such benefit is possible: because the compiler itself is often responsible for whole dataflow paths from algorithm to algorithm.

You framing the problem in terms of a database without any indication of how that database is queried or updated puts a significant limitation on what the compiler could do... but it doesn't seem, to me, a particularly common limitation.

This particular example is clearly pretty easy for humans to solve, and the compiler clearly doesn't have enough information to solve it.

I'm rather skeptical of that claim. Plenty often I'd like to store some dataflow information, promises, and incomplete lazy computations into a database... especially one that is 'in program'. If a library is developed entirely with input data from remote sources, then the ability to annotate that this data cannot be lazy or dataflow in nature would be important to achieving full optimization.

there are a great many cases where the optimizer cannot divine good layout policies. Am I wrong?

If you're defining 'good' as 'as good as a human would have done' then optimizer-driven approaches can be better than 'good' because they offer greater productivity. When that seventh computation is added the optimizer-driven approach can adapt automatically to that new computation in a way that a human-specified representation would not.

Similarly, when a profiler is run, the optimizer driven approach can adapt automatically based on that profile. I expect profiler driven approaches can help a great deal when the compiler lacks the 'global information' that humans presumably possess.

if we re-coded the SSL library (which is very I/O and repr intensive) in PerfectHaskell(tm), could we get decent performance out of it

That's probably a better example than the database, since now you have the library that clearly performs processing of the internal data. I believe we could get good performance out of such a library. I also believe that the ability to make suggestions would help get that good performance without sacrificing an unreasonable degree of compilation speed.

Do you know how to obviate

Do you know how to obviate hand-conversions while allowing human prescription of data representations?

No, but I don't know how to do that for compiler-controlled representations either.

you only need to 'box' data that is potentially delayed...

I'm not quite sure who "you" is here. The position of many FP advocates is that I should have no control over this at all, so I think that "you" here must be the compiler. We also appear to be referring to different databases. I assume here that the database is something like Oracle, where I have absolutely no control over boxing or unboxing decisions. I can conform to their (prescriptive) layout and avoid some expensive copies, or I can choose not to.

...If you're defining 'good' as 'as good as a human would have done'...

Perhaps I am making a mistake, but I don't believe that I am defining things this way. I think rather that I am assuming that representation decisions made during the separate compilation of a library module must be made without reference to any particular client application. The library developer may have some sense of the intended behavior of the client applications, but the compiler doesn't, because it must operate without visibility into the client code.

...When that seventh computation is added the optimizer-driven approach can adapt automatically...

I don't see how. In the presence of separate compilation, it seems to me that representation is effectively part of interface, and if the client application changes without updates to the library, I don't see where the optimizer has any opportunity to rearrange things.

It also seems to me that we may be suffering a meta-level disconnect. For the applications that I am concerned about, run-time code generation and/or optimization is simply not a design option; it creates prohibitive overhead and introduces prohibitive risk. Also, for the applications I am concerned about, separate compilation is an absolute necessity. Given these two constraints, I don't see how to implement some of the things that you seem to want to rely on.

don't know how to do that

don't know how to do that for compiler-controlled representations either

It isn't an issue for compiler-controlled representations. You may still need to convert between data types, but with hand representations you'll also be converting between physical representations of the same data type - e.g. a bit like busying yourself converting between vectors and arrays. In my experience, the latter problem is a much more common one.

I'm not quite sure who "you" is here

Sorry it was unclear. The "you" was the compiler... or the developer of strategies for the compiler. Either way.

I assume here that the database is something like Oracle, where I have absolutely no control over boxing or unboxing decisions.

I noticed. As I said, it still isn't a good example problem since you haven't specified anything about the input or output from that database.

I am assuming that representation decisions made during the separate compilation of a library module must be made without reference to any particular client application.

You also seem to be assuming the library doesn't actually 'do' anything... every bit of code for that database is in the client application.

That doesn't forbid representation decisions being made with reference to a profile developed in response to client applications.

...When that seventh computation is added the optimizer-driven approach can adapt automatically...

I don't see how. In the presence of separate compilation...

The computations I named are cross-cutting. Every single one of them would need support from the library component that is managing the data. You seem to imagine a world where the data is held entirely separate from the operations you perform upon it. I don't imagine such a world is very likely or common. Your SSH lib. is a better example that makes more clear the degree of processing and computation occurring within the library.

In any case, even in the world you envision one could re-profile the clients after adding a seventh computation then make a decision on what internal format for representing data will best serve the libraries manipulating the data.

for the applications I am concerned about, separate compilation is an absolute necessity

Perhaps a bit off topic, but separate compilation is a technique to solve a problem rather than a feature in and of itself. As such, it is almost certainly not 'absolutely necessary'.

As I understand it, there are four problems that separate compilation aims to solve: one is to speed up builds for dependent programs, another is to obfuscate source and thereby protect IP, the third is to produce code-as-data in a format for efficient dynamic construction of a larger program via runtime link-loading, and the fourth is to accept or produce code-as-data in a common format for integration with components largely written in another language.

Which of these problems apply to Bit-C? Can you not think of alternative solutions to the problems? It is always worth questioning your assumptions on 'how' problems must be solved, thinking up alternative and non-traditional solutions to the problems you see in front of you. Even should you come to the same conclusion for what is thee 'best' way to solve the problem, you won't be caught in the trap of arguing that a particular technique for solving a problem is 'absolutely necessary' - at least not without giving that assumption a good fight.

To me, separate compilation has a lot of costs I'm not willing to pay, especially when it comes to open functions and other inverted dependencies (e.g. provisionals - module 'requires', client 'provides', and constructs - clients add 'parts' to declared collection) and whole program optimizations. So I've spent a great deal of time thinking about this problem in particular. If you wish, I'll discuss some alternative solutions that I'm aiming to test, but another topic might be good for it.

Program specialization

I am not sure I interpret your question correctly, so I might be answering something else here, and at best it is only a partial answer. I would like to generalize your question some: instead of thinking about specializing data representations with respect to their use (which David Barbour already suggested), how about thinking of specializing entire programs? This will eliminate many higher-order functions, which in turn might make it easier for the following optimization passes.

So what is the state of the art of this? GHC uses Constructor specialisation today. Neil Mitchell and Colin Runciman investigated supercompilation in the context of Haskell (Supero) and found that it can yield some pretty dramatic performance improvements for small benchmarks.

I don't think that supercompilation is feasible for real world programs right now, it's more an indication that there is still much more room for improvement, and this is a transformation that isn't even using available type information as guidance.

This is a very interesting

This is a very interesting thing to talk about, but I'm trying to get an answer to a specific question here concerning data structure optimizations at the moment.

PyPy

Maybe some of the ideas in the PyPy project are worth a look. Afaik they do extensive unboxing and are now working on code specialization. Even without JIT, the performance of their Python interpreter is already close to the speed of CPython.

Premises

While I think this is a valid question in general, I have to disagree with several of your implicit and explicit premises.

1. Nobody claims that explicit control over low-level representation is "unnecessary" in general. The real question rather is whether it is necessary for the majority of applications and application domains.

2. In principle, hand-optimization will always outperform automated optimization. The real question is to what extend manual optimization is feasible and what the trade-off between micro efficiency and productivity/maintainability/modularity is. Otherwise, we'd still be programming in assembler.

3. AFAICS, unboxing can only ever improve constant factors, it cannot decrease actual algorithmic complexity (in terms of big-O for time or space). However, if you are careless it can actually increase algorithmic complexity, because it might induce excessive copying between isomorphic representations.

4. Automatic specialisation does not require whole-program compilation nor does it imply extreme code explosion. JIT compilation can replace the former, and specialisation based on internal representation types instead of language-level types addresses the latter. Dot-net seems to apply these approaches quite successfully, for example. For specialised tasks, even CPU vendors like Intel are moving towards jitting these days.

5. Humans aren't necessarily better at choosing good representations, neither practically nor in principle. In particular, all their decisions have to be made statically with potentially limited knowledge of the program (e.g. when writing libraries, or doing open programming with dynamic linking and loading of plugins, applets etc). JIT compilation OTOH can take actual global knowledge into account.

Capabilities parallelism?

To go off at something of a tangent, it seems to me that there a bit of a structural parallelism between your argument that we need to make security semantics explicit in PLs (eg. The Structure of Authority: Why Security Is not a Separable Concern), and your argument that PLs need to have facilities to guide compilers to the right sort of data representation.

Am I right? If so, is there a broader ideological backdrop? The thesis is not strange, least of all here at LtU, but it seems to be swimming against the dominant "programmer tools" current.