High level languages with optimal code generation

One of the issues that arises during the implementation of programs that need to have high performance, is that a large percentage of time is spent doing and debugging optimizations.

These range from reorganizing data, reorganizing control flow, and sometimes even going as far as using inline assembly.

This is work seems to be very error prone and quite difficult to do, with a lot of specialized knowledge that is necessary to achieve good performance.

A good example of the difficulties of this process can be seen in the series handmade hero.
From about episode 590 to about episode 618. Much of the work done in the episodes is about doing optimizations.

How can we make it easier for compilers to help generate faster code automatically?

I found out about superoptimization, but this seem to be very limited in capability. Specifically, it is very slow, can't deal with control flow and can't do anything about data structures.

Regarding this last point, most languages force us to decide on them, and so a compiler can't even do much about it. A different strategy is used in Compilation of Bottom-Up Evaluation for a Pure Logic Programming Language, where a Turing complete Datalog inspired language is proposed. With this approach, the data structures are selected by the compiler.

What other research looks at this problem, and what other ideas do you have?

Is relational algebra + mapreduce a good way to approach this problem?

How can we have program descriptions that compile into optimal or close to optimal code? Or at least how can we decouple optimizations from the problem description?

P.S. This is a bit of a sequel to the Why is there no widely accepted progress for 50 years?. The answers I got in that thread were very thought provoking and helped to clarify much of my thinking. Thank you very much to all that participated in the discussion!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

No universal optimization

I think the progress for the last 50 years is that one cannot know conditions the code is executed when writing it. CPU instruction availability and cost, memory access cost, usage patterns, and everything else could be different between different runs of the applications. There are even funny things like lowering CPU frequency when executing specific instructions.

Universal optimization is usually worse than condition-specific one. More knowledge about actual environment and usage patterns could be used, the better potential for optimization.

One could help doing compiler a better job, but from experience, such help usually has very little effect, expires quickly (condition change or compiler optimization changes), and it is usually done in the wrong place (because it is usually done by guesswork rather than because of profiling).

So things are going into direction of JIT and pre-compilation using gathered stats, or at least by checking specific environment. This allows to push optimization further at expense of load and startup times.

But this all help only for CPU-bound pieces of the code, and it have weaker effect on memory-access-bound code, and compiler optimizations have almost no effect for IO-bound code. At best one could save some electricity cost for data centers.

I think we even could even see next some ML-based optimizations, rather than rule-based right now. And that ML module will learn from program execution on the local machine.

1. What is optimal? It seems

1. What is optimal? It seems like "optimal" is input-specific. Some inputs for a geometry program might call for unrolling loops along the x-axis, and some along the y-axis, or some might see performance degradation if unrolling at all due to the larger cache footprint.

2. I think the OCaml compiler has demonstrated that you can get a pretty competitive implementation with a judicious choice of language features, minimal optimizations, and a good register allocator.

3. Better optimizations will probably require providing the compiler more information. Stuff like superoptimization, fusion, partial evaluation, etc. This information may sometimes take the form of manual staging annotations.

4. I believe there are some efforts to implement Lamping's optimal lambda calculus evaluation strategy, or other optimal reduction models (interaction nets IIRC?), but the links escape me now.

1. Yes, it seems the answer

1. Yes, it seems the answer might me some sort of JIT or compilation using stats, as mentioned in another reply.

2. Is there a good resource that goes into more details about the design of these specific parts of the OCaml compiler? Specially the "judicious choice of language features".

I suggest reading Leroy's

I suggest reading Leroy's design of the ZINC abstract machine, or get some highlights from these slides if you're already familiar with it, and see how OCamol evolved it.

optimal reduction of lambda calculus

The developer of Moonad/Formality, Maia Victor, went to great lengths to explore optimal reductions of lambda calculus via transform to Lafont Interaction Nets.

optimal cache-oblivious algorithms

Seems related: Cache-oblivious algorithms are designed to effectively use a memory hierarchy with increasing latency (e.g. L0, L1, L2, RAM, disk) without knowing exact details of cache and cache-line sizes.

Optimal cache-oblivious algorithms are asymptotically, provably optimal. Of course, they might not be optimal for specific hardware (where one byte more or less might make a world of difference). But they are optimal within the limits of being cache-oblivious.

Related: Cache and I/O Efficient Functional Algorithms by Guy Blelloch and Robert Harper.

Then there's GC, footprint, and thermal optimization

At least for the moment, the solutions for a large class of application safety issues require either GC (most) or static pre-allocation (the EROS/KeyKOS/Coyotos kernel family and some others). GC doesn't intrinsically make things slower, but even the best "real time" approaches to date do not yet deliver time-definite resource availability. The best GC (at least so far as I know) still incurs a very large overhead in required physical memory to achieve performance parity. When multiple applications are run on a single system, the requirement to provision for simultaneous peak footprint can become a major concern in both cost and heat. Even in a single application system, the implications of peak-footprint provisioning have unfortunate thermal and cost implications.

There are other (non-GC) classes of application for which footprint optimization or thermal optimization are more important than single-thread performance optimization. Search engines, as one example, are embarrassingly parallel, and their efficiency is best expressed in watts per query. The key to scaling them cost effectively is more about reducing the per-instance thermal cost and memory footprint than it is about improving their individual CPU performance. Though of course both can matter at scale.

I think what I'm saying is that modern optimization is almost always a matter of trading one cost for a different one, or sometimes several different ones. Where this is true, the criteria for optimality - or even just success - can't be measured on a single axis.

Are there any languages that

Are there any languages that allow to select on of these axis? Without manually having to re-design the software.

Are there any languages that

Are there any languages that allow to select on of these axis? Without manually having to re-design the software.

That's one of our goals with Ecstasy.

Not just "parity"

It is true that GC does require more space in order to achieve efficiency in terms of execution speed, but that performance benefit is fairly significant over manual memory management (i.e. it is not "achieving parity", but rather achieving measurably better throughput). The rule of thumb that I use is a 2x memory footprint for a GC-based implementation, but that is just a rule of thumb. And this is still an area of computing that is evolving fairly significantly.

Always a tradeoff

Ultimately, the tradeoff of optimization vs time is what it comes down to.

For your engineers to be productive you don't want your compiler to take longer than about ten seconds on most programs. Programmers are in a compile-test-debug loop and you don't want to stick any unnecessary extra "idle" (human not working) time into that loop. Conversely your code when released ought to be as optimized as possible and your users should see code that runs fast, in small memory, etc. These are mutually exclusive in most languages.

Pascal, Ada, and Modula-3 are exceptional in that the language syntax and organization are designed specifically to be easy to compile. A straightforward Pascal compiler is possibly the easiest full compiler in the world to write, and even a simple one produces decent-ish machine code. C was originally made parallel to machine code as well, but it didn't have any of the constraints on program structure that allow decent-ish code from a simple-minded compiler.

For C, C++, and many good languages, good compilers support several different optimization levels to meet different needs. Typically "-O0" debug, -O1" development, "-O2" testing, and "-O3" release, in increasing order of effort spent on optimization.

Debug code runs like a tortoise, with optimization specifically inhibited. But it's structured parallel to the source code with no reordering of operations, so you can follow along in the debugger and know what's going on.

Development code runs like a cow, with only the simplest and easiest optimizations done, but it compiles within seconds, so as to avoid sticking frustrating and costly delays into your engineers' compile-test-debug loop.

Testing code, for distribution to a few alpha testers, runs like a horse, with considerable optimization effort put in and a compilation process that takes several times as long.

And release code runs like a cheetah, with all the stops pulled out in analysis, optimization, and restructuring. But the compilation process is a major undertaking. If the program is large it may run on dozens of machines and take hours.

And static analysis of code looking for bugs does even more analysis than that. It can run for hours and hours and hours, even when the -03 code compiles in only about ten minutes. Even then, there'll be tests any given analyzer doesn't make. That level of analysis could probably support a -04 compilation mode that virtually nobody would ever use.

"That level of analysis

"That level of analysis could probably support a -04 compilation mode that virtually nobody would ever use."

In certain fields, I think it makes sense to have a final build that could even take a week or more, if it meant having the program be >20% faster. Simulations that run for months. Video games. The thing is, I don't know any language or compiler that gives you that option. Why is that?

It might seem like a week is a long time, but many times much more time is spent to write those optimizations by hand for similar gains.

Mostly it's more work than people want to do.

"Why is that?"

Mostly they don't do it because compilers are a major undertaking to support and maintain and it's hard enough in most cases to keep up with moving-target language specs and produce correct code.

There are infrastructure limitations, like compiler suites where many different compilers all compile to a common intermediate representation and then use the same code generation module, like the GCC compiler suite, and sometimes the code generation module, by being suitable for languages that have good reasons to have, eg, more complex stack frames or procedure calls, etc, fail to take advantage of regularities or structural simplicity allowed by the language spec.

Another kind of infrastructure limitation happens when a lot of different languages are all supposed to run on the same virtual machine, and the virtual machine is optimized to serve languages simpler than the languages someone is trying to compile for it. This sort of thing happened with the JVM when people were trying to cram Scheme onto it. Scheme has a call tree instead of a call stack because of call/cc, and the specification requires tail call elimination because that is the language's primary looping construct. (there are explicit loops now, but the reference implementation of those always makes them syntactic substitution for recursive calls).

Trying to explain this to the JVM resulted in a lot of frustration and ugly hacks, exposing VM code in the middle, storing it in variables, and deliberately overwriting the stack. The whole thing violated safety guarantees the JVM was originally designed to enforce, and was generally a painful and inefficient exercise.

The JVM is better now, and, with some of its safety guarantees turned off, supports a broader class of languages. Current scheme-on-JVM efforts are nowhere near as ugly and inefficient. But hyperoptimizing the JVM to get absolute results on such languages is hardly a priority for its developers, and the compiler implementers in this case are focused on producing JVM code; the abstraction barrier between virtual-machine compilation and VM-code execution doesn't allow straight-through optimization noticeably better than they're getting now.

Next there is portability. People who write compilers would like to think that these compilers will work for a broad class of machines, and avoid optimizations that presume the existence of particular hardware such as an NVidia 1080 or better GPU, or AMD- or Intel- specific cache structures, or aliased addressing via an MMU, or .... The result is that if these features are available, the compiler is still producing code that does not take advantage of them. The compiler writers write to a "target" that they can define in a restricted language for describing hardware, and then want to believe that making their compiler work on new hardware amounts to writing a correct hardware definition file for it. It's never really that simple, but it can come close.

The point here is that the hardware definition is restricted by the set of features that the compiler can understand, and "describing enough features accurately enough to make the compiler work for new hardware" is never the same task as "describing the hardware so completely that every possible feature of it can be taken advantage of in optimizing the program." Further, they don't want it to be. This isn't the last machine they'll ever target, and if they abandon the constraints of their hardware description format in order to take advantage of every feature of it, they'll be doing a lot of work that doesn't apply to any future compilation target. So if some machine has a unique feature, they *could* extend their hardware description language to describe it, then extend their compiler to recognize and optimize for it. But they'd rather spend that time targeting three more different kinds of hardware.

After that you have peculiarities of the operating system and how well its underlying structures match the paradigms of the language. For example Scheme specifies call/cc in a way that can be read as a description of cooperative multitasking. Implementing a scheme compiler on an operating system that natively supports cooperative multitasking (like MacOS 7) had zero impedance, because continuations simply became processes. Implementing a scheme compiler on an operating system that natively supports pre-emptive multitasking (like OSX, UNIX, Linux, etc) you have to pass up on a whole lot of stuff that's available and would be a heck of a lot more efficient, to the point that you feel like you have one hand tied behind your back. And in some cases (Looking at you OSX) you have to do backflips to avoid operating system optimizations intended to take old cooperative-multitasking code and run it with pre-emptive multitasking semantics - good for a fair number of programs, very bad for standards conformance in scheme. And your backflips, inevitably, come with a performance cost well in excess of simply executing the code with cooperative-multitasking semantics.

And I could go on for days. Big projects like GCC, that have accumulated over a thousand years of human attention and craftsmanship, tend to be projects that try to support five hundred different languages on a hundred different kinds of hardware, and hyperoptimization on any single case is seen as a waste of resources. Small projects tend to be the work of a single person, and are limited to the hours and days that make up a single person's life - rarely more than 20 years of cumulative attention and craftsmanship, even assuming someone works full-time on it for a sixty-year productive lifespan. If they devote their efforts to hyperoptimization they can get a long way - with exactly one language on exactly one hardware target. It's likely not to be the combination you're interested in.