Automatic Generation of Peephole Superoptimizers

Automatic Generation of Peephole Superoptimizers, Sorav Bansal and Alex Aiken, ASPLOS 2006.

Peephole optimizers are typically constructed using human-written pattern matching rules, an approach that requires expertise and time, as well as being less than systematic at exploiting all opportunities for optimization. We explore fully automatic construction of peephole optimizers using brute force superoptimization. While the optimizations discovered by our automatic system may be less general than human-written counterparts, our approach has the potential to automatically learn a database of thousands to millions of optimizations, in contrast to the hundreds found in current peephole optimizers. We show experimentally that our optimizer is able to exploit performance opportunities not found by existing compilers; in particular, we show speedups from 1.7 to a factor of 10 on some compute intensive kernels over a conventional optimizing compiler.

It's always fun to see a method that ought to be intractable rendered tractable through a combination of cleverness and strategically applied brute force.

I found their performance measurements suggestive but perplexing. Their results on their kernels are really astoundingly good, which they claim that is because they make use of the x86's SIMD instructions. But there is very little change in the SPEC benchmarks, and they subsequently suggest that their peephole optimizer catches many things that other compilers get via dataflow optimization. As a result, I don't feel like I have a clear picture of what their optimizer does and doesn't do, or where it does and doesn't overlaps with existing optimizations. But they definitely have enough to convince me that this is worth further study, so I really hope Bonsal and Aiken publish some more about this -- a tech report or journal paper with more systematic measurements and in-depth analysis would be really illuminating.

By neelk at 2008-05-05 19:56 | Implementation | other blogs | 13988 reads

They probably rely on back-tracking code rewriting.

I haven't read the article yet (no time currently; will do when I get home later today). However, I can hypothesize.

They probably rely on back-tracking code re-writing, which many Forth compilers use to produce good binary code in the absence of more global information. For example, the Forth phrase 4 + @ is a classic example. The 4 causes the back-end to emit code to push a literal on the data stack. Then, when compiling +, it sees that 4 was just pushed, backs the compile pointer up, and replaces the code with instructions to just directly add four to the top of stack (usually held in a register). @, the "fetch" word, will see that 4 was just added to the TOS, and its code is replaced, ultimately, by an indexed memory fetch instruction.

Writing such "incremental" optimizers are tedious, and this kind of automatic optimizer generation can be very useful.

By sfalvo at Mon, 2008-05-05 20:11 | login or register to post comments

Nope, just a fixed-size window

Nope. This is just a fixed-size peephole window.

The one problem I can see with this is that because it matches physical instructions rather than templates, it wouldn't work if you tried to apply peephole optimisation before register assignment, like VPO. Still, you could almost certainly use the idea of superoptimisation to generate peephole templates, at least in principle.

By Pseudonym at Tue, 2008-05-06 01:09 | login or register to post comments

SPEC

But there is very little change in the SPEC benchmarks

Maybe I'm just being dense, but isn't that mostly an indication of how much effort has gone into tuning existing compilers to generate the best SPEC results? Compilers are not allowed to special case the SPEC benchmark code specifically, but nonetheless considerable work goes into optimization techniques which improve the results for a vendor's architecture. At least a few years ago, there was considerable money riding on having the best SPEC numbers.

For example optimization #6 in Table 4 involves constant folding. Compilers perform a certain amount of constant folding even without specific peephole sequences, and I'd assume that ensuring a compiler recognizes foldable sequences in the SPEC code is high on TODO list for its developers.

By DGentry at Tue, 2008-05-06 16:38 | login or register to post comments

Diminishing Returns

I've only skimmed this paper so far as I'll need to be in the office tomorrow to concentrate on it properly. Hence my question is quite general in nature: it seems quite unlikely that the system of rewrite rules generated is confluent, and peephole optimisation suffers from the same phase ordering problems inherent in most optimisation passes. If this approach can massively increase the size of the rule database would it not become more difficult to decide on the order in which to apply them to any given program?

By Andrew Moss at Thu, 2008-05-08 19:38 | login or register to post comments

Guess

Sometimes you get lucky. If the guess was wrong (through performance analysis of the optimized code segment), try again and filter out similar matching rules to prune your 'guess list'. Then try again; optimizers are not supposed to be fast (performance wise), just make code run faster...

Who cares if I run my optimizer when I leave at night to get the results when I get to work the next day? How about on every binary on my system?

This would require feedback on performance improvements back into the database to help with the tree/'guess list' pruning. The database does not (and should not) be constant but a bit 'self aware' as to if what it is doing is helping with what it's supposed to do.

Blindly optimizing is no better than being the root of all evil (as Knuth might say).

By Burton Samograd at Thu, 2008-05-08 21:54 | login or register to post comments

Depends on how lucky you feel

The problem is that the complexity of the search can be exponential in the number of rules. Assume that you have n rules, if the rewrite system is confluent then you can try them in any order, so you will only need n attempts. In this case increasing the number of rules from 10s of handwritten rules to millions of generated rules would result in a night of number crunching to optimise the program.

If the system that they've generated is not confluent then you have to worry about phase ordering. So now the optimisation process will be O(2^n) and that night of number-crunching has just turned into thousands of years...

By Andrew Moss at Sat, 2008-05-10 21:21 | login or register to post comments

Cost of optimization is very important.

The cost of an optimization matters, whether it be an optimalization or merely an amelioration. It's a simple economic formula:

total_expected_savings = s * n - c
   where 
       s = average_savings_per_run
       n = expected_number_of_runs 
       c = cost_of_optimization

Now, this formula is fraught with difficulty in the case of recursive code, but limiting the discussion to peephole optimization drastically simplifies things.

Optimizations that pay for themselves in just a few runs are the most attractive, and indeed, some of these have even been implemented in hardware on high-end CPUs. Examples of this approach includes Digital's FX!32, Hewlett Packard's Dynamo for PA-RISC, Intel's Pentium Pro, and Transmeta's Crusoe, some more successful, others less.

(The Pentium Pro has even outlived it's younger Pentium-4 cousin, thanks to being brought back with the Core 2.)

Also, when dealing with higher-level languages, don't underestimate the advantage fast-running optimizer! You can test the performance of your code quickly, then tweak it to run faster, and rinse and repeat. The quicker the turnaround of the compiler, the more repetitions of hand-optimization programmer will do, ultimately creating sort of a hybrid human-computer optimizer.

And no, running code through an interpreter (or less-optimizing compiler) isn't as good. It's easy to start hand-optimizing code for the interpreter instead of the compiler.

The best programmers have a good, approximate understanding their language's implementation. This knowledge is indispensable for writing code that is short, readable, and fast. They will learn how to write cleaner code without sacrificing speed by putting the compiler to work for them.

As a side-effect of reaching this level of sophistication, these programmers will probably waste less time with hand-optimizations that contribute more code ugliness than speed, and instead focus their efforts on the changes that really pay off.

By Leon P Smith at Sat, 2008-05-17 07:58 | login or register to post comments

Lambda the Ultimate

User login

Navigation