STABILIZER : Statistically Sound Performance Evaluation

My colleague Mike Rainey described this paper as one of the nicest he's read in a while.

STABILIZER : Statistically Sound Performance Evaluation
Charlie Curtsinger, Emery D. Berger

Researchers and software developers require effective performance evaluation. Researchers must evaluate optimizations or measure overhead. Software developers use automatic performance regression tests to discover when changes improve or degrade performance. The standard methodology is to compare execution times before and after applying changes.

Unfortunately, modern architectural features make this approach unsound. Statistically sound evaluation requires multiple samples to test whether one can or cannot (with high confidence) reject the null hypothesis that results are the same before and after. However, caches and branch predictors make performance dependent on machine-specific parameters and the exact layout of code, stack frames, and heap objects. A single binary constitutes just one sample from the space of program layouts, regardless of the number of runs. Since compiler optimizations and code changes also alter layout, it is currently impossible to distinguish the impact of an optimization from that of its layout effects.

This paper presents STABILIZER, a system that enables the use of the powerful statistical techniques required for sound performance evaluation on modern architectures. STABILIZER forces executions to sample the space of memory configurations by repeatedly re-randomizing layouts of code, stack, and heap objects at runtime. STABILIZER thus makes it possible to control for layout effects. Re-randomization also ensures that layout effects follow a Gaussian distribution, enabling the use of statistical tests like ANOVA. We demonstrate STABILIZER's efficiency (< 7% median overhead) and its effectiveness by evaluating the impact of LLVM’s optimizations on the SPEC CPU2006 benchmark suite. We find that, while -O2 has a significant impact relative to -O1, the performance impact of -O3 over -O2 optimizations is indistinguishable from random noise.

One take-away of the paper is the following technique for validation: they verify, empirically, that their randomization technique results in a gaussian distribution of execution time. This does not guarantee that they found all the source of measurement noise, but it guarantees that the source of noise they handled are properly randomized, and that their effect can be reasoned about rigorously using the usual tools of statisticians. Having a gaussian distribution gives you much more than just "hey, taking the average over these runs makes you resilient to {weird hardward effect blah}", it lets you compute p-values and in general use statistics.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

exciting to me as a "dev in the street"

Considering the longer term possibility of such things being built into our development systems so they can interactively yell at us when we start typing code that is pessmial. (Of course, it also just lends ammo to the ML-AIs.)

principle of diversification


There exists an Optimization On that maximizes expected return while minimizing variance.

Further, a system deploying different optimization strategies for different code usage patterns (supercomputers, meta programming) cannot eliminate all variance.

Further, there is a riskless optimization that can be combined with the optimization On that dominates all other combination strategies.

So, how do we build such multi compartment models?

So, what are you "stabilizing" at a very large scale?


Having a gaussian distribution gives you much more than just "hey, taking the average over these runs makes you resilient to {weird hardward effect blah}", it lets you compute p-values and in general use statistics.

Sorry to nitpick, but... What enables using statistics is:

a) Valid model assumptions
b) The statistical technique as applied to the model assumptions (some techniques are "robust" even in the presence of not quite true model assumptions, and will still yield meaningful results, despite sloppiness).

The neat thing about Stabilizer, which isn't discussed in the paper and may be unknown to the authors themselves, is that it basically raises the question of offline/online compilation and the challenge of targeting, especially vectorization targeting, as there is a wide gap in supported architectures for vectorization and vectorization is one of the most common techniques in modern CPUs for enabling further performance gains.

So, while Stabilizer gives you valid offline tests, it doesn't really account for other problems in benchmarking, and it also doesn't account for poor benchmark design. But, Stabilizer could probably be re-purposed to do so and instead do J-tests instead of t-tests.

To understand why "having a Gaussian distribution" is not enough, we can look at an example from Effron's Bootstrap Sampling technique. There is a bizarre noise effect that happens in bootstrap sampling where, if the number of times you generate a "sample" Ni from an underlying sample distribution N is too high, you will simply begin sampling noise from the sample data set itself. This result has lead some to question the efficacy of using techniques that rely on re-randomization in general. Because it will result in a faulty Gaussian distribution. It's been awhile since I have done statistics work, but in order to approximate the law of large numbers using a sample distribution, I believe the sample population size must be N=32, and there is a limit on the number of times you can re-sample given by a combinatoric function I don't have handy at the moment.