My colleague Mike Rainey described this paper as one of the nicest he's read in a while.

STABILIZER : Statistically Sound Performance Evaluation

Charlie Curtsinger, Emery D. Berger

2013

Researchers and software developers require effective performance
evaluation. Researchers must evaluate optimizations or measure
overhead. Software developers use automatic performance regression
tests to discover when changes improve or degrade performance.
The standard methodology is to compare execution times before and
after applying changes.

Unfortunately, modern architectural features make this approach
unsound. Statistically sound evaluation requires multiple samples
to test whether one can or cannot (with high confidence) reject the
null hypothesis that results are the same before and after. However,
caches and branch predictors make performance dependent on
machine-specific parameters and the exact layout of code, stack
frames, and heap objects. A single binary constitutes just one sample
from the space of program layouts, regardless of the number of runs.
Since compiler optimizations and code changes also alter layout, it
is currently impossible to distinguish the impact of an optimization
from that of its layout effects.

This paper presents STABILIZER, a system that enables the use of
the powerful statistical techniques required for sound performance
evaluation on modern architectures. STABILIZER forces executions
to sample the space of memory configurations by repeatedly re-randomizing
layouts of code, stack, and heap objects at runtime.
STABILIZER thus makes it possible to control for layout effects.
Re-randomization also ensures that layout effects follow a Gaussian
distribution, enabling the use of statistical tests like ANOVA. We
demonstrate STABILIZER's efficiency (< 7% median overhead) and
its effectiveness by evaluating the impact of LLVM’s optimizations
on the SPEC CPU2006 benchmark suite. We find that, while -O2
has a significant impact relative to -O1, the performance impact of
-O3 over -O2 optimizations is indistinguishable from random noise.

One take-away of the paper is the following technique for validation: they verify, empirically, that their randomization technique results in a gaussian distribution of execution time. This does not guarantee that they found all the source of measurement noise, but it guarantees that the source of noise they handled are properly randomized, and that their effect can be reasoned about rigorously using the usual tools of statisticians. Having a gaussian distribution gives you much more than just "hey, taking the average over these runs makes you resilient to {weird hardward effect blah}", it lets you compute p-values and in general use statistics.

## Recent comments

1 hour 25 min ago

3 hours 19 min ago

7 hours 59 min ago

9 hours 52 min ago

10 hours 2 min ago

10 hours 18 min ago

10 hours 56 min ago

14 hours 27 min ago

18 hours 43 min ago

19 hours 11 min ago