My colleague Mike Rainey described this paper as one of the nicest he's read in a while.
STABILIZER : Statistically Sound Performance Evaluation
Charlie Curtsinger, Emery D. Berger
2013
Researchers and software developers require effective performance
evaluation. Researchers must evaluate optimizations or measure
overhead. Software developers use automatic performance regression
tests to discover when changes improve or degrade performance.
The standard methodology is to compare execution times before and
after applying changes.
Unfortunately, modern architectural features make this approach
unsound. Statistically sound evaluation requires multiple samples
to test whether one can or cannot (with high confidence) reject the
null hypothesis that results are the same before and after. However,
caches and branch predictors make performance dependent on
machine-specific parameters and the exact layout of code, stack
frames, and heap objects. A single binary constitutes just one sample
from the space of program layouts, regardless of the number of runs.
Since compiler optimizations and code changes also alter layout, it
is currently impossible to distinguish the impact of an optimization
from that of its layout effects.
This paper presents STABILIZER, a system that enables the use of
the powerful statistical techniques required for sound performance
evaluation on modern architectures. STABILIZER forces executions
to sample the space of memory configurations by repeatedly re-randomizing
layouts of code, stack, and heap objects at runtime.
STABILIZER thus makes it possible to control for layout effects.
Re-randomization also ensures that layout effects follow a Gaussian
distribution, enabling the use of statistical tests like ANOVA. We
demonstrate STABILIZER's efficiency (< 7% median overhead) and
its effectiveness by evaluating the impact of LLVM’s optimizations
on the SPEC CPU2006 benchmark suite. We find that, while -O2
has a significant impact relative to -O1, the performance impact of
-O3 over -O2 optimizations is indistinguishable from random noise.
One take-away of the paper is the following technique for validation: they verify, empirically, that their randomization technique results in a gaussian distribution of execution time. This does not guarantee that they found all the source of measurement noise, but it guarantees that the source of noise they handled are properly randomized, and that their effect can be reasoned about rigorously using the usual tools of statisticians. Having a gaussian distribution gives you much more than just "hey, taking the average over these runs makes you resilient to {weird hardward effect blah}", it lets you compute p-values and in general use statistics.
Recent comments
22 weeks 6 days ago
22 weeks 6 days ago
22 weeks 6 days ago
45 weeks 18 hours ago
49 weeks 2 days ago
50 weeks 6 days ago
50 weeks 6 days ago
1 year 1 week ago
1 year 6 weeks ago
1 year 6 weeks ago