Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore

The following paper has been posted on Hacker News and seems very relevant to the LtU audience, including its more practically-minded part.

Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore
Erven Rohou, Bharath Narasimha Swamy, André Seznec

Interpreters have been used in many contexts. They provide portability and ease of development at the expense of performance. The literature of the past decade covers analysis of why interpreters are slow, and many software techniques to improve them. A large proportion of these works focuses on the dispatch loop, and in particular on the implementation of the switch statement: typically an indirect branch instruction. Folklore attributes a significant penalty to this branch, due to its high misprediction rate. We revisit this assumption, considering state-of-the-art branch predictors and the three most recent Intel processor generations on current interpreters. Using both hardware counters on Haswell, the latest Intel processor generation, and simulation of the ITTAGE, we show that the accuracy of indirect branch prediction is no longer critical for interpreters. We further compare the characteristics of these interpreters and analyze why the indirect branch is less important than before.

The paper measures the accuracy of branch predictors on the indirect branch found in "naive" instruction decoding loops of bytecode interpreters. Experiments are performed with CPython, Javascript (SpiderMonkey) and CLI (.NET) on the interpreter side, and recent Intel processors as well as the authors' own state-of-the-art predictor (simulated) on the predictor side. The conclusion is that, assuming reasonably-sized predictor state, current prediction schemes fare very well on the dreaded indirect branch of while/switch interpreters. Reading between the lines, this makes well-known dispatch optimization techniques (e.g., jump threading) somewhat obsolete, at least on top of the line Intel processors.

I, for one, would like to see more work in this style, with hardware people bringing their expertise and experimental rigor to low-level PL implementation issues.

Also, I suspect that the PL implementation community would be interested in additional results, such as other interpreters and less aggressive microarchitectures (ARM processors, anyone?).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

That's really nice work,

That's really nice work, especially for VMs with complex ops. For lower level VMs, though, I'm still worried about the additional latency to invoke powerful predictors. IIRC, Intel uses a multi-level predictor, and resorting to the higher level (because simpler predictors are historically too inaccurate) adds to the latency. With complex operations, that's easily hidden by out of order execution; simpler ones (e.g., typical Forth words) may not expose enough ILP, and there may still be non-negligible overhead even if branch predictors are very accurate.

indeterminate hardware model

It's nice hardware changes improve performance of straightforward approaches in coding. Less exotic seems better in implementation style. But there seems a bit of lack in predictable hardware behavior model. The topic of models came up mid last week in conversation.

A coworker asked me why I expected to stop using C++ (I had complained about additions of new high level magic semantics). I explained I wanted to compile to C, for clarity in the imperative model of what is supposed to happen. He's one of those guys who worries about exactly what happens with different processors, even executing the same assembler. So he noted you really don't know exactly what happens in C, all you get is an approximate model. I said a reasonably simple model was all I wanted, and it should be okay to reason in terms of the model.

So I like hardware improvements yielding reasonable performance when you implement the obvious way to write something, when working at a level as high as C.

Microprocessors are optimized for C

Microprocessors, and function call conventions, are optimized for handling C/C++. Even exception handling is now default in most ABIs, it doesn't matter anymore whether you compile C or C++.

As I gathered from a presentation, smart microprocessor design buys you about a factor 50 (this was from the Pentium design guy.) So I assume that is the worst-case when you don't use C/C++.

complexity and opacity are undesirable in a compilation target

Yes, so it's weird when the obvious thing in C is slower than something complex. (C++ with inlined simple methods is sometimes faster than nearly identical C, in critical tight loop code with a bunch of subtle edge cases, apparently because the compiler can infer more about variable scopes. Otherwise I can't explain some empirical tests.)

It's feasible to debug C in terms of how bit patterns land where and when they do, because the model says you are entitled to see the bit patterns and know when locations change. But C++ reserves the right to say bit patterns are none of your business, and there's no reason you need to grasp flow control of internals. So it's hard to clean up messes created by others, who create more of them if features allow vague gestures without actual understanding. (Anyway, RAII via destructors in C++ doesn't fit well in an async stackless fiber model either, so refusal to be predictable about what happens is just aggravation.) Giving up C++ partly means no longer cleaning up after devs who don't know what they are doing, and buying immunity from further long term language changes.

Async Stackless

We discussed the async stackless fiber model before, but I wonder what the performance implications are? Stack allocation is significantly faster than heap allocation. If C++ offers other performance advantages over C that don't fit well with this model, is it fair to say that microprocessors are not being designed to perform well with this model.

Is an async model with a single stack a viable alternative model? Where you have a single big event loop outside of the program (in the language runtime) and you always return the stack empty to the event loop.

assume a mixed strategy can be tuned for local code profile

Keep in mind I like mixed strategies, with inclusive as opposed to exclusive arrangement, so I don't mind there being threads running C++ code somewhere, as long as I can move them safely out of memory stomping range when I'm debugging. If there's two opposing ways to do something, zig and zag, I want to do both when locally one way pays off, as long as I can pick. When one loses, do the other.

Stackless is an odd term, slightly a misnomer, meaning a language's normal stack is not used. Instead some other stack is used, so activation frames sometimes (or always) get allocated and managed by user code instead of host language runtime. Heap-based cactus stacks are an old well-worn case in point. I want some fibers to put activation frames in heap-based cactus stacks, some of the time, while locally non-blocking subtrees can use the normal runtime stack (for example).

Your main question is about performance implications. In short, it depends, on stuff like the mix of activity in terms of granularity, need to context switch anyway, and density of blocking or parking calls. If a subset of your computation graph is a horrific collection of tiny async activities that get bursty small i/o events before needing to wait on more i/o, these are going to force high frequency context switches anyway, at least in terms of data locality. It may be best to let large numbers of parked async tasks sit quiescent in fibers with tiny heap-based stacks, and if you can manage them like processes (ie being able to kill them), all the better. So a pool of resources can be dedicated to hosting stuff like that. Note heap-based does not mean "call malloc()" because all standard practices of local pooling of re-used frames will apply, and FIFO alloc order will keep locality in cache-line aligned buffers.

In contrast, if you have something ready to burn a lot of cycles over lots of data, and it's best to keep going continuously while you are cpu bound, then a more conventional sort of thread architecture like that used by C++ is likely going to run much faster. In that context, high granularity context switching and heap allocation is just going to slow it down.

If you can't declare or control which sort of situation is dominant, and the language and runtime get to pick for you, maybe your code will get assigned the badly fitting variant. But I'm not big on language tech with no knobs because devs are not supposed to care.

I consider your second paragraph to be putting me on a position that I'm going to ignore. (Note I don't want to hijack the topic, so we can start a new discussion if this needs further exchanges.)

RAII and Async Stackless Fibres

I want to understand why RAII might have problems with async stackless fibres, as they are both language features I am interested in, and interested in using in the same language. I have started a new topic as requested.

The branch prediction being

The branch prediction being added to processors these days is clearly aimed at Javascript and other interpreted languages. C is just a baseline.

Not a rhetorical question:

Once you have code generators is there any advantage to interpreters? Say you have an always available jit with some amount of staging, is there any point to not using the machine target?

Maybe heavily compressed code? I wonder what I could learn from this: Binary lambda calculus [Wikipedia]

Controlled Generic Elaboration

You can do runtime code profiling, but the counters cost you performance, so its not always a win. I prefer static code profiling, which only seems to work well with GCC at the moment. I have some built in profiling code paths that do things like run the inner event loop a million times with random events with a realistic distribution, and I run this with profiling on, then feed the dumped data back into GCC to let it optimise branch layout etc.

If you have ever done optimisation, you may have noticed that branch layout has a big effect on performance, and doing this profiling step means you don't have to care about branch layout any more, which means you can organise code for best readability and let the profiling take care of runtime layout.

I think there is little reason to use JIT compiling, except for the following circumstances: a block of code that has several infrequently changing parameters, and several frequently changing ones. Still you could build several versions statically for the different use cases, so the infrequently changing variables would need to have a large state-space, and be unpredictable to make a JIT really pay off. One example of this is number of cores in a machine, (CPU or GPU), as this could be anything in the range of 1 to a few 1000 on modern desktops.

So in my own work I am favouring a kind of advanced generic elaboration, rather than a full JIT. The byte-code to machine-code translation still happens at runtime, but the programmer has control over when it happens. The code is not profiled at runtime, but rather you ask for an instance of X optimised for certain parameters.

Programmer controlled specialisation

So in my own work I am favouring a kind of advanced generic elaboration, rather than a full JIT.

That's exactly what I'm encouraging people to do for my own performance sensitive workload, even with an AOT compiler. It's too easy to specialise too much, or on the wrong parameter; the result is wasted I$ (the L1I is particularly important; only the first level of cache has auxiliary info to accelerate/memoise decoding) for little to no performance gain.

A common pattern here (and I hear game devs do similar things) is that computation kernels emerge in performance sensitive code, and we'll work very hard to make them fast, while the rest of the code will mostly string kernels together. The driver code ends up being a (specialised) interpreter, and I definitely wouldn't want a JIT compiler to inline all that or convert loops over a compact bytecode to a sequence of machine code calls, *even if it's always the same bytecode*.

That makes me say the question of always JITting versus using an interpreter depends on the VM. If the VM dispatches to large operations that are already specialised for the workload, JITing is probably a bad idea, especially if it's a naïve JITC. Otherwise, it's probably useful. However, when naïve JITting is useful, I have to wonder how much comes from machine-level improvements like regalloc, versus eliminating genericity (e.g., dynamic dispatch and boxing) overhead, and ensuring basic blocks fully use cache lines (assuming a calling convention that uses a lot of registers).

The only advantage of

The only advantage of interpreters is ease of implementation. A compiler that only does local optimizations and uses a cheap register allocation and instruction selection algorithm can compile very fast.

I don't think this is quite

I don't think this is quite true, or V8 could have stuck with a cheap compiler rather than go back to an interpreter for cold code.

And I find dynamic compilation to be much easier to work with than writing an interpreter, especially with high level toolkits like the DLR for .NET.

Tradeoffs

The only advantage of interpreters is ease of implementation. A
compiler that only does local optimizations and uses a cheap
register allocation and instruction selection algorithm can
compile very fast.

I find that answer, or at least its formulation, slightly misleading.

It is true that, given an unlimited amount of effort, PL implementation technologies generally converge to some machine-code generation scheme that will outperform pure interpretation*. Yet, we rarely have the luxury of investing 100+ man-years in a project, and it would be more interesting to compare the performance and features of interpreters and compilers developped using a fixed, finite amount of work.

A compiler that only does local optimizations and uses a cheap register allocation and instruction selection algorithm can compile very fast.

Right, but you could have invested the same amount of effort it took to write your simple-but-non-trivial compiler in an interpreter, and obtained an optimized interpreter that would be more portable and much easier to embed. Whether this is preferable to higher performance of course depends on what your high-level goals are.

Also, as Sean remarks, even high-profile implementations of so-called "dynamic" languages tend to keep an interpreter around, because of its distinct performance characteristics compared to the more heavyweight JITs.

*: A possible counter-example has recently been discussed on LtU.

Infinite man-years (or Mike Pall) interpreters

Note that, at the other end of the spectrum, unreasonably elaborate interpreters also exist. They give up on many of the advantages associated to interpreters (ease of development and portability) in exchange for excellent speed.

Rumors are that v8 is replacing its baseline compiler with an interpreter named Ignition, which could become one of those examples. In the meantime, look at the LuaJIT2 interpreter, written in assembly by Mike Pall, as described:

Performance

Given the same source language, is there any evidence that a JIT compiled version can be faster than a static compiler, or are we safe to say performance is not a reason to use a JIT over an AOT compiler?

There would appear to be an advantage for tight loops, but then the JIT cost might be greater than the gain in performance in the loop. (For example 1 second to JIT, saving 100ms on execution would require the code to be executed more than 10 times to win). Also latency may be more of a problem, the user may not notice that it takes 100ms longer every time they click "do-it", but they do notice the 1 second pause the first time they click it.

Compiling dynamic languages

Given the same source language, is there any evidence that a JIT compiled version can be faster than a static compiler, or are we safe to say performance is not a reason to use a JIT over an AOT compiler?

I guess that the lack of high-performance static compilers for Javascript answers your question. It is folklore that the semantics of dynamic languages are such that efficient implementations have to rely on runtime information.

Are prototypes the problem?

I am not sure it is the dynamic nature of JavaScript that is the problem. Compiling Python to native seems to perform better than interpreters. To me it seems the prototype object model is the problem for JS. It is interesting that the prototype object model was originally chosen because it offered a performance advantage to interpreters, seems to cause problems for compilers as you cannot optimise for a static class hierarchy.

Meh

Python did not get commercial pressure to vastly increase its performance, at least not to Javascript level. Even for Python, Pypy realizes vast performance improvements over the existing interpreter. Finally, Python is arguably less dynamic than Javascript: in general object fields are fixed at create time, for example.

That said, many techniques originally designed for JITs could possibly be applied in AOT mode. For example, I don't think polymorphic inline cache are fundamentally a JIT technology.

PIC's might not be

PIC's might not be fundamentally a JIT technology, but they complement each other greatly. The point of a PIC isn't just to speed up a virtual call that turns out to be static at run time, but also to let the JIT optimize the code around it using the knowledge of which method will be called. For example a virtual call in a loop might turn out to be loop invariant after inlining to a particular method.

Object fields

Finally, Python is arguably less dynamic than Javascript: in general object fields are fixed at create time, for example.

In Python new-style classes x.y is just sugar for x.__dict['y'], so the namespace of each object is dynamic at run-time. Old-style classes are still used as an explicit optimization technique because they fix the fields. In general most objects in a python run-time boil down to a dict with dynamic contents eventually.

Look at 'cc

Have a look at Dawson Engler's work on 'cc (tick-c). There are fairly compelling reasons to think that more continuous compilation models have real benefit.

Latency is solved

Aren't all your criticism already addressed since 1-2 decades, or is there more to it? The latency problem was real but HotSpot solved that part. Instead of compiling code the first time you execute it, code is only compiled when it is executed enough — and, I'd hope, in a separate thread; before, either its bytecode is interpreted (HotSpot) or its compiled with less optimizations (as in the Google V8 version). As usual with HotSpot, this might well have been done in Self/Smalltalk/Strongtalk.

Compiler size

I don't think so. A compiler doing any kind of decent optimisation will still use resources, use lots of memory etc. Ihink I am less concerned about 2 stage compilers (with an intermediate byte code).

LLVM is a good example, as I think it started as a bytecode VM and JIT compiler, but has now almost exclusively become an AOT compiler. Why is this?

Language choice seems a threat to validity

This sounds like good news. But I wonder (1) whether improvements in latest processors already justify ignoring folklore, and (2) whether the chosen languages (especially Python) are overly favorable to the thesis.

(1) It sounds like we'll stop needing jump threading on x86 when Nehalem will be out of usage. Since Sandy Bridge is from 2011, I'm not sure we're at that point yet. But time will take care of this.
(2) More importantly, CPython's not a very efficient interpreter, so jump threading has less impact than elsewhere — more time is spent on individual instructions. IIRC, more than one attempt at adding indirect threading to CPython failed (https://bugs.python.org/issue1408710) to give interesting performance improvement (until Python 2.7, https://bugs.python.org/issue4753).
But it looks like, even after considering this, the general conclusions are still valid — that's good.