## Language features for tracing JIT?

Are there any special programming language features ("superpowers") that are made possible by specifically exploting a tracing JIT compiler?

It seems to me like tracing JITs are often developed for existing languages (JavaScript, Python, Lua, etc) and tend to focus on efficiently compiling the existing code bases that are written with their traditional programming idioms. I am interested in the opposite approach: what programming idioms and language extensions can help programmers to make the most of their tracing JIT compiler?

I can offer one candidate. Tracing JIT compilers can choose to specialize machine code based on the exact values of parameters that are only available at runtime, and the language can facilitate "hinting" whether a certain variable should be specialized on its value (e.g. generate machine code for the specific case that i = 42) or on its type (e.g. generate machine code for the general case that i is an integer.) Often a certain operations can be compiled much more efficiently when a parameter is constant, for example the size argument to a memcpy or memcmp, and so a potential "superpower" is to apply these optimizations automatically without having to statically generate code that anticipates each possible value.

Is that a legitimate superpower? what are some others? what tricks can tracing JIT users exploit to leave C/C++ hackers eating their dust? :-)

## Comment viewing options

### lancet

See https://github.com/TiarkRompf/lancet and various papers by Rompf.

### re: lancet

Thanks for that link, Chris! Lancet seems like a bullseye on the same goal as me. Do you know of any papers in particular? (The link on the README is broken.)

### Converting data dependencies into control dependencies FTW

I would like to propose that the simple act of specializing machine code on a value can improve performance, even if this requires more instructions, and even if the compiler does not specifically take advantage of the value for further optimizations like constant folding. The reason is CPU architecture. Specifically, the specialization can convert an expensive data dependency into a cheap control dependency.

Consider a program that wants to load a word from memory and do some computation with it:

mov reg, [addr]
...


Simple: Load reg with the word and then reference that from the main code (...).

Now consider a version that is specialized for the case where the value in memory is 42:

mov reg, [addr]
cmp reg, 42
jne ABORT
mov reg, 42
...


which code is more efficient? I think the full answer is "it depends" but that the second version has the nice property that the main computation (...) can execute without waiting for the memory load to complete. This can be significant because the memory load potentially has high latency e.g. tens or even hundreds of cycles.

On an out-of-order processor, like mainstream x86, the CPU would actually split the second version into two parallel instruction streams:

mov reg, [addr]     |     mov reg, 42
cmp reg, 42         |     ...
jne ABORT           |


On the left we execute the guard, including the memory load, and on the right we execute the main computation (...) independently. The memory load won't contribute to the overall latency of the program unless it takes longer than the entire ... block (or unless the extra instructions create too much contention for execution resources i.e. a throughput problem.)

Does this make sense? Is conversion of data dependencies into control dependencies a potential "superpower" of tracing JITs?

This idea originated while micro-optimizing some LuaJIT code and speculating about why some code with a lot of extra guard insturctions ran so fast: thread. I think that in the wild it is quite common for values to be predictable at runtime but not known at compile time, e.g. anything you would put into a configuration file.

### Multi-path execution

I don't think that Intel is doing multi-way speculation on branches. It's been a while since (4 generations?) since I looked at low-level execution on Intel though - if its changed in their more recent generations a link would be appreciated.

Not all superscalar processors execute multi-path ("multiple streams") when they hit a branch. It is more common to use branch prediction to select one path and speculatively execute instructions from that branch. If you are running micro-benchmarks in a tight loop then the branch predictor cache will not overflow and this case you will observe perfect branch prediction. The effect is the same during benchmarks: the ... code will execute with zero delay as none of the preceding instruction impose dependencies. During non-benchmark execution the code may behave differently depending on: the size of the working set and contention in the branch prediction table *in addition* to any variance in the near-constant value.

I think that what you are describing is well known - the aim of tracing is to place all exceptional behaviour outside of the trace to produce a straight-line code sequence. These branch free sequences will exhibit minimum latency. The value not holding its "typical" value is an exceptional case. It is possible to improve further: if the block of code that uses the value does not depend on any other input (i.e. it is a static computation, but with regard to the program lifetime rather than compilation) then the result can be memoized so that any repetition is free. If there are external inputs to the block then strength reduction may improve performance, e.g. partial evaluation / incremental computation.

Edit: Forgot to mention - there is some information about the Haswell performance model in http://www.agner.org/optimize/microarchitecture.pdf

### I didn't state all of my

I didn't state all of my assumptions. I am assuming that the CPU is predicting guards as being satisfied and that these predictions are coming true. So only one path of speculative execution is needed, but I am not accounting for mispredictions in my simple model.

### I found it a refreshing post

Hmpf. I found it a refreshing post and I learned something from it.

Keep it up!

### That is probably true - the

That is probably true - the code in your example and in the thread that you linked to both have a common path where the branch is not taken. The compiler has laid out the code so that the exceptional case is taking the branch. This matches the default estimate in the Haswell architecture.

In general, if the most common path meets all the guards, and each guard branches off the path when it does not match, then the predictor will estimate the most common path. In this case the instructions will be speculated without any performance penality.

### Inline value caches don't need tracing

It seems you me the next step with that trick is to go for a (polymorphic) inline value cache. It's silly to duplicate the tail of the trace all for a codegen hack that is trivially convergent:

mov r1, [addr]
mov r2, $expected cmp r1, r2 jne next ; (predictable) jcc instead of cmovcc because mov r1, r2 ; the latter isn't predicted. next:  The runtime can hotpatch the$expected value, even with multithreaded execution, and the conditional jump target can sometimes be a statistic gathering subroutine to update the expected value.

I don't see that as a tracing JIT super-power; if anything, it's a nice way to steal some of tracing's good fit with contemporary microarchitecture and use it in any compiler (even ahead of time!). It's only when the code following the load is specialised to the value that tracing really shines, but that comes with the usual code duplication downside.

### see: GraalVM - New JIT Compiler and Polyglot Runtime for the JVM

The Graal project implements most of your ideas. But it's from Oracle... but the papers are a treasure mine..

http://www.oracle.com/technetwork/oracle-labs/program-languages/overview/index.html

GraalVM - New JIT Compiler and Polyglot Runtime for the JVM
Graal is a new just-in-time compiler for the JVM focused on peak performance and multi-language support. Graal offers performance advantages not only to Java code, but also to scripting languages such as JavaScript, Ruby, and R. Additionally, it enables the execution of native code on the JVM via an LLVM-based front end (project Sulong). Languages are executed by Graal via the Truffle framework, which comes with seamless interoperability and polyglot debugging and profiling functionality.

Usage
GraalVM requires the JAVA_HOME environment variable to point to a JVMCI enabled Java Development Kit or Java Runtime. Oracle Labs JDK is a JVMCI enabled version of JDK 8 which can be used to run GraalVM. Alternatively, a JVMCI enabled JDK or JRE can be placed in the folder ./jdk or ./jre to override the JAVA_HOME environment variable. Please note that JDK 9 builds are not yet compatible. The bin directory contains:

java - Runs the JVM with Graal as the default compiler.
graalvm - Runs the JVM in polyglot mode with JavaScript, Ruby, and R as available languages.
js - Runs a JavaScript console with Graal.js.
node - Drop-in replacement for node.js with Graal.js as the executing JavaScript engine.
ruby - Drop-in replacement for MRI Ruby with TruffleRuby as the executing Ruby engine.
R - Drop-in replacement for GNU R with FastR as the executing R engine.
aot-image - Builds an ahead-of-time compiled executable or a shared library from Java programs, and the JavaScript and Ruby languages.

Benefits

Performance - Graal incorporates our research on compiler technology, and offers better peak performance on some workloads than a traditional JVM.
Polyglot - Java, JavaScript, Ruby, and R are all available at competitive performance within the same execution environment.
Interoperability - Languages executed in Graal can call each other without overhead and libraries from other languages can be used.
Embeddability - Embed dynamic languages and native code with sandboxing capabilities.
Tooling - Graal benefits from JVM-based tooling and all languages share common tooling such as debugging and profiling.