## Draining the Swamp: Micro Virtual Machines as Solid Foundation for Language Development

Draining the Swamp: Micro Virtual Machines as Solid Foundation for Language Development
Kunshan Wang, Yi Lin, Stephen Blackburn, Michael Norrish, Antony Hosking
2015

Many of today's programming languages are broken. Poor performance, lack of features and hard-to-reason-about semantics can cost dearly in software maintenance and inefficient execution. The problem is only getting worse with programming languages proliferating and hardware becoming more complicated. An important reason for this brokenness is that much of language design is implementation-driven. The difficulties in implementation and insufficient understanding of concepts bake bad designs into the language itself. Concurrency, architectural details and garbage collection are three fundamental concerns that contribute much to the complexities of implementing managed languages. We propose the micro virtual machine, a thin abstraction designed specifically to relieve implementers of managed languages of the most fundamental implementation challenges that currently impede good design. The micro virtual machine targets abstractions over memory (garbage collection), architecture (compiler backend), and concurrency. We motivate the micro virtual machine and give an account of the design and initial experience of a concrete instance, which we call Mu, built over a two year period. Our goal is to remove an important barrier to performant and semantically sound managed language design and implementation.
Inside you will find the specification of an LLVM-inspired virtual instruction set with a memory model (enables proper GC support) including a specification of concurrent weak-memory operations (reusing C(++)11, a debatable choice), relatively rich control-flow primitive (complete stack capture enabling coroutines or JIT-style de-optimization), and live code update.

## Comment viewing options

### The C++ weak memory operations are a good thing

If you are going to lock out low level memory access from the system a-priori then you're making the system useless to high performance needs, a-priori.

### Weak memory models tend to be broken

They are known issues with the way the C(++)11 memory-model is defined, and I think precisely how to fix it is an open research topic.

Common compiler optimisations are invalid in the C11 memory model and what we can do about it
Viktor Vafeiadis, Thibaut Balabonski, Soham Chakraborty, Robin Morisset, Francesco Zappa Nardelli
2015

We show that the weak memory models introduced by the 2011 C and C++ standards does not permit many of common source-to-source program transformations (such as expression linearisation and "roach motel" reordering) that modern compilers perform and that are deemed correct. As such it cannot be used to define the semantics of intermediate languages of compilers, as, for instance, LLVM aimed to. We consider a number of possible local fixes, some strenghtening and some weakening the model. We evaluate the proposed fixes by determining which program transformations are valid with respect to each of the patched models. We provide formal Coq proofs of their correctness or counterexamples as appropriate.

The Problem of Programming Language Concurrency Semantics
Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-Pharabod, and Peter Sewell
2015

Despite decades of research, we do not have a satisfactory concurrency semantics for any general-purpose programming language that aims to support concurrent systems code. The Java Memory Model has been shown to be unsound with respect to standard compiler opti- misations, while the C/C++11 model is too weak, admitting undesirable thin-air executions.

Our goal in this paper is to articulate this major open problem as clearly as is currently possible, showing how it arises from the combination of multiprocessor relaxed-memory behaviour and the desire to accommodate current compiler optimisations. We make several novel contributions that each shed some light on the problem, constraining the possible solutions and identifying new difficulties.

First we give a positive result, proving in HOL4 that the exist- ing axiomatic model for C/C++11 guarantees sequentially consistent semantics for simple race-free programs that do not use low-level atomics (DRF-SC, one of the core design goals). We then describe the thin-air problem and show that it cannot be solved, without restricting current compiler optimisations, using any per-candidate-execution condition in the style of the C/C++11 model. Thin-air executions were thought to be confined to programs using relaxed atomics, but we further show that they recur when one attempts to integrate the concurrency model with more of C, mixing atomic and nonatomic accesses, and that also breaks the DRF-SC result. We then describe a semantics based on an explicit operational construction of out-of-order execution, giving the desired behaviour for thin-air examples but exposing further difficulties with accommodating existing compiler optimisations. Finally, we show that there are major difficulties integrating concurrency semantics with the C/C++ notion of undefined behaviour.

We hope thereby to stimulate and enable research on this key issue.

### I've spent a lot of effort

Understanding, if not the abstract C++11 memory model that your paper finds fault with, but the possible mappings of that to real architectures and how those work. It was painful, partially because the mappings to different architectures have such horribly different effects on program speed that in the end you'd never pick the same algorithm for x86 and ARM even though they'd both run the same C++11 code correctly (because you'd never be writing at the very hard to analyse shared memory model level if you didn't need performance).

But I'm not sure that the paper found bugs in C++11 that apply to these real processors in practice. That is to say, even if the ideal standard doesn't work, what you get after the mapping to real processors maybe does work. For their first example they said exactly that - the bug would never show up on x86 or ARM or Power. Perhaps they're all like that.

In any case, even if a number of optimizations had to be disabled around atomic accesses, that isn't a problem either.

:/ an odd fact is that Microsoft's implementations of these have been so bad as to be called pernicious. Except for their very latest compiler (is it out of beta yet?) they had all atomic variables routed through interlocked instructions even in the case of relaxed memory order (let alone the ones that should only have been interlocked on write) making some programs a couple orders of magnitude slower, and some basic calls in the library, a swap on a small non-int or something didn't work, oh and they synchronized variable initialization, though the standard said not (once again wasting time).

Existing processors either have memory models that never get as weak as relaxed, or they have synchronization instructions that do much more than necessary.

If the actual c++11 model is broken I wonder if the hardware version of it that ARM was implementing (last I heard) is broken. I would guess not. Processor engineers are used to digging down to a level of detail about read and write queues and collisions that the C++11 model avoided, and making sure that every needed case can be covered, in some way or other.

### You cannot conflate hardware and software memory models

You speak of hardware and software memory models as if they were the same thing. The elephant in the room is the compiler, which relies on the memory model of its language to enable or disable optimization. Correct behavior of the hardware in presence of the right fence instructions is irrelevant if your compiler erroneously reorders instructions in a way that changes the semantics of the program. This is why the memory model in the language specification should have good properties: it is the only implementation-independent contract to which compiler authors are bound (... in theory).

When you write "I'm not sure that the paper found bugs in C++11 that apply to these real processors in practice", you seem to assume that compilers will perform a direct translation of the program to the assemblers of this processor, which is (1) probably not true in practice today (2) bound to become false in the future as the pressure to optimize code remains.

There has been a lot of work on modeling and testing the memory model exposed by the hardware (even those have bugs, some of which have been found, and I would not blindly trust ARM processors). But this only gives a memory model to assembly programs for this particular processor. Defining a memory model to a full programming language is even harder, and the C(++)11 one is known to be unsatisfactory; you cannot just say "well defects are unproblematic as long as the hardware does the right thing".

Herding cats: Modelling, Simulation, Testing, and Data-mining for Weak Memory
Jade Alglave, Luc Maranget, Michael Tautschnig
2014

We propose an axiomatic generic framework for modelling weak memory. We show how to instantiate this framework for SC, TSO, C++ restricted to release-acquire atomics, and Power. For Power, we compare our model to a preceding operational model in which we found a flaw. To do so, we define an operational model that we show equivalent to our axiomatic model.

We also propose a model for ARM. Our testing on this architecture revealed a behaviour later acknowledged as a bug by ARM, and more recently 31 additional anomalies.

We offer a new simulation tool, called herd, which allows the user to specify the model of his choice in a concise way. Given a specification of a model, the tool becomes a simulator for that model. The tool relies on an axiomatic description; this choice allows us to outperform all previous simulation tools. Additionally, we confirm that verification time is vastly improved, in the case of bounded model checking.

Finally, we put our models in perspective, in the light of empirical data obtained by analysing the C and C++ code of a Debian Linux distribution. We present our new an alysis tool, called mole, which explores a piece of code to find the weak memory idioms that it uses.

the paper. I just thought I'd get chatty enough to mention that I remember one researching finding an odd timing and getting verification from an Intel engineer that there exists some odd conditions where their processors detect some sort of memory problem (cache inconsistency?) go into a panic mode, throw away the results of an instruction and start over. I thought that was funny, but comforting. At least they go through the trouble of putting supreme effort into an edge case.

### Hardware vendors

The authors of the Herding Cats paper that gasche quoted above provide a summary of their experimental results on the paper's website. The tables clearly show that not all processor vendors are as rigorous as Intel.

### I wonder if they'd be open to adding a stackless mode

I have interest in languages that require full continuations.

### wrong metaphor...

It seems to me that this is not "draining the swamp."

This is more like building a houseboat. Separating the foundations entirely from the surly bonds of earth and trying to create stability enough for a house to stand without any connection to the ground save perhaps a mooring.

And it may be that the resulting structure requires less maintenance, leaks less, etc. It may even be less prone to flooding (The equivalent of which it redefines as "sinking") than a house built on a conventional foundation in swampy ground.

But the swamp is still there. And the alligators (and other things with lots of sharp pointy teeth) still live in it.

### I meant to comment on this,

I meant to comment on this, so thanks for reminding me of the paper gasche! I think the basics of the Mu language look decent. I always have two goto examples for low-level languages no safe system can safely handle: tagged pointers, and variable length structs like those used in hash-array mapped tries.

Mu includes its own tagged pointers, so there isn't much flexibility there. For instance, you can't safely express Haskell's sophisticated tag encoding optimizations, because this requires knowledge of the referrant's alignment to know those bits are unused, and Mu doesn't seem to handle alignment.

I think "hybrid" properly handles the variable length struct scenario, so that's good.

I'm undecided whether I like the idea of separating thread and stack into two abstractions, rather than simply exposing a symmetric coroutine abstraction. I don't have a sufficient grasp of what's available through the separation that wouldn't be available through actual coroutines. On-stack replacement seems like the only awkward fit.

Ommitting unions seems sensible for simplicity, but I can't help but feel like there ought to be a simple, safe encoding of these too. As long as pointer bits don't overlap with value bits, unions seem inherently safe. Mu still supports casting, but I think casting ops can be eliminated with proper support for unions. For instance, with knowledge of alignment, you can have parts of pointer bits safely overlap with value bits.

### Universal VM?

We were discussing, elsewhere, whether a universal VM is possible. Let's try to recap and resume here. I'll start posting subthreads.

### Control primitives, jumps, tail calls

Here we were discussing whether an universal VM can provide enough control primitives. Some said it's impossible, while I claimed that (IIRC) tail calls (and continuation-passing style) can do the job. (In fact, direct support for tail calls and continuations is an alternative that can be faster).

### Types in VMs?

People also debated whether such VMs need types. Somebody mentioned memory safety, but I'm missing a compelling case for requiring it in a low-level VM. I think one point for safety is that a (verified) memory-safe VM could remove the need for process separation: can't you run all your programs in the same JVM if memory safety prevents interferences anyway, reducing context switches?

### Somebody mentioned memory

Somebody mentioned memory safety, but I'm missing a compelling case for requiring it in a low-level VM.

1. To reveal bugs in your compiler.
2. To reveal bugs in your runtime, ie. GC, codegen, runtime codegen, etc.
3. To protect a language's abstractions from foreign code also running on the VM (process isolation like you said) -- web browsers are the most widely used examples.