Redesigning the Mozart virtual machine

Dear all,

The virtual machine of the Mozart system is in dire need of a refresh! It was originally designed in the early 1990s (twenty years ago!) and is now showing its age. We are working on a new virtual machine that is ready for the future. But some of the issues are still not clear, in particular how to represent the data. Yves Jaradin has prepared a detailed discussion of the issues.

We would like to ask advice from the experts on LtU. What is the best way to write a virtual machine today? Is it possible to be simple and still efficient? Please let us know what you think.

- Peter Van Roy and the Mozart development team

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

docs as you go

I may read up on what you're looking for, since I find vm design a most entertaining sort of topic in computing.

As an aside, I recommend docs as you go. When I work on something complex, over many months I'm always sorry I didn't write a plain language summary of every single thing to gloss intentions -- and indirectly, some assumptions implied. It would help prevent only-the-guy-who-wrote-it-understands syndrome (if he still remembers).

Is "best" fastest execution, or other attributes too? I'll look for the answer in issue discussion. I bet one desired quality might be missing in issues: ability to tell what a system actually does when running (as opposed to what it's supposed to do).

I am a lighthouse, your call.

The discussion is on the representation of data. I don't have the feeling that twiddling bits to stack as many different basic types into a word is still a relevant design. It may have been a good design when memory was scarce, but I think the added cost in complexity isn't worth it anymore. Maybe it would be a good approach to copy a/the Java memory model?

Memory isn't scarce anymore,

Memory isn't scarce anymore, but memory bandwidth is more scarce than ever and memory latency is higher than ever (relatively speaking).

Depends on the device

If this is a VM for a PC, then whatever. If it wants to target the newer computing platforms, then you definitely have to care...a lot.

It sounds like they are redoing their VM as a matter of maintenance, but to me redoing the VM is a chance to rethink everything.

Bandwidth

I agree with that, but a redesign comes at an implementation cost. For instance, moving the typing bits to the pointer might be a good idea, until you realize it can cost you a year just to get all invariants right. In essence, it means that, since some information is moved to the pointer, you'ld probably be better off understanding the pointer as the record which essentially holds all the information of an object. I have no idea how that would work out in practice, I can imagine a number of optimizations, but I don't think it is worth the implementation effort. (How often do you access the RTTI in a typical program? Not a lot I would gather. Though in OO it might speed up method calls a lot, obviously, forgot that.)

Bandwidth is a very valid argument, which is why I think for Mozart copying the Java model would probably work out best because of the amount of effort which was placed there, and I think they are comparable languages. I.e., it all depends on instruction and data throughput and the memory access pattern of typical programs. I have little reason to believe that for these two languages, although I don't know Oz, that data would be very different.

Lastly, there is the added fact that Java is a supposed to work on a variety of machines which is what they also want for Mozart. Where I know little of about how Java works out on a 64 bit CPU, I haven't seen reports that it fails there. (I personally think that moving current day CPUs to a pure 64 bit model was probably a mistake, but have little evidence for that opinion.)

Support for yet-undefined data and unification

Something that may be very different from Java : Oz relies strongly on partially-undefined¹ data and unification, thus the emphasis in the document on "indirection nodes". I have looked if there was anything relevant in the Prolog litterature, but I haven't found recent work on logical programming virtual machines.

¹: the partially-undefined is maybe less important in Oz than in Prolog, as if I remember correctly the Oz-on-Scala port only allows unification of data with a totally-undefined variable, and this was said not to be very problematic in practice.

Toto, we're home.

I don't know Oz that well, but Prolog unification and what I thought Oz has, delayed evaluation, or demand-driven evaluation, are two completely different things. Can one unify two lists of partially known values in Oz with a simple equation? I haven't seen examples of that, but I agree you end up with totally different memory access patterns for Prolog unification.

partially known values

See for the example the demand-driven production example of the Oz tutorial.

In this example, the consumer and the producer hold a shared undefined variable. When the consumer wants some input, it unifies this undefined variable with a cons cell of undefined components (undefined head and tail). It then uses the head of this list (which generally blocks until it is initialized by the producer), then recurse on the undefined tail. In parallel, the producer matches the shared variable as a cons cell (which blocks while it is undefined) and, when the matching succeeds, produces an item and initializes the head of the list with it, and then recurse on the tail. After one such step, both consumer and producers again hold a shared undefined variable.

Truth emerges more readily from error than confusion

Well, I have a hard time understanding the example, but Prolog unification is -or should be- different since through unification one can arrive at multiple answers to an underspecified problem. That 'feels' sufficiently different from what I gather from your description Oz has to offer.

EDIT: Anyway. A better explanation would be that Prolog is best seen as a symbolic equation solver which will supply one with all possible answers, often against a database of facts. Prolog is therefore that sufficiently different from Oz, also in memory access or typical programs, that there is probably nothing to be learned by reading Prolog literature to optimize an Oz runtime.

No, I don't think it's

No, I don't think it's prolog unification that leads to the non-determinism of having several solutions. If I remember correctly, it comes from the fact that several clause heads may match a given goal.

Besides, Oz also has constraint resolution mechanisms that can find multiple solutions, and backtracking search. But those are more specialized aspect of the language and I don't think they impact the representation discussion so much.

Indeed the example is not easy to understand if you're not familiar with Oz dataflow variables and partially defined structures. It's interesting, though, and my attempt at explanation probably doesn't do it justice. I learned about that aspect of Oz in the glorious book Concept, Techniques and Models of Computer Programming, and I would recommend anyone to have a look at it if it's available in the local academic library.

CTM

I'd recommend every LtU'er buy a copy, read it once, and put it right next to SICP, TaPL, and other classics.

Largely correct

In general you are correct that overlapping heads produces multiple solutions. It sounds as if this was what Marco was reaching towards with under-specified equations; all Prolog programs can be seen as equations with associated Herbrand models. There is a specific case in which unification can produce non-determinism, which is where an occurs check can rule out sets of solutions (such as X=f(X) ), but most Prolog implementions omit it for efficiency (takes too much memory). The description of infinite data-structures in Oz sound as if this behaviour is being used deliberately; refining the set of possible solutions provides communication between the processes.

Oz has both unification and demand-driven evaluation

Oz has full unification. Like in Prolog, it's possible to unify two large data structures (like two lists of partially known values) in one equation. Oz also has demand-driven evaluation. This is based on one simple primitive, WaitNeeded, which waits until its argument is needed by some operation.

However, in our experience, full unification is only needed for constraint programming. The Ozma language, an extension of Scala with Oz concurrency, only allows shallow unification (variable-variable and variable-value, not value-value) and this is enough for it to express all the important concurrent programming idioms of Oz.

The Java memory model is

The Java memory model is under-featured and inefficient due to the "everything is a reference" mentality especially on modern machines, where every indirection might mean a lengthy cache or TLB miss, never mind the added GC pressure. The fact that it works well now is a testament to the capabilities of the JVM writers and the many man years put into it.

The CLR got this better in this by implementing value types, true generics, custom memory layouts and "unsafe" access (no bit-tagged pointers though), although most of the applications requiring these capabilities are still written in C++.

Tagged pointers (these days NaN-tagged doubles) are a well-known optimization technique for dynamically typed VMs, and are still used in very modern VMs such as V8, Tracemonkey and LuaJIT 2.0. They can be made reasonably "type safe" by writing the VM in C++ (vs. C).

If you are going to do runtime type checking in your VM (vs. vtable dispatch or specialization), tagged pointers are likely always a win.

Untagged pointers, tagged pointers, or JVM?

It seems that we need to make a decision between three possibilities: untagged pointers, tagged pointers, and targeting the JVM.
Here are the main issues:
* Untagged pointers have a memory hit, but permit architecture-level optimizations like prefetching which are not possible for tagged pointers since the processor does not understand them. So the performance hit may not be so great.
* Untagged pointers may be best for exploring new language ideas. No reason to do premature optimization like tagged pointers.
* The JVM is widely used, supported, and compiled with JIT. We would not have to worry about those issues if we target it.
* The JVM is badly matched to the language ideas we want to explore (e.g., no support for dataflow variables, threads not lightweight enough). This will give us a performance hit, until (maybe?) the JVM/JIT advances enough to support these ideas.

v8 does not nan-box

V8 does not use NaN-tagged values. It has "small integers" (31-bit signed values) and a couple of other immediate values, all of which represent failure conditions (as it does not do nonlocal exits). Instead of representation tricks, V8 relies on type feedback to unbox values.

It's an attractive approach for its simplicity but your baseline, non-optimized code will be slower than the tagged value strategy.

Azul JVM

I recall reading or watching a video of Cliff Click describing how the JVM tuned for Azul hardware uses some bits in 64-bit object pointers as a unique index for the class. This lets other things get optimized without paying for an indirection, e.g. the type check for polymorphic inline caching.

Some experts to talk to

I would first talk to Ian Piumarta (currently at Viewpoints Institute). He has been writing VMs for more than 20 years.

The other thing I have seen (from dynamic languages people) is that they seem to get a huge amount of mileage out of JIT. The current feeling seems to be that making JIT easy (by having enough information around to do it efficiently and effectively) will gain much more speed than whatever other optimizations you do to your VM. Or at least that is what I got from discussions with some of the PyPy people. They are doing really good work - I would talk with them as well.

If you don't want to contact these people 'out of the blue', I can make introductions -- but they are all nice, I am sure they would be happy to talk to a fellow language implementor!

Out of Budget

Writing a somewhat decent JIT I believe is not feasible given the fact that they only have one man year for the whole project.

VMKit

Might it be time to revisit LLVM and VMKit?

Excellent idea

Wish I had thought of it.

You misunderstood

The point was not to write a JIT, but to make writing a JIT easy (later). You can easily design a VM such that JITing is astonishingly hard; I am not sure anyone has yet seriously thought about designing a VM with a view to (later) having a JIT which will be easy and effective. Planning for 'effectiveness' requires having information available at the level of the VM which, in current VMs, needs to be inferred, sometimes at a great cost (and so it is not, since that cost is too high for JIT).

The million dollar question

The million dollar question is why go there at all? If you build on top of Javascript or the CLR (with the DLR if you want to be dynamic), then you outsource the JIT and GC to someone else (e.g., Google V8). You can always push your metadata down in an existing VM.

or a happy inbetween?

That approach has yielded

That approach has yielded promising results with PyPy. That project's inner virtual machine is programmed with a language called RPython. You could write an Oz interpreter in RPython so that you can reuse PyPy's infrastructure to create a VM from that.

It has some similarities to creating a compiler by partially evaluating an interpreter.

I was looking for this one

Thank you! I recently wanted to cite this paper -- in comparison to the current ongoing work of implementing languages on top of Javascript -- but was unable to find the reference.

Perhaps you can take

Perhaps you can take inspiration from Haskell, it has similar issues with representing values because all values are lazy. GHC primarily uses two methods to reduce the cost: (1) eliminate as much laziness as possible statically with strictness analysis (2) use differently tagged pointers for thunks and evaluated values. Number (2) is already on your list. You can probably do something similar to strictness analysis for Oz dataflow variables and remove the extra cost for dataflow variables where a variable is statically known to point directly to a piece of evaluated data.

Variable representation in dodo

I am planning on a mixed approach for dodo:

- Types which are known statically do not need a dynamic type pointer, so they do not get one.
- Dynamic types are represented as a type pointer and a value pointer, except for boxed static types which value can go directly in the object after the type pointer (their dynamic type is expected to be constant).
- Abstract state can be part of a dynamic type (i. e. different abstract states are represented by different types).
- For static types, abstract state can be represented by a tag of the appropriate number of bytes (generally one).
- If the abstract state depends on the value of the object as specified by rules, then it does not need any concrete representation (it can always be retrieved by applying the rules).
- Variables waiting to receive a value use abstract state to express this fact.
- Generally, try to be clever and reduce the size of variables - e. g. a set which can have no more than 32 known elements can be represented by a 32-bit number. However I would not consider tweaking the low bits of a pointer except maybe for a yes/no (1 bit) info which I would need in every variable.

Going to the JVM?

Some people suggest that we should target the JVM. But how can we implement dataflow variables in a JVM? The Flow Java implementation modifies the JVM for dataflow variables. Is it a good idea to target a modified JVM? We would have to track all the JVM updates, for one thing.

The Ozma implementation is an extension of Scala, but it targets the Mozart VM, not the JVM!

Should we take the performance hit of targeting the JVM, and hope that future JVMs will support something like dataflow variables?

You are going to take a

You are going to take a performance hit anyways with your own VM! You are assuming that your dataflow variable optimization in your custom VM will mitigate the disadvantages of not using an existing heavily optimized VM that is maintained by someone else, but this isn't obvious by any means. You might find that doing what you can on an existing VM still winds up being faster than your custom VM. You might even find that a non-invasive pure library approach is faster than modifying the VM. Don't assume anything.

Or just copy the design

Well, I suggested that you just straightforward copy most of the design decisions of a/the JVM at minimum, or target it directly. I thought Oz isn't too different from Java, and therefor that would probably be the best thing to do.

Prefetching looks like a good argument not to have tagged pointers. (Though in a virtual machine that advantage may be lost...)

Implementing dataflow variables in the JVM directly? Proxy objects, or just a flag in each object whether its fields are defined in case the type is always known, or -probably best- just overwriting the entry in a table which holds the (variables) environment?

It may be a deal breaker, that's true. Then again, you might do something costly, but end up implementing the same costly manner in a VM.

Hunches

I've never thought much about the implementation of dataflow variables, but I wonder if the new invokedynamic, call-site bootstrap and "method handle" stuff could be used to do this efficiently on the JVM? I know that this machinery can be used, for example, to memoize a result directly in the call site in a way that the JIT can directly understand. It's quite a powerful and general framework, and it would not surprise me to find that dataflow variables could be implemented with good performance.

Based on your other comments in this thread, I suspect that heavyweight threads would be actually be the more difficult issue to surmount.

The JVM is a very impressive system, to be sure, and has lots of advantages. But I think it would also appear a bit baroque to people with no interest in the Java language, and it does have a lot of baggage. I suppose it may depend on the personal inclination of those involved: if you want something minimal, self-contained and focused, and you want to understand it all the way down, the JVM is almost certainly not for you. If you care most about portability and industrial strength and you'd rather invest time in the compiler than in the VM, the JVM may be a good fit.

Targeting the JVM would bring lots of missing libraries to Oz

I cannot comment on the implementational difficulties of building Oz on top of the JVM.

However, I would like to stress the obvious -- based on the JVM Oz would suddenly support a wealth of libraries. It will be next to impossible to gain such a rich environment of libraries with a custom VM. For example, Oz would finally have decent graphics libraries etc. From a practical programmers point of view this can make all the difference. In other words, an Oz based on the JVM can potentially attract a larger user community, which will hopefully also attract patches etc. JVM-based languages like Clojure or Scala would unlikely have become that popular if they would have been based on their own VM.

In E dataflow variables

In E dataflow variables (Promises) are implemented over JVM without any problem. In my Scala library AsyncScala, I'm implementing Promises as plain Scala classes also without any problem. So there is no direct need for modification of JVM unless you want to squeeze the last bit of performance out of it.

random notes

Whatever makes C run slow(er) may also make your new VM run slow too. So let's pretend all C optimization issues affect your VM design.

Efficiency ends up being about resource pressure for several sorts of resource including registers, cache lines, code pipelines, disk iops, and anything slow you might accidentally depend upon. (Classic accidental dependency is calling an innocent looking library function that internally uses a pthread mutex on global library state, with consequent convoy behavior in clients.) The resource you run out of first limits bandwidth in code using it.

For efficiency, aim to improve locality while decreasing size and branches. You can add complexity to get more locality with less size and fewer branches. Unfortunately goals can conflict. For example, using alignment bits to encode immediate values (locally instead instead of inside boxed values elsewhere) likely introduces a branch where you test which representation is used. So smaller size can increase branches. It's hard to say which one then dominates without measuring.

On-chip branch prediction tables are also a limited resource. If a branch tends to come out the same way, branch prediction wins. Branches with unpredictable outcome lose, say when depending on input data. If you structure test expressions so an outcome is likely, pipelines stall less on branches. Short circuit evaluation in C can cause unpredictable branching, which is bad; if test expressions are cheap and you can arithmetically combine them in one test (especially if value is predictable), it's a good idea.

Generally you want data used together (or just frequently) in the same cache lines in order to make cache line working set smaller and faster. But don't put things accessed concurrently by different CPUs in the same cache line, because it causes slow cache line consistency effects. When content or code arrives with temporal locality, which is later accessed again with temporal locality, store it with spatial locality or at least a tendency for it.

Alignment can affect performance more than one way. More alignment is better. For example, alignment of branch target addresses can change performance. A couple years ago at work we realized just rebuilding our system -- if it generates different code for any reason -- can result in a 10% change in performance, up or down, apparently just from randomly changing alignment of addresses in branch targets of tight loops. It depends on chip maker and chip gen.

On the topic of chip gen, if you compile to Intel, the hardware compiles instruction streams to another form with a different count of resources, like un-named registers in a register file used for assorted purposes. It can be hard to figure out what resource pressure is caused by your code. Speculative execution can consume more resources than apparent. On 64-bit systems, favor instructions which completely update registers, or else you can get an effect called false dependency, where speculative code stalls from need to sync with resources visible explicitly in the instruction architecture. Making a temp local variable 32 bits instead of 16 bits, for example, can change how much of a 64-bit registers is updated when modified. So use 32 bit locals.

If variables cannot have negative values, prefer unsigned arithmetic because your own compiler (or someone else's) might not be able to figure out sign extension can be skipped. If you iterate over an array with a signed index, you can easily get an extra sign extension in 64 bit systems, consuming an extra architecture register for scratch values and increasing pressure of multiple sorts.

Code unlikely to run should be processed in different functions from code likely to be used. Mixing special cases with normal cases can increase cost of optimization to the point a compiler has worse results. Localizing normal cases can simplify what must be optimized so a compiler does a better job.

As a last note on complexity: it's more costly than linear. Say you have four mechanisms -- A, B, C, and D -- which have two variations each, where one variation optimizes a special case. If all N interact, and each must know all variations, cost in developer attention seems to go up by the square of N, if you bother to figure out possible conflicts and correct modes of use. It's bad (but this is just my personal experience). Simplify under schedule pressure.

What happened to SEAM

What happened to the SEAM VM, which if I understood correctly, was intended to be a competitor to or replacement for the Mozart VM? Any ideas to be borrowed there, or were its advantages illusory, Alice-specific, and/or Oz-inapplicable?