emulation for Erlang style processes

In The Problem With Threads Peter Van Roy wrote: The real problem is not threads as such; it is threads plus shared mutable state. To solve this problem, it's not necessary to throw away threads. It is sufficient to disallow mutable state shared between threads

The last few weeks I've been thinking about avoiding shared mutable state in new language dialects, while retaining convenient thread styles. Recently I had a fun idea which -- as an unplanned side effect -- seems to solve shared mutable state as a problem well enough for my purposes. Suddenly I told myself, hey, I'm going to get Erlang style processes out of this. Then when I googled a bit for references, I hit the LtU thread I cited above, which I'd read recently.

Note I'm unsure how to present the whole idea yet, which I'm still groveling over in a zillion pieces, like a car that's been totally disassembled. (It's hard for me to point at this pile and say, "This would be a nice car, if assembled." To you it might seem merely a pile of junk.) Note I'm not trying to build this up; I'm apologizing in advance for how lateral this likely sounds.

The disturbing part is I seem to have veered off the language path into OS territory without meaning to do so. I'm not sure. Maybe that's always what happens when you plan to embed VM systems in host operating systems. The VM world starts to sound an awful lot like an operating system by the time you start getting processes (with no shared mutable state between threads). Try to ignore this transgression into OS territory as a separate issue.

The reason I'm describing this here is because I think a lot of folks would like to get Erlang's safe and efficient use of many thousands of threads into other languages. So this is a tactic you might consider when brainstorming on the topic.

I was thinking about instruction sets for virtual machines, because recently I was going to post a story to the history department about the Smalltalk blue book, with a minor note of interest in the Smalltalk-80 VM instruction set. Then I got derailed thinking about instruction sets, and now I'm posting about Erlang processes instead.

The missing part in the middle, which is still spinning off distracting derivative ideas for me, was a notion of using Motorala 68K family 16-bit instruction sets as a VM instruction set, instead of something like the 8-bit Smalltalk-80 byte codes. One of several reasons for this idea is a rule of thumb: "don't invent things." If you consider a virtual assembler for a VM, don't invent an assembler (or, not until you succumb to exotic extensions). Instead, get a free ride on lots of existing books and tools for some existing, well-known, easy-to-understand instruction set. I happen to like m68k.

Of course, if you don't run on an actual m68k chip, this means emulation of some kind. It's the emulation that leads to a wonderful game making the Erlang process style easy. The emulation of a chip's machine code adds an indirection causing some extra layer cost, but suddenly making many other complex things easier and perhaps cheaper to do. My original motive was something like this: debug all the compilation issues under a simulation, and later worry about real machine code.

Then I noticed that sandboxing was going to be trivial. An emulator can manually memory map all space seen by the code run on the VM, so it's impossible for VM code to either read or write space unless the emulator wants to let it happen. So VM code simply can't touch other pages in the emulator's host process except those the emulator chooses to let appear in the virtual address space.

Then, it's easy to think about sharing readonly pages among separate VM processes, simply by letting the VM's emulator map pages into multiple address spaces, with suitable refcounting. And while you're at it, you can make inter process messaging cheap too, because it's not necessary to actually copy any bytes around in memory to deliver a message in the same host process; you can simply map parts of the host address space into a virtual address space of a VM process.

Also, when sychronization primitives for concurrency are done in the VM engine, then multitasking starts to look rather cooperative (read: cheap) in each VM process. And high level operations performed by a VM engine can be atomic from the perspective of each VM process.

As an example, suppose you mimic m68k machine code and utilize A-line traps so instructions beginning with hex $A are not defined by the emulated chip, and instead cause (emulated) trap exceptions to system defined high level routines. This is how the original Macintosh architecture used 68000 traps to implement the Mac toolbox in ROM very cheaply, since each A-line instruction was only 16-bits in size. Effectively you get to dispatch to an arbitrary late-binding method using only one 16-bit instruction.

On a real 68K chip, A-line traps are not cheap because because they cause exceptions with a lot of register maintenance costs as part of the dispatch to other 68K code somewhere else.

But when a 68K engine is emulated, traps are even faster than a normal dispatch in VM code, because it can be done directly in native code in the VM engine, without emulation. Complex system operations can run native in a VM engine, and atomically from the view of a VM process. If you wanted to implement a Scheme system with, say, around 250 primitive methods, each primitive could be a separate 16-bt A-line trap instruction running native in the VM engine. Since the leading $A consumes only 4 bits of an instruction, the other 12 bits are good for 4096 different traps, without needing to start a new VM engine. So you could implement a very large high level library as atomic VM engine operations this way, without needing any emulation for code in the library.

The last paragraph has little to do with Erlang processes per se. But it helps to illustrate the idea that a VM engine for a hosted language is basically the operating system from the point of the hosted language. The VM engine can use and manage space invisible to a VM process, and perform magic like high level atomic operations without a VM process knowing how or why.

None of this requires any particular language to work. It works with any high level language you want to implement on top of emulated machine code. So it seems to be a runtime idea only, and not a programming language specific idea, unless you count the emulated assembler as the language. Weirdly, you can even use different high level languages in different VM processes, without them being able to interfere with one another.

Here I'll stop suddenly because I mentioned all my main points, and I haven't finished thinking about all the details, and so lack any conclusion as yet.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Some bits look familiar

When implementing the Chalk VM, we also put scheduling inside the VM and communications inside the scheduler. We also had essentially one read-only address space, which let us perform communications almost for free.

root vs leaf VMs

Hey, thanks for posting. I started thinking the silent treatment meant I'd goofed in some way -- such as sounding like a total nutball. Maybe that's what I get for putting a premise like 'if you emulate 68K...' into an idea. It sounds retro and therefore dead, when actually any emulated assembler might do, even Knuth's MIX.

David Teller: When implementing the Chalk VM, we also put scheduling inside the VM and communications inside the scheduler.

Cool. I guess I'd suggest using more than one VM, with the sort I described above -- let's call it a leaf VM -- reserved for Erlang style processes that can't have shared mutable state. At least one other root VM can run the other pseudo Erlang leaf VM's, and do whatever it pleases, including the messaging and scheduling.

David Teller: We also had essentially one read-only address space, which let us perform communications almost for free.

I'd use one address space for each leaf VM, with copy-on-write as needed to let a leaf VM write on whatever it feels like, without any expectation writes in the address space would reach another process without a system call to the root VM to intervene (by doing something like mapping memory into another process as a message).

Nothing really stops you from having lots of different leaf VMs, and it's really one of the implicit ideas. A leaf VM running a plain vanilla 68K emulator would give you a sandbox for testing compilers for assorted languages, but would also give you the ability to screw up memory in the leaf VM address space when the machine code allows you free reign to touch the address space.

Other sorts of leaf VMs might be memory safe, where code running in a leaf VM can't screw up memory. A complex app might have VMs of both sorts, with safe and unsafe code running in different leaf address spaces, in cooperation through some division of labor. Scary bits of code can live in a bottle.

Another obvious use of leaf VMs to run unsafe code is sandboxing untrusted code, such as when running client requests to a server (written in the root VM) that contain code that's executable and must be run to complete a client request, such as page building code in a web server, say. As long as a root VM runs the leaf VM, client requests cannot hope to succeed at denial of service attacks, since a root VM can meter cycles and shut down (or simply slow down) client coding using too many resources.

Leaf VM dispatch tables for system calls can be pre-edited (by the root VM) to decide exactly how much functionality to expose in a leaf VM. Potentially dangerous methods that might allow security violations can simply be made uncallable in leaf VMs that must run code that can't be trusted at all.

Anyway, if you take the sort of approach I suggested, you can get arbitrarily complex system architectures in simple user space code that doesn't require exotic system headaches addressed by the real OS hosting the root VM. By running a simulated OS in a root VM, with emulated code in leaf VMs, you get a moderate speed slowdown that might pay itself back (which is exactly what I expect) when you take into account you no longer need to do some really hairy things (read: slow) to solve complex problems when programming in the large to implement traditionally difficult server problems.

Chalk tutorial

Without entering in as much details, the Chalk VM is based on the Kell calculus. Which means that you have a hierachy of cells, each cell containing an arbitrary number of processes. Two processes of the same cell can communicate without any restriction, while distant processes need to open channels to communicate and are otherwise fully isolated. Communication itself is atomic and requires no copy.

No Unix-style memory protection is involved, as the bytecode actually doesn't contain instructions which could break the isolation by accident.

So, once again, I see related ideas, just not coming from the same world :)

Oh, and you can also put cells on hold, duplicate them (including the processes inside) or send processes along channels. But that's for advanced uses :)

Maybe Related?

Hey Rys,

I don't know exactly how good the conceptual relationship is, but if you haven't seen it already, you might find Jean-Claude Wippler's work on vlerq interesting. There's a vm called "Thrive" under the hood, and he's got bindings for TCL, Python, Ruby... interesting stuff, to me, anyway.

vlerq memory management

Paul Snively: I don't know exactly how good the conceptual relationship is, but if you haven't seen it already, you might find Jean-Claude Wippler's work on vlerq interesting.

Thanks, that's interesting. Yesterday I looked at many vlerq pages to see the relationship, but it was hard to find much in common besides using a vm. The vlerq architecture page is so generic and non-specific that it describes lots of other things besides vlerq, including what I suggested. But JCW mentions using Java as host language, and I wouldn't try using anything but C or C++ myself.

I didn't see any sign in vlerq of memory management plans suitable for threads as disjoint logical address spaces (ie processes), policed by a vm enforcing no memory interference. That's what I would consider conceptually similar if present.

That's not quite new ...

... but maybe not often being talked/written about. Basically, every compiler compiles to some intermediate code which actually forms a VM. Even multiple (layers of) VMs are possible. As you noticed, any existing CPU instruction is basically a VM in itself--probably every CPU out there uses some internal formats. (The RISC paradigm just tries to put them closer to each other.)

So, every compiler for a high-level language has a sort of internal VM that has some language primitives "built in". Whether you interpret these at runtime or translate these to their corresponding target architecture instruction set then will only be a performance (and portability) difference.

The problem is: In your example, synchronization is only cheaper if software threads run on the same processor. If you need cross-cpu communication/synchronization (which becomes increasingly important) you won't get any cheaper. However, you push the logics when to use which mechanism down to the VM, but pull it up and away from the hardware or OS. So you may be able to make better decisions, resulting in better performance in the average case.

So, basically it's indeed a useful abstraction, making a language implementation very portable (if you choose the right primitves).

more pros for emulation

I worry I might encourage a vein of discussion I find boring (debating pros and cons); but I want to say thanks for the comments.

Thomas Schilling That's not quite new ...

Newness was neither goal nor message. I suggested a simple way to get Erlang style processes into any language you like, provided a suitable memory scheme is chosen early in design. (I meant to encourage this tactic in future language implementations by showing how easy it is to other folks.)

Thomas Schilling Basically, every compiler compiles to some intermediate code...

We agree I described a very generic view of perfectly ordinary compiler ideas.

Thomas Schilling ...or translate these to their corresponding target architecture instruction set then will only be a performance (and portability) difference.

But we disagree whether emulation only makes a performance or portability difference. (Actually, that I disagree with that was the main motive for my post, so I'm slightly puzzled.)

Conventional wisdom says emulation and interpretation are minor evils (performance degradation) sometimes accepted in lieu of finishing a (more highly desired) translation to native machine code for execution by hardware. So in fact, using a jit compiler to finish the last step is normally desired.

I suggested emulation gives you control to get desired memory effects in easily debugged and managed user space C code that you'd otherwise need complex native assembler and supervisor status to do. I'd rather debug my all C/C++ kernel with an ordinary debugger like gdb.

When counting pros and cons, in addition to performance (con) and portability (pro), I'd also include these on the pro side of the balance: control, coding costs, safety, simplicity, accounting, analysis, and a few other things you can get from indirection.

I also suggested performance was not always a con for certain classes of application. I've seen more than one C++ project code first with direct, straight, and narrow tactics, only to discover later that adding N more required bits of finesse causes enough interference to subvert performance plans. I don't want to describe this in detail, but it strongly informs my thinking.

The kind of app I worry about is one allocating and productively using all available physical memory, with lots of fragmentation and allocation turnover, that must remain up, efficient, and stable. In short, I want my runtime to help me prevent worst case entropic effects under stressed memory usage.

Thomas Schilling In your example, synchronization is only cheaper if software threads run on the same processor.

Avoiding shared mutable memory aims for less synchronization in the first place. Rather than make sync primitives cheaper, the idea is to use sync primitives much less often. (Especially, to avoid using them defensively with shared state under pre-emptive multithreading.)

strange ltu traffic lately

Just post something if you want to discuss this any more, otherwise I won't bother bringing up implementation threads again. I get the impression (from very ambiguous signs, as like my imagination as not) this sort of discussion isn't very welcome in some quarters.

Can't speak for others...

I enjoy the discussion. But don't mistake not jumping into the discussion for a lack of interest - just don't have a whole lot that I can add.

(Only thing I was gonna say was that Erlang does have quite a lot of state going on - it's just that it is wrapped in messages to tail recursive functions. But I didn't think that observation was relevant to what you are aiming at).

likewise

(empty)

"Some quarters" can ignore

"Some quarters" can ignore you if they like (but as Chris Rathman said, don't consider lack of response ignoring). If someone feels that what you posting is inappropriate for LtU, they will likely say so. However, if you want more free-form and quicker/more spontaneous responses perhaps LtU isn't the right medium.

using immutable data only

This was actually extremely helpful, thanks very much. I didn't really get the model Erlang used before, and now I can add that to my toolbox of tricks for code later.

Chris Rathman: (Only thing I was gonna say was that Erlang does have quite a lot of state going on - it's just that it is wrapped in messages to tail recursive functions. But I didn't think that observation was relevant to what you are aiming at).

I just boned up a bit more on Erlang so I could get that, because it sounded interesting. You're right, that's not what I was aiming at. Now I see why what I've been saying sounds weird to folks who know Erlang. :-) Erlang is doing something much simpler than what I described, based on universal use of immutable data -- which means that since you can't cause thread conflicts, threads might as well be in different processes. But there's no need for separate address spaces, since you can't write code that detects whether you are the same address space anyway.

It would be a total waste to use the technique I described for an Erlang implementation, so it would only be useful in other languages that wanted a lighter weight process model. The Erlang model is lighter weight yet, and that model could be used in conjunction with the one I mentioned, using separate logical address spaces for threads that are allowed to modify state.

You'd only be able to mix the Erlang thread model into runtimes of other non-functional languages that had safe memory models allowing you to enforce the immutable status of data used by the Erlang style threads. But it's feasible to have more than one class of threads: one like Erlang's and one more like a traditional Unix process in term of having a separate address space.

I suppose if processes of either flavor only communicate by messages, they probably can't tell which flavor of runtime another process is using. Hmmm, so there's two different ways you could spawn new processes, depending on whether you wanted a private mutable address space or not. (A process that doesn't want a mutable address space can't tell whether it has a different address space.)

(Now I'm finally writing what I thought I was going to say when I first saw your post.) Most systems have a lot of state going on, even when they have a functional model with immutable data, because the bindings of variables to data and the generation of new data both involve space use at runtime. Some functional systems quietly update bindings and aren't quite as assignment free as some folks suppose.

It would be interesting to write a HyperCard system in Erlang. The HyperCard model was always a little disturbing in terms of being event driven, without a clear ordering for what would occur before other code has run. Um wait, but HyperCard mutates state and Erlang doesn't. So that would require a hybrid model of some kind to use Erlang style for the event handling, but state management with some other model.

Transputer

You might be interested in taking a look at the instruction set for the transputer. It was specifically desigend to support highly concurrent systems that used message-passing instead of shared state. The language of choice for programming transputers was occam, but compilers from languages like C and Pascal into transputer code were also produced at one point. The Transterpreter is a recently-implemented VM that runs programs compiled to transputer code.

misc VM instruction sets

Thanks; the transputer links suggested more ideas about instruction formats, especially along lines of uniformity and readabality (say, in hex) over a concern for minimum size.

I have a good idea where to go with these various ideas now, and will probably spend some of my spare time cheerfully writing code based on similar if not identical notions. Compared to several weeks ago, I lean more toward a VM centered world than C and C++ kernel centered for writing other languages. Except a VM ought to have a variety of powerful, high abstractions running in fast native code that's invoked by pcode in whatever format. I don't want to just clone what I did 12 years ago.

Last weekend I was very enthusiastic about my new lines of thought and really wanted to get both positive and negative feedback to help ground ideas in reality. Sorry if posting here attempted to use this site more as forum than blog, but there's not a lot of places to get quality feedback on language ideas which isn't heavily invested with agendas (like: 1. conform, 2. you can't do that, 3. leave it to experts, 4. use brand X instead, etc) that aren't interested in brainstorming or idea exploration.

A common theme appearing in some recent LtU forum-like threads is an observation that language design often involves choices informed by reasoning about entailed consequences and dependencies. I often wish threads on LtU focused more on reasoning and entailment, because these are fun -- and I suspect more productive -- while various specific details are just, well, details. (Like syntax. :-)

The rest of this post attempts to summarize what I got out of this discussion, in the sense of what I actually expect to do that's related to assembler, emulation, and to some degree Erlang. Parts I'd enjoy discussing more -- assuming a focus on reasoning and entailment -- are shown as bulleted items, should someone have an urge to explore. I'll try to characterize each in terms of choices.

  • vm - multiple; assembler syntax; pcode binary formats
  • gc - safe memory management vs manual allocation
  • fp - shared state: thread-safe/immutable vs oop-friendly mutable
  • os - shared address spaces vs disjoint process address spaces

The idea is to end up with a stackless VM with paged address mapping, memory safe gc and refcounting, full continuations, some Smalltalk and Lisp style semantics plus immutable memory with copy-on-write support. For FP, the VM can make it easy to convert memory one way from mutable to immutable, so oop oriented code can feed into well-behaved CSP interactions by freezing objects fed into process networks.

Codename note: names help keep you from getting lost, so it's useful to have codenames. 68K on phone keys can be spelled NUK, so nuk (say nuke) makes a one syllable synonym to replace Motorola's four syllable 68K. I sometimes use letter y as shorthand for Lisp, from the first letter in world tree Yggdrasil, to emphasize tree-based rather than list-based processing. (If your tool is a two-pronged divining rod, every problem looks like a search for water.)

First I'll write two very similar assembler parsers for two different VMs, both sorts roughly based on 68k: one traditionally memory unsafe (mnuk) just like m68k, and another gc-oriented and memory safe (ynuk) which might resemble m68k little. Both will build (probably Lisp style) parse trees in memory by allocating from a gc-oriented (roughly ML style tagged) format that will grow into a C/C++ gc memory kernel when collection is coded a bit later. Both can compile to more than one pcode binary format.
For example, mnuk might compile to traditional m68k binary or to, say, some hex-friendly x68k you might decode by eye (perhaps painfully).

Multiple VMs allow more options to be compared with one another. Direct comparison of nearly equivalent choices is a good way to evaluate and debug. (Debugging is faster given one correct model, and correct simple models should be done before fast complex models.)

You would only trust memory unsafe mnuk in root VM space if compiled statically. But mnuk is probably better in a leaf VM for simulating a traditional CPU when writing language tools you intend to translate later to native code for real CPUs. The ynuk VM would be useful in both root and leaf VM spaces, with the difference mainly involving process address space mappin using different emulators.

Since I haven't taken this particular approach before, I can't predict before iteration how much kernel code would work better in C or in ynuk, or in Lisp or Smalltalk that compiles to ynuk. Figuring that out will be part of the fun. Adding an Erlang style process model as an independent mechanism seems feasible without requiring conflicts with the other parts of the kernel.

Thanks for your time; I hope this didn't sound too much like gibberish.

funky runtime address space games

(For those few of you who find this interesting, tell the other folks it's not as crazy as it sounds. It's really just some math I prefer to think about using plain english. Just tell me to stop if you don't want to hear any more. But I probably won't write more about this, since it will only get more complex sounding as I get into details.)

I've been working out some weird properties of memory safe ynuk (a kinda sorta Lisp/Smalltalk VM), which turns out to be much more interesting than mnuk (which is merely an m68k simulation).

For y space objects to be "safe" they must live someplace where code can't forge anything or twiddle any of the bits, so ynuk needs a minimum of two disjoint address spaces. So let's use yin to mean y space objects, and ang to mean the 32-bit flat address space visible to mnuk. (Yes, ang is "yang" with the y missing; this actually helps a bit by avoiding y completely, though it might not seem so at first.) Yin and ang don't overlap, but they know about each other.

Each ynuk process has both a yin and ang address space; ang is byte addressable (as per m68k), but yin is an object address space and is not byte addressable. (However, many objects inside yin have byte addressable contents; the ang address space is probably one of the objects in yin space.) So ynuk can only see what happens in yin indirectly by copying data into ang, or by seeing the results of instruction execution depending on state inside yin. Yin objects don't have addresses, and their addresses cannot be transferred into ang space; ang objects cannot store yin references per se. But yin objects have pathnames and ang structures can store the names and use ynuk opcodes to load yin objects by name.

If you're familiar with client/server architectures, yin acts a lot like a server from the point of view of ynuk opcodes. But the server is just the ynuk VM, and isn't really remote (which incorrectly implies latency) as much as it is inscrutable (which correctly implies safe). In fact, things in yin space probably happen faster than operations in ang space, since ang is an emulation and yin is the real world (you think this is air you're breathing now?).

Roughly speaking, ynuk is a superset of mnuk. The opcodes for mnuk can only see and use bytes found in the 32-bit ang address space. However, ynuk has an additional 16 registers (y0, y1, y2, ..., ye, yf) which can only refer to yin space state. The yf register is the yin stack pointer to mirror the use of a7 as the m68k style ang space stack pointer. Other yin registers might have special assignments, but low numbered yin registers are merely references to objects in yin space.

You can't move a yin register into the m68k d0..d7 or a0..a7 registers because a yin register is neither address nor integer. But you can move bytes from yin space into ang space and vice versa, using suitable opcodes and yin object references in yin registers.

In practice, it won't be necessary to actually use the mnuk subset of ynuk, because everything can be done in the inscrutable yin address space. But if you wanted to do so, you could sample content in yin by looking at copies of data brought into ang space. This is how you'd define what happens in the yin space, because it can all be described in terms of semantics the ang space understands. The ang space considers itself a client of an abstract yin server, mediated by the ynuk VM.

Code probably lives in yin space when executing, even though ang space can generate new code and have it loaded (assuming the code loading opcode is visible in that particular process). So ang space might be entirely quiescent during execution if none of the mnuk opcodes appear in code at runtime.

The size of the yin address space isn't revealed, and the ordering of objects in yin space isn't either. Basically, yin space isn't flat. Yin objects that have byte addressable contents will support efficient operations at any position using slices of any size, as if the yin address space can be cheaply edited, which will actually be the case. (So writing an editor for very large text objects in yin space will be quite easy.)

Let's see, I think that was most of what I'd thought before I started writing this.