## erlang green threads and the CLR

Last few years i have been playing with creating my own programming languages, I've got a pretty good idea of what i want it to do and what i want it to look like. I've finally decided to go with the CLR virtual machine as the back end (simply because I'm working on a RAD language and not anything low level. If i need raw speed i will use asm, c, or forth) My problem is that all the common language runtime work i have done has been writing il code for basically toy languages and scripting tools, so while i know what i am getting into as to the ins and outs of compiler design what i don't know is if i could create erlang like green threads in any reasonable manor on the clr.
Anyone have a basic run down of how green threads are implemented in erlang and if i could do something similar on .net?

keep in mind the language will be pure functional sort of like haskell. I have a strong interest in making it erlang like with the concurrency because erlang simply got it right (though the syntax sucks).

Even if all you have is a link i'm cool with that, i'm not afraid to do a lot of research on my own =P

## Comment viewing options

### Optimal asynchronous interface?

You can do green threads on the CLR if you transform your programs into continuation-passing form. But then interoperation with the class libraries becomes more difficult, as that code is not in CPS form, and so "steals" the stack/thread whenever called. Some classes provide asynchronous interfaces, such as streams and sockets, so you can hide the blocking API in your language for these objects and expose only the asynchronous part, but such interfaces are rare.

I'm actually curious what the current research indicates is the optimal concurrent program structure, assuming we're designing a VM from scratch for commodity hardware.

1. We can transform all programs into CPS form, and this provides optimal scalability in terms of number of threads, but then all calls suffer from indirection costs, which are quite high on modern CPUs.

2. We can use a user-level stack-switching approach, which solves the indirection problems, but this limits our concurrency on 32-bit machines to about ~65K green threads (assuming 4kB stacks). That seems enough for most applications, but I'm wary of making such predictions. :-)

3. We can use an E-style pure messaging approach to execution, which solves most concurrency related problems, but violates the intuitions most of have about the costs of call-returns, and has other drawbacks of this sort.

Obviously, some combination of #1 and #2 is optimal, but would seem to require code specialization, as a particular trace may block (requiring a continuation) or not block (ordinary call-return) in the same function depending on the control flow up to that point.

I've read "A Language-Based Approach to Unifying Threads and Events", but at its core it seems just a form of #1 (uses the CPS monad). Any further insight here?

### limits

a green thread limit of 65 thousand would be more then acceptable for my purposes.

### 1. Indirection really isn't

1. Indirection really isn't that expensive anymore. Where CPS still loses is on the predictability of returns. If calls and returns (at the asm level) are always paired perfectly, it is easy to predict where returns will go. In CPS, there aren't actual returns anymore and they don't go to the right place anyway.

2. You don't need CPS to capture continuations without copying the stack. You only need some way to manipulate activation records. The program could certainly hooks directly in the stack management mechanisms. However, the relevant information could also be computed separately 'in userland', either eagerly or by-need (e.g. by using exceptions or dynamic-extent thunks). Obviously, if the program can't control stack management, some sort of copying to and from the stack is needed. However, all the captured data lives on the regular heap, and can be moved to and from the stack incrementally, as needed. In a realistic application, most computations should still be completely performed in a single time-slice.

### 1. Indirection really isn't

1. Indirection really isn't that expensive anymore.

Do you have some evidence for this? All the papers I've read show significant penalties for branch mispredictions. So the question is whether indirections can be predicted.

While branch predictors are a little more sophisticated nowadays, they still can't accurately predict if a different branch is taken everytime from the same location as far as I know (predictors are largely history-based). Couple that with the very deep pipelines and you have a significant pipeline stall penalty. CPS form incurs exactly this penalty, so it still seems quite expensive.

You don't need CPS to capture continuations without copying the stack. You only need some way to manipulate activation records.

Is there a more detailed description of this somewhere? I've been meaning to read this paper more carefully, as it seems to provide a slightly more efficient continuation mechanism. Is this what you're referring to?

### While branch predictors are

While branch predictors are a little more sophisticated nowadays, they still can't accurately predict if a different branch is taken everytime from the same location as far as I know (predictors are largely history-based).

So, we agree, it's not indirection itself that's expensive, but rather unpredictability. The question then becomes, in what code do you expect to use unpredictable indirect branches*?

In CPS code, the only (additional, compared to direct style) such branches are returns to tons of different callers. The equivalent direct style code would have returns instead. For returns to be correctly predicted, the recursion can't be too deep, or the buffer overflows. Thus, we can expect CPS to show a sensible penalty for functions that don't themselves recurse much or perform much work, and are called from a multitude of contexts (that don't exhibit temporal locality). The last criterion is particularly stringent, while the first two suggest that these functions should often be candidates for inlining or duplication.

* In my experiments, Core 2 seems to be able to predict constant stride indirect branches and short cyclic patterns in addition to pure 'monomorphic' ones. There is still an overhead for correctly predicted indirect branches compared to direct ones, but it's nothing catastrophic.

Is there a more detailed description of this somewhere?

I was more thinking of pay-as-you-go approaches in the spirit of Pettyjohn, et al, which reifies activation records lazily and still uses normal call/return pairs. Since the reified records are just heap allocated, the problem of choosing the right stack size is avoided. In some sense the stack is still being copied, but once it's flushed to the heap, it's simple to copy records back to the stack incrementally. Whether the tradeoff of slightly more efficient 'normal' execution for more expensive capture & restoration is debatable, but I don't believe that typical concurrent code usually captures very deep continuations or wastes a lot of time switching tasks.

### In my experiments, Core 2

In my experiments, Core 2 seems to be able to predict constant stride indirect branches and short cyclic patterns in addition to pure 'monomorphic' ones. There is still an overhead for correctly predicted indirect branches compared to direct ones, but it's nothing catastrophic.

Of course, Core 2 is known to be quite a good architecture, and a relatively new one, and thus is a minority of the existing PC base. What about UltraSparc or PPC?

I was more thinking of pay-as-you-go approaches in the spirit of Pettyjohn, et al, which reifies activation records lazily and still uses normal call/return pairs.

### I was more thinking of

I was more thinking of pay-as-you-go approaches in the spirit of Pettyjohn, et al, which reifies activation records lazily and still uses normal call/return pairs.

I've read the paper now, but I don't think it's workable for the types of languages we're discussing here, where there's heavy use of continuations. Essentially, this design requires each procedure to explicitly saves its live variables in a continuation by walking the stack when the continuation is captured. While this is definitely more efficient in the case where continuations are rare, I think it would kill performance in a program or language which uses continuations heavily, like Erlang.

Stack switching would be much more efficient, particularly with the one-shot continuation optimization in the paper I linked to. Stack overflow is itself handled as a call/cc, so sizing the stack isn't really a problem, and the technique seems reasonably efficient. Better than CPS in any case. There might be some space waste, but that's always the case with stacks.

### While this is definitely

While this is definitely more efficient in the case where continuations are rare, I think it would kill performance in a program or language which uses continuations heavily, like Erlang.

I think there's a tendency to overestimate the importance of fast task switching in real programs. While it's important to avoid being as expensive as kernel threads to support, say, Erlang's programming model, it seems silly to spend too much effort trying to optimise an operation that should hopefully only represent a small fraction of a program's execution. I'm much more worried about space usage than time, for example. Take an hypothetic network application that serve 100k requests/second. If each request has to pass through 20 process, on a single 2GHz machine that's an average of ~1000 cycles (500 ns)/time slice. How much of that can be spent switching processes? According to this page from 1998 (I don't have the time to run my own), it used to take around 1 us on 200 MHz PII-class machines. It can't be that hard to get it down to 10% of that, nowadays, especially without the hardware context switch overhead.

Also, note that the approach I linked to only has to perform work proportional to the number of activation records on the stack, and only when they have changed (much of the overhead is also linked to their assuming an uncooperative runtime). Restoring a process need not put all the activation records back on the stack. Moreover, the number of activation records must somehow be related to the amount of work performed; if processes spend so much time switching to each other, they can't perform that much work between switches.

### I'm much more worried about

I'm much more worried about space usage than time, for example.

Heap allocating activation frames also has hidden costs in GC time and poor locality.

Let's turn your question around instead: what are the space losses of stack-switching? Has this ever been quantified? We can have stacks start small (~4kB) and grow on demand to minimize internal fragmentation. The paper I linked to describes a good technique to exploit this approach.

[Edit: of course, I would be dishonest if I neglected to mention the stack overflow checking that must be done on every procedure call when using stack switching. We can't rely on MMU tricks with so many little stacks. Of course, the branch prediction is nearly 100% on this test, so I don't expect the overhead is all that significant.]

Restoring a process need not put all the activation records back on the stack.

True, you could place an underflow handler as the return address which would restore the next block of activation frames. These tricks incur considerabe complexity though.

### Computer architecture

For me there's something surreal in talking about memory locality, branch prediction, etc in the context of concurrent programming. I can see that it's of massive importance to the microbenchmarking industry but it feels miles away from the concerns we've had in the concurrent systems I've worked on. Those always seem to either be I/O bound or to spend their time in some 'kernel' that's often written in C.

For example I recently spent some years working on Erlang systems delivering realtime telecom network services to some tens of millions of SIM cards per server. These were absolutely performance-critical systems and running at maximum capacity, but if you ran 'top' on one you'd see CPU usage around 3%.

I'm just saying. :-)

### Then again, highly

Then again, highly concurrent applications used to be relegated to niche applications in which microperformance perhaps didn't matter so much. The larger the userbase, the better the implementation must be in every way.

### That's true. I'm so much in

That's true. I'm so much in the habit of exploiting knowledge of what won't matter for my application that it's hard to put myself in the implementor's shoes. Their users will be doing very weird and unpredictable things and swearing like hell at any feature that's implemented badly.

Really I just think there's so much more to Erlang than fast context-switches and small process footprints :-)

### Their users will be doing

Their users will be doing very weird and unpredictable things and swearing like hell at any feature that's implemented badly.

Indeed. I was thinking of scientific computing specifically, where parallelization certainly helps, but the microperformance of arithmetic, and vector ops can matter a great deal.

Really I just think there's so much more to Erlang than fast context-switches and small process footprints :-)

Absolutely. OTP is something I've been meaning to read up on. :-)

### applies to my concurrent programming

Luke Gorrie: For me there's something surreal in talking about memory locality, branch prediction, etc in the context of concurrent programming.

Those affect what I do lately: WAN optimization via dedup, which involves trading more cpu for less i/o. But yeah, all that code's written in C and C++. (I wish none of it was C but I don't get to choose.)

We're either i/o or cpu bound depending on choices I make. For the luxury of being i/o bound I must care about locality and branch prediction, or else I can max out cpu on memory bandwidth. When we have cycles to spare, we think about spending them on something else putting us closer to perfect balance again.

When I work in a high level language, I'd just as soon keep sensitive parts in C and C++ just like they are now. So it wouldn't hurt if part of a system in a higher level language didn't worry about such things.

### It depends on what you do...

... and as you point out, these aren't everybody's concerns. :-) However, wouldn't it be nice not to have to drop down to C for performance reasons to write the kernel, and instead write everything in Erlang?

Unfortunately, Erlang's implementation sucks when it comes to CPU-bound tasks. It's on par with Guile, which even worse than MzScheme. (No offense to the PLT crowd intended; the PLT platform has other strengths. But my experience is that Petite Chez, an interpreter written in Scheme, often runs code faster than MzScheme's compiler.)

I've been meaning to implement Erlang's actor model and OTP's native network messaging protocol in either Chez or Ikarus. These implementations put a lot of effort into making call/cc fast, something that's typically neglected in other Scheme implementations. I think it would be very interesting to see how this approach to concurrency compares, performance and stability-wise, to OTP's implementation.

However, OTP's networking message protocol is poorly documented. If you are interested, I'd love to try to decipher this protocol with your help.

### How To Implement the Distributed Erlang Protocol

Good timing! I wrote down what I know about it while having my coffee on this quiet saturday morning. :-)

== The protocol ==

The distributed Erlang protocol connects Erlang nodes together in a
network. If you say "{myproc,mynode@myhost} ! MyMessage" then the
runtime system is going to connect to mynode@myhost using the
distributed Erlang protocol and deliver your message.

The protocol starts with a handshake and then enters a two-way request
processing loop. The handshake is for authentication (same cookie?)
and feature negotiation (e.g. do we both support passing funs?).
The main request loop allows each node to asynchronously invoke one of
a small set of operations on the other node:

SEND(pid, message)                \ Send a message
REG_SEND(from_pid, name, message) \ Send to a registered name
.. plus more for process monitoring, etc

The encoding of these requests is based on the standard Erlang
external term format. That's the same format that the term_to_binary/1
and binary_to_term/1 BIFs use.

The OTP source distribution contains the details of all this:

lib/kernel/internal_doc/distribution_handshake.txt describes the state
machine for establishing a connection between two nodes.

erts/emulator/internal_doc/erl_ext_dist.txt describes the Erlang
external term format and the main request protocol.

Erlang/OTP itself implements much of the distribution protocol in Erlang.
Take a look in lib/kernel/src/net_kernel.erl for example.

== EPMD ==

You can run a bunch of Erlang nodes on a machine each with their own
name (foo@myhost, bar@myhost, etc). The Erlang Port Mapping Daemon is
a tiny program that maps name->port so that other machines can find
the node they're looking for.

Each time a named node starts it will contact epmd (or start it as a
separate unix process if necessary) and say "hey! I'm node x and I'm
listening on port 12345". Then when another node wants to connect to
x@myhost it first connects to epmd on myhost (with a well-known port)
and asks what port that node is listening on, then connects on that
port with the distributed Erlang protocol.

== Examples ==

These days there are plenty of examples of Erlang distribution
implemented in other languages.

One example is Distel, in Emacs Lisp. There's a paper about it here:
http://lambda-the-ultimate.org/classic/message5450.html
and the code is here:

The most interesting files are:
derl.el is the distributed erlang protocol state machine
epmd.el is the epmd client
erlext.el is the Erlang external term format encoder/decoder


Good luck!

### ETOS

ETOS is an Erlang implementation using Gambit Scheme as an intermediate language. You might find that interesting!

### Re: ETOS

I'm aware of ETOS, and indeed it is worth mentioning. However, I'm more interested in (for the time being?) the run-time aspects of Erlang.

Thanks for the pointers with regard to the distributed Erlang protocol!

### Distel for other languages

I was aware of Distel before I asked, indeed, that's why I asked. However, I'm not particularly aware of other similar projects... do you have a pointer or two?

I'm just talking about libraries implementing the distributed erlang protocol.

erl_interface is a C library and it's part of Erlang/OTP. It's used by some standard seldom-used commands like erl_call to RPC into Erlang nodes.

Jive is the same thing written in Java. It was in OTP around R7 (which is probably still online somewhere) but removed for some reason. It's nicely written and good as a reference for implementing the external term format.

I remember seeing a couple of Python implementations around the time I was hacking Distel. I'd guess there're a bunch more floating around.

### ... what about the manycore

... what about the manycore challenge? Pretending it doesn't exist doesn't mean hardware trends will go away.

### 64K processes is quite a few

64K processes is quite a few but 4KB maximum stack size doesn't sound very appealing. Then you will have to be careful all the time not to recurse too much and crash if you do. In Erlang you don't have this restriction because the stack is dynamically allocated -- you can truly recurse a million levels deep and nothing bad will happen.

I'd suggest the OP search the erlang-questions mailing list for a message by someone like Ulf Wiger or BjÃ¶rn Gustavsson explaining how Erlang represents processes.

I like Squeak's approach to the stack. Everything is an object, including activation records, 'nuff said, next question!

### 64K processes is quite a few

64K processes is quite a few but 4KB maximum stack size doesn't sound very appealing.

Indeed, so that's the maximum level of user-level concurrency achievable using the minimum size defaults. The stack can always be grown in case of overflow, so I don't think it's that big of a deal.

### I'm not to worried about the

I'm not to worried about the deep recursion, there are some tricks one can implement to fix this. I think the clr supports tail recursion correction (i seem to remember something about this) so i should be fine there which would be the 90% case.
I was really worried about interop with the rest of the net framework. after all if everything is an agent what do i do when the thread for a process crashes or hangs or just stalls within a long running net framework call or such, that would really mess up my grean thread system, i can do some fancy jumping and stack saves on my side but not on other peoples code side.

The only solution i can seem to figure is a special system level agent acts like a manager agent for net framework calls. it will spawn agents for specific tasks in a green mode when it knows they will be valid for async work (such as certain network and file calls) and will spawn special agents. otherwise it will spawn true os threads when we are running long running ops or unknown responses.

This means there will be a distinction between agents and functions that I'm not happy with but thats all right, we actually do this anyways when we write the code, I'm just going to have to make it explicit (i hate that).

Does this seem like a reasonable method for the CLR to everyone? user level stacks, a global message queue system, support for linking of green threads,a global 'shout' like message for all in the node. I'm thinking i will definitely need to use the DLR if i want to make nice and easy pattern matching assignments ala Erlang style. The other nice thing about this is if someone tries to call an agent without first setting up the green thread core language construct from another .net system there will be an exception thrown from the attempt to assign a message to a data space thats undefined, i catch this, create the green thread data space, then try again. tada the agent system core is up and running and things work.

This seems like a valid solution but I'm not sure. will have to play with it some to see how things work. It meens a lot of time spent creating stupid agents just for Net framework interop (though a lot of it will be testing which works in a generic like framework).
Any suggestions? hick ups? blatant mistakes?

### Check out Scala actors

I was really worried about interop with the rest of the net framework. after all if everything is an agent what do i do when the thread for a process crashes or hangs or just stalls within a long running net framework call or such, that would really mess up my grean thread system, i can do some fancy jumping and stack saves on my side but not on other peoples code side.

The only solution i can seem to figure is a special system level agent acts like a manager agent for net framework calls. it will spawn agents for specific tasks in a green mode when it knows they will be valid for async work (such as certain network and file calls) and will spawn special agents. otherwise it will spawn true os threads when we are running long running ops or unknown responses.

You should definitely take a look at the implementation of Scala's actors library. It does something very similar to what you describe to achieve scalability despite the presence of non-blocking APIs. This paper may be the best place to start. If I recall correctly, the implementation is based on work stealing and uses Doug Lea's FJ library. Perhaps something similar is available for .NET?

### work stealing

yes the new 3.5 framework parallel fx library uses work stealing for there system but still boils down to basically a fancy os thread system. The green thread system would basically be doing the same thing but from a different direction. I'm not sure about all of the system yet but i know that i wanted erlang agent systems for sure.

I'm still thinking about how i want the rest of the language to work. I truly like the idea of a pure functional language because of the ability to do things like quick immediate assessment.

I use a more organic method of development then oop supports, i tend fiddle test fiddle test fiddle test etc and a functional language with immediate mode support with a strong ide support would be perfect for this. i could just see a tool that you rclick a function and start the tool. the code is assessed using reflection and a ui pops up allowing you to enter values for the input and see the results of the output.
make it so you have some kind of guard system and you could even automate the process ranges to insure correctness. that would be a neat toy. it's a lot harder to do that with internal state (though possible). oh well i'm still thinking about it. if i go with a pure functional system i could even develop a more graphical representation of the flow of each agent and then present two views of the language. one an agent map and the other a flow assessment.
Thats more of a what kind of tools do i want type of thing then a what do i want the language to be type of situation. I've all ready decided on using an xml based representation of the code then using a parser in and out for the view of the language. this seems like a good idea simply because it will allow for different 'language views' based of the style you like. want a c-esq type language? go with a lot of curly braces. want a python like language? go for it. as long as you support in and out of the xml format then the compiler will support the in and out of the xml to the cil.

You might read how Haskell handles foreign calls from lightweight threads. I think the Erlang implementations don't work too hard here, because they would prefer to keep native code entirely out of their address space in the interests of reliability.

Extending the Haskell Foreign Function Interface with Concurrency.
Basically, all your green threads are run by a few worker threads, and you use separate threads to call native code - blocking that green thread on the result and continuing to run the rest with the workers. There's more if things like thread-local storage require you to care what native threads some calls are made from.

Haskell on a Shared-Memory Multiprocessor shows how they extend the runtime to handle several worker threads running in parallel on an SMP systems. It's mostly about safely implementing the under-the-hood updating in pure lazy code, so maybe it's not so interesting. Even if you want to use pure lazy code in each process you might prefer separate heaps and no shared values like Erlang, so you can kill the right process if you run out of memory, and can quickly clean up after dead processes.