## Continuation-Passing C: Compiling threads to events through continuations

Gabriel Kerneis and Juliusz Chroboczek, "Continuation-Passing C: Compiling threads to events through continuations", arXiv: 1011.4558.

In this paper, we introduce Continuation Passing C (CPC), a programming language for concurrent systems in which native and cooperative threads are unified and presented to the programmer as a single abstraction. The CPC compiler uses a compilation technique, based on the CPS transform, that yields efficient code and an extremely lightweight representation for contexts. We provide a complete proof of the correctness of our compilation scheme. We show in particular that lambda-lifting, a common compilation technique for functional languages, is also correct in an imperative language like C, under some conditions enforced by the CPC compiler. The current CPC compiler is mature enough to write substantial programs such as Hekate, a highly concurrent BitTorrent seeder. Our benchmark results show that CPC is as efficient, while significantly cheaper, as the most efficient thread libraries available.

## Comment viewing options

### WTF?

Lambda lifting is the identity transformation on C programs, because all free variables in C are at global scope. Not having looked at the paper, I wonder what they can possibly be talking about.

### Explained?

It looks like their compilation strategy introduces inner functions, which they must then lift. So the question at issue is whether that can be done in the (very restricted) cases they care about, and the answer is affirmative.

### Looks fun: related (but

Looks fun: related (but uncited) reading giving a good motivation for this project is the Carpaccio series of work. Despite all of their support, they some reason stopped just short of CPS'ing for the programming interface, requiring something like explicit passing of state if I recall correctly.

An interesting question is whether, for these performance-sensitive programs, continuations are the right abstraction. If you take a security perspective, that means a function you call might take away your thread of control! In the explicitly cooperative model (e.g., AJAX script), by convention** of manually passing around a continuation, that won't happen. Given that we write C code for tricky performance reasons, it's not actually clear that the traditional continuation primitives, without addressing this, are a good idea!

The paper uses a 'cps' annotation for convertible functions and only allows an annotated funtion to be called from other ones; I've mostly observed this choice to be made for safe or guided compilation, but it is also a potential stab at the problem.

**Control passing can be baked in without requiring convention: e.g., in some parallel or otherwise resource-oriented programming models, execution units are reified in some way that is explicitly sharable.

### A few answers

The "cps" annotation is here mainly for efficiency purposes (you do not want to cps-convert every function) and modular compilation (you do not want to rely on whole-program analysis to find cooperative functions because sometimes you simply do not have the whole program). It is also provides good hints about atomicity (at least as long as you stay in cooperative mode).

About performance issues, note that CPC cannot leak continuations because they are not exposed to the programmer. The runtime uses them for scheduling but only linear operators are provided (no call/cc, shift/reset, etc.).

As a matter of fact we cited von Behren et al., except it is spelled Capriccio (see reference [5] in the article). I would have loved to play with it but it does not compile on a modern Linux/gcc/libc/whatever.

### Oops, guess I had fish on my

Oops, guess I had fish on my mind when I wrote that :)

My concern here is not of leakage (first-class reifications) of continuations but of a control inversion. As another example, in browsers, you have the main thread on which no blocking operations should be called (think responsive UI thread). However, in a browser, there are many scenarios when you'd want to CPS asynchronous calls (such as in dealing with physical resources). Doing these in acceptable locations is a really big concern where, even without continuation support (they use event loops), the various browsers get hit hard (e.g., accidental file access when you thought something was cached): throw in cheap continuations, and I can envision the problem escalating.

Splitting the world into cps/non-cps seems coarse. Imagine a cache-optimized kernel that you don't want to get kicked out: if it gets called by a cps'd function, too bad, it might get kicked out. That's why I suspected it was entirely for compilation purposes.

However, if proposing continuations for large hairy systems codes, I suspect there's a way to make the abstraction palatable, and that'd be great! (Indeed, I have to wonder: if the benefit is avoiding a global transformation, isn't that something an IDE can handle with a 'refactor' button? I think there is a great opportunity here beyond the traditional motivations for CPSing.)

### Detached'' threads

Of course, when you use cooperative threads (compiled to an event-loop), you need to be non-blocking and this can sometimes be tricky.

This is why CPC provides the notion of detached'' thread: a CPC thread can temporarily be executed in a separate threadpool. You lose the ability to cooperate with other threads but you can make blocking calls without freezing your event-loop.

See for instance p. 10 of the article the "non-blocking disk reads" example.

(AFAIK, detached threads have been invented by Boussinot. See FairThreads, section 3.5.)

### That doesn't really solve the

That doesn't really solve the problem; I think we're talking past each other. Obviously, if there is blocking call, it can't be made on the non-blocking thread, and thus any such call should be put on a different thread.

The question is how to avoid making blocking calls in a non-blocking thread. Consider the constraint that all calls that may block are labeled CPC and can only be made from a CPC context. That solves the problem theoretically: just don't label the non-blocking function as a CPC one. However, the problem quickly becomes one of taint creep.

### a note on why this is useful

(This post mainly aims to give you a reason why someone would want to do something like this.)

I've been thinking about something similar: compiling to C with cps, implemented in C, using manual cactus stacks, a trampoline, cooperative lightweight green threads, and optional integration with a dynamic language implemented on top (something like interpreted Lisp and/or Smalltalk).

If you write something like networking code, and want to process many thousands of flows in the same user address space, you avoid native threads and implement some ad hoc event scheme or async callback system to represent thousands of independently multiplexed activities efficiently. And when you roll it all by hand, it gets really complicated.

For example, it's conceptually easy to implement pipes once you have an okay async infrastructure. But if you rolled everything by hand, code for pipes can look very complex, in the sense it takes a long time to reason about just a few things that seem simple in concept. Basically, you end up with operating system responsibilities when normally you can ignore this in a synchronous language.

It's very hard to test async components intended to plug into other complex async environments. To write a good (read: meaningful) standalone unit test of an async library with non-trivial semantics, you might need to write a simulation of the expected execution environment, which costs much more than your library. In fact, a good test for sophisticated async modules might always cost more than writing the module, in general.

So there's a strong motivation to go with some systematic scheme, just to control costs. Ideally, instead of hand-rolling gnarly async C, you might instead write a description in a higher level language which compiles to the same gnarly async C you would have written by hand, but without nearly as many dumb errors. And then you could write a high level description of the simulated external environment to drive the module under tests. You'd compile that to C as well.

The idea of compiling to C everywhere is largely driven by a need to permit dev peers to review code at the same level they would have written by hand without any system. It's to maximize challenges to code which criticize whether it works as expected. You want aggressive desk checking by folks who only want to read C. And you want aggressive unit testing that actually causes every block to be reached via condition injection (which is more general than error injection).

If you do it right, you can emit async C modules with no dependencies (depending only on themselves). But you can implement a larger execution environment, that might have a dynamic language, used for unit testing at first, then more after you figure out, by experimenting, that it's a nice setup.

Among other things, the larger environment can present a web server view of module runtime and stats. (The same way you write tests because you doubt production code, you also doubt tests themselves, so you need to inspect what happens under test in detail.)

### Split stacks

I wonder how GCC's recently added split stacks change the picture. Green threads appear to be a duplication of work that the OS should be doing.

### contrarian approval of duplication

Manuel Simoni: I wonder how GCC's recently added split stacks change the picture.

Only one veto is needed to stop a tool from being used. For example, I know one host environment that would veto for three reasons independently: 1) you can't depend on that version of gcc; 2) you can't use that feature even if gcc supports it; and 3) you can't use that feature unless you can control when you get scheduled, since you can't run unless we say it's okay. However, the split stack trick is interesting; too bad it's not universal to existing C code.

Green threads appear to be a duplication of work that the OS should be doing.

You say that like it's a bad thing. :-) Yes, sometimes you simply must duplicate what someone else does, if only because you need finer granularity control than exposed by existing api.

For example, if you expose a new async api with multiplexed activity and an entry point that says "run some more, without blocking, but not longer than this" (my peers call this a device api) then you must write a scheduler, even if the OS has a scheduler, so you can honor contracts.

### Why not Cheney on the MTA?

That would be cheaper than trampolining, because the return cost is only paid once for every stackful of CPSed calls rather than for every call. Of course, longjmp() is less efficient than return, but if the effective stack is big enough (Chicken uses a 64K stack by default, even though the C stack is much larger), that difference is readily overcome.

### Lazyness

We did not try Cheney on the MTA, mostly because:
- we want to keep things straightforward to understand and debug them easily (and longjmp is definitely not straightforward),
- we want to compile to something looking like event-driven programming,
- trampolining is not too inefficient so we can live with it,
- life is too short too implement every possible optimization ;-)

But you are right that this would be an interesting trick to consider (although I have learned the hard way that some "optimizations" sometimes do not perform as well as expected and shall therefore always be benchmarked).

### Speaking of longjmp

The paper does not address it at all. Is the code transformation still safe in the presence of longjmp? Is a cps function allowed to use setjmp?

### Probably not

The CPC manual mentions that setjmp/longjmp must not be used. The issue is that the continuation would not be restored when coming back to the longjmp, and some boxed variables might be leaked.

More generally, CPC does not need a garbage collector because continuations are used linearly ("one-shot continuations"). Adding support for setjmp/longjmp would probably add a significant amount of complexity to this scheme.