Preemptive concurrency via compiler-inserted checks

Consider a language that provides lightweight concurrency primitives, like JoCaml or Erlang, and which uses co-operative threads like GNU Pth for this concurrency. A co-operative thread must explicitly yield to permit other threads to execute.

Has there been any research into some sort of inference for compiler-inserted yields? For instance, each co-operative thread has a "time slice" (or a "work slice"), and we would want the thread to yield after the slice has expired. In the limit, we can force each function call to decrement the "remaining slice" until it reaches zero, at which point we yield (I believe Erlang does this, or something similar). This would seem to be quite a heavy penalty though.

I was wondering if there was a "smarter" way, that didn't cost a repeated test and branch on every function call. For instance, the compiler could analyze the control flow, and insert checks every X function calls, where X > 1.

So is there any other way to build preemptive threading from co-operative threads, other than via timers, or the above technique of a runtime check per call?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Erlang

I'm mostly clueless, so this might be a mostly useless comment, but: I am under the impression that Erlang basically does that, by counting the number of "reductions" it is doing in a given process? So I guess that (a) is not a different way, as you were apparently seeking, but (b) apparently works well enough in practice?

(a) is not a different way,

(a) is not a different way, as you were apparently seeking, but (b) apparently works well enough in practice?

It works well enough for Erlang, sure. I was just seeking some suggestions to reduce the number of checks needed, to make this metric more intelligent somehow. For instance, execute half of a work/time slice without any checks at all, then start checking on every function call once that threshold is reached. Or if there was some program analysis that could pinpoint an exact preemption point.

Isn't that basically the

Isn't that basically the same problem as inserting safe-points to interrupt a program's execution (e.g. in a stop-the-world GC)? You have the same tradeoff between check overhead & timeliness. If you don't care too much about timeliness but only want to make sure execution is bounded, then you can, e.g., only insert checks before potentially (mutually) recursive calls and backward branches. The tricks to reduce the overhead of each check might not all be applicable; I don't think MMU hacks are what you're after here :)

Isn't that basically the

Isn't that basically the same problem as inserting safe-points to interrupt a program's execution (e.g. in a stop-the-world GC)?

Similar, yes. And similar to the checks green threads must perform to ensure they don't overflow their small stacks. That's another question I might pose at some point actually, but I'm looking into regions for this reason. :-)

If you don't care too much about timeliness but only want to make sure execution is bounded [...]

The idea isn't perfect timeliness, but it's basically to prevent a DoS on the CPU (i.e infinite loops), so to let other green threads make progress. Ideally, each thread could make a language "syscall" which requires blocking, and so relinquish the CPU naturally, but I want to ensure progress of other green threads in case of malicious or CPU-bound code too.

For instance, this could be done with timers and signals. I've read a great deal of problems caused by signals, so I'm trying to avoid them, but it often seems like the only way to achieve what I need.

An alternative, is a watchdog thread which periodically checks the status of the worker threads. If a worker thread has been executing the same green thread for some time, it would need some way to interrupt it. Is there a portable mechanism for doing this? Again, signals come to mind, without resorting to constantly checking a semaphore.

I don't think MMU hacks are what you're after here

Definitely not. :-)

[Edit: I don't expect the branch to be all that expensive on Core 2, since the branch prediction would be 100% until the work/time slice reachs the end, but if there's a smarter way to do it I'd love to hear it!]

cost of test and branch

The smart thing is to couple it with an operation that is expensive anyway, and possibly has a non-predictable branch (the latter may or may not be a good idea).

Larceny Tasks

The Larceny Scheme compiler makes a primitive preemption facility in the form of timer interrupts available to the programmer. These interrupts are currently based on a count-down software timer which is decremented and checked on every procedure call and backward branch. When the timer reaches 0, a user-installable handler procedure is called.

This along with Schemes's call/cc provides sufficient machinery to implement a green/fair threading system with pseudo-preemptive traits.

The handler procedure is a parameterizable so a simple
(timer-interrupt-handler
(lambda ()
(task-switch #t #f)))
is all you need.

See Larceny note:
http://www.ccs.neu.edu/home/lth/larceny/notes/note1-timer.html

For an implementation based off of work original work done by the Larceny developers see
http://github.com/GreyLensman/rl3/tree/e90a1de202f4081ae45ac9959fbc6d8aa6ae10e2/rl3/concurrency/tasks.sls
for details.

Interesting. Have they tried

Interesting. Have they tried to quantify the overhead of the count-down software "timer"?

Forth?

I'm looking at similar things now too. I noticed that in the OpenFirmware (Forth-based firmware of Sun/Powerbook/OLPC) the address of the main interpreter/VM loop is kept in a register, and on x86 each Forth word ends with:
jmp edi
to dispatch the next instruction. This seems very amenable to preemption: a timer interrupt could poke the address of a 'yield' routine into edi and it'd be automatically run on the next invocation.

MMU/breakpoint tricks should be lots of fun too :-)

Break on GC

One less expensive approach then checking on every function call would be to check on every garbage collection. If you were using a generational garbage collector with a small initial generation and did not do tail-call optimization the recursion depth available before preemption would be bounded. To ensure tail calls do not take up extra space use the garbage collector and continuation-passing style.

Be aware however that it is possible to slow a machine very easily without recursion. If denormal numbers are turned on, then a malicious thread could simply do a clever division. Depending on the CPU being used this could lead to epic slowdowns. Funny memory access patterns and space leaking can cause issues. And if you are using mark-sweep then memory fragmentation is another attack vector.

Excellent points!

Excellent points!

I think slow down via certain operations are handled, since execution time is booked to the agent, so malicious code is only really wasting its own slice. I think I'm clear on the policy I need to handle DoS attacks, I'm just unclear on what efficient mechanisms are available for implementing those policies.

You do raise a good point about memory fragmentation attacks though. I'm still investigating regions, which should prevent that particular attack, but if I choose to go with a concurrent collector, then I will have to consider this issue. I was going to account for memory, just as I plan to account for CPU time, but fragmentation is an interesting attack vector that hadn't occurred to me. Thanks!

Using signals

Thanks for all of the suggestions. I was trying to avoid signals, but then I found SML/NJ's handling of signals as continuation-producing functions. This seems sufficient for providing a green threading framework entirely within a language. Anyone have any experience or comments on this?

Does this mean

...that, assuming a code fragment taking T(n) to complete getting m yields inserted, another code fragment taking T(2n) to complete 2m yields, if the compiler analyses a program fragment as not halting, it is going to insert infinitely many yields?

That is, to do this you'll have to completely unify code and data plus solve the halting problem.

I think you're better off just using preemptive multitasking, also because it scales better to real parallelism: Most of your yields will be placed utterly wrong if you have more than one cpu.

I think you're better off

I think you're better off just using preemptive multitasking, also because it scales better to real parallelism

Preemptive threading is exactly what I'm trying to achieve. The question I'm asking is: how? The compiler-inserted yield points/checks was one possibility I put forth, as was Erlang's countdown timer. I was soliciting suggestions for any other ideas.

GHC

Just as another data point. GHC implements pre-emptive threading by (potentially) yielding on heap checks (which, conceptually at least, occur on every allocation). It is possible to DoS a GHC computation by having a tight loop that doesn't allocate. It would be easy enough to fix this by inserting explicit yields in recursions if there aren't already heap checks. This hasn't been done yet as it hasn't been much of a problem.

Related work

Marc Feeley, "Polling efficiently on stock hardware", FPCA 1993. Here: http://www.iro.umontreal.ca/~feeley/papers/polling.ps.gz

Excellent link! This is

Excellent link! This is exactly what I was looking for, Re: quantifying the costs of polling. I figured someone must have studied it at some point. Thanks!