Was there a language with an explicit call stack?

When writing a runtime system or an interpreter in C (though TBH it’s true about all programming languages that I know of) with e. g. moving garbage collection or support for continuations, you find that, surprisingly, the call stack is too abstracted away by the semantics, and basically have to roll a separate one yourself (without the benefits of hardware support) or resort to things like getcontext(3) or GCC’s split stacks that technically are an extension of the semantics that need their own definition. LLVM tries to support precise GC, but falls short of real-world uses. Another facet of this problem is the unpredictability of stack overflows with alloca(3), C99 variable-length arrays or just plain old recursion. (I have to count my recursive calls so that my recursive-descent parser doesn’t crash on malicious input? Really?)

This motivates the question: was there a language that officially permitted or even required managing one’s own call stack, inspecting stack frames and the like? Note that this doesn’t mean not having some sort of subroutine linkage, but could mean banning recursion unless the user explicitly allocates space for it, so that the implicit stack is statically guaranteed to be bounded. (FORTRAN didn’t have recursion, of course, but I doubt it was for this reason.) Things I’d want to (be able to) implement in such a language are, again, a moving GC, closures, coroutines, and continuations. I’d expect the language to be reasonably ALGOL- or ML-like so that FORTH and its kin are excluded.

On a second thought it looks like the requirements of moving GC are orthogonal to those of continuations and well-defined stack overflow behaviour, but closures seem to be somewhere in between, so I’m leaving it as one question with possibly multiple valid answers.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Precisely why Fortran didn't have recursion

Without recursion, you don't need a stack, all data is statically allocated. Stacks weren't even understood at the time: they began as an implementation technique for Algol 60 recursive procedures.

Out of ignorance not understanding

But now that we built in recursion, we’ve lost static allocation, and I don’t see how to get it back. (‘f x at ρ’?) More importantly, I’m not aware of any attempts to do that. An additional complication is that (on recent Intel architectures, for example) you almost have to use the stack the hardware provides you for linkage, or branch mispredictions will kill your preformance; so reifying is not a good option.


cf. work on Statically Allocated Parallel Functional Language by Mycroft and Sharp.

you almost have to use the

you almost have to use the stack the hardware provides you for linkage, or branch mispredictions will kill your performance

I haven't tested this in a long time, but I'm pretty sure Intel's shadow return stack is independent of the push/pop acceleration. That means you can reify however you want, as long as you know which (indirect) calls should be executed as push/ret. That does restrict the use of the stack pointer, but that register isn't a full-blown GPR anyway, so you were probably going to use it for something like a nursery.

ABIs are more likely to be hard constraints here: win32 SEH likes to unwind the stack, you have to be careful with posix signal handlers, ...

Back to the original point, I'm not sure how often bounded-by-RAM recursion is the best tool. It's convenient, but I often end up switching to an explicit stack anyway; there must be a nice algorithmic skeleton hiding somewhere in the ton of worklist processing code everyone writes.

I meant the return addresses

Push/pop acceleration is indeed independent, but it is also much smaller fish than the shadow return stack. So yes, I meant that you have to use the hardware subroutine linkage: CALL/RET (BLX/BX LR, whatever; that the ARM mechanism doesn’t use the stack proper is interesting, though). Theoretically this should penalize Cheney on the MTA (see below) rather drastically, but I guess the environment is not that fast anyway.

My point is that you can

My point is that you can pair CALL/RET without linking on the stack. You just have to push the return address before executing RET.

Worklist processing?

You mean the kind of control-flow queues that E and Javascript (in the form of setTimeout(0, …)) have?

yes, amplification requested for worklist processing

I thought "worklist processing code everyone writes" was very interesting too. I assume it means a simulation of lightweight processes for systems that don't have lightweight processes, in order to perform incremental non-blocking tasks that must either wait on async i/o, or avoid starving the main thread control flow. I would like to hear more about this. :-)

We have such a thing where I work too, but it is both byzantine in structure and frightening in scope, like Godzilla. It's one of those "what if everything in the world is worklist processing?" deals, making you wish both language and runtime was organized around it, because inversion of control is the norm. It strongly motivates a lightweight process model.

Shame the comments are a tree not a DAG

As long as you know which (indirect) calls should be executed as push/ret

Do you, in CPS code?

You do in the common case if

You do in the common case if you skip the naïve CPS conversion and track arguments that are only passed continuations. It's the same set of techniques you'd use to CPS before analysis and unCPS before code generation.

Well, yes, obviously....

There was Forth.

Stack management doesn't get much more explicit than that.

It is, in fact, the only language with even LESS syntax than Lisp.

Other point free languages

postscript maybe

Not the control stack, surprisingly

Well, if you restrict yourself to ANS and don’t do things like

: EXECUTE ( ... xt -- ... )  PUSH ; ( in a DTC system )

the return stack of Forth is still pretty much absent from the semantics. That is, knowing how to work with the control stack is still normally considered implementation-specific knowledge. (Also, the pervasive data stack makes closures impossible, barring LOCALS.) But then not exploiting implementation details is very much un-Forth-like, of course.

Values on the control stack are...

Values on the control stack are... what? Concretely they're addresses, but concrete addresses are way too chaotically low-level powerful (such stuff as exploits are made on). The obvious abstract form would seem to be continuations, but continuations imply something like a spaghetti stack — parent pointers, with concretely the parent information embedded in the control code that the return address points to. It's not obvious that continuations are the right abstraction for this. Perhaps we want an entirely other abstraction, something that hides less, or even hides just plain different, information than continuations do.

Linearly typed one-shot continuations outside of Set

The toy CPS-style language λCPS from Compiling with Continuations, Continued has second-class continuations that can’t be used as values, and on the first glance they seem to capture the expressive power of a conventional call stack neatly. I’d have to reread the paper once again to see how environment lifetimes play out, though, and this is a crucial point for what I’m after. Figuring out a typing (and ‘kinding’) discipline for this thing would also be a nice exercise.

abstract return addresses

On consideration, the trouble with continuations in this context is that a continuation contains way more information than a return address; really, a continuation contains as much information as a loaded control stack, so treating continuations as encapsulated objects, of whatever class, is no better than having an implicit control stack. The return addresses on a loaded control stack are so-called "delimited continuations"; effectively they're functions, except that functions are ordinarily written in static form while "delimited continuations" are captured at runtime by playing back part of a static function and then putting the rest in a doggie bag. Perhaps the reason "delimited continuations" have attracted interest is because they correspond almost exactly to an abstraction we've been using intensively for about half a century. (I've been meaning for ages to try my hand at a blog post on "delimited continuations", but never felt ready; maybe it's finally coming time.)


Taking the thought full-circle, if you break a static function into its atomic parts, each of which is a minimal "delimited continuation", and then make the control stack explicitly accessible, you have... Forth.

Problem with environments

But now, when I try to define a function call as a non-primitive operation (capture a delimited continuation, push it to stack, establish a new prompt, branch to function), I don’t see how to handle (not necessarily first-class) environments, including argument passing. Forth solves this by not having them and using the parameter stack instead; as a result the “delimited continuation” represented by the returm address doesn’t close over anything, and any sort of higher-order functional programming in Forth is painful (String handling with generalization of SKIP and SCAN is the only attempt that I know of). Do you think we can have it both ways?

Brilliant observation

I thought I understood your description of Forth, and then I understood it... Spot on. Interesting that you can also store literal (i.e. constant) “delimited continuations” in this model: this is what BRANCH and 0BRANCH do.

Sorry, bad listening on my part

It appears I was distracted by my original intentions and so pleased by the realization that second-class continuations are equivalent to conventional stacks expressiveness-wise that I didn’t notice you actually were asking a different question; sorry. Indeed continuations are stacks not return addresses. (Although it’s not the literal return addresses you’re talking about, but stack frames/activation records, right?)

It seems that common motivation for delimited continuations involves stack slices—i.e. collections of linked stack frames—though, not single stack frames.

There is an alternative in C

When writing a runtime system or an interpreter in C (though TBH it’s true about all programming languages that I know of) with e. g. moving garbage collection or support for continuations, you find that, surprisingly, the call stack is too abstracted away by the semantics, and basically have to roll a separate one yourself (without the benefits of hardware support) or resort to things like getcontext(3) or GCC’s split stacks that technically are an extension of the semantics that need their own definition.

True for 99% of implementations, but there is an alternative that can be fully within the standard if you use C++ and exceptions, Cheney on the MTA: cps, never returning, doing your allocation with alloca, and finally, doing your gc by moving things off the real stack then throwing an exception. You don't lose information and returns when you throw because in this cps style you NEVER return.

Chicken Scheme does this in pure C, using setjmp longjmp and portable if not standard stack checks... The result is supposedly the fastest scheme in C around.

The original papers were called "cons should not cons" suggesting, I think, that cons cells should be local variables not allocated in the heap.

CONS should not Evaluate its Arguments

CONS Should not CONS its Arguments, or, a Lazy Alloc is a Smart Alloc" is just a reference to this paper on lazy evaluation. Doesn't really make sense if you're not aware of the older work. https://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR44

Different papers

You linked to CONS should not evaluate its arguments, which is indeed a paper on lazy evaluation. I believe that the two parts of CONS should not CONS its arguments were meant: A lazy alloc is a smart alloc (where lazy refers to generational GC, not to lazy evaluation), and Cheney on the M.T.A. So much for puns on other authors’ paper titles.

I've been interested in this too

on the basis of the principle of having a language where NOTHING is opaque.

I've seen the return address used as a hash value in the gui library for the Unity 3D game engine.

In that case you don't give GUI items like buttons identifiers, you just have ONE call into standard button code per button and that standard library code gives the button an identity based on the return address.

This is a powerful way of simplifying the code, and very unusual.

Now this is seriously evil

Not only does it depend on absence of inlining of client code, it can also be broken by the compiler duplicating the code (if it wishes to move it into both branches of the following conditional, for example). Also the user has to take care not to create things in a loop (but that is also true of the more common __COUNTER__-based hacks).

On the other hand, it exploits an interesting form of generativeness inside the compiler; there might be a good point trying to get out here.

Good point

In this case they have control over the compiler which is a version of mono (an open sourced clone of .net)

Side Effects

Not an existing language, but one of the things I am looking at is "pure" code, defined as code that cannot go wrong at runtime (as opposed to Haskell pure code which of course can run out of memory or stack).

It seems to me if you perform no allocations (IE the code works in a fixed sized preallocated memory block) we can statically guarantee it wont 'go wrong', although it may diverge, that may or may not be seen as an error.

I have an effect system, and the idea is to provide effect constraints. For example we might want an interrupt service routine that cannot go wrong, nor diverge, and perhaps with a constrained longest code path (execution time is a side effect). This would guarantee that if the initialisation succeeded, reserving memory etc, that the interrupt callback could not fail.

I also plan to allow runtime stack reflection (maybe with separate stacks for pointers and values), so that GC can be implemented as a library. The GC code itself may be constrained to not use any stack. This will also allow static reflection of the program abstract syntax tree to facilitate coalescing allocations if the GC library needs to do that. It would also allow proving the GC cannot 'go wrong', nor diverge, and provide a way to limit the time it runs for.

In stackless code, all functions are implemented by inlining and monomorphisation, and all parameters are passed by reference. Imperative loops should be used in place of recursion.


Amusing Roof Space?

Odd name, but looks interesting, thanks for the pointer :-)

sorta ("was there a language that...")

This motivates the question: was there a language that officially permitted or even required managing one’s own call stack, inspecting stack frames and the like?


1. real Scheme.

A large part of the point of the compilation strategy in the Rabbit thesis is translate to an intermediate language whose operational model requires no stack. The invention of the implementation technique known as a "spaghetti stack" (RMS et al.) helps to illustrate the thinking: that mostly-stack-ish allocation could be compartmentalized as an implementation detail for a core language (a subset of scheme) that did not in any way require a stack.

The intermediate form of the Rabbit thesis is suitable for naive interpretation using a general purpose allocator for environments, at the cost of speed. This technique is moribund in practice and the art of finding slow interpreters nevertheless useful is largely dead but there it was.

The operational model of the rabbit intermediate language invites a very easy and obvious extension to include the concepts of first-class environments and run-time reflective environments.

Of course, this is all also recapitulated to some degree wherever people worked on CPS-translation as a compilation technique.

2. Actors

It is at this time a paper language, to the best of my knowledge, but Hewitt's Actors in their present form seem to imply a similar transformation. That is to say, the "stack" itself can be usefully regarded as an assembly of "stackless" actors that tend to act in stack-like ways.

No, I don't know of any examples where someone had the freedom to develop those lines of thought thoroughly, but there they are and they still seem to me like an interesting direction.

3. Might be worth looking into Smalltalk, too.

I'm just not as familiar with it.

wheels within wheels

Even if you make the stack explicit, I think something remains implicit, in the runtime that operates on the explicit stack. Given a Venn diagram with a circle labeled "explicit", it's hard to make the implicit world outside that circle the empty set. Where hardware meets software requires something host the explicit model with some glue.

When writing a runtime system or an interpreter in C [...], the call stack is too abstracted away by the semantics, and basically have to roll a separate one yourself [...]

You can take a body of code, say in C, and compile it to cps form where the stack is explicit, say with call frames in cactus stacks (originally called spaghetti stacks as noted by Lord). But then some host runtime has to execute this explicit model, say using implicit trampoline style. You would have perfect reflection on the explicit model, but not the implicit runtime (unless it self-describes).

To operate on the explicit model (from within the code compiled to the cactus stack model) requires you use an api you know will end up present after compilation.

In effect, you can compile C to a form that looks like a simulation of the C runtime, which gets run on something which still has a normal C stack, but one that doesn't wind in the cps runtime hosted within. Objectively you would think this would be slow. Except your entire app might be zillions of little things waiting for packets to show up, a network switch on steroids, where this sort of result at runtime was going to happen anyway. In which case the best you could hope for is more efficient context switching and explicit representation for management purposes.

counting recursive calls

In the early 90's I used to ask every job candidate how they could arrange to see stack depth well enough to prevent stack overflow on (for example) unbounded recursion. It was my standard question, to compare relative understanding of runtime behavior. The hard part was setting up the question, so there was a reason to code a solution. ("Assume the problem is inherently recursive, but you prefer graceful exit with an error to a crash.")

(I have to count my recursive calls so that my recursive-descent parser doesn't crash on malicious input? Really?)

There was really only two good answers (three when two variants are considered not identical). First is count calls, after estimating stack consumed per call. Second is compare address of a local variable to a reference address captured earlier, near top of stack: inside main() or inside a thread's entry point. On Macs with a default stack size of 8K, it was necessary to do this when implementing a language interpreter. (Because: there was no memory protection, so wildly overwriting memory could kill your file system, if file i/o buffers managed to get flushed after they got clobbered. Thus being careful was obligatory.)

Today if I write a recursive descent parser, the AST gets a stackoverflow node when nesting gets too deep, after some unreasonable amount of nesting. I like to always generate an AST, so any error causes explicit error nodes describing exactly where the problem happened. ("Right here, you nested too deeply.")

The idea I suggested earlier was that you can make an implicit stack explicit, by rewriting the code so that (while keeping most things intact) each call stages the calling conventions explicitly. If the implicit C stack is on (say) a Z-axis, the rewrite moves parameters and locals to a Y-axis managed as heap allocated call frames. The trampoline api involved becomes explicit, which is of course nasty. But you can compile the rewritten code with your choice of C compiler, that will run on a Y-axis scheduler that uses the normal C runtime Z-axis itself, but it keeps unwinding on returns to the trampoline. The Y-axis call stack gets knit as a result. If you have threads for a thread pool, or else non-blocking i/o, you can farm out all blocking operations to be done by a thread or another process via non-blocking sockets. This allows you to implement non-blocking fibers in a single-threaded host runtime.

hybrid stack

What are the consequences of doing that to your stack? What pros and cons do you get, and did you already have the cons involved? (This is for anyone with interests similar to those I had ten and twenty years ago.)

The y-axis stack on the heap is slower to administrate than a native C stack, and standard toolchain debugging aids like gdb will have trouble. So degraded speed and debugging are concerns as cons. Let's do speed first. In a call tree, there are more leaves than inner nodes. You can keep leaves on the z-axis; in fact, you can keep anything on the z-axis (normal C stack) that need not yield in the scheduler running fibers on the y-axis. Any function that calls no code reaching a blocking call needs no rewrite; all that code will run just as fast as before, and you can set normal breakpoints inside. A function running on the y-axis will make z-axis calls that run to completion before a fiber can yield. (Any compute bound function that might run too long can be broken into pieces that yield by simply pretending some operation might block, even when it can't.)

Without fibers, you normally write async code with callbacks and heap allocated contexts, without any semblence of reified stack structure (so there's no good way to say "break when this continuation resumes"). So you already have cons of heap allocation and looking up callbacks to fire when events they depend upon occur. But that stuff occurs in the central part of the call tree skeleton, while leaf code still runs fast. This is still true with a rewrite to y-axis CPS transformation, so it is only slower if the process got heavier weight.

Now let's talk about debugging async callbacks. How much fun is it to step through that in a debugger? You can set a breakpoint in code called by the async continuation, but it breaks there for everyone, and not just the current operation you want to analyze -- because there's no stack per se for the denatured async task which runs whenever the worklist processing gets to it. If your async task was a fiber, with it's own y-axis stack, you can have tools debugging just that one fiber instead of all code hitting a function. In this case a CPS rewrite puts structure back into the code lost by async callbacks.

The pros you get in exchange for cons (which you might have already been paying for) involve clarity, organization, and control, when you can treat a fiber (or group of fibers) like a lightweight process. You might be able to do symbolic debugging on the version of code before the CPS transform to y-axis calling conventions.

For a small and simple project, a CPS rewrite could increase dev time, because it gets hairier. But a project you hope to do in a year, that actually takes you three years, might be sped up by axe-sharpening first that normalizes on a level of complexity less than the worst it can get. Imagine a pseudo Richter scale that measures a code project on a log scale in terms of long it takes to do. Each 0.5 increase is a factor of two, and each 1.0 increase is a factor of four. On this scale, 8 is six months, 9 is two years, 10 is eight years. (You will be fired if you project a magnitude 10 project, or if your project estimated at 9 is actually a 10.) The idea of CPS rewrite is to aim at normalizing cost at 7 or 8, where smaller projects will get slower and bigger projects will get faster. So it's dumb to do on small class projects, but a good idea on anything above an 8.


What do you think of Google's Go language? It appears to implement fibres much as you would want, with goroutines, and anecdotally scales to millions of in-flight routines which are stackless.

It seems to be a combination of the best parts of Java (interfaces, static typing) and JavaScript (event based) in a garbage collected language deliberately kept simple.

doesn't suck in several ways, not much to complain about

Well, Go stacks are not explicit, in the sense of fiddling with them as you please via first class api. Of language inventions in the last 15 years, Go is least offensive to me. I would find a job working in Go far more pleasant than most alternatives. But hell is other people, not languages, and you get people mostly unchanged everywhere you go, universally going on about their social status and pet stupid theories sanctioned by convention. :-) At least Go has enforced code style that stops folks from dwelling on some trivial things, muting some of the drama.

It is nice to have as many fibers as you want. Go doesn't have killable fibers though, so they aren't like lightweight processes the way Erlang processes aim for crash-as-cleanup semantics. But lots of it sounds charming, and free tools from smart, responsible suppliers is hard to beat.

My remarks were suited more for someone working in C, or another brittle language, to suggest you can get the results you want if you try. You can embed a fiber system inside a single-threaded host runtime, as long as an async interface is acceptable, and there's a way to get non-blocking i/o to daemons. But while you can get whatever you want to happen in code, getting other folks to be reasonable is rather harder.

My opinion isn't interesting. I posted a couple more times, because I left out details a person like me would have really liked to hear. But my opinion of languages is not something I would have ever cared to hear about, mainly because predictably boring.

Stackless Concurrency

From your past posts I get them impression you work with highly concurrent systems where holding a stack per thread is a problem. It is also my experience that treading does not work well in variable latency / high latency environments (like networking). I am therefore interested if you think 'Go' gets concurrency and parallelism right. My own observations are what I have seen of it so far looks good. It does not have the kind of support for generics I would like, but you can't have everything.

I saw an interesting article on C/Go interoperability: http://blog.labix.org/2010/12/10/integrating-go-with-c-the-zookeeper-binding-experience, so I was wondering what it would be like to integrate async 'C' code with manual fibres and Go.

okay, just this once

When I paid attention, the Go story was working for me. Nothing caught my eye as a mistake, so my view was positive. Every specific design has trade-offs, though, so something might work less well in Go, just haven't thought about it much (beyond how it would fit in my work problems). I gather folks have a tendency to share state between goroutines in a way that suggests lack of guidance, or something. But I suppose it is loose ends rather than gotchas. Leaving users to their own devices isn't as bad as surprising them with a trap that looks safe. I prefer tools that let you make mistakes, rather than force you into a carefully vetted groove. So being able to shoot yourself in the foot is okay to me.

Come August I'll have been at the same networking company (of no great prestige) for ten years. I just do what I'm told. :-) Like elsewhere, a lot of the time tasks that need doing are a terrible grind: "making license plates" as Watershouse would say (in Stephenson's Cryptonomicon). More often than not, I fix things that are messed up. Occasionally I design something, when that's what is missing. Usually my focus (design, features, or bug fixing) is scaling and optimization. My first few years focused on deduplication for WAN optimization, but recent years have been mostly IPsec: code bases descending from racoon, but arranged in a particularly complex way, so just keeping your bearings is hard. Until several years ago, I worked mostly in C++; but I was asked very nicely to work entirely in C, and I didn't quit, and it hasn't killed me yet. C was the first language I learned really well in the early 80's, but then I learned C++ in 89 and began working full time in that at Apple in 90. (Pink early 90's, OpenDoc mid 90's, Netscape late 90's.) One of the IPsec majordomo processes I tweak now is in C++ (with a particularly nasty Boost flavor).

I get assigned things I would have found horrifically complex in my 30's and 40's. Sometimes I feel just barely smart enough to debug what I see. There's a bunch of processes and threads, and a lot of connections and tunnels, and code tends toward byzantine organization. (Meaning, sometimes I am the only person who can figure something out. People pretend to read my email explanations; all they care about is whether I make it right.) The ultimate goal is to support as many clients as humanly possible with a particular hardware setup. But it's a target rich environment in terms of low hanging fruit. I only just finished replacing some lists with hashmaps a week ago, in an IKEv1 daemon with C coding artifacts from the 90's, right after I shrank the memory it used by 96% when thousands of tunnels are configured. My role is to be a peon with a high IQ, who will do things that otherwise won't get done.

I spent the first few years trying to think of what kind of software architecture would make all the problems easy to resolve, from design to debugging. Somewhere in the middle, I thought of an answer I find satisfying. To apply the answer would require writing tools that transform existing bodies of C code into new versions that do the same things but with much better control and organization. All through my career I have been far more interested in programming languages and development environments, so I think of solutions in those terms, and then sometimes mimic the essential tactics in local changes to code. For example, a lot of the deduplication system was a hand-rolled simulation of green threads, but organized so it made other engineers happy because finite state machines had their guts hanging out so it looked like "just a bunch of FSMs".

The things I know about this particular network switch plus load balancer (etc) have no translation to other environments, except I have a good idea what kinds of things everyone must be living with, and what they would rather use instead, instead of sharpening prehistoric flint tools in C.

Most of your posts are very

Most of your posts are very condensed. Knowing your opinions about irrelevant things increased the amount of shared context we have, which then makes it easier to decondense your posts.

more context seems due then

I hope the post above helps some, by providing a bit more context about my normal work problems, though I don't care to dwell on work. I can never tell whether everything I say is totally obvious already.

I know C and C++ fairly well, and Scheme and Smalltalk well enough to implement them. Statically typed versions of Lisp or Smalltalk would be more fun to use professionally, but I see little demand for that. I have been offered Java positions before (last time at VMware), which surprised me since I professed to know little about it. In explanation, I was told the most important thing was that I understood what needed to happen, algorithmically, when it came to where bytes needed to go and when, despite a perfect storm of stuff happening at the same time. (My reaction is to shrug and say: if you say so.)

I like dynamic languages, but appreciate static typing, and have a taste for as much simplicity as you can get. In particular, I want definitions to sound simple, and be understood by junior engineers. Edge cases in tools that require exotic explanations are a bad sign, especially if they require a priesthood to explain, doubly so when folks still don't understand afterward. I appreciate that simplicity requires skill and inspiration. Anyone can make a mess.


Perhaps I can relate the Felix model.

Felix uses *both* the machine stack and a spaghetti stack. It suffers consequences and reaps benefits.

Procedures use the spaghetti stack (heap allocated stack frames linked by pointers). Functions use the machine stack for return addresses. They do this because Felix uses the C/C++ ABI to allow lifting C/C++ code and types without (or without much) glue code.

Felix procedural code uses mutable continuations (technically resumptions) which support fibres. Procedural code is stackless (I mean, machine stackless).

The garbage collector is precise on the heap, and conservative on the machine stack. The type system does NOT ensure the programmer doesn't mess up stacks (unfortunately perhaps). Multi-threading is also supported but requires a cooperative world stop (another compromise).

Spaghetti stacks use close to optimal address space and support the fastest possible context switching, however they suffer from the overheads of requiring heap allocation and de-allocation. This is less than might be expected due to aggressive inlining and other optimisations.

Machine stacks provide faster allocation and deallocation and C/C++ compatibility, but suffer from lack of extensibility: one has to allocate the maximum possible address space required at any time for each thread's stack. On 32 bit machines the limitation is untenable, on 64 bit machines its a bit easier to tolerate (32 bits of space per stack still leaves space for 32 bits worth of stacks ...). However the load on the VM paging system is likely to be extreme ***

Felix could run millions of fibres on a 1980's IBM PC with a context switching rate around 100K per second. OS stack swapping can only be done portably using pthreads.

Stack copying (which I believe Go uses) is a reasonable method, but it is not portable, and it isn't clear C++ with exception handling will tolerate it. It has the advantage only a single large stack (per worker thread) needs to be allocated, as only the used part of the stack needs be saved for a suspended gothread. However copying is linear in the amount of stack used. Spaghetti stacks don't require copying.

So in Felix, functions behave like .. er .. functions. But procedural code has many more options including non-local gotos, explicit branch-and-link for control exchange, and a higher level form of coroutines (fibres) using channels for communication (which effect control exchange too).

The fibre model subsumes both functional and procedural programming: procedures are in fact delimited continuations in the sense of the symmetric lambda calculus (SLC). However the Felix implementation doesn't play so well with concurrency, although a separate implementation of pre-emptive threads with channels subsuming Go's model exists (it use OS context switching so is much slower than Go at this).

*** The VM paging model in theory supports lots of address space. In practice this is NOT the case. The reason is that in many VM including Intel CPU's the second level indirection consists of 4K pages which themselves live in virtual memory. What's more the VM mapping is per-process so the page tables have to be swapped on a process level context switch. If your memory use is compact, the system is fast enough. However if you have millions of threads each with a large stack, the address space mapping will require a physical DISK operation for a random access. There simply isn't enough RAM on most machines to hold the page tables .. let alone the actual memory being mapped.

Therefore, in my view, linear stacks are a dead duck. Whether fibre or threads are used, linear stacks are not viable on any architecture. Choose between spaghetti or stack copying. It's really a pity modern processors are a slave the to archaic C programming model.

nice writeup

Nice description, and I like your conclusion that linear stacks don't scale. Note copying is unfriendly to code ported to fibers that uses pointers to stack allocated objects, unless the location of every such pointer is fixed (updated to the new location) at copy time.

A plan to know where they are doesn't sound feasible to me, since it means you either limit what the code can say, or else figure out what the code means rather than move where it runs. It's simple to leave objects in place while still alive.

User space context switching can be very fast, though sensitive to cache line locality if one does not aim to favor related jobs in a "process group" pipeline. (I use quotes to imply user space simulation of process group, meaning the same thing.) A dev wants to write code as if cooperating fibers don't know about each other, but would like the scheduler to favor locality as if the dev wrote a monolithic version. Closely related things can switch almost cooperatively, until timeslice (or reduction count) exhaustion requires pre-emption for fairness.

Caching and Fairness

I have no idea how to effectively manage cache locality with fibres, it seems an interesting problem.

However, fairness isn't an issue: fibres in Felix at least are deliberately as unfair as possible, because the primary purpose is to provide maximum throughput, not realtime performance. Contexts switches are therefore avoided unless necessary: necessary means when a fibre cannot proceed because it is "waiting" on I/O which another fibre must provide.

In Felix there are two choice points for which fibre to run. The first is in the main scheduler, when the running fibre blocks, the scheduler may pick any of the waiting active (non-blocked) fibres to run. Felix uses a stack, which is grossly unfair, but which may actually improve locality (since the same "process group" of fibres tend to exchange control).

The other choice point is rarer: when I/O is performed on synchronous channels a read and a write request are matched up and two fibres are made active (the reader goes first to simplify data access to the write data via pointers). However an schannel can stack up multiple read requests or multiple write requests, and the one chosen at the moment is the latest (because its a stack, because stack ops are faster).

Otherwise one must understand fibres look like threads (in Felix they're called f-threads) but they're not really even remotely related underneath: schannel I/O is a high level way to effect an exchange of control between peers: there's no threading and no concurrency: logical threads of control interleave by exchange of a single physical thread of control. (Felix also has a low level primitive for control exchange called branch-and-link which was the an instruction on the PDP-11).

This leads to interesting properties: in conjunction with garbage collection f-threads cannot deadlock: deadlocked fibres become unreachable and are thus reaped by the GC and therefore don't exist and so can't be deadlocked. Similarly fibres which lock up waiting for input which is never going to come because the only owners of the other end of the schannel are gone, are unreachable and deemed to have suicided. This is, in fact, the *usual* way for fibres to terminate.

I admit to not fully understanding the design consequences. I would love a type system that provides various assurances. At present I understand various cases but have no general theoretical understanding. For example a pure pipeline is guaranteed to operate correctly. Reading two inputs in either order is sure to work provided the input providers are independent.

Felix is a good place to play because, unlike say Go, it provides real functional programming with type classes and parametric polymorphism, row polymorphism, procedural/imperative programming, and pre-emptive threading with shared memory .. as well as coroutines. Along with C/C++ ABI compatibility, seamless integration with C/C++, and a grammar defined in the library, not the compiler. Oh, and it can outperform C!

granularity games and transient unfairness

Networking code generally wants to be unfair temporarily, for higher throughput. So if you can keep going, you should, unless now you are starving other tasks for too long. It's a matter of granularity. Very fine-grained fairness is bad. Coarse-grained fairness is enough to keep everyone alive.

fibres in Felix at least are deliberately as unfair as possible, because the primary purpose is to provide maximum throughput

That's good. It's merely a bad idea to eagerly context switch, as soon as possible. When you hit a possible barrier, trying to yield is not optimal.

When a lot of system i/o is involved, which stutters (and implies latency between actual readiness), you can only run so long before the pipe runs dry for now. Trying to hog the cpu won't work. Yield is forced due to nothing ready. But you would like to keep touching the same caches lines as long as it needs doing again, so they stay in cache and are cheapest to re-touch. You want a producer generating data to unpark a waiting consumer and have the consumer run then, when the data is hot in the cache.

If you can pass data back and forth between fibers a long time without ever hitting a blocking system call, you can starve someone ready to run now, if no one is trying to see all ready fibers get some cycles, in some soft realtime sense. So compute-bound processing can have a problem with real time latency.

In my environment there's a watchdog process that checks whether heartbeats occur often enough. A process that misses a heartbeat is killed, and it's less than a second (I won't say what it is). So you can't do something like memory map many gigs of disk all in one shot and get away with it. So enforcement has coarse granularity, but it's quite severe when invoked.

Memory mapping aside.

The behavoiur of mmap you describe seems odd, because my understanding is that a mmap call just registers the address space and sets up the page fault handlers, so it should be very quick (and not dependent on the size if the file mmaped). Only when you try and access the memory are pages mapped in, and it is possible to mmap files much larger than available RAM.

second hand tale

I wasn't the guy who had to rate-limit memory mapping; one of my peers, more senior than me and better at details, had to make it incremental to avoid heartbeat induced whacking. He told me kernel memory was used to describe the mapping, and that and other things caused latency too. Best to ask someone who researched it in some depth. They should report part of cost is proportional to map footprint. (Yes, many times larger than available RAM. I would not use it for a storage system, but that time no one asked me.)

Big pages

memory mapping can be non-instant.

mmap's contract with anonymous mappings(those not backed by a file) has it initializing the entire mapping to zero. In the normal case when it's file mappings it has to initialize the contents with the file contents from some offset.

As an optimization one could make anonymous mappings (but probably not file mappings) a "virtual" initialization where a structure keeps track of what parts of the map have been written to, and returns the initialization without bothering to access the memory whenever an unwritten part is read.

But the most straightforward way to do it, and the safest in terms of not getting excessively hairy when used in threaded environments/accessed from assembly code/whatever, is to just initialize the space. That gets the prize for "least surprising behavior."

With mmap you can overcommit

With mmap you can "overcommit" a lot of space, in which case it is not a good idea to just initialize the space (i.e. allocate a backing phys page per virt page). There are various strategies possible each of which can help in diff. conditions. You can use the same phys "zero" page to back all the virt pages and mark them copy-on-write and do a phys page allocation on a write fault. Or you can allocate on first read fault.

Zeroed pages

Linux has a pool of pre-zeroed memory pages. Also Linux sparse initialises so all zero pages are not mapped. Is Linux just doing mmap better than other OSs, or do they also have efficient mmap implementations?

In my tests on Linux mmaping a file beat steaming it's contents in all cases both for small and large files.