notes on a C-ish memory manager design

I am (re)designing a C runtime memory manager for hosting fibers, DSLs, and lightweight daemons. I started teaching my early-20s sons to program (in C++ and C), and the idea is to give them a library allowing progressive elaboration of tools written in Unix philosophy: i.e. simple and composeable processes. The idea is to use interfaces that allow embedding, so pipelines can run in threads at first, then later fibers within a thread, without code rewrites beyond automatic translation for CPS (continuation passing style) to run in fibers. The "app" would be a daemon hosting a user space operating system, with services like an http agent on a port for a browser-based UI. The idea is to give my sons an undergraduate CS education in terms of what happens here, and in stuff they write to execute inside.

My only aim in this post is to briefly describe its memory management, and the relation to programming language implementation, since a couple different (Smalltalk-ish) Lisp interpreters will run inside, as DSL examples, one simple and one fancy. The paragraph above was just context, since otherwise you might wonder "why in the world I want to do that?" about each detail. This is for your entertainment only, since I would find a helpful comment surprising. My current version of this sort of thing is much nicer than any I did before. I was trying to solve the problem of making everything refcounted, without something terribly awkward happening. (Embedded sub-processes can garbage collect as they like without refcounting, though; my fancy interpreter would use stop-and-copy.)

Each runtime instance uses interfaces defined in the library only. The world outside is abstracted as an environment, so all calls to the outside world go through an env instance. This helps later automatic code rewriting, so blocking calls can be handled as async calls to a worker thread pool to avoid blocking a fiber. (Such calls park instead of blocking.)

The idea is to have multiple independent runtimes in one host operating system process, so there is no global memory manager unless glue code for the environment does that (whatever mechanism gets memory from native interfaces). There is at least one instance of environment E in the OS process, which has function pointers for every external interface supported. For memory management purposes, it must have a memalign() function pointer, to allocate large blocks of aligned memory. (This can be simulated with malloc() by allocating very big blocks, then chopping off misaligned start and end parts, using those for other purposes.)

Each runtime instance has at least one vat, acting as the heap, which allocates only aligned memory blocks. For example, each thread would have its own dedicated vat, which is single-threaded, so no races exist between threads. (All dataflow between threads is copied from one vat to another, mediated by thread-safe queues in the runtime.) Each vat allocates only large uniform blocks from the env, perhaps several at once for efficiency, where every block is 1MiB in size, or 2^20 bytes, and aligned to a 1MiB address. (The standard block size could be another power of two instead of 1MiB, but I will speak as if it could never change.) Since there will be fragmentation, this design only makes sense when allocating a lot of total memory, on the order of gigabytes. So you would not want this runtime on a system with limited resources.

Each vat describes its address space as a book, a set of 1MiB pages, where each 2^20-aligned block of 2^20 bytes is called a bp for book page -- or just page for short when book is understood. The main purpose of each bp is usually sub-allocation of smaller aligned memory blocks to satisfy vat allocation requests, but a garbage-collected vat would use its book any way it liked (an address space with page granularity). A vat used for C style allocation would return blocks that are powers-of-two in size, each aligned to the same power-of-two bytes, as small as 16 bytes and as large as, say, 256K. You can cheaply test any address for membership in a vat by seeing if the aligned bp address is in the book hashmap of bp addresses. Each allocated aligned block is called a rod, short for refcounted, because each one is reference counted by metainfo at the start of each book page.

Each bp begins with a standard bp header, describing page format and detail common to the page, followed by as many rh rod headers as needed to describe each rod sub-block in the rest of the page. (A garbage collected vat would have only the bp header for each bp.) Each rod head is 16 bytes in size and 16-byte aligned, with a 32-bit refcount a 32-bit checksum, and several descriptive fields, including a 16-bit generation number and a pair of 16-bit ids characterizing format and optional allocation context (naming a backtrace for memory usage analysis). Starting from the address of a rod -- some block inside a bp -- getting its refcount touches two cache lines in the general case, one for the bp head and one for the rh. Except for very small rods, this is only one cache line worse than co-locating the refcount inside the block itself. And that would only happen when the refcount was actually needed. (In my model, refcount is NOT linear in aliases; it is only linear in official handles which other aliases can borrow when their scope is clearly inside the handle scope.)

The purpose of a checksum in the rh for each rod is dynamic support for immutable values. When a rod is frozen to immutability, so it cannot be changed again before deallocation, the checksum must still be unchanged when the last reference goes away. You might only audit this in debug mode, but you an ask a rod if it is immutable, and enforce absence of mutation by accident.

Now here's the interesting part: each bp header has a pointer to the parent vat, and to the env. So the cost of those pointers is amortized across all the rods in the same page. In effect, every object allocated from a vat is a subclass of rod, and every rod has a pointer to the allocating vat and to the runtime env interface at zero space cost. It becomes unnecessary to pass vat and env as separate function arguments, since they are always available from any object in a vat. You can also add a handle reference to any sub-object inside a rod, because you can find the associated rh for any address in the page that is part of the rod space. So you can hold references to sub-fields which keep the parent object alive, without any problem, as long as you avoid accidental fragmentation cost (keeping something larger alive only to use a small sub-field).

What about details related to programming languages?

My overall objective is to write tools converting content into different forms, including source code rewrite transformations. (I tell my sons that most of programming is turning inputs into different outputs.) But that is little affected much by memory management per se. Except the i/o model of embedded languages can stream refcounted immutable rods between agents in a process pipeline. That helps get operating system style features into programming language semantics, where a language can also have a process model.

I expect to write a garbage collected Lisp using a vat as the heap, which allocates new bp instances for stop-and-copy before releasing old instances. The uniform use of standard book pages makes re-use efficient without fragmentation across multiple vats.

For bootstrapping, I probably want a simpler and dumber Lisp first, just to perform operations on S-expressions used to express whatever high level policies are used by the system in configuration, or in manifests describing large scale transformations. This can be done with a primitive AST interpreter over a refcounted tree of nodes allocated by C-oriented code. Graph cycles would be pathological unless manually handled or automatically detected. But it's not worse than other C manual memory schemes.

One field in each rh with metainfo describing each rod is the ID of a schema descriptor: the plan of any struct inside the rod. Among other things, it would exactly describe the location of embedded handles in the rod, so reference releasing can be automated. The idea is to be able to release all resources when a lightweight process is killed, without requiring as much manual intervention by a user. A secondary purpose is to make releasing large graphs incremental and uniform, managed by a background rod releaser, which incrementally walks a graph using the rod metainfo as a guide.

A second purpose in putting a plan ID in the rh for each rod is generic debug printing support: I want to be able to print everything. But manually writing a print method for everything is a pain. I would rather write a standard metainfo description of everything, especially since this can be emitted from a source code analysis tool. Then a single generic printer util can process this standard format guide when walking graphs. (Note: to stop huge graphs from being printed, a field in a plan can be marked as best cited briefly, rather than printed in full, and this also helps prevent following cycles due to back pointers.)

Note I am not describing handles, which are like smart pointers, as they little affect memory management and PL topics. But each rod refcount comes from an alias in a handle, which associated metainfo with the rod pointer, including a copy of the plan ID and generation number, which must match when the ref is released. (That detects failures to keep refcounted objects alive as expected.) Refcounting is not manual. It occurs as a side effect of aliasing via handles. When auditing is enabled, each handle can hold the ID of a backtrace that added a reference, and this finds refcount leaks. On a 64-bit system, the size of a handle is twice the size of a pointer: the metainfo is pointer-sized in total. This requires a library provide standard collection classes that work with handles too, when ownership of objects is expressed as collection membership. Each handle includes a bitflag field which, among other things, allows a reference to be denoted readonly, even if the rod is mutable, for copy-on-write data structures like btrees that share structure. (This is useful for transient collections expressing "the original collection plus any patches that might occur" without altering the original collection.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Random associations

I'm having trouble visualizing what you're building. Is the code available somewhere? In the meantime, I'll talk about some projects I've worked on in the past:

1. I have this old interpreter that is refcounted and built atop a Lisp core.

2. I have this new project for teaching programming that requires manual memory reclamation, but uses refcounts to keep memory management safe, avoid use-after-free errors.

visualizing vs building

The memory manager is unclear? Or target usage is unclear? I assume the latter; my first paragraph was short, because that initial context has little PL relevance. It's sort of off-topic to dwell on what I'm building, unless tied to something about programming languages. I can say more, but it's going to sound the way an elephant feels to a group of blind men (a rope, a snake, a tree trunk, etc). From the POV of my sons it will be a pedagogical tool that also lets them make something actually useful later. I'm only doing it for them, since it really isn't all that interesting.

(I use the word "interesting" a lot. I get bored often, and status has no appeal to me at all -- in fact, I dislike status oriented behavior. So my motives are most often boredom deterrence. I learned to program 40 years ago, and the obsessive phase was a 20 year span ending 15 years ago. But I still design things when killing time, like waiting in doctor offices, trying to fall asleep, or finding nothing good on the tube. But the intricate memory palace to do it well gets harder as I age, since I no longer have near photographic memory.)

Draw a Venn diagram with three circles labeled "PL", "interesting", and "project." They overlap only in part. Most of my project is neither interesting nor PL. The uninteresting parts are alternately obvious, elbow-grease, and hoop-jumping BS. But it's worth spending a year or two making for my sons to use, and they won't be done with their current education paths before then. When there is well-debugged code ready to share, and docs, I will distribute under an MIT or BSD license. And I probably won't talk to people much about it, because that would be boring, and I would choose to be an asshole before boredom.

So what am I building? Several things with a middleware flavor. I like infrastructure. I consider most of what the software industry is doing to be drivel. I have no interest in social media or web apps (read: amplifiers for dumbness). I also care little about user interfaces. Professionally I do backend infrastructure, with a majority of time going toward fixing other people's messes (not fun at all, but everyone has a price). Lately this is multiple daemon stuff involving IPsec in a network appliance. New features and optimization are more fun. I can describe two models: one my sons will see, and another I might use later in my work life. Here's what my sons will see:

They write code for apps that looks like command line tools, which go into a single executable (for a daemon), where tools run in a virtual file system (mostly but not completely in memory). A session saves as a tarball that keeps persistent parts of the memory-based file system. Actual file system paths can be mounted in the virtual file system, for i/o to the operating system. Remote file systems can also be mounted, from other daemons or servers that can simulate file systems. The virtual file system also contains all the source code for the daemon, so it can be browsed and/or transformed for future versions of the daemon (which is a main objective from my POV). The daemon's file system also contains any assets needed for services, like a lightweight process sub-daemon serving http for a browser UI inspecting the daemon. Whatever they want to see appear in a browser view, they learn how to generate as needed in the daemon. Probably they will want to write a game, who knows.

And what do I get out of it? Most of my work involves efficient asynchronous message handling, for as many concurrent sessions as possible, with awkward use cases of reconfiguration on the fly under load, without interrupting connections. A zillion things that should be lightweight processes are not, in code written by (crazy? junior?) people with obviously less than a lifetime of coding experience. It really needs to be rewritten. But that will only happen if there is a shining gem of a tool that makes it easy, because it is an embeddable lightweight process engine written in C, that has no opinion about how the host environment works. As a bonus, it would be nice if a browser-based inspector could see everything inside the embedded engine that makes the daemon work. Because debugging is hell, you know.

A goal of giving my sons a good tool is reasonable, while that of transforming my work life some year is likely more a pipe dream. The former must come first, if only to publish with no strings attached, before someone gets the idea everything I touch is valuable intellectual property. I could get bored too. It's happened before.

I love it!

Rys - your ideas are shockingly similar (at least some of the concepts) to a system I'm building. Fibers and continuations, check. Asynchronous message handling, check. PL and runtime, check. Abstracted environment, check. Overlay file system abstraction, check. The list goes on.

The threaded programming model is fundamentally broken, but not because it's bad, but because programmers hate to create millions of threads (one for each discrete piece of work that needs to be done.) The fact that only 15-20 years ago, a high-end OS (AIX, Solaris, Linux all included) had a very hard time managing a process with only 1000 threads didn't help.

But when you give up the threaded programming model, I suggest that you also need to give up the threaded memory model. Memory needs to be allocated where it is owned and used, and not visible outside of that domain (with very few exceptions -- I except only immutable data and async proxies).

Anyway, I hope you don't hide too often behind the asshole label, because there's nothing boring about what you're describing, and it's worthy of some interesting conversation.

Peace,

Cameron.

memory allocated where owned and used

Thanks for positive feedback. You might be interested in some of my past LtU posts on fibers, etc, which old-timers here may be tired of seeing. Keeping those ideas completely PL oriented to stay on topic is hard. I would only hide behind an asshole label when I have too many conversations to hold at once.

(I like people singly, and in small groups. But groups large enough to qualify as mobs are both creepy and obtuse, with some unfortunate stunted group intelligence effect. I take a "your tribe means nothing to me" stance, making me an asshole to tribes. Perhaps experiences when internet trolling was new still skew my outlook. Learning to see how you look like a jerk to others isn't fun.)

A sub-topic of threads and threaded programming models is several different things to me, only sometimes related to one another. But I find it fun to talk about, and a PL angle is there too, though perhaps I don't emphasize that angle enough. The PL angle seems most clear when you ask what concurrency looks like to code written in a language, as a model a programmer is using to organize what happens. That's an abstract model layer.

There is also a concrete implementation layer, where a runtime may do things outside an abstract model. The number of concrete threads can be independent of what an abstract model shows. (One extreme case occurs when you are not allowed to spawn native threads by a host environment, so all threads must be simulated with fibers. This case matters for my work, but not so much for a standalone setup I make for my sons.)

I gather the broken threaded model you mention is one where lots of threads try to cooperate in use of lots of shared mutable state. Yes, that case should be minimized because it's hairy and bug prone. Slathering on synchronization mechanisms is apt to make a mess, rather than yielding order. You can reserve it for situations where you cannot think of any other way of getting things done. Usually shared immutable data is the way to go. Generally you want mutable state managed by a single writer, who talks to other actors via messages. Then you can figure out whether the single writer's state machine can misbehave.

Memory needs to be allocated where it is owned and used, and not visible outside of that domain (with very few exceptions -- I except only immutable data and async proxies).

Several reasons to pursue memory locality exist. One is process isolation, to limit interference and complexity. Another is performance, by reducing CPU sync costs when multiple CPUs see memory that one CPU changes. To get maximum physical parallelism opportunity out of logical concurrent organization, you want independent processes to really have nothing to do with one another, until something forces a join in control flow.

So I want different native threads to use different vats (that is: heaps), so they do not affect one another. It also makes memory management faster when there are no locks for alloc and free. And reference counting need not be atomic with attendant CPU sync costs. However, because threads don't share free lists, there is more unused dead space in memory when low water marks are reached in multiple threads.

I used to spend most of my time thinking about what would happen in a fiber runtime, where you can get anything you want to occur. Since focusing on a tool for my sons, I've thought more about a common interface between several contexts, so the same code might run a "program" as 1) a native OS process, 2) a worker using one or more threads from a thread pool, or 3) a subset of fibers in a lightweight process engine hosted by one thread. The main point of (2) is halfway point bridging between (1) and (3), as a test case more than a desirable place to stop. In principle you can build a pipeline composed of processes in all three of those contexts, and it would work. (Yes, credit-based flow control would be needed to stop buffers from blowing up in a spot where a consumer cannot keep up with producer, to stall production until consumer backlog gets small enough.)

Something unpleasant happens in common API design for multiple sources of events. Say you want to take a simple filter program that consumes input and writes output, blocking (or parking) on either end as needed. Then suppose you want to add a simulation of signals for out-of-band communication, and more generally, a way to react to async events not coordinated with input and output streams. To do that with a single flow of control requires a funky state machine and event polling, so the natural stream filter organization gets goobered up, and no longer looks clean.

Alternatively, you can look at a program as a set of flows where one is the main thread, and others handle out-of-band signaling, which somehow inform the main thread about important things, presumably via shared mutable state which the main thread polls. That would be really inefficient if you devoted auxiliary threads to do nothing but wait to notice something the main thread needs to be told about. (Unless you are using fibers in (3), in which case each auxiliary can be nothing more than a stack frame with a continuation.) This is okay for (1) which is already grossly inefficient. But for (2) you might want to have multiple main threads subscribe to common auxiliary threads that wait for async events on behalf of multiple programs.

That's the kind of crap I am seeing in my mind's eye. Then I imagine how difficult it will be to explain that via good code organization alone. Only good docs seem to address activity not co-located in a single flow of control.

You might wonder how I get there from your remark about about allocation where owned and used. System stuff isn't linear to me, it's more like an elephant. Why is there both a trunk and legs? Why the ears? When you finish describing one component, it dovetails into something that seems completely unrelated, except the system needed it.

A sub-topic of threads and

A sub-topic of threads and threaded programming models is several different things to me, only sometimes related to one another. But I find it fun to talk about, and a PL angle is there too, though perhaps I don't emphasize that angle enough. The PL angle seems most clear when you ask what concurrency looks like to code written in a language, as a model a programmer is using to organize what happens. That's an abstract model layer.

If you've worked with cooperative multi-tasking (or more recently, "continuation based") models, as old as Windows 2.x / 3.x and MacOS, or as new as Node.js, you know that threading and tasking come together only at a shot-gun wedding.

The biggest problem is the unwinding necessary to "get back to" work that had to be halted because the next step wasn't ready. So the continuable fibers are trapped somewhere down in the call stack, because what a thread in a cooperative multi-tasking environment is, is a nesting fiber "visitor pattern".

That's where a pure fiber model like you were describing would be dramatically superior. I'd start here for a relatively portable approach to book-end the fiber-creation and arbitrary-continuation-points: https://en.wikipedia.org/wiki/Setcontext

There is also a concrete implementation layer, where a runtime may do things outside an abstract model. The number of concrete threads can be independent of what an abstract model shows. (One extreme case occurs when you are not allowed to spawn native threads by a host environment, so all threads must be simulated with fibers.

Yup. This is exactly the model that I'm focused on, and for some of the same reasons. (Specifically, I need a runtime environment that can host a few tens of thousands of business applications on a single server. Some may get to use a significant number of hardware threads, while others need to constrain themselves to one or two hardware threads at most. The more I/O intensive, the more likely they are to be parked, waiting for a notification that will trigger a continuation.)

We've designed a pretty elegant OO model that allows developers to pick natural points of asynchrony, and coincidentally localizes memory management with the functionality likely to be accessing that memory (for all of the single threaded reasons you cited, plus cache locality).

I find the topic fascinating. I think Go (PL) got some of the basic understanding of the problem right, but missed the chance to produce a model that mapped well to how programmers decompose problems.

degrees of freedom in stack representation allows a re-org

I have worked on cooperative multitasking, but even more with POSIX threads on Linux. Where they meet is indeed awkward. A cooperative fiber scheme runs inside a single thread. To interact with other threads well requires a little hoop jumping; usually I use a thread-safe queue for async requests, to service fiber blocking operations via thread pool. That way fibers merely park, without blocking the host thread, and resume after async reply makes them runnable again.

(A long aside is needed to discuss whether context switch should be eager when you lock a synchronization mechanism, like a mutex. For efficiency, you often want mutex not to provoke a switch, just to get safety. But some folks tune locks to nearly always context switch, in pursuit of fairness in the abstract. I had a longish argument with an Apple engineer about this during an interview 15 years ago. He position was: of course you have to always context switch as eagerly as possible. If that happens every time a fiber wants to issue an async blocking request, it will waste time in the thread that hosts fibers. So instead, you queue outbound fiber requests, so they can be transferred in batch mode to a thread-safe queue all at once, with N requests moved under one lock. When you are ready to notice the outside world, and perhaps accept new inputs, you batch-transfer outbound traffic too. It's just cost amortization.)

Representation of a fiber's stack is a key part of the design. Often people want to do stack copying, which isn't efficient. The setcontext technique you linked appears not to suck, so that would be good to try. I like heap-based cactus stacks (aka spaghetti stacks), where a fiber's stack is a linked list of aligned frames. But it requires restructuring the C stack, rewriting the original C code into continuation passing style (CPS), with a trampoline to dispatch calls on the cactus stack. The code rewrite is not trivial, though the idea is simple, and trampolines are slower if implemented in C, because it doubles the number of function calls. So fibers must run slower then straight C by a small factor, depending on density of calls and heap allocation latency. There is also an amortization trick on leaf calls which cannot park, where you keep the original C code and ordinary C stack to go faster in leaves of the execution tree.

(My sons won't understand that, unless I creep up on it, starting from a point that is not a pure fiber runtime, even if the pure fiber model is dramatically superior. If I start with programs running as threads, I can implement the C parsing and CPS rewrite transform that way, operating on the overlay file system, then bootstrap the fiber version of the same code. I can arrange that anything that runs in the thread-centric system has an automatic translation to fiber-centric version. Then they can be directly compared with empirical testing.)

My CPS transform for fibers first requires I identify all calls that can block. So these always park a fiber while the call is done in a worker thread. All those calls have the can-park property. Then I figure out which functions can reach code that calls a can-park method; every function that can reach a can-park call also gets the can-park property by inference. Then every can-park function must be rewritten, as well as every call to a can-park function, to use the trampoline that dispatches using the heap-based cactus stack. The net effect of trampoline usage is that the C stack does not wind, but the cactus stack winds instead, so when you park there is no C stack to undo. When you reach a blocking call, there is no C stack; everything up to that point is on the heap-based cactus stack. Of course you also have to implement a pre-processor to find function calls everywhere. (It helps to control the file system here, to deal with include paths.)

Suppose a can-park function originally has three formal parameters a, b, and c, with three more local variables d, e, and f, plus return value r. The rewrite defines a struct with members a, b, c, d, e, f, and r, as well as other state needed to be a subtype of activation frame, allocated as a rod from the vat. When you "call" that function, what you really do is construct the frame and return to the trampoline so it can dispatch. When you actually reach the CPS version of the original code, it has a single argument of type frame, with all the state for calling conventions and locals.

I sort of like things I hear about Golang, though I could not use it as an embedded device inside a single thread. I work in a runtime that is already using threads to simulate processes, and there cannot be more threads without breakage. To embed a daemon inside that, running lightweight processes inside, requires a fiber engine using queues for all input and output. When a fiber within wants to make a blocking call, it must be farmed out to the enclosing environment. Go isn't really suited for Russian matryoshka doll nesting architecture inside other C systems, but it wasn't designed for that.

Fibres

If you want to use fibres use Felix. It was designed to run a million fibres in a C like environment (actually it was designed to support 1 fibre per phone call in a major city).

The secret is, you have to use heap allocated frames linked by pointers. Linear stacks aren't an option. Setcontext, even if it was actually supported (which it isn't on most systems) is useless. Wikipedia says: "Citing portability concerns[citation needed], POSIX.1-2004 obsoleted these functions, and in POSIX.1-2008 they were removed, and POSIX Threads recommended."

Go made a total mess. Gothreads are real threads, that's wrong for a start. Fibres aren't threads, they're do not execute concurrently. Secondly Go allows you to close channels and test if they're closed which is wrong. Finally, the real killer: Go doesn't dispose of unreachable fibres, despite having a concurrent garbage collector.

What Felix does is *control invert* fibres, this is the only way to do it if you have a C like environment. The result is an object with a resume method and heap allocated frames linked by pointers, called a spaghetti stack. Heap allocation is slower than a linear stack, but the stacks are unbounded yet do not require reserving address space.

In a language which is not based on the C model of the world, things can be done better. The machine stack is an evil device which has subverted the design of quality software. Modern FPL's can allocate heap just as fast as a stack (because they use the same instruction: bump a pointer with a bound check).

control inversion and splitting concurrency into finer parts

I have liked a lot of your Felix descriptions; your pro-fiber material is always very agreeable. I mainly want to expand on your remarks, not because I disagree, but because it's interesting. If I say something you don't like, feel free to say so. I'll save remarks about concurrency for last.

Control inversion may not be clear to some readers. To get fibers to run, you want a C runtime to ask them to run, so C code initiates things. A C runtime needs an embedded fiber runtime (say "FR") which the C runtime constructs, and then stocks with resources (here is how to get memory, here's your budget, here's your API to negotiate with C, here are the primitive routines you can call, here is the population of main entry points for fibers, etc). Then the FR only gets cycles as the C runtime allows. If a C program has a single thread, and never spawns another, the FR only gets time when some FR method is called. The api might be "execute for no longer than about this much time", where time might be logical reductions or some precise or fuzzy measure of wall clock time. Whether i/o across the boundary looks like push or pull depends on choice of organization. You can start a service running as a fiber that then "runs things" when the C program calls the run-a-little-more method. You can start a shell and feed it command lines. Non-blocking socket i/o is especially easy to manage. Other sorts of streams across the FR membrane can be done with in- and outbound queues. But it's fairly easy to run a service in the FR and connect to a port it listens to, and do plain-vanilla connections. The C runtime must poll for i/o readiness, however efficiently you can manage, since the FR should not be in charge.

Anyway, control inversion amounts to: run a subordinate thing and put it in charge, but mediate effects it has on the whole environment you control. As long as it does what you told it to do, you are still in control, but the organization has been folded differently. The demands made on a C runtime are only those you told the fiber runtime to make, as situations arise. It's indirection applied to control flow.

Suppose we want to coin new words with meanings similar to concurrent, but addressing smaller individual issues. We can use prefix "co" as meaning together, reserving "con" and "com" for use only when it sounds better. The verb form of current is "to run" when current means approximately "running" in this context. So corun means "run together". Simultaneous or not? That's the problem with concurrent.

If two fibers are alive at the same time, we can say they colive. If a scheduler manages them together, they coschedule. If they interact, they cowork. If an execution trace interleaves two fibers, they cointerleave. These are aspects of concurrency where simultaneity is irrelevant. We might say coextant entities like fibers colive, coschedule, and cointerleave; whether they cowork depends on interaction. Some folks use concurrent as meaning: coextant, coalive, coscheduled, cointerleaved, etc. This is particularly true when used in opposition to parallel which includes corun as "potentially simultaneous".

If you use concurrent as identical in meaning to parallel, this forces everyone else to be confused, or to fall back on itemizing every other aspect of coextant. Fibers operate and schedule concurrently, but they do not execute concurrently if that implies simultaneously. But it's ambiguous unless spelled out painfully like this (which no one does). Do you agree fibers schedule concurrently?

Concurrent

In Felix, I say when a read and a write of two routines connected by a channel synchronise, one or the other will go first but it is not determined which.

Indeterminacy is the weakest form of concurrency in some sense because it demands the ordering is total. Either the reader or writer does first. Some other coroutine may also run before either.

What we're doing here looks a LOT like threading but it isn't because they're no pre-emption permitted.

In Felix, the run time system ALSO allows supports asynchronous events: asynchronous socket I/O and timers. It does this by STEALING a fibre out of the scheduler and then sneaking it back in when the transfer is complete or timeout elapsed. The machinery to do this, whilst integrated with the coroutine scheduling is technically separate and build "on top" of the implementation.

This asynchronous event processing is not concurrency either. It just means the fibre scheduling is driven by external events: there is still only ONE thread. Control is no longer exchanged entirely synchronously under program control, but there is still no concurrency.

Felix ALSO has pre-emptive threads and channels. Instead of spawn_fthread (a fibre) you just say spawn_pthread (posix thread). Pthreads communicate by pchannels (p=preemptive) whilst fibre communication uses schannels (s=synchronous).

If you try to get fibres in different pthreads talking with schannels all hell breaks loose! The type system does not prevent this. It would be nice to have better typing!

The Felix web server (http://felix-lang.org) uses exactly 2 pthreads: one pthread is a loop using epoll (because its running on a linux box) to manage the socket I/O and one runs the web service. In principle it can run millions of connections at once because there is one FIBRE per connection.

In general there is never a good reason to run more than O(N) threads, where N is the number of cores. This means Chrome and, even worse, Firefox, are utter, unmitigated crap. FF is running 300 threads on my machine. This is because C doesn't have a control exchange instruction, only a subroutine call. So programmers use pthreads to do control inversion instead of coroutines.

Of course dual to my approach of starting with purely sequential programs and lifting to concurrent execution by compiler analysis as an optimisation, you can do the opposite: start with many threads and try to partition the threads into groups of coroutines.

I don't think anyone really has any idea how to do *either* of these things, and IMHO that's because mathematicians have been so focused on spatial (functional) programming to the detriment of temporal (imperative) programming that there is no approximation (type) for control. We need a system of control types.

green versus non-green concurrency

I like many details in your post. We use different definitions for concurrency, though. Mine resembles the Wikipedia page on Concurrency_(computer_science), applying to decomposition of programs and algorithms (etc) into partially ordered components (which can be context switched either cooperatively or preemptively). Your definition seems to require preemption or parallelism or both. In my view, green threads are concurrent, even if only one thread exists. Here green thread basically means the same thing as fiber: a cooperative user space mechanism that context switches without preemption.

For clarity, when I use adjective green, I mean the sort of behavior you see with fibers: cooperative partially-ordered user context switching. Then we can discard concurrent as ambiguous. When I use adjective red, I mean what you see with native threads: pre-emptive partially-ordered native context switching. Color choice implies degree of possible conflict, with quite a bit more hazard in a red environment. Other useful colors include brown, where neither red nor green switching happens in a single thread, and plaid, where everything happens with little restriction.

You can write "(b (g))" to depict a green yard embedded in a brown yard. Here yard means a domain, environment, module, or runtime system following a set of rules and conventions; this is clearly a vague spatial metaphor. Many industrial environments have yards, like train yards and bus yards, which can have sub-yards within, like switchyards. Re-using this metaphor in code does not seem confusing. You can write "(p (r (g)(g)))" to mean two green yards embedded in one red yard, within a containing plaid yard. If you wanted to end up with a (b (g)) model, it may be easier to start with a (p (r)) model debugged using threads, before migrating first to (p (r (g)) and then (p (g)). Note (b (r)) is illegal since brown forbids threads within. Comparing those architectures, and designing them to be similar, is difficult without using a term like yard to name the sort of context involved. In particular, you want API for red and green yards to be very similar, to ease porting of code from one to the other.

Each yard is much more than just a set of threads or fibers. You also have state describing and managing resources, like an OS kernel -- stuff lasting longer than the lifetimes of executable entities. You might want a green file system (fs yard) to be a ported red file system, which of course only works safely inside a green yard when not thread-safe as a result. You can manage socket connections between yards though, to mount one yard's file system in another yard's FS, and this will be safe if an async protocol prevents shared state mutation. To hide information, you don't want a yard to know how things work outside it: outside the membrane is just Env, the environment. So a green yard does not know if its host is red, plaid, or brown.

Felix ALSO has pre-emptive threads and channels.

Having both sounds good, for situations when using threads is not a problem. I am inclined to use a (p (r (g)) model for a "front yard" command line app that talks to other "back yard" apps like daemons and embedded modules inside other products. For example, after launching a (p (r (g)) daemon to run in the background, you can talk to it with a browser, or with other instances of (p (r (g)) tools to access daemon state using virtually the same code as that in the daemon. You could also talk to a (b (g)) system that cannot spawn threads itself.

Note my model doesn't have a channel concept, unless this is implemented by a user. It's all just things that look like i/o ports -- files and sockets, etc -- with sync provided by either red or green flavor mutex and cond vars.

If you try to get fibres in different pthreads talking with schannels all hell breaks loose!

Same thing with mutex and cond in red and green yards, since cross yard sync doesn't work. Only env-yard sync works, using sync of the yard's color. One green yard can talk to another via async ports only, but not using green sync primitives. If you spawn a bunch of things in one yard, you know they can coordinate directly with sync primitives. But if someone connects to you, they can be in another yard in the same process, or in another process entirely.

In general there is never a good reason to run more than O(N) threads, where N is the number of cores.

This is where a brown yard comes from in my world. There are N cores, so there are N brown apps in a process, one per thread. If you need to embed, it can only be a green yard. The case for red yards is basically big, sloppy prototyping, with code very nearly like what happens with plain vanilla single threaded programming. Using threads is the first step up from just using multiple OS processes, so it's maybe good for folks learning concurrency. I also like the prospect of moving a fiber to another process where it can be debugged in isolation. This might be simpler if getting the control flow farther and farther away is easy to do.

Browsers tend to use threads instead of fibers, as far as I know, because of pragmatism and peer pressure applied by people just smart enough to follow all rules correctly. Being "right" is mastering the same boring rules everyone else knows. Also, there's all these big piles of libraries around, doing everything the old way. Going against their grain is "wrong" because this provably makes you too dumb to follow the rules, according to people who never make up their own rules. (Variance is nearly always assumed mistake due to incompetence rather than creation due to cleverness.) The penalty is large and the reward small for standing out via exploration.

We need a system of control types.

I like enough typing to catch accidental errors with high probability, but not so much that my mind must shift into a different abstract space to process things. I don't expect types to feature much in what I teach my sons about programming.

N:M concurrency

I think i am following along. What i wanted to ask is how the N:M model fits into your naming? For clarity i mean where you have something like Erlang implements, where you have things that look like green threads, but they are run on an underlying thread pool rather than a single thread, so blocked green threads can be resumed by any cpu thread (work stealing?).

It sounds a bit like r(b), so the host environment is native threads (red), and the application environment is something like brown. However your description above makes brown sound less restrictive as you imply both red and green switching can occur. Do we need something between brown and green, to represent user space scheduling of "green" threads of onto multiple red threads with work stealing?

options in N:M models

(Another way to imply embedding is with a path, so p/r/g means (p (r (g)), except one path is not good at showing multiple children at some point in a path. Path components to the right are assumed under the scope of those to the left, which act like directories.)

Brown is most restrictive: nothing that looks overtly like a thread or fiber. (Think brownout: underpowered.) It would only be good performance under high volume heterogenous i/o if it had non-blocking i/o and some organization to process flows with a flow scheduler. A managerie of ad hoc state machines would work. It might also simulate entities that look like fibers if you squint and ignore some things, such as not having stacks, and only representing continuations with manual callbacks. Basically a soup of manual multiplexing.

Plaid is least restrictive: all colors mixed together in varying intensity. This is probably a top level OS process that doesn't deal with much else beyond embedding yards and servicing env requests. So you are free to do whatever you want.

Red implies there are threads where the number can vary as needed, probably with some thread pool organization, and that coordination uses locks and cond vars (either widely through code, or narrowly in thread-safe queues).

Green implies fibers can be seen, but threads cannot be seen, and no native mutexes or cond vars are used, unless such are secreted within glue methods defined by the Env provided to a green yard. If the env is red, a green yard is probably hosted within a single thread; the same goes for a plaid env, but organization can be really ad hoc. If the env is brown, it probably uses a lot of queues to process inbound and outbound green yard traffic. A green yard would look something like a container running lightweight processes, if the model aimed to run arbitary programs.

An N:M model is a higher level construct you would have to compose from lower level mechanisms. My model doesn't aim to package a product with a specific organization. It's more a pile of tools in source code you cherry pick to suit a plan you invent from whole cloth. This could be painful, like starting with a sheet of blank paper when you start to write. (In essence, constraints are often freeing, because the number of things you can do is fewer, letting you focus on just available options. Being able to do anything is overwhelming, especially without old projects to clone, to provide a starting template. So you start by playing with sample code.)

However, if you wanted an N:M model, you might plan something that looks like load balancing, with some number of native threads hosting green yards. I suppose it depends a lot on your app's semantic model. A lot of ways to organize it exist. You could spawn things manually from scripts, from servers listening to ports, or from supervisors with the job of deciding how load ought to be arranged. Probably someone would prototype one of these, and then evangelize it as the right solution everyone should learn as a new rule. (Because people are just like that.)

I had trouble parsing your second paragraph, so I wrote more generally about details above. If you implemented something that involved work stealing, you would need a red yard and a green system that can migrate (but I don't plan on migrating fibers myself). Migration would be semantically more complex than my green model. It would only make sense to migrate a fiber if it was CPU bound and preventing other fibers from getting enough cycles. I would just start abusive CPU bound fibers in a different green yard than fibers that are i/o bound, thus mostly sitting around parked until i/o happens.

Brown?

Maybe if i describe the scheduler better, it might make mapping to your terminology easier. Let's make the numbers concrete to avoid abstraction:

Let's say there is an 8 core CPU, so we would create a thread pool of 8 operating system threads. Let's say there are 100 asynchronous processes that need scheduling, 50 are blocked and 50 ready to run. The ready to run list might be a priority queue. When an unblocking event arrives the task is moved from the blocked list to the ready to run queue. When one of the 8 running threads gets to a blocking operation the current task on that thread is parked to the blocking list, and that thread picks the highest priority ready to run task from the ready to run queue and resumes it.

This means that a task may have the underlying thread change on any blocking operation, so we either have to swap stacks, or store the task stack in the heap, allocating frames as we go (my preference is for heap allocating frames for now, if the language is garbage collected). To avoid locks there would be no shared resources, but instead all inter task communication would be over channels, however there would be efficient zero copy sending of resources between tasks by transfer of ownership.

I don't want to evangelise this as the "right" model, it's just a model that is interesting to me for my own language work. I was really wondering how to describe this using your model. Are my tasks brown threads? It seems as they are the most restrictive they could be implemented over red, green or plaid threads provided by the platform? How would you write this, maybe *(b) as we have brown threads running on any thread system provided by the platform?

cross yard semantics

     "Why are you shaking your head?" Dex asked.
     "Because," Wil explained, "he missed the part in the first post about a vat not being thread safe, in general."
     "Go on," Dex said. "What significance would that have?"
     "Okay, first let me recap." Wil sighed impatiently, then resumed. "Suppose you design fibers aiming to never make blocking calls, since this also blocks every other fiber in the same green yard. You also want other mechanisms in a green yard itself to make the fewest blocking calls possible as well, for the same reason. But the env can force a green yard to block, by providing API for some things that use native mutex locks, or simply make blocking system calls. If I use a vat as a green yard's heap, and only fibers inside that one green yard can allocate and free, then it does not need to be thread-safe. It will run faster. And it will fragment memory more, as I already said. Memory allocated and used by one green yard cannot be touched in another green yard, since they might be in different native threads, and they would corrupt each other without any way to synchronize. Now, suppose a fiber allocates all stack activation frames in a vat, which is not thread-safe. What would happen if you tried to run that fiber in a different native thread?"
     "You would be completely screwed?" Dex guessed.
     "Yes, completely screwed," Wil nodded. "Say a fiber Fa was running in green yard Ya, in thread Ta, using vat Va. Then you try to run this fiber in thread Tb instead. Ignore the fact green yard Yb inside Tb has no way to run a fiber from another yard; pretend it is possible. The continuation in the fiber resumes, reaches end of function, and returns, which pops the stack and deallocates the frame. The frame's vat is Va, and thread Tb tries to free the frame in Va, which has no mutex because it is not thread-safe. Meanwhile, thread Ta is still running other fibers and using vat Va concurrently. What happens when threads Tb and Ta both try to update shared mutable state without sync?"
     "It corrupts vat Va," Dex deduced.
     "Yes, it corrupts the vat," Wil confirmed, "unless you get lucky. In general, everything allocated by a fiber came from the vat in its green yard. So you can't do anything with that fiber in a different native thread, except talk to it with async messages. Exceptions involve resources allocated by the env, not by a green yard's vat. If you kill a fiber, the env must free any resource it held, which means an env must know all resources held by a yard outside the yard."
     "Well, it sounds like not making vats thread-safe is a mistake," Dex said with a smile. "You should let programmers do whatever they want, and if this corrupts the heap it's your fault. It's like you don't know how to do threading correctly."
     "Don't be an ass," Wil dismissed. "Note you do want a thread-safe vat if a red yard uses the same general vat design. If you allocate resources in a red yard used by multiple threads, with lifetime spanning thread lifetimes in funny ways, then the heap interface must be thread-safe. You could use nearly the same code as a green vat, but you need mutex around parts subject to use in critical sections. Irritatingly, this includes refcounts when more than one thread can hold a reference to the same object."
     "You can use atomic operations on the refcounts," Dex suggested. "Better than locking a mutex. I guess you knew that."
     "Anyway," Wil explained, "red vats are thread-safe, but not green vats if you want best performance. A thread-safe green vat would behave properly, but force a green yard to run slower and block more often, when the goal was to block as little as possible."
     "What happens when a green vat runs out of memory?" Dex wondered. "Does it ask the env for more? And does that block?"
     "In general, yes, but it depends on edge conditions," Wil noted. "You might configure a green yard to kill fibers on OOM conditions once a certain threshold of memory use is hit. Or a green vat can get more large aligned memory from the env, which might block. If nowhere near the maximum memory budget yet for a green yard, an env might preallocate space ahead of time so this happens less often."
     "Say, if a fiber can run only inside its green yard, where is the run queue?" Dex asked.
     "In the green yard, which is its own little closed world," Wil replied. "There is no run list spanning green yards; each is the kernel for things it runs. In red yards using native threads, everything is managed by the operating system as you would expect, except for simulations of services like file systems."
     "Does it make sense for a thread to be brown?" Dex wondered.
     "No, threads are red, fibers are green, and yards have color depending on the extent those are, or can be, used," Wil said. "A brown yard is something like a program starting from main() that never spawns a thread, and doesn't believe other threads beyond the main thread exist, and refuses to allow other threads to be spawned on principle. Further, a brown yard does not roll its own fiber system. If you embed a green yard inside a brown yard, the fact fibers exist within may be invisible to the brown yard, which merely sees an asynchronous interface."

green vs red

Irrespective of prejudices of computer terminology, *concurrent* means "at the same time". Literally.

Fibres specifically never run "at the same time" and so are not in any way shape or form concurrent.

Asynchronous has a different meaning. In Felix fibres exchange control with continuation passing mediated by channels. The critical thing is that this is plain old ordinary sequential programming. The core difference to procedural and functional code is that instead of using *subroutrines* we use *coroutines*.

Subroutines are slaves that return control. If they're functions they also return a value. So they have an entry and exit event, an entry time and place and an exit time and place.

Coroutines are temporarily persistent. The start at the big bang and die only at the big crunch. However they do NOT run concurrently with other coroutines: they suspend and resume. Suspension is explicit and only the currently executing coroutine can suspend itself. In Felix this is done by reading or writing on a channel. In turn, this causes the resumption of some other coroutine.

So this form of exchange on a channel is *synchronous* exactly like a subroutine call, its just a peer to peer control exchange rather than a master/slave one.

The reason for using channels is that it allows coroutines to be dynamically connected in much the same way as higher order functions work in functional programming languages.

Asynchronous operations involving external events added to this system do NOT add concurrency in any way shape or form. On the contrary there are two effects: the order of scheduling becomes more complex and dependent on external events, much like a promise. The second effect is that the system may idle whilst waiting for an external event.

In Felix, the response to external events is NEVER reactive. Instead coroutines request the event and simply wait until it arrives. You can request a timeout, and then you suspend until the time is up, then resume. You can request data from a socket, and you suspend until the data is read. In Felix this is done with ORDINARY SUBROUTINE CALLS. Its *exactly* the same as reading data from a synchronous socket in C except it doesn't block other fibres than the caller.

So again the suspension is a context switch entirely controlled by the single running coroutine. There is never more than running fibre (per pthread).

Its sequential programming. There is NO concurrency. There is indeterminism, but that exists in almost every sequential language. For example in C, and most languages, the order of evaluating arguments is indeterminate. So this is nothing new. And it has nothing to do with concurrency.

The point is that fibration by coroutines is intended to be the imperative dual of functional programming. It deals with event sequencing instead of spatial layout, as a consequence of duality.

Concurrency .. things happening simultaneously .. is a whole different ball game.

A good example to look at is Unix signals. They're asynchronous. They NOT CONCURRENT. A signal handler suspends the underlying thread of control pre-emptively, but it does not run concurrently.

If you have lots of threads, many will not run concurrently. However an abstraction layer prevents you knowing which threads are concurrent, so you have to treat them as if the were. A perfectly good example of the impact of this is the idea of critical section. Critical sections are rubbish! The idea is completely bogus. It was just interrupt mask which prevent pre-emption. But that only works if you have multi-threading on a single processor. You have to use real locks if you have a multi-processor.

To make the distinction clear, if you look at the Felix webserver, there is ONE thread doing the web service, but there is one fibre PER CONNECTION. However the socket I/O is synchronous with respect to the fibres, but asynchronous with respect to the webserver user pthread.

How can this be? There is a second pthread running *concurrently* which monitors all asynchronous event using epoll and responds to these events by performing asynchronous socket reads and writes. When the requested transfer is completed the fibre waiting on completion is made active again (active means it is made available for scheduling). So there is definitely concurrency happening here. The two pthreads could run on two distinct cores.

re: concurrent

Irrespective of prejudices of computer terminology, *concurrent* means "at the same time". Literally.
Fibres specifically never run "at the same time" and so are not in any way shape or form concurrent.

Concurrent doesn't mean "run" at the same time. Actually, most natural language definitions refer to "existing or happening" at the same time. And fibres certainly exist at the same time. They can even be said to happen at the same time, given their active life cycles overlap and you cannot explain the behavior of a fibre without examining other fibres that exist "at the same time" in the system. Merriam-Webster also has "acting in conjunction", and your fibres certainly do that much.

In any case, nobody will hear your message if they feel the need to correct your terminology, or your discussions devolve to arguments about terminology. So you should consider accepting the jargon of computing rather than pointing at natural language definitions to insist your private language is 'right'.

re: indeterminacy

Indeterminacy is the weakest form of concurrency

There are concurrency models that are deterministic (such as multi-threading with linear single-assignment variables), and there are non-deterministic computations that are not concurrent (such as an NDTM). The qualities of indeterminacy and concurrency are distinct and should not be conflated.

It's similarly useful to separate the concepts of concurrency and parallelism in computing. Concurrency (logically happening at the same time) does tend to offer opportunities for parallelism (physically computing at the same time), limited by synchronization. But many non-concurrent models also offer great opportunities for parallelism (such as linear algebra, parallel beta reduction, pipeline processing). In games, simulations, high-performance computing, we can frequently achieve orders of magnitude more parallelism from the non-concurrent models.

mathematicians have been so focused on spatial (functional) programming to the detriment of temporal (imperative) programming that there is no approximation (type) for control

Mathematicians have studied and developed many temporal models (e.g. synchronous reactive programming, functional reactive programming, cellular automaton, temporal logics, and many more). There has been corresponding study of control types (cf. session types, arrows in computing, Hoare logic).

I don't think anyone really has any idea how to do *either* of these things

I think the problem is the opposite. A lot of people have 'ideas', but it's difficult to judge their quality. Learning how to do things "better" is a risk - it takes time, money, might not be "better" enough in the right ways to pay for itself, and it won't help with maintaining a million lines of legacy code anyway. Good ideas are lost in a sea of dubious ones. Industry isn't easily going to listen without a clear message from giants they trust (like Oracle or Google or Microsoft).

Your ideas with fibres most remind me of flow-based programming.

The main difference from FBP is that you've limited yourself to size-zero buffers. In my opinion, that's a dubious feature. Size-zero buffers are formally equivalent to any fixed-size buffer because an identity process (e.g. a fibre that loops read and write without modifying the input) acts as a size-one buffer and we can chain any number of identity processes. If you can dynamically construct new fibres, it's even equivalent to allowing arbitrary-sized buffers.

Consider that: your idea with fibres sounds like a dubious variation of a forty year old FBP model which, despite being pushed and refined for forty years, has not achieved wide adoption or even recognition by the industry. If even experts like you can't be bothered to study the many dozens of concurrency models and competing 'ideas' (and many more minor variations), how far can you expect your message to be to be heard when you add one more?

Fibres.

I fully agree with your comment on first class support for fibres, which can then be a building block for lightweight threads. Raw fibre support is so useful in and of itself, particularly nested state machines.

I recently stumbled upon this paper. Resumable expressions (pdf) that goes a step lower, towards stackless coroutines.

parking

Suppose a can-park function originally has three formal parameters a, b, and c, with three more local variables d, e, and f, plus return value r. The rewrite defines a struct with members a, b, c, d, e, f, and r, as well as other state needed to be a subtype of activation frame, allocated as a rod from the vat. When you "call" that function, what you really do is construct the frame and return to the trampoline so it can dispatch. When you actually reach the CPS version of the original code, it has a single argument of type frame, with all the state for calling conventions and locals.

The point is, you have to have a programming model that makes it obvious where the spots are that a fiber has to wait, and what the conditions are that will make the fiber be runnable again.

I use the "future" concept. (AKA promise)

So when one calls a method that might be async, its response is assumed to be a future. That means that the dispatch could be sync or async (assume that the decision is statistical). If async, then the fiber registers its continuation predicate (what future it's waiting on), saves off a continuation point, and restores the context from its entry (i.e. jumps back to the scheduler).

In reality it's a lot more complicated due to critical sections, timeouts, and escalating settings for re-entrancy. But the basic approach remains.

waits and critical sections

Implicit and explicit waits can be done differently. When you write in style that looks synchronous, waiting is implicit when it happens. It can be a side effect of a blocking call, that requires no explicit code annotation. But for debugging it would be nice to show where implicit things occur, in some view of code. When you plan to statically analyze all the code, and rewrite lots of it, generating indexes and alternative views might as well happen too. (Show me everywhere this field is read or written, but not everywhere other fields with the same name in other structs are read or written.)

For explicit waits, I prefer condition variables, where each has an associated mutex of course. Flavors include wait for signal, broadcast, or timeout. With native threads, you have to use native condition variables and mutexes. But with fibers, both cond vars and mutexes are cooperative mechanisms in user space runtime. With no pre-emptive context switch in fibers, these are really easy to write, because context switch can only happen in return to trampoline. If you don't do that, the code is uninterruptable. Thus every critical section that makes no calls other than primitive ones (that cannot park) is already protected without a lock. This makes cond vars and mutexes simple data structures, except semantics can get complex when you kill lightweight processes. (Killing a process holding a mutex seems to require killing every other fiber waiting on the mutex, becuase there is no way to honor critical section semantics. So calling numerous can-park methods in a critical section has more risk.)

So program logic would use cond vars. But the fiber runtime can do things more directly, often in ways that look like a simulated cond var. For example, when parked until a blocking call gets an async return from a thread pool, the runtime knows a fiber is waiting on an implicit condition of "async reply happened". So it just makes the fiber runnable after the reply occurs. If you wanted to make a blocking call support timeouts, it can be built right into the API, perhaps with an EAGAIN result and a code indicating reason was timeout. (However, a thread-centric version of the code would have to use native calls, perhaps without timeout as part of the API, so you would have to either avoid this, or rely on worker thread pools in that case too even though it was fine for the original thread to block itself.)

I remember liking futures in the early 90's, say 1991. But I guess I never thought of anything complex I could do with them, except starting multiple async operations before control flow join. (I am avoiding remarks about the project involved then, which isn't on topic.) I did notice the idea became popular in JS again.

So when one calls a method that might be async, its response is assumed to be a future.

Ah, when program logic calls an API that is explicitly async (but which might have immediate sync result anyway), I find this hard to organize unless the program is also explicitly more than one fiber in a related group. (I would use different terms if talking about a group of threads, if not using a fiber runtime.) I have a word for the group which no one likes, so let's say process and only stick "lightweight" in front when needed to disambiguate from native OS process. I call the fibers in one lightweight process "jobs", where each job simulates a thread in a native process. Each process has identity, but jobs only weakly so, local to a process. So an async reply comes back to a process, not to a particular job. At least one job sits waiting on the arrival of async messages to a process, and simulated signals too. Another main job simulates a main thread, which organizes the process's idea of getting things done and managing data structures. The main job only waits on a cond var when it has nothing else to do but await async replies. The auxiliary job waiting on messages will signal cond vars as needed when replies arrive. In this context, it isn't necessary to register continuations, because each job already has a continuation, which it runs whenever scheduled. That way coders can mostly ignore the idea of continuations.

When you would think about control flow forking because of async operations, you can spawn a job or a process, both of which allocate very little space when stacks are heap based frame lists. Or you can use the organization of main and auxiliary jobs to join the control flow following async replies. (It's a kind of scatter/gather idea, but in the control plane instead of data plane.)

In reality it's a lot more complicated due to critical sections, timeouts, and escalating settings for re-entrancy.

Yes, the practical wrinkles in detail never really go away, if the details matter. When you coordinate concurrent things, like fibers, and some state is mutable, then you have critical sections, mutexes, and things like condition variables. Coders still need to understand them, even if scheduling is cooperative. Even if only one fiber can run at a time in a host thread, the order of execution is still interleaved, despite being cooperative. So mutexes are needed anyplace can-park operations may occur in the middle of critical sections. However, fiber peers inside one process can safely reach right into one another's state and update each other, as long as no can-park action is involved. So data management is easier with fibers than with threads. But, if you wanted the same code to work under both threads and fibers, you need to write all the code as if the less safe thread semantics will be used.

It is not necessary to allow re-entrancy. Frames in a continuation belong to a fiber, and no other fiber will execute continuations in those same frames, unless you subverted the runtime to force that to happen. But why would you? If more than one fiber can see and mutate the same piece of state, that's a critical section, and needs a lock if any of the operations on data can park. The problem one has in Scheme of "what happens if I call this continuation more than once?" doesn't need to come up, unless you are implementing Scheme using the fibers.

Reentrancy

Fibres are not concurrent however they CAN be recursive which requires consideration. In Felix, recursion is supported by cloning closures to ensure the data is virgin. The continuation objects are intrinsically non-reentrant.

The biggest problem I have is to can actually invoke a dead continuation. It's possible because they're first class (in Felix ALL procedures are continuations).

models and hazards

I'll write a bit more soon about matters not directly related to comments here. (I noticed some of these topics appear interesting to folks, based on discussion elsewhere.)

Fibres are not concurrent however they CAN be recursive which requires consideration.

Informal term definitions vary a little over time. Lately concurrent usually implies logically overlapping at some granularity, either simultaneous or not. So interleaved context switches running at most one fiber at a time is concurrency, logically. In contrast, parallel usually means physically simultaneous, or potentially so, like when different cores run threads on hardware.

So parallel is a subset of concurrent, where physically simultaneous can occur. I don't know a good synonym for concurrent, for use in redefining it in different words, to express logically overlapping. So if concurrent didn't mean that, the idea gets hard to talk about. We could cook up a new word, but people hate that. (Perhaps "con" plus some synonym for "current", or another word about time locality.)

Two fibers in the same thread cannot run at once, but they can context switch back and forth. If a context switch can occur in the middle of a critical section, the issue of concurrency is manifested; synchronization is needed for mutex, even if they cannot run simultaneously, to ensure no overlap in critical section execution in different fibers.

Recursion in fibers works as well as in C, as long as calls get new activation frames for nesting execution. Code is re-entrant as long as not self-modifying. A continuation involves both code and data, and is only trivially re-entrant if the data part is not self-modifying too. With mutable state, re-entrant self-modifying data is hairy and requires coordination. (It seems best not to do it, since separate frames would be less confusing.)

The biggest problem I have is to can actually invoke a dead continuation. It's possible because they're first class (in Felix ALL procedures are continuations).

If I implement Scheme or Smalltalk with fibers, I get that problem too. In Scheme you are permitted to do it, but in Smalltalk it is usually defined as invalid to call a block that has gone out of scope even though they are first class objects.

There's a difference between what you can do, and what's defined to be supported with rational meaning. (I know you know this, I'm just being rhetorical for others.) In C you can write on any memory you want, starting from a pointer, but this is only valid -- part of the model -- in some cases. Other uses are undefined, ending typically in crashes. When a rewrite of C code adds continuations for CPS, you see objects that look first class, but are not according to the model. You could get away with using them like first class Scheme continuations, if you could avoid the hazards; but junior coders could not navigate this easily. So the model will say "don't do that." With source code in hand, though, someone may be tempted to fork with a new model.

fibres

Fibre are NOT logically concurrent. It's sequential programming with continuation passing. The implication is that no locking is required. There's strictly ONE thread of control which is EXPLICITLY exchanged.

What fibres have is their own stack. In principle, you can consider fibres exchange control by stack swapping.

Although I take your point about definitions changing, the notion of a fibre is very strict: a thread can be made of fibres, this implies one thread of control.

Contrarily there is a question: can some fibres in a fibrated system be *lifted* to execute physically simultaneously? In other words, under what conditions are the alternatives A then B or B then A equivalent to doing A and B in parallel? Without changing the strictly sequential semantics!

This is where some systems go wrong. They're trying to run large numbers of threads with fast context switching. This usually means either a response to a timer and a lightweight context switch or cooperative yields at implementation defined points. But these things are real threads. They're not fibres.

Here is a fibre in Felix:

chip sink_to_list[T] (p: &list[T])
  connector io
    pin inp : %<T
{
  while true do
    var x = read (io.inp);
    p <- Cons (x,*p);
  done
}

This reads a stream and creates a list. It is perfectly safe. No locks. No synchronisation issues. It is a crucial component, called a drop. Here's a lift:

chip source_from_list[T] (a:list[T])
  connector io
    pin out: %>T
{
  for y in a perform write (io.out,y);
}

A lift takes you out of the functional (spatial) world into the cofunctional (temporal) world. It translates a list (inductive) to a stream (coinductive). A drop does the dual, taking you from the stream back to a list. The scheduler running the fibres MUST be nested inside a function so that the inactivity of the fibres becomes terminal (so you can get back to the function).

These two components allow model switching for the simplest kind, the sequence. The critical thing is that fibres complement functional code WITHIN the sequential programming model.

If we mess up our definition of fibres to think they're lightweight threads we mess up the semantics. Go for example is semantic nonsense. Its channels can be closed at one end and open-ness tested at the other (Nonsense!). And it fails to garbage collect starved Gothreads.

Real concurrency imposes stricter constraints on code than coroutines. Coroutines necessarily provide superior throughput. Concurrency allows for superior real time operation at the expense of throughput synchronisation cost.

As I said the real problem is to write sequential code, and have it automatically take advantage of multiprocessing capabilities by analysing the fibres. Generally pure fibres (unlike drops for example) can be run concurrently. The analysis is very similar in some sense to Haskell doing strictness analysis to find which lazy applications can be lifted to more efficient eager ones.