symbols and loosely coupled concurrent apps part II

This discussion is a continuation of managing closed worlds of symbols via alpha-renaming in loosely coupled concurrent apps where a number of different topics are touched, most recently a Lisp sub-thread at the end in addition to a bunch of stuff I was posting about lightweight user-space process architecture involving threads and fibers. When that discussion gets less useful due to sheer size, I'll continue here when I have anything novel to say, which probably won't be that much. You're welcome to continue here anything started there, too.

(A number of things I said in the other discussion might be expected to bother folks who want to see fewer development systems around, with less competition, fewer languages, and narrower choices due to existing "winners" who will guide us to a managed promised-land where we can be tenants for a modest fee. Is that enough hyperbole? I can't be bothered to get worked up about it, but I see many things as subject to change, especially along lines of choice in portable code for app organization not dictated by trending vendors. I expect to mostly ignore politics and advocacy.)

The first thing I want to continue is discussion of performance among options for implementing fibers (single-threaded user-space control flows that multiplex one OS thread) because I like heap-based spaghetti stacks, while others recommend options I think are poor choices, but to each his own.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

fiber context switch cache benchmark

This is a reply to the opening question in Keean Schupke's Variables post yesterday. I don't expect to discuss this further, unless I see something surprising. Correcting misapprehension is not one of my compulsions, and neither is teaching how to do things that seem obvious to me, beyond friendly summary in sketch form. Infinite patience is more of a workplace habit.

Could you explain how you handle variables across async calls that return to the event loop, so that I can benchmark this against the setjmp method?

Variables are in stack frames, same as always when locals live in stack frames. The only difference is whether frames are in a contiguous stack array or in a discontiguous stack linked list. When a thread suspends by blocking, variables stay where they are on the stack. When a fiber suspends by parking, variables also stay where they are on the stack. You are proposing to copy parts of a stack array, and want to benchmark this against zero-copy stack lists, because you have a hypothesis that copying might be faster.

That answer appears earlier in the other discussion, but there's a lot to read so I'll boil it down. As prerequisite, one should read about stacks, call discipline, and activation frames in plain vanilla old-school implementations of programming languages that use frames, so backtraces are found easily by following frame pointer (fp) list linkages. Old treatises on programming 68K in assembler give fine frame discipline explanations that are easy to read.

Suppose you look at some C code whose stack is contiguous in the address space, with frames going up to main() and beyond, laid end-to-end in contiguous memory, typically linked in a list via frame pointers (which is where a debugger gets a backtrace when you ask). If you rewrite such code to use CPS (continuation passing style), one way to do this instead is via heap-based cactus stacks (aka spaghetti stacks) where each frame is separately allocated, so the stack is a linked list of frames instead of a contiguous array. Some folks make the mistake of assuming this means all calls change this way, but calls near the bottom of a call tree can remain as they were before, using a conventional contiguous C stack that unwinds to the list version upon return to trampoline. (This is not return to an event loop; find out what is meant by trampoline -- it's not complex compared to anything abstract, but I do find writing style of functional language folks using trampolines almost impossible to follow when implementation is not described imperatively in terms of which memory locations change.)

One way to implement fibers is via setjmp/longjmp after copying part of the stack to and from a location on the heap. Another way is by switching which frame list is now the "top of stack" without doing anything at all to the contiguous C stack. Schupke proposes to benchmark which is faster, because I claimed the bottleneck was memory bandwidth, where touching more cache lines is slower. A good benchmark can probably be constructed, if realistic stack populations are represented, with a range of optimistic and pessimistic sizes likely bracketing what actually happens. You would need to implement the heap frame approach efficiently, using frames pooled in standard sizes that are always cache-line aligned (and preferrably powers of two in size, aligned to addresses that are a multiple of that size). There's not much good in having a frame smaller than, say, a 64 byte cache line. Obviously cache lines vary to larger sizes.

A valid benchmark must also cause a realistic cache-line pressure via memory load corresponding to an actual application where you want to use far more fibers than you can afford to have threads. Your benchmark should model live traffic from active connections in a server with many thousands of connections, where only some are currently active, but each is moving a lot of data through memory and therefore completely flushing the cache almost continuously. Assume it's gigabit plus transfer rates.

Say a stack is typically about sixteen to thirty frames deep in some application, like a network stack that is perhaps implementing IPsec or deduplication. You can just go sample a population of stack frames in some application, or make some assumptions about distribution in frame count and size. In server side network code, it's common to see a lot of temporary buffers on the stack for staging messages, logging control flow, analyzing IPv6 addresses, and consulting a lot of policy description. (When debugging I log a backtrace everywhere something significant happens; very deep stacks are common.) Small and shallow stacks in this context are rare to the point of non-existence.

Depth (in frame count) in a fiber stack would depend on where fibers are introduced: toward the top of a network stack or at the top of some nested feature like IPsec and dedup. Closer to the leaf end of the call stack is less than half, but that end has larger frames because temp bufs show up more as the mix moves from policy to mechanism and binary formats get staged for concrete processing. Total number of fibers depends on how many things occur concurrently per connection activity: not very many for IPsec, but a lot for dedup with high granularity fragments, like every 2K on average. At one end you need a handful of fibers per connection; at the other you need dozens of fibers per connection.

For a benchmark with any relation to reality in server loads causing high memory pressure, where cache lines turn over continuously, you need at least thousands of fibers whose size average somewhere between 2K and 16K, where 4K to 8K is plausible. This is exclusive of actual i/o buffer space, which is bigger, and isn't relevant to a benchmark, beyond causing cache line pressure much greater than caused by fibers.

Awakening a fiber makes it's stack active. Normally this will be all or nearly all cache line misses if a lot of i/o happened while a fiber was parked. For stack frame lists, this means the topmost frame has at least one cache line miss. For contiguous stack frame arrays, this means every frame you copy before calling longjmp is a cache line miss. I would expect the setjmp method to come in at more than an order of magnitude worse. About 50 times as many cache line misses would not be surprising, but seat-of-pants guesses are often not very good. Maybe you can copy large swathes of memory using instructions with no effect on cache lines at the destination, but when awakening a fiber we still expect the heap-based source stack to be out-of-cache when read.

Edit: Regarding trampoline efficiency, it's okay with me that async calls dispatch up to several times slower when they might park a fiber. (Extra cache line misses are worse than extra function calls.) Highest priority is efficiency at or near leaves of a call tree, which are far more numerous than inner nodes. Tight loops and local graphs of calls with bounded non-parking behavior don't get rewritten to dispatch via trampoline; instead they stay the same and operate as efficiently as an optimizing C compiler makes them using normal C stack that unwinds upon return to trampoline. Local areas of in-the-small code is normal C; only wide areas of in-the-large code needs CPS rewrite around points where context switches can occur. It's a hybrid tactic of old and new stack style. Not every sort of trampoline requires special privilege, or costs more than any other kind of branch; the only extra cost in my model is branching more times and making more implicit calls.

Trampolining

I once wrote my own operating system from scratch in 80x86 assembler, so I am familiar with most of that, except the cactus stack. Really this never occurred to me as an efficient option because trampolining requires a CPU mode change (to a higher privilidge ring) and that causes an MMU page flush, which is seriously bad for performance. Exceptions can take 1000s of clock cycles compared to 10s for a regular function call. However I now realise that an OS trampoline for exceptions is not the only way to do it, and only occurred to me first because I did a lot of programming with that particular kind of thing.

To do what you are doing requires efficient trampolines, how do you do that? Are you loading the stack register directly with inline assembly - this isn't really an option for me as I am aiming for portability. setcontext/getcontext offer the ability to set the stack pointer but they are obsolete, whereas setjmp/longjmp are still in the standard. I think I could use setcontext/getcontext to make a tampoline as Linux still supports them, hopefully implemented in user space, probably not going to perform as well as inline assembly to just load the stack register.

It sounds like the same thing I was thinking about with pointer switching, rather than copying, and not using a heap pointer for 'variables' so my previous comments are probably not relevant. You can probably attribute this to lack of time to think about it properly (and it being a while since I have thought about these particular things).

re: lwp and fiber abstractions

The main user class I think about is a developer, usually one who will see C source, before and after rewrite for fibers. Call this user Jim: a C dev who is working with multithreaded effect. Another class of dev may compile another language to C without caring about C any more than assembler; say this is Ann (my sister) who writes in Scheme, a sidekick of Ivy (who focuses on C++). Last is end user Tom (one of my three brothers) who doesn't code much, unless it's writing tiny HyperCard scripts. Tom understands no C whatsoever; Ann can scrape by in C; Jim lives and breathes complex concurrent server side C.

My main audience is Jim, until I write a Scheme or prototype a HyperCard; and only then I'd write docs for Ann and Tom. Jim will be able to see everything reified after rewrite. I only use term fiber here at LtU because it's a standard term. In my docs I would only use fiber to explain a job, in more abstract terms.

Replying to Keean Schupke's Thorn is fine:

It was more the part about abstraction, that is if as a user I can see the difference between a Thorn fibre (job) and a generic fibre, there is something wrong.

I can see that for Tom driving the GUI of an end user app. Ann needs to know both words. Jim has to debug job infrastructure when trying to prove it isn't jobs that are broken when some other bug occurs.

Therefore I only expect to see the word 'job' or Thorn fibre when discussing how Thorn implements fibres internally. I prefer Thorn fibre to job, but I understand that's just my personal preference. When discussing the user view of Thorn I think we should be talking about fibres and LWPs,

Term job has a long tradition in computing. I first heard it when submitting Fortran punched card decks in the 70's, before I learned Pascal in college, where we still used punched cards. (No, I learned Dartmouth Basic first in 1975 on a paper terminal. C I learned by 83.) To schedule a run you needed to add job control cards to a deck. Systems understood a job control language. The concept of job is essentially one run of a program. This also applies to how job is used to described job control in unix shells. And when you think about it, an executing fiber is one run of a control flow, and that's what a Thorn job is. The idea that code might run as a control flow in the future is fiber: a blueprint for execution. But when actually running, it's a job. Sounds clear to me anyway. Having more than one way to say it also affords definitions that are not quite so painfully circular.

because it doesn't matter to me (a Thorn user) how they are implemented, and my intuitions about fibres from other fibre systems should apply without modification.

That's pretty reasonable. The ways in which you might need to know how they are implemented would involve situations where you directly parked or unparked a job yourself, because of privileged information you had, not represented some other way by existing interfaces.

In other words we are really discussing generic fibres and LWPs and that the platform is Thorn is irrelevant unless discussing the internals of Thorn.

Sounds good to me. Acronym LWP has two problems. First is a three-syllable letter within. Second is ambiguity if you pronounce it loop, which is said often when discussing code, so that doesn't work. It's a good name in print though.

but sometimes an outside perspective can be useful, or shine light on things in a different and unexpected way.

Yes, this is one reason I like talking to people: I want to be surprised. I'm especially interested in new spin on old ideas too, if outright surprise can't be had.

I agree on your thoughts about languages in general, although I am not a fan of Lisp. I like the 'idea' of Lisp, and writing a Lisp interpreter is fun, but I don't actually like programming in Lisp. I prefer my functional languages pure, and I find object mutation and looping constructs easier to read. I quite like programming in Ada, apart from its generics (which are module-like rather than C++'s type-class like templates) and its treatment of pointers. I find myself agreeing with a lot of things you write, and I think our philosophy of programming is similar in a lot of ways.

With any language, I have the most trouble with abstraction system of a person showing me sample code. In my favorite languages, people can write the weirdest looking things, just because of the way they think. (Often from being cargo cult copy-and-paste, when a dev doesn't know what it means, exactly.) If you haven't implemented a Lisp or used one much, I'd expect weirdness of Lisp to often be from the person who wrote the code. For example, if a Haskell programmer wrote Lisp in favorite idioms, I'd have no idea what it does. Because Lisp is low in syntax, you're likely to get a naked view of the abstraction model.

I would like to discuss this too. I think Erlang handles this in a very elegant way.

A lot of my current approach to fiber design started out from aiming to make C more like Erlang in some way. I'm not sure exactly when I started that. Maybe ten years ago, plus or minus a little. I haven't used it though. I'm more interested in the story about things work, rather than playing with them, unless I need to get something done.

Shared memory (and locks) are problematic, and parallel (concurrent?) processes communicating by message passing seems a better model to me, however I am concerned that it may be less efficient than the thread model. However there is a large semantic overlap. If we consider the 'lock' on an area of shared memory to be a token that only one thread can possess at a time, you could see that this could be an access handle, that you pass from thread to thread in a message.

People famously make a lot of mistakes with shared-mem concurrency, I suspect partly because not enough diagnostics are included. I was once unable to get folks to add identity of who held a semaphore, so we could diagnose deadlocks clearly. Usually the advice is: stop making mistakes! I think we should pay a tax in both space and time to reveal concurrency bugs often, even if sometimes there are false negatives (when you fail to report its wrong). Just catching a good percentage of them in a run will weed them out.

I'm pretty interested in ensuring resources are released, especially when it comes to locks. Holding locks anonymously is not okay. I generally don't like shared multiple writers; I prefer one writer, who takes and executes requests to make changes on behalf of other parties, when a shared state model is really necessary. A central server is often effective when you can trust just one writer. Blame is distributed and dilluted when multiple writers could have done something wrong.

[I think it only becomes really complex when database semantics are present, and you can't lose data. I actually don't want to talk about that. :-) It's just as involved as programming language discussions. Certain things must be done carefully with layers of backup, and can't be done with glib one-layer locking schemes. But we need to provide tools following clear and precise rules for folks wielding them to do hard things. I think mainly people want a guarantee locks cannot be leaked, because they can handle redundancy in guaranteeing copies for transactions themselves.]

This is the kind of concurrency/parallelism model I am thinking of. Of course this only models a simple mutex, there are other possibilities to consider like a shared read lock, and exclusive write lock.. This seems to generalise the single lock nicely, but has the problem of how to restrict the shared read handle you may have already given to other processes when you want to write to the write handle. Software Transactional memory might be a good option, but I am not sure, I probably have to try it out. If I had to give an answer to the question directly it would be performance, or rather lack of proof than the more structured approaches perform as well as the threaded/shared-memory approach.

For any given layer that makes conflict under concurrency possible, there's a suitable way to synchronize for access control. Inter-process, inter-thread, and inter-fiber concurrency all need mechanisms that accurately follow rules advertised by an interface. It's simpler in user space, so fibers are easy. If you're going to roll your own process and thread mechanisms, you have harder ones to define and implement. I'd prefer using an existing api though. For example, I only plan to code to a posix thread interface, though one abstracted through a vtable of function pointers in an env.

what's the most interesting part of vfs design for a PL?

What do you want to hear about a vfs (virtual file system) as it pertains to hosting single-app-daemon-process collections of green processes and green threads?

efficient low level i/o support in languages?

Do posts about system effects on PL design and runtime have a chilling effect on all other LtU discussion? (Why would a few messages about one subject undermine desire to talk about other things?) Is getting your hands dirty a sign of lower class breeding? Or is it that only math is cool, and anything that smells like operations research is vulgar? This comes from curiosity rather than intent to challenge (a desire to draw out rather than suppress). My best guess is that realistic design plans sound complex, even with a focus on simplicity, and this depresses a person who hopes thinking about exactly one thing will be sufficient. Does undesirable realism intrude more on setting goals, or on safe abstract world views? Of the two I prefer to upset any assumption something can be perfect, which is silly, but easy to believe by avoiding practical evidence.

Right now I'm going over standard tools for portable async i/o, like libevent and libev, and everything has problems, caused by model assumptions about memory management and control flow interface like callback organization. Of the two libevent is worse; libev was designed specifically to address libevent's imposed model effects, but it still presents an event loop design and abstracts memory use no further than allowing plugins to replace malloc, realloc, and free, etc. And standard system types like descriptors are not abstracted (no one will ever need more than an integer to identify a resource). At the very least, I want nearly every native type upgraded to a larger sized struct that adds a namespace and a generation number as a weak capability to catch lifetime and cross-domain mistakes with IDs. A 32-bit file descriptor becomes 64-bit triple (fd, ns, gen) where gen is assigned at fd creation time, and that layer knows the gen associated with each fd, so it can be stringently enforced (no access without the correct gen). There also needs to be a way to clean up and release a resource belonging to a green process that goes away, so things like file descriptors cannot leak in long running apps where processes get killed.

Apparently on Windows one must use iocp (i/o completion ports) to scale anywhere near as well as posix alternatives, but they have the odd character of reporting only accomplished tasks (your read or write was done) instead of saying a non-blocking operation can now safely occur. That is, iocp usage is async but blocking; while async is good, non-blocking is better. This likely means entry points for i/o in an abstract environment interface must be able handle both "i/o can occur" and "i/o is done" (before and after i/o) in a common way, which might be an abstract valve interface for streams or datagrams. The after-io style iocp can be tested using before-io OS api by using a shim, but the reverse is not true, as folks have noted. That might be why libevent feels free to impose a specific buffering interface, which is implied by after-io style associated with iocp. That would force an extra copy for all i/o when the intended destination is immutable refcounted data structure for zero-copy streaming effects. Edit: no, said that wrong. Async means non-blocking, so the problem is that iocp api is early-resource-pinning, with fixed sized capacity. So buffer space pins early, and you can't dynamically change the size when ready without risk of blocking.

A programming language can show an OS-neutral interface for efficient async (and maybe non-blocking) i/o, provided a sufficiently hands-off general api is defined. Maybe most PL i/o designs assume an existing OS interface is wrapped to hide sharp edges, often aiming for a high level when an assumption low level performance won't be needed.

Topic change: queueing need not imply excessive buffering. There's an idea called "credit-based flow control" whose purpose is to effectively limit queue buffering sizes, by contracting ahead of time instead of assuming everything will be fine. Note how often control is overloaded to mean different things. Word order matters too. A control flow like a thread or fiber is a line of control that flows, while flow control is data flow that needs control. In both cases the first noun is primary and the second noun after modifies. Controlling flow with credit is basically a pessimistic strategy for bounded buffers (assume it won't work until reserved space is signalled), compared to blocking when buffers become full, which is a more optimistic strategy (assume it will work and block-or-park when a limit is hit). In both cases, green or OS condition variables can be used; for fibers, both strategies involve parking until a green condition variable signals, but one approach uses more buffer space than another. Complexity comes from using a window with credit systems, to avoid stalling a producer just to get permission to go; latency to clear queues can be smoothed by window size and credit extended.

New async IO wrapper?

Meant as a reply to above post: Would a good starting point be a new cross platform async IO wrapper library? Is satisfies the Unix idea of doing one single thing well, and is likely to get to the point of being useful quicker. You can then build your further work on that once the API is stabilised.

library support is time consuming, need to focus carefully

My larger project needs an aio abstraction as a sub-part. Experiments run in a first-draft daemon need it now, in portable form so it's not throw-away. Should be quickly useful for testing pieces as written. Odds of a stable first api are pretty good, but it would depend on other parts of base definitions. A standalone library for that alone might be extractable, but focus on a larger library precludes time for support. People seem to like libevent and libev; to assume anyone wants another alternative is presumptuous. :-)

Existing Libraries Usable?

Can you use the existing libraries? I inferred that your problems with them meant that you couldn't use them, but it sounds like you can use them, although they are not ideal.

libuv seems to be a cross-platform version of libev, that can work on Windows, and is used by nodejs, and seems to be used by the Rust language.

what's your angle? why do you care what approach I pursue?

Almost unusable, cost exceeding benefit, when work is throw-away and takes several times longer, guessing from first examination. Both interfere with prototype structure and have more hoops to jump through than manual hookup to abstracted system calls.

I may keep replies no longer than your own remarks, so write more to get longer answers when I like the tone and content.

Implementation of Async IO

Because I am interested in your experiences implementing this, it may be useful to myself and other readers of this. Is there some reason why we should not discuss how something like this might be implemented?

So you're not going to use an existing library? libuv seems to use the same API as libev, but with better cross platform support (using IO Completion Ports on Windows). What other choice is there that does async properly on both Windows and Linux?

aio with an api suiting specific non-blocking green process lib

Most discussion is fun when both sides are willing to contribute near equal amounts of work in writing. I'll let you know how it goes. I must want a different api for some reason, which doesn't map cleanly. My first note implies I plan use iocp on Windows too.

libuv seems to be a

libuv seems to be a cross-platform version of libev, that can work on Windows, and is used by nodejs, and seems to be used by the Rust language.

The Rust runtime based on libuv didn't scale, so they threw it out of the main runtime into a separate 'libgreen' and just went with kernel threads. They discussed this on the reddit thread about the Rust 1.0 alpha release.

Rust's future seems to be a C#-like await/async function annotations that performs a CPS transform. This is more inline with their goal of a zero-overhead FFI with C. libgreen requires expensive stack switching for FFI.

Green Thread Performance

I am interested that you found the performance of green threads to be a problem. In theory cooperative scheduling in user space should be lower overhead than preemptive switching in kernel space. Were you using Cactus-stacks to change contexts or copying the stack?

It's not me, this was the

It's not me, this was the conclusion of the Rust devs based on data. The problem wasn't green threads per se, it was the interaction of some competing requirements. They wanted a zero-overhead FFI for calling out to C (meaning raw pointers), and they didn't want to require GC so there was no way to relocate pointers on the stack or to support cactus stacks. Thus the FFI required a separate C stack from the Rust stack, inducing user space context switches for FFI calls (as I understand it, which isn't uncommon).

Furthermore, they said something about libuv not being multithreaded, so perhaps it doesn't implement work stealing, which would be needed to properly scale above a certain point. I can't find the thread I'm referring to after a quick search, so you'll probably have to look up the discussions on the Rust lists.

FWIW,

When I'm making prototype or throw-away code, I usually regard Windows as irrelevant. It's insufficiently documented, development tools are not installed by default, and most of its I/O stuff is, as you've been noticing, obtuse. It's not worth my time when I'm just trying to get something to work. And it seems like most new stuff requires asynchronous I/O, which is exactly what it's worst at.

That said, I/O and system interaction in general seems to be way harder than it needs to be in virtually all computer systems, and doing it in a semantics-neutral building-block compatible way is especially hard in third-party libraries for those systems. It's hard to even check for anything's existence, for example, unless compiled in the presence of an assumption that everything required to support its existence is already available. In too many cases you have rather harsh options. You can attempt to use it and crash if it's not available, or not attempt to use it and never learn whether it's available. Or, you just can't build the library at all unless everything that would be necessary for it to become available is already in place.

When I'm at the bottom level of a project -- putting things in place that will become the building blocks for other things -- I want as close to neutral semantics as I can get, and most libraries don't provide it. As Rys says, I don't want anything that interferes with prototype structure. Most libraries of these things require jumping through additional hoops to support conveniences that the prototype structure will not be taking advantage of. That would be okay except that they almost universally follow the use-it-or-break problem I outlined above. They either provide no way to opt out of those conveniences which would release me from the requirement to still have them available, or provide no way to opt out of them that doesn't require me to still jump through those hoops and have available all the infrastructure that they'd need.

It's sort of like driving a Jaguar - the car can be a marvel of engineering, but if the airconditioner breaks, you'll wind up vapor locked at the side of the road, because that damn airconditioner, whether you want to aircondition the interior or not, is necessary to the function of its engine cooling system!

yes, and control inversion helps with semantics-neutral api

(I agree and this is how it works out in my situation, for folks like Schupke.)

I want a daemon prototype with no planned throw-away code, as sample code and as a harness for hosting experiments or driving other things via connections. I won't bother with Windows at first, starting with Linux and MacOS, using primarily posix interfaces plus as few other things as possible. I can simulate Windows iocp with an after-io shim on top of the before-io interfaces on Mac and Linux, exposing the same interface one would use on Windows.

My approach involves at least two libraries: one for fibers and green processes, and another using it that builds up the whole daemon OS process prototype. Pretend momentarily that libyf is a fiber library and libyd is a daemon library, which depends on libyf. Both of them see the outside world using only abstract interfaces: C-based objects using vtables filled with function pointers that abstract everything, so no dependencies exist on system headers or anything outside those libraries. The smaller libyf won't know how an OS process is organized, and the larger libyd won't know how i/o works, for example, only that there's an interface the host app can satisfy.

A host app where main() is actually located puts pointers to functions (satisfying the interface and contracts) into vtables used when constructing the daemon runtime. This is where dependency on specific system calls is injected. Slots remain null (zero) when no way to satisfy api exists in an app; each platform plugs in the way things work for that OS. If anything critical is missing, the daemon can't work, and runtime failure occurs if you execute anyway.

This sort of organization is sometimes called "don't call us, we'll call you", otherwise known as the Hollywood principle (HoP), as well as several closely related names like Inversion of control (IoC) and the Dependency inversion principle (DIP). The OS is not in charge, in the sense of driving things, other than satisfying requests for i/o when they occur. (All OS signals propagate via self-pipe pattern.) The daemon itself is the driver, which makes only calls using api it defined, which the host app satisfied by installing function pointers conforming to expected behavior.

Usually I work bottom-up when coding, but with at least one top-down phase to meet in the middle. This i/o architecture segment of design is the top-down phase, but with control inverted so the daemon is in charge as far as specification goes. Anything that interferes with this prototype goal is throw-away code because it cannot be kept long term, with dependencies minimized to suit the daemon.

The identity of OS resources cannot be under OS control, as far as the daemon shows to code that it hosts inside, because the daemon will simulate things that don't exist in the OS and use the same ID space, but with namespace extensions to prevent ID collisions. If one daemon peers with another, they can expose descriptors to each other, so the same OS descriptor appears more than once, but additional disambiguation says which address space actually serves a specific descriptor. It's basically illegal to make system calls in the daemon, and if you try, it won't work because IDs like descriptors will be obfuscated or swizzled. (Yes, in the same address space you can circumvent this, but if you're going to subvert the model there's little reason to use it.)

This begs for whole-program transformation of code hosted in the daemon, done either by hand or by tool, so only daemon-defined api is used anywhere inside. Exceptions can be white-listed at compile time, and this is a good idea for anything simple, bounded, non-blocking, and fast that works fine as a leaf call. A primary avenue of interacting with the world outside is via sockets, or anything else the daemon can change into dataflow resembling a socket.

Usual Approach

So you are defining an API for IO and will implement a backend for each platform. Leaving Hollywood aside, this is what I suggested several posts ago when suggesting you might implement your own library, however I probably didn't explain very well. I meant a library as a method of code organisation, not something you necessarily have to publish separately and maintain, although if it were me I would put it in a separate source code repository with its own unit tests from the start. After this initial suggestion I took your response about not needing another library to imply that you were going to use an existing Async IO library, hence why I mentioned libuv. As you can see my initial response was the same as yours, to build a minimal and natural abstraction.

I suppose if you want to load platform IO drivers at runtime a vtable is necessary, but I wouldn't bother. Insead I would have a static API, after all you need to recompile for different CPUs on different platforms anyway. Unnecessary indirection via a vtable is bad for performance, and does not change the actual contract definition.

the usual suspects (lib-kaiser-soze)

Sorry for mis-reading intent too far toward new standalone aio polling library support. An inverse relation holds between velocity and inter-person sync (rephrasing Fred Brooks' idea of communication tax). I hope what I come up with seems clear. Any code style used that people hate can be fixed by a code-rewriter pass.

Often the opposite of what I say works too, if you revise contracts to match; but saying each one drives people crazy. The host env abstraction can be static, yes, and that's going to be likely for some methods folks want to dispatch via macro instead of vtable indirection. A header for the env interface can be hacked to statically dispatch directly to OS api instead of using the env vtable. The fully general way might be better for a shared library.

That way of saying it (all vtable all the time for the world outside) helps clarify a daemon merely has references to things defined by the host env, logically outside so it's replaceable. I know devs who seem to believe you can't replace standard entry points (yes you can), because they never see alternatives to naked direct calls. The hard alternative is vtables; you can always macro it back to direct. It's also a common belief that direct (without vtable) is measurably faster, even when a system call will occur with much higher overhead. Twenty years ago devs argued about whether C++ must be slower than C just because of virtual function call overhead; it's a lot less relevant now when cache line misses are a bigger deal than indirection.

I focus less on libraries that cannot change, and more on source code that can be completely transformed by a dev building an app, by running it through a tool that rewrites to a form better suiting what is desired at runtime. This POV is unusual, and lets you do odd things like publish multiple versions of a library (and the rewriter too, plus rules, telling folks to give it a go themselves). Test suites you can still run after rewrite seem more important than meticulous standardization.

ID management in green thread i/o libraries

(My intended audience is someone like me, perhaps much younger, who finds this useful.)

A while ago I figured out some rules affecting aio library API, as induced by resource reclamation requirements for a lightweight process (lwp, also known as lane). A process should release all resources held when it dies, and so should an lwp (aka lane). So memory, locks, and IDs (descriptors) must all be freed, if held on exit. After I considered union file systems next, I felt no need to go back and rethink, so a plan seens stable.

A fiber holds only green locks, because only runtime interfacing with threads has any OS lock issues, unrelated to fiber lifetime. So a fiber unlocks each green lock held on exit. Further, a fiber can park on at most one green lock, so it can be a member of at most one wait list. So if present on a wait list, a fiber removes itself on exit. Thus no stale pointers can remain in green lock runtime when a green process is killed, and no locks can be held by a fiber after it goes, so killing cannot induce deadlock.

The memory story is short: incremental refcount release in a cleanup green process. The location of all handles is known in reflection metainfo for perfect collection. No destructors run, just return of storage to allocating vats. Release is one-way dataflow, so errors that would normally kill an lwp, if any, are just counted in statistics.

IDs allocated by an env (environment) are also tracked by the env when release must occur, and the env knows every ID held by a green process, so when it dies the runtime releases them all, or tasks a cleanup entity with this. In effect, the env implementation is required to maintain an index of all IDs held, indexed by lane, so they can be freed en masse. Each index entry has enough metainfo about a resource to release it, even if the resource is owned by some remote entity outside the current OS process. Among other things, every resource has an associated generation number, so they can only be released when gen is correct (so a fiber cannot double free anything).

Every ID is owned by some particular lwp; if two different lwp instances need to share an external resource by ID, one of them dups the original ID so it has its own private ID that needs release. Presumably the common resource is refcounted by index entries, so the last one out turns off the lights. While you can use another lwp's ID for a resource directly without dup, it will become invalid when the original owner dies, thus killing you on first use afterward when you depend on it being valid.

As a consequence of this scheme, no ID need be large in the abstract API presented by an env, because indirection is used any time an ID (say for something remote) would need to be larger. Actually, indirection is always used, so any ID needing a lot of bits can live in the env's ID table, while a fiber sees only a local ID for the table entry. In general, no more than 32 bits more is necessary per abstract ID, so it can include 16 bits for gen and another 16 bits for namespace.

Any descriptor normally described by an int in OS system call interfaces can be replaced with the following opaque 64-bit struct, also passed as a scalar value. We omit unique namespace prefixes here for brevity and clarity:

struct id64 { /* opaque abstraction of int resource descriptor */
    u32   id_key;  /* key identifying resource within the namespace */
    u16   id_ns;   /* namespace scope for the interpretation of key */
    u16   id_gen;  /* (nonzero) low bits of key's generation number */
};

For example, when this represents a local OS socket descriptor, the value of id_ns might say how to extract the descriptor from id_key, or it might only describe a table entry where the socket descriptor actually lives. Field id_gen acts like a weak capability, to protect against memory corruption, or accidentally targeting a resource owned by some other green process.

There's no way to tell it apart from any other sort of stream, unless you make a call that tells you how this particular resource is managed, if for some reason you care. Any resource needing a lot of bits to represent correctly (in the local process address space) can hide that via table indirection, so this 64-bit ID is the interface everywhere. So there's no reason an aio library need use a bigger ID than this, even when long chains of plumbing are involved. You can ask for io notification, but you needn't see how it's done or have any reason to care, unless you implement an inspector for debugging.

Suppose you spawn green processes to service an actor protocol, then exchange messages using an ID with this 64-bit format to name an actor. Some range of values in id_ns might name local threads hosting lwp instances, where id_key names the target green process. Another range of values in id_ns might mean an actor is remote, perhaps an lwp running in another daemon, whose full addressing path is spelled out in a table entry described by the ID.

In effect, the purpose of uniform ID use for descriptors is loose coupling, to prevent dependency on physical representation, and to put cooperative firewalls between mutable lwp fiber storage space and environment managed resources. Not very many bits are needed for scaling purposes, because indirection creates flexibility to use IDs as big as necessary internally.

HN on kernel vs user space i/o scheduling efficiency

HN topic 84% of a single-threaded 1KB write in Redis is spent in the kernel has a number of interesting comments about user space scheduling as performance optimization.