notes on a C-ish memory manager design

I am (re)designing a C runtime memory manager for hosting fibers, DSLs, and lightweight daemons. I started teaching my early-20s sons to program (in C++ and C), and the idea is to give them a library allowing progressive elaboration of tools written in Unix philosophy: i.e. simple and composeable processes. The idea is to use interfaces that allow embedding, so pipelines can run in threads at first, then later fibers within a thread, without code rewrites beyond automatic translation for CPS (continuation passing style) to run in fibers. The "app" would be a daemon hosting a user space operating system, with services like an http agent on a port for a browser-based UI. The idea is to give my sons an undergraduate CS education in terms of what happens here, and in stuff they write to execute inside.

My only aim in this post is to briefly describe its memory management, and the relation to programming language implementation, since a couple different (Smalltalk-ish) Lisp interpreters will run inside, as DSL examples, one simple and one fancy. The paragraph above was just context, since otherwise you might wonder "why in the world I want to do that?" about each detail. This is for your entertainment only, since I would find a helpful comment surprising. My current version of this sort of thing is much nicer than any I did before. I was trying to solve the problem of making everything refcounted, without something terribly awkward happening. (Embedded sub-processes can garbage collect as they like without refcounting, though; my fancy interpreter would use stop-and-copy.)

Each runtime instance uses interfaces defined in the library only. The world outside is abstracted as an environment, so all calls to the outside world go through an env instance. This helps later automatic code rewriting, so blocking calls can be handled as async calls to a worker thread pool to avoid blocking a fiber. (Such calls park instead of blocking.)

The idea is to have multiple independent runtimes in one host operating system process, so there is no global memory manager unless glue code for the environment does that (whatever mechanism gets memory from native interfaces). There is at least one instance of environment E in the OS process, which has function pointers for every external interface supported. For memory management purposes, it must have a memalign() function pointer, to allocate large blocks of aligned memory. (This can be simulated with malloc() by allocating very big blocks, then chopping off misaligned start and end parts, using those for other purposes.)

Each runtime instance has at least one vat, acting as the heap, which allocates only aligned memory blocks. For example, each thread would have its own dedicated vat, which is single-threaded, so no races exist between threads. (All dataflow between threads is copied from one vat to another, mediated by thread-safe queues in the runtime.) Each vat allocates only large uniform blocks from the env, perhaps several at once for efficiency, where every block is 1MiB in size, or 2^20 bytes, and aligned to a 1MiB address. (The standard block size could be another power of two instead of 1MiB, but I will speak as if it could never change.) Since there will be fragmentation, this design only makes sense when allocating a lot of total memory, on the order of gigabytes. So you would not want this runtime on a system with limited resources.

Each vat describes its address space as a book, a set of 1MiB pages, where each 2^20-aligned block of 2^20 bytes is called a bp for book page -- or just page for short when book is understood. The main purpose of each bp is usually sub-allocation of smaller aligned memory blocks to satisfy vat allocation requests, but a garbage-collected vat would use its book any way it liked (an address space with page granularity). A vat used for C style allocation would return blocks that are powers-of-two in size, each aligned to the same power-of-two bytes, as small as 16 bytes and as large as, say, 256K. You can cheaply test any address for membership in a vat by seeing if the aligned bp address is in the book hashmap of bp addresses. Each allocated aligned block is called a rod, short for refcounted, because each one is reference counted by metainfo at the start of each book page.

Each bp begins with a standard bp header, describing page format and detail common to the page, followed by as many rh rod headers as needed to describe each rod sub-block in the rest of the page. (A garbage collected vat would have only the bp header for each bp.) Each rod head is 16 bytes in size and 16-byte aligned, with a 32-bit refcount a 32-bit checksum, and several descriptive fields, including a 16-bit generation number and a pair of 16-bit ids characterizing format and optional allocation context (naming a backtrace for memory usage analysis). Starting from the address of a rod -- some block inside a bp -- getting its refcount touches two cache lines in the general case, one for the bp head and one for the rh. Except for very small rods, this is only one cache line worse than co-locating the refcount inside the block itself. And that would only happen when the refcount was actually needed. (In my model, refcount is NOT linear in aliases; it is only linear in official handles which other aliases can borrow when their scope is clearly inside the handle scope.)

The purpose of a checksum in the rh for each rod is dynamic support for immutable values. When a rod is frozen to immutability, so it cannot be changed again before deallocation, the checksum must still be unchanged when the last reference goes away. You might only audit this in debug mode, but you an ask a rod if it is immutable, and enforce absence of mutation by accident.

Now here's the interesting part: each bp header has a pointer to the parent vat, and to the env. So the cost of those pointers is amortized across all the rods in the same page. In effect, every object allocated from a vat is a subclass of rod, and every rod has a pointer to the allocating vat and to the runtime env interface at zero space cost. It becomes unnecessary to pass vat and env as separate function arguments, since they are always available from any object in a vat. You can also add a handle reference to any sub-object inside a rod, because you can find the associated rh for any address in the page that is part of the rod space. So you can hold references to sub-fields which keep the parent object alive, without any problem, as long as you avoid accidental fragmentation cost (keeping something larger alive only to use a small sub-field).

What about details related to programming languages?

My overall objective is to write tools converting content into different forms, including source code rewrite transformations. (I tell my sons that most of programming is turning inputs into different outputs.) But that is little affected much by memory management per se. Except the i/o model of embedded languages can stream refcounted immutable rods between agents in a process pipeline. That helps get operating system style features into programming language semantics, where a language can also have a process model.

I expect to write a garbage collected Lisp using a vat as the heap, which allocates new bp instances for stop-and-copy before releasing old instances. The uniform use of standard book pages makes re-use efficient without fragmentation across multiple vats.

For bootstrapping, I probably want a simpler and dumber Lisp first, just to perform operations on S-expressions used to express whatever high level policies are used by the system in configuration, or in manifests describing large scale transformations. This can be done with a primitive AST interpreter over a refcounted tree of nodes allocated by C-oriented code. Graph cycles would be pathological unless manually handled or automatically detected. But it's not worse than other C manual memory schemes.

One field in each rh with metainfo describing each rod is the ID of a schema descriptor: the plan of any struct inside the rod. Among other things, it would exactly describe the location of embedded handles in the rod, so reference releasing can be automated. The idea is to be able to release all resources when a lightweight process is killed, without requiring as much manual intervention by a user. A secondary purpose is to make releasing large graphs incremental and uniform, managed by a background rod releaser, which incrementally walks a graph using the rod metainfo as a guide.

A second purpose in putting a plan ID in the rh for each rod is generic debug printing support: I want to be able to print everything. But manually writing a print method for everything is a pain. I would rather write a standard metainfo description of everything, especially since this can be emitted from a source code analysis tool. Then a single generic printer util can process this standard format guide when walking graphs. (Note: to stop huge graphs from being printed, a field in a plan can be marked as best cited briefly, rather than printed in full, and this also helps prevent following cycles due to back pointers.)

Note I am not describing handles, which are like smart pointers, as they little affect memory management and PL topics. But each rod refcount comes from an alias in a handle, which associated metainfo with the rod pointer, including a copy of the plan ID and generation number, which must match when the ref is released. (That detects failures to keep refcounted objects alive as expected.) Refcounting is not manual. It occurs as a side effect of aliasing via handles. When auditing is enabled, each handle can hold the ID of a backtrace that added a reference, and this finds refcount leaks. On a 64-bit system, the size of a handle is twice the size of a pointer: the metainfo is pointer-sized in total. This requires a library provide standard collection classes that work with handles too, when ownership of objects is expressed as collection membership. Each handle includes a bitflag field which, among other things, allows a reference to be denoted readonly, even if the rod is mutable, for copy-on-write data structures like btrees that share structure. (This is useful for transient collections expressing "the original collection plus any patches that might occur" without altering the original collection.)