LtU Forum

notes on a C-ish memory manager design

I am (re)designing a C runtime memory manager for hosting fibers, DSLs, and lightweight daemons. I started teaching my early-20s sons to program (in C++ and C), and the idea is to give them a library allowing progressive elaboration of tools written in Unix philosophy: i.e. simple and composeable processes. The idea is to use interfaces that allow embedding, so pipelines can run in threads at first, then later fibers within a thread, without code rewrites beyond automatic translation for CPS (continuation passing style) to run in fibers. The "app" would be a daemon hosting a user space operating system, with services like an http agent on a port for a browser-based UI. The idea is to give my sons an undergraduate CS education in terms of what happens here, and in stuff they write to execute inside.

My only aim in this post is to briefly describe its memory management, and the relation to programming language implementation, since a couple different (Smalltalk-ish) Lisp interpreters will run inside, as DSL examples, one simple and one fancy. The paragraph above was just context, since otherwise you might wonder "why in the world I want to do that?" about each detail. This is for your entertainment only, since I would find a helpful comment surprising. My current version of this sort of thing is much nicer than any I did before. I was trying to solve the problem of making everything refcounted, without something terribly awkward happening. (Embedded sub-processes can garbage collect as they like without refcounting, though; my fancy interpreter would use stop-and-copy.)

Each runtime instance uses interfaces defined in the library only. The world outside is abstracted as an environment, so all calls to the outside world go through an env instance. This helps later automatic code rewriting, so blocking calls can be handled as async calls to a worker thread pool to avoid blocking a fiber. (Such calls park instead of blocking.)

The idea is to have multiple independent runtimes in one host operating system process, so there is no global memory manager unless glue code for the environment does that (whatever mechanism gets memory from native interfaces). There is at least one instance of environment E in the OS process, which has function pointers for every external interface supported. For memory management purposes, it must have a memalign() function pointer, to allocate large blocks of aligned memory. (This can be simulated with malloc() by allocating very big blocks, then chopping off misaligned start and end parts, using those for other purposes.)

Each runtime instance has at least one vat, acting as the heap, which allocates only aligned memory blocks. For example, each thread would have its own dedicated vat, which is single-threaded, so no races exist between threads. (All dataflow between threads is copied from one vat to another, mediated by thread-safe queues in the runtime.) Each vat allocates only large uniform blocks from the env, perhaps several at once for efficiency, where every block is 1MiB in size, or 2^20 bytes, and aligned to a 1MiB address. (The standard block size could be another power of two instead of 1MiB, but I will speak as if it could never change.) Since there will be fragmentation, this design only makes sense when allocating a lot of total memory, on the order of gigabytes. So you would not want this runtime on a system with limited resources.

Each vat describes its address space as a book, a set of 1MiB pages, where each 2^20-aligned block of 2^20 bytes is called a bp for book page -- or just page for short when book is understood. The main purpose of each bp is usually sub-allocation of smaller aligned memory blocks to satisfy vat allocation requests, but a garbage-collected vat would use its book any way it liked (an address space with page granularity). A vat used for C style allocation would return blocks that are powers-of-two in size, each aligned to the same power-of-two bytes, as small as 16 bytes and as large as, say, 256K. You can cheaply test any address for membership in a vat by seeing if the aligned bp address is in the book hashmap of bp addresses. Each allocated aligned block is called a rod, short for refcounted, because each one is reference counted by metainfo at the start of each book page.

Each bp begins with a standard bp header, describing page format and detail common to the page, followed by as many rh rod headers as needed to describe each rod sub-block in the rest of the page. (A garbage collected vat would have only the bp header for each bp.) Each rod head is 16 bytes in size and 16-byte aligned, with a 32-bit refcount a 32-bit checksum, and several descriptive fields, including a 16-bit generation number and a pair of 16-bit ids characterizing format and optional allocation context (naming a backtrace for memory usage analysis). Starting from the address of a rod -- some block inside a bp -- getting its refcount touches two cache lines in the general case, one for the bp head and one for the rh. Except for very small rods, this is only one cache line worse than co-locating the refcount inside the block itself. And that would only happen when the refcount was actually needed. (In my model, refcount is NOT linear in aliases; it is only linear in official handles which other aliases can borrow when their scope is clearly inside the handle scope.)

The purpose of a checksum in the rh for each rod is dynamic support for immutable values. When a rod is frozen to immutability, so it cannot be changed again before deallocation, the checksum must still be unchanged when the last reference goes away. You might only audit this in debug mode, but you an ask a rod if it is immutable, and enforce absence of mutation by accident.

Now here's the interesting part: each bp header has a pointer to the parent vat, and to the env. So the cost of those pointers is amortized across all the rods in the same page. In effect, every object allocated from a vat is a subclass of rod, and every rod has a pointer to the allocating vat and to the runtime env interface at zero space cost. It becomes unnecessary to pass vat and env as separate function arguments, since they are always available from any object in a vat. You can also add a handle reference to any sub-object inside a rod, because you can find the associated rh for any address in the page that is part of the rod space. So you can hold references to sub-fields which keep the parent object alive, without any problem, as long as you avoid accidental fragmentation cost (keeping something larger alive only to use a small sub-field).

What about details related to programming languages?

My overall objective is to write tools converting content into different forms, including source code rewrite transformations. (I tell my sons that most of programming is turning inputs into different outputs.) But that is little affected much by memory management per se. Except the i/o model of embedded languages can stream refcounted immutable rods between agents in a process pipeline. That helps get operating system style features into programming language semantics, where a language can also have a process model.

I expect to write a garbage collected Lisp using a vat as the heap, which allocates new bp instances for stop-and-copy before releasing old instances. The uniform use of standard book pages makes re-use efficient without fragmentation across multiple vats.

For bootstrapping, I probably want a simpler and dumber Lisp first, just to perform operations on S-expressions used to express whatever high level policies are used by the system in configuration, or in manifests describing large scale transformations. This can be done with a primitive AST interpreter over a refcounted tree of nodes allocated by C-oriented code. Graph cycles would be pathological unless manually handled or automatically detected. But it's not worse than other C manual memory schemes.

One field in each rh with metainfo describing each rod is the ID of a schema descriptor: the plan of any struct inside the rod. Among other things, it would exactly describe the location of embedded handles in the rod, so reference releasing can be automated. The idea is to be able to release all resources when a lightweight process is killed, without requiring as much manual intervention by a user. A secondary purpose is to make releasing large graphs incremental and uniform, managed by a background rod releaser, which incrementally walks a graph using the rod metainfo as a guide.

A second purpose in putting a plan ID in the rh for each rod is generic debug printing support: I want to be able to print everything. But manually writing a print method for everything is a pain. I would rather write a standard metainfo description of everything, especially since this can be emitted from a source code analysis tool. Then a single generic printer util can process this standard format guide when walking graphs. (Note: to stop huge graphs from being printed, a field in a plan can be marked as best cited briefly, rather than printed in full, and this also helps prevent following cycles due to back pointers.)

Note I am not describing handles, which are like smart pointers, as they little affect memory management and PL topics. But each rod refcount comes from an alias in a handle, which associated metainfo with the rod pointer, including a copy of the plan ID and generation number, which must match when the ref is released. (That detects failures to keep refcounted objects alive as expected.) Refcounting is not manual. It occurs as a side effect of aliasing via handles. When auditing is enabled, each handle can hold the ID of a backtrace that added a reference, and this finds refcount leaks. On a 64-bit system, the size of a handle is twice the size of a pointer: the metainfo is pointer-sized in total. This requires a library provide standard collection classes that work with handles too, when ownership of objects is expressed as collection membership. Each handle includes a bitflag field which, among other things, allows a reference to be denoted readonly, even if the rod is mutable, for copy-on-write data structures like btrees that share structure. (This is useful for transient collections expressing "the original collection plus any patches that might occur" without altering the original collection.)

Controlling Reductions

Haskell has an option for user defined reductions. I have no idea how it works.

Felix also has two ways to do this.

The first way is complete but hard to use: the parser accepts user defined patterns in the form of grammar productions, and then the user writes Scheme code to manipulate the parse tree to produce any combination of a fixed set of AST terms they like.

The second way is simpler but typed, and looks roughly like this:

reduce strings 
  | (a:string,b:string) : a + b=> scat ([b,a]) ;
reduce scat2 
  | (x:list[string], y:string) : scat ([y, scat x]) => scat (Snoc (x,y)) ;

Its polymorphic though that's not seen in this example. The vertical bar indicates you can provide several alternatives which are tried in sequence. The algorithm applies each reduction, and each is applied top down. The above example doesn't work! The intent was to replace string concatenation using a left associative binary operator with a faster function scat, operating on a list. Clearly I needed the algo to work bottom up.

The contrast between these two methods is that the first method has a complete general purpose programming language available to perform the reductions, the pattern matching uses a fixed algorithm (the GLR+ parser). The second method recognises terms and performs the reductions using a fixed algorithm so is much weaker.

What I want is something less complex than having to hand write the whole reduction machinery in a general purpose language (I could do that, using dynamically compiled and loaded Ocaml modules but it's too hard for end users).

So I'm looking for a *compromise* which is reasonably simple, but also reasonably capable. I added alternatives for this reason and am thinking to add an adjective which specifies if a reduction should be applied top down or bottom up. A more complex option is to allow for nested reductions (so that if one reduction succeeds it triggers another one).

Any ideas?

CFL parsing, and another way to look at the CNF...

I'll implement it and try to break it when I get a chance, but it seems the below should exhibit O(|G| |U|2 n2 log n) worst case behavior (in time), where |G| is the total size of the grammar, |U| is the number of distinct nonterminals in the subset of unary productions U -> w in G, and n the input's length.

Or, what am I missing?

We consider only unambiguous CFGs in Chomsky Normal Form (CNF).

We devise a sort of "type system" over rewritings of the input as we process it from right to left, from end to beginning.

We'll look at the algorithm's work environment as being made of a left context, say Lc, a current symbol, say w (talking about a terminal) and a right context, say Rc.

The left context (Lc) is always a (length-decreasing*) prefix string w0...wi...w of terminals of the actual input (equal to the entire input s = w0...wn-1 initially, and empty at the end, when no more production type rewritings can be applied).

(* albeit, non-strictly)

The right context (Rc) is a mutating list of computed S-expressions storing what has been consumed and "typed" by reading the input from right to left, from end to beginning.

In this scheme, it is constructed so as to be, also, the recorded value of the rightmost-derivative of the parse tree, at the "point" Lc \ w, before Lc gets empty and reduction to the grammar start symbol (if possible) happens, eventually finalizing Rc as the single S-expr (S w0...wn-1) (or failing to do so).

Initially, Rc is empty, and Lc equals the entire input.

In the CNF grammar, we treat the two distinct types of productions thusly:

Productions of the form U -> w (aka rule R1)

Rule R1: U -> w  will be said to "type" w between Lc and Rc as... w: U ("the utterance of w is of type U between any Lc and Rc")

that is, whenever Lc ends with w


Rewriting via R1

Lc Rc ~> (Lc \ w) (U w) Rc
 new Lc> \------/ \------/ <new Rc

(takes a list of S-exprs as Rc and prepends the new S-expr (U w) to it, making a new Rc longer by one S-expr, while Lc gets shorter by one terminal, w)

Productions of the form U -> P R (aka rule R2)

Rule R2: U -> P R  will be said to "type" P followed by R in Rc as... P: R -> U ("the utterance of P applied to Rc typed as R promotes Rc to U")

that is, whenever head(Rc) = (P x) and head(tail(Rc)) = (R y)


Rewriting via R2

Lc Rc ~> Lc (U head(Rc) head(tail(Rc))) tail(tail(Rc))
            \----------------------------------------/ <new Rc

(takes a list of S-exprs as Rc and turns it into a new Rc shorter by one S-expr, Lc left unchanged)

Notice the one to one correspondence between mutually ambiguous grammar productions and mutually unsound production types, btw:

E.g., having, in the grammar, the mutually ambiguous

A -> BC
D -> BC

would yield, in this "type system", attempts at applying the mutually unsound (if only signature-wise)

B: C -> A
B: C -> D

Which would obviously be intrinsically problematic from either / both perspectives, and no matter what the input may be, as soon as said input will contain utterances of (a substring already reduced to) B, followed by (a substring already reduced to) C.


Now for a concrete example (full parse / derivation); consider the grammar in CNF (informal meaning given in parentheses):

(nonterminal-only RHS)

S  -> NP  VP  ("a noun phrase followed by a verb phrase reduces as a sentence")
VP -> Vt  NP  ("a transitive verb followed by a noun phrase reduces as a verb phrase")
NP -> Det N   ("a determinative followed by a noun reduces as a noun phrase")
N  -> Adj N   ("an adjective followed by a noun reduces as a noun")


(terminal-only RHS)

Det-> a       ("'a' reduces as a determinative")
Det-> the     ("'the' reduces as a determinative")
Adj-> young   ("'young' reduces as an adjective")
N  -> boy     ("'boy' reduces as a noun")
N  -> dragon  ("'dragon' reduces as a noun")
Vt -> saw     ("'saw' reduces as a transitive verb")

Which we'll use to "type" thru rewritings of S-exprs in Rc, as:

NP : VP -> S  ("the utterance of 'NP' applied to Rc typed as 'VP' promotes Rc to 'S'")
Vt : NP -> VP ("the utterance of 'Vt' applied to Rc typed as 'NP' promotes Rc to 'VP'")
Det: N  -> NP ("the utterance of 'Det' applied to Rc typed as 'N' promotes Rc to 'NP'")
Adj: N  -> N  ("the utterance of 'Adj' applied to Rc typed as 'N' promotes Rc to 'N'")


a     : Det   ("the utterance of 'a' is of type 'Det' between any Lc and Rc")  
the   : Det   ("the utterance of 'the' is of type 'Det' between any Lc and Rc")
young : Adj   ("the utterance of 'young' is of type 'Adj' between any Lc and Rc")
boy   : N     ("the utterance of 'boy' is of type 'N' between any Lc and Rc")
dragon: N     ("the utterance of 'dragon' is of type 'N' between any Lc and Rc")
saw   : Vt    ("the utterance of 'saw' is of type 'Vt' between any Lc and Rc")

After iteration 0 (init)

   (the          young          boy         saw         a         dragon)
Lc>\--------------------------------------------------------------------/ ( ) <initial Rc

(NB: Rc = empty list)

After iteration 1, via R1

   (the          young          boy         saw         a)        dragon
Lc>\-----------------------------------------------------/       (N dragon)1
                                                                 \--------/ <new Rc
(NB: Rc = list of one root S-expression)

After iteration 2, via R1

   (the          young          boy         saw)        a          dragon
Lc>\-------------------------------------------/       (Det a)1   (N dragon)2
                                                       \-------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 3, via R2

   (the          young          boy         saw)        a         dragon
Lc>\-------------------------------------------/   (NP (Det a)   (N dragon))1
                                                   \-----------------------/ <new Rc
(NB: Rc = list of one root S-expression)

After iteration 4, via R1

   (the          young          boy)        saw          a         dragon
Lc>\-------------------------------/       (Vt saw)1(NP (Det a)   (N dragon))2
                                           \--------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 5, via R2

   (the          young          boy)        saw         a         dragon
    the          young          boy    (VP (Vt saw)(NP (Det a)   (N dragon)))1
                                       \------------------------------------/ <new Rc

(NB: Rc = list of one root S-expression)

After iteration 6, via R1

   (the          young)         boy          saw         a         dragon
    the          young       (N boy)1   (VP (Vt saw)(NP (Det a)   (N dragon)))2
                             \-----------------------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 7, via R1

   (the)         young           boy          saw         a         dragon
    the         (Adj young)1  (N boy)2   (VP (Vt saw)(NP (Det a)   (N dragon)))3
                \-------------------------------------------------------------/ <new Rc

(NB: Rc = list of three root S-expressions)

After iteration 8, via R2

   (the)         young          boy          saw         a         dragon
    the      (N (Adj young)  (N boy))1  (VP (Vt saw)(NP (Det a)   (N dragon)))2
             \---------------------------------------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 9, via R1 (final Lc is empty)

    the           young          boy          saw         a         dragon
   (Det the)1 (N (Adj young)  (N boy))2  (VP (Vt saw)(NP (Det a)   (N dragon)))3
   \--------------------------------------------------------------------------/ <new Rc

(NB: Rc = list of three root S-expressions)

After iteration 10, via R2

    the          young          boy          saw         a         dragon
(NP(Det the) (N (Adj young)  (N boy)))1 (VP (Vt saw)(NP (Det a)   (N dragon)))2
\----------------------------------------------------------------------------/ <new Rc

(NB: Rc = list of two root S-expressions)

After iteration 11, via R2

       the          young          boy         saw         a         dragon
(S (NP(Det the) (N (Adj young)  (N boy))) (VP (Vt saw)(NP (Det a)   (N dragon))))1
\-------------------------------------------------------------------------------/ <new Rc

(NB: Rc = list of one root S-expression)

Success, with the expected parse binary tree:

       the          young          boy         saw         a         dragon
       Det          Adj            N           Vt          Det       N
                            N                                   NP
               NP                                     VP

Disclaimer edit:

this is just genuine amateur research from a practitioner who's never been in academia (that would be me). I've obviously posted it here on a friendly LtU only because I'm (still) wondering if I'm on to something / or what am I missing. Bear with me (and thank you in advance).

Refining Structured Type System

This is my first more serious paper on Structured Type System.

As every theory needs some syntax form to express its elements, a road to a theory about theories leads through a syntax defining land, so structured type system, in the first place, provides a flexible generalized text parser that builds up internal abstract syntax trees (AST) from input data. The other aspect of theory about theories inevitably covers the meaning of input data. This is called semantics, and this is the point where structured type system provides a possibility to define deeper connections between syntactic elements of AST-s. For this purpose, structured type system uses a kind of functions known from functional programming paradigm. These functions are able to process any data corpus, being natural or artificial language translation, which in turn happens to be just enough for running any complexity task used to analyze existing and calculate new data from an input.

In short, we use BNF-ish grammars as types for function parameters and function results. Some nice constructions can be made by combining grammars and functions. One of the most important properties of structured type system is its ability to additionally extend grammars outside the grammars definitions, all based on function result types. It is fairly simple: where a certain type of expression is expected, there a grammar that results with the same type can be used, and there goes syntax extensibility. Conveniently, we can combine grammar definitions and their inputs in the same source code file.

I was hoping to get some feedback and critics from this community before attempting to get more publicity to the paper. This is an important milestone to me and I want to thank You all for being so inspirational community during my research.

Cool stuff from recent conferences

I heard of some good stuff. How about someone who was there post the headline worthy papers?

Céu: Structured Synchronous Reactive Programming (SSRP)

Céu is a Esterel-based synchronous language:

It appeared in LtU in the past in an announcement of the "SPLASH: Future of Programming Workshop" program.

In this new public version, we are trying to surpass the academic fences with a more polished work (docs, build, etc).

In summary:

  • Reactive: code executes in reactions to events
  • Synchronous: reactions run to completion in discrete logical units of time (there's no implicit preemption nor real parallelism)
  • Structured: programs use structured/imperative control mechanisms, such as "await" and "par" (to combine multiple awaiting lines of execution)

Structured programming avoids deep nesting of callbacks letting programmers code in direct/sequential/imperative style. In addition, when a line of execution is aborted, all allocated resources are safely released.

The synchronous model leads to deterministic execution and simpler reasoning, since it does not demand explicit synchronization from the programmer (e.g., locks and queues). It is also lightweight to fit constrained embedded systems.

We promote SSRP as a complement to classical structured/imperative programming like FRP is now to functional programming.

Archaeological dig to find the first Lisp example of the Y-combinator

I'm trying to find the first Lisp examples of the Y-combinator. Beyond that I am also trying to find the first time the Y-combinator was demonstrated using the factorial function and the mutually recursive definition of odd/even.

What works should I be looking at? The first Scheme paper references fixed-point combinators at page 16 and also shows the familiar LISP definition of the factorial function. But, it does not express the factorial function using a fixed-point operator.

How will look a modern imperative language? All love here is functional only..

After read a lot about compilers/languages I see that most research, if not all, is about functional languages, and complex type systems

Now that I'm toying in build one, I see that I'm biased the language because that to be functional, yet, the truth is that I'm more a imperative guy.

So, I wonder what is new/forgotten in the world of imperative or non-functional languages, languages more "mainstream". Are GO/Rust/Swift just there?

If wanna build a language (more mainstream, imperative, etc) with the wisdom of today, how it look? Is already made? Maybe ADA or similar?

I probably switch it to make "const by default, variable optional", use AGDT and the match clause, but not think what else...

Inference of Polymorphic Recursion

In the following (Haskell) example, the type annotation on f is required:

f :: a -> (Int, a)
f x = (g True, x)

g True = 0
g False = fst (f 'a') + fst (f 0)

main = do
    print (fst (f True))

I can understand why in general, but I wonder if we could just decide to generalize arbitrarily in the order that declarations appear so that in this case the type of f would be inferred but if you switched the definition order you'd get a type error. When f is generalized, g would be constrained Bool -> b where b would be unified after generalization. Is this something that might work (but isn't done because it's arbitrary and makes definition order matter) or are there hard cases I need to consider?


Generic overload resolution

Kitten has ad-hoc static polymorphism in the form of traits. You can declare a trait with a polymorphic type signature, then define instances with specialisations of that signature:

// Semigroup operation
trait + <T> (T, T -> T)

instance + (Int32, Int32 -> Int32) {

instance + (Int64, Int64 -> Int64) {

This is checked with the standard “generic instance” subtyping relation, in which <T> (T, T -> T)Int32, Int32 -> Int32. But the current compiler assumes that specialisations are fully saturated: if it infers that a particular call to + has type Int32, Int32 -> Int32, then it emits a direct call to the (mangled) name of the instance. I’d like to remove that assumption and allow instances to be generic, that is, partially specialised:

// List concatenation
instance + <T> (List<T>, List<T> -> List<T>) {

// #1: Map union
instance + <K, V> (Map<K, V>, Map<K, V> -> Map<K, V>) {

// #2: A more efficient implementation when the keys are strings
instance + <V> (Map<Text, V>, Map<Text, V> -> Map<Text, V>) {

But this raises a problem: I want to select the most specific instance that matches a given inferred type. How exactly do you determine that?

That is, for Map<Text, Int32>, #1 and #2 are both valid, but #2 should be preferred because it’s more specific. There are also circumstances in which neither of two types is more specific: if we added an instance #3 for <K> (Map<K, Int32>, Map<K, Int32> -> Map<K, Int32>), then #2 and #3 would be equally good matches, so the programmer would have to resolve the ambiguity with a type signature.

XML feed