WebAssembly

Finally, my prayers are being heard. The question was "get us web enabled bytecode" and the answer came from above in a form of W3C group effort, WebAssembly.

Though it encapsulates low level jump commands into structures controlled by labels (I'd like to leave it at the low level for maximum freedom), it is more or less what I'd expected from a bytecode that could be ran inside browser. It's enough assembler like language to leave a lot of space for specific implementation of different programming language architectures, while retaining speed of native implementations. Maybe it is an overkill for a bytecode language to automatically deal with functions (see "call" command), but it can't hurt, we dont have to use it, we can maintain our own call stack if we want to replace existing one.

It is ought to have continuous memory handler that indexes space from 0 on. This memory can even grow on the fly, if we want to. This is just what I need, encapsulation of memory management.

If this is what I'm thinking it is (nearly 1:1 mapping from bytecode to machine code upon compilation), I can say only one thing: "thank you, world :)"

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

The call stack

Providing a call stack for local variables and return addresses keeps them out of the linear memory, thus eliminating a whole class of errors and attacks. You can bypass it, but you're better off working with it.

It has to be safe

You can't easily "replace" the call stack, as there is no way to reify return addresses (*). However, you can implement a shadow stack for your local variables in linear memory if you need to.

In fact, WebAssembly does not currently expose any notion of built-in "stack" at all. Like a number of other design decisions, that's important: because Wasm is running on the Web, it has to be "safe". That's why it's typed, has no undefined behaviour, code is separated from memory (and not addressable), casts over function types are checked, etc. Having an explicit concept of function also enables other benefits, such as effective register allocation, parallel or lazy compilation, useful debug tools, etc.

Control-flow indeed is semi-structured: branches are really only breaks and continues, so that you cannot construct irreducible control flow. That's a feature as well.

WebAssembly's linear memory doesn't provide "memory management" per se. Its main motivation is to enable the compilation of low-level languages that have address arithmetics, but do their own memory management. For more high-level languages, we hope to add real managed memory and GC support post version 1.0.

(*) You could presumably emulate explicit return addresses with indices into a giant table jump, by compiling your whole program into a single looping function. But that would probably be pretty slow and defeat much of the WebAssembly ecosystem.

(*)

For optimized emulated jumps, you can structure giant table jump as balanced binary tree, but that's just not it.

Anyway, the more I understand it, the less I like it. What I really want is complete freedom for designing my own call and function stack. I get half of it it with wasm (call is locked). Yes, attacks are possible by exposing call stack to users, but that should be handeled some other way. They plan to expose whole source code via "view source" menu option anyway.

I also want classic jmp command and the ability to change a code at given address on the fly to implement my own JIT compilers or whatever comes to my mind (say, jumping to arithmeticly calculated address).

In my opinion, it is bad design to force programmers into frames of higher order technologies. Interesting design of S-expressions with branching, but it kills the spirit of invetiveness. S-expressions should be defined one level above asm, with ability of defining something else if I want the change.

But, I shouldn't be ungrateful, I'll try to manage with what is offered.

The Web is an interesting space

...with various specific constraints. You want a format that is sufficiently high-level to be (1) safe, (2) deterministic, (3) compact, (4) platform-agnostic but efficient on all platforms, (5) fast to compile, (6) allows streaming, parallel and lazy compilation, (7) easy to debug, and a few more. You also want to make it an easy target, in order to be viable for a wide variety of language projects (ideally, it shouldn't require more work than targeting JavaScript).

One reason for directly supporting nested expression operands instead of just, say, 3-address code or SSA, is that it makes for much more compact binaries. That's hugely important on the web.

((*) Constant-time table jumps are directly supported, see the br_table operator.)

Suppose this design

Suppose this design:

1. You define the lowest level asm language.
2. Then you build a layer above 1. for more compact arithmetic
3. Then you define another layer above 1. for block break/continue branching.
4. Then you define another layer above 1. for functions with call/variable stack.
5. Then you combine layers X, Y and Z to produce a new layer with new properties and performance
...
N. You end up at defining a layer of object oriented constructs, functional programming, or something else, you name it (I know you can :)

When you want to compile layer N to layer 1, you compile it successively to layers below it, one level at the time. What you get at the end is level 1 assembly.

When passing apps over web, you can pass directly level 1. assembly (big executable), but you could also pass higher levels (more compact) that a client machine gets JIT compiled to level 1.

Some layers can be marked safe for specific use, some are dirty, some with this property, some with that property. When a client downloads an app, she gets warned about issues such are layer's undesired properties like safety. This marking of layers could be done by respectful organizations like W3C, so the process of picking only safe apps can be done just by user saying: I trust W3C, if they say it's safe, I believe.

With WebAssembly, I'm starting from layer 5 or something. I want to start from level 1 for the most flexibility. Zeroth layer should be a classic machine code which can vary (Intel, AMD, ARM, ...) and is nested in client machine. Zeroth layer opens a question of optimizability of specific apps for specific architectures.

((*) How to know in advance where I JIT compiled my code if I had an opportunity to write on the fly the next instruction I would like to run?)

Edit: If I label every line of code, I can use a big jump table with br_table to jump anywhere. Thanks for the tip, it's good to know :)

Not a solution

That may be a plausible design in other spaces. But here it would only address 1 or 2 of the points I listed, while getting in the way of most the others. In particular, on the web, you absolutely cannot make a criterion like safety optional -- warning the user is not a solution, since we already know too well how utterly fallible that approach is.

Safety should be easy to

Safety should be easy to implement. A code has access its address space. Access to outer regions should raise an exception.

((*) And I think that you can "br" only to parent's label. Bummer.)

Jumps

You can nest blocks to achieve most forms of forward jump:

(block $1
  (block $2
    (block $3
      (br_table $1 $2 $3 (...))
    )
    ;; ..
  )
  ;; ...
)
;; ...

Another method

is available for simulating goto. Suppose you have a number of labels inside a code:

label1:
...
label2:
...
label3:
...


Now you turn those labels into functions with bodies of what's in between:

function label1() {...};
function label2() {...};
function label3() {...};


In the place from which you want to goto, you call wanted function after which you put a return statement. You can put these calls (which would be then passed by the return statements) outside functions into a loop to avoid stack overflow.

Safety

There's no such thing as safety, on the web or otherwise. There are only classes of assurances. Some code can be proven to terminate. That's nice, but it doesn't ensure correctness. Sound typing doesn't ensure correctness either, if the type system is weak (not expressive).

I would argue that subroutines are the worst model for the web because interaction is utterly unlike that, forcing programmers to pervert their models, which removes the significance of any safety assurances. What we actually have on the web is a significant set of *active* interacting processes. Videos play, sound plays, buttons get clicks, people scroll, timers run, actions are cancelled or replaced by others.

What we need is peer communications, not (just) master/slave relations.

Physical systems that do this exist of necessity irrespective of the underlying technology: have a look in an aircraft cockpit.

The bottom line, IMHO, is that what we REALLY need is better theories. Functional models aren't enough. The type systems are too weak to explain the distinction between evaluation strategies, and fail to account for effects. We have instead an ad-hoc collection of rules like: tail-call optimisation, mutable lists are fine if privately owned (linear typing), procedures nested in functions are fine if the effects don't escape. I know how to use such ad-hoc rules by examining each case, but I have no supporting general theoretical framework.

In Felix, pipelines fibres are guaranteed (in some sense) to work. Self-feedback without buffers is guaranteed to fail. Ad-hoc rules. Where is my general theory? Where is the type system that enforces it?

whither?

(if 2 comments show up, i blame drupal.)

http://lambda-the-ultimate.org/node/4768 :-)

where are we today?

Call stack

Call stacks are evil. They hide a fixed model of continuation passing which proves useless in real world programming. To get around such poor design languages provide undelimited exceptions which destroy any good properties the system may have had.

It would be much better if, in a low level language, the primary control transfer instruction was a neutral branch and link which allows explicitly passing a continuation and saving another one atomically. IMHO of course. Subroutines are useful but they're NOT good enough alone. A general control *exchange* subsumes jumps and subroutine calls. Oh, and in general of course that doesn't allow a linear stack.

Low level

I am not sure how a low-level language can work without a stack, as you would be requiring garbage collection in a low level language which seems far worse.

I think you are overstating the uselessness if a call stack in real world programming. I have written many real world programs that have never needed a continuatuon.

You've written programs that never use a subroutine?

I for one have never written a real-world program (ie, more complicated than "hello world) that used no continuations. The minute you call a subroutine, there exists a continuation that subroutine must return to. Even in old-style stackless non-reentrant BASIC A, each routine knew what address it had to return to when it was done and those return addresses were continuations.

From context, I think maybe you mean you've written lots of real-world programs where you never use anything more complicated than a stack to hold your continuations, in the same way lots of real-world BASIC A programs were written that didn't use anything more complicated than a per-routine return address?

The idea of exchanging continuations as a control structure is an interesting one, really. Pretty much all control-flow operations amount to passing continuations in one direction or the other; an exchange seems like a decent generalization.

Losing stacks.

If the cost of that generalisation is losing stacks, it doesn't seem worth the cost to me. I suspect performance will be bad on all modern CPUs which are designed to use a stack (sometimes multiple stacks).

Indeed, I wrote many assembly-language programs ...

... on the PDP-8, which has no (or minimal) hardware support for a stack, that did no recursion. Arguments were stashed in the locations after a call instruction (by convention), and the return address in the location before the first instruction executed (a hardware feature).

Garbage collection worse?

Why is GC worse? I would have put a heap, pointers, and a GC as basic requirements.

The thing is, using this language, you can do ANYTHING by implementing an interpreter, or translating to it using various techniques. The safety issue amounts to this: given requirements, how much SUPPORT does this system provide?

And the answer is: almost none. I have to defeat all the supposed safety measures to do anything interesting, transferring the onus for providing safe constructions to ME. Not a good idea!

What's worse, huge problems are created interacting with foreign code (i.e. other people's code). I know all about this because that is what my system Felix does, targeting C++.

Here what I would have to do in WebAssembly: first, everything is one huge subroutine which starts with a switch on a variable inside a loop. No stack is used (except transiently). I just provided goto, but its a bit slow. I use a heap which is garbage collected. I can implement that with a static array, my own allocator, use integers as pointers, get forced to use only a word size data type like i64 to defeat the type system.

I need objects, so they're allocated and actually my functions are objects with the program counter in the object. I can return control at any time and resume just using an integer indexing the jump table.

This is *in fact* roughly what Felix does to defeat the C machine stack which stands in the way of developing interesting code. For ABI compliance with foreign code, in Felix only procedures use this mechanism: functions are FORCED to use the machine stack for return addresses, which means my coroutines can't run inside functions. I don't have a type system that can enforce that so the code isn't safe.

That's the point here. By providing a model of 1970's structured programming concepts we're stuck in the 1970's, but its worse because standard cheats aren't available. In Felix, the GC is precise on the heap, but it scans the stack conservatively and I can do that because I can cheat to get hold of the machine stack.

Here's what enforcement does: Clang uses LLVM which uses a similar constrained model, however Gcc allows me to get hold of external code addresses with a trick I stole from Fergus Henderson's Mercury system (assembler labels). Because I can get machine addresses, and Gcc provides a computed goto, I can do a computed jump instead of a switch. (The jump is safe, its only done inside functions containing the label, but the address to jump to is obtained from outside the function).

Clang refuses to allow this, so I have to use a switch. My code is just as safe or not either way because the Felix compiler just generates macros which are conditioned to whether the assembler labels/computed goto are available or not. So the Clang implementation is just slower.

You want to me to implement a heap and a GC in Webassembly, generate code that *emulates* gotos and allows resumable procedures, yields, continuation passing, etc, I can do it but don't expect it to be safe or efficient compared to a natively supplied one, and don't expect it will be any fun to make it work with other people's code.

Javascript, for all its warts, at least provides real closures.

What implements GC

If this is a low level language what is going to implement the GC? Are you going to try and implement GC in the CPU hardware? Something that has been attempted many times (see Java CPU) and failed every time.

If you don't have a stack machine, what would you have instead? People don't tend to do abstract register machines, presumably because you don't know how many registers it should have, and the wrong choice would ruin performance on certain physical CPUs. For example if you choose 10 registers and the real CPU had 8, prepare for terrible performance.

Ordered access machine

Perhaps what is needed is a new machine model? One a bit like the random access machine, where we address memory directly, but that incorporates a cache oblivious approach, so we consider location 0 to be a register and location infinity us the slowest external storage (a tape robot?). There are no sudden transitions so we expect performance to degrade the further from zero we get (hence ordered rather than random access). This would model the cache hierarchy of modern architecture without reference to particular architectures (like 16 registers, 16k l1 cache, 256k l2 cache, 4meg l3 cache, 16 gig DRAM, 32 gig virtual memory). The idea is to optimise performance generically without targeting a specific chip. So we model memory as a flat array, and we want the most used values to have the lowest (or highest) addresses. It is interesting that this is effectively what a stack does (although unlike my model the 'hot' part of the stack moves, but that is not a problem for the caches).

Abstract register machine

An abstract register machine should not have a limit on the number of registers: move the register allocator below the level of abstraction. I'm not sure how common this approach is, but Dan Bernstein used it in Qhasm (cr.yp.to/qhasm.html). It looks like assembly language with an arbitrary number of registers. The allocation is done in a compilation phase between the abstract and concrete machines.

Register spilling

Where would you spill registers without a stack to implement such a model?

The heap?

The answer to your question is quite simple - but I get the impression that is not really what you are asking?

If you are wondering more about the mechanism: allocate fixed spill locations in whatever data-structure on the heap is being used to replace the local storage in the stack frame.

The Location

Given a recursive function, you cannot use a fixed heap location. If you wish to use malloc or a heap allocator to find space for the spills, then the allocator itself must be written in machine code native to the platform, or compiled from a stack based language, otherwise calls in the allocator will have nowhere to spill.

Even when you write in machine code, and have no restrictions you still use a stack to push and pop data. It's really the only thing you can do.

Not quite

You can e.g. compile using CPS, then you don't need a stack at all (except to call out to the FFI). SML of New Jersey does that, for instance. But of course, that requires GC.

Variables

I assume variables are on the heap, hence why it requires GC. I am not arguing that languages cannot work like this, I am arguing that a low-level language cannot work like this (as in a language designed to run directly on the hardware). Maybe you have a different understanding of low-level?

There's nothing magical about running "directly on the hardware"

There's still abstractions underneath and / or alongside your code, even your hardware might be merely a simulation, or a layer of microcode implementing the machine on something else entirely.

Memory allocation is usually one of the abstractions that you "rely on", no matter what "level" you're coding at - how many times, for example, have you had to implement the low level code for "gimme 20 sequential, writeable bytes"? There's nothing to say that (for example) garbage collection can't be one of those things as well.

Indeed, writing to run "on the metal" usually implies a very restrictive hardware model - I have a bunch of microcontroller boards on my desk right now which have mere 20K of volatile memory each - can't go allocating great huge stacks on there so adopting a "big stack" model means I have to be not only careful not to blow the stack via infinite recursion or blowing the heap through over-allocation, but also have to balance potential maximum function call nesting in each process against memory wasted to the stacks for those processes..

Confusion.

Indeed, writing to run "on the metal" usually implies a very restrictive hardware model - I have a bunch of microcontroller boards on my desk right now which have mere 20K of volatile memory each - can't go allocating great huge stacks on there so adopting a "big stack" model means I have to be not only careful not to blow the stack via infinite recursion or blowing the heap through over-allocation, but also have to balance potential maximum function call nesting in each process against memory wasted to the stacks for those processes..

I'm a bit confused, you start out sounding like you disagree with me, and then present an anecdote that supports my position. You would be a lot worse off trying to manage that 20k as a heap with fragmentation. When I have programmed small microcontrollers the normal way is to use the stack for function calls and local variables, and use global variables, IE fixed addresses for longer lived data, effectively mapping the memory by hand. There is no allocator unless you write one, and you don't want to waste the memory space on an allocator if you don't have to.

Memory allocation is usually one of the abstractions that you "rely on", no matter what "level" you're coding at - how many times, for example, have you had to implement the low level code for "gimme 20 sequential, writeable bytes"? There's nothing to say that (for example) garbage collection can't be one of those things as well.

Never? You just subtract 20 bytes from the stack pointer if they are local to the current function, or you position them statically at the end of the program code using the assembler directive to reserve memory with a label. This calculates the address in the same way as a jump label, but the assembler macro reserves blank space at the address instead of assembling code there.

What says garbage collection can't be one of those things is the code is large compared to that 20k, and does not run effectively if you are using more than half the memory. So a large chunk of that 20k becomes filled with the GC code, and what is left you can only effectively use half of without appreciable slow down.

Locations != Offsets

Locations are not the same as offsets. A register allocator does not calculate addresses to spill into, it allocates positions relative to a base address. When using a stack frame the offsets are relative to the base pointer. When using a structure on the heap they are relative to the starting address.

Using one activation record would not work with recursion, but it is not clear why one would? In a stack-based system activation of the procedure would create a new record: same offsets relative to a new base. In a heap-based system the same would be true, once a new record is allocated the same offsets would be used relative to the new base.

The only difference is that the set of locations for records is not required to be a contiguous sequence - presumably to avoid forcing a nesting order on the activations. The original point was that a stack is not mandatory for a low-level language. Indeed the linear nature of the stack forces sequential behaviour, a more parallel machine would place procedure activations on the heap so that their lifetimes are not required to nest.

When writing in machine code you can use a stack, or there are other ways to determine layout - it is not required. An allocator takes a CFG and calculates offsets for each of the spills - they are not required to push/pop onto a stack. They can be direct moves into pre-allocated locations using relative addressing. It is slightly harder to use push/pop than relative addressing when merging control-flow paths.

Fragmentation

However if you start storing activation records in a 'heap' it will need compaction as it will fragment. This will require pointer indirection, and a garbage collector.

This all sounds good in theory, but how many lines of machine code have you written using this method? I have written a 3D engine in 68000 assembly and a small operating system in x86 assembly, so I have a reasonable amount of experience in what works when programming bare metal.

I don't doubt stackless languages can exist, I just don't think they are low-level. I think we might be disagreeing about the definition of low-level?

subdivision entropy

However if you start storing activation records in a 'heap' it will need compaction as it will fragment.

Yes, the heap picks up entropy in the form of fragmentation: a predisposition to keep reusing memory carved into the same-sized pieces. If a trace and profile shows the same sized pieces will keep working, this fragmentaiton is not a problem. It's normal for embedded C apps to live with this, for example. But it is susceptible to failure from fragmentation if a steady state cannot be achieved.

This will require pointer indirection, and a garbage collector.

That would undo the entropy. The old pre-unix MacOS (MacOS 9 and earlier, going back to Inside Macintosh) used memory-moving compaction like this, using an API called handles equivalent to pointer-to-a-pointer. A lot of literature on the practical consequences of Mac handles can be looked up and examined, but it would be like swimming in online complaints about (for example, say) C++ templates, just with an ancient sound to it.

But memory was really critically skimpy on early Macs, especially without virtual memory. Today fragmentation seems an okay choice in comparison.

I don't doubt stackless languages can exist, I just don't think they are low-level.

You can still do low level things, but the heap must be fancier, to plan what it does instead of treating it like an infinite resource. If you have cache-line aligned fragment pools, locality becomes slightly less a problem after re-use causes memory scrambling.

Fancy Heaps

You can still do low level things, but the heap must be fancier, to plan what it does instead of treating it like an infinite resource. If you have cache-line aligned fragment pools, locality becomes slightly less a problem after re-use causes memory scrambling.

But do I want to be scanning free block lists just to make a function call? It's going to be much slower than stack calling. In assembly I just push registers, use 'call' or 'jsr' and then subtract a constant from the stack pointer to reserve local space. Not many cycles, and no fragmentation.

Contrast this with, scan block chain for large enough fragment for register stash, continuation address and local storage, save registers and continuation address then jump. This seems a lot slower, although it will depend on how long it takes to find a large enough block.

Fragmentation is still a real problem, as long running processes will simply crash, and function calls would be a lot slower.

Scanning free lists slow?

Excuse me but .. how are free lists SLOW?

Popping the head off a list is pretty fast!

Finding a large enough fragment

Its probably easier for you to describe the algorithm you are thinking of, to avoid misunderstanding each other. What heap management algorithm are you talking about?

reluctantly discussing heap strategy for stacks

I'm having a flashback to early 90's conversations I had, making this less fun, like playing a video game again until it's boring. :-) I agree with you. But context of usage has a big effect on total-cost-of-execution. (How many threads? How much heap? How many long running transactions? How much i/o? How much async? How much concurrency? Just asking all the questions is taxing.) But there is long tradition in language performance design to ignore context and say "this is fastest in isolation". True, but.

But do I want to be scanning free block lists just to make a function call? It's going to be much slower than stack calling.

Again I agree stack pointer bumping is very fast, when you ignore things that may occur later, like growing when space is exhausted, or handling control splits due to async spawning of side tasks, or managing thousands (up to a million) of stacks in an address space. If the number of things you are doing is just a few, and code is cpu bound, then the usual stack discipline is pretty good. Even if you had heap based spaghetti stacks, in part of your code, you would still want the traditional style in a few places too. Having more than one stack style doesn't sound pure though.

Just because a malloc() implementation might scan free lists (for good block fits) does not mean a stack allocator must do that too. I would be surprised if a general heap was used for stack frames. A very special purpose allocator is what I'd expect, with an emphasis on size uniformity standardization to promote smooth re-use with less fragmentation. Like, just a few power-of-two small sizes, then some maximum size, where multiple instances are allocated to cover unusually large frame requirements.

In the normal case, you would go to a free list that was exactly the size you wanted, typically non-empty in steady state, then just pop it. If you had a program trace that allocated a lot of blocks of one size, then never used that size again, the space would become unused and wasted. But that just reflects lack of planning and config guidance after a dev analyzes typical loads. (A buddy system might recombine small things to repopulate bigger pools, but worst cases seem to occur a lot.)

I think you know all this, so I don't get the vein of inquiry, unless your objective is to FUD heap-based stacks, which I don't see benefiting you. The traditional stack style is very simple, but it makes some complex apps hard to write. A one-size-fits-all design style merely penalizes anyone with problems outside the norm. Edit: here I'll save you the trouble of raising standard objections. A dialog:

Stu: Power-of-two frames have more waste due to internal fragmentation!
Ned: They can be multiples of cache line size instead.
Stu: Then why didn't he say that?
Wil: It's a form of license, in writing, to not say absolutely everything.
Stu: But how do I prove you're an idiot because you left something out?
Wil: You understand I cannot wish you success in every such endeavor.

Complexity

I already said I like the idea of a stackless language, and it would work well with asynchronous cooperative threading. But I don't think a language like this is low-level. I could be wrong - writing an operating system kernel in such a language would be a great way to prove me wrong.

It is also somewhat going against the design of the CPUs which often include stack based sub-routine calls as part of the instruction set, and initialise the kernel stack as part of startup. I don't think people should fight against the architecture of CPUs, I think people should embrace them.

fancy heaps

Take a look at TCMalloc for an example of 'fancy heaps'. Granted, it's slower than bump-pointer allocation. But it's a great deal faster than scanning a free list for a large enough fragment.

An interesting possibility is to use a bump-pointer 'nursery' region within a heap. This gives an important part of the heap stack-like properties. Data that survives a couple collections within the nursery is promoted to the a survivor heap, then eventually to the old generation heap. Though, generational heaps do complicate the system - e.g. need write barriers to keep track of pointers from a mature space to the nursery.

Anyhow, how important is the performance of function calls will depend heavily on the programming language and its compiler. If function calls are expensive, for example, then a compiler will just do heavier work inlining or specializing code. In some extreme cases like SAFL or FunLoft, entire programs (including all function calls) can be reduced to a single allocation.

Nursery

An interesting possibility is to use a bump-pointer 'nursery' region within a heap

Doesn't that require pointer indirection if you are going to relocate objects? That's a big slow down.

TCMalloc is thousands of lines of code, compared to a single opcode to reserve stack space for local function variables.

generational collectors, nurseries, and stacks

TCMalloc doesn't execute thousands of lines of code for every allocation. The common case allocation is O(1), involves just a few lines of code, and has pretty good locality and branch prediction properties. I do not claim that a fancy heap is faster than stack allocation. I only claim that it is not nearly so bad as that incredibly naive free-list algorithm you described.

The greater flexibility of putting activation records and state on the heap can lead to better performance elsewhere. For example, enabling lightweight parallelism, continuations, coroutines. A little inlining - allocating an activation record for a largish sequence of functions - can further amortize potential overheads. Using a few basic sizes that cover most activation records (e.g. small, medium, large) would further improve locality and reuse of memory and reduce external fragmentation (in favor of internal fragmentation).

I do not assume that the basic call stack is superior for performance. It probably depends on a lot of contextual things, like the language, the programming patterns involved, how much is easily inlined, etc..

And to answer your initial question: no, generational GC (including use of a nursery) does not require pointer indirection. It does require tracking references from an 'old' generation to a 'new' generation so that pointers can be repaired when things move. But that can be achieved with write barriers (cf glossary). And read-barriers too if you want concurrent GC. But it's much less costly than pointer indirection. (I grant this is a fair bit more sophisticated than what I'd want for a "low level" language. But the idea of using a nursery heap for performance is separate from the idea of stackless languages.)

Recording

But that can be achieved with write barriers (cf glossary). And read-barriers too if you want concurrent GC.

This requires recording the read/write accesses, and presumably modifying the code on a move. Its still a lot of work compared to simply accessing memory, and recording the read/writes will push other stuff out of the cache.

CMalloc doesn't execute thousands of lines of code for every allocation.

But those thousands of lines have to be stored in memory somewhere and when memory is small (like the 20k above) that is memory you cannot use for storing data.

To summarise what I am trying to say, I think stackless concurrent languages are interesting, but I am waiting for someone to write an operating system kernel in such a language before I will agree it can be used at a low-level.

On the other hand, this is probably the way forward for application and network programming.

re: recording

This requires recording the read/write accesses, and presumably modifying the code on a move.

Generational GC only requires recording a subset of writes. And imprecisely, at that - i.e. you can trade precision for extra GC search. Code is usually neither moved nor modified by these techniques.

when memory is small (like the 20k above)

Quantity has a quality of its own. Techniques designed for single processors with extremely limited memory will work well in those contexts. But they won't necessarily scale well to big systems with multi-processors.

My impression is that you're somehow conflating notions of "low level (of abstraction)" vs. "small scale (of computation)". But it doesn't really matter to me.

Applications

GC is an abstraction of memory management, so I assume we agree that a GC language is by definition higher level than a non-GC language. This should be enough to justify my original point.

Is Rust high level though? It has nice abstractions like parametric polymorphism and type-classes, it doesn't have GC? The thing is those abstractions do not cost any performance nor require a runtime, so Rust can be used to write operating system kernels and write code for embedded devices.

Is Go high level? It lacks the power of Rusts type system and type classes, so would seem to have less abstractive power, but has GC and cooperative concurrency.

For me any language with GC seems to be high level, because abstracting away memory management seems a big deal. There are other ways to be high level too.

I guess I am defining low level as the lack of any high level features, which seems sensible to me. I guess some other people are prepared to permit some high level features and still call it low level (just not quite as low level).

The scalability argument seems a straw man, plenty of big systems have been written without GC, even multi-processing systems.

As for small systems I think that came up because low level languages are often used to program small systems due to the runtime overhead of high level languages, but I think Rust shows that is not necessarily the case, as long as those high level features have a low cost. So you can have high level for small scale, as long as it's not GC or cooperative concurrency apparently.

scalability

I agree that GC offers an implicit higher level of abstraction for memory. OTOH, GC isn't essential for stackless languages.

The scalability argument seems a straw man

I agree. Low level doesn't imply works efficiently on tiny embedded systems. That's a downwards scalability concern. Conversely there are high level languages just for embedded systems (static allocation of memory, compile to hardware, etc.). AFAICT, scalability and abstraction level are independent axes by which to classify programs.

Stackless with no GC

GC isn't essential for stackless languages.

How would that work? Do you have any links?

reference counting

?

Stackless w/o GC

One answer should be very obvious and familiar: manual memory management. Manual reference counting if you need it. This isn't especially safe. But it works. There isn't any particular reason to assume stackless programming will be safer than stack-based programming.

If you want safer memory, it is possible to use substructural type systems or a variant like Rust's ownership. Conveniently, continuations fit nicely into linear, affine, relevant, and more specialized substructural attributes. (A conventional call-return stack can be modeled as a single linear continuation.)

Manual Memory Management

I think you are right, you could do it all manually, but that would be quite tedious to program in, although it would probably make an interesting target language for a compiler.

Certainly stackless seems a better solution for high concurrency, but then again with processors that work very fast sequentially you can get a lot done with a single thread and a stack.

same as first reply answer, but with detail

Semi-automatic refcounting, which is basically manual but automated by both tool and interface style. If API says you return like this from an activation frame, that also releases the ref associated with that call. (Other refs might still be around, like in a closure or in a return value holding a temp ref to the frame holding an out param.)

The compiler target approach you mention is more the tool angle, and lends itself to inlining of refcount management. If you do it manually, another tool that helps is a "ref holder" object acting as handle, similar to a C++ smart pointer in flavor. Lifetime of a single ref is the same as the handle: addref on create and release on destroy. You can instrument handles to audit them, so ref leaks are revealed as backtraces with more addref than release on the books.

A refcounted object does not need to count each alias. Any alias whose lifetime is covered by an official handle for that scope need not manage more refs. For example, when running code in an activation frame, you would only be there if a ref was held, so the frame pointer needs no further local refcounting.

Recounting vs mark/sweep.

I thought mark/sweep was generally considered better for GC than recounting? Is there something about high concurrency that makes refcounting a better option again?

let's keep our volume of words in closer balance

It's unfair to ask a short question needing a long answer, so you owe me more effort. Your sound bite is fit for short attention spans.

Is refcounting always GC? When folks do manual memory management and use a bit of refcounting, have they adopted GC? Stacks are usually reclaimed automatically, even if nothing else is; are you saying all languages have GC because stacks pop on call returns? Can you compare incremental nature of stack popping vs mark/sweep? On a scale of one to ten characterizing developer absence of control over something, how close to 10 is GC? Where do you put refcounting? Why would the number not be identical? There's too many questions to ask, but you didn't even start. For example, better at what?

(I hope this doesn't sound mean-spirited, but questions tend to sound that way, the way a harassing cross-examiner does.)

Blaise Pascal

To paraphrase Blaise Pascal "if I had more time I would have written a shorter response". It actually takes more thought to compose short replies that convey the point well. Mere word count is no measure of effort involved. It's a bit like measuring developer productivity in lines of code.

I never liked his triangle anyway

I'm very familiar; I do condense points. You are free to reject my proposed deal, which naturally assumes equal point-to-word density. Pony up to get concurrency remarks. What's a one word way to say "redefinition implies bad faith"?

Short Answer?

Okay, let's try another approach, ask something with a short answer, as I now realise that was probably a bad question open to very speculative answers.

Has anyone written (or used) a language that is stackless, and is suitable for low level programming like writing an operating system, or have any performance benchmarks comparing ref-counting, mark/sweep, and stack allocation in this context, contrasting low and high concurrency scenarios?

graph POV habit

Amdahl's law affects fast parallel things more if you find a way to make them sometimes depend on slower serial things. Suppose m is a memory dependency relation such that A m B means A's memory management depends on B's memory management, because we need to look at B to decide if A is alive. Draw the graph, then see if evaluating m serially can affect fast parallel A nodes that otherwise would not have waited.

Indeed.

It is true, mark/sweep suffers from a 'stop the world approach' and graph tracing although parellisable needs to be completed before the results can be interpreted, but I often hear that mark/sweep outperforms reference counting, including criticisms of languages like Swift that use refcounting. This has always bothered me (when I remember it :-) )

Refcounting obviously does not suffer from this, so I would be interested in when the crossover occurs (how many threads or concurrent tasks) for some well known algorithms.

Naive refcounting

Naive refcounting has terrible performance characteristics. It requires touching objects in memory every time you move references to them around, which is bad for both VM and CPU cache. It adds a word to the overhead for each object, which increases memory consumption and cache requirements. It adds a conditional check (is the count zero) to lots of otherwise straightline code, which becomes a branch prediction hazard. The extra memory management code itself becomes a cache hazard if there are virtual destruction methods.

A lot of this can be mitigated by various deferred reference counting techniques that reduce a lot of updates to the reference counts, most of which sort of hybridize reference counting and mark/sweep GC. Reference counts can also be moved into a separate page than the objects in question to improve memory locality and caching behaviors, though doing so can be a bit tricky to implement.

When people call reference counting slow, they are usually talking about the naive form. Given that it's difficult to implement more sophisticated algorithms without compiler or runtime support, I think this is more or less a fair analysis.

Concurrency

Clearly mark/sweep has a concurrency problem, everything must be marked before you can sweep. Where do you think the crossover between mark/sweep and refcounting is, 10 parallel tasks, 100, 1000 ?

Concurrency and GC

Just based on anecdotal observation of ad-hoc concurrency - a bunch of threads with a big, shared heap - I'd estimate the crossover point is somewhere between 10-100 OS/Hardware threads depending on lots of fiddly implementation details. This could correspond to many thousands of lightweight threads (in an N:M model). We're only paying for GC synchronization on the M.

If you use an Erlang model instead - an independently GC'd heap for each 'process' together with some manually managed binary data spaces - I think there is no crossover point. Mark and sweep is excellent for small heaps, after all, and this avoids all synchronization issues. This does mean 'messages' between processes are copied.

So I guess, like many things, It Depends (on software architecture, etc.).

avoid synchronization

There's a lot of "it depends" in forward looking projection of expected system behavior. (Even worse, when debugging, you must construct backward looking projections that are likely counterfactual, to guide search for clues. To avoid confirmation bias, a plan to "look for everything" is too vague. You need to look for specific perverse influences. I just saw Z; what are all the things that can cause Z? It's more hunch-driven than logical.)

It's fairly hard to get a load consisting of tasks that are mostly to completely independent. But it's a common format in session-oriented comm stuff like network connections. Locally you might be might be able to do it with a daemon talking to a lot of processes. Otherwise getting to a crossover point might be hard, unless you're running a simulation of many independent things. Around the time you manage to service thousands of things that didn't wait on a shared serial resource, you don't want them to start syncing with anything serial.

Shared mutable state implies coordination, implies sync, implies waiting. A shared heap can be a mutable state machine even if state is immutable post-allocation. (For example, writing to a shared log-structured store implies a coordination bottleneck.) If you can make a shared resource really cheap and low latency, it doesn't get in the way much. But lines (queues) can be deadly, whatever form they take, setting a hard limit on parallel speedup, if you would not have waited without that resource.

Some things don't use a lot of bandwidth, but you want to let a lot of them hang around until an event triggers them, without them increasing cost of other things running just because they exist. Agents scale if they seem to have non-interfering domains. If one agent is not slowed down because a lot of others are sitting quiescent, it's working. Sorry if none of this seems helpful, but it's hard to describe when a design has value.

narrow gutter question

what are good architectures to avoid "hunch driven development/debugging"?

(if you have a thought, perhaps blog about it and post a link since this thread is too narrow and i am making it more OT even! :-)

refcounting vs. mark/sweep

Mark/sweep usually has a big throughput advantage, especially for small heaps. All the GC code is in a tight loop, it only touches living objects, and some of its paging behavior is predictable.

But reference counting has other advantages, such as avoiding unpredictable latency. If supported by compiler or runtime, many reference count updates can be eliminated.

Other options are even better for high concurrency, such as fixed flyweight allocations for each subprogram and the messages/queues. If your program already fits this pattern, compiler supported reference counting shouldn't add any runtime overhead.

Sorry, been off-grid for a bit

and most of the arguments I would have made have already been done (to death, perhaps). That said, can't let this pass :

But those thousands of lines have to be stored in memory somewhere and when memory is small (like the 20k above) that is memory you cannot use for storing data.

When volatile memory is small, it tends to only be used for truly volatile data, with nonvolatile data (including the majority of code) stored in nonvolatile memory. The µCs I'm using, for example, have hundreds of kilobytes of (effectively) nonvolatile and executable directly memory mapped flash alongside the 20480 bytes of volatile memory. They're actually quite big, as µCs go, but if you get a few concurrent processes up, the "big stack" model eats up pretty much all the memory, and it's a bugger to tune how much stack each process should have.

Consensus

Rust originally had a garbage collector, but has abandoned it because for systems and embedded programming having a language runtime is considered a bad thing. I have never had a problem with stack size on a micro-controller, as you tend not to use recursion, and use imperative looping instead.

What language are you using to program these microcontrollers that has GC and is stackless?

I'm not. Like most people,

I'm not. Like most people, I'm using a horrible mixture of C and assembler, which kinda almost sometimes gives me correct behaviour, as long as I haven't accidentally quadrupled stack usage by compiling with no optimisation, or some pathological case leads to a stack overflow on one or other process.

Garbage collection of one type or another, automatic or not, might lead to a wastage of memory, but that still isn't going to be as much as is left unused 99% of the time by the big stack model, and while "out of memory" might still be an issue, it would be a hard fail rather than a "stack overflows heap" variable corruption case.

My gut feel is that there are significant gains in terms of stability and provability to be had for on-the-metal programming (especially in the common case where dynamic allocation simply doesn't happen) using stackless programming.

random project that has always warmed my heart

cool stappl via PLT Racket.

Assumptions about concurrency.

The "big-stack" model would seem to result in optimal memory use in the single threaded case, so you must be working with high concurrency systems to have such problems with stacks?

I would disagree with

I would disagree with "optimal memory use". For example...


extern void bar (int x);
extern void baz();

void foo (void) {
for (int i = 0; i < 1000; i++) {
// stack usage here is probably optimal
bar(i);
}
// but here we're carrying "dead" space in the frame for i, which is no longer used
baz();
}

Obviously, that's trivial, and you'll only carry space in the stack frame in this case when compiled with -O0, but you see what I mean (and that's without considering that the call to baz is a tail call...)

As for concurrency, not that much. I'm running a USB HID device stack, several processes handling "interesting" timing-critical serial links, and an equal number of even more interesting decoder stacks inbetween the two.

Trivial Optimisation

I think that's just a compiler problem, there is no reason to carry 'i' and you could put the for loop in an extra block if you want to make it explicit.

extern void bar (int x);
extern void baz();
void foo (void) {
  {
    for (int i = 0; i < 1000; i++) {
      // stack usage here is probably optimal
      bar(i);
    }
  }
  // definitely not carrying space for i.
  baz();
}

However the compiler is perfectly capable of making this optimisation itself as you are aware. O0 means no optimisations, so by definition it cannot do this. With a GC compiler you would have to do no garbage collection at all to comply with the strict requirements of O0. Personally I think C compilers should default to O1 at least these days.

I don't think this small issue justifies the claim that stacks are bad. I still maintain they are optimal for single threads.

A retrospective

What a long journey it has been.

I don't think this small issue justifies the claim that stacks are bad. I still maintain they are optimal for single threads.

What does that even mean? Optimal with respect to what criteria - memory utilization? performance? It is difficult to see what you are arguing against - nobody has claimed that for one specific set of problems stacks are an inefficient solution. If procedures activations form a nested tree then they are optimal in both utilization and performance. But if data lifetimes are not nested then they fail completely and a secondary store is necessary.

"Single-threaded" is too general a classification. The subset of single-threaded codes that exclusively access data with nested lifetimes are well served by stacks. For anything outsid of this set of codes they are a premature optimisation. The original point that started this thread was not that stacks are inefficient for programs in that target set. The original claim was that stacks are "evil" - they are too specific an optimisation to handle arbitrary control-flow constructs.

Heading back up this long tangent to something that you brought in earlier - writing an OS. This is a classic example of a program that falls outside of nested data lifetimes and the context switch requires a form of control-flow that does not suit a stack: data lifetimes in application programs are not well nested with respect to data lifetimes in the kernel.

Arguing that a stack is efficient does not answer the original criticism that it is not general enough. You asked about low-level bare metal programming. This serms like a perfect example to explore generality: how much of an OS could you implement using obly a single stack? Without extending the system using a more general secondary storage (i.e. Implementing a heap on top of a memory allocator) you would be hard pressed to do much at all, let alone manage the execution of simple programs.

[apologies for multiple typos - using a phone in a car, hopefully they do not impede readability]

OS and Stacks

I think there is a distinction between possible and easy. It certainy would be possible, you only have to allocate storage before entering the frame where it is used. For example the OS loader can allocate the process table on the stack, and then call the main loop. Often operating systems have fixed resource limits for things like processes, IO handles etc. It would probabably not be ideal, and I was not arguing that stacks should be used without heap storage, just that stacks are not evil.

In general with single-threads GC + stack will be more efficient than just GC, because some things can be destroyed without any scan, and it reduces the working set when a scan does happen. Stacks + malloc seem more efficient than just malloc too.

Above it has been pointed out that mark/sweep GC does not work very well with highly parallel systems.

My thoughts at the moment are that for highly concurrent code you want a combination of 'cactus-stacks' and ref-counting, where everything with single-ownership is managed by the stack (equivalent of a C++ unique_ptr) and anything shared is ref-counted (equivalent of a C++ shared ptr). This seems like it could be a much more efficient solution for both single-threaded and highly concurrent systems than mark/sweep GC for everything with no stack.

Edit: what is needed is a single language with pluggable memory management, so comparative benchmarking can be done.

Low-level

I don't doubt stackless languages can exist, I just don't think they are low-level. I think we might be disagreeing about the definition of low-level?

Certainly. I believe this is because your definition of what is low-level has drifted somewhat during this thread. Initially you claimed that a stackless language could not be low-level as it would require GC. Then there was a claim that an abstract register machine could not exist as it would be tied to a specific architecture.

It seems that we have reached a point where: a stackless language could function as long as it has a heap allocator in the runtime. Whether or not this particular level of abstraction is "low-level" is arbitrary as the term is not well-defined, but certainly it fixes the misconceptions about what would prevent it from being "low-level".

Have I written such a runtime? No, but I don't need to sit down and implement one to check that it could exist. The pieces are quite well known and understood. And yes, I have also written my fair share of 68000 assembly (once upon a time) and bare-metal x86 (back when protected mode was new and exciting and we needed thunks to handle the old driver model when we upgraded applications to a flat address space).

Fragmentation: it is not so clear that this would be a problem. Certainly it is a potential problem - and the main reason that we use a stack for nested activations is that it guarantees a lack of fragmentation. But if the activations are well nested then the allocations and deallocations will follow an interesting pattern. In particular if the allocator glues together adjacent regions during a deallocation (i.e the buddy system) then there will be no fragmentation. Objects on the heap that do not share lifetimes with well-nested procedures will obviously break this property.

More interestingly if the heap is only used for activation records in a system when the activations are not well nested, then fragmentation can still be avoided if the run-length of each activation is bounded. In particular the layout of the heap would look like a generation garbage collector and could be treated as a ring-buffer. I would implement this but I do not have any interesting test-cases to demonstrate how it would work. It would not surprise me if it already existed somewhere.

Buddy System

I did think the most likely implementation would be a buddy system with power of two pages, but how many pages of each size would you use to seed the system with? It immediately seems to be excessively wasteful of memory. Then we get to 'memory is cheap' arguments. But if I can write code that uses the same memory more efficiently (no matter how large the memory is) I will have an advantage. My drawing application will handle bigger images, or more complex filters on the same machine. The argument that memory is plentiful just leads to wastefulness in my opinion.

Some clarifications though, I still think a language with a GC is not low-level. I don't think low-level languages should have a runtime, especially if we are talking about a 'C' replacement, and a lot of people agree with me (see the Rust language). I also did not claim an abstract register machine could not exist, what I did claim is that the number of registers would not match the underlying hardware, and that mismatch would cause performance problems when you try and spill registers without a stack for storage.

I think another problem would be the need for a custom backend, because I am not sure LLVM, or 'C' would make a good intermediate language for something like this?

Custom Back end

A custom back end is ideal, a viable alternative is a suitably constrained use of existing back ends. Felix does the latter: it generates C++ which can go through LLVM via Clang. The advantage is C/C++ ABI compatibility, and of course reuse of existing complex tools including assembler generators that can already do low level optimisations.

The disadvantage is lack of "purity" and for Felix the lack of a compiler based safety enforcement system such as a suitable type system: Rust does this I believe but it is a poor compromise IMHO.

For WebAssembly, however, there is no need for compliance with an existing ABI nor any need for reuse of any back ends. Felix is designed to sit on TOP of an existing target language, WebAssembly is designed to sit UNDERNEATH.

If you wanted a REALLY low level language, without GC, you would also NOT have function calls with implicit continuation passing and an implicit stack, not to mention implicit argument passing. To be low level, the requirement would be for everything to be explicit.

In fact, you would also NOT have any static (global) storage either. I have actually worked on an OS where there was no static memory at all, all code was pure, reenetrant, and relocatable: namely OS9 (it does have a stack though).

If you look at web applications today they're written in Javascript which has two high level features of merit: automatic memory management and real closures. Although it does have global variables, which makes experiments easy, high level libraries like jQuery can be written which don't use globals for communication. If you really want a low level language to build web applications in, it would seem advisable to at least support the hard stuff: automatic memory management. It's already a requirement for all existing JS apps so why do we want developers to have to write a memory manager every time they implement something in WebAssembly? Such managers make cooperation between products supplied by different vendors impossible. JS, at present, does not have this problem *because* the memory management is built in.

WebAssembly ensures functions from competing vendors cooperate. But closures do not because they require a GC for full generality, and that isn't provided. So roughly its a bad compromise: its not high level enough to support automatic memory management and not low level enough to support peer to peer control exchange or continuation passing.

Low level / High level

Yes I agree about a custom backend, but backends are big projects, especially as you need to support multiple architectures.

I disagree about webassembly, it's designed as a target for compilers, and it clearly should not impose any assumptions about memory model. If you look at other bytecode like the JVM and CLR they are very restrictive, for example Scala had to work within the Java object model, and F# inherits issues from C#. The problem is we do not know how to write a higher level bytecode without limiting what you can do with it. What we do know is the CPU itself is flexible, so I assume webassembly is designed to behave more like a real CPU than a high level bytecode.

The answer is to port to webassembly exactly as you would port to a new hardware architecture.

cost of memory

(...) we get to 'memory is cheap' arguments. But if I can write code that uses the same memory more efficiently (no matter how large the memory is) I will have an advantage. My drawing application will handle bigger images (...)

In my experience, I don't usually have a very "deep" stack. Not even when recursing down balanced tree data structures in purely functional languages. If I had 100 bytes overhead per average activation record and a relatively deep stack with 300 activation records, that gives me thirty kilobytes wasted memory.

In comparison, a single "large image" decompressed for easy manipulations and drawing could be over thirty megabytes. The waste from three hundred activation records would only be a 0.1% tax relative to such an image.

Focusing on memory overheads without considering context is one of those 'penny wise and pound foolish' optimization concerns. Even if memory is expensive, we need to find the hot spots and optimize there.

Besides, as far as 'advantages' go: while making efficient use of memory is clearly advantageous, so is making efficient use of multi-core systems. Coroutines, continuations and the like are convenient in this role, and are less awkward to implement efficiently if activation records are on a heap rather than a stack.

So there are a lot of different advantages to weigh against each other, even if we're just looking at performance.

No tail calls planned, then?

Control-flow indeed is semi-structured: branches are really only breaks and continues, so that you cannot construct irreducible control flow. That's a feature as well.
Does this mean that TCO is an anti-feature for WebAssembly? If so, how does this interact with ES's guarantees of tail-call optimization.

Planned

No, (explicit) tail calls are definitely planned, but post-1.0. There is no interaction with JS TCE, as far as I can see (also, since the last TC39 meeting there is a non-zero probability that TCE is gonna be ripped out of JavaScript again).

!

[...] since the last TC39 meeting there is a non-zero probability that TCE is gonna be ripped out of JavaScript again.

I suppose I shouldn't be surprised! I remember attending a talk by Waldemar Horwat in ~2002 where he said that Javascript would get TCE soon. :)