Microscript

I'm designing a Turing complete utility microlanguage that has a minimal set of commands. Chosen commands should on the one side have ability to compile to fast assembler code, and on the other side sould be enough descriptive so one can directly write sane code with them.

The microlanguage would be a part of my bigger project metalanguage that compiles any other languages to the microlanguage which at the end (jit or cached) compiles either to browser-executable (html + javascript), either to native code. Because it wouldn't really be used for high level programming, but only as underneath intermediate language (analogous to some bytecode language), it shouldn't have all heavy weight addons like objects, structs or reactive constructs, but these addons should be able to compile from a guest language to the host microlanguage. It should be also easy to port the microlanguage to platforms like OSX, Windows, Linux or Android.

So I'm thinking about a minimal set of commands. Somehow I think it should be imperative language, because microprocessors work that way at low-level understanding. I was thinking of supporting only these:

  • variable assignment and arithmetic
  • pointers and arrays like in C
  • if branching
  • for and while loops with labels
  • defining functions with stack support
  • assembler passing engine like in C for system programming

Do these commands sound cool or there is something that can be changed for better speed/usability performance?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Forth is more minimal than

Forth is more minimal than your set of primitives.

But given your set, if you're supporting labels and branches, then you don't need for/while loops, so that's even more minimal.

I don't see why you consider structs to be heavy weight. Encoding struct accesses as offsets from pointers is remarkably unsafe.

Structs

can be lightweight, indeed.

With labels and branches

With labels and goto I can drop out functions too.

[Edit]
The problem is how will it impact compiled-to-javascript performance. On the other side, I want to be able, once in the future, to provide compiling to different fast machine codes.

asm.js is recently available

asm.js is recently available in all more known browsers and that should be it. It operates on a single array (or something like that) that can be used to hold stack, heap, or whatever low level implementation of some language might look like. Maybe it's too low level, but why to restrict language designers to frames of particular stack (or something) implementation?

PPL

I have been thinking of something I call a PPL: Purely Procedural Language, the dual of a purely functional one :)

At the base level you have an *arbitrary* set of instructions with the rule that the arguments of the instructions cannot be expressions (there aren't any expressions!), they have to be variables (that is, storage addresses). The instructions arguments have types, however, that have to agree with the variable types.

You will need a computed jump instruction, the argument should be a unit sum. If the sum is void (0), the program halts, if it is unit (1) its an unconditional branch, and if bool (2) its a conditional branch with TWO jump address (this ensures the jump terminates a flow: no fall through).

A core instruction is Branch-and-Link, this subsumes the goto above and subroutine call: BLA jumps to a continuation stored in one variable, saving the current continuation in another variable. Variables are just registers, except they're typed and can be created on demand by a declaration.

This is enough for a finite state machine. Although we specify sequential execution of instructions, it is easy to reorder or parallelise instruction execution based on variable dependencies.

The hard part is extending the machine so that variables can be dynamically created on the heap. It's important NOT to have a machine stack, stacks screw up things big time.

For anything to work at all you have to have a product constructor (struct) to create aggregate types, so you can have projections. This is not required for the finite state machine but it is essential for the full machine because you want to create a collection of variables on the heap dynamically, store the heap address in a variable of pointer to product type, and be able to locate the components with projections.

So roughly programs look like this:

var x : int;
x = x * x; // sugar for add_int (&x,&x,&x)

var px = new int; // heap
*px = *px + *px; // sugar for add_int (px,px,px)

If you allocate a struct on the heap, you have to use an instruction to calculate a projection to a component so the type agrees with the instruction: instructions always takes pointers as arguments. This allows them to work with constants, global variables, and variables on the heap as well.

So the most important thing for a micro language is that you have an architectural model which means you do NOT have to specify any calculating instructions in the architecture. You only really need dereference, computed goto, branch-and-link and then you can throw in any calculating instructions you like. Add some basic ones, monitor patterns, and throw in some more for better performance.

Similar Ideas

I have been thinking along similar lines, but probably for different reasons. If all parameters are passed by reference (with no return values) this solves the escape problem, and removes the need to heap allocation and garbage collection (in a much simpler way than Rust lifetimes). You cannot have an iterator that outlives the collection for example. Procedure calls do not need to allocate any stack, so a procedure that does not allocate any local variables would be completely allocation free. This can lead to strong exception guarantees, as if we limit the code in the procedure to instructions that cannot cause an exception, then the code cannot fail (except for page-faults, or bus errors - so the code needs to be correct, running in real memory and the machine free from hardware problems) so can be used in interrupt handlers etc.

I am not sure I see the advantage of a heap over a stack, as a heap is just a stack with holes? If as I outline above, you don't need a heap, why not just use a stack?

I don't see how you can get

I don't see how you can get away from Rust lifetimes, since some data structures (allocated on the stack) will likely have to contain references to other stack values. That right there requires a Rust-like region analysis.

Your setup would also mean that you cannot have procedures that allocate a structure of unknown size to a caller. You would have to split each such function into an pre-allocation stub, and a main body that works on the allocated data. This quickly gets messy even for a single procedure with multiple allocations that depend on each other, not to mention nested calls, and integrating with closures that may or may not allocate -- yikes. I went down this path experimenting on a closure library in C awhile back, and the dynamic allocation problem gets hairy.

AST Rewriting

I don't see how you can get away from Rust lifetimes, since some data structures (allocated on the stack) will likely have to contain references to other stack values. That right there requires a Rust-like region analysis.

It's a stack only language, so references have to refer to things before them on the stack, therefore the reference will _always_ be released before the thing to which it refers. No further analysis necessary.

Your setup would also mean that you cannot have procedures that allocate a structure of unknown size to a caller.

If everything is immutable, then boxed values can be allowed, where the box-handle is on the stack, then all references are still sound.

This quickly gets messy even for a single procedure with multiple allocations that depend on each other, not to mention nested calls, and integrating with closures that may or may not allocate

procedures can only allocate local variables, which does not seem problematic. Nested calls seem fine, arguments are all immutable pass by reference. It would not support closures.

No array allocation?

So you can't write a procedure that creates an array whose length is some function of that procedure's parameters?

Resize

You can pass in a reference to a zero sized array and resize it.

I suppose I'm confused,

I suppose I'm confused, because this would seem to require heaps and you said earlier "why have a heap when you have a stack?" You also mentioned keeping handles on the stack, which I suppose now means handles to heap values. So you're saying there's no general heap, just a restricted form of heap of some kind?

Stacked lifetimes

Some details are lost when you try to describe things quickly. All lifetimes would be controlled by the stack, so you can have boxed values in DAG owned by a root on the stack. You can have a non-owning reference to anything rooted higher up the stack, or with the same root. Collections can either be referenceable (in which case elements are not deletable), or non-referencable where elements can be deleted, but you cannot hold a direct reference you must use the provided getters and setters.

What sort of program we

What sort of program we might see in a GC'd language would you not be able to express with these primitives?

Optimal Performance

I don't think there is a program that you could not express, but it will be sub-optimal for code that builds graphs and frees a lot of nodes during processing, or highly parallel tasks that share memory.

It will work well for tasks that do not parallelise well, that pre-allocate a workspace before running an algorithm and freeing the workspace afterwards, or parallel tasks that communicate by message passing.

Stacks

The problem with linear stacks is the idea doesn't scale to multiple logical threads, including coroutines. Most of the design and issues with my Felix system are related to the forced compatibility with C/C++ stack models.

Stacks support subroutines, but subroutines are a special case of control exchange, i.e. where the subroutines continuation is that of the caller. Modern systems cannot survive without a more flexible mechanism for peer to peer context switching.

Swapping stacks is easy and fast, the problem is that linear stacks necessarily each consume the upper bound of required address space. Paging systems ensure only used memory is allocated, but there is no way to run large numbers of logical threads with limited address space using linear stacks, and machine pages are often quite large (older systems used 4K but I think Linux now uses 32K?)

With a stack implemented by linked heap blocks, context switching is just as fast as a stack, however allocation and deallocation are slower. However the amount of address space required is roughly equal to the amount of memory required.

In addition, with nested contexts, threads can share ancestor data, so in fact linked frames can use less memory than a stack. So the stack should be reserved for operations between context switches. For example if you have a routine which has more than one possible continuation to select, the stack should be empty at the point the selected continuation is invoked (or at least used only for transient parameter passing).

Re: Stacks

I don't like those stack restrictions also. I specifically needed to implement my own stack as an array because the default function stack was too small. And guess what? The code execution was equally fast. And I even got an opportunity to optimize the code moreover by reusing and rejecting small blocks of stack segments. It was javascript, by the way, I don't know about the performance impact with other languages.

Stackless

I think stackless coopertative concurrency with asynchronous IO and N : M threading model is an interesting design. As the N:M threading requires a runtime to manage it, this design works well with garbage collection. That does not invalidate a stack based design with no runtime and no-GC, they are different points in the design space optimising for different requirements.

One of the key use-cases the stack based idea is optimal for is tasks that do not parallelise well that allocate a block of memory before running an in-place algorithm, and then free the workspace afterwards.

Is it just me, or it is

Is it just me, or it is really odd that stacks are populated form higher address to lower address? What is the reason for that, what am I missing?

It's convenient to allocate

It's convenient to allocate the stack at the opposite end of your address space from everything else, leaving plenty of room; and feels natural to allocate most stuff in low addresses. Those things between them favor a stack growing from the top of memory down.

Stack space is contiguous,

Stack space is contiguous, heap space is fragmented. If you don't want to start breaking up your stack as soon as either your heap or stack grow beyond their initial size, then heap grows up from address 0x0 (via sbrk), stack grows down from address 0xffffffff. Makes maximal use of available address space.

Of course, this was before multithreading broke all of these properties anyway. It still has the nice property that taking the address of the top stack element yields a struct for the current stack frame. It doesn't work if the stack grows up and structs and arrays are addressed the same way.

So, you're saying that

So, you're saying that non-checking stack overflow can be a hell of the optimization?

That overflow should be checked automatically by the processor, without slowdown. Some hardware interrupt or something to be triggered in the case of overflow. *sigh*

The page below the top of

The page below the top of the stack is marked as read-only, so the program faults when it overflows. The system traps the page fault, the runtime extends the stack space in response, and resumes the program. No typical slowdown beyond the usual costs for virtual memory (which are actually high).

How hard it is to reallocate

How hard it is to reallocate a block of used memory into a bigger block and to retain the same addresses for existing data in the block (from assembler point of view)? Is that even possible? Does MMU helps here?

growing allocations

The main difficulty with `realloc` growing memory in place is that you need to not have allocated a sufficient address space immediately following the current allocation. Testing for this is typically an unreliable optimization. It depends on non-local properties of a program - the allocation algorithm, prior allocations (e.g. which might influence where an allocation is relative to a page boundary), multi-threading (other threads performing allocations), etc..

Unreliable optimizations are in many ways worse than than no optimization because code that performs efficiently in one context can unpredictably slow down in another context (e.g. adding allocations or another thread). This sort of fragility is not a desirable feature.

Moving memory that contains internal pointers or that is accessed through external pointers creates its own challenges.

Aside: If our address space was hierarchical, rather than flat, we could potentially grow memory without too much difficulty. But this would require a different hardware architecture. (Perhaps one aligned to more explicitly manage caching and NUMA?)

Easy with relative addressing. Not so much with general.

If you're using addresses relative to the base of the allocation, then realloc() does exactly what you want.

If you're using absolute pointers, you'd have to do things with the MMU that, as far as I know, are non-portable.

Non-portable or not, though, I know some Garbage-Collection systems use this stunt at a low level and they've been implemented on lots of different hardware/OS combinations.

Sparse memory interface

Rewriting all of the addresses in use is difficult, a simpler approach is to use a sparse model of memory. These used to be called "segments" and they provide a 2D interface between the internal memory manager in the language runtime and the external (kernel) memory manager.

The main issue is performance: an internal allocation within the process may take 100-1000 clock cycles if it does not involve growing the process' address space through a syscall. If the process needs a larger address space it needs to make a brk() syscall to inform the system memory manager, which will allocate further pages and wire them into the pagetable. This can take 10,000 - 100,000 cycles.

Growing sections of memory is quite easy if the process uses a sparse model of memory. One simple scheme is binary subdivision of the memory space on each allocation, adding in extra pages where necessary to grow a particular region / object / segment. But the performance penalty of making a sycall per internal allocation would be huge.

The dense / contiguous memory model that we use in operating systems today is basically a giant performance hack to avoid this issue. It creates the need for handling realloc() inside the program logic, but it optimise the common case by orders of magnitude: allocation / freeing of memory without resize.

Stacks of helium balloons.

If you have trouble remembering which way stacks grow, think of them as stacks of helium balloons on the ceiling.

If you have trouble understanding why they made the choices they did for which end of address space is given the name 'stack' and which end is given the name 'heap', well, there's nothing there to understand. Both of the metaphors are things that grow/shrink by adding stuff to the top of them.

So stacks are stacks of helium balloons and heaps are heaps of sandbags. It would make the same amount of nonsense the other way round, but it's easier for me to remember because stacks are also the lightweight structure in that allocating and deallocating from them is relatively streamlined.

As far as I can tell ...

... stacks grow downward because the PDP-11 had post-increment and pre-decrement addressing modes only. Before that, stacks were just as likely to grow up as down.

PDP-10 stacks grow towards higher addresses

It's interesting that the PDP-10 has PUSH,POP,PUSHJ and POPJ and the direction the pointer advances is specified in the definition of the instruction such that the stack grows towards higher addresses.

2.3.1. PUSH - Push on Stack

PUSH C(AC):=C(AC)+<1,,1>; C(CR(AC)):=C(E)

The specified accumulator is incremented by adding 1 to each half

A guide at Columbia

I would guess that PDP-11 was more influential on C implementations than PDP-10 for a variety of reasons. But if you were taking advantage of the PDP-10 instruction set the stack grew upwards.

stack and heap

I think it was shown (in the '60s) that regular programs require finite state, context free requires only a stack in addition, but general programs require a heap as well. So the question arises whether you can factor the code into three separate parts.

However we're talking sequential (single threaded) programs here. If you're going to do multi-processing you might as well bite the bullet and throw out the static store and stack and do everything with a single uniform allocation method, and that has to be dynamic (heap) allocation. In practice you can do micro optimisations where fixed variables (registers) and small known amounts of stack can by used locally to improve performance, however that only works with cooperative context switching. As soon as you go to pre-emption, the system has to save more and more state.

cooperative is faster

I think applications like Nodejs and the Go programming language show that cooperative concurrency with asynchronous IO and N : M threading model offers a better solution than threads with pre-emptive multi-tasking.

and non-uniformity is faster

Yes, cooperative is faster. You can make cooperative local while still messaging other things globally that pre-empt. The model is usually that local is faster. (For example, local use of registers.) But to leverage this well, a programmer must be able to tell what is local.

If everything looks completely uniform, it's hard to see cost of operations. Did I just call a sub-routine, or did I send RPC to another machine? You want local to be visible, so program architecture can arrange local contexts nested inside larger global contexts. Points of interaction should be amenable to analysis: how many transitions from local to global happen per unit of load here? Transparency is an enemy of planning for optimization.

Fewer entities is better of course. But too few hides local/global boundaries. So you want at least as many entities as flavors of local and global context. It's awkward when things of different flavor interact, but this awkwardness is incentive to stay local and efficient when possible. You can make it easy, but invisible boundaries seem a poor idea.

The upshot is that you want concurrent but single-threaded things nested inside concurrent but multi-threaded things, where the boundary is apparent. When you elevate everything to work in the worse case context, it gets simpler but slower, and doesn't give much traction to plan performance.

Edit: you would like to be able to change local to global without rewriting. When debugging, it's quite helpful to move part of a sub-system to a place where you can audit its traffic closely, even slow it down to human response speeds. So a programmer can write a bunch of interacting parts that "know" each of the others is local; but when debugging, it's nice to isolate individual pieces by making them remote (say in another process) to triage causes of problems. That you did something "evil" (mapped one part to remote) just needs to be visible.

erlang as irony?

The Erlang people seem to love to say that it is stupid easy to move a program to being distributed. No "code" changes, "just" configuration changes. It always seemed to fly in the face of the 9-plus-and-counting Fallacies of Distributed Computing to me. (But maybe the real answer is that Erlang code gets written e.g. with timeouts even when it is first just done on a single machine.)

don't see irony yet

(This should all sound like agreement with everything you said.)

I believe it is easy in Erlang to move where things happen. This is great; it's only bad if you cannot figure out where something will happen, or did happen, to plan and debug. You want interfaces to be location agnostic, so they don't care at the bottom, specifically so someone who does care can pick good arrangements. The mistake one can make is to assume location won't matter, and allow software to pick location at random, thus erasing relevant relationship knowledge a developer might have.

Erlang is location agnostic because async is native, so dispatch and reply can take an arbitrary amount of time; thus pervasive timeout support is helpful.

Configuring where things happen sounds good, since a configuration can capture what a developer plans with regard to locality (for performance or other reasons). It must also be possible to empirically spot check evidence, when debugging, for inferences to be valid about cause and effect relationships.

Visible but refactorable pre-emptive/cooperative boundaries

I would love to know how to actually do that.

Felix has synchronous threads and channels and pre-emptive threads and channels. The syntax of basic use is the same, roughly:

spawn_[f/p]thread procedure (f = sync)
create[s/p]channel
read/write

However despite a similar syntax the semantics are utterly different. Fibres can access shared variables without locking. Pthreads can't. Fibres suicide on deadlock (its not a bug), pthreads actually deadlock (always a bug).

For me the real problem is the converse of your requirement: how to get the type system to help *prevent* fibres in the same thread trying to communicate with pchannels, or fibres in different threads trying to communicate with schannels.

Common features

Well you have to choose the features so they become the same, so you need locks on shared variables, and everything non-blockling.

balance of direct and indirect in bindings

I would love to know how to actually do that.

Via some indirection, so rebinding can occur, but not so much that it confuses the programmer. Short answers are necessarily vague though, right? To make it apply to every language with binding requires being pretty vague. As I give concrete examples, the narrowing may sound like it loses generality; if this happens, pop back out to 'binding' and go down another concrete path with different rules.

To interact with something, you need a reference to it (or to a proxy), and this ref comes from someplace. If you made the item and its ref yourself, locally, you know exactly how it works and what happens when used. But if you were willing to work with a ref that came from somewhere else, you would know less, after some anonymization of types and interfaces involved. Your local ref could also be used via the more general anonymous type and API, perhaps, if you were careful to plan substitution works in a uniform way. That your code works with either local or remote refs is part of your design factoring. As the developer who wired things up, you would know whether you used a local ref or got the ref through an input arg or a service. When it comes from a service, you can control the mapping used by the service via configuration, or some other way you discover its provenance.

(Whether you can debug anything weird that happens mainly depends on whether you can observe and audit the cause and effect chain of any signals involved, in the sense of bits propagating in a light cone, not the operating system sense of 'signal'. If the system is sufficiently aggressive in saying you don't need to know, you may be screwed if the hidden parts seem to contain the problem area. Swapping things is part of the diagnostic regime to isolate symptom scopes.)

As I start to say things about fibers and threads, this may cause a jarring snap to concrete which loses part of the general idea. Also, I will have to assume some particular hypothetical scheme of interaction, but feel free to remap it fluidly to something you like better, like your Felix design. As I segue into definitions, some ideas about typing implicitly creep into the mix, because I don't want to separate the discussion and make it longer.

Say 'blocks' to mean a thread that yields, and 'parks' to mean a fiber that yields (and by yield I mean actually de-schedule). Next, say a primitive routine blocks when it might block the calling thread. Note I find 'can-block' a somewhat awkward construction, but that's all it means to say a primitive can block. You can mark a primitive with a 'blocks' annotation to mean it can-block, similar to the way you can mark a primitive 'const' to say it does not mutate state. Clearly type analysis can look for const methods that reach mutating calls, and look for non-blocking methods that can reach blocking calls.

If your language has fibers, then every call can-park if dispatch can de-schedule (usually cooperatively by some scheduling oversight-entity). But you might compile your code to another language, target T, where some code clearly calls only local non-parking primitives. After conversion from source S to target T, some calls will not have the can-park attribute, even though nominally all calls in S have the can-park attribute. The non-parking primitives in T can run faster, locally, because they don't have to obey the speed limit. (If S and T are the same language, this is confusing, but that's what happens when both S and T are actually C, for example. Whether can-park is pervasive depends on whether context is before or after translation.)

Okay, so you want to write your fibers to not know whether they are messaging an entity in another thread, process, or network node. If there's shared mutable state involved, you can't, without something pretty complex happening. (You could have another local fiber with rights to access the shared mutable state, acting as the proxy of some remote agent; this would work, but it's complex.) So it has to be message oriented, or streaming (think UDP or TCP, or whatever concrete thing you want to swap in as example.) The ref for this other entity came from someplace, local or remote, but you can use the anonymous version of interface that works remotely. The only question could be: how do you avoid blocking the fiber? Only parking is allowed. To call a can-block primitive may lock up every other fiber running as peers of the calling fiber, which is a no no. This includes the call that actually messages out of the local thread; to interact with the world outside seems to require a can-block operation.

A thread hosting fibers (H) should have a non-blocking queue to a worker thread pool (W1, W2, ...) that services can-block primitives. A fiber F that tries to message another entity that turns out to live in another network node does this: parks if the non-blocking outbound queue has hit a bounded buffer limit. If the queue was not already full, it doesn't park. But if it wants a reply or status return value, it explicitly parks for a reply. Something moves the message to the worker queue's input channel, but without blocking H, so the fibers in H can keep running. After some worker Wi services the can-block operation F was not allowed to do directly, a response goes back to H, where the scheduler makes F runnable, so it can unpark and consume the reply.

Operationally, everything looks like calls in fiber F. But code is annotated with the possible consequences of calls, like those that can-block. A dev can see this is true about the API involved, and it's documented (in a non-crazy world). A dev also knows if a thing messaged was a local fiber (because it was statically spawned locally), then only local context switching will happen. If the contacted entity came out of a service listing, a dev can arrange the config either makes it local or remote, depending on what the dev wants to happen. If you want the whole thing to get fluid and dynamic, maybe a COW virtual file system exposes a Plan9 style path to provide access, and then the dev controls that file system as config.

Yeah, this was about twice as long as I hoped. But sometimes you should finish instead of saying so little it requires ten back-and-forth exchanges to clarify. I hope my explanation triggers a "I knew that already" reaction, so it fits in with familiar stuff.

Refs and control

But the issue here isn't the ref, which in this case is a channel. The problem is that control flow matters because it determines ordering. If thread 1 reads A then B and thread 2 writes B then A, we have a deadlock in that neither can proceed. But if the channels are buffered, there is no deadlock.

Why is this? Because the control flow interaction differs. In Felix with fibres, the deadlock is permitted and a legitimate way to terminate fibres, with threads it isn't. Merely throwing in a buffer allows the fibres to proceed, and resolves the thread deadlock (at least temporarily, since buffers can fill up and we get back to a deadlock). However this system has utterly different semantics.

What I'm saying is that the semantics of the refs matters. Where they come from doesn't matter to the fibre or thread, only the overall program that launched them and supplied the connections.

So now, we have the reversed issue: if we want unordered read/writes we need the main program to throw in the buffers, and it has to know the semantics of the threads to calculate what channel kinds are required.

The *problem* here is that the semantics of the threads/fibres and channels types and connections add up to the program semantics, but there is no good theory describing how to design with such a system. The best we have is Hoare's Communicating Sequential Processes (CSP) and derivatives, which are about concurrent threads. Fibres aren't concurrent. It matters.

The best model I have seen for fibres is actually Andrzej Filinski's masters thesis "Declarative Continuations and Categorical Duality" which introduces a Symmetric Lambda Calculus (SLC) and Symmetric Categorical Language (SCL). It turns out Felix fibres and channels (pipelines in particular) model this system reasonably well (but remove the dependence on evaluation ordering by making it explicit).

Anyhow the point is you can't just replace synchronous channels with prthead monitors (pre-emptive channels) because the semantics of the channels are distinct. You can only do this if the whole system was designed to "not care" which kind of channel is used and whether the threads are synchronous or premptive. I have no idea how to do that without throwing out the prime advantages of abandoning the weak model of processes communicating with messages. That best attempt to keep the model but avoid actually sending messages is probably implemented in Rust.

control is a tough nut (Ordinary People reference)

By the time I'm done replying, I might have a general point not yet in mind. My motive to respond is because "conversation is fun"; if I have an agenda, it's not conscious. I enjoy exploring the idea of bindings: why bit patterns mean things, depending on context.

I was trying to be vague about what a ref is: an ID, a pointer, an object, an interface, etc. In your case it's a channel. In object oriented languages, a ref is often an object identity, which either actually does a thing involved, or acts as a proxy for activity elsewhere. Sounds like you are using channel address as ref. I use a lot of C interfaces where integers act as descriptors for files or sockets. Erlang uses process ID as a ref for sending messages. I favor short integer sequences as ID format, so they can include (for example) a generation number making an ID invalid when lifetime has passed. I like supporting revocation without risking memory corruption.

(Sidebar: file descriptors are too easy to abuse, because you can close one that now belongs to someone else, because you accidentally closed it twice and it got re-assigned concurrently to a new file or socket between first close and second redundant close. So they should be repackaged, via indirection, so they include a generation number that changes when lifetime does. A generation number is like a weak capability, but with too few bits to guarantee accidental forging will not happen; as long as it usually catches a problem, you can signal violently like a soft SEGV so the bug gets fixed. That would cancel a fiber in release builds, but might abort the process in debug builds.)

Fibers are not physically parallel, but they are semantically concurrent. At least that's the usual definition. Fibers can interleave execution due to context switches, though they run only serially, and that makes them logically concurrent. So they need synchronization primitives when mutable state gets read/written non-atomically in critical sections. That would only happen if a fiber in a critical section had no choice but to hold a lock when making a call that potentially parks, letting another fiber run. If the entire critical section can be done without making a parking call, it's atomic (with respect to fiber scheduling) and doesn't need protection; and yes this is really cheap, thus a benefit of cooperative scheduling in fibers.

I basically like monitors and condition variables; they are really easy to do in fibers, because it's just a data structure less complex than a hashmap. Only fiber cancellation gets hairy. Ties to other parts of the world must be visible for each fiber, so killing one can unhook it from whatever it was using. If waiting on a condition variable, you need to pop it from the waiting list first. But if a fiber is holding a lock — the mutex in a cond var — then killing it invalidates the critical section. You either need to also kill every other fiber waiting on the cond var, or awaken them with an error reporting the bad news. Probably they need to abort themselves after coping.

Partly I explain that to illustrate an idea, that system organization devolves from a combination of early decisions and choices followed by making later parts fit your starting ground rules. You might start with channels and then make later design elements fit that model, because you wanted guarantees about channels. Whereas I didn't start from how things communicate; instead I refine how fibers and synchronization primitives behave, then define communications around data interations with suitable sync primitive usage. When a fiber is waiting, I want it parked on a condition variable, always.

Some kinds of system organization are hard to talk about, because I haven't heard folks come up with standard definitions. :-) They just 'wing it' during discussion: "you know how when this happens and that happens, and we need to do this?" But not quite enough verbs and nouns are coined about those things. In particular, there is one situation — responding to external signals — where ad hoc proposals are common, without bothering with terms. Control flow branching makes folks use specific terms they know in their language or operating system, without bothering with general terms.

I see most concurrent activities around managing state need to have an agent not only handle its business, but respond to being prodded out-of-band. This is confusing if an agent is one fiber. So I like to think of an agent as N closely related fibers, where N is the smallest number that works for an agent: the number of things it needs to wait on simultaneously plus the number of async signals it must respond to. I want each continuation to be a fiber. To signal an agent, this awakens the fiber that was waiting on the cond var for one or more input channels served by the agent. So all fibers in one agent are absolutely in bed with one another, so killing one might as well kill them all; thus you kill agents and not fibers, and this shuts down all the fibers inside. Basically an agent is a process, and fibers are like lightweight threads. But channels are not central in a model like this. Instead they are like pipes or connections running in the fiber runtime.

(It sounds expensive to some people, but a fiber can be a single activation frame, where they act like callback functions with associated contexts ... except they are first class schedule-able objects with formal relationships to a containing parent agent. So you get structured programming instead of chaos.)

I start from a mental runtime simulation instead of a calculus, so my habits of mind may differ from someone thinking math. Instead of working on timeless propositions that are always true, I'm seeing a rube goldberg machine that follows enough rules to be managed.

replacing signals

On the topic of signals, yesterday Hacker News coincidentally had a discussion of material on signals in Should you be scared of Unix signals? by Julia Evans. It's nice to hear folks talk about replacing old signal designs.

A couple paragraphs before the end of my last post above, I'm talking about replacing signals in green thread systems. (It's reasonable to ask what this has to do with programming languages. Well, should you be able to talk about system semantics in a PL system? At the very least, you need an answer to what happens in concurrent code when system effects occur. Otherwise you have not defined what is still valid after subsets of fibers are killed/cancelled, for example. Not having the PL involved hardly seems feasible.)

If there is a primitive communication mechanism in a Unix process, I want to be able to closely mimic that in a lightweight process agent, where N of these run single-threaded but non-blocking in a host thread inside (say) a daemon process. The idea is to port stand-alone process code so it runs as an agent. If some code is driven by signals, a similar API would make things less awkward to port.

If each agent has a condition variable for arriving signals and messages, a fiber in an agent can wait on signals and messages. (When a signal arrives, that fiber can already be running from an earlier signal, so it must check for more signals that arrive asynchronously before it parks on the cond var again.) The responsible fiber in an agent is well-known by an agent-local ID, so we can check whether a signal will be handled, or whether we should kill the agent if necessary. Basically it's just simple bookkeeping.

It looks like simulation of signals. When there are actual signals, you map them to the proper agent that simulates handling them. (In code, the difference between real and simulation can be a bit arbitrary; usually fast is considered real, so fast simulation is a bit ambiguous.) One of the inter-process communication messages would amount to "find the agent this is addressed to, set a bit in the signal mask, and then report via cond var." You would be able to signal an agent running in a daemon by running a command line that launches a client, builds the proper message, then connects to whatever listener the daemon advertises.

To invent a completely different set of signal definitions would be weird. But to prevent depending on a zillion pages of messy docs about signals, you would want to write a new short specification that (over) simplifies things. Pidgin versions of complex things seem better than depending on huge specs.

Stu: Can I assume you thought about security, despite not mentioning it?
Wil: Yes. How surprisingly charitable of you. Are you okay?
Ned: It's a pretty long discussion, which is why I think it was skipped.

I like two versions

of programming tools: one that "just works" and one more complicated that can be fine tuned. I'd leave to users to build anything in between.

low and raw versus high and polished

I like two kinds as well, raw and cooked: extremes of too-many-knobs and safe-useful-defaults, with the latter a high level application of the former primitive interface.

Users cannot build some kinds of things that need to be there at a primitive level, unless they assume a role of systems programmer as runtime developer. And if something must be standard, you don't want users defining conflicting versions. In case of signals, for example, you want some things uniform, like ability to kill any agent because rogues immune to control are forbidden in a shared space.

Anyway, it's possible to make the way you manage systems portable, because the message handling on each operating system maps to roughly similar things. But that is less likely unless you plan it that way.

Spaghetti stack for the win.

Scheme and similar languages have no distinction between dynamically and statically allocated memory. All memory is allocated in call frames, and whatever has a pointer at it doesn't get garbage collected when the called routine returns. IOW 'dynamically allocated memory' in that type of language is a persistent closure.

Under the hood a lot of scheme systems have complicated heuristics for figuring out which things are most efficiently allocated in a stack-ish structure that get auto-freed as the procedures return, and which are most efficiently allocated in a heap-ish structure that gets more general garbage collection applied to it - but from the POV of the abstract language there is no distinction and everything is in a call frame.

Or, like Chicken ...

... the C stack is the first generation of the Scheme heap, and when it gets big enough, a minor GC copies the live data into the rest of the heap. Because Chicken procedures are CPS-converted but are called as C functions, all the control frames are garbage at that point.

Investigating

There is something that doesn't quite thrills me about C pointers. It is about indirect dereferencing of pointers like in following example

*(p + 4)


Firstly, i'm not thrilled about concluding of what type the pointer is. Secondly, if we want to implement our own struct mechanism, to reach each struct members we could reference it inline like this:

void *struct = malloc(9); //declaration, malloc returns a pointer to the new memory chunk 
*(long*)struct       = 50;
*(long*)(struct + 4) = *(long)struct;
*(char*)(struct + 8) = 255;


This works fine, people have been using c pointers for decades, but there is something in this example I don't like. I don't know what exactly I don't like, but I thought of another pointer system anyway.

Pointers have exactly two informations attached to them: a type of dereference and a memory address. So I thought of this notation for dereferencing a value from an arbitrary memory address:

type @ address


This looks cool when used on both sides of an assignment. If it's on the right side side we get a dereferenced typed value, and if it's on the left side, we have a type to cast the right side to the type of left side, and the address to which to store the right side. Actually it works exactly like it was a regular variable instead.

So our struct from C example now looks like this:

int32 struct := malloc (9); // declaration, malloc returns the physical address of the new memory chunk
int32 @ struct     := 50;
int32 @ struct + 4 := int32 @ struct;
int8  @ struct + 8 := 255;


By reusing '&' operator from C, we get the final flavour of this new notation in trying to access memory:

int8 a := 2;
int32 b := &a;
int8 @ b := (int8 @ b) + 1; // increments the value of a
// note that 'int8 @ &a' is equivalent to 'a'

I'm still not sure should I implement variable address changing. It could be implemented with '&var' on the left side like this:

int8 a;           //declaration
&a = 0x0BADBABE;  // address of a is changed here


but I don't like this. It complicates compilation of variables, while both speed and memory performance of compiled code counts. Maybe I don't need this last feature at all.

Beside pointers

Beside pointers with simple arithmetic described in previous post, Microscript would implement three more constructs: label, jmp...(if...) command and halt command. It would be a stackless language. At the end of the code you should place a label which would serve as a pointer to all data and/or runtime generated code and which could serve as a stack, if you want. To be complete, there would be two more commands for growing/shrinking address space (so you know when you run out of memory) and I think that would be it.

Here is how a for loop could be made:

grow 100;
int32 @ Data := 0;
Loop:
int32 @ Data := (int32 @ Data) + 1;
jmp Loop if (int32 @ Data) < 10;
halt;
Data:


Behind the Data label (or anywhere else inside the grown address space) you could write any code by pointers in runtime, and use jmp to run it.

Simple, isnt it? Labels, pointers, jmp, grow, shrink and halt.

Dijkstra

Looks suspiciously like a goto, and whilst I agree with Stepanov that goto has appropriate uses (in state machines for example) a language with only goto's sounds like a step back to pre-structured programming and earlier BASICs. Dijkstra would not be happy. :-)

I actually quite like this as an intermediate language to be generated by a compiler, and think it would make a good backend to replace LLVM if I have understood it correctly (no GC, no stack, static storage requirements for each 'block').

probably RIP

I ended up finishing this definition of Microscript merely as a curiosity, as Webassembly popped up in the middle of design process. Finally, I'll probably use Webassembly as a base language for my parent project (instead of Microscript), as I don't want to loose webscape potential.

Calls and returns

So how do you encode calls and return addresses?

Just like plain assembler

It's just like plain assembler, but manual assignment, without push/pop commands. That means that you have to maintain your own stack pointer, maybe at the "Data" address, before other data. I decided not to implement push/pop commands because I think that is not the only way to pass data around the code (don't ask me how else because I don't know. But I feel people could be creative. I wanted to give programmers complete freedom.)

A label is merely an

A label is merely an internal address representation. You can also do "jump 0xAFFF" (or whatever address) if you know AFFF is the address you want. You put a label before a function, so you can do smart "jmp".

You can simulate stack or whatever after a label behind the code (in our example Data:). You populate stack on your own by pointer usage (i.e. at the address of "Data" you write the stack pointer, at "Data + 4" the first return address, at "Data + 8" the first function parameter, on "Data + 12" the second parameter, and so on, even recursively.

What is the benefit of the data label?

Could you not write the same style of code if you assumed the data area started at address zero?

Good question

The problem is that data and code reside in the same addressing space (I wanted enabled JIT compiling by plain writing raw code bytes to memory, just like in assembler).

I put the code at address 0. So data go elsewhere.

Looks position independent

The style of code looks position independent. If it is being translated into PIC on the target then it provides more freedom in the relation between code and data addresses. The "grow" primitive simply allocates a range of memory so presumably it could be used for either a data-structure or to emit code into. It should be possible to interleave code and data areas during the JIT - might have a good effect on data locality (I think you mentioned this earlier?). If the address space is large and there is good locality between the code and data addresses then it makes sense to think of the program as a stream rather than a static declaration.

Address

Something should be in a constant position, either an address from which code executes, either a pointer to the code. Basically, you can put jump at that position, so it is more or less the same where the code is. Though I'm not sure how to specify where to target the code from a source file?

possible implementation of Microscript in Asm.js

It is a bit of a challenge to implement Microscript in asm.js, as Javascript doesn't support a goto statement. A possible solution would be to have a set of functions, each containing partial duplicates of the original asm.js code starting from each line to the end of code, from the first to the last line. When we need to jump to a certain line, we could exit the current function, and call a function whose code starts from wanted line. But that would introduce a memory consumption problem, with memory that contains the code growing exponentially by formula:

(NumberOfLines ^ 2) / 2

But, we could do a bit of optimization here. We could reduce the number of lines in each function to a constant number (32 or so). That way, when we normally execute the code without jmp command, we sequentially execute relevant functions, one after another, every 32 lines. When we encounter the jmp command, we exit the current function and call the function whose command line starts from wanted number. This reduces transpiled code size to a linear formula. It would be nearly:

NumberOfLines * (32 ^ 2) - (32 ^ 2) / 2

if we choose code fragment size to be 32 lines. Pretty steep function, but it's linear. I think that this falls into reasonable frames as current personal computer memory sizes range about 2 to 4 gigabytes. By ensuring that each function transpiled from Microscript code doesn't have variables that affect stack usage, the slowdown of this approach would be in that every continuous 32 lines, the system has to exit the current function and call the next function. This affects 1/32 amount of code which should be more than acceptable in performance slowdown.

Instantly brings to mind the CLR and the JVM

Isn't that the entire purpose of a VM like the CLR or JVM? Lots of good lessons from reading the instruction sets they chose.
Note 1: you do have to decide whether to support 'objects'. they can't be grafted on top.
Note 2: the CLR in particular has a stack with variable length entries, to support pass by value. Think about that too.

Why?

So, why am I building another bytecode language? The answer is that I want to have a full control over the lowest level language. Actually I'm building a multilayered transpiler where each layer transpile to the layer below, while providing possibility to use lower layer constructs in higher layers. It is about a new extensible language in which you can extend the syntax by your own rules. Starting language could branch into various forms in the future and it is up to programmer which forms she would combine in her own developement.

The lowest layer (code name microscript) is just a nusproduct of a possible use of syntax extensible system I'm building. It would be a proposition for a base for defining other languages that transpile on the fly to microscript. However, this syntax extensible system could be used for transpiling any AST to any other AST, assuming that former can be defined in later.

I don't plan a built-in support for objects in microscript. Higher order language designers should think about building that one or any other paradigm. As for stack, linear address space would be provided, so programmers can do with it what they really want. I'd like to leave it as low level as possible, as I don't want to force any particular technology in advance.