Streaming Language Rewrite Processing (SLRP)

Streaming Language Rewrite Processing (SLRP) is a low-level computing model that is essentially characterized by limited-memory processors rewriting a much larger input stream. Computation under this condition requires multiple passes, but is highly amenable to pipelining and partitioning, and thus parallelism. SLRP processors may have side channels to external devices that we drive through ‘effectful’ rewrite rules.

TLDR: SLRP is a very promising alternative to the von Neumann architecture. SLRP offers advantages for security, concurrency, distribution, and scalability. However, SLRP performs poorly for data plumbing and loops because there is no pass-by-reference within the stream. This weakness can be mitigated by effects, shunting a region of the stream to a remote device and keeping a small placeholder – a reference.

The article develops a Turing complete foundation for a SLRP machine code based on concatenative combinatory logic, and sketches several extensions. The challenge is that the operands for a combinator may be larger than a processor's memory, e.g. for a copy operator `c[X] == [X][X]` the value `[X]` might be larger than what our processor can remember. Thus, a processor cannot simply copy and paste. To solve this involves divide-and-conquer tactics, and a notion of open-ended blocks.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

-

-

mechanics of reduction

I added a 'background' section in response to your comment (which seems to be gone, now).

Regarding open-ended blocks and `\`:

I introduce `\` because I want the behavior `o[A][B] == [AB]` but, mechanically, our processor cannot necessarily fit all of the first operand into its limited memory. We can only start printing `[A` then, when we get to the end, figure out what to do next. If we have `][B...` that's great, just don't write the `][` to output and we're done. If not, we can only leave a mark in our stream so we can continue composition on a later pass. `\` is exactly that mark.

However, we have not lost the equivalence: `[A\` is just another way of writing `o[A]`. Thus, `\` is still a block delimiter, both formally and mechanically.

Streaming Language

I am interested in the topic, however, I got lost at the first line.

a[A][B] == [B]A     (apply)

If it is "apply" shouldn't it be A[B]?

The next line has the notation in the order I'd expect:

b[A][B] == [A[B]]   (bind)

Did I misunderstand something? Is the program in Polish reverse notation or no?

Then how to read the more complex rules such as:

c[XY    == oa[a[o]w]c[X]c[Y

re apply

What you're looking for is the 'inline' operator: `i[A][B]` will give you `A[B]`. More generally, `i[A] == A`. We could define inline from the initial four combinators as `i = daa[[]]`. However, in SLRP we end up wanting `i` to be primitive.

The 'apply' operator from Awelon corresponds to "dip" in Joy (or Factor). But 'dip' doesn't fit my nice `a b c d` mnemonic, and I viscerally dislike the word almost as much as 'moist'. Anyhow, apply as defined is quite useful in practice, and has a nice symmetry with 'bind' (bind a function to first argument vs. apply a function to all but first argument).

Anyhow, apply is correct as defined. And yes, it's reverse Polish notation.

Assembly not machine code?

This looks like an assembly language, as surely a machine code would be a bitstream encoding?

machine code, not assembly

SLRP is an architecture (essentially, SLRP is stream rewriting in multiple passes by fixed memory processors). Like the von Neumann architecture, SLRP is not specific to any instruction set.

The machine code in the article is indeed a code we could directly implement in a physical processor. But it's also a didactic proof of concept. The design goal was not performance, but rather to explain the SLRP architecture, to prove Turing completeness, to demonstrate generality of the model, and (for anyone interested) to provide a viable starting point to discuss and design a more performant machine code.

So, you could encode the instruction set as a bitstream encoding if you choose to develop SLRP processors.

That said, I do wonder whether simply adding a lightweight stream-compression module to the processor (decompress on input, compress on output) might serve us better than attempting a Huffman encoding or whatever at the operator level. It's not something we could easily do in a language with jumps, but with stream processors the idea of hardware compression could work quite well.

Silicon Area

The area of silicon you use for compression would reduce the number of cores you can fit on a wafer, so has a direct cost implication for the production. This all of course presupposes a CISC architecture. In a RISC architecture the bits in the bitstream (instruction set) tend to correspond directly to the internal control lines of the processor, which saves on instruction decode logic, at the expense of instruction density.

The best option here might be RISC op-codes that get stream compressed, so the processor decompresses the incoming stream into an internal window of RISC op-codes, that directly switch the reduction logic.

re Silicon Area

There's definitely some tradeoffs involved. It's been a few years since I read much about compression. I vaguely remember a blog article on an adaptive bit-stream encoder that used only a very small window and small amount of state. Its effectiveness wasn't super great, but not bad either.

At first I thought it was finite-state entropy but upon review that doesn't seem to be the case. I'm pretty sure it was a finite-state adaptive arithmetic coding.

Anyhow, I'm pretty sure we can develop a compression algorithm that has a small size on die and reasonable effectiveness for most input.

IIUC, the cost of moving code and data from memory to CPU is quite significant and is a big part of what motivates sophisticated logic like pipelining processor instructions. So if we can reduce costs via compression, we might accept simpler logic elsewhere. Basically what you're saying with compressed-RISC streams.

In any case, this isn't a domain that interests me too much, so I'll leave it to those passionate on the subject. There are entire forums for people who are into compression.

Copy, and Code with Data

The hardest part seems to be the divide and conquer for long arguments. In the case where we want to duplicate or find the fixed-point of a large function for example. Am I right in saying that this would take several passes, dependent on the 'window' size of the processor?

It seems to me in every case you would be better off with a reference/pointer, and so although you could theoretically design a processor like this, it would be an academic exercise, and you would be better off designing a streaming processor that could use references for actual use (you mention some ways of doing this in the article). Further for most cases combining code and data is not that useful, as it reduces the available stream bandwidth for data. Consider streaming network TV, or NASA telemetry data as typical uses of streaming, sending the code with the data doesn't make sense.

What is the use case where sending the code with the data is a better option?

For more local use, combining code and data could lead to security issues. Where code is separate from data, we can validate that code, and have assurances that it will not do certain things. We are then safe to use data from any origin. Having said that JavaScript in web pages is probably a good example of widespread code and data combined. There would probably need to be some kind of sandboxing to make this work for general purpose (desktop) computing.

It is not clear to me how real-time IO would work, when the operating system may need several passes through the processor. I guess you would have to update the display once per pass? Maybe my brain just doesn't work right for this kind of design, but it seems like it would be hard to implement even simple UI elements.

re Copying

Am I right in saying that this would take several passes, dependent on the 'window' size of the processor?

Yes.

It seems to me in every case you would be better off with a reference/pointer

It's certainly more practical. That's why there's a section in the article on 'virtual memory' for SLRP, which replaces a region of the stream by a reference to the region. For practical performance, that's a very useful extension. (I call this 'stowage' in Awelon.)

(I'm separating the other title issue.)

re Code with Data

There are many use-cases for combining code and data: templating, procedural generation of data, representing infinite streams or trees as first-class values, adaptive structure where data can automatically integrate with its context, introducing spreadsheet-like properties to a filesystem, mobile software agents that can move to a remote processor and implement local decisions/filters/compression to reduce latencies and network overheads.

Bandwidth isn't a significant concern. Even within a process stream, binary data with sized headers can have negligible overheads. This depends on chunk size, but working with a few kilobytes at a time is no problem (a typical L1 cache on a processor today is tens of kilobytes).

Further, you seem to be assuming that code and data must be combined for external IO. In SLRP, our processors would typically interact with network or human-interface devices through side-channels - see the 'Effectful SLRP' section with the `![ChannelID][Request] => [Response]` operator. In most use cases, the channel ID, request, and response would all be embedded binary data. A streaming process that interacts effectfully with a network is not the same as receiving arbitrary streams of code from the network.

Regarding security:

The von Neumann architecture has a lot of difficulty with security mostly because 'addresses' are forgeable. This can easily be fixed using capability-based security. Unfortunately, use of addressing is extremely frequent and fine-grained in von Neumann systems, so the relative space-time overheads for securing addresses this are too much. Instead, we resort to ad-hoc, coarse-grained security models such as page tables and protection rings.

A lesser issue is that it's very difficult to control what a user can do with an address, to protect system invariants.

However, this awkward security experience with von Neumann architecture doesn't generalize to other mixed code and data designs. SLRP works well with capability-based security. Virtual memory or channel IDs only require coarse-grained references. There's no pointer arithmetic. Both request-response channels and virtual memory references are structurally restricted in what we can do with them, so we can protect invariants. Overall, the cost of securing references at the hardware level is quite acceptable in SLRP.

A secondary security benefit for SLRP relates to rewriting semantics: there is no undefined behavior. When the processor sees a divide-by-zero or a type error or a request it cannot handle, it simply writes that code to output and moves on. Consequently, there is no need for external fault handlers and awkward non-local reasoning.

Regarding real-time processing and IO:

In SLRP, we naturally must control the stream size for real-time or embedded systems processing.

An intriguing feature of SLRP is that, beyond pipelining the stream and passive virtual-memory storage, we can also leverage the idea of virtual memory to partition an 'active' stream across multiple processors. This would allow us to compute each partition in its own tight loop, and stitch the streams back together when the result is required.

With clever partitioning schemes, we essentially support multi-threaded code within a stream. If we have more partitions than processors, we could feasibly time-share partitions on a processor. In case of heterogeneous systems (e.g. different processors control different devices) we can also migrate the partition to the processor that can handle a request. Thus, if we partition a specific request we would simulate remote procedure calls, and if we partition a loop we would simulate a mobile agent. Thus, SLRP plus virtual memory has convenient, intrinsic support for concurrency and distributed computing even without use of explicit network requests.

With that in mind, for real-time IO we don't necessarily need to control the entire stream. It's sufficient to control the size of specific remote or time-critical active partitions (threads) of a large stream. This requires precise, effectful control over partitioning, but that's entirely feasible.

Loops and messages

I was thinking more about how APIs would work where something has priviledge, but only exposes a limited set of capabilities to the “outside”. Consider a file system with ownerships and permissions. The “driver” has complete low level access to the device. This would exist I assume as a kind of “tape loop”, the code gets repeatedly passed over and over again. For efficiency it would need to be “paused” at a rendezvous with another steam as we previously discussed. The second stream would then be a message, and the driver stream would check permissions, perform the actions and reply with another message.

So to make this work the processor would need to be able to have two input streams and two output streams at the same time. It would also need to be able to context switch. One of the input streams would have to be limited in some way so that it only serves as data to the other stream, otherwise it could take full priviledges and break the security. I don’t see how this can work with code and data combined? Code and data together creates a symmetry, whereas an API requires asymmetry, one side having priviledge the other not. If there is any way to pass code through the API we lose control of security.

The other issue here is context switching, how could it context switch without a stateful, memory address / pointer based approach? You can’t interleave the code easily without exceeding the window size.

re Loops and messages

If I understand correctly, you're essentially asking about how we model 'applications' separately from the 'OS' process. In SLRP, we wouldn't need to separate these things in the conventional sense. Apps can be directly embedded as block-values in the OS. Consider two approaches:

First option relies on facet pattern. We inject capabilities as an argument to our app - basically, instead of a `void main()` app model, we have `void main(caps)`. The injected caps can be restricted via facet pattern. Instead of a raw `[![BlockDeviceID]]` we can give the app some 'methods' that wrap the block device and restrict access.

Second option is based on monads or algebraic-effects. Our app evaluates to some form of request-continuation model. The OS wraps this monadic program with an effects handler that will interpret each request, provide a response, and continue. The important difference is that the `![BlockDeviceID]` is never injected into the application code, which results in improved control by the OS at some cost to performance.

Apps might be loaded from a block device or network. If from a distrusted source the app code might be sanitized or compiled by the OS before it is loaded. For scalable systems and performance, we could leverage the 'active' virtual memory partitions that I described above. This would allow each partition to compute in its own 'loop' if we have enough processors that time-sharing is not required.

Note: I'm not using any synchronous rendezvous here. No pausing streams. No assumption of multiple streams. I'm assuming only the original SLRP as described in the article, including side-channel effects and virtual memory references.

One of the input streams would have to be limited in some way so that it only serves as data to the other stream, otherwise it could take full priviledges and break the security.

No. We're not using protection rings with 'privileges'. Capability-based security and protection rings/privileges are diametrically opposed. It's easy to blend code with different authority in a capability-based system.

The question to ask for capability-based security is whether we can forge capabilities. If we can't, then there's no risk of takeover. Protecting against forgery isn't difficult. One option is to sanitize apps, perhaps forbid direct use of the `!` operator so app is pure until it receives caps from the OS. Another option is to secure the resource identifiers from forgery, e.g. add a random 64-bit entropy suffix to channel IDs so that guessing at caps is infeasible.

The other issue here is context switching, how could it context switch without a stateful, memory address / pointer based approach? You can’t interleave the code easily without exceeding the window size.

We don't need context switching, or at least not a conventional notion of context switching. In SLRP, there's no protection ring context, no page table context, no register context. And we don't need to interleave code at operator-level granularity. At most, we might need to time-share a processor for several streams, but that isn't so different from working with one larger stream of form `[Stream1][Stream2][Stream3]...`.

Also, the processor sliding window (via output buffering) is simply an optimization, to focus processor attention on parts of the stream that need more work. In practice, there'd be a quota in case of tight infinite loops (like `z[i][]`) so we only cycle so many times before forcibly moving on. So time shares will always predictably progress.

Security and Priority.

First option relies on facet pattern. We inject capabilities as an argument to our app - basically, instead of a `void main()` app model, we have `void main(caps)`. The injected caps can be restricted via facet pattern. Instead of a raw `[![BlockDeviceID]]` we can give the app some 'methods' that wrap the block device and restrict access.

Yes that makes sense, I understand the capability model as opposed to ambient authority. This processor is going to be very complex compared to the traditional, as it will need to provide capabilities to enumerate devices etc in hardware. Effectively the BIOS would need to be hard coded into the CPU, unless you had a "normal" von-neumann CPU running a micro-code program to do device enumeration and manage capabilities.

This architecture seems to have a "bootstrap" problem. Things like capabilities are traditionally managed by the OS kernel because they are too complex to implement in hardware. We want to minimise the complexity of the hardware because a "bug" in a mask set can cost $100,000+ to put right. The von-neumann architecture can be implemented in tiny chips like the 6502. Without it we would have never had computers with which to develop more sophisticated algorithms, so perhaps my "fundamental" comments could be justified after all? But to be fair the 6502 had no memory protection, and tended to be used in systems without multi-tasking. Even so it seems to merge system architecture and OS function with CPU architecture, which would seem problematic.

Also, the processor sliding window (via output buffering) is simply an optimization

I don't see this - you need to see the operator and it's arguments to do a reduction. If we have arbitrarily long arguments we need to split them into chunks, which means holding the operator whilst streaming the arguments. With multiple interleaved streams we need to hold an arbitrary number of operators and context switching them, which probably requires spilling to a stack.

This also only seems to work for equal proportion timesharing, there is no easy way to have dynamic prioritisation where the priority of a process changes with time. It's seems like it would be difficult to have a usable desktop system.

re hardware caps and sliding windows

Hardware caps aren't nearly as complex as you make them out to be. Mostly, it just requires a few extra entropy bits to interact with resources. Or tag bits, if we have a machine code that can prevent forgery of references. Sure, there'd be a few $100k hiccups when transitioning to the model. But I don't see how it would be a problem long-term.

Bootstrapping also isn't particularly difficult. We could still have a separate BIOS or firmware that injects an initial stream and gets the cycle started. We would provide some capability arguments to the OS like the OS would do for any other application, e.g. a simple binary cap to query and discover other local hardware caps.

SLRP processors only need to know the side-channels they're physically wired to so they can filter which requests they handle vs those which they ignore and output unmodified. This is useful for heterogeneous systems, where not all processors are attached to the same devices. Processors could determine this by query. It doesn't need to be hard-coded. This implicit 'routing table' is not directly exposed to the SLRP process, not even to the OS. So isn't a security risk, and it's disentangled from bootstrap issues.

you need to see the operator and it's arguments to do a reduction

Reduction requires an input buffer. I can see how you might call that a 'sliding window'.

But to clarify, in our discussions on SLRP so far, I've used 'sliding window' only to refer to a loop within the processor wherein we buffer *output* then push it back to input, allowing multiple smaller passes within a processor for each full pass over the stream. This allows the processor to focus more attention and effort where it's most useful, such as tight loops or arithmetic. The 'window' refers to the volume of focused processor attention within the larger stream. It's a very useful optimization, but not essential for SLRP systems.

With multiple interleaved streams we need to hold an arbitrary number of operators and context switching them

Why would you interleave multiple streams? Before, you were imagining interleave was needed for synchronous rendezvous. However, I've been telling you that synchronous rendezvous is not a thing for SLRP. We don't need multiple interleaved streams, at least not at fine granularity. We can interleave streams at a coarse granularity by doing all of [A] then all of [B] then all of [A] then all of [A] then all of [B] and so on. (Doesn't need to be homogeneous round robin.)

An SLRP processor/core will handle one stream at a time, from end to end, without context switching in the middle. This allows for very simple pipeline processing. For priority, it is sufficient to process some streams more frequently than others, or on dedicated processors, or to pipeline them through more processors. We can also partition a large stream into smaller streams that can be prioritized independently.

This also only seems to work for equal proportion timesharing, there is no easy way to have dynamic prioritisation where the priority of a process changes with time.

Remember, one interesting thing we can do with a SLRP stream is to partition a large stream and process the partitions separately, even remotely. This is feasible due to local rewrite semantics (relevantly, 'control flow' is not part of SLRP semantics). We only need some means to track the logical containment of partitions and stitch things back together when needed. Virtual memory references work for that.

Thus, although we may have one big OS stream that logically contains our apps, we could physically arrange each app into its own partition. If we have more partitions than processors, we'll need to timeshare some processors. But we could easily process some partitions more frequently than others, or use dedicated processors for high priority foreground apps while all background computations share a processor.

I don't understand why you believe prioritization would be particularly difficult. The physical aspects are simple enough. So that just leaves software management (partitioning, quotas, etc.) which we can make explicit via effects to any degree desired.

Things like capabilities are

Things like capabilities are traditionally managed by the OS kernel because they are too complex to implement in hardware.

There have been a number of tagged memory computers, and there's CHERI and SAFE.