Process Network for Effects, Monad Alternative

Monads are an awkward effects model in context of concurrency. We get incidental complexity in the form of futures or forking threads with shared memory. The running program becomes entangled with the environment, which hinders persistence and mobility and debugging. So I sought alternatives in literature.

Kahn Process Networks (KPNs) seem like a very good alternative. From an external perspective, they share a lot of similarities to monads, except we get more than one input (continuation) and output (effect) port and thus can model concurrent operations without relying on effects or environment support. Internally, KPNs have a lot of similarities to free monads: we can compose KPNs to handle effects internally, translate them, etc.. Use of KPNs as first class values allows for dynamic structure and mobile processes.

The main feature missing from KPNs is the ability to work with asynchronous inputs. But it is not difficult to add time to the model, and thus support asynchronous messaging and merges in a style similar to functional-reactive or flow-based programming (and somewhere between the two in terms of expressiveness). I doubt this is a new idea.

I've written about these ideas in more detail on my blog:

Reactive KPNs with open ports or channels also make a better FRP than most, having a far more direct API for pushing inputs and pulling outputs deep within a network.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Effects as sessions, sessions as effects

This POPL 16 paper might be of interest.

Sessions and pi calculus

That was an interesting paper, though it isn't clear to me how to adapt session types to KPNs.

Pi calculus doesn't have the right properties to serve as an effects model for a purely functional language. In particular, pi calculus is non-deterministic (which means we must understand the evaluation function), and pi calculus introduces new first-class names with vau (which means we must concern ourselves with uniqueness of name representations, and the potential for names to enter or escape scope within context of monotonic processing, and sophisticated GC issues surrounding use of named state).

session types and process networks

After thinking about this for a while, it seems to me that we could treat the entire network as having a session type, i.e. characterizing the network itself as a session where every message is a variant on available ports. Composition of networks would need to filter and perhaps rename some events within the session types.

Whether this is effective depends on how well we can compose session type descriptors based on corresponding compositions of subnets and processes. It's certainly worth exploring.

Coroutine modelling

Simpler coroutine modelling is worth study I believe. In Felix coroutines have arbitrary channel I/O, however there is no buffering so they represent synchronisation points. Buffering in general is an evil hack to work around lack of understanding of synchronisation :-)

One of the key things about coroutines is the observation that any use of more than one thread for a single CPU is a design fault. You should be using coroutines. Threads are wrong in principle and way too expensive in practice.

If you have to handle asynchronous events, two threads is enough. Anything more is a design fault. More specifically, one thread per device and one extra management thread is optimal, and anything more is a design fault. If you have 8 cores on your machine, and you see Firefox running 100 threads, you know the programmers had no idea what they're doing.

On a GPU the optimal number is a bit higher due to slow memory and fast context switching 3-1 oversaturation is optimal.

My point here is that it seems hard to understand concurrency when the strictly serial coroutine model is not understood. Coroutines replace concurrency with indeterminate ordering, but preserve a guarantee that the ordering is total (even though you don't know what it is).

The key point is that subroutines are inadequate: that includes functions and traditional procedures too. Coroutines exchange control by a different model that complements functional programming and eliminates callbacks.

The purely functional world is broken. But it cannot be fixed with "effects". It needs to be fixed by symmetry: incorporating its dual into an integrated model.

coroutines etc

KPN values could be viewed as somewhere between coroutines and generators, I suppose. Internal order of evaluation is nondeterministic, but this is not externally observable.

Regarding your comment on threads: performance isn't the only consideration. Threads are often distinguished by task or responsibility. Lightweight threads (oft under a kilobyte) make this easier, since the set of hardware threads can correspond to CPUs.

I generally disagree with your comment on functions. I've contemplated using KPNs or their reactive variants as foundation instead of functions, seeing as process networks can subsume functions while remaining deterministic. But doing so both complicates reasoning and sacrifices a lot of nice properties that are possible with first class network descriptions as plain old values, and evaluation as a clearly delimited KPN->KPN function.

Functions

My general claim is that functions are only half the story. I would never throw out functions. As you say, functional programming in some cases can make things simpler, easier to reason about etc.

I'm not asking to switch to cofunctions as a foundation, but to *merge* functions with coroutines into a single model. I readily admit I do not know the best way to do this.

I'm currently writing some code of non-trivial complexity which uses a mix of procedural, functional, and coroutine based code, to see how it pans out. Actually a parser library. I'm learning as I go. It's beginning to look very convincing. My design could be implemented entirely with functions, however the coroutine model has some very powerful advantages.

One of the hard things to grok is that there is no termination. Because we have cofunctions, which process coinductively, we're dealing with streams, and streams don't end. So your thinking has to change.

As an example I battled with the fact that parsing a string, one gets a set of parses. The set could be empty (parse fail), it could have one entry (unique parse), or it could have multiple entries (ambiguous). In a functional model this is trivial: you can use a list or some other data structure to hold the all the parses. The data structure is modelling a product type IN SPACE.

But you cannot do that with coroutines, because the product is TEMPORAL. You get the values over a channel in a temporal sequence and there is NO END. Values just stop coming. None come at all in case of a parse failure!

Now guess what? This is just perfect! It is much better for adjunctive composition that the functional model. If you have parser A and parser B, and you want to compose them to get parser A |-> B meaning first parse a prefix of a string with A, then send the output to B and parse a prefix of the suffix, then the coroutine model just walks all over the functional model. In the functional model, you start with a set of input positions, each one generates a set of output positions, and you iterate through the input set merging the output sets, then send them to B. In the coroutine model you just connect the output channel of A to the input channel of B. Done.

If a thread fails, it just doesn't generate any output. In the functional model failures HAVE to generate outputs. Typically one uses an option type.

The functional model is the wrong model, its trying to use an inductive data type to model a coinductive data type. Coinductive types do NOT live in space. They live in time.

I don't think one can easily claim the dual model is harder, without at least developing it to the level of sophistication functional programming enjoys, and even then, I'm still advocating a symmetric system.

The core problem with functions is that they're subroutines in which data and control flow are integrated: and this is also their primary strength. In fact a huge weakness of (conventional) procedures is that the "effects" have no type, only the input is typed. However subroutines just don't cut it as a general tool by themselves, because local state is lost on return unless manually saved, defeating the primary advantage of coupling data and control flow.

Coroutines change the coupling rules, so control can be transferred to the coroutine on both data input and output, instead of being lost on output as with a subroutine, and local state is preserved on both input and output. So actually a function is just a simplified kind of coroutine, which means coroutines in native form are harder to use when a function would do. When you leverage their structure more deeply they're obviously superior to functions, by the same reasoning.

The hard bit is designing a model that uses both, and lets you choose which kind of model to use, AND lets you interface the components AND lets you re-engineer the system.

I can do that. I am doing that. But there are two core problems for me at the moment: there's still a lot of boilerplate, and, most importantly, I don't have any real THEORY to guide development. I know general principles, but i have to reason the hard way about each construction, each time. [The basic model is just a subset of CSP, but CSP is very hard use as a reasoning tool. The sequential subset CSC should be about as hard as functions to reason about .. because basically it is just the dual theory]

Logic languages

This was the basic dream of logic languages, and it seems really nice at small scale in prolog. Write predicates that either generate a stream of answers or fail with silence. Composability is simple connecting streams together (where the sequence is multiplexed into some Vars).

But complexity is a bitch. Taking the product of generating sets and then filtering them afterwards to find solutions does not scale well. Debugging it is painful as well.

CSP / Occam without alternation?

KPNs seem to be CSP/Occam style programming without the alternation construct to select between channels. The other difference is that the channel capacity in the Occam model is locked to one, but this is unimportant as buffering processes can simulate larger capacities.

CSP originally started out without an alternation construct, but it was added later to model problems where inputs must be multiplexed in an unknown order. How are these problems modelled in KPNs?

Alternation

KPNs as originally specified by Kahn don't support alternation.

The reactive/temporal variant I detailed in the second linked blog article can model alternation quite trivially between wait and read - wait on a set of ports then read each and repeat, emitting messages in the order read. (Indeed, the expressiveness issue of merging streams of asynchronous inputs is what motivated reactive KPNs.)

So what is the difference?

So the wait primitive that you've added then looks exactly like Occam on a system with a global clock. Is there some difference in the model, or are they the same?

KPN vs Occam

I see two big differences in the model and one in my proposed use of the model.

KPNs don't block on write, only on read. The reactive variant also blocks on the wait action if necessary to determine the next process time.

Any expression of alternation in reactive KPNs is inherently deterministic. In Occam, alternation is a source of nondeterminism.

Occam has a built-in notion of completion, which is necessary for sequencing as a general composition operatior. A KPN subnet has at best an ad-hoc notion of completion, and isn't generally clear when we can replace one KPN subnet by another. My intuition is that it is feasible to extend KPNs with completion and sequencing, but I haven't worked out the details. Sequencing would support a more dynamic, state-driven network structure.

I use KPNs as plain old values - a network description of general type KPN with a pure KPN->KPN evaluation function. This allows some interesting patterns, like passing KPNs as objects and messages, reflection and extension as effects, backtracking.

Message loss?

One of the nice properties of blocking writes in the context of fixed capacity channels is that messages cannot be lost. This simplifies concurrent code as a layer of ack / retransmit is not necessary in the code to make sure that data-flow and control-flow remain coupled. For small scale, low-level systems this simplifies development. In larger scale systems making the mechanism to guarantee transmission explicitly visible can be more useful.

What about boundaries to the system - alternation over messages on external interfaces to the system would inject nondeterminism whenever the devices are independent / have separate clocking domains.

First-class (executable) code as a value is an interesting departure. What does the type system for KPNs look like?

Fixed capacity, boundaries, types

Yes, a fixed capacity channel has some advantages. The ability to control memory requirements is among them. While plain old KPNs lack this property in general, some people use a variant with bounded capacity and push back. The cost is potential data lock where processes wait in a loop.

The reactive variant doesn't need bounded capacity channels because we can constrain computations in the temporal dimension (how many logical microseconds we compute) instead of the spatial (how many messages are enqueued). If a process produces unbounded messages in bounded logical time, that's an error to correct or (feasibly) prevent via type system or progress/termination analysis.

Bounded buffers are essentially a means to model a temporal constraint within a spatial structure. Explicit separation of spatial and temporal aspects of any concurrent system has potential to simplify this and other such things. That said, using bounded capacity channels anyway doesn't hurt in the common case.

For external interface, we can push messages into a KPN's open input channels, evaluate, then pull from open output channels. The reactive variant can additionally control advance of logical time on open channels. So we could model pushing a list of messages into a KPN all at once or adding some logical delay between messages. Alternation within the KPN is independent of whether messages originate externally.

An important feature of KPNs is they are monotonic, so this pushing and pulling can occur in parallel with evaluation. The client only needs to wait for an output message if no such messages are already pending. So this provides a useful model for incremental background computation.

Typing a KPN isn't something I've spent much time researching yet. Conventional FP type systems don't make it easy, but my intuition is that GADTs or heterogeneous collection type could be adapted, or KPNs could feasibly receive specialized support from the type system in a new language. A simple type system might just assign a message type to each channel. A "dependent" type system (or a session-type? I'm still contemplating Roly's earlier comment) would require we adjust channel types based on prior messages sent.

Determinism can be bad

I think it is mistake to assume determinism is a desirable property, that is, desirable because it makes reasoning easier. On the contrary, I would argue indeterminism makes reasoning easier. (Devil's Advocate for Good Science..}

Consider a routine that has two channels A and B, and the data comes on channel A first, then on channel B. The routine has to read A first, then B. Its puts the A data in variable a, and the B data in variable b. Then we do something with the data. Is the routine correct? You have to know the order the data is coming in. You have to be able to determine the reads are in the right order. When the data is used, you have to verify that the a data is used where you intended to use the first lot of data that came in. The reasoning is quite hard!

Now consider a routine that reads from either A or B, and we don't know which data comes on which channel, nor which data comes first. So you have to read A or B together to get some data, and do it twice. Since we can't distinguish the data by time or by which channel it came in on, we can't get the read order wrong, we can only forget to read exactly twice. And since we can't tell the difference between the two pieces of data in a calculation, we only have to make sure to use both where they're distinguished, we don't care which one is first in the usage expression.

Clearly, the nondeterministic case is *easier* to reason about. Knowing less is a vital part of all programming. Its called *abstraction*. The less you need to know the better. The less you know, the harder it is to mess up. Since nondeterminism reduces knowledge, the more of it you can get away with the better. Its a common mistake made by programmers that reasoning is easier if everything is determinate. One should always be asking: how can I get away with the least possible information?

The flip side of the coin: nondeterminism is *faster*. Why? Because it gives compilers more optimisation choices.

Let me give a horrific example. Python. In Python, the result of everything is determinate. It is almost impossible to optimise. In particular, there is no such thing as an error in Python. Not even a syntax error. Yes, that's right. You can import entirely the wrong file, and you are sure to get a SyntaxError exception. And you're allowed to catch it. And do something. So it isn't an error. So it is quite hard to meaningfully compile a Python program because you cannot even report a SyntaxError and terminate compilation, because there's no such thing. You actually HAVE to generate code that throws a SyntaxError at run time. Determinism gone mad.

You're not comparing

You're not comparing equivalent problems. This isn't contrasting verification of a non-deterministic vs. deterministic protocol, it's contrasting the verification of an ordered protocol vs. an unordered protocol.

A proper comparison is a program where you have to forward messages in a specific order, where they arrive on a single channel that is:

1. deterministic: message A always arrives first, followed by message B (deterministic).
2. indeterministic: messages A and B can arrive in any order.

Clearly (2) has more cases to analyze, and hence is harder to verify to ensure A and B are forwarded in the correct order. I think determinism always makes reasoning easier.

determinism

I suppose it depends what it is that you're ordering. In my system, the order in which data appears on a channel is deterministically ordered for a single pair of reader/writer. If W writes m1, m2, then R will read m1, m2 certainly.

What is not deterministic is whether W continues on after a transmission or R does.

However, channels can have multiple readers and writers, and in this case the indeterminism of the readers and writers propagates to the order of data exchanges *across the set* of fibres. It is still deterministic for a given pair, that is, the message ordering is total and determined for a given pair of reader/writer.

The precise semantics are specified here:

https://github.com/felix-lang/felix/blob/master/coroutines.pdf

As to reasoning, I think you are making a partly fallacious argument, similar to the argument that it is impossible to determine if a program terminates. This is a rubbish argument: the claim is valid only for arbitrary programs, but programmers do not write arbitrary programs.

In the case we're discussing, it is the same. Programmers don't write arbitrary programs. You can reasonably expect if an ordering is nondeterminate that it was intentional and therefore reasoning is actually simplified. Contrarily, where the system forces a programmer to pick an order, the human reader is at a loss to understand the semantics because they cannot tell if the order picked was arbitrary or essential.

Nondeterministic constructions are in fact vital to reasoning, and the fact functional programming is a mess is *because* implementors refuse to accept that beta-reduction is indeterminate. This is part of the Felix semantics. It is why Felix is so fast. The compiler choses the evaluation strategy which seems most efficient and if the programmer requires eager or lazy they have to specify it. It requires a bit of paradigm shift, because the compiler isn't as sophisticated as say GHC at analysis.

We can agree on nearly every

We can agree on nearly every point you just made, but I don't see how any of these points supports your original claims that non-determinism makes programming *easier* to verify, and that your original example somehow demonstrates that determinism makes programming harder. It still looks to me that verification difficulties in your original examples are due to the difference in ordering properties, not any determinism.

As for determinism making programming easier, it seems trivial to argue: the more you know about your context, the more rigourous the assumptions you can make in your function, resulting in a smaller set of cases to handle in that function, thereby punting satisfying those preconditions to callers. This is also a form of abstraction useful for verification, and determinism makes for very strong knowledge in any given context.

About the only way I can make sense of your claims is if you're actually saying that functions that make fewer assumptions about their parameters are easier to use for *clients*. That's certainly true, because clients don't have to ensure strong preconditions are satisfied. But functions that take all string parameters and then must do something sensible no matter the values they're given make the fewest assumptions. So you're now programming in bash. I don't think anyone would seriously argue that bash is a good language for reusable programming or verification.

Joining it up

Perhaps a way to join up these two world-views is a language that makes all the assumptions made by a function explicit in its type-signature, so that they can be checked, hopefully by automated type checking. Further we would like to have assumption-inference :-)

Re: determinism

Determinism isn't just good for reasoning. Determinism is also good for testing, replay, scientific reproducibility, caching, consistent sharing of computed views and procedurally generated data, omniscient debugging, and robust computations via redundant evaluation on multiple hardware devices (in cases where we can't trust the platform due to cosmic rays or nefarious agents).

Determinism doesn't imply strong ordering constraints. Irrelevant ordering information can be eliminated by sorted collections, especially those like a sorted list or trie where even balance is independent of access and update order. Evaluation of KPNs does not depend on the order in which messages on different channels, and so an accelerated evaluation function can leverage arrival order to improve resource utilization while still producing observably deterministic results.

While nondeterminism reduces knowledge, it does so in a manner that requires your code to be correct in exponentially more cases. Including cases you haven't tested but that will arise when you use your code in a new context where scheduling decisions have a different probability distribution. (Which in my experience guaranteed every time you change hardware or clients or optimizer/compiler versions or optimization flags.)

In theory, nondeterminism allows more optimizations. I have not experienced this in practice. What happens instead is a lot of extra, difficult-to-optimize barriers to constrain nondeterminism, and a lot of distrust in the codebase from programmers who keep getting new regressions when code is ported or moved, and blindly hack at code until they go away.

The Python example, I'd attribute more to side effects and lack of effective support for static partial evaluation.

optimisations

I actually don't understand your comment "in theory nondeterminism introduces more optimisations", it reads like nonsense to me. If the system is entirely deterministic there cannot be ANY optimisations. If you want to see how this can be leveraged, consider the Felix semantics for beta-reduction. It is not specified if the evaluation is eager or lazy so the compiler is free to choose.

This is why Felix is fast and Haskell and Ocaml are slow. Felix is free to inline applications which is lazy evaluation. On the other hand if the function is unknown because say, it is a closure, then the argument is passed by value (copied), which is eager evaluation. Felix says, if the two don't produce the same result, its the programmers fault. So here you have blinding speed due entirely to indeterminism.

Furthermore, the system is easy to reason about in the sense that you know a plain function is total/strict because it has to be. You know when you've messed up if you see that a function is not. So the indeterminism *restricts* the kind of functions you can write correctly, which improves reasoning ability. It would be better if my compiler could do this reasoning but it can't.

It's just not true that indeterminism exponentially increases the number of cases you have to reason about. First off, one reasons with abstractions, theorems, and tools that apply to *classes* of cases not individual ones. Secondly, even in the deterministic case you have to consider what the other possible deterministic cases *might* have done if you'd encoded them instead. In particular, if one say swaps the order of two operations, will the program still work? The fact is in the real world requirements change and locked down deterministic programs are fragile. Knowing "it doesn't matter" is the most powerful reasoning tool we have (and it has a name: "abstraction").

Actually in functional programming the idea "referential transparency" is a direct expression of indeterminism. It says an expression has the same value no matter where it is in the code or when it is evaluated (modulo name binding of course).

Determinism

A pure function is deterministic in the sense that it always produces the same observable output for a given input.

That doesn't overly constrain the algorithm or intermediate results. Nor does it constrain the representation of non observable outputs such as the inner structure of first class functions or the address of a value in memory. An optimizer can freely apply equational reasoning to transform a referentially transparent expression to any other that produces the same value, e.g. reordering matrix multiplications, inlining code, extracting constant expressions from a loop. Parallelism can compute intermediate results in any order at runtime. A garbage collector can move things in memory. The observable output remains deterministic.

You seem to be using a definition of determinism that includes all hidden steps and results. But I believe that is misleading. In philosophy, determinism regards observable events, behaviors, outcomes. In computer science, a non-deterministic algorithm can produce observably different results and behaviors. This may be a consequence of race conditions or side effects or a special choice function. Further, Felix is non-deterministic in this relevant sense due to arbitrary ordering of side effects by different coroutines, arbitrary ordering of messages for multiple writers to a channel. Also, the ability to read from /dev/random or observe CPU usage and temperature.

I don't follow your argument about how "the system is easy to reason about". You seem to be saying that indeterminism is easy to reason about insofar as you don't use it. I suspect that is not your intention.

And regarding case classes and abstraction vs. individual cases: the exponential growth applies in either scenario. It's just more coarse grained at the case class level (a few broad cases per module instead of a few specific cases per expression). Further, no matter how good your abstractions, your testing, bug reports, log traces, and a lot of concrete reasoning will still happen in practice at the individual case level. When real world requirements change you'll break code either way, but deterministic code allows reproducible regression tests.

Also, RT is not an expression of indeterminism. That argument seems to be assuming that determinism implies dependency, e.g. on location of expressions within source, or time of evaluation. It doesn't. Determinism describes a system where every outcome has a cause, but does not require that every system variable contributes to that outcome. A trivial example of this is `F(x,y,z) = (x+y)`, which produces "the same value no matter z". As analog to your argument, 'z' might happen to describe "where it is in the code or when it is evaluated".

Python

Yet Python is a good language to program in, programmers are productive, and the code is readable. With things like "numpy" it can be all the above and also be fast.

A solution to effects

I've been thinking about effects for a while and decided to solve them in the following way:
A program has input and output. We can consider it as a function that transforms input to output values. When input parameters change (mouse move, keypress, timer, ...), output also changes. Output also sometimes depends on the previous output. So we can build any program as a pure reactive function

(NewInput, PrevOutput) => NewOutput

that is recalculated in cycles of input changes. The state is preserved in input and output.

That way the code remains pure, but the disadvantage is that we have to implement hardware devices outside the programming system (like a kind of low-level operating system beneath the actual programming interface).

re: solution to effects

That pattern fits KPNs, free monads, and a lot of other pure effects models.

The question then is what should the shape of the input and output be to support: background parallelism, concurrent input and output, fine grained composition, distributed evaluation, transparent persistence, forking and backtracking, etc.. My observation is that KPNs and temporal variants do better than monads at ALL of these metrics.

Regarding hardware devices: while physical sensors, actuators, and HCI devices must be modeled using IO values, computation hardware (memory, non-volatile storage, CPUs, GPGPUs, mesh networks, cloud compute services, etc.) can be leveraged by compilation without violation of pure semantics.

Further, it is feasible to model distributed computations to efficiently work with distributed hardware. That is, we can track in a type system not just the shape of a value, but where it is located (near this sensor or that actuator) and even when it is produced (to model latency). Network failures can be modeled as special inputs indicating certain sensors seem to be inaccessible.

Avoiding the bottleneck of requiring all data be provided or consumed at one physical location allows pure effects models to scale effectively. One reason I pursue KPNs as an effect model is that they seem to be a superior substrate for this purpose, which also is important to implementing my reactive demand programming model.

Freedom

Input could be modeled as a stream, reading one byte (or bit) at the time, producing some intermediate output, then be read again in cycles, bit by bit. Isn't that how actual data transition functions? It seems it is all about sequences of data (streams), and they can be interpreted in numerous ways. Even in the case of multiple core processors, instead of reading one bit, we can read several bits or bytes at the time.

I'd leave the completely open choice to programmers how would they process stream input, by this or that pattern, using this or that method. I'm a patron of implementing as language basic particles the least configuration that is necessary, as simple as it can be. All specific implementations should be achievable by using the language crux, leaving the programmers the most freedom they can get. I believe that specific implementation should be a question not of a language design, but of a language use.

On the other hand, I may miss the point here, and maybe we are talking about numerous uses and shape patterns that are used to process data, specifically KPN in this case. In that case, I'd be thrilled to learn valuable facts from further conversations.

[Edit]
I have to admit, I tested the pattern I'm using only on a few examples in theory, so it may turn out that it is incomplete, while KPN is certainly a more thoroughly investigated solution.

freedom with limits

Modeling input in cyber-physical systems as "a" stream (emphasizing the singular) is ultimately inefficient because it creates a bottleneck within the model itself. If you have a thousand sensors (mouse, keyboard, cameras, GPS, etc.), it's a lot more efficient to model this as a thousand input streams. Similar applies for output to actuators (screens, audio, robotic arms, etc.) - you can't efficiently use one stream for all the actuators, you'll instead want a thousand streams to control a thousand moving parts.

And regarding freedom to the programmer, that must be balanced against the burden on the compiler or interpreter.

For example, Haskell's IO works efficiently largely because the Haskell compiler knows all about IO and can flatten it down to bare metal or close to it. A few more optimizations can help with simple wrappers on IO like the StateT or ReaderT monad. One might think about this in terms of "hardware accelerated software" (which is essentially a compiler's job). If you want to leverage a GPGPU, it's a lot easier to accelerate a model of linear algebra, or a model of a graphics rendering pipeline, or perhaps a DSL that represents a safe, pure subset of OpenCL.

Kahn process networks (and the reactive variant) touch on both of the above points. First, they readily model and compose multiple streams, which is valuable for concurrency. Second, they are readily evaluated with distributed processors and massive parallelism, by interpretation or compilation.

Performance is important. Thus the question in practice is never "what models the problem best" but rather "what models the problem adequately but also fits into this other model that compiles and computes efficiently?" If you can squeeze your problem into linear algebra, you get to take advantage of GPGPUs. If you target a KPN, you can take advantage of distributed processing. If you layer both in a careful way, you might be able to leverage a mesh network of GPGPUs.

Now, all of this doesn't prevent you from pursuing a 'simple' language.

For example, if you have a language without numbers, you might represent natural numbers as `type Nat = Z | S Nat`. Naively, adding a natural number N to another has a cost of O(N). But if your compiler or interpreter knows about natural numbers specifically, it might substitute an efficient representation with efficient adds. It is feasible to leverage the notion of hardware acceleration of software to a much greater degree than conventional languages in order to preserve simple language semantics. This isn't a mainstream technique, but it is a practical one.

Nonetheless, there are practical limits. Not all models can feasibly be accelerated, or not all to the same degree. More critically, there are specific limits on which models are accelerated or in-the-works when the programmer begins development of an application. Acceleration is a non-trivial task, an investment (both economic and psychological) that requires hacking a compiler or optimizer and communicating, teaching, learning the accelerated models. So the issue of "shape of the input and output" isn't necessarily about "shape of the language", but rather of the models we choose to spend our time teaching, learning, accelerating. The path-dependent overheads and opportunity costs involved are significant, much like they would be for language design choices, perhaps more so.

(Addendum:) Also, there are concerns besides performance that should enter our rubric when examining a model, such as ease of serialization (for persistence, debugging, forking or backtracking, process mobility). I have observed that use of 'new' (as in `newIORef`, but generalized to any effect that allocates identifiers) to be poisonous to a model, entangling a running model with its environment and hindering many otherwise nice features. KPNs can usefully support many streams without 'new'.

Accelerators

It would be great if we had a complete (but may be slow) model of a language while having an opportunity to accelerate slow functions by rewriting them in assembler or something fast like that.

Still, I wonder how would functional version of assembler look like, pairing functional processor...

re: accelerators, functional assembler

My Awelon language is essentially a functional macro-assembler. Awelon has four primitive combinators:

[B][A]a == A[B]   (apply)
[B][A]b == [[B]A] (bind)
   [A]c == [A][A] (copy)
   [A]d ==        (drop)

Words can be defined, but are logically inlined and must be acyclic. Natural numbers like `42` and embedded texts like `"hello"` are supported and essentially act as predefined words (42 = [41 S], "hello" = [104 "ello" :]). Beyond that, I can leverage projected views, e.g. such that `[3 4 ratio]` might be rendered as `3/4`. Annotations of form `(par)` or `(nat)` or `(error)` must have identity behavior semantics but can guide evaluation, type checking, optimizations, projected views, etc.. Awelon is a simple language, simpler in many ways than lambda calculus, but complete.

It isn't difficult to implement this language in a slow manner, then gradually accelerate targeted functions. Some useful functions to accelerate:

[A]i == A;          i = [[]] a a d    (inline)
[B][A]w == [A][B];  w = (a2)[] b a    (swap)    
[X][F]z == [X][[F]z]F                 (Z fixpoint combinator)
  z = [[(a3) c i] b (=z) [c] a b w i](a3) c i

In these examples I use some arity annotations like `[C][B][A](a3) == [C][B][A]` to prevent infinite expansion of the `[[F]z]` term, and the `(=z)` annotation reduces `[...def of z...](=z) == [z]` mostly for aesthetic reasons. Acceleration of the fixpoint function is convenient because it covers all loops and is trivially recognized by a compiler. Acceleration of `i` is useful for tail-call optimizations.

A lot of optimization and partial evaluations can be performed on Awelon as a functional assembler before compiling it further to an intermediate two-stack register machine (data and return stacks, with top elements of each stack instead represented logically in registers to eliminate physical data plumbing), then to machine code. Or before interpreting it.

One could of course use raw lambda calculus as a functional assembly. Or SKI combinators. There are many old, good papers on compiling lambda calculus via SKI. My choice of combinators for Awelon was based on an interest in linear types and easy tracking for uniqueness of references, and easy scope control, so I separated the 'copy' and 'drop' responsibilities, and both 'bind' and 'apply' control scope (taking or hiding one value, respectively). I also wanted a concatenative language for easy composition, streaming, and refactoring and the ability to easily abstract patterns involving multiple-return values or arguments. And the stack-like structure provides an effective foundation for static typing that is otherwise missing from a pure lambda calculus (with no other types). I believe Awelon is simpler than and superior to other proposed functional assemblers.

In any case, it's trivial to implement Awelon language if you're happy with it running slow. But Awelon needs accelerators to be practical, and is intended to work with them. Those annotations serve an important role of informing a runtime or compiler where certain accelerators should be applied.

There is at least one other functional language that is using accelerators as a basis for performance: Guy Yarvin's Urbit, wherein they're called jets. I'm not fond of Urbit's invented jargon, and I think Nock is a stupidly awkward functional assembly, but I agree with the author's analysis of objections to jets.

I haven't seen the approach pursued elsewhere. I suggested Paul Chiusano consider the technique for Unison, but he was and is more focused on the HCI aspect than the performance concerns and he took the conventional route of adding ad-hoc types and intrinsic functions to the language as needed.

Awelon implementation of addition and factorial?

May I ask, how would a function that adds two numbers look like in pure Awelon? Or maybe a recursive factorial function?

loop combinators

Working with fixpoints and anonymous recursion is painful, and many times more so without variables. So we have a couple options: either introduce variables in the projected view or define useful loop combinators like `repeat` such that `3 [F] repeat == F F F` then define other things in terms of those combinators.

I personally favor the latter option, so I might define `add = [inc] repeat` and `factorial = [1 0] a [inc c [mul] a] repeat d`. The latter pushes `result count` on the stack then repeatedly increments the count and multiplies into the result. When done, we drop the extraneous `count`. Other loop combinators could be used, of course. Factorial is readily expressed by constructing a stream or list enumerating from 1 to our value then applying a `foreach`, e.g. `factorial = enumFromOneTo 1 w [mul] foreach`. Or even more readily with something like Joy's primrec - `factorial = [1] [mul] primrec`.

We might define `repeat = [[d] w [a w i] b if] b z` contingent on the definitions of `0 = [d i]` and `S = w b a d` (corresponding to a Scott encoding of natural numbers). The behavior `if = [] b b a (cond) i`, and simply applies a sum type like a natural number to select the left or right path. One can either come up with such definitions by hand, or apply a search, or use a translation with local variables - e.g. `repeat = λF.[λRep. [][λPredN. F i PredN Rep i] if]z`. In practice, a few loop combinators go a long way, especially once we have a pool of collections processing combinators. So we don't need to worry much about the challenge of implementing them.

If you insist on a "recursive factorial", you'll need to work with the fixpoint combinator directly to get your anonymous recursion. In that case I'd translate tricky data plumbing at the editable view layer with local variables. `factorial = [λFac.[1][λPredN. PredN inc PredN Fac i mul] if]z`.

If you're interested in more of this sort of information, Brent Kerby's Theory of Concatenative Combinators (2002) is a delightful read.

Aside: Awelon doesn't have primitive support for pattern matching. I've sketched out a few ways to support that at the view layer, via accelerating a simple model of labeled data.

KPN Semantics

So I have a question about KPN semantics. As you know I am studying coroutines which have indeterminate control flow rather than concurrency: this allows some things like safe mutation of shared state, for example.

However there's another difference I want to ask about: as I understand it the base KPN model has infinite output buffering so that writes cannot fail, reads can fail if the buffer is empty. I understand you're playing with bounds on the buffers which then means writes can actually block. Felix coroutines are entirely unbuffered. I would guess bounded buffers are equivalent to NO buffering, so there remains a single distinct semantic difference: with the original KPN model, nonblocking writes would be similar to changing the Felix semantics so reads always went first. This would means the data is always gone when a writer goes back to write again, since it can't proceed until the reader has grabbed the data.

This is not strictly correct, due to both models supporting multiple channels, and allowing multiple readers and writers to use the same channels.

Still, I am considering at least adding a "read-and-proceed" variant of the read operation which reduces the indeterminism considerably and is somehow a bit closer to the KPN model.

I wonder what you think that would do?

[One reason for asking is that one would like a synchronous model of coroutines which can somehow be "upgraded" to operate concurrently, at the cost of losing transparent atomic mutations for example: many of my coroutines are pure and don't use that facility so could be upgraded safely]

KPN buffers

Bounded buffers are equivalent to writes without buffering. We can trivially model a bounded buffer of size K as a chain of K unbuffered identity processes (where each identity process just reads a value on one channel then writes it to another in an infinite loop). Consequently, you could annotate Felix channels or ports with buffer sizes to improve performance (batching, CPU memory locality, etc.) and reduce deadlock conditions (from order of writing multiple ports, e.g. write x,y in P read y,x in Q will deadlock with a size zero buffer on x but proceed with a size one buffer) without impacting your coroutine semantics.

I don't believe non-blocking writes are very similar to prioritizing reads. Deadlock in bounded buffer KPNs is deterministic, like all observable KPN behavior, regardless of scheduling for reads and writes. I don't see how "read and proceed" will save you much.