## Alien worlds, values, and you can't touch this

Conventional programs operate on values from one coherent "world;" e.g., many programs execute on a Von Neumann machine (i.e., the CPU world) where values are stored in registers or in uniform access memory while GPU-based programs operate in a very different world with severe computational and memory restrictions. Other examples of these worlds include distributed and parallel computing environments (futures, map reduce clusters), persistent data stores (relational and graph databases), worlds where time is abstracted (FRP), worlds where space is abstracted (GPU, various databases, array programming languages), and so on.

Typically, our bridge between different worlds is a very coarse API where code on the other side is often written in a different language, but DSLs/libraries like Bling can allow us to manipulate from the CPU world resources and values from other worlds through various wrapping and indirection tricks. So in Bling we can manipulate a timeless FRP-world where signal-like values are composed and routed between UI components, or we can write some C# code that operates on GPU-based values to express vertex layout and pixel colors.

Such techniques are hardly transparent: we must firewall values between different worlds as they either do not mix or how they mix is very tricky and possibly very expensive. An int on the GPU is very different from an int on the CPU even if they both support "+;" given a + b, where a is a GPU values and is a CPU value, the result is a GPU value and we must add extra logic to route b onto the GPU at runtime (as a global shader parameter). We could even bring the GPU int back to the CPU somehow, but only at extreme cost that involves transferring GPU memory to CPU memory; its not something that should be allowed to be done lightly.

Haskell seems to have great support for world bridging, possibly because its own world is so idealized that bridging is easier as well as often necessary. My question, and idea, is would it make sense to add specific support in a programming language for world bridging; i.e., should a language have specific support for alien values that do not come from its own native world? Alien values will often share data types with the host world; e.g., a CPU int vs. an alien int, and many operations on those data types will often have some kind of reasonable interpretation (e.g., gpu_a + 1 yields a gpu value that is formed from an add computation). What sort of abstractions would be necessary to allow for as many safe compositions, preventing unsafe compositions, and making expensive operations very obvious?

## Comment viewing options

### Makes me think of message passing

Isn't a memory transfer like message passing in the actor model?

Maybe the language should encourage "fat" messages, so as not to waste too much on transport costs. And of course the message passing syntax should be explicit.

There should be ways to tweak the transport parameters, for example using the texture loading or not with a GPU.

### Memory and messaging

I'm not a big fan of messaging passing in this case, as it basically gives up on composition at the level of values. You could create nodes that then compose the values via message passing, but we are back to square one!

I'm not sure if this was clear, but there is also a matter of time as well other constraints; "thunking" a future means halting until the future's CPU value is computed, thunking a signal gives me the current value of an FRP expression without the benefit of update (thunk with observe could do that). There is also something about space that goes beyond memory; say I express logic to compute the color of a pixel, but what pixels are actually computed depends on rasterization; they aren't discretely addressable. We might also have values from a secure world where thunking is impossible and we can only manipulate them using certain APIs (to prevent private user info from leaking from the program).

### Batches

I consider Batches to be a technology for bridging computational worlds. The idea is to partition code that refers to and uses results from an "alien API" into two parts: the local code and the alien code. The system then optimizes the communication to transfer data in bulk, and sends the alien actions as a script that is executed in the alien context. You can think of the partitioning as being based on "binding place analysis" much like "binding time analysis" for partial evaluation. The place analysis allows staging by place, rather than staging in time. Oracle has encouraged me to release batches as a proposed future feature of Java, and you can try it out in my fork of OpenJDK "javac".

### I can't believed I missed

I can't believed I missed this paper! Bling does something very similar ("binding place" analysis) but in a value-oriented rather than control-oriented, decent fro the problems I was trying to solve (reactive and GPU programming, consider that the expression that is assigned to the color property becomes "transported").

"batch" seems to be a control-flow oriented world change: the data world under the batch is native while the CPU world is alien, but you can still refer to it (e.g., System.out.Println), where the operations are then stored up and executed in a single transition at the end of the batch. But I'm thinking that the batch calls shouldn't necessarily exists. Instead, "music" would have APIs callable from the native world, where the results (as 2nd class values) and effects of these APIs are not directly observable from that world. Of course, this would require an explicit flush somewhere, and depending on the data topology it might not bring over data as efficiently.

### Typeclasses

Another reason Haskell seems so good at that bridging is typeclasses, which really are a mechanism for describing similarities (classes) among different worlds (types).

### Partition Types

My RDP paradigm makes extensive use of "partition" types for signals. I describe this on github. I was inspired from the FRP.Sodium library. In my case (and in Sodium) partitions represent distinct threads, but they could support multiple processes and heterogeneous memories quite effectively.

Partition types have some interesting impacts. For example, you cannot just create a unit () value out of nowhere... in general, you must be able to explain: in which partition does this unit value exist? how did it get there?

As an aside, you CAN create values "out of nowhere", but only in imaginary partitions (like the Void partition). Such partitions aren't very useful except for type-level computations or a black-hole role similar to /dev/null (they don't perform any computation, and you can't pull values back out of them).

I plan to utilize partition-types extensively in my RDP-based language when I get around to developing it. Partition types, in combination with a much less expressive language than Haskell, will make it relatively easy to distribute code and orchestrate computations across remote partitions, heterogeneous memory, GPUs, FPGAs, etc..

Partition types can further be utilized to control ambient authorities. E.g. for Sirea, I will probably develop partitions for Joysticks and partitions for OpenGL, and so on, representing available resources for computations that occur in a particular partition. In Sirea, I model these by use of typeclasses (i.e. class HasPrinter p means there is a behavior available in partition p for writing to the console). For my RDP language design, I'm taking a lot of inspiration from Cook's object grammars (objects as first-class partition types, controlling even the low-level features like +), though I haven't entirely figured out how to best use them.

So you could say: yes, I believe this makes a lot of sense.

Except for one thing!

You seem to be favoring something like automatic coercion between partitions, i.e. CPU Int + GPU Int -> GPU Int. I think automatic coercion between partitions is a terrible idea. Well, I wouldn't mind creating some sort of "soft" syntax that tells the compiler to auto-wire two partitions together intelligently. But the underlying logic should use only explicit communication, and the developers should be explicit about when they want auto-wiring.

In FRP.Sodium and my Sirea, there is a cross behavior that expresses motion of signals between partitions. In general, we would want to control access to cross behaviors. Within context of Haskell threads I can be pretty liberal about crossing between partitions, but that won't apply to something like FPGAs or GPUs and other physical models. Even for remote processes, I would need to restrict the cross behavior to "serializable" values. We should be able to express some pretty severe restrictions in the communications logic, regarding what kinds of values (aka programs) can be communicated where.

[Addendum] BTW, remember how you keep asking me why I model latency explicitly in RDP? Well, these partition types are the primary motivation for them. Expensive cross operations are "made very obvious" by associating latencies with them, for example. If I could effectively and efficiently model nanosecond-precision latencies in the Haskell type-system, I'd model them there (weak support for dependent typing has me eying Agda).

### I agree that automatic

I agree that automatic coercion between worlds isn't ideal; it should be more like an explicit coercion that gives us awareness over the costs of the conversion (e.g., a shader register for CPU -> GPU coercions). On the other hand, if the cost is effectively free and the conversion is semantically lossless, then an implicit coercion would probably be acceptable.

I guess I'm really looking for a low overhead solution here, I mean, a language that is a smaller delta from what programmers use today that adds 1st class support for 2nd class (alien) values that don't scare away most programmers. I know this is commonly done in Haskell, but the support (a) still seems to be somewhat of a pattern and (b) requires some amount of unconventional thinking associated with programming in Haskell at all.

### Even between lightweight

Even between lightweight threads, I would find it difficult to agree that conversion is "effectively free". (It might not require routing or copying, but there is still a scheduler delay, a potential move to another CPU's cache, possible synchronization.) But I agree that "scaring away most programmers" is something to avoid.

A point to consider is that something like cross becomes part of a communication model, serves a similar role to channels or pipes. It isn't so difficult to support partition types if developers only need to be vaguely aware of them at certain communication boundaries (except, of course, when they are writing the communication abstractions). I think that, even applied heavily, partition types are not likely to disrupt most programmer experiences.

In any case, all you need for alien types are: abstract data types and operator overloading. (Haskell uses typeclasses for the latter.) The primary technique is to abstract the constructors for a value, i.e. so that we aren't modeling this as creating CPU integers then converting them, and we are instead capturing code that creates the integer.

In Haskell, most of the control-flow abstractions (Applicative, Arrows, Monads) each allow embedding of arbitrary Haskell values (pure, arr, return). This has proven a weakness for generalizing these "patterns" in Haskell. We really need to say that only some values can be converted, and even then only at certain times and places. There has been some work at generalizing the control flow models so they do not assume a language that is a superset of Haskell (cf. the Categories package and Adam Megacz's Generalized Arrows). Using these more generalized control abstractions, we can apply type-level transformers to lift from software design pattern into proper abstraction.

I believe one could leverage Cook's Object Grammars to a similar purpose. I am planning to pursue Object Grammars further, as they are much closer in nature to the partitioned model already (unlike typeclasses, which push overloaded behaviors to a global space).

### Even between lightweight

Even between lightweight threads, I would find it difficult to agree that conversion is "effectively free".

I prefer to optimize the programmer's time where feasible. I like the idea of a compiler err-warn that allows the compiler to "figure it out" while prototyping, then allow the programmer to debug and optimize the cost of what they've done rather easily. Ya, thunking this future 'here' is expensive, but if you must...

In any case, all you need for alien types are: abstract data types and operator overloading. (Haskell uses typeclasses for the latter.) The primary technique is to abstract the constructors for a value, i.e. so that we aren't modeling this as creating CPU integers then converting them, and we are instead capturing code that creates the integer.

This is what I did in Bling, but I found it to be completely unscalable (for C#), where some sort of auto-lifting of safe functions would be nicer.

I really need to spend time on grocking the object grammar work; incidentally, this idea first came up when I was talking to a colleague about his recent ecoop paper.

### Progammer's Time

I agree with optimizing the programmer's time. I touched on this when I mentioned auto-wiring earlier, but it would be easy to miss in a wallotext.

You and I have, several times in the past, discussed the possibility of soft constraint-logic systems or similar to "fill the gaps" in a partial specification, allowing programmers to work at something closer to a sketch or pseudocode level (and gradually improve the constraints on a system). That sort of feature remains a significant part of my vision for unifying UX and PX. I imagine such systems working at both development time and at runtime (depending on where the programmer wants to use them).

I feel that programmers should be explicit when they want the auto-wiring; explicit doesn't mean verbose, but it is different from automatic coercion. To the extent developers abstract strategies for a constraint-logic system, they will need to be explicit about which coercions are available. The underlying logic needs to be very formal and precise and under developer control, even if the decision heuristics for strategies and preferences are fuzzy.

Automatic coercions imply a built-in heuristic that is difficult for developers to control. Similarly, you're suggesting built-in "err-warns" (nice noun-verbing there :) where I would wish for developers to build warnings into their own heuristics (e.g. via annotation of low-preference or high-cost strategies, or reflective observers on them).

### toPixel

Efficiency is another problem with being too explicit. Often in a GPU computation, the best partitioning for the computation depends on what the encapsulated functions called do. For example, setting up a global parameter (CPU -> GPU) explicitly means you must decide before you call the functions that mix the CPU value with the GPU values, but its very possible that the run-time encode this transition much more efficiently as a larger sub-expression according to what the function does.

I even play a trick in Bling where the vertex/pixel shader computation are considered together, and we only force some computations down into the pixel shader (where they are necessarily less efficient) by stating that some expression is computed at a pixel granularity (e.ToPixel); the run-time then figures out the largest sub-expression that can be computed solely by the vertex shader (not contaminated by a toPixel value), which is then passed as a parameter to the pixel shader.

The only alternative that I can see recodign all the logic explicitly, so if your triangle hypotenuse computation mixes CPU, vertex, and pixel values, you'll have to recode it and divide the computations accordingly to get the best efficiency.

### Batching

You mention granularity problems, e.g. lots of small CPU->GPU operations instead of one bigger one. That seems more an artifact of control-flow languages. The declarative nature of data-flow makes a lot more techniques available to optimize granularity of communication. For RDP, I batch updates between partitions. The batching supports multiple objectives: snapshot-consistency, efficiency, minimizing physical synchronization. Within Sirea, I also use a bounded-buffer abstraction - i.e. a bounded number of batches in-flight - to ensure fairness and soft real-time processing. I expect you could use similar mechanisms in Bling, with a clever application of buffers, so long as you make any feedback asynchronous.

Some of the optimizations you mention (the toPixel tricks to push more code to the vertex shader in Bling) are quite specific to the problem domain. I doubt you could achieve them readily in a generic way, i.e. without knowing a lot about the targets. If you were to develop a more generic or general-purpose model of Alien types, how well would this work for you? I think not so much.

I have been developing a relatively generic strategy to improve efficiency and robustness in a system without changing the logic: implicit code distribution or replication. Code distribution can drift in either direction. I.e. if you invoke a remote service, some service capabilities might expose part of their code to you (so you can make service-decisions locally and reduce overheads). Conversely, you might expose some capability code to the service (e.g. so the service can access your policies and preferences without a round-trip). This code may be shared further, e.g. if a service invokes another service to support your request. Automated code distribution is constrained orthogonally to the object capability model - discretionary security, based on certifications of trust and equivalence, and discretionary acceptance to host the foreign code rather than invoking remote caps.

Logically, such code distribution never happens; the round-trip still occurs (and is, thus, still reflected in latency). But the physical cost is much lower. And efficiency is much higher. (There is, of course, a more explicit form of code-distribution available - by sharing a representation of code that hides nothing, aka a "value" instead of a capability).

In addition to that, of course, one can meta-program, e.g. leverage a weighted constraint-logic to build a low-cost program. This avoids the over-specification that sometimes leads to inefficient programs.

### When going from a world of

When going from a world of lower frequency (fewer updates) to a world of higher frequency (more updates), it makes sense to partition computations as coarsely as possible. We might be able to generalize that into a principle that as much work should be done in the native world before the value is transported into an alien one. However, there is a cost as transporting more coarse grained values might be more expensive than transporting fewer fine grained values to be composed in the alien world; e.g., consuming global parameter registers for a CPU to GPU transition. The GPU infrastructure might be able to optimize in that case, being able to identify redundant computations that depend on static global parameters (they can't change during shader invocation), and reason about whether resources are available to cache their results in extra registers. But GPU programmers really don't trust/rely on the infrastructure to be that smart!

I'm not exactly sure if we are talking about the same thing. I'm just interested in mixing 1st and 2nd class values from a more traditional model of computation (at least, it should look like POC - plain-old code - as much as possible), while you seem to be tackling a much more comprehensive problem. That's quite all right, but I would really benefit from seeing a more focused discussion of your ideas/work...like in the form of a technical paper :)

### Frequency

Initially, I didn't follow your first sentence. Then I realized you interpreted the fine-grained transfers as being on the temporal dimension. I was thinking about the spatial dimension - i.e. rather than updating one integer three times, update three different integers in one batch. This is useful regardless of frequencies.

That said, I do support the temporal dimension. Sirea partitions will process all the batches received at the start of each round. In some cases, that might mean combining up to six batches from another partition. (Six is the default bounded buffer size for batches in flight in each direction between each pair of partitions.) This helps a slow consumer keep up with a fast producer, allowing frequency disparities of up to factor six. The performance properties are quite different compared to systems that use bounded buffers but only process one input at a time.

This is made possible in RDP due to its temporal semantics. I.e. a batch can tell me about future values (value @ time) without overwriting information about past values.

For a GPU, I don't have as much freedom to represent the temporal semantics. Well, it is possible to model some shader variables as temporal splines (they cannot change during shader invocation, but they can model change - e.g. for motion blur, temporal anti-aliasing, and efficient simple animation over multiple frames - tweening, rotating, swaying), and even to model some temporal attributes with vertex attributes. But the translation effort to the GPU is high enough that I feel it's wiser to consider these almost orthogonally to the communication and update model. Though, I would certainly leverage recorded history or anticipated future when computing animations, shaders, or shader variables.

You seem interested in combining a mess of code from both the CPU and GPU then teasing it apart, disentangling it, with a sufficiently smart compiler. You'd face problems such as "what is the largest sub-expression that can be shifted to the GPU?". I can see the motivation for such things, but I'd prefer it be left to staged programming (even if we make it look like plain old code via syntax or similar), with the more formal and generic partitions model underneath it. (For my approach, I imagine some sort of specializer or JIT would be desirable to optimize stable code that is glued together from lots of small first-class functions.)

### Decoupling algorithms from schedules

In some sense you're discussing partitioning: how to managed values from multiple partitions within a program in a straightforward manner.

A somewhat orthogonal (I guess that's like being a little bit pregnant) view on this partitioning problem, specifically in graphics, is this excellent paper on Halide, a system which separates graphics algorithms from the schedules used to execute those algorithms (where "schedule" is a spacetime schedule, e.g., describing both spatial and temporal partitioning).

The main valuable insight here, as in many other domains, is that scheduling details are critical for performance but irrelevant for correctness, so separating them linguistically simplifies both the algorithm description and the schedule description.

### Cool paper

Looks like a nice paper. Thanks for the link.

### Is an encrypted channel a valid technique?

I have often heard people complain that "check constraints" put too great a load on a database, and often wondered why it was not common for this constraint to be treated as a projection to the "middleware".

I thought, in undergrad, a good solution would be to only allow connections to the database via an encrypted channel in the middle tier, and the database would simply trust that the middle tier had done the sanity checking. This changes the game in interesting ways (possibly negative ways), as one of the defining features of a (robust) database is that all the rules for maintaining data integrity are essentially in terms of itself.

If valid, I would expect such abstraction to be supported. But how could that go into today's languages? I've always assumed this was an abstraction technique I would have to wait 1,000 years to make its way into realistic language-driven systems.

### If valid, I would expect

If valid, I would expect such abstraction to be supported. But how could that go into today's languages?

Inferring code contracts from check constraints in a tool like Linq2Sql's sqlmetal seems feasible.

### Hi Sean...

...take a look at Murphy et al's ML5 language. They were designing a distributed language, in which it is reasonable to send certain values (e.g., booleans, immutable lists of ints, etc.) over the network, but where other values (e.g., files, sockets) were inherently local to a machine, and couldn't be sent over the network. The hard case were functions -- depending on whether or not a function closure captured a local value or not, you might or might not be able to send it over the wire.

To handle this, they introduced a modal type operator â–¡A, which was the type of mobile values of type A. So the type â–¡(A â†’ B) contained just those functions which captured only local values. The nice thing about the system is that you could do all the needed typechecking with just a bit of bookkeeping on variables -- the typing rules are basically those of modal logic.

I've started using a similar idea in FRP (nothing published yet, though), to account for the idea that certain values (e.g., ints) can be used at multiple timesteps, and other values (e.g., streams or GUI widgets) which are represented by mutable data structures are only good at a particular timestep. It's really nice!

### Very nice. I might need to

Very nice. I might need to look into that sort of modal logic for some of my own work.

### Transactions

One thing I'm looking at that relates to this question is making transactions, with constraints, native within a language.

You know the way the formal call parameters (which are also local variables) to a function are all simultaneously initialized using the actual argument values from the call site before the function begins. Apply the same logic at the call site, and you have formal return parameters (which are also local variables at the call site) simultaneously initialized using the function's actual return values.

This can be modeled as a transaction in which all of these initializations happen simultaneously.

With the addition of 'constraint' functions (functions which are automatically evaluated whenever an attempt is made to assign or initialize a constrained variable) to express invariants on the relationships between variables, you can add the notion of transaction failure due to constraint violation.

Constraint functions must be order-independent, and all constraint functions constraining variables assigned/initialized by a given transaction are invoked (in any order) with the proposed (post-transaction) new values. If any of them return false, the assignment/initialization does not take place. If any of them return a constraint-failure error, the assignment/initialization does not take place *and* the attempted transaction passes on the constraint-failure error, which must be handled somehow. Otherwise, the transaction takes place, initializing local variables.

It's a nicely succinct way of making sure your invariants remain true at all times, because there's no point at which you've modified one of the constrained values but haven't done the others yet. It automates constraint checks in a way that permits the compiler to optimize them away whenever it can prove that they will be met. The specific use case I was thinking of was to ease the pain of dealing with RDBMS's, but it also seems applicable to other 'foreign/can't touch this' sorts of data.

Ray