Remora: An Array-Oriented Language with Static Rank Polymorphism

The paper: An Array-Oriented Language with Static Rank Polymorphism, Justin Slepak, Olin Shivers, and Panagiotis Manolios, Northeastern University, 2014.
Abstract. The array-computational model pioneered by Iverson’s lan- guages APL and J offers a simple and expressive solution to the “von Neumann bottleneck.” It includes a form of rank, or dimensional, poly- morphism, which renders much of a program’s control structure im- plicit by lifting base operators to higher-dimensional array structures. We present the first formal semantics for this model, along with the first static type system that captures the full power of the core language.

The formal dynamic semantics of our core language, Remora, illu- minates several of the murkier corners of the model. This allows us to resolve some of the model’s ad hoc elements in more general, regular ways. Among these, we can generalise the model from SIMD to MIMD computations, by extending the semantics to permit functions to be lifted to higher-dimensional arrays in the same way as their arguments.

Our static semantics, a dependent type system of carefully restricted power, is capable of describing array computations whose dimensions cannot be determined statically. The type-checking problem is decidable and the type system is accompanied by the usual soundness theorems. Our type system’s principal contribution is that it serves to extract the implicit control structure that provides so much of the language’s expres- sive power, making this structure explicitly apparent at compile time.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

on GitHub I guess

Tree-oriented Languages?

If APL and J are array oriented languages, I wonder what has been explored with tree-oriented, or nested data structure oriented languages. It seems like arrays are special instances of binary trees, so a tree language should be more general than an array language.

Four-letter word

Lisp is tree-oriented. Just saying.

Term-Rewriting

Absolutely. However, the more interesting bit of an array-oriented language is the operations that apply uniformly across arrays. Traditional lisps don't really have that, but term-rewriting languages do! Stratego is probably the most educational exploration of this idea that I know of.

Lisp allows arbitrary graphs while K is tree oriented

In array language K a vector element can be anything including another vector and K doesn't allow arbitrary graphs (but you can represent them in other ways). And yes, its scalar functions distribute over any arbitrary tree. e.g.

  x:(1 2;1;(3 4; 5))
  x+x
(2 4
 2
 (6 8
  10))
  +/x
(5 6
 8)

Or, the other way around

http://www.cse.hut.fi/en/research/SVG/TRAKLA2/tutorials/heap_tutorial/taulukkona.html

Represent a binary heap as an array. Obviously it is an encoding so it makes writing visitors a little ugly, but it does the job.

Both APL2 and J support nested arrays

That is, an element of an array can be an "enclosed" array, which looks like a scalar, but can be revealed with the "disclose" operation.

Index oriented

My language sort of generalizes array-oriented to index-oriented. You can view an array as an indexed collection where the index is a natural number. Similarly a tree is an indexed collection of values where the indices have parent-child relationships. The idea of transparently lifting operations to arrays is generalized to transparently indexing (parameterization). Functional reactive programming also fits into this paradigm (where time is the hidden parameter).

Coordinate Structures

This sounds a lot like the coordinate structures in Stepanov's "Elements of Programming". They tend to get called iterators because that's what the C++ standard library wanted to call them, but really they are much more general. In addition to the simple Iterator (which is for a single pass, like for an input stream), ForwardIterator for a container that you can re-pass (for example a singly linked list), and BidirectionalIterator (eg. doubly linked list) which most languages implement there are others:

IndexedIterator - you can move back and forward in arbitrary steps efficiently.
RandomIterator - every location has a natural number index, and lookup time is uniform.
BifurcatingIterator - iterator for something like a binary tree.

The chapter on coordinate structures presents an analysis of iterators as an abstract algebra. As this is based on templates in C++, generics in Ada, type-classes in Haskell or traits in Scala or Rust, one object can implement several different iterators giving many combinations of compound coordinate structure.

This is effectively the study of generalised access patterns for containers.

Transparent Lifting

That sounds like a similar concept. One characteristic feature of array languages is that you can write something like this:

y = x + 1

and have that implicitly interpreted pointwise as:

forall i.
    y[i] = x[i] + 1

You don't get that part with explicit iterator use.

Explicit Lifting

That is true, but you can get equivalent generic concepts to map and fold. For example you might define the generic function:

parallel :: (RandomIterator r, Readable r, RandomIterator s, Writable s) => r -> s -> (ValueType(r) -> ValueType(s)) -> IO ()

That once defined generically allows:

parallel x y (\v . v + 1)

Different iterators imply different access patterns, some only have one obvious access pattern (map for ForwardIterator) and others several (pre-order, in-order, post-order for a BifurcatingIterator). As you can see, when you are using containers with more complex access patterns, or that are compound containers with several possible interfaces (for example an Array would provide both ForwardIterator and RandomIterator) you would probably want to explicitly specify which access pattern you wanted to use.

Map considered harmful

I don't really like 'map'. It loses the fact that the two indices (of the before and after structure) are identical. Also, I think you're still more worried about algorithms (as opposed semantics) than I am.

Map for Iterators

I am not sure what you mean about losing the index identity, as map can be implemented in-place for a mutable array.

I don't see map as an operation defined on lists, but instead as an operation defined on iterators, like this:

map src_iterator dst_iterator lambda = ...

There may be several overloaded definitions of map for the different iterator types, so that the optimal map algorithm can be chosen based on the efficient access patterns for different containers.

Losing the index

The thing I don't like is this functional idiom:

xs = ...
ys = map (+ 1) xs
zs = zipWith (\x.\y. sqrt(x**2 + y**2)) xs ys

Compare to the implicit pointwise:

x = ...
y = x + 1
z = sqrt(x**2 + y**2)

Not only is there a bunch of extra plumbing in the earlier version, but we're relying on map producing a structure ys that has the same index as xs so that we can safely zip them together to make zs.

Add overloading

With iterators, you can define a pointwise addition operator for any forward iterable container, and then add can be overloaded with a single generic definition allowing programming in the style you illustrate, but allowing the operations to be defined and extended in the standard library, rather than in the language core syntax. This keeps the language simple and small.

Named parameters

This feature is a cornerstone of my language, not something I've bolted on to handle this example. I don't think doing what you suggest would get you the full benefit of this style. For one thing, you're still losing the common index semantically. How are you going to complain (statically) if the user accidentally adds two containers that don't have compatible index sets? For another, you have to manually add that pluming code for addition, etc. And then you have the interplay with other implicit parameters (e.g. time as I mentioned earlier).

Static Sizes

If you were defining add for a container that is statically sized, then the equality of this property would be part of the matching for the add overload, so adding containers of different static sizes would by default fail to find a matching add operator. You could improve the error-reporting by providing an overload for non-matching dimensions which generates the static error message of your choice. In other words you can make the common index requirement part of the semantics of the add operation, using the type-constraints, if that is what you want.

This is not of course offering any benefits for built-in containers, all it does is allow some code that would be written in the compiler to be moved to the standard library, which has the benefit of keeping the compiler smaller. The real difference is that it makes user-defined containers equal to the built-in ones, so my new mmap-ed vector for big-data processing gets treated just like the ones provided with the language when it comes to addition.

I don't really understand the problem with other implicit parameters like time?

Well, most of the time the

Well, most of the time the size of the container probably isn't statically known, and even if the sizes of two containers match that doesn't necessarily mean that their indexes are conceptually the same.

Regarding the memory mapped example, I would want to do that with representations. I see parameterization as fundamental. That's what the site name -- lambda the ultimate -- is about. My point with other implicit parameters like time is that if you try to support them with the same mechanism of explicitly lifted operations, then you'll end up with a combination explosion (lifted to time, lifted to index, lifted to time+index, lifted to X, etc.).

Combination Explosion

I am finding that and wish some theory could reduce the problem. I fear there is no real understanding of the nature of representations, at least unless the domain is heavily restricted, e.g. to purely functional lazily evaluated (Haskell) and even then it's not clear.

I'm hopeful

I currently have no experience with a representation system, so pretty much everything I write about my plans in that direction is speculative. Hopefully, I'll get a prototype sometime soon and can play with it.

Concepts

How do you ensure only containers with conceptually the same index get combined? You mean the size of the index is dynamically determined, but you have various things with that same dynamic value. That sounds easy enough to deal with as you would just statically represent the index with a unique type, which would be a type parameter of the iterator.

Iterators are the representations of abstract data structures. They abstract the access pattern, not how the data structure is implemented. Iterators are the abstract algebra of containers, in the same way groups, rings and monoids are the abstract algebra of natural numbers.

I still don't get your time objection? You pass the iterators by value, the only implicit is the type of the iterator. I don't see how this would interact with time as an implicit parameter, surely you would just have:

test :: ?time -> iterator1-> iterator2 -> result

I don't see any issues, unless it's that you have to implement for different iterators? That's about efficiency because containers should only implement iterators that can provide efficiently, and not every container efficiently maps to indexed addressing (linked lists and trees for example).

How would you efficiently define addition over both a tree and an array, considering that tree nodes may be deleted, or moved (pruned and grafted)?

Or how would you efficiently define a sort function that works for both linked lists and arrays?

Containers of structs vs. structs of containers

See my reply to John. I'm basically trying to make it convenient to use a style of containers of struct rather than structs of containers.

Similar Ideas

My summary would be that there use similar underlying ideas, but you want to optimise for simplicity of syntax, whereas I want to optimise for flexibility of use, by for example allowing new representations to be added by the programmer, as well as implementing existing representation for new types.

In my model there are only values and containers, so a struct (or record, or tuple) is a heterogeneous container just like an array or a list is a homogeneous container.

Back to representations vs. parameterization of implementation

I still think a big difference in the way we approach things is that you're thinking in terms of algorithms and I'm thinking in terms of semantics. For example, I think it's uncommon for the semantics to care which container is used. Eventually, you pick the container that optimizes the operations you need. If I can specify the behavior I want without implementation knobs/parameters, I think that simplicity has advantages. And I think that a representation system will be more powerful than what you can get with parameterization. For example, you might write code that generically uses "a container", but then realize that the way you did it precludes using an intrusive list as that generic parameter. Setting up your generic function so that it supports all possible implementations is delicate and probably impossible.

But I've probably posted too many words about this here given the state of my representation system plans.

I understand that view but...

I understand that if you want to make it as much like equation-type math as possible, where a final value being equal is all that is required for expressions to be equal, you can embrace a view of semantics that disregards everything except information inputs and outputs.

But I can't. To me it really matters - it is a vital part of the semantics - whether a sort is O(n^2) or O(n log n) or O(n) and whether tail recursion consumes unbounded memory. These performance characteristics are an intrinsic, inseparable part of what the algorithms _mean_ and a view of semantics that does not account for them allows you to call things equivalent, which are not.

The value that gets output is NOT the whole meaning of a program. It's just the easiest to reason about in a reductive way. The inputs that a program requires and the values that the program outputs include the compute resources provided and utilized, and until formal reasoning includes that part of the program's meaning it will be incomplete.

Myth of RAM

Most complexity analyses are severely incomplete. For example, RAM is really O(sqrt(N)) because it's connected on a 2D physical surface. We just try to hide this via successive layers of CPU cache. (And virtual memory. And now NUMA.) There's also a log(N) aspect in there, somewhere, as our address space grows.

If we reject the simplifying lie of RAM, then arrays don't necessarily offer greater performance from a big-O perspective. The implicit (and hardware supported via TLB) movement of data between cache and RAM remains very convenient, however, allowing us to support a simple form of 'locality' for iterative access without making it explicit in our code (e.g. via some variant of Huet zippers).

I think it's useful to understand performance characteristics. Computation is mechanical. Motion and transformation of information suffers forms of impedance or friction.

But how much complexity are you willing to accept? What sort of type would you ascribe to an array to show it has O(sqrt(N)) random access but amortized O(1) iterative access for adjacent indices?

Access patterns

An array is an array, a simple type will do, but it provides a random-access-iterator. This is an abstraction of the access pattern of a set of algorithms and not containers. Put it another way, take all the useful sort algorithms we know and group them by the access pattern they use. Now containers must implement only those access patterns they support efficiently. Note that we start with the algorithms not the containers, so a change in memory structure or performance or CPU architecture does not change the classifications, as they are classifications of algorithms. What new ram might change is which algorithm we pick for a given data structure, so an array will provide both sequential and random access iterators (because it efficiently implements those access patterns due to its structure) but the compiler might choose to use the sequential access algorithm if memory paging is known to be an issue. The only thing that might change the classification is the introduction of a new class of algorithms, but that would add a new type of iterator, not affect the existing ones.

Edit: Further to above, the compiler is not going to optimise bubble sort into quick-sort, that is not optimisation but function equivalence searching in the space of possible functions. You cannot gloss over the difference between different algorithms. As such you don't want your library of algorithms inside the compiler, that does not make sense, you want to implement the algorithms in the language libraries where you can see the source code, and understand and improve them.

re: access patterns

I agree with most of what you say. It is convenient and useful to characterize data structures and algorithms based on access patterns provided and utilized. This is not so different from characterizing services and applications by interfaces provided and utilized.

But the definition of "random access iterator" is frequently something like "A RandomAccessIterator is a BidirectionalIterator that can be moved to point to any element in constant time" [1] (emphasis mine). By this definition, arrays DON'T truly provide random access iterators. Because the "constant time" is a lie. We simply can't do much better than O(√N) with physical implementations of a computation.

We can say that arrays should provide 'efficient' access to any element. If this is your personal definition for a random access iterator, that's cool with me. There are a lot of data structures that are good at 'efficient' access to any element - e.g. tries are one of my favorites. I also like finger trees and ropes. Log-structured merge trees seem to be popular for scalable key-value databases.

Edit: Also, I wouldn't want a compiler to behave in ways that are difficult to control or comprehend. I wouldn't reject optimizing a bubble sort to a quick sort, but only if there is some way for me (as a programmer) to predict and protect this optimization. I'm not sure what such a programmer interface should look like.

Constant time...

Constant time is not a lie unless you're confusing non-jargon usage with technical jargon. It's here in exactly the same technical sense that it has in algorithm analysis, meaning O(1).

It means the runtime required will never be *more than* a constant factor higher than 1. The constant factor, in this case, is the time required to calculate the index and access main RAM. Yes, you might hit a cache or an optimization and get it faster. But the operation is bounded by the constant time required to perform an access in main RAM.

In the case of an array, that is exactly true. It *will* perform the access, within the time required to calculate the index and access main RAM once, regardless of array size. That is very different from random access in a list, in which the same "constant-time" fetch has to be made a number of times proportional to the number of elements traversed to get to it.

re: Constant time

Algorithmic analysis frequently starts with "assume the following operations are O(1)". This assumption is simplifying, but is usually invalid. For example, random access to array elements is not O(1) with the size of the array. But, with enough repetition, many people begin to accept the assumption as valid.

O(1) is the spherical cow of computer science.

Addendum: The problem with your claim about maximum costs is that it's more or less equivalent to the argument: "sorting is O(1) so long as we're sorting fewer than 2^32 (or any other constant) elements". That's a true statement, but it also loses the relevant variable factor from your asymptotic algorithmic analysis.

Optimizing compilers, sorts and value vs operational semantics

When I'm using value semantics, I set -O6 and I *DO* in fact want an optimizing compiler to recognize bubble sort, say, "Oh, that's a SORT" and optimize it into quicksort (or merge sort, or whatever, probably depending on runtime profiling). At least insofar as it is able to identify the sort accurately. Value semantics means I don't care how it's done, and generally I want my answers as fast as possible. At LEAST as fast as the operational semantics my code expresses would get them, but if the compiler can find a faster way, go for it.

When I am using operational semantics, as in cryptography, I have a very different set of requirements, and I *DONT* want the compiler to optimize anything in a way that changes timing and memory access patterns. I set -O0 and I STILL have to go back and check the machine code to make sure it didn't do some common optimizations, because not doing those optimizations is a vital part of the semantics I need but is NOT a supported part of the language spec.

The fact that multiplication can be done faster in some cases (where some words are zero for example, or where the numbers are large enough that Montgomery multiplication would be faster) does not mean that I want the compiler to change the multiplication algorithm I use; constant-time multiplication prevents a side channel attack where the opponent observing timings can discover information about what numbers I multiplied. The fact that I don't access a variable again does not entitle operational semantics to elide the write when I clear it; that write may be absolutely necessary to clear a sensitive value before exiting a critical section.

Cryptography is an example of an entire field in which the operational semantics and access patterns are a vital part of what the program means and proofs of "semantic equality" using value semantics utterly miss the point.

Interesting argument

For example, RAM is really O(sqrt(N)) because it's connected on a 2D physical surface.

That's not really the argument made in that article. The maximum density achievable in a 3D-space that does not collapse into a black-hole is is bounded by n^2 as a result of the underlying equations. I did not see any argument that a maximally dense structure would have to have 2D-connectivity.

It's an interesting argument although it does seem a little confused to me. The argument is not really a true upper-bound. It does not take into account any complexity in switching or routing that is necessary to access such a structure so it is not an upper-bound on access - the root(n) could only be one factor in a higher upper bound.

It also seems confused in another way: the difference between a RAM (of fixed size) and the general trends in different sized RAMs. I've never seen a workable proposal for a RAM-type device of arbitrary size - how would the address be encoded for an arbitrary-sized space? Surely the encoding would causes a massive blow-up that would be reflected in the upper-bound for the access time.

Big-Oh just means "hide the constants"

Big-Oh notation is commonly abused, but I would assume the author is just being sloppy rather than confused. Also, you can reference an arbitrary address by interpreting an N-bit pattern of all 1s to mean that a 2N-bit pattern follows, etc. Summing the geometric series, you'll only ever waste half of the bits in an address.

Neat trick

That's a neat way of encoding it - exponential backoff with an escape sequence. I shall add that to a list of cool tricks that probably come in useful in other domains.

Utf8 already uses a variant,

Utf8 already uses a variant, just one bit per byte instead of one bit per bit.

Right. Appropriate to address scale.

Right. UTF8 already knows they're committed to using 8 bit units, and the maximum value they must encode is 21 bits wide - so using 1 "green" bit per 7 "data" bits has their largest-scale numbers working out even as three bytes.

Another way (more compact, used in some bignum libs for serialization) is a byte width in base-255, followed by a stop sequence (255 in this case rather than zero), followed by the number in base 256. So a 18-byte number would be transmitted as 18, 255, < 18 bytes of value >, and a 257-byte number would be transmitted as 1,2,255, < 257 bytes of the value >.

A variation of the way the above communicates the WIDTH of the number, is the way zero-terminated strings are encoded. They are effectively sequences of base-255 (except that the digit zero is encoded binary #0xFF), with the first occurrence of the non-digit encoded as #0x00 signaling the end of the sequence.

Skynet

Just to state the obvious, terminator-based representations have high hazard potential if the terminator gets corrupted.

If you watch the first movie

Terminator has code go by in his visor.

It's 6502 code.

The same 8 bit processor in an apple II

Explains a lot.

It always seemed curious that x86 became dominant in this timeline. Now we know why.

I mostly agree

I agree that operational characteristics are an important part of the programs we create, but I think it's a good idea to keep the input-output/value semantics separate from the operational semantics when that's possible. I guess it boils down to balancing whether the simplicity of abstracting from implementation details up front pays for the extra work of recovering an efficient enough implementation down the road.

Recovering Efficiency

I don't think you can recover the efficiency. You have to program the algorithms and choose the data structures to get an efficient implementation. A compiler is not going to be able to refactor code as its equivalent to a proof search. What a compiler might be able to do is choose between a number of possible algorithms (quick-sort, merge-sort, funnel-sort etc), based on certain parameters, but where are these algorithms best implemented, inside the compiler or in a library? My position is you do not what a "library" of possible algorithms inside the compiler, what you want is a way to abstract the algorithm choice so that you can implement algorithms in a library in such a way that you can select the most efficient ones.

Not in the compiler

My hope is to simplify semantics and separate implementation selection into a separate task that is still under the control of the programmer. Like you, I doubt that the compiler can in practice recover an efficient implementation on its own (the general problem is undecidable). An advantage of structuring things this way is that, as with proof search, the compiler can help you find a good low level implementation. On the other hand, if the only thing you specify is a low level semantics, then the compiler won't know what's going on and thus can't help you.

not grokking in fullness

Extra plumbing isn't good, certainly, and it seems this example would be a plausible case in favor of greater abstraction; but I'm not sure I grok the point about the index. map is by definition a device for acting on members while leaving the index undisturbed; if it doesn't do so, that's a bug.

The problem

You're right. That code works and if the lines are consecutive, it might even be pretty clear. But what if you abstract a function

f(xs:[Int], ys:[Int]) = ...

Now it's not clear from the types that xs and ys are expected to have a compatible index. You can pre-zip x and y, like:

f(xys:[(Int, Int)]) = ...

Now it's clear from the type that there is a single index (because there's a single collection). But that loses the ability to name the parts x and y and thus loses the openness that we had. If we add z, then we have to change a bunch of plumbing code.

It's not exactly a problem with map. It's a problem with a certain style of programming that uses map, zip, etc.

Cache Efficiency

Arrays have the advantage of working well with caches. Modern architectures seem to be adding more layers of cache not less, so I don't think this advantage is going to go away any time soon.

Prolog is more tree oriented than Lisp, which i would call list oriented. Just Saying. (Yes you can represent a tree using nested lists, and a lisper might call Prolog tuple oriented)