## Optimal efficiency

Reading about Constraint Handling Rules, there is a claim that CHR is the first declarative language for which optimum efficiency of implementing all algorithms can be proved, and that this cannot be proved for the pure part of logic (eg Prolog), rewriting (eg Maude), or functional programming (eg Haskell), without adding imperative features.

Should we all be trying to develop languages based on CHR, rather than functional? What other language types to proofs (or disproofs) exist for optimal efficiency?

## Comment viewing options

### Precise statement and rigorous proof

Hello Keean,

It's difficult to answer without further references. Where did you find such a claim? Do you have any link to a mathematical statement of this result, hopefully including a proof?

### The Proof?

This document was cited as where the proof is, but I haven't had time to read it yet: http://people.cs.kuleuven.be/~jon.sneyers/papers/toplas-complexity.pdf

### What is the context of the claim?

In order to examine if the claim is correct it is necessary to look at the context it is made in. The cited paper does not make that claim. The relevant part to check is in section 7:

However, for higher-level declarative languages, the property is far less triv- ial. The time and space complexity of a program depends more crucially on an optimizing compiler. Whether or not the general complexity result holds for some language depends largely on the properties of its compiler: A pathological compiler could conceivably detect the pattern of a RAM machine simulator pro- gram as a special case and produce special, hardwired output with the desired complexity properties.

The property referred to in this paragraph is “Every algorithm can be implemented in language X with the right time and space complexity.”

The paper does claim that Prolog and Haskell do not have this property. It does not claim that functional languages in general cannot have this property, and I would read the quote above as claiming the opposite - that even if the specifically tested languages do not have that property it is still possible that languages of that type may do.

It's really quite a beautiful paper, I didn't realise that CHR was that efficient. The complexity results are backed with a constant factor of 10, relative to a baseline in C (for a single algorithm).

### Not Proved

Indeed, It was my mistake to write "cannot be proved", when it should have been "has not been proved". The only proofs require adding imperative features, so for example ML is not a pure functional language, it is probably possible to prove optimal efficiency by using the imperative features.

So the claim was that CHR is the only declarative language for which the optimal efficiency result has been proved. However we might suspect that it cannot be proved for a pure functional language, a logic language etc.

I think for a logic language if you add linear types you may be able to recover optimal efficiency, as the semantics of CHR can be mapped into a fragmant of linear logic http://www.informatik.uni-ulm.de/pm/fileadmin/pm/home/fruehwirth/Papers/llchr-final0.pdf

### Space complexity

I misread your question as meaning functional languages rather than pure functional languages; adjectives that cover multiple clauses seem a little ambiguous to me. There is a interesting discussion about the complexity of implementing a growable array in Haskell. The time complexity already matches, it is only the space complexity that differs: O(n+s) rather than O(n).

Abandoning purity gives a solution with a DiffArray, where the space complexity should be the same if the old contents are not accessed. Maintaining purity places the responsibility of checking the old value is not used on the compiler, so in-place updates can be used as an optimisation. Would this not be the same issue - adding linear types to the language to model the use of array values?

Should we all be trying to develop languages based on CHR, rather than functional?

No, we should be using languages that let you target the imperative features of the underlying hardware. IMO.

### This is my opinion too

For clarity this is my opinion too, but I am always interested in the alternatives. I was considering this as a meta-language, replacing my current use of Prolog.

Do you mean scary things like C? Or do you mean something magical like Atom in Haskell? I'd like to understand how to we do a GREAT job of targeting the imperative features.

### More control over compilation

We should program by first writing some declarative definition of what we want to compute and then by helping compiler find a good hardware implementation for that specification. The problem with C is that the only specification available is written in a low level language so the compiler can't know whether it's right or wrong and help you. The problem with higher level only languages is that those low level decisions you were writing in C were important and the compiler generally isn't going to be smart enough to figure them out. So, we need both: the high level specification and the compilation hints.

### research on modeling & predicting?

I imagine some sci-fi tool that has profiles of hardware and then lets you somehow simulate your code against it, and then across some distribution of inputs, and then out spits some optimization profile, and that is used for code gen, or to prime the jit.

So I can write my video game in a HLL, and it will still play just fine on my absolutely horrible \$50 android device, as well as anything else.

### Real hardware

I believe that the "proof" that pure functional programs can have worse complexities than side-effecting programs relies on the trick that formal models of imperative machine assume O(1) random access update, while functional programs need at least O(N).

Isn't it the case however that this O(1) assumption is unrealistic in practice for large memories, and that real hardware are actually closer to the O(log N) model, or worse, due to cache hierarchies?

(Another angle of attack of the result, which we are all well aware of, would be to use linear state tracked through types, and claim that this is purely functional programming. Imperative programming has long been assimilated.)

### The O(1) assumption is

The O(1) assumption is unrealistic in practice but functional languages make the same assumption. To traverse data structures in a functional language you need to dereference a pointer. If you cannot access an array in O(1) you certainly cannot dereference an arbitrary pointer in O(1). Functional is still that logarithmic factor slower no matter how you slice it: if you can access memory in O(f(n)) then a functional random access is O(f(n)log(n)). The O(log(n)) vs O(1) is "in practice in theory". If you look at "in practice in practice" you see 10x-100x slowdown using functional data structures. Linear types are excellent since they can potentially eliminate that slowdown.

### Hm

But the pointers you follow when you descend a big tree are not arbitrary. I would be ready to believe it but it is not clear, to me, that there doesn't exist a locality-management strategy that can absorb the tree-access-and-spine-update cost in the general lookup cost. We have seen provably cache-efficient GC algorithms in the work of Blelloch and Harper for example (but I don't remember an optimality or non-optimality result).

### Not hard to calculate

Let's assume a favourable model of computation where we can choose a series of caches of our desired size, and a cache of size S can be accessed in time O(f(S)).

Suppose that we have a binary tree of k levels deep to simulate a random access list of size n=2^k. So we pick k caches, one cache for each level of the tree. Since the i-th level of the tree is of size 2^i, we can access a pointer into that cache in time O(f(2^i)). The total complexity of traversing down the tree is then O(sum f(2^i) for i in 1..k). This is "obviously" the best we can do.

A plain array would take time O(f(2^k)), i.e. the same time as it takes us to access the last level pointer. So we've got an extra O(sum f(2^i) for i in 1..k-1) cost for the other levels. Does that affect the asymptotic complexity? Well it depends on f of course. Intuitively if f grows fast enough then the last access dominates all the preceding ones. If f is constant then we know that a plain array is O(1) but the binary tree takes O(log n) time. If f is logarithmic then we have O(log n) vs O(sum log(2^i) for i in 1..k) = O(log(n)^2). Imperative is a factor log n faster even if memory access is O(log S) in the size of the memory. If f(s) = s^(1/3), because of the speed of light and all then both functional and imperative are the same asymptotic speed at O(n^1/3).

Declare victory? ;-)

### Cache and I/O Efficient Functional Algorithms

I guess gasche refers to this paper:
https://www.cs.cmu.edu/~rwh/papers/iolambda-cacm/cacm.pdf
(More complete version for POPL: https://www.cs.cmu.edu/~blelloch/papers/BH13.pdf)
which gives different results than you, I think — but for specific algorithms, not in general.

I've skimmed the paper, so here's what I got: That paper uses an existing model, where you have one fixed-size cache of size M which can be accessed in zero time, and a bigger and slower memory, from which you can load blocks of size B (cachelines) in unit time. Finally, they assume a generational GC where the nursery fits within the cache. This model is both simpler and closer to actual hardware than the one you consider (though, well, it is also a special case).

This model is known to work well in practice (they say), but I think you can exploit it by performing an unbounded amount of in-cache operations. For instance, you can functionally update a value an unbounded number of times, as long as you throw away most copies to keep fitting into the cache. (I'm not sure that they ignore the GC costs, but I think they should to be coherent, if GC happens within the cache).
To be most unfair, functional programs can be time-optimal if you model time complexity by ignoring an important source of inefficiency of functional programs.
So, without a benchmark for sanity-checking, I'm not yet willing to trust an optimality result that only holds in this model.

I might have missed an answer to this objection, but my skimming didn't find it.

### Optimal efficiency for all algorithms.

The result for CHR was optimal efficiency for all algorithms, not just some carefully chosen subset of all algorithms.

CHR is declarative, so has all the properties that brings, yet can optimally express all algorithms. Doesn't that make it a great language for writing down algorithms, as they will both optimally efficient and purely declarative. So far no other language has proved it can do this.

### I know that gasche was

I know that gasche was referring to this. My analysis assumes an even more favourable model than they do: a series of caches up to the size of the data rather than a single small cache. The block size is irrelevant for functional arrays.

In their model being able to perform in-cache operations for free isn't a bad assumption when you're analysing asymptotic complexity. You can't really use it to cheat in the asymptotic complexity since the cache is assumed to be small relative to the total size of the data.

I do think their model is a bit unrealistic. The block size B (cache line size) on x64 is just 8 pointer sized values. Given the inefficiencies of data representations you're be lucky if you are able to fit 3 values in that. Not exactly a situation in which O(B) is appropriate :) For actual efficiency it would be better to make languages in which you can control the data layout of data structures at the level of cache lines, and not rely on pointer linked data structures that are laid out in allocation/access order. Besides being more efficient that would also be conceptually cleaner.

Take the merge sort example on lists of integers. A linked list node will be represented as a header word (8 bytes), an int (4 bytes, but expanded into 8), and a pointer to the next node (8 bytes). That is 24 bytes of which only 4 are actually useful. Only 3 of those would fit into a cache line with optimal layout. Compare with a flat array of ints: 16 would fit into a cache line. If you are unlucky then you are using some data that is not represented inline into the list such as pairs of integers. Then your list node is header + elem pointer + next pointer and the elem pointer would point to a structure that has a header + two ints.

Add to this that with arrays splitting is free. Traversing linked lists and building up new ones is expensive.

### Just for fun I ran a

Just for fun I ran a functional merge sort against sorting an array. Sorting an array is 229x faster.

### Cheating: you didn't close the loophole, IMHO

The key difference is that cache access is for free in their model, which is a very specific choice for your f. This is one way in which their model is more favorable than yours :-).

On not cheating: cache size is bounded, but the cache only stores live data, and as far as I can tell in-cache GC is for free. I hope to be wrong there, but since you can run a semispace GC within the cache (and you probably should), either GC is for free or some cache operations do have a cost. So, if GC is for free, you can generate unbounded amounts of garbage with zero cost.

Concretely, assuming you can create a fresh copy of a with copyOf, you can write recursiveCopyOf a = recursiveCopyOf (copyOf a) and call it.
This will generate tons of garbage, but it'll be dead on arrival, so it won't hit main memory neither in their model nor with a well-tuned GC. Indeed, a semispace collector for the nursery will just copy the latest copy of a, or the two latest one if interrupting during copyOf. Hence, GC has a small cost but non-nil cost in practice, but I don't see where their model gives a non-zero cost to GC while keeping cache operations free.

More concretely, "traversing linked lists and building up new ones" is expensive in practice, but not in their model under the right assumption (and this seems less contentious). This seems is an example of the same kind of questionable assumption, and they explain specifically that this is important for them (citing matrix multiply).

In a different context, similar assumptions are used to analyze database algorithms, and I'm not sure they work so well there either: under such assumptions, compiling queries rather than interpreting them would be no faster, and we now have experiments against that.

I see one use of their model: Once you've decided to write purely functional code, their model will predict which one has better locality, and this might be actually useful to decide which one is faster. (Unless one functional algorithm uses many more temporaries than the other).

On other points:
- I agree that O(B) is odd, that makes more sense when you use the same models for disks (or over different CPU generations, but even there...).
- Re memory efficiency, I agree that unboxed data is important, though that's an orthogonal concern, that matters also on the JVM. On this topic, I like "The causes of bloat, the limits of health" (http://doi.acm.org/10.1145/1297027.1297046).

### Nice

I didn't reply initially because/but I found your point clear and convincing. Thanks!

However there is still something that I find disturbing in here (and it is not in contradiction to what you say): you are computing the time it takes to access an array vs. a binary tree under several memory layout assumptions. But I think the initial intuition I had when discussing "real models" was rather the fact that large arrays *are* in fact addressed as trees (the "time to access" discussed here under the "f is logarithmic" assumption), and my idea was that pure algorithms could use similar tree-structure directly (instead of layering a binary-tree data structure on top of a memory-as-tree abstraction, which as you point out still pays a logarithmic factor).

However, the problem with discussing this intuition is that the setting is very ill-defined. What exactly is acceptable as a "purely functional tree", and what would be cheating? For example, no matter what the underlying memory hierarchy is, we can use a "functional array" data-structure that uses an array but keeps a backlog of anti-updates. This is a persistent data structure (observationally pure) and with a optimal time complexity (you pay memory for the log) under linear-usage scenarios -- the DiffArray discussed by Andrew Moss above. Does that contradict the "proof" that functional code is slower by a logarithmic factor? What about, say, using the ST monad (at a pure type), or even using a monad and providing an algorithm at a monadic/effectful type?

I would need to go back to the paper demonstrating this O(log n) additional factor (does anyone here has the reference at hand?), but my guess would be that they study a rather restricted setting that does not mean much in term of actual functional code in practice. On the other hand, the point discussed in this thread is that there would be a paradigm that would allow efficient expression of all algorithms yet remain in a declarative subset of the language (which cannot be said of a functional program using the ST monad or other techniques to actually write imperative code).

I haven't had the time to look at the CHR article in depth yet (thanks for the partial reviews of it in the thread -- "It's really quite a beautiful paper" is enticing!), but I am a bit sceptical that it can be interpreted as supporting efficient declarative programming (a subjective and ill-defined word, the root of the problem here), given that the proof technique used seems to be to implement a RAM machine interpreter with no algorithmic overhead. I wouldn't claim that an imperative RAM program paired with a (possibly elegant and declarative) RAM interpreter is a declarative algorithm.

### Memory reuse in CHR implementations

Readers may be interested in the article Memory reuse for CHR, referenced in the above paper and an important part of the memory-efficiency claim. Basically, CHR are implemented by compilation to Prolog systems, but delegating constraint collection to the host Garbage Collection is costly (you cannot guarantee that the imperative-state-update use-case, where a constraint is discarded and a similar-looking (updated) constrant is added immediately after, will have a constant cost on the host GC). The paper solves this problem by implementing a small analysis to reuse the memory previously-discarded constraints, effectively turning the constraint-based representation of in-place update into an in-place update. (This is not an earth-shattering optimisation and is similar to many other allocation-avoiding technique in compiler or interpreters of garbage-collected language, but it solves this problem in the CHR setting.)

### Indeed, what is and what

Indeed, what is and what isn't cheating isn't that interesting. What matters is whether you can express algorithms at a high level and still be efficient. For my little test above I had to write a functional merge sort with tail recursive split and merge (otherwise it blows the stack) and honestly the code turned out more difficult to write and read than imperative merge sort.

Some array algorithms can also be read quite declaratively. Matrix multiply for instance:

for i in [0..p]:
for j in [0..q]:
for k in [0..r]:
C[i,j] += A[i,r]*B[r,j]


This looks imperative but notice that the order in which we run the iterations does not matter. Instead of iterating from 0..q we can also iterate from q..0. In fact:

for (i,j,k) in [0..p]x[0..q]x[0..r]:
C[i,j] += A[i,r]*B[r,j]


We can run this loop over any permutation of [0..p]x[0..q]x[0..r] and it will give the right answer. You can read it declaratively as specifying a set of iterations to run and then running them in any order. Polyhedral optimization in compilers recognizes this and chooses a permutation that's cache efficient. It can also analyse dependencies to determine which set of permutations is valid. For matrix multiply it's all permutations, but if subsequent iterations read from the array that is being written then only some permutations are valid. Computing Levenshtein distance between strings S and T into a matrix D[i,j]:

for (i,j) in [0..n]x[0..m]:
cost = (S[i] == T[j]) ? 0 : 1
D[i,j] = min(D[i-1,j-1]+cost, D[i-1,j]+1, D[i,j-1]+1)


Now not all permutations are valid because iteration (i,j) uses results from iterations (i-1,j-1),(i,j-1),(i-1,j), but there's still a lot of ways to permute the iterations (e.g. iterating rowwise or columnwise or diagonally).

It might be nice to have a construct for specifying such computations. You'd specify the iteration space and what each iteration does so it's declarative. It's the job of the compiler or run-time to pick a valid permutation (with a fallback to a worklist iteration if the compiler can't figure out a schedule statically). Maybe that's related to CHR? I should really read that paper too :)

### Isn't O(log N) for NUMA or clusters?

"In practice in practice", cache sizes don't scale with the input size, so in general your constant factors change when the input is bigger than some threshold, and you costs stop growing when you hit main memory.

Jules points out that O(log N) is "in practice in theory". But I think this result only matters "in real practice" for cluster-like scenarios, where your dataset is so huge that you need a cluster (or NUMA machine) to hold it, with the machine size (or number of used nodes) growing with the input size, so that input twice as big needs memory farther away. To be fair, this probably also matters on GPU and other highly parallel scenarios; while this is not the usual scenario, it is more common.

I might be wrong, but I think in those contexts imperative programmers sometimes do write to fresh local memory to avoid cache contention among processors (I forget the details), which goes a bit vaguely in the direction suggested by gasche. I'm failing to remember enough to talk intelligently about it, though :-(. Anybody recalls what I'm thinking about :-( ?

### Serial?

Doesn't this argument only apply to sequential programs? I suspect that with high levels of concurrency the whole hierarchical caching arrangement of todays processors is totally wrong anyhow. High speed caches close to the CPU are only fast if they don't have to be synchronised with M other such caches. Look at the other extreme, the GPU, with thousands of ALUs, no caching, high speed context switching, and thread oversaturation. The performance in that context comes precisely because there is no caching to slow down context switches.

### GPUs and Parallel vs Serial

GPUs have on chip memory buffers that you have control of. Main memory fetches are still slow, but you rely on a wide bus and sequential fetches for speed.

This model was tried on a CPU (the PS/2) and it proved very difficult to program (more recent PlayStation's have gone back to a cached architecture).

I think it is important to consider difficulty of Programming. A buffered highly parallel architecture may in theory be faster, but its use will be limited to where needed (like the GPU on a modern PC). It makes sense to write serial code on a cached architecture for the rest.

What we will probably see more use of CUDA/OpenCL for algorithms that are highly parallelisable, but overall program flow control on a sequential CPU. After all many programs do not use the full speed of modern CPUs and are written in languages without fully optimised interpreters (like Python).

### Complexity is not efficiency.

The result for CHR show that the complexity is optimal. Efficiency is something else entirely, and the constant factors are important. The same results show that the ratio between execution in CHR and execution in C is 10:1 - how can the CHR implementation be optimal in efficiency?

For writing down algorithms people already use one of the two machines that CHR is translated into. For specification it is not entirely clear what is the advantage of using CHR instead of a direct proof using either of the two existing machine models.

### Declarative

Isn't the fact that CHR is declarative an advantage?

### Depends

Depends on who you ask. I prefer declarative languages, but I know many who do not. The standard argument in their favour is robustness: less code to verify, more precisely defined properties when you do. But if there is any general consensus on ease of use then it would seem to go the other way. People (not all, but probably most) find declarative languages harder to write code in. These two variables are not entirely independent; more robustness can be achieved with more effort. The shortest answer is that I don't think your question has a yes/no answer.