Cost semantics for functional languages

There is an ongoing discussion in LtU (there, and there) on whether RAM and other machine models are inherently a better basis to reason about (time and) memory usage than lambda-calculus and functional languages. Guy Blelloch and his colleagues have been doing very important work on this question that seems to have escaped LtU's notice so far.

A portion of the functional programming community has long been of the opinion that we do not need to refer to machines of the Turing tradition to reason about execution of functional programs. Dynamic semantics (which are often perceived as more abstract and elegant) are adequate, self-contained descriptions of computational behavior, which we can elevate to the status of (functional) machine model -- just like "abstract machines" can be seen as just machines.

This opinion has been made scientifically precise by various brands of work, including for example implicit (computational) complexity, resource analysis and cost semantics for functional languages. Guy Blelloch developed a family of cost semantics, which correspond to annotations of operational semantics of functional languages with new information that captures more intentional behavior of the computation: not only the result, but also running time, memory usage, degree of parallelism and, more recently, interaction with a memory cache. Cost semantics are self-contained way to think of the efficiency of functional programs; they can of course be put in correspondence with existing machine models, and Blelloch and his colleagues have proved a vast amount of two-way correspondences, with the occasional extra logarithmic overhead -- or, from another point of view, provided probably cost-effective implementations of functional languages in imperative languages and conversely.

This topic has been discussed by Robert Harper in two blog posts, Language and Machines which develops the general argument, and a second post on recent joint work by Guy and him on integrating cache-efficiency into the model. Harper also presents various cost semantics (called "cost dynamics") in his book "Practical Foundations for Programming Languages".

In chronological order, three papers that are representative of the evolution of this work are the following.

Parallelism In Sequential Functional Languages
Guy E. Blelloch and John Greiner, 1995.
This paper is focused on parallelism, but is also one of the earliest work carefully relating a lambda-calculus cost semantics with several machine models.

This paper formally studies the question of how much parallelism is available in call-by-value functional languages with no parallel extensions (i.e., the functional subsets of ML or Scheme). In particular we are interested in placing bounds on how much parallelism is available for various problems. To do this we introduce a complexity model, the PAL, based on the call-by-value lambda-calculus. The model is defined in terms of a profiling semantics and measures complexity in terms of the total work and the parallel depth of a computation. We describe a simulation of the A-PAL (the PAL extended with arithmetic operations) on various parallel machine models, including the butterfly, hypercube, and PRAM models and prove simulation bounds. In particular the simulations are work-efficient (the processor-time product on the machines is within a constant factor of the work on the A-PAL), and for P processors the slowdown (time on the machines divided by depth on the A-PAL) is proportional to at most O(log P). We also prove bounds for simulating the PRAM on the A-PAL.

Space Profiling for Functional Programs
Daniel Spoonhower, Guy E. Blelloch, Robert Harper, and Phillip B. Gibbons, 2011 (conference version 2008)

This paper clearly defines a notion of ideal memory usage (the set of store locations that are referenced by a value or an ongoing computation) that is highly reminiscent of garbage collection specifications, but without making any reference to an actual garbage collection implementation.

We present a semantic space profiler for parallel functional programs. Building on previous work in sequential profiling, our tools help programmers to relate runtime resource use back to program source code. Unlike many profiling tools, our profiler is based on a cost semantics. This provides a means to reason about performance without requiring a detailed understanding of the compiler or runtime system. It also provides a specification for language implementers. This is critical in that it enables us to separate cleanly the performance of the application from that of the language implementation. Some aspects of the implementation can have significant effects on performance. Our cost semantics enables programmers to understand the impact of different scheduling policies while hiding many of the details of their implementations. We show applications where the choice of scheduling policy has asymptotic effects on space use. We explain these use patterns through a demonstration of our tools. We also validate our methodology by observing similar performance in our implementation of a parallel extension of Standard ML

Cache and I/O efficient functional algorithms
Guy E. Blelloch, Robert Harper, 2013 (see also the shorter CACM version)

The cost semantics in this last work incorporates more notions from garbage collection, to reason about cache-efficient allocation of values -- in that it relies on work on formalizing garbage collection that has been mentioned on LtU before.

The widely studied I/O and ideal-cache models were developed to account for the large difference in costs to access memory at different levels of the memory hierarchy. Both models are based on a two level memory hierarchy with a fixed size primary memory (cache) of size \(M\), an unbounded secondary memory, and assume unit cost for transferring blocks of size \(B\) between the two. Many algorithms have been analyzed in these models and indeed these models predict the relative performance of algorithms much more accurately than the standard RAM model. The models, however, require specifying algorithms at a very low level requiring the user to carefully lay out their data in arrays in memory and manage their own memory allocation.

In this paper we present a cost model for analyzing the memory efficiency of algorithms expressed in a simple functional language. We show how many algorithms written in standard forms using just lists and trees (no arrays) and requiring no explicit memory layout or memory management are efficient in the model. We then describe an implementation of the language and show provable bounds for mapping the cost in our model to the cost in the ideal-cache model. These bound imply that purely functional programs based on lists and trees with no special attention to any details of memory layout can be as asymptotically as efficient as the carefully designed imperative I/O efficient algorithms. For example we describe an \(O(\frac{n}{B} \log_{M/B} \frac{n}{B})\) cost sorting algorithm, which is optimal in the ideal cache and I/O models.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

"Describe the destination without observing it"

Interesting post! One nitpick:

I think it is folklore knowledge, or rather folklore opinion, in the functional programming community that we do not need to refer to machines of the Turing tradition to reason about execution of functional programs.

That may be an overstatement of what informed people believe. It wouldn't be necessary to refer to those Turing-tradition machine models if we didn't ever need to translate functional programs for such machines. But given that much of the time, such a translation is needed, it's hardly surprising that we do in fact need to consider the model of the target machine.

The two models are equivalent in certain senses, but in other ways have different characteristics, and attempting to determine the characteristics of a translation from one model to another by only considering the source program and its natural model seems a bit of a doomed enterprise. (It would be amazing if it were possible, but it seems unlikely!)

"Program" in an abstract sense

Of course this is a controversial subject (I was unsure it was fitting to put an opinion in this way at the head of the story, but I thought that could be helpful to understand the choice of articles), so I don't expect everyone to agree.

I just reformulated the part you quote slightly; I didn't intend to imply that *everyone* sees things this way, rather that it is a wide-spread (and old) position.

I wish I could think so

I've run into surprisingly insightful people who nonetheless insist that any consideration of performance is a breach of abstraction. They casually say that if you care about the performance, you must be working with the little details, and hence not working at a high level of abstraction.

I agree with Anton's perspective here, and I hope it is indeed the dominant one. Unless you find PLs just interesting mathematical artifacts (how many such people are there?), then you have to have a way to predict whether a program is going to fit within its resources *before* you start on a detailed implementation. You need to have a high-level argument, not depending on the details of the problem, that your general strategy is going to work.

One way to make that argument is for the language to always run with reasonably small resource usage, such as with XML DTDs and with non-contrived regular expressions. However, such languages are relatively weak. For general-purpose languages, you either have to have a performance model, or else you better be ready to update the PL's implementation as you run into snags.

I've run into surprisingly

I've run into surprisingly insightful people who...

Perhaps my abstract model of insightful people needs to be translated to a model based on real-world humans!

(Following section added as an edit:)

...who nonetheless insist that any consideration of performance is a breach of abstraction.

That may be true, depending on which abstraction is being discussed. If you're dealing with some computational model like the lambda calculus, then how it "performs" when translated to some other model may indeed be irrelevant.

But presumably, no-one would claim that they can predict performance and resource usage on some target model without actually considering the features of the translation to that model, which was all I was really saying.

Unless you find PLs just interesting mathematical artifacts (how many such people are there?)

Someone studying computation in a pure sense may see things that way, which is fine. All I was really saying is that if you're interested in resources and performance on real world machines, then it's to be expected that you have to refer to a model of those machines in order to get a complete picture.

The flip side of the same argument is that there's nothing special about current real world machine models, other than that they happen to represent how we do physical computing today. If we had quantum computers, or if we still used analog computers, we'd need different models for those.

Real world machines != machine models

The machine models traditionally used in complexity theory (various RAM derivatives) are also quite distant from "real machines". I could almost pretend that we don't know how today's processors are made (the details of the micro-instructions and their semantics are not public, I believe) and that it may very well be a physical realization of a lambda-calculus inside.

That's only a thought experiment; of course we understand that "big integer-indexed array of memory" is somewhat closed to today's shared memories (but still quite far). But you'll recognize that we still have to validate the *other* idealized models by empirically testing predictions of the model against actual hardware. We can validate cost models for lambda-calculi in just the same way.

If my model predicts that I will use a heap graph of size at most 3000 to compute the value of this program, and I find out I'm using 12Kib of RAM on the real machine, and this ratio empirically is the same found in other computations, then it's reasonable to assume that each node of the heap graph corresponds to 4 bytes of memory, and roll with that whenever I need to reason about actually running on actual machines. And I never needed to formally check a correspondence with other machine models to do this estimation -- although the fact that it was possible could be seen as a witness that this correspondence *could* be done.

Seems a bit vague.

Looking at memory used via heap profiling, and dividing by a number of nodes (presumably read form the source code) seems a somewhat inferior way of reasoning to having a precise 1:1 memory layout correspondence (for example):

struct {
    uint32_t a;
    uint32_t b;
} node;

int main() {
    node x[3000];
}

Where we know 'x' will take precisely 8*3000 bytes (modulo any target-dependent alignment rules). Not that I am advocating use of 'C' here, just that I think having direct memory layout correspondence is important.

You could certainly do this in a functional language (have layout rules for records, algebraic datatypes and arrays, lists etc) if you didn't have garbage collection - which is something I am interested in.

Seems a bit cherry-picked

Pointing at a carefully selected program in a carefully selected language doesn't make a convincing argument that it's easier to reason about memory (nor any other property) for a whole paradigm.

Consider:

  • There are many imperative languages that do not give you precise knowledge of memory details, such as JavaScript or Perl.
  • There are some functional languages that do allow you precise understanding of memory details, e.g. Lisps where you can know the exact structure of a cons cell and the details of symbol interning.
  • Even in languages like C/C++, where you can theoretically know the exact details, this often fails to apply for real world programs which may involve publish-subscribe patterns, arbitrary precision numbers, search, concurrency, sockets and database cursors, memory fragmentation, and so on.

Besides this, your proposed argument for precise reasoning is not very modular. You rely on an ability and to peek inside every function and see the details and remember them. It is not a strategy you can effectively apply when working with libraries that are maintained and updated by other people.

Reasoning with bounds and empirically with profiles is at least practical, and a good skill for a programmer, even if a bit vague.

proof by existence.

The fact it is possible for one example and in one language proves it is possible by example. Other languages can adopt this approach. The interesting question is what algorithms fit this model.

As for libraries, resources can be encoded as phantom types, and language primitives for returning sizes (like size of in C) can be provided and extended by libraries.

GC

The garbage collection is not a fundamental problem if you are passionate about predicting memory usage: it has some overhead but many implementations let you compute a precise upper bound just as well -- they may not be the best for throughput or even memory use, just as "hard" real-time systems sometimes have longer pauses in practice than "soft" real-time because you pay for the extra precision.

predictability of sweep timing.

As you don't know when the sweeps will happen, its hard to predict the upper bound.

No

GCs have settings to let you decide the amount of memory overhead you're willing to tolerate to improve the throughput (if you start using more than twice the known real memory needs at some point, do a collection before requesting more memory). I think they're heuristic rather than hard-settings in most cases, but there certainly are provable-upper-bound implementations (one of which you can find on Blelloch's publication page, from his webpage whose link is at the top of this story).

Provable upper bounds

Provable upper bounds for GC sounds interesting. What about stopping sweeps during critical sections and forcing sweeps manually? Can you guarantee that a single sweep will free all unreachable memory?

Hm

A reliable way to not have sweeps during critical sections is to not allocate. Some people do that in production code in GC-ed languages. To be finer-grained than that you'd need GC-ed regions, which has been researched and implemented but I'm not sure has been used much in practice (I wouldn't be surprised to learn that some proprietary Lisp/ML implementation of the old times did something like that, though).

Finally, guaranteeing that sweeps free all unreachable memory is for free in non-generational systems (and in other you can usually force something like that by forcing a collection in all generations), but it's rarely what you want given the large latency bump it implies -- not unlike a cascade deallocation in a non-incremental reference-counting scheme, except you traverse all live rather than all dead memory. Note that you don't need that guarantee to have provable upper-bound in memory; the GC just has to promise to sweep "at least enough" as soon as you try to allocate above the limit.

Incremental GC

Incremental GCs can be pretty good at giving relatively reliable bounds for pause times, within certain constraints. In practice, they are most likely much better than any home-brewed manual mechanism will succeed to be. In particular, ownership-based schemes (like C++) or reference counting schemes can cause rather surprising "sweeps" at times, for anything but trivial networks of data structures.

hah, nobody does that for C

you have to have a way to predict whether a program is going to fit within its resources *before* you start on a detailed implementation

ok, not precisely 'nobody'... but most of the time in a lot of industry that i've seen, we just write it and then try to cram it down / optimize / hack if need be to make it actually "fit". yes there are plenty of (good, hooray) exceptions to that rule, but i feel they only serve to prove it ;-)

I/O complexity

I'll read your references later, but for now I thought I'd just remark that there are various communities that care about memory. Running out of memory is just one question (not one that concerns me since I do my own mallocs), but given that memory bandwidth is far more important for performance than operation counts, some people have done theoretical investigation of how much memory you need to transfer for a given algorithm.

Theory:
Hong-Kung theorem

Paper that analyses the best way to do matrix-matrix multiplication:
Goto Blas algorithm

(If anyone is going to bring up parallelism: matrix-matrix multiplication is trivially parallel. Functional languages have nothing to say here. The only interesting analysis is for cache reuse.)

Lies, damned lies, and functional programming

These bound imply that purely functional programs based on lists and trees with no special attention to any details of memory layout can be as asymptotically as efficient as the carefully designed imperative I/O efficient algorithms.

Sure.

A quote from Princess Bride comes to mind.

"You keep using that word "optimal". I don't think it means what you think it means."

"For example we describe an O(nBlogM/BnB) cost sorting algorithm, which is optimal in the ideal cache and I/O models." Sorting algorithms are supposed to have O(N) space complexity, so N log N is not optimal in at least one obvious sense.

And you have to be rather divorced from reality to care about asymptotic cache behaviour. That's almost a contradiction in terms. Caches are there to get a constant factor improvement, so getting asymptotically the right cache behaviour is fairly meaningless.

If you can pay for the performance penalty you can use FP

I've seen the "hard to beat lists" argument pop up in discussions of FP in a large number of variously clothed arguments.

Ain't so. Tough message.

Nothing changed in that respect over the last thirty or something years though modern functional languages are somewhat faster. If you like FP and can get away with the performance penalty, then you can use it.

Sorting algorithms are

Sorting algorithms are supposed to have O(N) space complexity, so N log N is not optimal in at least one obvious sense

The 'cost' here is cache cost, i.e. the cost of a cache miss. It's a time cost, not a space cost.

you have to be rather divorced from reality to care about asymptotic cache behaviour. That's almost a contradiction in terms. Caches are there to get a constant factor improvement

In memory-bounded algorithms, cache misses often dominate performance. We know this empirically. For big-O notations, we typically select whichever operations we expect to dominate. If we refused to select cache misses based on a belief about 'why caches are there', would that not be 'rather divorced from reality'?

on form

It's very good that we have strong opinions, diverse preferences and aesthetic sense about various research topics. "Fairly meaningless" is however a rather dismissive way to describe someone's work, and I think you could show a bit more respect, even if it's an area you are personally not terribly interested in -- which is fine!

fairly unsurprising?

Ok, maybe I was expressing myself too strongly.

Reasoning about bits and bytes

I like what Robert Harper has to say about logic and type theory, but I disagree with his assertion that imperative programming is "shabby". I much prefer Stepanov's approach to programming. If I could somehow have Stepanov at the value level, and Harper at the type level I would be happy.

The idea that functional programs will not have to run on memory/pointer machines seems unrealistic. The direction of efficient computer hardware seems obvious, and attempts at hardware specialised for functional (or logic) languages have all failed (so far). I find the paper proving that functional programs can be converted to run on imperative architectures in constant time interesting, and I started to think this approach could have merit (and it certainly does for costing functional programs). I thought it was interesting how the language used arrays as the storage primitive instead of lists. It all looked good until I saw their quick sort implementation using '++' concatenation of the left and right sorted sub-arrays and the pivot. How can you be discussing efficiency, and then write an obviously inefficient quick sort like that and not think something is not right.

When I am referring to the ease/difficulty of reasoning about memory usage, I am talking about precise usage. The exact number of bits and bytes used. If I have an in-place sort of 1000 32bit integers, I can calculate that the maximum memory needed is 4000 bytes, plus the stack frame (as long as we are not using recursion). We even know the precise amount of stack space needed (a pointer for the return address, plus space for each local variable). The work on functional cost seems like big-O notation for space usage. Its not the same thing at all - although I think a worst-case space requirement is more useful than the worst case time requirement.

big-oh

"The work on functional cost seems like big-O notation for space usage". Actually he describes it as optimal from an I/O point of view. In that sense he is indeed proving the right order, but he is an order logN over in space usage.

The way I make sense of this is that functional models have space complexity of the order of the number of results (intermediate and final) that are produced. Naively these all have to be written to memory, so you get an I/O bound that is the same. By keeping stuff in cache you can cut down on memory writes slightly, so that's why his bounds have that N/B factor: problem size divided by cache size.

That is the theoretical bound as shown by Hong Kung (see my reference elsewhere) and he proves that he gets the same. That's a fairly trivial result, if you ask me.

Precise calculation

When I am referring to the ease/difficulty of reasoning about memory usage, I am talking about precise usage. The exact number of bits and bytes used. [..] The work on functional cost seems like big-O notation for space usage.

On "more precise than asymptotics" computation, you may be interested in the work on Resource-Aware ML, whose combinatorial tools are able to give rather precise multivariable polynomial costs (time or space) rather than just asymptotics. It has been recently extended to analyze imperative programs over arrays as well, see Type-Based Amortized Resource Analysis with Integers and Arrays, by Jan Hoffmann and Zhong Shao, 2014. The (time) bound on the in-place quicksort given is \(12.25 x^2 + 52.72 x + 3 \) (where \(x\) is the length of the region to sort), with the following quantitative discussion:

The experiments show that the bounds for dyadAllM and mmultAll are tight. The bounds for dijkstra and quicksort are only asymptotically tight. The relative looseness of the bound for quicksort (ca. 20%) is in some sense a result of the compositionality of the analysis: The worst-case behavior of quick sort’s partition function materializes if the number of swaps performed in the partitioning is maximal. However, the number of swaps that are performed by the partitioning in a worst-case run of quicksort is relatively low. Nevertheless, the analysis has to assume the worst-case for each call of the partition function.

That said, if you want to study constant-precise performance characteristics of imperative programs, there are many existing techniques that work rather well -- they tend to assume first-order programs and very intra-procedural reasoning, while the functional-inspired techniques are traditionnally more higher-order or at least more structured for composability. I'm not trying to say that this is the only way to go.

Interesting.

Lots of interesting stuff there. I think this is over complicating my argument though. My point was much simpler. Its very easy to prove the memory usage exactly (not best/worst case or asymptotic, but exact) for a constant-space (update in place / destructive update) imperative algorithm. The reasoning about space usage in this case is trivial. In this case it would seem clear that reasoning about (constant space) imperative algorithms is much simpler than reasoning about space for functional algorithms, especially if you want to know the exact resources used by the real machine executing the code.

Still I am beginning to thing the imperative/functional divide is a false one anyway. If you provide low level primitives (say mutable arrays) which you can have in both functional languages (like ML/Haskell) and imperative ones, and functional languages include notation for sequences (Monads etc), and imperative ones can define pure functions, then really there is a convergence onto a common feature set. What remains is probably just syntactic sugar.

Strict vs non-strict

Do I understand correctly that Blelloch's work is a strong argument for using strict languages over non-strict ones? I've seen some nice work on cache-oblivious data structures in Haskell, but it's not clear if there's any unified theory behind it, similar to Blelloch's theory of "work and depth".

Do I understand correctly

Do I understand correctly that Blelloch's work is a strong argument for using strict languages over non-strict ones?

No, I don't think so -- but that's certainly a suggestion that Robert Harper made in his second blog post.

My understanding is that Blelloch's work shows that it is possible (and, I would say, surprisingly simple) to reason about performance (time and space) of strict semantics -- and various models of parallel reduction. My understanding is therefore that we understand well how to reason about performance of strict languages, and less well how to reason about performance of non-strict languages. That is an argument to use the former over the latter, but not a strong argument, as long as we don't know for sure whether (1) it is fundamentally harder to reason about performance of non-strict languages or (2) the research on this question is less advanced, but will be equally convenient when it has succeeded.

I think a lot of people's gut feeling is that it is fundamentally harder to reason about non-strict languages -- it is mine, although I'm in fact surprisingly familiar with the work on the duality between call-by-value and call-by-name in sequent-calculus-inspired term systems. I do not think Blelloch's work lets us turn a "gut feeling" into objective knowledge.

PS: Guy Blelloch also worked on cost semantics for "lenient evaluation" (eg. futures) which are a form of non-strict evaluation. There is in fact a nice combinatorics of evaluation strategies for function application if you allow parallelism, that all give interesting cost semantics. There is also a fair share of work on amortized resource analysis (see eg. the Ressource-Aware ML brand of work). I'm surely forgetting a lot of other things, but the work of Stephen Chang on profiling lazy evaluation is also relevant.

Call-by-need

.. familiar with the work on the duality between call-by-value and call-by-name in sequent-calculus-inspired term systems. I do not think Blelloch's work lets us turn a "gut feeling" into objective knowledge.

Most non-strict languages in the wild are call-by-need. If we restrict the ourselves to reason about equality for the three different evaluation strategies we know that existing theories (Dave Sands's improvement theory for example) are more complicated. Maybe equality and space/time are completely different beasts, but it wouldn't surprise me if resource usage for call-by-need is hard as well.

That's very nice. In

That's very nice. In practice it's unfortunately unrealistic to rely solely on such an abstract model. While this does address the asymptotic cache behavior very nicely, in practice memory layout does matter a great deal because memory is transferred in cache lines and not in bytes, and because of the memory prefetcher which only works if your memory is laid out in a linear pattern. For performance, we're stuck with reasoning about the compiled code :(

Blelloch does discuss the

Blelloch does discuss the whole layout issue and relationship to compilation, especially in the CACM version of the last paper. The basic idea is that you can reason about the abstract cost semantics so long as you can make a few reasonable assumptions about the implementation.

Because functional languages abstract away from data layout decisions, we use temporal locality (proximity of allocation time) as a proxy for spatial locality (proximity of allocation in memory). To do this, the memory model for ICFP consists of two fast memories: a nursery (for new allocations) and a read cache. The nursery is time ordered to ensure that items are laid out contiguously when written to main (slow) memory. For merging, for example, the output list elements will be placed one after the other in main memory. To formalize this idea we introduce the notion of a data structure being compact. Roughly speaking a data structure of size n is compact if it can be traversed in the model in O(n/B) cache cost. The output of merge is compact by construction. Furthermore, if both inputs lists to merge are compact, we show that merge has cache cost O(n/B). This notion of compactness generalizes to other data structures, such as trees.

Obviously, we could construct a perverse runtime that maximally distributes allocations, such that a hundred node linked list is scattered across a hundred pages. But I think in practice we can make many reasonable assumptions, and our reasoning about the compiled code can be relatively shallow.

memory layout aware data structures?

are there libraries that give us such things, so we can then fit our program into those, to make sure we're getting the best perf? sorta like parallel/concurrent skeletons. sure then there's a higher level of how the app accesses them, it could do things pathologically poorly. so are there frameworks that push certain approaches, all to make things run nicely?

p.s.

Ah, yes, there's some of course, according to the search engines. Cache sensitive, cache oblivious algorithms and, say, trees. But i haven't seen e.g. a cache oblivious library/toolkit/framework as a shipping product, more just research here and there. Wish i could e.g. get a more-cache-oblivious version of jdk or something.