Functional anti-memoization

In Haskell, data structure elements are memoized. That is, the function that creates say, an element of a list is called at most once. After the first evaluation, the resulting value is saved away in case it is ever needed again. This is great for things like the infinite list of Fibonacci numbers...

fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
...but it can lead to excessive memory in situations like...
main = print (length list, sum list)
list = [1..10000000]
...where at some point the entire list will need to be in memory. You could have avoided the large memory usage by traversing the list in a linear way, building up the length and sum at the same time, rather than in sequence.

Now imagine an evaluation régime, where instead of memoizing each data element (for later use), we call all of the functions which have a pointer to this element. Afterwards, no more pointers to this element will exist, so we can reclaim its memory. It changes the rule "evlaute a data item at most once" into "evaluate a data item as most once, and be sure to call anything that depends on it immediately, because it is short lived". Is there a name for this evaluation strategy, and any programming languages which implement it?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Thats more like an optimizati

Thats more like an optimization scheme than an evaluation order. You (or the compiler) would need to prove that each of the expressions which consume the data structure does not depend on a result from another consumer of that data structure.

Also, its not really possible in general for a compiler to rewrite an algorithm into a more efficient form. That duty rests solely with the programmer, for as you say, it could easily be reformulated to be space safe.

You might wish to check out research on the algebric laws of fold and family, especially fusion laws. These are used in some cases by a compiler so eliminate needless intermediate values in a computation. I can't recall whose done work with regards to that, but I'm sure someone else here at LTU does.

Attribute Grammars

This particular example is almost exactly the one given in Why Attribute Grammars Matter, for what it's worth.

The particular example is a red herring

The example is similar, though the point is very different. The obsession with "eliminating multiple passes" through tying-the-knot techniques and the credit-card transformation is often a red herring. The article's example certainly is; subtracting the average off of a list inherently requires two passes. Tying the knot builds a list of thunks, not floats... approximately speaking, it makes the second pass implicit.

import Data.List as List

--- Listing Two ---

diff     :: [Float] -> [Float]
diff xs  = map (\x -> x - (avg xs)) xs

avg    :: [Float] -> Float
avg xs   = sum xs / genericLength xs

diff2    :: [Float] -> [Float]
diff2 xs =
    nil avg          =   (0.0, 0.0, [])
    cons x fs avg    =   let (s,l,ds) = fs avg
                         in (s+x,l+1.0,x-avg : ds)
    (sum,length,ds)  =   foldr cons nil xs (sum / length)
  in ds

diff3 :: [Float] -> [Float]
diff3 xs = map (\x -> x - (average xs)) xs

average :: [Float] -> Float
average xs = sum / length
   f (tot,count) x = ((,) $! tot + x) $! count + 1
   (sum, length) = List.foldl' f (0.0, 0) xs   

test f xs = let xs' = f xs in xs' == xs' 
main = return ()

Compiled with "ghc -O" and loaded into ghci, "test diff [1 .. 100000]" takes roughly the same amount of time as "test diff2 [1 .. 100000]", which both take longer than "test diff3 [1 .. 100000]". Moreover, diff3 is the only one to return a correct answer when the size of the list is increased to 1,000,000... the others overflow the stack. Not to mention that diff2 breaks GHC's list deforestation... diff is not a good consumer, but it is a good producer, and that map is liable to get fused into the calling function.

So, in this case, fusing sum and length into one pass is worthwhile, tying the knot is not. A useful example of the latter is building an environment for a set of recursive bindings in an interpreter. Here, all you are doing is filling a hole in a symbolic data structure, not building thunks, thus there isn't an implicit second pass.

Flat types.

A follow-on thought occurs to me: the problem with the example is that it is attempting to do coinduction on a flat type. While this certainly has a sensible semantics, it's not where current implementations really shine.

Flat types are those without any partial values. Common examples are characters, machine integers and floats, ML's unit type, etc. The only values that inhabit these types are computational divergence and a fully formed values.

However, say your number type isn't flat. Say you are using or building a library for exact real arithmetic, and you want to implement a particularly efficient way of taking the average of a list of numbers. In this case, I would most certainly investigate algorithms similar in nature to diff2.

Moreover, I'm guessing that it's fairly straightforward to derive diff3, the best performing solution on flat types, directly from diff2, which might perform better on non-flat numbers. I'm not sure that the converse would be true, however.

I would venture a guess that diff2 or something rather similar can be turned into a very compelling example, in an appropriately modified context.


As for Greg's example, there are a number of options for this and similar cases:

1. Leave it as is

2. un-CAF your list, and recompute as necessary, e.g.

fib () = let fs = 1 : 1 : zipWith (+) fs (tail fs) in fs

This would incur negligable extra run-time costs for [1 .. 10000000]

3. un-CAF your list and tuple the folds together. This could be accomplished using GHC's rewrite rules and compiling with "ghc -O -fglasgow-exts" , such as

fuse_foldl (f,g) (a,b) x = (f a x, g b x)

  "pair/foldl" forall f f0 g g0 xs . 
               (,) (foldl f f0 xs) (foldl g g0 xs) 
                  = foldl (fuse_foldl (f,g)) (f0, g0) xs

For better or worse, GHC defines neither length nor genericLength in terms of foldl or foldr. So you'll have to make a special rule for length, or you'll have to implement your own version of length. You'd also have to make sure that sum gets inlined. On the one hand, sum is a "small function" and thus very likely to get inlined, but I'm not real clear on what GHC does with inlining across modules.

Now, these options have adequately covered the programming problems I've run across... though your idea may be useful for other examples.

I'll have a deCAF, please...

For those of us who have not previously seen the phrase "un-CAF your list", this page might help a little.

Nice Technique

There is a somwhat similar technique used in boolean satisfiability for removing variables from a term. Specifically, in a propositional term T where a variable x occurs then sat(T) = sat(T[x:=0] &or T[x:=1]).

So, I guess it's a nice technique which might actually work in practice, but, hmm, for specific instances (like the one above) by reduction you know you might start using exponential memory space.


Thanks for all the tips on curing the example problem. But the ivory-towerist in me is more curious about the generalization. Maybe it is somewhat related to dataflow architectures? That is, as each part of the lazy list gets created, it opens up the possibility for more of the program to be executed in parallel. Thoughts?

Re: Dataflow?

It seems like it might be a variant on traditional dataflow; instead of functions blocking until data is available, functions are moved to the head of the queue when data is available.

I'm guessing that the dataflow and control-flow analyses for such a scheme would often be too complicated to be done at compile time; it seems like this would necessitate run-time elements. I'm not real familiar with the implementation of dataflow languages though; in Oz you create multiple threads in order to express dataflow, and I doubt your variant would be any different.

Evaluation Strategy of a GRS?

The ivory towerist in me tells me it surely is an evaluation strategy for a graph rewrite system, at least that's the easiest way for me to understand it. That's where I'd look in literature.

Apart from the variable elimination technique from satsolving, another thought:

It occured to me that your idea might be problematic though. I guess, one restatement of your strategy is that in general all functions which refer to an argument get rewritten when the refcount of the argument decreases. Functions are also arguments of other functions, what happens when functions get rewritten?

Just checking

The ivory towerist in me tells me it surely is an evaluation strategy for a graph rewrite system, at least that's the easiest way for me to understand it. That's where I'd look in literature.

Thanks for the tip, I'll start diving into that end of the pool.

It occured to me that your idea might be problematic though.

Yeah, I haven't thought it through very far. But before I started implementing it as a toy interpreter, I thought I'd check with LtU to see if it was a well-understood problem.

As for Anti-Memoization

The original post still seems plausible, but I'm still skeptical.

However, I have a variation to propose. Certainly, many things are better recomputed than stored, and the best choice obviously depends on how long it takes to compute it, how soon you'll want it again, and actual memory characteristics: namely performance of, quantity of, and current demand for storage from other parts of the system.

What if you took a largish structure that you think you might not need for a while, wrapped it in a weak pointer, and add another pointer capable of re-computing the structure, if necessary. It would achieve some of the same goals, but it seems much easier to implement.

In fact, I would be suprised that nobody has implemented something similar to what I described. I'm not particularly aware of any though.

[edit: hmm... weak pointers aren't quite right here. You need some kind of probablistic pointer, that states how important that computation is for you, so it's neither weak nor strong.]

Programmer's Choice

I think your original post and this post are both great idea's.

However, a small alternative view.

Haskell's memoization stems probably from two origins:

1. Graph rewrite semantics just imply sharing, [implicit memoization]
2. Need for speed, or performance. [explicit memoization]

I think 2. is probably a main reason. Probably the compiler would otherwise be significantly slower because it would need to re-evaluate a lot of basic terms each time (for instance, a computed string to match upon).

I think in the end, for a programmer, however, it is not worth it. I do believe memoization can be implemented without a substantial overhead. However, it makes the runtime substantially more complex, and it probably will always have "weird" (in the programmer's eyes) runtime characteristics.

In the end, I think performance optimization should always be a choice of the programmer.

So, unless you are writing a sufficiently high specific purpose language, say Matlab, I think these kind of optimizations should be avoided. Even if they are very convenient.

[Sharing is not memoization. The adding of tables to remember precomputed values is, and that is what I meant in 2.]

Incremental transformation

I've seen this called an incremental transformation before. It's not strictly the answer that you're looking for as I'm not aware of any languages that use this evaluation strategy, but there is a body of work in Program Transformation about how to convert programs like your sample into the type you describe. I suspect that you would have to drastically reduce the power of a language in order to use this evaluation strategy as the problem of converting a program into an incremental version is pretty hard in general.

Here is just one paper that springs to mind. There are many more out there.

Push based scheduling

In manufacturing, this is called push (vs. pull) based scheduling. Here is a description. I don't know if the work on the trade-offs between push and pull based scheduling and analysis of the scheduling algorithms would be helpful, but the parallel between "work in process" and "memoization memory consumption" is clear.