To CPS or not to CPS

Old question, still difficult. I am in the process of moving from a compiler which compiles the source code to an AST representation in C and then runs a trampolining interpreter on top of it, towards full-blown C compilation.

I am puzzled whether to do the CPS transform or not (major code expansion, seems like a large performance hit? Or am I using the wrong transform?). I can also go ANF but actually, I am not too sure how to compile the ANF representation efficiently.

And actually, is there any _real_ difference between CPS and ANF? Seems to me that the CPS transform just makes the return pointer explicit in the lambda representation.

Any thoughts?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Compiling with

Compiling with Continuations. Many answers found therein. :-)

I scanned it.

But I get a bit annoyed by it. According to Wikipedia, by example, an ANF transformation is 'f x (y+1)' is expanded to 'let z = y + 1 in f x z'.

It's silly, in that manner ANF is nothing more than 'when an argument of an application is not fully reduced, abstract it out' (by building an extra abstraction, remember 'let z = y+1 in f x z' equals '(\z . f x z) (y+1)'). Of course they run into all kinds of trouble then keeping terms in ANF form since they beta-reduce [edit: misread the article here, should be don't or hardly beta-reduce]. Why not do an ANF transform as the very last step in your compiler? Do an extra eta-expansion on local variables, and presto, Peyton-Jones/Wadler hyper combinators emerge.

The CPS transform at least makes control flow explicit, but the CPS transforms I read introduce way to many abstractions (one for every application?). There should be a way easier transform. Or maybe that actually is that transform used in the paper; guess I'll re-read it.

Sigh. Thanks anyway.

CPS isn't the final program

CPS isn't the final program form, it's just an intermediate form used for optimization. Most compilers transform the program back to direct style for output (SML/NJ keeps the CPS form though).

Recent paper on CPS, ANF and friends

You might be interested in a recent paper of mine, from ICFP'07 last year: "Compiling with Continuations, Continued". See

http://research.microsoft.com/~akenn/sml

for the paper and slides from the conference presentation.

In summary, my thesis is that CPS is a good place to do compiler optimization. This is true even if code generation doesn't make use of continuations, as was the case with the compiler that I developed: the compiler pipeline consists of

SML source (direct-style)
-> transform to
CPS (variant with second-class continuations)
-> various optimizations
CPS
-> code generate
.NET IL (direct-style)

The paper describes the relationship between direct-style, ANF, CPS, and monadic languages.
- Andrew.

This is the paper I linked

This is the paper I linked to in my initial post, so marco has at least skimmed it given his feedback above.

As an aside, is SML# being developed anymore? F# is promising, but I really like the fact that SML# eliminates null reference exceptions.

Read it

Yeah, I read it. Along with the slides, some stuff of Wadler (reflections), and some other CPS transforms/ANF related articles.

I, roughly, implemented some of the transforms (Plotkin, and the transform of the continuations paper), and didn't like what I got back. Too much machinery for the simple language I am building. Also, 80.000 LOC is not something I am looking forward to implement.

I am going for a simpler scheme, which will of course not be as efficient as compiling to an intermediate assembly language, or optimizing with various transforms. But I am willing to pay for that at the moment, because I want clean compiler code and I want to generate clean C. (I don't see myself debugging continuation code).

I want a glue language so I am hoping for, say, Javascript performance (whatever that means).

SML.NET

(I missed that it was my paper you linked; thanks for pointing it out.)

I think you mean SML.NET. It's not being "actively" developed, though I have a couple of whole-program optimizations that I'm pursuing. Also I would like to re-release the sources with the new CPS intermediate language, so that others can play.

Yes, I did mean SML.NET.

Yes, I did mean SML.NET. Thanks for the update! The only activity I had seen "recently" was your paper above, and the "Lightweight Fusion by Fixpoint Promotion" paper. For instance, SML.NET still targets .NET 1.1 without generics.

SML#

The confusing thing is that there in fact is an SML implementation named SML#. However, it predates dot-net by almost a decade, and the # in its name simply refers to its record polymorphism feature.

Hmm, apparently SML#

Hmm, apparently SML# provides polymorphism but uses natural representations, ie. no tagging, runtime type information, and no whole program analysis. I couldn't find any papers on this specifically. Any ideas how they achieve this? Couldn't find any papers on their site referencing this in particular. The resulting trivial integration with C is very appealing.

Possibly...

Did you see this paper? (disclaimer: I didn't read it)

Bitmasks for polymorphic representations

That paper doesn't discuss the actual compilation, but his more elaborate follow-up does.

But what really caught my eye was a paper on an idea I was mulling over for polymorphic representations: using bitmasks so natural representations can be used! This is the technique used by SML#.

MLj?

I for one really miss the prospect of having a SML for the jvm: was it your (A. Kennedy) MLj which preceded SML.net? Sure we have Scala now, but Scala isn't SML (not saying it should be or that that's a bad thing).

Edit: a quick google show MLj isn't dead, but not actively developed?

http://research.microsoft.com/~nick/mlj.htm

Edit: a quick google show

Edit: a quick google show MLj isn't dead, but not actively developed?

Hmm, is there another form of liveness I'm not aware of? :-)

bitmasks so natural representations can be used

Note however that it seems to be dog slow on symbolic code. I once measured various SML implementations on bootstrapping HaMLet, and SML# was about 70-80 times slower than MLton or Poly/ML (and 7 times slower than even Moscow ML). Of course, I don't know how much the bitmap technique contributes to this, but it makes me skeptical.

Pace of development seems

It seems their current version (0.31) only runs on a VM, so it's not natively compiled yet. That page says they expect to have a native compiler by March 2008, but it doesn't seem to have materialized yet. :-)

As near as I can tell, the

As near as I can tell, the best way to go these days is some form of SSA transform. LLVM is getting pretty remarkable results that way, and it's a fairly straightforward approach.

SSA, ANF, CPS, and Monadic

SSA, ANF, CPS, and Monadic Normal Form are all fairly similar (and interconvertible.) So by and far, which you choose is mostly a matter of taste.

What about this solution? Reverse Polish Notation. PLEASE SHOOT!

I am going another direction, which I myself understand a bit better.

When dealing with flow of control, nothing beats reverse polish notation. So, I wrote some examples out for evaluating lambda-terms (or combinator terms to be more exact).

One solution would be to have thunks which remember which of the arguments are evaluated. Let [xx | yy] be a thunk where all yy are fully reduced and all xx are to be reduced, then we can build a small stack based evaluator in the following manner. (BTW: not in reverse notation)

[ c |] -> [| c]                      {CONST}

[ f |] -> definition of f            {FUN}

[ xx y | zz ] -> [xx y | zz ] [ y |] {PUSH}

and

[ xx y | zz ] [| e ] -> [ xx | e zz] {POP/STORE}

An example program:

x = 7
g = \x . x + 1
f = \x y z. x + y + z
f 1 (g 3) x

The program evaluates then like this:

    [f 1 (g 3) x |]
    [f 1 (g 3) x |] [x |]
    [f 1 (g 3) x |] [| 7]
    [f 1 (g 3) | 7]
    [f 1 (g 3) | 7] [g 3 |]
    [f 1 (g 3) | 7] [g 3 |] [3 |]
    [f 1 (g 3) | 7] [g 3 |] [| 3]
    [f 1 (g 3) | 7] [g | 3]
    [f 1 (g 3) | 7] [| 4]
    [f 1 | 4 7]
    [f 1 | 4 7] [1 |]
    [f 1 | 4 7] [| 1]
    [f | 1 4 7]
    [| 12]

You could also try a different approach where every application expands to a thunk which is waiting for arguments.

    [. . . .] [f] [1] [[. .] [g] [3]] [x]
    [. . . .] [f] [1] [[. .] [g] [3]] [7]
    [. . . 7] [f] [1] [[. .] [g] [3]]
    [. . . 7] [f] [1] [. .] [g] [3]
    [. . . 7] [f] [1] [. 3] [g]
    [. . . 7] [f] [1] [g 3]
    [. . . 7] [f] [1] [4]
    [. . 4 7] [f] [1]
    [. 1 4 7] [f]
    [f 1 4 7]
    [12]

The latter approach has the advantage that a combinator can be rewritten to one series of thunks, thus might be faster; but this approach needs two pointers, one where to store the result, and one where to continue the evaluation.

(BTW: I claim both approaches work for curried functions and (non-tail) recursive functions.)

Please SHOOT at this idea, maybe I am totally far off so I would like to have some feedback before I start implementing this!

Extra explanation: The idea of course is not that I am building an evaluator, but compile to combinators which know how to deal with the stack. In the first approach a combinator just needs to know which thunk 'called' it, in the second approach a combinator needs to know which thunk 'called' it and which thunk is 'waiting' for a result.

Extra example for fac

The above examples don't show how to expand definitions, so here an example with the 'fac' function. (*) Denotes the current thunk being evaluated.

fac = \n -> if n = 1 then 1 else n * (fac (n-1))

Is compiled to

fac = \n -> if n = 1 then [1] else 
    [. . .] [*] [n] [[. .] [fac] [[. . .] [-] [n] *[1]]]

Reduces like:

    [fac 3]
    [. . .] [*] [3] [[. .] [fac] [[. . .] [-] [3] *[1]]]
    [. . .] [*] [3] [[. .] [fac] [[. . 1] [-] *[3]]]
    [. . .] [*] [3] [[. .] [fac] [[. 3 1] *[-]]]
    [. . .] [*] [3] [[. .] [fac] [*[- 3 1]]]
    [. . .] [*] [3] [[. .] [fac] *[2]]
    [. . .] [*] [3] [[. 2] *[fac]]
    [. . .] [*] [3] [*[fac 2]]
    [. . .] [*] [3] [[. . .] [*] [2] [[. .] [fac] [[. . .] [-] [2] *[1]]]]
... [. . .] [*] [3] [[. . .] [*] [2] [[. .] [fac] *[1]]]
... [. . .] [*] [3] [[. . .] [*] [2] *[1]]
... [. . .] [*] [3] *[2]
... *[6]

Btw. An optimizing compiler wouldn't push constants, so would evaluate (generate code) like this:

fac = \n -> if n = 1 then 1 else 
    [* n .] [[fac .] *[[- n 1]]]

(Yeah, I know, I am editing heavily).

Going for it.

Yeah, this is it. I am going this way: expand the lambda-calculus I have with thunk code (a thunk is a series of lambda-terms and unknowns) and compile that.

Any input still appreciated...

Progressing thought. So, actually, I gave it some more thoughts. My solution actually doesn't add to much except that it makes evaluation order of function application explicit in the same manner that ANF makes local computations explicit. I might as well not do it, and directly generate the correct code.

I am still going to do it, just as an intermediate step to more easily generate code, and, while I am at it, I think I will introduce the ANF form also.

The correct manner to handle post-optimizations is then not to make the compiler more complex by adding reduction rules to the extended calculus, but by being able to translate back to the core lambda calculus, and do the optimizations there.

Ah well, 48 hours of thinking.

Talking to myself. It's a bloody good idea

For those interested, it turned out to be a bloody good idea. It makes the term very readable, like:

fac = [ 1 -> 1 // no thunk, place the result in the callers stack
      | n -> [. . .] [*] [n] 
             [[fac] [. . .] [[-] [n] [1]]] ] // build thunks

Also, I think thunks which are embedded in other thunks can safely be flattened/linearized. Not sure yet.

Lastly, it exposes problems like. Does [. . .] [fac] equal [fac . .]? In a pure language this is true, in a language with side effects it is not. The problem is exposed with the following program:

add = print "hello"; (+ 1)
f = map add [1,2,3,4,5]

How often is "hello" printed? 5 times or 1 time? This ultimately depends on whether one generates [. . .] [map] [add] [1...5] (1 time) or [map add .] [1...5] (5 times).

Bit of a foolish thing maybe, but it actually becomes a big problem when optimizing the code.

Oh yeah, and it is easily expanded to deal with exceptions.

Reflections

To whom it may concern, thought a bit more, might as well write it down.

I thought about the several transforms a bit more. The thunk based approach where stack frames are 'waiting' for results is the fastest, as far as I can see, but has one drawback I don´t like. It destructively updates regions of the stack. A normal garbage collector wouldn't have big problems with that. Unfortunately, I am either stuck with Boehm or my own (_very_ fast) generational garbage collector, which compresses a DAG-like structure in the heap (I allocate in the heap, don't have a stack).

Boehm is nice, but I think, leads to poor performance when used with a functional language. Using the C-stack for stack frames is out of the question to I think (the compiler I wrote in ocaml runs out of the C-stack when evaluating large terms, I needed a lot of hacks to get around that - seems like the wrong way to go). This problem actually pushed me towards an AST evaluator in C. (For which I wrote the DAG generational garbage collector, because Boehm was slow.)

The CPS transform actually would be good, since it basically chains the rewrites on the graph, therefore keeps the DAG invariant. It loses to the stack based approach since it will explicitly build thunks for all intermediate steps. (I wonder whether I made a mistake there, since I tried to simplify terms with CbV-beta reductions, instead of the normal beta-rules).

The stack based solution where we keep a pointer to the evaluated arguments also maintains the DAG invariant (actually, one might as well read it as another explicit CPS variant).

Guess I am stuck with that.

(Still, the fastest would be the thunk based approach with a separate stack, and a garbage collector which compresses a DAG in the heap as referred to by the stack.)

(Excuse my ramblings, didn't explain it very well. The DAG invariant I need is that always the top of the term is rewritten, and the term is represented as a Directed Acyclic Graph, which means it has flattened representation where all newer nodes point upward to all older nodes. CPS does that since it always rewrites the (top of the) current term and a continuation to a new term and a continuation.)

End Result - slow, or damned slow

I implemented this simple scheme. All lambda terms are rewritten to combinators, the combinators are evaluated with a simple stack.

Positive result: it works.

Negative result: something slow or damned slow. The following program counts to one million.

namespace system (

    type bool = [ true | false ]
    type int = {system.int}
)

using system

def int_min: int -> int -> int =
    \v0,v1 -> {int_min[v0,v1]}

def int_plus: int -> int -> int =
    \v0,v1 -> {int_plus[v0,v1]}

def int_mul: int -> int -> int =
    \v0,v1 -> {int_mul[v0,v1]}

def -: int -> int -> int = int_min
def *: int -> int -> int = int_mul
def +: int -> int -> int = int_plus

def count: int -> int =
    [ 0 -> 1
    | n -> count (n-1) ]

def main: int = count 1000000

Output of the program (no fancy printing yet):

0x804c524:CONST(1:system.int, 2:system.int, 3:size: 1, 4: 1)
0xb7ea3e14:THUNK(0:call: 0x8049960, 1:return: (nil), 2:exception: (nil), 3:nr_reds: 0, 4:nr_args: 1, 5: 0x804c524)

**********************************************************************
DEVIL GARBAGE COLLECTION INFORMATION
0-heap last major     : 0
0-heap base           : 0xb753d020
0-heap top            : 0xb7f3d020
0-heap free ptr       : 0xb7ea3e10

1-heap size           : 10485760
1-heap used           : 627216
1-heap free           : 9858544

2-heap minor trigger  : 131072
2-heap minor barrier  : 9762048
2-heap to space       : 9893120
2-heap major barrier  : 2621440
2-heap grow barrier   : 5242880

3-heap last collected : 131264
3-heap minors         : 9259
3-heap majors         : 0
3-heap collections    : 9259
**********************************************************************

Output of the profiler:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 70.89      1.80     1.80     9259   194.48   194.48  heap_collect
 18.51      2.27     0.47                             main
  3.74      2.37     0.10  3000004     0.03     0.03  alloc_record
  2.36      2.43     0.06  1000001     0.06     0.12  __count
  1.97      2.48     0.05  1000000     0.05     0.10  __int_min
  1.77      2.52     0.05  1000000     0.05     0.05  alloc_const
  0.39      2.53     0.01  1000000     0.01     0.04  __sb
  0.39      2.54     0.01                             ___apply
  0.00      2.54     0.00        1     0.00     0.00  ___exit
  0.00      2.54     0.00        1     0.00     0.03  __main
  0.00      2.54     0.00        1     0.00     0.00  debug
  0.00      2.54     0.00        1     0.00     0.00  heap_debug
  0.00      2.54     0.00        1     0.00     0.00  heap_new

 %         the percentage of the total running time of the
time       program used by this function.

cumulative a running sum of the number of seconds accounted
 seconds   for by this function and those listed above it.

 self      the number of seconds accounted for by this
seconds    function alone.  This is the major sort for this
           listing.

calls      the number of times this function was invoked, if
           this function is profiled, else blank.
 
 self      the average number of milliseconds spent in this
ms/call    function per call, if this function is profiled,
	   else blank.

 total     the average number of milliseconds spent in this
ms/call    function and its descendents per call, if this 
	   function is profiled, else blank.

name       the name of the function.  This is the minor sort
           for this listing. The index shows the location of
	   the function in the gprof listing. If the index is
	   in parenthesis it shows where it would appear in
	   the gprof listing if it were to be printed.

		     Call graph (explanation follows)


granularity: each sample hit covers 2 byte(s) for 0.39% of 2.54 seconds

index % time    self  children    called     name
                                                 
[1]     99.6    0.47    2.06                 main [1]
                1.80    0.00    9259/9259        heap_collect [2]
                0.06    0.06 1000001/1000001     __count [3]
                0.05    0.05 1000000/1000000     __int_min [5]
                0.01    0.03 1000000/1000000     __sb [7]
                0.00    0.00       3/3000004     alloc_record [4]
                0.00    0.00       1/1           __main [9]
                0.00    0.00       1/1           heap_new [12]
                0.00    0.00       1/1           ___exit [50]
-----------------------------------------------
                1.80    0.00    9259/9259        main [1]
[2]     70.9    1.80    0.00    9259         heap_collect [2]
-----------------------------------------------
                0.06    0.06 1000001/1000001     main [1]
[3]      4.9    0.06    0.06 1000001         __count [3]
                0.06    0.00 2000000/3000004     alloc_record [4]
-----------------------------------------------
                0.00    0.00       1/3000004     __main [9]
                0.00    0.00       3/3000004     main [1]
                0.03    0.00 1000000/3000004     __sb [7]
                0.06    0.00 2000000/3000004     __count [3]
[4]      3.7    0.10    0.00 3000004         alloc_record [4]
-----------------------------------------------
                0.05    0.05 1000000/1000000     main [1]
[5]      3.7    0.05    0.05 1000000         __int_min [5]
                0.05    0.00 1000000/1000000     alloc_const [6]
-----------------------------------------------
                0.05    0.00 1000000/1000000     __int_min [5]
[6]      1.8    0.05    0.00 1000000         alloc_const [6]
-----------------------------------------------
                0.01    0.03 1000000/1000000     main [1]
[7]      1.6    0.01    0.03 1000000         __sb [7]
                0.03    0.00 1000000/3000004     alloc_record [4]
-----------------------------------------------
                                                 
[8]      0.4    0.01    0.00                 ___apply [8]
-----------------------------------------------
                0.00    0.00       1/1           main [1]
[9]      0.0    0.00    0.00       1         __main [9]
                0.00    0.00       1/3000004     alloc_record [4]
-----------------------------------------------
                                   1             debug [10]
                0.00    0.00       1/1           ___exit [50]
[10]     0.0    0.00    0.00       1+1       debug [10]
                                   1             debug [10]
-----------------------------------------------
                0.00    0.00       1/1           ___exit [50]
[11]     0.0    0.00    0.00       1         heap_debug [11]
-----------------------------------------------
                0.00    0.00       1/1           main [1]
[12]     0.0    0.00    0.00       1         heap_new [12]
-----------------------------------------------
                0.00    0.00       1/1           main [1]
[50]     0.0    0.00    0.00       1         ___exit [50]
                0.00    0.00       1/1           debug [10]
                0.00    0.00       1/1           heap_debug [11]
-----------------------------------------------

 This table describes the call tree of the program, and was sorted by
 the total amount of time spent in each function and its children.

 Each entry in this table consists of several lines.  The line with the
 index number at the left hand margin lists the current function.
 The lines above it list the functions that called this function,
 and the lines below it list the functions this one called.
 This line lists:
     index	A unique number given to each element of the table.
		Index numbers are sorted numerically.
		The index number is printed next to every function name so
		it is easier to look up where the function in the table.

     % time	This is the percentage of the `total' time that was spent
		in this function and its children.  Note that due to
		different viewpoints, functions excluded by options, etc,
		these numbers will NOT add up to 100%.

     self	This is the total amount of time spent in this function.

     children	This is the total amount of time propagated into this
		function by its children.

     called	This is the number of times the function was called.
		If the function called itself recursively, the number
		only includes non-recursive calls, and is followed by
		a `+' and the number of recursive calls.

     name	The name of the current function.  The index number is
		printed after it.  If the function is a member of a
		cycle, the cycle number is printed between the
		function's name and the index number.


 For the function's parents, the fields have the following meanings:

     self	This is the amount of time that was propagated directly
		from the function into this parent.

     children	This is the amount of time that was propagated from
		the function's children into this parent.

     called	This is the number of times this parent called the
		function `/' the total number of times the function
		was called.  Recursive calls to the function are not
		included in the number after the `/'.

     name	This is the name of the parent.  The parent's index
		number is printed after it.  If the parent is a
		member of a cycle, the cycle number is printed between
		the name and the index number.

 If the parents of the function cannot be determined, the word
 `' is printed in the `name' field, and all the other
 fields are blank.

 For the function's children, the fields have the following meanings:

     self	This is the amount of time that was propagated directly
		from the child into the function.

     children	This is the amount of time that was propagated from the
		child's children to the function.

     called	This is the number of times the function called
		this child `/' the total number of times the child
		was called.  Recursive calls by the child are not
		listed in the number after the `/'.

     name	This is the name of the child.  The child's index
		number is printed after it.  If the child is a
		member of a cycle, the cycle number is printed
		between the name and the index number.

 If there are any cycles (circles) in the call graph, there is an
 entry for the cycle-as-a-whole.  This entry shows who called the
 cycle (as parents) and the members of the cycle (as children.)
 The `+' recursive calls entry shows the number of function calls that
 were internal to the cycle, and the calls entry for each member shows,
 for that member, how many times it was called from other members of
 the cycle.

Index by function name

   [8] ___apply                [7] __sb                   [11] heap_debug
  [50] ___exit                 [6] alloc_const            [12] heap_new
   [3] __count                 [4] alloc_record            [1] main
   [5] __int_min              [10] debug
   [9] __main                  [2] heap_collect

Output of time -v (compiled with O3 -fnostrictaliasing), run on 1Ghz.

        Command being timed: "a.out"
        User time (seconds): 2.73
        System time (seconds): 0.00
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.73
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 358
        Voluntary context switches: 1
        Involuntary context switches: 35
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Ah well, anyone any ideas (except for CPS that is ;-)).

[Hmpf, There is a lot of copying going on to keep the DAG, well, in DAG form, so that actually costs. Also, instead of calls, I create thunks, I might actually get rid of a lot of intermediate thunks for non-recursive calls if I replace the combinators with actual variadic routines - now they take one argument, a thunk.]

[I am going to expand small local definitions, and stick with this scheme, since I want to get the language right first.]

[uh, used the wrong collector ;-O ]

[looked at it all again, going for Boehm and destructive updates]

*Cough* - output when evaluating the AST directly

When run with a simple AST-based evaluator (lambda-terms are translated to C, an evaluator interprets the program).

output of the program (nothing, since nothing is printed)

reductions: 12000005

output of the GC

**********************************************************************
DEVIL GARBAGE COLLECTION INFORMATION
0-heap last major     : 0
0-heap base           : 0xb7587020
0-heap top            : 0xb7f87020
0-heap free ptr       : 0xb7f11400

1-heap size           : 10485760
1-heap used           : 482336
1-heap free           : 10003424

2-heap minor trigger  : 131072
2-heap minor barrier  : 9988576
2-heap to space       : 10119648
2-heap major barrier  : 2621440
2-heap grow barrier   : 5242880

3-heap last collected : 131008
3-heap minors         : 3661
3-heap majors         : 0
3-heap collections    : 3661
**********************************************************************

time -v

        Command being timed: "a.out"
        User time (seconds): 2.01
        System time (seconds): 0.00
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.03
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 295
        Voluntary context switches: 1
        Involuntary context switches: 224
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

[God, the AST based evalutor doesn't need to set op thunks, since it just breaks down the programs. The combinator form does build thunks for every call. The AST evaluator even looks op variables in O(log(n) in a context instead of O(1), and it still is faster. Maybe I get other results for other programs. Darn...]

[What the hell? This evaluator does 12 million reductions (basic pushes and pops, the C program does 3 million reductions). Both rewrite a DAG, but the C program triggers three times as much minor collections???]

[Confirmed: it's a bug in the generated C]

[Fixed it: major memory leak]

Ackerman:

ocaml  	0.04  	664  	9  	log
vc++ 	0.04 	540 	13 	log
vc 	0.05 	492 	14 	log
mercury 	0.06 	1728 	36 	log
gcc 	0.07 	1504 	14 	log
ghc 	0.07 	1224 	9 	log
mingw32 	0.08 	596 	14 	log
delphi 	0.10 	604 	15 	log
modula2 	0.11 	672 	0 	log
bcc 	0.12 	608 	14 	log
fpascal 	0.13 	556 	18 	log
lcc 	0.13 	548 	14 	log
gnat 	0.14 	792 	0 	log
se 	0.14 	592 	30 	log
bigforth 	0.15 	924 	13 	log
pliant 	0.16 	3228 	14 	log
csharp 	0.18 	3280 	15 	log
modula3 	0.18 	932 	24 	log
smlnj 	0.22 	940 	20 	log
vpascal 	0.28 	600 	18 	log
java 	0.53 	4628 	11 	log
nice 	0.64 	4940 	0 	log
gforth 	0.67 	1504 	15 	log
poplisp 	0.78 	3272 	10 	log
*** c combinator evaluator 0.82 (GRAPH-FAKE/NO_DICK)
erlang 	1.01 	5268 	10 	log
ocamlb 	1.09 	380 	9 	log
oz 	1.26 	652 	19 	log
*** target
cim 	1.32 	2044 	23 	log
*** c combinator evaluator 1.48 (GRAPH/NO_DICK)
parrot 	2.53 	8036 	35 	log
pike 	2.61 	3620 	9 	log
lua5 	2.71 	840 	11 	log
lua 	3.81 	1000 	11 	log
*** c combinator evaluator 3.80 (GRAPH/BOEHM)
mawk 	3.99 	1912 	10 	log
slang 	5.51 	2296 	15 	log
*** c combinator evaluator 6.25 (DAG)
guile 	8.52 	2784 	9 	log
icon 	9.37 	1280 	13 	log
awka 	10.34 	17396 	10 	log
ici 	10.78 	2264 	7 	log
cygperl 	13.85 	37080 	11 	log
*** c ast evaluator 13.97
python 	14.74 	3544 	12 	log
perl 	50.15 	36100 	11 	log
rexx 	78.82 	5896 	14 	log
*** ml ast evaluator 457.9
gawk 	T 	T 	10 	log
php 	T 	T 	10 	log
elastic 	F 	F 	17 	log
jscript 	F 	F 	14 	log
rebol 	F 	F 	12 	log
ruby 	F 	F 	11 	log
tcl 	F 	F 	15 	log
vbscript 	F 	F 	15 	log

[I am now at 3.80 using the GRAPH rewriting scheme 60% of the time is in the GC, so moving away from Boehm might mean I reach my target. I don't guess going faster than 1.0 makes sense with the simple translation scheme and boxed integers. (Also, unfair comparison, different machines)]

[NO_DICK is a Cheney GC. Still missed target, but will make it probably when inlining small functions. GRAPH-FAKE is inlining by hand. Still 50% spend doing GC, well duh...]

The 'let' in ANF is

The 'let' in ANF is primitive.

Who is the "they" who is running into all kinds of trouble?

You are right

The let is primitive. I just don't see the need for it. In the example given in continuations continued the author gives an example where the let is convenient.

In ANF form '(\x. 0) (f y)' becomes 'let z = f y in (\x. 0) z' which then reduces to 'let z = f y in 0'. (Look we removed an abstraction!)

If you look at it again closer, you might as well state that they did the following reduction '(\x. 0) (f y)' -> '(\z . (\x . 0) z) (f y)' which reduces (under call by value beta-reduction*) to '(\z . 0) (f y)'. I claim, nothing gained, or not too much gained. (A compiler which notes that z is never used could just as well optimize it out; or stated differently, any abstraction which binds a variable which is never used isn't a combinator and may be seen as a let).

The 'they' are people who use ANF form and then notice that they need extra primitives for beta-reduction on the let (some terms may be expanded, some terms cannot be expanded in the let construct). I suspect they are re-inventing call-by-value beta-reduction in lambda-calculus extended with let.

* = call-by-value beta-reduction only reduces terms where the argument is fully evaluated (constants or variables).

I think you are totally

I think you are totally missing the point of these intermediate forms. They don't optimize the code at all. They don't make optimizations possible that weren't possible before. They don't let you do anything you can't do some other way.

All they are is a different, equivalent but more explicit representation of the same code. The purpose in life for ANF/monadic normal form and one of the purposes of CPS is to name intermediate expressions and select a specific evaluation order. When I said 'let' is primitive in ANF, I mean there is no beta-rule for it. It is irreducible. If you beta-reduce a let statement in ANF, you end up with something that isn't ANF. This is why it is called Administrative Normal Form.

The benefit of ANF or CPS or SSA or Monadic Normal Form is that the code becomes more uniform, some implicit choices are made explicit, and finally more transformations become local and syntax directed. You can see this in the example you gave.

The ANF result is trivial to arrive at: convert to ANF, then all beta reductions remaining are safe to reduce w.r.t. CbV.

In your first proposed way, you need to eta expand which isn't always sound, then you need to beta reduce those beta reduction which are reducible in CbV. The real problem though is how does the compiler decide where to eta expand?

You also suggest that the compiler could just "note" that some variable is never used and then remove it. This takes a non-local analysis and is also a separate analysis. In ANF, this Just Happens when you beta reduce the result and you don't need to do any analysis to decide when to beta reduce in the ANF representation: anywhere you see an application of a lambda, beta-reduce; a trivial to recognize syntactic property.

Some benefits of the explicitness of these normal forms is it makes code generation more uniform. For ANF, the only places where I need to allocate (a register, a stack slot) are in the lets and I can just process the lets in order since they follow the evaluation order. From the original code, you'd have to have different cases for applications, case analyses, if statements, and any other built-in forms you language supported.

Ultimately, these normal forms are just about convenience. It simplifies compiler implementation by making the syntax reveal more aspects of the semantics so similarities and differences can be more easily recognized and thus more uniformly handled. Any optimization you do in ANF can be turned into an optimization on the original source, just transform the resulting ANF back to the source. However, if you did do this, you'd end up with a bunch of ad-hoc rules with annoying side-conditions.

I've used ANF as an example, but this applies equally well to any of the common intermediate normal forms. As I said elsewhere, which you use is mostly a matter of taste.

Got it

Yeah, see above post. I noticed it too. It is all about administrative convenience. So I decided to use it ;-).

Oh, and thanks for the reply.

And also

Extra thoughts.

I got confused because ANF doesn't tell how to handle the stack, it tells how to deal with local computations (except for the extended ANF forms of course).

That's why I am going with both ideas, ANF and THUNK ;-), ANF for local computations, THUNK for dealing with the continuation.

Actually, the CPS transform is something orthogonal to ANF since it makes explicit how to deal with continuations. I just don't like it since it really blows up the lambda terms.

A last thought, small mistake in your post. Normally, you always know the evaluation order for a given language, otherwise you wouldn't know how to compile the code. You just have the extra property that in an eager language expressed in lambda calculus, call-by-value beta-reduction is safe.

Actually, I'll call it STACK instead of thunk since it makes explicit how to deal with the stack.