C-Machine

I semi-formalized some part of the translation process for a compiler I am working on. I need people, to do my homework, to shoot as many holes in it as possible, and possibly supply references. I am aware of Thompsons work, but I'll admit I never really got past his elaborate notation.

Below a convenience notation I use. Please read it as a preview. I could have done it shorter, the full notation and evaluation process is elaborated as you read. I should rewrite it such that it introduces the full notation in a few paragraphs. Excuses for that.

A DOT Evaluator
The gist of DOT evaluation, semi-formal. A DOT evaluator is a combinatorial evaluation strategy used as an intermediate notation to make clear how thunks can be set up and reduced.

Note that there is nothing new about this notation. It is a machine-amiable convenience notation used to show, with a few invariants, that it is possible to safely translate combinatorial expressions to C, not as a new notation for combinatorics.

A DOT expression is a syntactic notation for standard combinatorial expressions. A combinatorial DOT expression consists of constants, variables written in lower case, dots '.', combinators in uppercase, and series of squared series out of the former four. A rewrite rule is a squared series starting with a combinator producing a combinator expression, bound variables may occur in rewrites.

Examples:

[I x] = [x]

[K x y] = [x]

[S x y z] = [x z .] [y z]

(What is the expression for lazy evaluation of S?)

Basic Rewrite Rules

A DOT evaluator rewrites a combinator with a fixed strategy. Always, the last expression in a series is evaluated. When it reduces to a constant, its value is placed in the first not-nested dot encountered reading from right-to-left. When it reduces to a squared series, that squared series replaces the original term. Evaluation halts when the last expression does not reduce.

The above rewriter gives a fixed rewrite strategy on DOT expressions. Note that the order of the series rewritten to matters.

Example, using some extra notation:

[F n] = if n = 1 then [1] else [* . .] [n] [F .] [- n 1]

Evaluation of F 3:

[F 3]

= (if 3 = 1 then [1] else [* . .] [3] [F .] [- 3 1])

= [* . .] [3] [F .] [- 3 1]

= [* . .] [3] [F .] [2]

= [* . .] [3] [F 2]

= [* . .] [3] (if 2 = 1 then [1] else [* . .] [2] [F .] [- 2 1])

= [* . .] [3] [* . .] [2] [F .] [- 2 1]

= [* . .] [3] [* . .] [2] [F .] [1]

= [* . .] [3] [* . .] [2] [F 1]

= [* . .] [3] [* . .] [2] (if 1 = 1 then [1] else [* . .] [1] [F .] [- 1 1])

= [* . .] [3] [* . .] [2] [1]

= [* . .] [3] [* . 1] [2]

= [* . .] [3] [* 2 1]

= [* . .] [3] [2]

= [* . 2] [3]

= [* 3 2]

= [6]

Expand and apply rules.

The above rewrite rules are not complete. Reducing a combinator also implies that all extra arguments are inherited by the first squared expression of the series reduced to. Also, a Curried expression is treated as a constant. Dots or variables may not occur in the first position, rather, a special apply @ combinator is used where [@ v] = v, in case the v held a squared expression. In the rules above, the apply is implicit. I.e., the rewrite for I should be read as [I x] = [@ x].

Examples:

[K x y z]

= [ @ x z ]

also,

[K . y z] [S x]

= [K [S x] y z]

= [@ [S x] z]

= [S x z]

Well-formedness and complete reduction
(I need help getting this section right.)

All variables occurring on the right-hand side must be bound on the left-hand side of a rewrite. In the right-hand-side of a rewrite, a squared expression must be bound to an exclusive dot to its right, there may not be more dots than squared expressions. All squared expression either start with a combinator followed by arguments, or hold one constant.

Since a rewrite substitutes all variables, no variables will occur when rewriting a term which doesn't contain variables; a term without variables will therefore rewrite to another term without variables. Since the last term, during the rewriting of a dot expression, cannot contain dots, since all dots in the last term will always be filled, terms rewritten are always strongly reduced. Since all terms rewritten to always start with a combinator, albeit apply, all squared terms always start with a combinator.

The above invariants are sufficient to guarantee that each combinator can be compiled directly to a C routine, and that, by the confluency lemma, terms who have a normal form reduce to that. (I guess.) (I should show Turing equivalence here.)

Shift/Reset
The shift/reset pair saves and restores a stack position. It cannot be expressed directly, therefore some extra notation with curly braces is used. When a shift is encountered, curly braces are introduced, and evaluation continues in that expression, when a value is determined the curly braces are lost, when a reset is reduced, the whole evaluation continues with that expression.

For example (check with compiler internals):

[ Shift G ]

= { [G .] [Reset] }

= { [G [Reset]] }

= {[K . . ] [ F ] [ Reset . ] [3] }

= {[K . . ] [ F ] [ Reset 3 ] }

= [3]

Not really sure where you're

Not really sure where you're going with this; you might get better reactions if you do some editing. Otherwise it's like listening to half of a conversation with yourself :)

By Andy Wingo at Mon, 2010-06-14 17:11 | login or register to post comments

True

I admit I abuse LtU as my personal scratch pad ;)

By marco at Mon, 2010-06-14 18:25 | login or register to post comments

Please No

Really, please don't do that. LtU should not be anyone's personal scratch pad. The policies document is quite clear about this.

By Sam Tobin-Hochstadt at Mon, 2010-06-14 23:44 | login or register to post comments

Neat.

This is an interesting notation. At first I thought you just wanted to eliminate nesting, but are you trying to implement CPS?

You lost me at "Expand and apply rules", though. Why do you write the @ sometimes and not others?

I would prefer a more formal presentation of your notation. For example, a grammar, rules for substitution, etc.

By Frank Atanassow at Mon, 2010-06-14 17:44 | login or register to post comments

Yah.

The whole idea is just to have an intermediate representation which shows that 1) there is a straightforward representation for combinators in a thunk form, 2) thunks can be easily evaluated with a reverse-polish evaluator, and 3) a number of invariants are maintained during rewriting. And lastly, of course, the representation compiles to C, or some other language, in a straightforward manner.

It is all known stuff, the intermediate DOT representation just makes all that known stuff explicit.

Your questions:

Regarding CPS, no that was exactly what I was trying to avoid since I want to be as close as possible to a one-on-one correspondence between compiled code and the source.

An apply rule is introduced since, for example, in [. .] [f] [3], which is equivalent to an expression let x = f in let y = 3 in x y, the first squared series [. .] doesn't start with a combinator, therefore, isn't a thunk. Therefore, in the right-hand side of rewrites, variable or dot applications are made explicit with the @ combinator. (I want all squared expressions to be thunks, i.e., they need to start with a combinator, otherwise it doesn't compile trivially. Stated otherwise, if each thunk starts with a combinator, then all thunks can be evaluated trivially by applying the definition of the combinator; I need that invariant for the translation.)

The expand rule is there to deal with the case where the arity of the rewrite is lower than the number of arguments supplied. Which happens a lot in functional languages when dealing with curried expressions. For example, say f = map inc, and you evaluate f (1,2,3). Then [f] rewrites to [map inc], but [f (1,2,3)] should rewrite to [map inc (1,2,3)].

I am writing the whole enchalada down in TeX atm, will post it here. I don't think there is a lot new or interesting about the notation, I just needed (something like) it to be sure that the translation I thought of works.

[This is also a reply to Andy]

By marco at Mon, 2010-06-14 18:13 | login or register to post comments

the whole enchalada?

The correct phrase is "The Whole Enchilada".

The Enchilada Programming Language is similar to the C-Machine in that it is also based on term rewriting.
However, Enchilada can't be easily compiled because every redex (must) carry a cryptographic hash.

By Robbert van Dalen at Mon, 2010-06-14 19:52 | login or register to post comments

Web site issues

What is the enchilada.jnlp Java Web Start applet that site says can't load? I don't have Java installed/enabled on my system these days since it is a huge annoyance and most desktop vendors that write code in Java (that I depend on) compile it to binary form to maximize interop.

This isn't a complaint, just a hopefully helpful observation that learning about your language feels a bit like a struggle and not hacker friendly.

By Z-Bo at Tue, 2010-06-15 19:46 | login or register to post comments

Browse archives

Active forum topics

New forum topics

Ah, cancel that request.
27 weeks 2 days ago
.
27 weeks 2 days ago
First-class link?
27 weeks 2 days ago
Video Presentation
49 weeks 3 days ago
Also published in ICFP
1 year 1 week ago
About identifiers...
1 year 3 weeks ago
Duplicate comment deleted.
1 year 3 weeks ago
Also true for postfix
1 year 5 weeks ago
That's all about the delimiters.
1 year 10 weeks ago
How about prefix notation?
1 year 10 weeks ago

Lambda the Ultimate

User login

Navigation