Request For Advice and Guidance On Writing a Scheme To C Compiler?

Dear Lambda The Ultimate,

I've been working very hard recently on writing something I thought would be entirely straightforward: A scheme to C compiler that bootstraps - nothing I do seems to work. On my first attempt (which was able to compile and run its own parser) there was no GC or tail calls so it demanded >8GB of memory and failed, on my next the emitted code blows up so huge that it grinds to a halt. I hope someone could give me some advice on what I should do or what I need to learn to complete this? It would also be interesting to discuss and read about other members experiences.

Starting

I've been interested in programming languages since I started programming a long time ago, and have explored and written parsers, type checkers and interpreters for a fair selection of different types of programming languages like lisp (of course!), prolog (and minikanren), a cut down version of haskell with hindly milnor type checking as well as some unusual things like linear and reversible programming languages.

A long time ago I read Marc Feeley - 90 Minute Scheme to C compiler when it came up on this site and it was fascinating to understand (or so I thought) how a scheme compiler could be done, in particular using the CPS transform to deal with tail calls and call-with-current-continuation. This got me interested in continuations and I studied Appel - Compiling with Continuations but I didn't write a compiler back then.

Continuations

Since then I've read about Abstracting and Representing Control from Olivier Danvy and written metacircular interpreters that emit continuation semantics (pure lambda terms without any control operators). I also came to prefer delimited continuations rather than call-with-current-continuation and decided that I would like to implement those directly in my own compiler. I noticed that in Marc Feeley's compiler he was able to just use the continuation semantics of call-with-current-continuation as is because it's already in CPS form, on the other hand the continuation semantics for shift and reset are not in CPS form so this technique cannot be used: one would have to CPS transform twice to get the necessary metacontinuations - but this also causes a huge blowup in the number of lambda terms. So I studied other approaches to direct implementation of shift/reset:

In the end I felt like building a runtime for my compiler to target (in C) that handled delimited continuations was something I could add later on but for now I should forget about that (and if it was require in bootstrapping I had an easy way to interpret them), so I just used Matt Might's very clear and simple explanation of hybrid higher order CPS transform How to Compile with Continuations.

Getting Started

When the got the idea to write a self hosting scheme compiler I quickly wrote out a parser that's able to parse as much of scheme as I needed as well as the file itself. Then I initially came up with something very simple that just performed closure conversion: It was a surprise to me that this was basically enough to execute lambda using C procedures for code and vectors for their environment - it just lacked garbage collection and tail call elimination. I pushed on with this hoping lack of optimisations wouldn't be an issue ending up with a system that had one compiler pass per file as well, some utility functions and a script that chained them all together. This quite successfully compiles all the small simple scheme programs I gave to it (getting a program with the Y combinator to compile and run for the first time was very exciting!) but it just isn't good enough to bootstrap.

So I looked back a bit and read over Guy Steele's RABBIT paper and decided to start again from scratch. This time I started with a C runtime that had a simple two space copy and collect GC which uses the stack as the root - the stack itself is executed by a trampoline (a while loop that pops a continuation off the stack and calls it) and the continuations pop arguments off the stack, compute something with them then push a continuation onto the stack and return. I wrote some tests for the runtime: for example compiling recursive and iterative factorial functions by hand. The value of this approach is that you get tail call optimisation "for free" it only requires doing a CPS transform in the compiler.

So the overall architecture of the compiler is now:

  • Parse a scheme file to get the s-expression tree
  • "desugaring" - I have looked into syntactic closures and I think that I would like to build a good macro system out of this in future but to start with I decided to just hard code expansions for all the special forms.
  • mutability analysis - since I use what I found out later is called "flat-closures" I had to box variables that SET! is used on
  • continuation passing style transform - this makes every call a tail call
  • lambda lifting - I added this optimisation at the end but it didn't help at all [it looks for in place lambda applications like ((lambda (var ..) body ...) param ...) and adds its free variables as extra parameters - this means that it can stack allocate them instead of heap allocate then]
  • closure conversion - this removes all free-variables from the code by creating closures (code & environment vector pairs that holds the closed over variables)
  • hoisting - I give every lambda a name and move it to the top level since the code part of each closure will be implemented as a C function
  • c-gen - this translates continuation passing procedure calls into stack machine operations that my runtime can execute, it also handles taking things off the stack to allocate them on the heap

after that actual C source code is emitted and some simple programs like an infinite lazy stream of fibonacci numbers do execute (with the correct space usage i.e. TCO and GC are working) - but when I started to head towards bootstrapping just adding a recursive definition of the EQUAL? function blow up horrible creating 40k lines of C.

Looking for more to read

I've looked for more modern things to read that might help me build a self hosting scheme compiler like:

and tried to find short readable implementations (studying real world compilers is just too hard since they are so complex and enormous)

  • http://canonical.org/~kragen/sw/urscheme/
  • https://github.com/darius/ichbins/
  • https://bitbucket.org/bunny351/bones/src

but all in all I just don't know what I'm missing - I don't want to give up though because I think there's an important difference between understanding something and actually doing it. In particular I feel like I can't progress in my study of programming languages until I pass this block (I want to experiment with building compilers using Futamura projections like PyPy does after I complete this for one thing).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

You've given a lot of

You've given a lot of information, but it's still pretty abstract and so it's hard to see where you're going wrong. It sounds to me like you're on the right track. My advice is to dig into a case where there's a huge blowup in the amount of generated code. Why does equal? produce so much code? Where is it coming from? What is it all doing? Naive CPS can produce a large number of new functions, is that what's happening here? Does the huge amount of output consist of tons of one-line functions?

Anyway, in any case I really encourage you to keep going and work through it. I agree that building this yourself is the best way to develop and test your understanding. There's no substitute for experience.

tl;dr;

but have you looked at PreScheme from Scheme48?

Check out Chicken Scheme

which is a CPS-ing compiler from Scheme to C based on Henry Baker's "Cheney on the MTA" paper.

Off the shelf GC

Not knowing the specifics, an option might be to use off the shelf GC such as BDW GC.

random ideas and request for amplification

I'm very sympathetic toward your objective. I assume merely using someone else's Scheme to C compiler doesn't satisfy your need. (For learning purposes, for example, merely using an existing thing teaches you very little, unless you completely tear it down and rebuild almost from scratch.) If you need your own runtime, using someone else's work won't help much either.

I didn't get a sense from your post how well you know C. It sounds like you're writing in Scheme, which seems a hint you prefer Scheme to C for writing the compiler; but working from hints is just guessing. If you're bootstrapping, the first Scheme you use isn't yours, but then later you want to use your Scheme to compile your compiler. You might want to show a diagram with version numbers for each Scheme, to clarify which executable is involved when a problem occurs. (Say s0 is the first Scheme that's not yours, and s1 is the first one output by your compiler, etc.)

I've been working very hard recently on writing something I thought would be entirely straightforward: A scheme to C compiler that bootstraps - nothing I do seems to work.

And you're writing mainly in Scheme? A C runtime with continuations is not straightforward to most C programmers, so that would complicate things, unless you know everything about how your runtime on the C side does that part. I see you're using a trampoline, which is basically on the right track, since the C stack won't perform (say by copying) if you support Scheme continuations.

On my first attempt (which was able to compile and run its own parser) there was no GC or tail calls so it demanded >8GB of memory and failed, on my next the emitted code blows up so huge that it grinds to a halt.

Pronoun "it" at the end loses me. Which executable fails? If you put a version number on each version of Scheme and a version number on each source code file (Scheme and C), and a version number on each process that runs, I might be able to tell which one fails.

but when I started to head towards bootstrapping just adding a recursive definition of the EQUAL? function blow up horrible creating 40k lines of C.

Eyeballing the emitted C code should tell you something, unless you did not write the part emitting C code. Several hypotheses come to mind. Are there many identical copies of the same thing? Nearly the same thing with different names? Weirdly divergent avenues through vaguely related semantic spaces that don't cooperate with one another?

I want to go off on a long philosophical tangent here, but I'll resist. A short version goes like this: when a piece of code is idempotent, running it once and running it 1000 times has the same result, but the latter is much slower. One reason to examine logs of actual executable behavior is to find and eliminate redundant behavior, which might not otherwise show itself. (It's upsetting to find it in i/o code like btrees.)

Sometimes "doing something" has an identity that doesn't make a valid binding until a conclusion later in negotiation, making it possible to start doing it from scratch again before the first is complete, because it's demanded again before the first is done. It's often necessary to keep track of a task now in-progress to avoid re-doing the same thing again just because it's not done yet. Typically a hashmap of such things is necessary, with a good definition of identity to encourage waiting for an earlier thing to finish.

CPS transform code duplication?

Maybe it's a code duplication issue due to the CPS transform.

... so I just used Matt Might's very clear and simple explanation of hybrid higher order CPS transform How to Compile with Continuations.

It seems to me that, although that article is really great at explaining the CPS transform, the final transform it presents still has a small code duplication issue. Look at the T-k function in scheme-cps-convert.rkt from How to Compile with Continuations. In the if-form case, the continuation is duplicated in the branches of the if-form that is constructed. The following is a potential improvement for that case, which avoids duplication in the samples I've tried.

    [`(if ,exprc ,exprt ,exprf)
     ; We have to reify the cont to avoid
     ; a possible code blow-up:
     (define $rv (gensym '$rv))
     (define cont `(λ (,$rv) ,(k $rv)))
     (T-c expr cont)]

Good samples to test with:

(T-c '(begin (if a b c) (if d e f) (if g h i)) 'halt)
(T-c '(if (if (if a b c) d e) f g) 'halt)

In any case, I'd suggest looking for code duplication issues due to the CPS transform.

Further code reduction

Another source of code expansion in the final CPS transform from that article is the rebinding of continuations that are already bound, which results in the introduction of unnecessary lambda abstractions. Here's a substitute for the if-case in the T-c function that addresses the problem.

    [`(if ,exprc ,exprt ,exprf)
     ; If it's not already bound, we have to bind the cont to avoid
     ; a possible code blow-up:
     (define (if-form v)
       (T-k exprc (λ (aexp)
                    `(if ,aexp 
                         ,(T-c exprt v)
                         ,(T-c exprf v)))))
     (if (symbol? c)
         (if-form c)
         (let (($k (gensym '$k)))
           `((λ (,$k) ,(if-form $k)) ,c)))]

Good samples to test with:

(T-c '(if a (if b c d) e) 'halt)
(T-c '(if (if a f g) (if b c d) e) 'halt)

Eric, Thank you so much!

Eric,

Thank you so much! This cut the size of my generated code by 10 and gave me new hope that this approach might take off! I spent a couple hours debugging unrelated errors and it's starting to take nice shape.. exciting.

I really appreciate everyones comments too!

You are very welcome!

Your project sounds similar to one of mine so it's encouraging to read about your project as it's getting off the ground. Best of luck!

Reading recommendation

The book Lisp in Small Pieces by Christian Queinnec has a chapter on compiling Scheme to C.

And another

Another favourite on this topic is Richard Kelsey's Realistic Compilation by Program Transformation. See here.