output language for new statically typed language?

Hi,

I want implement a language similar to pascal with focus on more flexibility and ease of use and compatibility with C constructs.
With this in mind too, I've finished a course on theoretical Informatics at a German "Fachhochschule".
I've also bought Pascal Creator Wirth's cheap book on Compiler creating and started working with it. The book is about creating Oberon-0, an Oberon subset, with self made parsing and tokenizing tools written with Oberon. These tools translate to a RISC language similar to MIPS. This for me is very difficult, for I've little assembler knowledge only in Motorola 68K. This assembler appears much easier than Wirth's RISC lang, f. e. has certain ops for function prologs and epilogs.
On the other hand, C as an output language would not suit my needs probably.
So what do you recommend? Which language is used in Ahos books on compiler writing? How about UCSD Pascal P-Code? Or Forth?
And which books about compilers would also be useful for me? I especially would appreciate a book with not to difficult exercises and a solutions appendix.

Regards, Ludwig

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

The Tiger Book

I think that Andrew Appel's Modern Compiler Implementation book is a great update to the dragon book.

Ahh, I think my experience

Ahh, I think my experience was similar to yours. I recall reading Wirths book, I had previously read Crenshaw's "Lets Build a Compiler" article series. They both skip creating an abstract syntax tree after parsing and go straight into code generation. In fact, I don't recall if or to what detail the Dragon book goes into on AST's either.

So I think the ingredient you're missing is parsing and creating a tree structure, then generating intermediate code from that. Not sure what would be a good resource for information about that. Maybe the Tiger book, I haven't read it though.

To get started, use a parser

To get started, use a parser generator for parsing, and output C or some such not assembly.

c runs on many plattforms

A lot of open source compilers use c as the target language, because you can find a c compiler for almost every machine.
Another option would be Nicolas Cannasee's Neko VM, see http://nekovm.org/

Maybe C--?

My recommendation:
1. Parse to an AST.
2. Resolve/typecheck into another tree representation (let's call this "LT" for "lowered tree").
3. Write a converted from LT to C.

You may not need the LT representation (often, it will be annoyingly similar to your AST). However, for any non-trivial language, I prefer that extra step. Especially if you plan to have multiple back-end targets.

I think C is a good language to target. It's pretty close to that sweet spot where you can still do all your high-level optimizations but still avoid having to deal with the machine-specific tedium.

What, specifically, were your concerns with C? You'll almost definitely have your hands full with the front-end, so targeting C will speed up development and debugging. If, at some later point, you want to learn how to do register allocation and stuff, you can always add another translation target.

Also, while C is probably fine to start with, it does limit your capabilities if you're trying to do non-C-like things (for example, exceptions). You might want to look at C-- or LLVM's IR.

Maybe C--

C as output lang for my compiler would require to express all featares of the language planned in C, e. g. a module system similar to the one of Oberon, but C doesn't have namespaces at all.
But I'll read the C-- pdf.

C doesn't have namespaces at

C doesn't have namespaces at all.

You can use a name-mangling scheme. No low-level target that I know of supports namespaces.

Modules are deep structure

In response to naasking: s/can/must/. The design of a language module system is intimately tied up with questions of modularity and therefore scalability of systems implemented in the language. It is also irrevocably tied up with the definition of a correct and predictable definition order for global variables. So far as I know, none of the current low-level targets deal with initialization order at all, so the issue goes far beyond mere name mangling. I think what I am saying is that the module system design issue can't be ducked by a serious language designer, and that no "one size fits all" solution exists.

As support for this assertion, consider the number of ad hoc module systems that predated Dave McQueen's system in ML, the number of approaches that were deemed inadequate for standardization by the Scheme community, and the fact that both a careful specification and a limited effect type system were required to ensure a dependency-respecting (therefore deterministic) initialization order in BitC that does not rely on the accident of link order.

The reason that language and OS design is hard is that they are both places where a large number of constraints converge leaving surprisingly few degrees of correctness-preserving design freedom. In consequence, the design spaces in both domains have local maxima that are very hard to evolve from in any consistent and compatible way. Subjectively, the experience is that you get to a place where adding "just one more thing" threatens to topple the entire construct, or at best, require a grand re-work to make the new thing fit. It's a very peculiar process, and it is why I have always found that while "bad code drives out good", it is also true that "good architecture preserves itself in the presence of both bad code and junior developers".

Linux today succeeds on the strength of a small number of very talented core developers, but in the early days it was rather the other way around. Which, when you think about it, is a pretty good metric for just how talented Ken, Dennis, Doug, and a few others really are.

Naming

Ken Thompson, Dennis Ritchie, Doug.... ?

McIlroy

Ken Thompson, Dennis Ritchie, Doug.... ?

Shap almost certainly meant Douglas McIlroy of Unix pipeline fame.

Go to LLVM IR

We initially decided to target C from BitC, mainly for portability reasons. That wasn't an unreasonable choice, but in hindsight LLVM IR would have been a much better choice. It provides significant optimization, and it is a very well structured back end.

The big issue you will run into isn't exceptions. It is expression vs. block structure. Things like tail recursion and/or non-local expression exit will force you to introduce an SSA pass if you are going to C. At that point you may as well emit LLVM IR and get the benefits of optimization.

Concerning the front end, I agree with Ehud. Use a parser generator such as bison or ANTLR.

Things like tail recursion

Things like tail recursion and/or non-local expression exit...

Good point. I was assuming that the language is ultra-simple and does not contain features such as these, and/or that the quality of the implementation is of little concern. I should have made that explicit.

When we started...

We naively assumed that we could use GCC's expresion blocks. Unfortunately, goto into/out-of expression blocks is not supported, and we were forced to do an SSA pass.

That wasn't all bad. It opened up the option to do the Henderson-style accurate collection. We aren't doing that yet, but we clearly could.