archives

Practical Bits of Making a Compiler for a New Language

I'm designing (and hopefully implementing) a new programming language. Now, I've taken a bit of language-design advice I once heard from a wise old professor, which is: "Never add an entirely new feature to an entirely new language." Instead, I'm consolidating older, thoroughly-tested ideas whose time has come into a language suited for a domain that has yet to see such features.

The vision? An advanced, almost-type-safe imperative language with functional and object-oriented features for kernel and base-library level systems programming. The goal in creating this? To bring the convenience of high-level language features to low-level programming.

The very foundation of the language is:

1) Lexical scoping.
2) Imperative block structure based on statements.
3) A strong type system with a single exceptional case of code marked "unsafe" (because when writing a kernel you either allow unsafe code somewhere, or develop a type system that could probably win a Turing award).

The features I'm putting in are:

1) Module/unit files. See: Delphi, Python.
2) Lambda functions and full lexical closures. Every function treated as data comes with a closure by default, even when it has no closed-over lexical variables. When a function lacks such variables, the compiler can optimize out their storage, but the type system must treat them as the same! Tail-recursion optimization will be mandatory.
3) Exception handling as seen in Java, C++, Delphi, Python and practically every other modern language. Not having it is downright silly, especially when a clever compiler can allocate an exception object on the stack, thus allowing even the heap allocator to safely throw exceptions.
4) Templates, as almost directly stolen from D. Only types can be passed to templates.
5) Object orientation based on generic functions and specializing multi-methods. However, there won't be pre or post methods, and each class can only inherit from one other class (thus VASTLY simplifying inheritance-related problems). Generic functions and specializing methods are used to provide dynamic method dispatch.
6) The unification of tuples, structures, and classes. A class is merely a structure that can inherit from another class, have methods specialized on it, and have multiple scopes (private, public, protected, et al). A structure is merely a tuple with named members. I see no reason that I can't declare a tuple type in which only some members have names, and leave the language to provide indexes for all, then declare a new tuple type that inherits from the first, and finally specialize a generic function on that latter type.
7) Dynamic arrays, stolen right from Delphi. I will also include static arrays just for the convenience of allocating a known number of items off the stack when heap space often runs scarce at the lowest level.
8) Typed reference values. These function like pointers in that their implementation identifies an address in memory, but to the programmer they can either contain the NULL value or reference a known object whose address was obtained via a take-the-address operator. Objects (as in instances of classes) will be referred to by references only. I'm actually quite on-edge about this one as it will introduce the possibility of dangling references. The level of safety provided by an utterly pointer-less language would be nice, but so would not copying huge amounts of data to pass an object into a function.
9) Blocks of code marked unsafe in which pointers, pointer arithmetic, inline assembly and re-interpretive type casting all become legal. These should only be used for ultra-low-level functions such as writing to device registers, and a perfectly-formed unsafe block will make the compiler emit a warning unless the programmer also marks it assumed safe.
10) Let blocks for variable declaration. Sod just declaring a new variable out of the blue, we should be marking where our variables go into and out of scope.

Features left out as deliberate design decisions are:
1) Type inference. It requires writing an inconveniently extensive type system and disallowing either references or mutable variables.
2) Garbage collection. You guys will hate me for this (because everyone and everything has garbage collection nowadays), but when writing an operating system or a device driver one can't find themselves left at the mercy of a GC, either for performance or safety reasons. For these domains you need a language with such a small runtime-library you can write it in 3 pages of code, or none at all.
3) Looping constructs. In the presence of tail-recursion optimization and templates, they simply aren't needed. Let people write functional code!
4) Concurrency. Once again, you simply don't assume certain language features at the lowest level.
5) Implicit type conversions. The only working implicit conversion will be the assignment of an integer to a floating-point variable.
6) Operator overloading. It makes parsing that much harder.

Now, the things I need your advice about are:
1) Syntax. Should I stick with C-like syntax for familiarity, switch to a Lispy s-expression syntax for metaprogramming capabilities, or use some completely other syntax I haven't thought of?
2) The references. Is not copying data really worth aliasing bugs?
3) Any advice on the implementation of generic-function dispatch, particularly in the case where methods might only become known at link time? I've got a system that will work fairly well already, but it only works when the programmer can only specialize on a generic function in the same module that declares and implements that generic function. Still, assuming that, it can do some work to type-check and ambiguity-check the generic-function call at compile time, resulting in only a small list of the type-safe possible dispatches to be looked up at runtime.
4) Do you think I've completely missed something?

I've got a textbook on compilation techniques and I'm raring to go at this, so please write in any ideas or criticisms you might have. Except about the mandate of garbage collection. I refuse to have my thread scheduler pause for a GC, though I feel quite happy about having one for writing desktop applications.