Best route for new language to be self-hosting

Does any particular language stand out above the crowd to be a host in which to develop general purpose languages?

I am interested because of the development of Perl6 using Haskell as a host by Autrijus (See pugs.) I expect that any of OCaml, Haskell, Scheme or Lisp would be better than using C <grin> to develop a host, but is one of those (or something else entirely) likely to be significantly better than anything else?

I expect that a good host language:

  • Has an active, friendly community.
  • Is super productive (for a skilled developer who has experience in host language).
  • Has good libraries. Especially language related libraries for say parser generators. Possibly part of language.
  • Is fast enough. Self-hosting may come after the initial childhood and growing pains of new language, so nice if the host language can be used for some time.
  • Is relatively stable. (Not hot off the press, not one-man-band, time spent learning it is usable in future, unlikely to be a pure research language)
  • Is mostly learnable in a few months by a bright developer (Hopefully that is me! My work has always been with imperative languages, but I am certain I can be quickly seduced by functional programming...).
  • Is open source (No vendor lock-in, ability to share with other developers).

I hope this is the perfect forum to get an answer to this personally vexing question!


PS: For those that wonder at the need for self-hosting (rather than just keep another language under the hood):

  • I think that to build a community of developers for any new language wants ownership of ‘their language’ that is ‘better than other languages’. You can’t have that without self-hosting (note: C is great as a host, but not great for prototyping!).
  • Reduce dependencies

PPS: I do think the world needs another language! Of course I am preaching to the converted here.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.


I should add that the question relates to developing a mostly imperative language.

This simplifies the question because there is no advantage to choosing a hosting language that is similar to the desired language (I would like to ignore that issue for the discussion). I also would like not to initiate a functional vs. imperative debate - although I am not sure if that is possible! ;-P

Short answer...

No, no language stands out for GPL compiler development; any GPL will do. Actually, my thought given your statement is that most - if not all - GPLs seem to meet your requirements. So my guess is that you are stating the wrong requirements?

This simplifies the question because there is no advantage to choosing a hosting language that is similar to the desired language(I would like to ignore that issue for the discussion).

Yeah well, I wouldn't like to ignore that issue and I strongly suggest you don't too. There are tons of reasons to choose a hosting language similar to the desired language; there are also tons of reasons not to.

I also would like not to initiate a functional vs. imperative debate - although I am not sure if that is possible! ;-P

I don't think anyone is interested in that debate anyway.

I really suggest that you define your language features and some roadmap first.

Pick a language with macros

Choose a language that can be bent into behaving the same way as the language to be implemented.

Embed the new language in the host language: implement the new language's dataypes and operations in the host language using normal techniques, and use macros to implement the control structures. You can now write programs in you new language (using the syntax of the host language).

Now implement the compiler in the embedded language.

Finally, you need to convert the compiler source from the host syntax to the new syntax. Do this by redefining the macros implementing the embedded language to emit code in stead performing the construct in question.

My personal choice would be PLT Scheme since the combination of its modules and macros allow the creation of embedded languages.

Thanks for the advice

As you can see, I am ignorant of most of the issues. I have only had a small involvement with language design in the past.

As an aside I am really looking forward to seeing how much the Haskell language influences Larry and the design of Perl (There is plenty of opportunity to improve Perl... but there is some really ugly stuff planned for Perl6!)

Use whatever you're best at

I think the best route to self-hosted programming is to get off the host language as fast as you can. Your new language is much better for development, isn't it? Assuming your new language is simple enough, it shouldn't take too long to make a simplified subset of your language in a host language. This would be easiest done in whatever language you feel most comfortable in. The sooner you're done with this, the sooner you can throw out the better. You shouldn't worry about speed. If you have to use the host language implementation for a long period of time, you might be focusing your efforts in the wrong place (though I'm not sure of the purpose of this language, so I may be wrong).

If you don't already have a favorite language that has a parsing library (or already have a favorite language and are good at writing your own parsers), then I'm not sure which language is best. This commment is probably completely useless because of that. I personally like Factor, which has a friendly but extremely small community. Right now, it doesn't really have a mature parsing library, and the language isn't very stable, so it's probably a bad choice. If your language is s-expression based, I think Common Lisp would be a great choice, because there's read and it's just a very productive language. If not, OCaml and Haskell both have good parsing libraries and IMHO are ok languages. OCaml in particular is very fast, but as I said, I'm not sure that matters. Haskell might make it awkward to implement an imperative language, and if you like imperative languages, you probably won't like Haskell anyway. But OCaml and Common Lisp can be as imperative as you want

Get a good stable base, first.

Before trying to self-host a language; it is important that the language design itself be reasonably stable. Not perfect, mind you; but stable enough so that you won't be breaking yourself.

The decision to go and self-host (or when to do so) is complicated by several other factors. In the following discussion, the language being developed is known as L, and the bootstrapping language is B:

* The goodness of the tool-set. Even if L is far more expressive than B; chances are that B has a more mature tool-set available for it. Better libraries, robust compilers, IDEs, debuggers (please no "debuggers are for morons" rants), etc. None of these will exist for L until you (or someone) writes them. A good REPL will make a basic development environment, but using B's tools may be more productive.

* That said, if L is to become a useful language, at some point somebody will have to write the aforementioned tools; so doing so isn't at all wasted effort. The key question is not if, but when.

* If you are writing a HLL, chances are that at least part of the runtime infrastructure will need to be written in a systems' language such as C/C++. Java, for example, is not a suitable language for implementing a JVM.

* If you are writing a domain-specific language; you'd be better off not trying to self-host (unless the problem domain being addressed is compilers and such).

* When self-hosting, be wary of the hornets nest of paradoxes and infinite recursions that comes with self-hosting. Read Art of the Meta-Object Protocol, which discusses such issues in depth. Also read Ken Thompson's famous Turing Award lecture Reflections on Trusting Trust.

* Eating your own dog-food is a time-honored engineering principle; and there is a longstanding bias in the CS world that a (general purpose) programming language must be self-hosting in order to be worth anything. Many developers of high-level languages express a desire to free themselves from the "shackles" of low-level-language B, as if having a compiler written in B somehow limits the capability of L. It doesn't. The migration to self-hosting, if done at all, should be done for sound ease-of-development principles. (That said, starting a self-hosting implementation in L might be a good test case for L, so it has those advantages). Quite a few production-grade compilers and such are not written in the target language--GCC, for example, is written (mostly) in C despite supporting translation of numerous other languages. Don't let pride lead you to self-host prematurely.

* The transition to self-hosting should be incremental. A good technique is to replace one section at a time, until the whole implementation is self-hosting, possibly excluding the low-level bits that require C or some other systems language.

* Don't throw away the bootstrap compiler! (Or interpeter--see next point). The next user will need it to bring L up on his system.

* Regardless of self-hosting or not; write an interpreter first. Only when the language definition is stable, and the interpreter fully working, should you attempt to compile to assembly language and/or bytecode. The interpreter will give you a reference implementation to compare your compiler against.

* At all times: Don't bite off more than you can chew.

Wanted: Stable base with many careful owners.

Scott has stated much more insightfully than I the issues associated with looking for a bootstrapping language B. I disagree with the grandparent that I should use what I know.

Note that I am avoiding saying anything specific about the language L, because it would only lead to people commenting on L, when for this discussion I am interested in comments on B. However imagine L is a descendent of popular recent scripting languages with a C flavour syntax, and you would have a good idea of L!

My premise is that there are better languages for B that I have no experience with, which would be more productive to use as a bootstrapping language (including time to learn the language.)

Is my premise correct?

For example, last week I started learning Visual Basic 6 to use for a project and it would be execrable to use as B (or for that matter virtually anything :-). Most other languages I know would be 10x more productive for me than VB. I believe that a more suitable language for B could be 10x more productive again. However my experience is in imperative and scripting languages, and I have no experience in the languages usually recommended for B. Lisp descendents are often cited for B, but I have yet to be convinced they would be resoundingly better than say Haskell or ocaml (I admit I know neither) for my needs.

For translation or slow interpetration, any of Lisp/ML/Haskell

would work fine.

Haskell might be the hardest to learn, if you aren't already familiar with it, because it's lazy evaluation is quite different from what programmers used to eager evaluation semantics might come to expect. I'm tempted to recommend ML over Common Lisp, if for no other reason than the existence of MLKit and similar tools. (What sort of typing system are you employing, BTW)?

The above assumes you are implementing a reference or prototype interpeter (and not too concerned with performance); until your language is up and running you ought not be in the business of cycle counting. Alternatively, both are good at implementing compilers (which is just rather complex symbolic processing). When it comes to higher-performance implementations, I would recommend first that you target an existing VM or abstract machine (JVM, ParrotCode, .NET) rather than re-inventing this particular wheel.

What of C++, Java, C, C#, etc? I wouldn't use plain C; as it has a primitive type system and is rather poor at symbolic manipulation compared to other languages. C++ is better (especially augmented with the numerous high-quality libraries from and others), but it's lack of garbage collection will be a problem for any interpeter. Some consider it's manifest typing a problem (others don't mind). You can get a GC for C++, but it's one thing that other languages don't make you worry about. Java and C# are often both used successfully for this sort of project (and make it easier to target the JVM and or .NET platforms respectively), but both are less "expressive" than ML. O'Caml (an object-oriented ML dialect) can be made rather fast if you like.

A lot also depends on your goals. Are you trying to get a system working as fast as possible, or is your goal more to learn something new? If the latter, you might be better off sticking with a familiar language. If the latter, consider acquiring a new language part of the learning experience.

Parsers not included

Lisp beats Haskell/Ocaml only if L is parenthesized-prefix syntax. In that case, you avoid writing a lexer/parser, instead using the Lisp reader to transform program text into data structures your interpreter can understand. You can often implement the whole language as macros, cutting down significantly on developer time.

If you have a moderately-complicated syntax, you're probably better off with Ocaml. Between ocamllex, ocamlyacc, and caml4p, you've got a really good set of parsing tools, and then algebraic data types and pattern matching often help you destructure the syntax trees for interpretation/compilation. I know Lisp has similar tools, but I don't think they're as mature (does anyone in the Lisp world ever use syntax?) and then your type system isn't as rich once you've got the actual syntax tree. Then again, it's so easy to destructure lists (destructuring-bind anyone?) that perhaps you don't need it.

One interesting bootstrap possibility might be to use Lisp as B, create L as a prefix-syntaxed language, and then write a parser-generator in prefix-L that parses full-L. This way, you decouple the syntax from the compiler, so you automatically have the ability to create multiple syntaxes, and macros are a simple library away. I've thought some about implementing a language this way, but never got around to it.

My favorite toy languge

there is a longstanding bias in the CS world that a (general purpose) programming language must be self-hosting in order to be worth anything

And there is the canonical reaction to this: “Has it been used for anything besides its own compiler?”

Brilliant link!

Fortunately self-hosting is academic to me at present!

On reflection, I now think the topic title should be something like "What are the most productive host languages to use for developing a new imperative language?"

Ceterum censeo Carthaginem delendam esse

I say "underspecified" or "if you abstract away from everything, then everybody is right."

Java, Javascript, Pyhton, Lua, Php, C++, D, Perl, Icon, Ruby, Nice, F#, Dylan, ML, OCaml... Think about this for a while: Lua in Haskell? Dylan in Python? Ocaml in Java? ML in Perl? I can give arguments and counterarguments for any combination. Each of the mentioned languages have/had their own specific requirements, they all have/had their own manner of getting there.

What do you mean with productive? I assume you want the shortest route to your goal.

Almost no requirements, no goal, therefor any well-known language fits.

Java not suitable for self-hosting

Java, for example, is not a suitable language for implementing a JVM.

Really? What about Jikes? I'd also note that the most widely used Java compilers are written in Java.

This discussion makes me realize that my traditional understanding of the term "self-hosting" (is the compiler/interpreter written in its source language) is somewhat naive, considering both the compiler/interpreter and runtimes for a language may be implemented in any mixture of languages. I don't think I know of any strictly self-hosting implementations, considering even C compilers have parts of the runtime library implemented in assembly.

-- Java.Next