Writing an interpreter, targeting a VM or writing from scratch?

I'm going to write myself a small functional language just for fun. I've previously written an interpreter in Haskell for LC but now I want to try going the route of modern (popular, whatever) languages such as Python or Ruby. So exactly how are those languages implemented?
• Do they generate byte code for a virtual machine?
• Does the virtual machine handle garbage collection?
• What virtual machine would be good to target if I need to implement typical things such tail call optimizations and closures?
• Is parsers/lexers for modern languages handwritten or do they use something like Flex/Bison?
• Is it easier to write an interpreter in C or to generate byte code?

Comment viewing options

Source code

I'd encourage you to check the source code, it can answer your questions a lot better than I can.

Anyway, quick answers from what I've read so far in the source code:

• Python compiles to bytecode, then interprets the bytecode. The bytecode interpreter is basically a massively-CISC machine; the methods for all the built-in types are written in C, and things like integers and strings are all boxed Python objects. Parrot uses the same approach, except it's not limited to Python. Ruby interprets the AST directly (I've heard this is on the list of things to change for Ruby 2.0). In both cases, the bytecode interpreter and parser are generally in a single executable, so it doesn't really matter to the user what's going on behind the hood.
• Depends on the language. Python is refcounted with a backup GC to collect the cycles. I think Perl is all refcounting. Ruby I think has real GC, probably mark & sweep. Parrot also uses true GC.
• You're out of luck with most of the popular VMs (Java, CLR, and all the scripting languages; since scripting VMs are usually bundled with the parser, you may not be able to separate it out to target with a new language anyway). Parrot may have support for tailcalls; I'm not certain about this, but its erstwhile chief architect was a big fan of CPS and functional languages. I think NekoVM may be able to handle tailcalls via $goto, and has built-in support for closures. Or you could go all the way down and target LLVM or C--. • Not sure - problem is, many modern languages do not have context-free grammars. Python, for example, either needs state in the lexer to handle indent levels, or it needs to record token positions and run a pass between lexing and parsing to insert indent/dedent tokens (I took the latter approach when writing an indentation-sensitive language in Haskell; it was only a couple dozen extra lines, but this is Haskell we're talking about). Ruby has an interesting rule where newlines are significant, except when they would leave the parser in an inconsistent state. It's hard to code these into an existing parser/lexer generator. Python, at least, does have a separate grammar file, but I think they use a custom parser generator. Parrot and Perl Perl 5 does use refcounting. It's painful. Parrot does support tailcalls, but I believe that you have to mark tail-recursive capable subroutines for the VM to know to optimize them. (The justification is that some languages we want to target Parrot, such as Python, emphatically do not support tail recursion eliminiation, so this is a HLL feature you have to request explicitly.) I've been working off and on on a Scheme variant targeting Parrot, using the Parrot compiler tools (PGE, a grammar engine built around Perl 6 rules and TGE, an attribute grammar system). Though right now you have to write the transformation rules in PIR (Parrot's native assembly language), it's much easier and nicer in almost every way than doing it in C. I recommend at least looking into Parrot for any little language implementor. CLR has tail calls You're out of luck [regarding tail calls] with most of the popular VMs (Java, CLR ...). CLR can do tail calls. See, for example, Technical Overview of the Common Language Runtime by Erik Meijer and John Gough. NekoVM NekoVM has gotos AND tail calls. It's also very easy to target Neko since you don't have to learn how to produce a binary bytecode format. Instead, you generate Neko source code which is a lot more easy. Neko has been thinking from the start for the kind of task the author propose, and is not bound to any particular language, so I think it's one of the best choices out there. Language-independent VMs Designing a good language-independent VM is much harder than it might seem. In particular I've seen no VM which I would like to target my Kogut compiler with. My compiler currently produces C code. Targetting a preexisting VM is practical only if we are willing to adapt the language for the VM, or if the language semantics is already similar to an existing language for which the VM has been designed. The problem of targetting VMs compared to targetting C and assembler is that most VMs are high level relative to the other choices. Especially those which are interpreted rather than JIT-ed. Besides the well-known issues of tail calls and continuations, VMs often include a runtime type system, garbage collection, calling functions with arguments, objects with fields, strings, dictionaries, typesafe arithmetic etc. It is unlikely that these operations match the semantics of an independently developed language. The type system will most likely not match at all. There are tons of problematic areas: • How are functions with a statically unknown arity handled? • Can objects other than functions be applied? • How are object fields identified? Are they accessed when knowing the object type or not, possibly including subtyping? Are fields identified by strings or symbols, or indices computed at compile time? Is the set of fields known at object creation time, or is the semantics of field access resolved dynamically? • What is the nature of characters, elements of strings? Unicode code points? UTF-16 code units? Bytes? Are they distinguished from 1-character strings? Are strings mutable, and can they change their length? • How is hashing and equality used by a hash table specified? • What is the semantics of weak references and finalizers? Is a finalizer specified when its associated object is created, or can it be added at any time to any object? May a finalizer refer to the object being finalized, or to other finalizable objects? Are finalizers run by other threads? • What numeric types are available? What is the effect of overflowing fixed-size integers: returning bignums? an exception? wrapping modulo integer size? What is the semantics of integer division and remainder of negative numbers? What happens when you try to add objects which are not of a builtin numeric type? • Which exceptions are thrown from particular erroneous situations? What happens when an array is accessed outside of bounds? What happens on integer and floating point division by zero? Are exceptions resumable? Is exception handling code in a tail position with respect to try? Of course in each case the intended semantics can be somehow obtained from primitives offered by the VM. But this often requires a given source operation by many VM operations, most of which are not needed in typical cases (e.g. to translate error conditions and error signalling). Or it requires emulating features using completely different features for silly reasons (e.g. a language wants object fields to be identified by symbols having names being sequences of Unicode code points, while a VM identifies them by mutable 8-bit strings compared by contents, or vice versa). This is a significant instruction decoding overhead even if the VM took care to make individual opcodes fast. It is tempting to modify the language to suit the VM, which is a sign of weakness of the language implementation technology. I suppose a faster result is obtained by targetting a VM which is either low-level and compiled into machine code, or designed with the given source language in mind. For example NekoVM looks very far from my language Kogut. .NET and JVM are different still, and so is Parrot. From these choices I suppose .NET would be best; it is sufficiently low-level. But I decided that it would be better to build the runtime myself in C, which is so close to the processor that the logic needed to obtain the semantics I wanted had a low overhead. I don't advocate developing an entire compiler or interpreter in C. Here high level languages are better, especially for a compiler. I'm talking only about the runtime. This is painful too but this pain has a reason. Language-independent VM I agree with most of your remarks. However in the case of NekoVM, here are some important points to consider : • there is no static type system. The idea of Neko is to be a "high-level C", so it provides some runtime functionalities that can be used to implement your language runtime. This is different from .Net for example. • For example, you don't have to stick to Neko objects to implement your objects, you can use Neko arrays or others Neko or your own data structures that will suit your needs. • Neko has a good number of highlevel features : closures, tail calls, dynamic loading... This reduce the amount of work needed to target Neko. • Neko has a good and fast runtime (right now bytecode, but I'm currently working on JIT). • strings are byte-buffers. It's up to your language to decide to rewrap them as Objects and handle different encodings. • Neko provides 31-bit integer arithmetic, but since you generate code, you can either use your own C Primitives instead to perform operations on your own integers. This of course has an overhead of calling a C function for every operation. • for exceptions, it's not so much a problem. You only need to add appropriate code to transform a Neko exception into a your-language-exception when it is catched (and only at this time). Any value can be thrown in Neko. I hope this answers to most of your concerns about this kind of approach. Language-independent VM Here's the big problem with high-level language independent VMs: language designers invariably want to do something new. By definition, this is something you haven't thought of. Want lazyness? Out of luck. Continuations? Nope. First-class stack frames? Ditto. Fast multimethod dispatch? Well, you can hack it up, but most likely it'd be easier to just write a new VM. In my case, I want to do a language with Erlang-style lightweight concurrency and message passing. This has to be built into the VM at a low level, because it affects everything from IO to heap allocation to stack representation to scheduling. I looked at Parrot, Neko, C--, and LLVM, but none of them offer much in concurrency support. The only halfway-suitable VM was, not surprisingly, Erlang itself, but I can't find any documentation on the bytecode format and the source is semi-closed for modifications (and makes what are IMHO poor choices in string represenation, making it unsuitable for my purposes). I'll probably end up starting with LLVM and then writing an instruction-compatible custom VM once I get to the concurrency parts. Neko would be a good choice for existing dynamic languages, where the design space has already been thoroughly explored. But they already have working VMs, and so have no need or desire to switch. I think Parrot ran up against that problem: yes, you could theoretically run Python or Ruby on it, but why would you want to? Parrot also had the benefit of leveraging CPAN (well, once it's working with Perl 5 & 6), but the cross-language semantic issues are tricky, as I understand it. Not always so true As the recent Scala paper indicates. Sometimes there are things we know to be true that in fact aren't entirely so, once somebody smart (i.e.: not me) thinks about it enough. Still has limitations The Scala paper acknowledges some limitations: To conclude, managing information about statements following a call to receive would require changes either to the compiler or the VM. Following our rationale for a library-based approach, we want to avoid those changes. If you want statements to follow a receive, you have to manually-CPS them yourself, possibly duplicating code between different branches of the receive. Moreover, any function that calls a function that uses receive also cannot contains statements after the receive, and so on, recursively. Scala's type system will enforce that for you, but that doesn't stop it from being annoying. For me, that's a dealbreaker. I could just as easily use Java and the Proactor pattern. I use threads so that I can write my control flow in direct-style; if I wanted to CPS things I can do so in other languages. Now, it's certainly possible to get around this: just write a VM with first-class continuations. But then you start running into other problems, like first-class continuations not playing nicely with a C# or Java FFI (which is why Scala doesn't support them). Beware the Turing Tarpit. Once you've generalized your VM to fit every conceivable language, its functionality is probably so general that it becomes faster for a language designer to write his own VM than to target yours. Good point And echoes the same point made in other guises on LTU. Strings in Erlang and BEAM I do not understand why Erlang's choice for representing strings would make BEAM (Erlang's VM) unsuitable for your language since you could choose to represent strings in something other than lists in your language, for example binaries (which basically amounts to bytearrays). I do understand however that you would be reluctant to compile your language to BEAM code since the format is poorly documented and the VM is quite complicated. Core Erlang I would use Core Erlang as the target if I planned to compile a functional language to the Erlang virtual machine. Core Erlang is an intermediate representation used by the standard Erlang compiler -- it can compile from Erlang to Core Erlang (erlc +dcore) or from Core Erlang to BEAM (erlc foo.core). Core Erlang is also a very easy vehicle to port Erlang to another host environment. I once did a totally half-baked compiler from the sequential bits of (Core) Erlang to Common Lisp and it was really trivial to be complete enough to execute e.g. the lists module. EDIT: I didn't really read the parent post enough to make a relevant comment here, but Core Erlang is cool and worth knowing about anyway. :-) Core Erlang This is a very good way to compile a language onto the Erlang VM. There are a number of limitations, however, which can be difficult to accept depending on the language you want to implement. For example destructive operations on data are not supported by the Erlang VM. Also Core Erlang is not well documented and was designed to be used more as an internal pass in the compiler rather than as a langugae in its own right. Although there is a reader which can parse Core. Ruby does use yacc currently Ruby has an interesting rule where newlines are significant, except when they would leave the parser in an inconsistent state. It's hard to code these into an existing parser/lexer generator. Python, at least, does have a separate grammar file, but I think they use a custom parser generator. See parse.y in the source distribution. Of course things like "sometimes significiant newlines" are implemented in the side effects of the yacc grammar. That's not really true. They That's not really true. They are implemented in the hand rolled lexer. Yes, parsing actions are implemented in the lexer. ... This stuff was harder to do in the grammar, so that's the reason. Maybe Lua What virtual machine would be good to target if I need to implement typical things such tail call optimizations and closures? Lua may be a good choice, if its other features are sufficient for your needs. See The Implementation of Lua 5.0 for an overview of its VM. Is it easier to write an interpreter in C or to generate byte code? Targeting an existing VM can save a lot of work, of course. But if you mean generating bytecode for a VM that you implement yourself, then writing an interpreter directly is less work. Direct interpreters tend to be slow, though. Lua is definitely worth studying The source code of Lua is unusually straightforward in my opinion, compared to, say, Python or Ruby. It is worth looking at just to see how things work. Lua does tail recursion, and has proper closures. It uses a tracing garbage collector (if that is the right term), rather than refcounting, if I recall. Lua's parser is hand written. However, I would recommend using a parser generator such as Bison in the initial stages. I often use Bison just as a validator for a grammar, as it quickly shows up problem areas that would cause more work in a hand-written parser. Lua / Metalua If you're going to target Lua, you'll probably want to use metalua (http://metalua.luaforge.net). However, it obviously depends on your objective, i.e. your definition of 'fun'. So, if you want to play with the high-level language design issues, and you consider stuff like VM, closures, tail calls etc. as not funny, use metalua and produce an AST. If you want to directly produce bytecode, you'll need to heavily refactor the lower level compiler; then I'd be very interested in sharing your code :) One wrinkle to consider I have a small functional language I'm toying with right now for which I'm targetting the Java VM, and I ran into one interesting wrinkle: be sure if you target a VM that it supports the kinds of operations that you intend. Since my little language is statically typed, I ended up changing how things are to work rather dramatically (I think it was an improvement, all in all, but YMMV). If your language will be dynamically typed, maybe it isn't a big deal though. A Wealth of Information I feel that there is a lot of stuff -- low level C like languages like C-- and VMs like Parrot -- that I'm not aware of. It would be great if there's was a summary of the different options for little language implementers like myself. Is CPS the "trick" used to interpret/compile functional languages nowadays or are there other options? In praticular I'm interested in different options for a language with eager/strict semantics. A-Normal Form Is CPS the "trick" used to interpret/compile functional languages nowadays or are there other options? One other option is A-normal form (ANF), which stands for administrative normal form (A-normal form is impossible to google for). See the thread Retrospective: The Essence of Compiling with Continuations. The retrospective paper is a one-page summary paper which puts ANF in context w.r.t. CPS. The original paper, also linked from that thread, provides the detailed spec for ANF. The thread also contains a link to a comp.lang.scheme thread with some discussion of the two. Jens Axel Soegaard posted a short Scheme program on the Scheme Cookbook which demonstrates Transforming CoreScheme into A Normal Form. Functional backend First thing to do is ask yourself, how much, and which type of fun do I want to have? Writing the whole thing in C (like popular php, perl, python and ruby) can be frustrating. For example, if you wanted to be able to call functions written in python, you might consider a Johan-Lang to Python translator, and call Python as your assembler. :) Then you wouldn't have to deal with bytecode generation, and you'd have a huge library at your nascent language's fingertips. * Do they generate byte code for a virtual machine? Most do (the faster ones) The thing is, for you to target that format, you may have to wind up learning a lot more than just a few opcodes from that language's VM. * Does the virtual machine handle garbage collection? If you write your own interpreter, but backend into their runtime/VM, you will still have to worry about memory management (unless you write it in a GC'd language), because you have to cause values/variables, etc to come into existance (via your parser/AST builder) before they are translated into your target in-memory VM format. Now, if you write a compiler to their bytecode, you won't have to worry about this at program runtime (though you will still have to manage memory during compilation, and you won't have your nice REPL) * What virtual machine would be good to target if I need to implement typical things such tail call optimizations and closures? If you want TCO and closures, I'd seriously consider targeting ocaml's VM, or even just embedding your language via camlp4 LtU link | Camlp4 Tutorial * Is parsers/lexers for modern languages handwritten or do they use something like Flex/Bison? Python uses a handwritten grammar. (See$Python-dir/Parser)
There is really no reason to do this to yourself nowadays. For pragmatic reasons, I'd say: don't write it by hand unless it is LL(1) (or maybe LL(k), if you are ambitious), and you want to do some contextual error-handling that would be harder to understand or implement in an auto-generated parser. LL parsers are the only ones I'd consider writing by hand.
If you are starting from scratch, I'd use ANTLR(Java) or lemon(C) or Elkhound(C++) or maybe yacc. (With lex)

* Is it easier to write an interpreter in C or to generate byte code?
Well, most interpreters that don't directly execute the AST generate what amounts to bytecode internally anyway (most LISPs, Python, Ocaml's REPL). Some of them write a representation of that out to disk (.pyc files, ocaml bytecode files) So, if you write an interpreter, you'll be doning both (unless you build an AST and just execute that directly with a recursive evaluator - this may be the way to go for a small interpreter).

Definitely read lots of sources before you start. There are some great ideas in those interpreters (and in some cases, bucketfulls of pain).

edit: You said small, functional language written in C, and I left out possibly your best resources: 1001 Scheme implementations!

also...

Let me add The Gold Parser in that list.

Personally I prefer it because it is language-agnostic.

But not OS-agnostic and not open source

From the FAQ:

1. Is there going to be a UNIX or Linux version?

The GOLD Builder will be compiled for both the UNIX and Linux platforms, but this will be some time from now. [...]

2. Are you going to release the Builder source code?

Eventually the source code will be released, but this will be some time from now. [...]

Each of these reasons is enough for me to ignore the library.

GLR Parsing

For transforming concrete syntax to abstract syntax, I'd concentrate on GLR parsing, which basically means Elkhound, D Parser, or modern versions of Bison. GLR parsers can handle any context-free grammar, not just LL(k), LR(k) or LALR(k) where k most often = 1.

I'd also attempt to look for an incremental parser so I could embed my language in an IDE easily. In fact, I've been thinking a lot lately about why functional languages focus on principal types rather than principal typings, since principal typings lend themselves to incremental type inference, and it seems to me that there could be a nice pairing of incremental AST generation with incremental type inference in the context of an IDE.

Harmonia

The Harmonia project at Berkeley has done a lot with incremental GLR parsers. The publications there give a good overview of how to do incremental GLR parsing.

Where can I go to learn more about principal typings? I was also thinking that type inference would be absolutely wonderful in an interactive IDE, but the classic Hindley Milner algorithm seems like it'd fail pretty easily if, say, you haven't gotten around to defining a function or are missing the last argument of a call.

[Edit: Found this. Any other introductions available? Google came up with a bunch of uses for principal typings, which aren't terribly helpful when one doesn't know what they are.]

Formal framework for virtual machine

As it is for fun, then you should try keep the virtual machine clean, that is in a simple enough formal framework. Although not necessary, going the byte code way makes it much easier to make sure that this is the case. Going the byte code way then also makes it easier for you to do lots of interesting analysis on the properties of your code.

I've decided to go with an all C interpreter

After reading about a few different VMs and intermediate languages recommended here I've decided to try to build something in C first. I believe that it'll be interesting and instructive for me. I've found a number of different papers and articles on interpreter implementation but all are written using more high level languages (such as Haskell and OCaml). What I need is some practical hints on implementing a strict lambda caluculs in C. More specifically I need some pointers on AST representations and value representation (i.e. not fully applied functions/lambdas/closures).

When I wrote an interpreter in Haskell I used something like this (simplified):

data Exp = Lam Id Exp
| App Exp Exp
| Var Id
| Lit Int


And for values:

data Value = VFun (Value -> Eval Value)
| VInt Int


I.e. embedding functions directly in Haskell which won't be possible using C (or will it?). So how to do all that trampolining, CPS andclosure stuff in C? Any good resources out there?

P.S. I will probably use the Boehm GC as it looks mature enough and I don't want to write a GC as well.

LC in 90 minutes

So how to do all that trampolining, CPS andclosure stuff in C?

The 90 Minute Scheme to C compiler might be helpful, although it describes a compiler, not an interpreter. Scheme is close enough to lambda calculus that the issues you mention are essentially the same at the implementation level.

One interesting approach

One interesting approach might be to use the techniques from From Interpreter to Compiler and Virtual Machine and A Functional Correspondence between Evaluators and Abstract Machines . It should be straightforward how to represent an AST in C, it is just a labelled tree. Defunctionalization (not really defined but illustrated, ask google if you want a more formal approach) is one of apparently three* methods for implementing higher-order languages in first-order ones. It should be pretty clear how the final abstract or virtual machines can be implemented in C especially without the worry of GC added (which these papers don't cover).

* Closures and combinators are the other two, though closure conversion are very much related to defunctionalization.

Boehm GC

P.S. I will probably use the Boehm GC as it looks mature enough and I don't want to write a GC as well.

I have used the Boehm GC to keep things tidy in the AST generation and processing part of a compiler written in C. Saves alot of bother.

I am not too sure whether the Boehm GC is well suited for memory management in functional languages, which typically allocate many small, short-lived objects, e.g. cons cells. This is not a typical pattern in C as such. The reason for my misgivings is that a Bigloo Scheme programme I was working on spent over 90% of its time in GC, and was slower than the same programme written in Python or Lua, and far slower than Ocaml. Bigloo Scheme compiles to C and uses the Boehm GC. Other programmes I tried out in Bigloo Scheme, which were less list intensive, ran pretty much as fast as C. Take this with a pinch of salt. As I am not an experienced Scheme programmer, I may have written the Scheme programme incorrectly in some way.

It might be worthwhile to look at TinyScheme

If you can wrap your head around the code, that particular interpreter is very tight, and it has stood the test of time fairly well. Don't assume that it has support for GC right -- last time I looked, there were bugs, but if you contact me offline I can describe the issues for you.

Boehm GC is not a good place to start if you ultimately want precise GC. It leaks like a proverbial sieve. Also, if you are writing a n interpreter you don't need something like Boehm, because you are in a position to easily keep track of all the references. My suggestion on this would be to try hard for a precise GC. The results are worth it.