AST intermediate representations

Hi,

Is anyone aware of any systems which use AST intermediate representations instead of byte code? A few years ago, I read papers by Michael Franz concerning use of compressed ASTs in the Oberon system (e.g. A Tree Based Alternative to Java Byte Codes - pdf). Work on AST intermediate representations seems to have continued particularly in relation to mobile code (e.g. Towards Language-Agnostic Mobile Code - pdf).

Typically the AST intermediate representation of a program is compiled at the destination on the fly (with varying degrees of optimisation depending on time constraints). I'm interested in interpreting ASTs to see what the performance of AST interpretation is like as a first step (possibly with a subset of Oz). So far I've only found one attempt at anything like this.

Just wondering if LtU readers had come across any other work on AST representations and interpretation.

Regards,
Chris

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

How decorated is the AST?

If the AST being used is simply the output of the parse phase; there's little advantage there--especially since saving an AST to persistent storage means serializing it; which then requires re-parsing the serialized form.

Interpreting AST's directly is known to be a rather slow method. Useful for debugging and such, but not where you want to go if you want a high-performance runtime. A good discussion of this is in the book Modern Compiler Design.

What might be more useful is an annotated tree of some sort.

Fairly common

Quite a few real languages use some kind of AST interpretation. Last time I checked (a couple of years ago), Ruby was one such language. Another example (afaik) is MzScheme, although in that case, the AST has already been compiled down, by the macro system, from Scheme source to a small core lambda language, so the final AST is not a direct representation of the source code.

It tends to be easier to implement AST-based interpreters, partly because the AST's semantics are the same as the source language, whereas true (linear) bytecodes have their own, independent semantics. Reasoning about and dealing with one semantics is easier than two. Some downsides of AST interpretation is that it tends to be slower, and the tree representation uses more memory.

IIUC, the fact that linear bytecodes are organized contiguously in memory provides much of their performance advantage. This improves locality of reference, which has a variety of benefits. For example, it allows for an efficient interpretation loop, which doesn't have to follow as many pointers as the tree version.

If the goal is to implement a really fast interpreter, it seems as though linear bytecode is the way to go. However, you can get perfectly good performance from an AST interpreter, particularly if you use an approach similar to the typical Scheme one. Also, if you're doing JIT compilation to native code, then there seems to be less need for a linear bytecode, except perhaps to support platforms on which JIT isn't available.

And Perl

Perl 5 interpretes the AST directly, I believe.

IIUC, the fact that linear by

IIUC, the fact that linear bytecodes are organized contiguously in memory provides much of their performance advantage. This improves locality of reference, which has a variety of benefits. For example, it allows for an efficient interpretation loop, which doesn't have to follow as many pointers as the tree version.

I hadn't considered the affect of locality of reference on performance. It makes sense that this would be the case. It's exactly like hardware virtual machines.

The AST approach to intermediate representation has interested me since reading about SlimBinaries. It looked like compressed ASTs offered a number of benefits over byte codes but there were some concerns about the cost of the final compilation - is it worth it in practice, etc. I wanted to know if interpreting the AST would alleviate this there but couldn't find much information on this question. Thanks for the info regarding current languages which use a similar approach.

If the goal is to implement a really fast interpreter, it seems as though linear bytecode is the way to go. However, you can get perfectly good performance from an AST interpreter, particularly if you use an approach similar to the typical Scheme one. Also, if you're doing JIT compilation to native code, then there seems to be less need for a linear bytecode, except perhaps to support platforms on which JIT isn't available.

The goals are to create a runtime and compiler for a language that tested AST representations to satisfy my curiosity and learn more about implementating languages (especially for distributed systems). I'm familiar with the ideas, but never put them into practice. Interpreting the AST was one option that could be done early but I was unsure of the benefit - hence the "first step" in the original post. Compiling on the fly is another option.

These are all just ideas at the minute and I'm researching before I do anything. No point in repeating a well trodden path if it's know to lead to a dead end. I wasn't sure where to go to find out what's already been done on this. I apologise if LtU is the wrong place. It just seemed like given the number of people here, someone might have seen something.

AST representations

I think the most popular representations for ASTs are markup (XML and Lisp's s-expressions) and Postfix notation (like Forth uses).

Both can then be compiled or jitted as much as you like... Generally, you don't do pure interpretation of the AST, but try to pre-load as much as you can, so that loops and stuff like that are as fast as possible.

You might want to read the SICP chapter on interpretation. IIRC, at the end it makes the step from simple interpretation to an interpreter that pre-constructs executable objects (closures; in Java you could use objects I think) and later executes them without much variable lookup etc. overhead.

Re: AST representations

I think the most popular representations for ASTs are markup (XML and Lisp's s-expressions) and Postfix notation (like Forth uses).

Those notations would usually be considered concrete syntax, not the abstract syntax which the AST represents. Once you read an s-exp or XML tree into memory, you end up with something closer to abstract syntax, but it's still not quite what's being discussed here. [Edit: these notations are sometimes used as an external representation of abstract syntax, but in their textual form they're still a concrete syntax.]

You might want to read the SICP chapter on interpretation. IIRC, at the end it makes the step from simple interpretation to an interpreter that pre-constructs executable objects (closures; in Java you could use objects I think) and later executes them without much variable lookup etc. overhead.
That sounds more like what's being discussed here. Typically, you'd have a compiler phase that converts the source code into a tree of objects representing the program's abstract syntax (or a transformed version thereof, in the case of a language with macros). Those objects would be optimized for fast interpretation, with things like variable names converted to symbol references, and literal values converted to instances of the appropriate internal value type. An AST represented like this can be efficiently interpreted directly, without first converting it to some other form such as a linear bytecode, or native code.

XML is concrete

Those notations would usually be considered concrete syntax, not the abstract syntax which the AST represents. Once you read an s-exp or XML tree into memory, you end up with something closer to abstract syntax, but it's still not quite what's being discussed here.
Agree with that, and using the opportunity, I have a concrete example of AST being bad for intepretation:

Having written an interpreter of UML state machines (no hate-mails, please :-) ), I discovered that even AST is not very handy for interpretation (sometimes). The concrete syntax was XMI, which in-memory representation would be a DOM tree or a stream of SAX events; its abstract syntax was UML metamodel defined in MOF, which in-memory representation was Java object graph in terms of JMI (javadoc) and Novosoft UML API (javadoc). And JMI was very unwieldy to interpret - mostly because of complicated navigation. I cannot say, whether this is a problem with the specific AST (which well might be), but I ended up defining custom Java classes for representation of this subset of UML, and translating from JMI to them during load.

By the way, this example really has many "syntaxes", and I am having hard time classifying them as either concrete or abstract. It's XML, XMI, JMI API, Novosoft API, and custom API.

Designing ASTs for interpretation

its abstract syntax was UML metamodel defined in MOF, which in-memory representation was Java object graph in terms of JMI (javadoc). And JMI was very unwieldy to interpret - mostly because of complicated navigation. I cannot say, whether this is a problem with the specific AST (which well might be), but I ended up defining custom Java classes for representation of this subset of UML, and translating from JMI to them during load.

For greatest simplicity and best performance in interpreting an AST directly, you'd really want the AST's internal representation to be customized to the requirements of the interpreter. Otherwise, you'd end up effectively performing an extra internal translation phase, potentially at every execution step, from the AST representation to whatever the interpreter needs to execute. That would negate some of the benefits of compiling to an executable AST in the first place.

BTW, that application sounds quite interesting!

Race condition :-)

I shamelessly edited my comment while you were replying, sorry about that.
For greatest simplicity and best performance in interpreting an AST directly, you'd really want the AST's internal representation to be customized to the requirements of the interpreter.
Which we did. It just was easier to translate JMI to our AST, than XMI to AST, so we had to introduce additional stage, basically: CS->AST1->AST2. Program transformations at work :-)

Another lesson: in another incarnation we generated Java code form state machine model, which didn't improve performance much, but allowed to examine the code, and was a bit more friendly to interoperate with from manual code (because of more specific types in generated code). That was the moment when I understood how inexpressive is Java's type system...

BTW, that application sounds quite interesting!
Do you mean executing UML models? :-)

SICP, CTM, ...

You might want to read the SICP chapter on interpretation. IIRC, at the end it makes the step from simple interpretation to an interpreter that pre-constructs executable objects (closures; in Java you could use objects I think) and later executes them without much variable lookup etc. overhead.

I was planning to get SICP at some point, might as well get it now. Btw, what about a books section on LtU to complement the existing material linked to by LtU?

It could hold references to CTM, TPL, SICP et al.

SICP 4.1.7

Separating Syntactic Analysis from Execution is the section you are looking for. I actually used it in class last Thursday...

We will put up a book page, when we set up the LtU-Wiki...

Getting Started thread

Btw, what about a books section on LtU to complement the existing material linked to by LtU? It could hold references to CTM, TPL, SICP et al.

This Getting Started thread provides quite a few links of that kind.

Neko AST

I'm recently working on a dynamicly typed intermediate programming language called 'Neko'. I both have an AST interpreter written in OCaml and a bytecode-driven VM written in C. There is around a 8x speed improvment at running the bytecode. Of course, the interpreter is very trivial and does store variables bindings in hashtables while the VM is stack-based with one register.

(more infos on http://ncannasse.free.fr).

Beware of the early optimization

Most of the comments so far seem to warn about the negative impact of direct AST interpretation on performance. But beware of the early optimization. The AST form is much easier to tweak during development, exactly because it's Abstract. Before you can serizalize the tree, you have to cut many design decisions, like:

  • Do you evaluate a function (i.e., a tree node) before or after its arguments (i.e., the node's subtrees)?
  • In what order do you evaluate sibling subtrees?
  • How to encode non-linear loops and conditionals?

In fact, the performance of the linear encodings is better precisely because you make those decisions early, so you can optimize your bytecode sequences with them in mind. However, you can't make them correctly until the semantics of your language is settled. For example, you can decide to generate code for argument evaluation before the function call only if the language has by-value call semantics. If you keep your interpreter abstract, changes in semantics will be much easier to accommodate.

Since you say your goal is to implement a subset of Oz (and not another variant of Java), my guess is that you're going to play with the semantics as you go, and you're probably better off interpreting the AST until your language settles down.

Metaprogramming with ASTs

Another thing you should ask yourself is whether your language might eventually support metaprogramming in any flavor stronger than Java-style reflection. If so, having support for interpretation or JIT compilation from ASTs will make that task much easier. I'm not sure there even is any other way to support run-time construction and execution of code. Lisp's S-expressions are ASTs and nothing more than that.