Implementing fast interpreters

The webkit community announced Squirrelfish last month, showing important performance improvements.

The post has some links LtU readers may find of intererst (such as The Structure and Performance of Efficient Interpreters, by M. Anton Ertl and David Gregg, November, 2003).

I am thinking of building a Scheme interpreter using a (TR) SECD machine, and I would be most grateful if any LtU readers would be so kind as to provide some additional links about improving interpreter performance.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Don't have time to dig up links

Don't have time to dig up links, but here's my list that google will help you with:

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters

Context Threading, by Florien Fest

The Case for Virtual Register Machines

YETI: A GRADUALLY EXTENSIBLE TRACE INTERPRETER

Eff ective Inline-Threaded Interpretation of Java Bytecode Using Preparation Sequences

A PORTABLE RESEARCH FRAMEWORK FOR THE EXECUTION OF JAVA BYTECODE

A No-Frills Introduction to Lua 5.1 VM Instructions

The Implementation of Lua 5.0

Thanks

for the list of links, I'll dig up those links and post them here afterwards.

The last two

The last two about Lua 5 are very good. The design of instructions is very interesting in Lua 5. Just to mention, it was these two articles that had convinced me to implemented my language from AST based to virtual machine based.

tail-recursive SECD

If you want an abstract machine for Scheme you might be better of with a CEKS machine instead of trying to shoe-horn tail recursion onto the SECD machine.

That said, I recall John Ramsdell has a paper on a tail-recursive SECD machine, which might be what you're talking about.

Thanks for the link

The fact is that I like the SECD machine :-) thanks anyway.

And yes, I was talking about Ramsdell's paper
Lispme Scheme is implemented using that.

Shouldn't this be under "discussions"

and not "site discussions"? This is content, not meta.

My fault

I've moved it to LtU forums, my apologies.

Queinnec's book LiSP (Lisp

Queinnec's book LiSP (Lisp in Small Pieces) goes from the trivial interpreter to more advanced techniques and finishes presenting bytecode and "Scheme to C" compilers. I don't think it is online, though.

I own it!

Thanks Pablo.

I've adquired it recently in Amazon. The price (used,15$) does not match its value!! It's a great read.

Peter Henderson's Functional Programming - Application and Implementation is a good starting point as well.

Virtual Machine Showdown

http://www.usenix.org/events/vee05/full_papers/p153-yunhe.pdf

Bizarre statements...

So, starting at the “x” part of the expression “f((x) + 1)”, a variable node “x” would return x to a parentheses node “(x)”, which would return x to a plus node “(x) + 1”. Then, the plus node would return (x) + 1 to an argument list node “((x) + 1)”, which would copy that value into an argument list object, which, in turn, it would pass to the function node for f. Sheesh!

Why would there be parentheses nodes in the syntax tree? There isn't any need for them.

In a typical compiler, conversion to bytecode is just a means to an end, not an end in itself. The purpose of the conversion is to “lower” an abstract tree of grammatical constructs to a concrete vector of execution primitives, the latter form being more amenable to well-known optimization techniques.

Bytecodes are more amenable to optimization??? Now, I suppose if the dragon book is the only compiler book you've ever read, maybe you'd think that peephole optimization is the only game in town. However, the compilers I'm most familiar with do most of their heavy lifting well before that.

Re: Bizarre statements

1. One of the payoffs from bytecode (aka token threaded code) is cache and memory locality. ASTs tend to be spread out over the address space, making for more cache misses. The same can be said for other forms of threading. Ertl, Gregg and others make a pretty strong case for it.

2. The other key to efficient interpreters besides cache and memory location is making it easy for the processor to predict branch targets. I think that's also a consequence of doing threading right.

I would agree that bytecode

I would agree that bytecode interpreters have more opportunities for performance than syntax tree walking interpreters, and that they tend to be faster.

However, what I took the paragraph to mean was that the best time to perform optimizations is after the compiler converts a program to bytecodes, which isn't really true. Most compilers do get a lot of leverage out of such optimizations, however, they also tend to do quite a bit of work on various intermediate representations that are still pretty far removed from a linearized bytecode or machine-code format.

Once you've made it all the way to a mostly linearized representation, you've lost quite a bit of information about the program that makes it *much* more difficult to perform certain interesting optimizations, such as say, closure optimization, or trying to convert indirect calls into direct calls.

Locality

1. One of the payoffs from bytecode (aka token threaded code) is cache and memory locality. ASTs tend to be spread out over the address space, making for more cache misses.

Isn't this just a simple implementation issue? For example, just use a custom pool allocator for the AST.

A different way you might suffer from more cache misses is if your AST is less compact than your bytecode representation.

As ASTs frequently contain pointers...

they frequently are less compact than bytecode, which uses its default-sequential nature to eliminate child-node pointers in most cases.

Oberon/Juice AST compression

Oberon/Juice AST compression technique comes to mind.

It beats Java bytecode, BTW!

Juice is not an interpreter

Yes, but to conflate Juice with a AST-walking interpreter is a mistake of monumental proportions. The front end is little more than a parser, which spits out a cleverly encoded abstract syntax tree, which in turn is compiled to machine code. CPUs are just a bytecode interpreter. (Well, actually, these days CPUs are often compilers to yet another, invisible, bytecode format, but that's not overly relevant here)

I never figured out exactly what the Juice front end did, other than parse, validate, and compress the abstract syntax tree, and remove variable names. My impression was that there wasn't much in the way of program transformation or optimization done up front, and that the back-end essentially had the original program at its disposal.

All this is quite useless to implementing Javascript, as JS is transmitted in its "original" source. (Ignoring the fact that the Javascript might be compiled from Ruby or whatnot.) Now, if you can get the world to change, you could use this technique to speed up downloading times, but unless you are Microsoft, good luck with that.

TinyScheme is pretty nice

It takes some head scratching, but the TinyScheme implementation is a uses a modified SECD-style machine, and it is surprisingly tight.