Have tracing JIT compilers won?

I've been watching the interpreter and JIT compiler competitions a bit in the JavaScript and Lua worlds. I haven't collected much organized data but the impression I'm getting is that tracing JIT's are turning up as the winners: sometimes even beating programs statically compiled with GCC. Is there a growing body of evidence that tracing JIT compilers are the horse to bet on? Will a more static style JIT like Google's V8 be able to keep up?

Thanks,
Peter

[I promoted this item to the front page, since the discussion is highly interesting & informative. -- Ehud]

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Betting on JIT

There is no clear 'winner', especially given the variety of issues other than performance that one should consider (runtime code distribution (modules, extensions); upgrade of code at runtime; security and safety analyses; memory overhead; realtime considerations; etc.). I doubt there will ever be a clear winner.

But I would consider runtime compilation to be a good bet for many languages. Its performance is competitive. If combined with persistence and automated code-distribution features, allowing reuse of code over long periods of time, one might even benefit from supporting computation-intensive optimizations in the JIT compiler.

It is unclear to me why you'd bet on just one thing, though. I would want to compile based on in which circumstances the code is acquired and used.

MSR's SPUR: A Tracing JIT for CIL

You should find this interesting, certainly on topic: http://channel9.msdn.com/shows/Going+Deep/E2E-Tracing-JIT-and-SPUR/ Best, C

mozilla

Mozilla's best guess is apparently that a combination of the two techniques ultimately wins:

http://hacks.mozilla.org/2010/03/a-quick-note-on-javascript-engine-components/

which, if I understand it correctly, announces their plan for a very coarse-grained JIT for Javascript (directly generating native instead byte code), combined with a tracing JIT that looks after hot-spots in the resulting code. Which kinda makes some sense: the overhead to native code shouldn't be much greater than to bytecode in the initial translation - yet you should get some good payback. And then tracing magic lazily optimizes some paths based on execution feedback.

Native Code Tracing?

In Andreas Gal's dissertation on tracing he describes tracing inside a bytecode interpreting VM. That seems reasonable as the VM is software and so can record/trace the bytecode instructions it executes.

How can native code execution be similarly monitored to record execution traces?

Profiling

One compiles native code with a few extra instructions (inside functions, conditionals, loops) to track usage and trigger a more intensive optimization of code regions that see a lot of use. That is, you don't trace Native CPU instructions directly, but rather trace code regions via the profiling instructions. You optimize from the original source, or from an intermediate representation. This also works for bytecode; one can compile source to profiled bytecode that traces itself and triggers more intensive optimization... this avoids mucking with the VM's basic interpretation loop.

I remember reading about this somewhere... can't recall where, though.

Profiling

The idea of inserting profiling instructions into the native code makes sense. I imagine that to get the exact traces that Gal's tracing VM achieves it would be necessary to insert a profiling instruction at the beginning of each "basic block" of code as he calls them.

If you do happen to remember where you read about self-tracing code please mention it here. I'll be looking for similar information.

Tracing x86-64

One of my pet projects for the last year and a half has been working on a tracing jit for x86-64 assembly.

If you are willing to make a number of assumptions that the code in question follows the x86-64 ABI, and require explicit opt-in for the jit points, you can get pretty far without requiring annotations at all. However, you lose some of the nice properties of tracing, in that your loops are no longer strictly super blocks with side-exits, but may contain epicycles caused by looping through 3rd party code that doesn't opt-in to the jitting strategy.

The other tricky part is how to get this to interact well in a multi-threaded environment. Since you're "gambling pounds to win pennies" by jitting such simple opcodes, you need to have very low overhead for getting in and out of a trace, so I've needed to use some incredibly loose consistency guarantees in order to be able to collect dead jitted code fragments.

Depends on domain

Maybe for general purpose systems tracing JITs are fast enough. Until someone shows impressive numerical results using a tracing JIT there will still be at least one domain where offline computation is the norm. What I think you see here is that the amount of computation you can usefully throw at optimising "integer" code hits a point of diminishing returns quite quickly, so that tracing JIT is effective here. This isn't the case with numeric programs AFAIK where you can really grind to optimise memory access patterns and concurrency.

That said, I think the idea of adaptive optimisation is very promising. The old Dynamo project (which became Transmeta IIRC) showed that online optimisation could get equal performance to GCC's -O4 (profile directed) optimisation level. This was working on raw machine code -- so only low level optimisations were feasible. The MILEPOST compiler shows what you can do with more feedback in the offline setting. The next step for JIT IMO is to save information to for offline / idle processing. I'm not aware of any work that does this but it should be an easy win.

Dynamo and MILEPOST

The old Dynamo project (which became Transmeta IIRC) showed that online optimisation could get equal performance to GCC's -O4 (profile directed) optimisation level. This was working on raw machine code -- so only low level optimisations were feasible.

Just to be clear, the more general idea here is that field programmable gate arrays can be used to efficiently implement in hardware Turing's connectionist networks.

The MILEPOST compiler shows what you can do with more feedback in the offline setting. The next step for JIT IMO is to save information to for offline / idle processing.

The next step is actually moving optimizations into a version control scheme*, and distributing optimizations. Think WAY bigger than you are thinking right now. There were glimpses of this idea in the 1980's when some researchers were looking at ways to harness parallel computing machines (before they were becoming widespread), and one of these research projects focused on using neural networks to do optimizations. If you abstract out the neural net and focus on conflict-serializability of optimizations, then you've got a killer app that a company like Google should be building. But Google thinks small and pragmatic (e.g., Go, GWT, Gentoo for ChromeOS, etc.), not big and potentially revolutionary.

* Technically, hotspot optimizing compilers already implement a form of revision control (literally, revising a string). They simply do not implement branching versions (morphing a string and creating simultaneously active but different strings). Also, schemes with hardware change control management is typically known as Product Data Management (PDM), whereas software-only change control management schemes are called Software Configuration Management (SCM).

Revision control for optimizations

The next step is actually moving optimizations into a version control scheme*, and distributing optimizations.

You may be interested in this paper: Equality saturation: a new approach to optimization.

As Thomas Lord mentions, this is about modeling compiler optimizations as an actual search/optimization problem, instead of applying optimization passes sequentially in a substantially ad hoc way.

Seen it

Equality saturation should only be a part of the picture. Equality saturation is about integrity.

I also disagree with you that Thomas was suggesting modeling compiler optimizations as an actual search/optimization problem. From what I've read of Thomas's ideas in the past, such as A Case for Gestures/Visualizations and Against Concrete Syntax. In this example, Thomas actually discusses the idea (in a round-about way*) of making optimizations a service for the programmer to plug into. Thomas gives the example of choosing between a number of implementations. Usually, though, a compiler does not know which one is preferable. It is up for the programmer to decide. Due to the current medium of expressing changes to code (file edits), we cannot really capture these design trade-offs.

The problem is in change control management, and as I indirectly noted above, it exists for both hardware and software folks. I figured Thomas was zero-ing in on this because of his background in SCM and version control as the creator of GNU Arch (he also started that aforementioned thread after a discussion on LtU about how much subversion pretty much blows).

The programmer should be able to automate solving scenarios like, "I've got 16Kb of memory I have to stuff these requirements into a program, for a real-time system. How do I do it?" Well, if that's the case, it doesn't really make sense to choose quicksort necessarily. So what algorithm do you choose? Well, today, you probably don't even choose an algorithm. You implement it yourself. Then you have to go through the tedium of seeing how much memory your compiler allocates for that sorting routine, and how well it performs. Do you expect equality saturation to handle that? I don't. I expect the programmer to know he has a specific problem to deal with that the human brain needs to get immediately involved with to solve it effectively. Current compilers wait way too long to let human brains into the picture. If you know what your requirements are, then you shouldn't let a compiler architecture dictate to you your development process!

*I would argue Thomas's model example for trade-offs is solving the problem at the wrong level of abstraction. But he has the right idea.

"to see and be seen"

Right, Z-Bo. That's kind of neat, for me, because you synthesized various things I've remarked on in a way that leaves me saying: "Yup, that's about how I see it."

I think that projects like JaegerMonkey, extrapolated, go in the direction of the AI-search kind of translation system. And, I can't prove that that's intractable but I can point to reasons why it's implausible. I'm not sure Mr. Eich is right that it's a long game of chess. I think it might be an infinite game of "construct a rational number lower-bound approximation of .7 that is less than .7, within the time limits of your turn, and that is greater than any approximation your opponent constructed in earlier turns. The game ends when a player succeeds under those rules at constructing .99999" Yeah, that competition is converging on the AI-search approach and, yeah, I doubt that that's a winning strategy.

And, I agree that the alternative to AI-search using a toolbox of translation tools is human-directed application of the tools. Search has a place as a tool in the toolbox (e.g., "super-optimizers") but it takes the situational awareness of people to decide which of these imperfectly understood tools to apply to what, and when.

The C example might be solving the problem at the wrong level of abstraction, sure. All I mean to highlight about early C is that here was a language translation tool that under nominal operating conditions gave highly, easily forseeable, easily reproducible results. It shouldn't be as hard as it is made by existing interop standards to write an efficient FFT that is portable across browsers, for example - and fancifying Javascript implementation techniques along current lines won't fix problems like that well.

Right, Z-Bo. That's kind of

Right, Z-Bo. That's kind of neat, for me, because you synthesized various things I've remarked on in a way that leaves me saying: "Yup, that's about how I see it."

For what it is worth, that is pretty much how myself and David Barbour see it, too. See the c2 wiki RethinkingCompilerDesign and CompileTimeResolution. The only difference between you and I vs. David is that David is heavily devoted to the wiki concept, because he also wants tight integration with collaborative editing. But that difference isn't important for this discussion.

In the same vein

In the same vein you might be interested in a paper I did a couple of years ago called Program Interpolation. The system that we describe in that paper is a very simple sketch of an idea, but it show what happens if you make a type of strong equality guarantee between the pieces of program that you are trying to optimise. The large sets of enumerated programs would then be your candidate optimisations within some VCS (we never looked at it that way) and the paper shows how to build explicit search problems from the optimisation. For people without pay-wall access there is another version here although I'm afraid I can't remember if there are any mistakes in that draft.

Never seen your paper

Never seen your paper before, thanks for sharing. I'll have a look. I should also dig up the reference to neural network compilers from the '80s so people realize the machine learning ideas in MILEPOST are nothing fundamentally novel, just stuffed into an ancient, non-modular code base (gcc).

The large sets of enumerated programs would then be your candidate optimisations within some VCS (we never looked at it that way)

I think a good question would be: How few lines of code can you implement Git's core module in using something like VPRI's Pepsi/Cola/Ometa.

Compiler tuning

You might be interested in this paper then too: COLE: Compiler Oprimisation Level Exploration - CGO 2008. Disclaimer: they guys are my colleagues :-). We did a follow-up paper for this years' CGO titled Automated JIT compiler tuning. You can find the abstract and a preprint version here.

Different types of optimization gains

This isn't the case with numeric programs AFAIK where you can really grind to optimise memory access patterns and concurrency.

When I see results that sometimes LuaJIT is beating programs compiled with GCC it makes me think there must be optimizations that can be made in a tracing JIT that cannot be made by static analysis.

different types of ....

It seems to me that:

In a simple static compiler, the compiler would have to make good guesses about which paths to optimize tracing-style (or to generate unrealistically large amounts of code and conditional branches that cache the dynamically most likely prediction).

A profiling static compiler can make better guesses about what paths to optimize. It can essentially answer the question: what would a native code with tracing jit's state be after this profile?

Real tracing jit can beat all of that in the case that in most runs there are dominant paths, but there are many possible paths, and which paths will be dominant varies widely across runs.

Conversely, tracing jit can't well handle hot loops which have many branches, none of them dominant - while static or method-jit methods can do a better job. This is apparently one the experience of Mozilla and one of their motivations for this new work:

http://blog.mozilla.com/dmandelin/2010/02/26/starting-jagermonkey/

It's also worth noting that in some cases: doing no optimization whatsoever produces the most desirable performance. For example, for very simple code, naive interpretation can win for speed. In general, naive interpretation can win for latency (amount of delay before execution commences) which is sometimes desirable (e.g., in the presence of EVAL).

So it seems like ultimately, at least conceptually, you have a discreet spectrum of composable code transformations available, and a real-time search problem. At one end, you simply parse the code and directly implement a syntax tree (or s-exp if you're into that kind of thing). You have a wealth of possible compiler passes that can optimize trees and also take them other forms (against which further optimization passes can be applied). You have various interpretation strategies for many of the intermediate forms along the way. You have various ways of generating native code (or FPGA configurations, etc.) from some of these intermediate forms.

If we think of execution as comprising everything from the submission of source code through an actual "run" of the program, and *if* (big if) a system operator can characterize what counts for this execution as optimal performance, then you have a big search problem - deciding which allowable code transformation passes, interpretive strategies, or native code generation to apply when in order to maximize the performance score along the imposed metric. Necessarily, in many cases, you want the system to play that search game during the run of the program, using feedback about which of the potentially "further optimizable" parts of the code are most harming the performance score in a given run.

It would, I think it's safe to say, be very difficult to engineer that search directly in a clean and elegant way. Perhaps someday people will have worked out enough of the math and done enough of the puzzle solving that you can just declare all of these passes in nice clear ways, let the system help identify the stochastic performance impacts of various passes, and let a meta-translator turn all of these nice clean declarations into static front-ends and gnarly code-generating run-time systems. It seems like that, if it ever happens, is still a long way away.

And an arguable weakness of the entire approach is that it depends on (a) there being a good way for the system operator (or context otherwise) to characterize the desired performance goals; (b) a search algorithm that scores reliably high on achieving those performance goals. It might never be possible. We might instead, permanently, have "touchy" systems where you rearrange your code in seemingly innocent, minor ways - and the resulting performance scores vary wildly in baffling ways.

The very opposite of the entire approach is exemplified by languages like C in the early days: don't worry about generating *optimal* code (that's why human-friendly assemblers exist) -- worry about generating pretty good but above all *predictable* code.

With a language like that, at least you can build something robust and have an easier time convincing yourself and others that it is robust in just the ways you say. No surprises.

I think that the C-style approach fell out of favor for two reasons: (1) is that modern VM systems and processor architectures undermine the very ability for a compiler to generate code of predictable performance; (2) the economic history of the big systems that drive much of the visible PL research lowered the bar, relative to C's early days, of what kind of robustness programmers were supposed to aim for.

Concerning that economic history, here's an example:

http://vimeo.com/8525101

Here we have a sweet, impressive demo showing what Firefox's Javascript is capable of in terms of running an FFT algorithm and a graphics front end. Nicely, nicely done.

But think about what is going on there: We're going through many, many layers of highly complicated Javascript implemenation just to get that performance. And it's flakey: combine it with something that stresses the GC or try and run it on another browser (assuming that other browser also exposes audio buffer raw data to Javascript) and the results won't be as impressive. Indeed, in the next round of that experiment, they've moved the FFT into a C++ patch to the browser:

http://vocamus.net/dave/?p=974

Why don't browsers have, perhaps in addition to a kitchen-sink language like Javascript, several other high-level, tiny, safe languages for particular domains - and ways to combine these in a single application? It would be relatively easy to make a safe, high-level, tiny language suitable for such things as computing FFT in a portable way but we didn't. And we didn't because of the history of how Javascript gained its position, and the history about how advanced browser application development became a largely empirical affair (each year, a zillion new apps are implemented - these are then "tested" on lots of browsers - some of them seem to be fairly robust, performance-wise and others don't. Lather, rinse, repeat.)

The C guys had the better idea. And the harder sales problem.

JägerMonkey

Wow! If Mozilla is implementing a method-based JIT compiler after so much energy invested in TraceMonkey and it's trace-based JIT compiler then that is quite an admission that the trace-based approach is not sufficient. It would be interesting if the V8 group adds a trace-based JIT next.

Long Chess Game

V8 may not do tracing soon, but Android (Dalvik) is doing a straight-on trace-trees JIT based on Andreas Gal's thesis.

Trace-JITting has greater headroom than less specialized JITting on numerically-intensive, type-stable loops. Tracing makes variables constant on trace, with trivial dominance computation for all the traditional loop optimizations with none of the full SSA construction overhead. But Tracing has its drawbacks, in spite of the greater headroom. We live in a world of trade-offs.

So beware "Wow!" reactions. We JS implementors are in the market for the long haul. While the best techniques will rise to the top and win out, this process will take time. There won't be a quick victory for only one approach. Many dynamic and even static analyses are conceivable, and more are feasible over time. It's early days still.

/be

spinning your questions around

Why don't browsers have, perhaps in addition to a kitchen-sink language like Javascript, several other high-level, tiny, safe languages for particular domains - and ways to combine these in a single application?

Good question. VPRI's Ian Piumarta argues all you really need are 3 objects and 5 methods - not 'several other high-level, tiny, safe languages for particular domains'.

Google NaCl

If Google's Native Client (NaCl) takes off, it sounds like a practical solution to the problem of unpredictable Javascript performance. It sandboxes native code (x86 currently, ARM coming Real Soon Now) and runs it on the client. With that, you wouldn't need a special C++ FFT patch in the browser; you could write the FFT in C++ or whatever language you prefer, and stick that on your web site the same way you link to Javascript files.

Considering its potential, and the fact that it's open source and backed by Google, I'd bet on NaCl or something like it being available in all major browsers five years from now, except Internet Explorer. (Wouldn't it be great to be able to write client-side web code in whatever language you like? I would love that.)

Not Portable

The initial NaCL implementation relied on x86 features (segmentation) to achieve memory reference sandboxing. Segmentation is effectively obsoleted even in x64. So NaCL is an interesting prototype, and the question is: how do they take that idea and realize it in a way that is suitable for a broader class of architectures.

Caveat: This is based on an old discussion with Brad. The NaCL team may well have solved these issues over the last two years. I haven't tracked.

Looks like they've already tackled that

They made it a bit hard to find, but they have a paper on how they adapted their approach to x86-64 and ARM here: http://nativeclient.googlecode.com/svn/data/site/NaCl_SFI.pdf

Agreed, but

I agree. A tracing JIT has much more information available, but it has much less time to do anything with that information.

By "really grind" I'm think of the kind of optimisations Single Assignment C (http://www.sac-home.org/) does on numeric programs. This is stuff like solving linear programs to work out sets of array indices, and then rearranging access patterns to make use of cache locality. Very few static compilers do this kind of optimisation. I don't know of any tracing compiler that does anything approaching this.

Automatic parallelization.

No amount of trace information can prove you that your loop have no dependencies.

So, I see no benefit in trace optimizations in the light of multicore execution.

I don't understand

Your objection is vacuously true, which is probably why you see no benefit.

Trace-based execution models are orthogonal to building a dependence-based graph model of the program. In fact, the two could potentially be profitably combined for really expensive executions. What you are in effect doing is branch prediction in software, at the compiler level. Hardware does this too; Intel models pipeline depth and so on based on the software most likely to run on EMT64 and Itanium64. [Edit: Back when Intel had less stable branch prediction logic, game developers would handroll assembly routines to rearrange branch orderings to avoid pipeline stalls. These game developers were in effect "human compilers", just very monolithic with respect to the source text they operated on.]

"Multicore" in and of itself doesn't matter. If all of these Multicores were to become connection machines, then that might be different, as that would completely blow up the notion of a vonNeumann-inspired MIPS-like pipeline entirely. This is why it will eventually become important for hardware and software to be modeled together for maximal efficiency. But the Department of Energy is already proving this with the national weatherstation grid and its predictive modeling supercomputers.

Trace-based execution models

Trace-based execution models are orthogonal to building a dependence-based graph model of the program. In fact, the two could potentially be profitably combined for really expensive executions.

I trust in Fortran and SAC developers. They use profiling information, I agree, but no trace-based JIT compilers. I don't think that complexity estimation for parallelization cannot be done using just program analysis.

Okay, my point of view: trace-based JIT compilers are solution for a problem that couldn't be existed in first place.

To see what I mean just think what could be done if we have Haskell instead of Javascript. With all current Haskell glory: rewrite rules, par/pseq annotations, even with nested data parallelism and supercompilation that coming soon.

Not Really

Tracing JITs solve the problem of getting decent performance and low latency. That is, the code starts running soon after it is downloaded and at the same time it runs at a decent speed. Haskell (presumably you mean GHC) has no solution to this problem. It is an offline compiler and shipping around binaries in clearly infeasible on the web.

I missed the "low latency"

I missed the "low latency" here, thank you for that.

I still do not agree that trace-based JIT and multicore will play together nicely.

Justification?

Since Java bytecode and CLR bytecode are binary formats, this is clearly a silly assertion. I assume that you are trying to say something more specific here, perhaps about hardware specific binary formats.

Even that may or may not be true. There is some reason to think that x86 assembler, when subjected to suitable constraints for static safety checking, may be just as portable and safe as Java bytecode. See, for example, MiSFIT. I continue to be surprised that nobody has combined MiSFIT with a suitable dynamic translation technology to recover performance.

Propagation delay

Serguey,

Interesting thoughts, but I think my rebuttal is similar to Noel's. In any communication protocol where the speed of light is your bottleneck, propagation delay will bring interactive responsiveness of end points to its knees. Roy Fielding, in his Ph.D. thesis, discusses tricks you can use to minimize this delay, such as caching and decoupling the server from the client. Caching allows you to use connectors such as proxies to minimize the execution path of a request. -- That is a powerful idea, and should be applied to the way programmer's build distributed systems much moreso than it is right now.

Haskell's model, where top-level definitions must be topologically sorted before the interpreter can do any real work, simply adds additional propagation delay. Now, you might say, the solution then is to simply encode Haskell programs into a binary wire format.

Now, the Internet Explorer 8 team, such as Pratap Lakshman and Allen Wirfs-Brock (one of the first Smalltalk-80 compiler writers) as well as Chris Wilson are campaigning for a new, different way to do the Internet. Microsoft in general has been doing really interesting stuff here. Just last week Microsoft released JSZap, and the researchers who wrote it are a who's who of big shots at MS Research. JSZap actually helps improve the propagation delay problem by transporting compressed JavaScript ASTs rather than compressed JavaScript. What Brendan Eich and the folks at Mozilla and places like Opera think about this, I don't know. (There's also the Microsoft Gazelle Browser and Xax, but I won't go into deep detail here on those).

There are so many ways the browser is completely wrong architecturally, though, that I don't have space here to go into it. If you understand Smalltalk, then you understand much better what a browser, and its inspectors (e.g., "View Source", "FireBug") and debuggers (e.g., "console.alert, console.warn") should look like - although I feel even Smalltalk's model has some elements of monolithic structure to it.

Why the divergence about Browser architecture and how inspectors and debuggers should work? Because I claim it's part of the problem that it is so hard to increase efficiency. And people today are building operating systems on top of ECMAscript and by layering various mime-types the Browser supports. This is seriously stuff that should only be interesting if you love That '70s Show, because that's where all these bad ideas about architecture come from - the 1970s.

For what it's worth, Alan Kay has probably made the most condemning criticism of the Web Browser to date, in his 1997 OOPSLA speech The Computer Revolution Hasn't Happened Yet. He referred to Netscape and Microsoft has having "a complete non-understanding" of how to build complex systems. The funny thing is, since that talk, only Alan Kay and his VPRI group have done anything to advance the major architectural criticisms he laid down in that talk. For example, Alan's principle that every object should have its own IP address and its own URL. While unfortunately he was referring to IPv4 at the time, the general idea can be seen in the pseudo-tunnel interface in VPRI's STEPS project and Ian Piumarta's TCP/IP architecture written in only 187 lines of code. -- If you don't grasp these points, then you are missing out on arguably the most interesting computer project in the past 30 years.

The sad part about Microsoft's solution, putting the DLR on top of the CLR, is that it is ultimately a hack. -- Mike Pall has hinted as much in this thread, leaving it up to the reader to make their own judgment about architectures like these, but he was referring to JS tracing JITs. Apart from Pall's performance criticisms, it is also not clear to me that the CLR is any longer about multi-language support. With the DLR, I am not sure I understand how it is about language interoperability. The DLR is really confusing.

The DLR is really really

The DLR is really really nice (meaning, the dynamic code generation part via System.Line.Expressions). I'm using it a lot for my current work and it...just works as expected with decent performance. I wish it supported a way to generate classes and structures more easily, but I'm very happy with what they did. It is too bad that team doesn't exist anymore, they did good work!

What about Roslyn, is that a

What about Roslyn, is that a good replacement?

Roslyn is a static

Roslyn is a static compilation framework. It doesn't allow you to compile lambdas on the fly...it definitely doesn't allow you to embed runtime values from the compiling environment in those lambdas :). The DLR is quite quick also...they fiddled with the CLR I guess to get there.

It is quite raw, nothing as nice as a C#-ish template language that we dream about, but for language implementation it is pretty much the right level of abstraction.

No it's not...

Post your example code and I will show you much cleaner code.

Edit: To elaborate, they rewrote the Expression API namespace in .NET 4, to help with the shortcomings of .NET 3.5's initial Expression API. As the basis for a compiler, it's actually somewhat limiting, unless you are using it directly to write expression trees in CIL.

If you're using C#, there is a much better way to simplify the hoariness of .NET 3.5:


var truthiness = Expr.Create((Boolean r) => true);

This is nearly as powerful as F# bunny earmuffs.

This allows you to automatically bootstrap yourself using the underlying language (C#)'s ability to convert lambdas into expressions.

And there are more tricks for writing nicer C#.

It really depends on what

It really depends on what level of abstraction you are looking for. For implementing languages, the Expression Tree API is quite capable and doesn't bother with pseudo macro abstractions that you probably can't use very much anyways (I don't bother with C# lambdas if I don't need closures, preferring instead to use reflected static method). If you want something higher level, you can write a layer above it (as you've done and I did with Bling).

Type safety is purely at code generation time, which can be a bummer but in practice works fairly well.

Escape analysis

If your dynamic language supports copy elision and other rather simple optimizations, you will want more abstraction than just a simple syntactic sugar for expression trees.

This is the entire reason we come to the ball park as language designers. We don't get paid to make ASTs.

You can do simple CSE well

You can do simple CSE well enough on your own using a hash table and your own symbols. Think of this as pure backend, you still need an intermediate language and such.

The DLR does one thing and does it well, everything else you might want can be done in other ways.

IE 9 has multithreaded JS

So soon we will see, how do multicore and trace-based JITs are playing together.

I believe ie9 has a

I believe ie9 has a multicore jsvm in the sense of firefox's: jit path overheads can be taken off the critical path.

If you want to talk about multicore js (not compiler overhead), trace vs method jit seems to be very much at the wrong level. Also, as a reference, I've found your strongly contrarian position to be lacking(stated) technical reasoning, making it difficult to discuss.

Clarifying some details.

If you want to talk about multicore js (not compiler overhead), trace vs method jit seems to be very much at the wrong level.

So your words support my point of view, actually.

I'd like to clarify my position slightly here.

Trace-based JITs use traces for "profiling", not as proofs of code properties. Parallel code should exhibit some strong properties regarding execution and communication between various parts, so to parallelize code we have to have analyses and proofs about code behaviour.

Demand for JS parallelisation will only rise, as I see opportunities to write complex browser games and there we have complex world, AI queries over the world. We could even have physics simulations.

Trace-based JITs use traces

Trace-based JITs use traces for "profiling", not as proofs of code properties.

There should be proofs for traces, even if informal (otherwise the optimizations could be wrong). In terms of proofs, instead of 'profiling', I can phrase it as finding relevant special cases of proofs that are incorrect in general but correct under typical contextual invariants (and, in fact, am thinking about how to do that for a security project discussed in this thread).

Parallel code should exhibit some strong properties regarding execution and communication between various parts, so to parallelize code we have to have analyses and proofs about code behaviour.

When I said at the wrong level, I meant at the level of even automatic parallelization of general languages being an unknown. I have had a lot of success in parallelizing various DSLs within browsers but know of no evidence suggesting that parallelizing general JS using a JIT, irrespective of type, is a viable approach. For example, if you check out the 'SEIJIT's work going on in my lab, JIT (with dynamic context checks) + patterns has been successful: skipping the pattern step (so relying upon method or trace analysis) is closer to supercompilation and thus likely limited to basic stuff like automatic vectorization for the foreseeable future (and probably would benefit from OpenMP-like annotations to help prevent parallelization from breaking upon code changes, making the entire issue somewhat moot).

Optimizing multithread dynamically typed languages

That isn't easy.

Consider a variable that can be accessed from more than one thread.

A trace compiler optimizes for its type not changing or even it's value not changing.

Then another thread changes the type and value of that variable.

How does the system detect this without so much overhead that serialized instructions made it dog slow in the first place?

I do have an answer to that, (that I came up with myself) other than the obvious one (the obvious one is to write your programs in such a way that they avoid all accesses to globally visible objects whenever speed matters).

The solution is to replace locks with scheduling. Ie instead of checking locks all the time, you break up tasks into smaller units than threads and schedule so that tasks that could compete for the same lock won't ever run at the same time.

Perhaps highly contentious code always ends up running on the same processor and only switches at very safe points, while all the other code can get distributed to the other processors.

Glitch is the future (maybe)

I'm working on this problem right now (ATM actually) in APX (YinYang rebranded), except that I have aggressive type inference and code that changes while the program is running (code changes can obviously change type inference results). So it isn't really "tracing", but has the same problems.

Glitch avoids locking and tolerates inconsistencies by supporting transaction-like re-execution. So say the type of something changes, your execution is a bit messed up for a while, but eventually (usually very quickly) converges back to the "correct state." Then it doesn't really matter how many threads you have executing your tasks, and you can do nice optimizations like polymorphic inline caching and unboxing. I haven't scaled to distributed systems yet, but it would work under the same principle of tolerating temporary inconsistency.

Do you have links

to the systems you're working on, or information about them?

Also does this transaction retrying show itself to the visible definition of the language in any way? Is it visible to the programmer, or affect what features the language can support?

If it's hidden, how hidden is it?

You can look at my managed

You can look at my managed time paper.

The retry is transparent to the language, but it requires new semantics (all effects must commute and be undo-able until committed) that aren't very portable to other languages (not many imperative languages out there with declarative semantics!). It basically requires a time collector that is analogous to a garbage collector, and analogously, abstracts away time details that most sequential (including functional*) languages take for granted.

There is something else to consider: why not optimize variables backwards for how they are used vs. what values they carry? This is the approach that Type-less takes, and it avoids the issues like this (since usage is local!). An extra bit could carry whether the value is valid or not, so dynamic type errors could occur in the right place. Edit: Now what happens if multi-threaded usage causes the type of a shared variable to change? Well, you don't actually have to change it's type until convenient if ever, since usage based typing always specializes rather than generalizes as definition-based typing does...you are never wrong!

The type systems people unfortunately got it backwards 40 years ago, so we are stuck with all these crazy problems today :)

* Even most monads in Haskell are sequential.

Can you explain what you mean

I'm interested, but I don't know what "optimize variables backwards for how they are used" means. Nor am I sure how adding an extra bit to every answer helps. Some worked out examples would help.

Also, pretend that I'm an engineer, or a researcher from the 70's and avoid that Haskell or pseudo Haskell notation that everything on this site is written in. I can't make head or tail of it. Use scheme or prolog or Ada or something I can understand.

I guess I better learn to read the notation everyone is using.

And yes, I get that changes can be delayed but sometimes that means that the thread making the change has to ACTUALLY delay because it might read something affected by thread it's affecting, which implies a lock. Also, sadly, current processors don't allow asymmetrical locks except by making one side VERY VERY expensive.

[Edit] I just noticed your link.

I'll read that in case it answers my questions.

Backward type recovery (and

Backward type recovery (and inference) and Glitch are two separate solutions. One works by re-execution, the other works by recovering types via conservative specialization rather than non-conservative generalization (they can be used together or not). The bits just separate where type errors are detected vs. where they are reported, with the ability to use static type information to detect errors statically and unbox values since the tag isn't needed for a type check later.

It is just an early idea right now, maybe I'll work out some examples later (although I'm perfectly happy with Type-less and don't see the need dive into type recovery for dynamic languages at this point).

Is that the same as type feedback?

I'm not sure what the word "conservative" means in the phrase "conservative specialization" means and "nonconservative generalization"

The obvious thing to do is to compile a trace for the types seen coming into it, and to put checks in that types remain stable, or prove that they will...

I still haven't got around to your video/paper but I will

Self had rather minor improvements based on starting with a very inefficient object model (prototypes with layer upon layer of compares) and improving it with LRU's of most recently seen types and messages... Call sites that seemed to only see one type (or traces that only saw one type) could be made much more efficient as:

load type
compare with expected type
hint processor to prefetch as if no branch
branch if type wasn't expected.

but that only saved on the stall caused by a branch or by an indirect call... Note they gave it self-modifying code so that it could change behavior without creating a new trace (actually I'm not sure if the self compiler had specialized traces) But an object model based on a flattened table might have one stall but would still be much faster than the Self model in the general case. And no doubt processors are becoming designed around C++, ie. the stall for an indirect call is probably something that won't exist in real processors forever.

Processors are already keeping statistics on branches, and even on patterns of branches and correlations between branches.

Note this was optimization based on doing no inference at all, just MEASURING and making the case seen faster.

By conservative, we just

By conservative, we just mean whether types can be too specific or not. Starting from bottom and joining up or starting from top and meeting down, the former can lead to a type that is not general enough given new information, the latter does not have that problem; you can tolerate missing a meet with additional dynamic checks when the inferred type is not specific enough to handle a new usage.

APX is very inefficient given dependency tracing and overhead to support re-execution, and so it needs inline caching to work well by default. It doesn't use LRU, but rather last-used since tasks are likely to re-execute multiple times (at least over time, no answer for spatially intensive computations yet). As Glitch caches book keeping work anyways, might as well cache everything else, like virtual method lookups. This worked before in the last version of YinYang that was dynamically typed, and the performance is about the same with Type-less in APX. The real performance benefit of type inference is that we easily unbox values like numbers and vectors, which is actually a huge win! I would guess unboxing dominates any gains from method resolution or dynamic type check optimizations, but I have no proof for that. (I unboxed in the last version of YinYang, but had to specialize EVERY execution path to do so...)

And of course, the main benefit of type inference (not recovery, which is a different beast) is computing types to do code completion...and maybe detect some errors earlier. The performance aspect is just icing (well, maybe not, but performance wasn't my goal in doing a type system).

Maybe

Maybe they're winning. But the competition is not over, yet.

You won't win if you just slap a trace compiler onto an existing VM (as exemplified by TraceMonkey).

Creating a good trace compiler requires a lot of careful engineering and research, a fast interpreter, a specially crafted IR, the right set of compiler optimizations and lots of patience/money.

So be warned: it's not the easiest path you could take. But IMHO it's currently the best bet, if you want a dynamic language to be competitive with static languages.

There are two weak points of this approach:

  • The needed heuristics quickly grow in complexity, need a lot of tuning and create some unwanted surprises.
  • Branchy code is troublesome for a trace compiler. I'll eventually add hyperblock scheduling to solve this issue.

Lua's interpreter

Mike,

Thanks again for your response. (It was me you answer on Reddit.)

You have implemented a bytecode interpreter in LuaJIT. Did you consider the idea presented in some of the comments here about generating self-tracing machine code rather than bytecode. It would save the whole interpreter and as you say the LuaJIT interpreter is written in Assembly then generating machine code would not break any current platform independence of the interpreter. (I could imagine a design where the interpreter is completely platform independent but the trace compiler would kick in when it's machine-specific back ends were supported. But this doesn't seem to be what you've done.)

Better use an interpreter

It's much easier to record what an interpreter is doing. Just patch its dispatch table and intercept every instruction.

The small gains of a simplistic compiler over a carefully hand-optimized interpreter are just not worth the trouble (the LJ1 JIT compiler is not much faster than the LJ2 interpreter, sometimes it's worse). Better focus on improving the trace compiler and stay 'on-trace' as much as possible.

Another subtle point, often forgotten in the reports about Mozilla's plan with JägerMonkey: the method compiler only triggers trace recording, but they still need the interpreter to actually record the traces! So now they gotta keep three engines in sync (interpreter, method compiler, trace compiler). I'll leave it to your judgement whether that's a smart move or pure desperation.

We're too old to be desperate

We survived Netscape, remember? What we at Mozilla are is old (my hair color confirms this :-|), with long-lived APIs and platform components built on top of our JS engine. So our changes involve incremental rewrites and breakage.

To be clear, we intend to retire our old bytecode interpreter at some point. We do not want an interpreter and two JITs if we can use just JaegerMonkey and tracing. The bytecode (rather an improved evolution of it) may remain as a first-stage compiler output. But for the short term we will keep the bytecode interpreter, since it does many things (simple debugging support, ECMA-357 aka "E4X" support) that the JITs do not yet do.

I've talked to Ollie Hunt of Apple about tracing for JSC (WebKit). It may happen, as JS is hard to optimize fully without some kind of aggressive type inference and speculation, more aggressive than polymorphic inline caching.

There is no "one best way" to compile dynamic languages, but for JS, it seems clear that both inline/call(context)-threading and polymorphic inline caching are important. Firefox had a PIC for its bytecode interpreter before we did TraceMonkey.

Tracing is not an alternative to a well-tuned PIC: trace trees branch up to a fairly low bound to cut off potentially exponential growth, and due to aggressive specialization they do not easily recompile to more generic "megamorphic" code.

Since fast generic-code JITs, PICs, and tracing JITs seem to be complementary, we are doing all three -- we already have the latter two developed and in shipping products. But we will lose the old bytecode interpreter (I first wrote version 1 of it in 1995!) in due course.

/be

Tracing without the bytecode interpreter

Brendan, I'm glad you found this thread. :-)

If you retire the bytecode interpreter completely then what is the high level plan to record traces? Will you method JIT first with some extra profiling instructions that map code blocks back to the intermediate representation and then compile optimized code from a combination of profiling info and the intermediate representation?

Looks like it

For notes on their plan, see a quick note on JavaScript engine components.

Maybe I'm missing it but

Maybe I'm missing it but does that post really answer my question about how native code will be "traced" and/or mapped back to a higher level representation using some profiling instructions?

Code is code

It's not that hard. With either a bytecode interpreter or an inline/call-threaded JIT, you have to count hot loops.

With a bytecode interpreter, once a loop header is hot enough, you can swap indirect-threading computed-goto jump tables, or if computed gotos are not supported (MSVC), use hideous macrology to double your switch-in-a-loop's case count to interpose recording call-outs.

With inline/call-threading, you quickly regenerate the code to include recording-hook calls, and switch to the instrumented code.

You write "native code" but JaegerMonkey, like V8 and JSC, does not emit machine code with arbitrary structure. Instead you get blocks of more type-generic code, either inlined and specialized a bit using caches and hottest-case-first self-organizing structures, or big enough to be consolidated into a stub or helper that is called for slow paths and chunky opcodes.

These native code blocks still correspond to higher-level operations in the source language. The recorder can therefore re-lower to trace-optimized machine code.

/be

How to be a millionaire

Old Steve Martin joke...

So after you get the million dollars, you have to do all those things Mike said -- his analysis is spot on.

One thing Mike didn't highlight: get a simpler language. Lua is much simpler than JS. This means you can make a simple interpreter that runs fast enough to be balanced with respect to the trace-JITted code.

IIRC, LuaJIT2 does not nest trace-trees (loops) via call instructions, as the UC Irvine researchers have done, and as we do in TraceMonkey. It simply runs the outer loop in the fast-enough interpreter. In TraceMonkey nowadays, we not only nest trees, we trace tail and direct recursion.

JS is surely a beast compared to Lua, but it's the one language interoperably implemented across browsers. This interoperation condition is hard to achieve (C, Scheme, etc. don't try -- compiler writers guard the standards' underspecification to enable optimization experiments; deployed code is firewalled generally, or only ported slowly), and harder still to justify supplementing with yet more (generally unsafe-code) language implementations.

So people are targeting JS with compilers, more and more from compilers running in-browser. JS is the X86 of the Web.

To quote the Wicked Witch of the West (mid-melt): "What a world!" ;-)

/be

This interoperation

This interoperation condition is hard to achieve (C, Scheme, etc. don't try -- compiler writers guard the standards' underspecification to enable optimization experiments; deployed code is firewalled generally, or only ported slowly), and harder still to justify supplementing with yet more (generally unsafe-code) language implementations.

Sorry, I don't understand this. Even JavaScript has portability issues. It is just the case that it has 100% market share and tightly coupled to HTML thanks to ridiculously bad ideas (the DOM and CSS) aimed at adding interactive behaviors on top of static documents as progressive enhancements.

C is an idiotic language because it doesn't even do array bounds checking. The web would have failed years ago if C was used to manipulate HTML. We'd all be drinking Brawndo by now if that was the case. So I don't understand how "underspecification" of C matters at all.

As for Scheme, I don't understand how it is underspecified. Quite the contrary, Scheme demands and expects compiler writers to provide Proper Tail Call optimization. By making this trade-off in specificity, Scheme is a much simpler language to specify (<100 pages, an order of magnitude smaller than Common Lisp), but requires more advanced knowledge for implementors of the Scheme spec. Back in 1995, this was a big issue, since most of the world had never heard of "open source software" and Linus Torvalds was just starting Linux, and, my gosh, slackware was an active linux distribution! The world has changed, and your expectations of how to solve problems in it today must change if you want to radically innovate. As an example, look at how the current web standards process for HTML rendering has now basically become simply pointing to webkit and saying "That's what HTML looks like!"

JS is the X86 of the Web.

Agreed, but this does not mean we cannot improve trustworthiness on the Web. CSS was arguably 'fixed' (sort of) the same way by refining it into levels.

Edit: Fixed escaping <

under-specification

As for Scheme, I don't understand how it is underspecified.

I suspect Brendan was referring to argument evaluation order.

Scheme

As for Scheme, I don't understand how it is underspecified

http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-8.html

5.9 Unspecified behavior

If an expression is said to “return unspecified values”, then the expression must evaluate without raising an exception, but the values returned depend on the implementation; this report explicitly does not say how many or what values should be returned. Programmers should not rely on a specific number of return values or the specific values themselves.

http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-10.html

The effect of returning twice to the continuation of the last body expression is unspecified.

http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-12.html

When a procedure call is evaluated, the operator and operand expressions are evaluated (in an unspecified order) and the resulting procedure is passed the resulting arguments.

http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-14.html

If test yields #f and no alternate is specified, then the result of the expression is unspecified.

The result of the set! expression is unspecified.

etc...

The goal was, like C, to allow implementers to do whatever was most efficient given their target platform and over-all approach. Just as with C a great deal of the art of writing portable Scheme is to avoid unspecified behavior.

I get it

Thanks for your help James!

I just searched the ECMA-262 5th ed spec and the word "unspecified" only comes up once, and it is a note stating that an unspecified behavior must now be specifically handled by ignoring extra tokens in the token stream for an argument list.

I had no idea JavaScript was so detailed.

However, my points were mainly about how with JavaScript people still avoid certain JavaScript idioms because they leak memory on certain platforms or perform poorly on others. That's why I gave the example of PTC. My viewpoint is changed now thanks to you and Dave Herman.

Edit: I just had a separate thought. RFC2119 uses a different vocabulary. So I searched the ECMA standard for "should not" and found some search results.

Host objects may define additional constraints upon [[Put]] operations. If possible, host objects should not allow [[Put]] operations in situations where this definition of [[CanPut]] returns false.

The implementation of ECMAScript should not try to determine whether the exact time was subject to daylight saving time, but just whether daylight saving time would have been in effect if the current daylight saving time algorithm had been used at the time. This avoids complications such as taking into account the years that the locale observed daylight saving time year round.

An implementation shall not treat other kinds of errors as early errors even if the compiler can prove that a construct cannot execute without error under any circumstances. An implementation may issue an early warning in such a case, but it should not report the error until the relevant construct is actually executed.

Host objects are from the Devil

Good finds. Those underspecified counter-examples are botches, big sore points with web developers. Mainly they concern the IE DOM and its COM objects that do not behave the same as JS objects. Until ES5, many host objects could not be mimicked by self-hosted implementations (some misfeatures require future meta-programming features).

Of course historical DST would be a giant pain and source of bugs. Indeed ECMA-262 has always used "extrapolated Gregorian" to go back into the past, never mind the lost week in the middle ages, or previous calendar standards and shifts.

ES5 has tightened up error reporting norms a bit. This is actually good for interoperation. One of those "host object" quirks, a taint from VBScript, is callable lvalues:

d.item(i) = j;
This is found on the web, but in IE-only branches of JS control flow. It must parse in all browsers without an early error.


/be

Scheme underspecified?

I wouldn't say Scheme is underspecified. Specifying "unspecified behaviour" is a good practice, isn't it? ;-)

You have a denotational semantics for Scheme (including an executable one) and even an operational semantics too!

Not "what Brand X does"

As an example, look at how the current web standards process for HTML rendering has now basically become simply pointing to webkit and saying "That's what HTML looks like!"

Sorry, no. Mozilla is not rolling over, we have more market share than WebKit browsers, and we co-founded the WHAT-WG, where we continue to work on HTML5 as well as in the W3C.

In truth much in HTML5 is based on Hixie's synthesis of what IE, Gecko, WebKit, and Presto (Opera's engine) do. WebKit may indeed "win" more by being (a) open source (this helps Gecko too); (b) younger in its bones, so benefiting from more experience gained by older efforts; but that's fine and fitting.

HTML5 is not a matter of standardizing "what WebKit does", any more than JS future standards will bind themselves to one implementation. This is one reason the ECMA-262 standard is so detailed: the vendors competing and cooperating to create it will not agree to use one source base. They will follow de-facto standards where those standards are good enough, however.

Given this reality, and the web-developer mandate for interoperation, we have no choice but to specify harder. We're looking at a testable spec (definitional interpreter) for the next major edition.

John Mitchell, Sergio Maffeis, and Ankur Taly have been working on an operational semantics for JS, but this is a long road, and it would need mechanizing. Others have done subset semantics, but ECMA-262 needs to be as complete in future editions as it has been in past ones, if not moreso. We can always drop into prose, but we lose testability if not specificity.

Apart from parsing, HTML5 has an easier time using prose, since pixel exact layout is not promised unless you use CSS to achieve it, and other effects need less precision -- but at the limit HTML5 wants to be as precisely specified as JS does. That definitely excludes "what WebKit does", or "what [brand X] does", as a fiat spec.


/be

Observation

Most people like me develop in one browser and then figure out how to make it work on other browsers. [Edit: this is true even if I use a tool like Expression to allow simultaneous previews of multiple renderers, and also have VMs in place. The reason is that I pick a default browser, and right now the one with good looks and great speed is my favorite, because it is simply fast.] Sometimes we have the hacks in our head as we create the design.

My observation is that I anticipate more and more developers will be using WebKit as that default. This will in turn mean WebKit is the defacto standard, because it is what us in the real world building HTML - what goes across the wire - are using for testing.

Right now, it seems to me like Firefox's greatest advantage is that it has FireBug.

Just being pragmatic, even if fiat!

Off topic, but web developers target users, not themselves

IE still has dominant market share on "top sites", and web developers do not fail to make content that works in IE (all of them -- IE7 and IE8 at least, and they differ! IE6 is hopeless outside intranets, and web developers are starting to shun it).

Firefox has more market share than all WebKit-based browsers, and while we aim to keep web developers happily using us as "browser plus debugger in which to develop" and "first target to test", there's no guarantee we will prevail. Competition is what we wanted to restore to the market when we launched Mozilla and then Firefox. We succeeded.

But since market share shifts only slowly, developers cannot deploy and then tell their bosses "works in WebKit" to excuse a failure to work well in Firefox -- or in IE.

The game theory on the web is such that being "number 2" can beat being numbers 1 or of course number 3 or higher. Example: when Microsoft rolled out Live Maps (first version, using "Atlas", their answer to AJAX or Ajax), they assumed the IE DOM by browser-sniffing for IE: if not IE, they then assumed Firefox's JS dialect, with its getters and setters.

This caused rapid work at Apple and Opera to support getters and setters. The asymmetry in most web content whereby browser-sniffing tests

if (document.all) {
  // IE only code here
} else {
  // Mozilla, WebKit, Opera, etc. here
}

is pervasive and it won't go away quickly.

I agree that the competitive fight is over second-place status, and it is being fought first in the web developer segment of the market. Nevertheless, web developers cannot in general move users to new browsers by giving them broken pages or apps. This is the risky and rightly hated "viewed best in Netscape" vs. "viewed best in IE" stratagem from the '90s.

Market share in my view shifts based on keeping pages working while adding better features, performance, and intangible goods to-do with brand (privacy, trustworthiness).

Lets get back to JITs, shall we?

/be

A little more appealing in

A little more appealing in terms of semantics to the LtU crowd might be Arjun Guha's more standard approach to semantics (including executable PLT redexes I think!) for JavaScript.

Clearing up some misconceptions

LuaJIT does not treat traces as trees. It treats them as a graph structure, more like Dynamo. There are no 'interpreter-only' gaps in the control-flow (other than for NYI stuff).

Of course nested loops run fully native. It needs one root trace for the innermost loop and a side trace for each outer loop which connects back to the root trace.

Why don't you try LuaJIT before speculating? It takes only a few seconds to compile. Running a triple-nested loop:

$ cat test.lua
for i=1,100 do
  for j=1,100 do
    for k=1,100 do
    end
  end
end
$ luajit -jv test.lua
[TRACE   1 test.lua:3]
[TRACE   2 (1/3) test.lua:2 -> 1]
[TRACE   3 (2/1) test.lua:1 -> 1]

And the example of two inner loops, Andreas likes to cite:

$ cat test.lua
for i=1,100 do 
  for j=1,100 do end 
  for j=1,100 do end 
end
$ luajit -jv test.lua
[TRACE   1 test.lua:2]
[TRACE   2 test.lua:3]
[TRACE   3 (1/3) test.lua:3 -> 2]
[TRACE   4 (2/3) test.lua:1 -> 1]

Runs fully native of course. Draw the graph, and you'll see. Or run with -jdump and look at the traced bytecode, IR and machine code.

Oh, and yes, LuaJIT traces tail-, up- and down-recursion, too. The generated graphs are ... umm ... a bit odd and not what you'd expect. Certainly not what a method-at-a-time compiler would ever generate. But it runs fast.

IMHO trace trees and especially nested ones just drive up complexity without a real gain. In TraceMonkey you're not compiling a whole tree together (which is where the only pay-off might be), since this would lead to excessive recompilation whenever a trace is attached to the tree. So this approach is a pure loss.

LuaJIT reduces trace-transition costs with cross-trace register coalescing. This is much simpler and helps all edges of the graph. Not just the side traces that happen to loop back to their parent root trace.

Yes sure, JS does have some weird semantics. But no show stoppers. And most JS programs don't use all of the weirdness at once. It does take more of an effort to write a JIT compiler than for Lua, of course. With proper design and engineering, there's no reason a JS JIT compiler couldn't compete with LuaJIT.

Why Lua?

With proper design and engineering, there's no reason a JS JIT compiler couldn't compete with LuaJIT.

Slightly tangential but why did you choose to write a Lua JIT rather than a JavaScript JIT? Where there any technical reasons?

Lua is king of game engine embedded languages

GDC is happening right now. Walk around the show floor if you attend and see how many engines have Lua embedded. Lua clearly has first mover advantage due to small footprint and simple embedding API, although proprietary languages appear still often enough.

Not to speak for Mike, but from all I can see (I've been to GDCs for most of the last ten years), optimizing Lua performance for ever more scripting of advanced game engines is good business.

/be

Lua is king in the land of Lua-dites

Lua's supremacy actually took some competitive heat from Mono. Miguel de Icaza loves talking about this, he is so proud that Mono was able to compete with Lua for game scripting languages. The major issue with Lua prior to Mike Pall's work was that nobody had done any serious performance tuning of Lua, so some developers were looking at embedded a small version of Mono into their games and would use the Mono C# compiler for logic.

What this actually demonstrates here is that the best VM+Language will eventually win.

Best doesn't matter (WiB)

This here demonstration has tons of confounding variables, pardner. It's hard to prove "best", but you are right that Mike's work was and is crucial to keeping Lua competitive.

The most I'd conclude is that "too slow", or just the perception of being "too slow", leads some to try faster alternatives. This had been an objection to JS compared to plugin-based alternatives, but you don't hear it much any more.

Unlike browsers, game engines do not all have to support the same source language, although recycling Lua brainprint and source code may win depending on your legacy training and porting requirements.

Brainprint matters. When game engine hackers roll their own language, it often looks like a Java/Python love child. C# is rising too for this reason, not just due to fast implementations.

That reminds me: Eve Online is still using Stackless Python. "Best" is not the issue, scalability and sunk costs are.

Games are not the web, although convergence will tend to level downward (toward JS) over time. Let's hope "best wins" sooner. It is not a sure thing.

/be

Scheme

Uncharted 2 is using Scheme for their scripting system. The
slides [PPT] from GDC '09 goes into how they use it and how they implement it. Doesn't look like their VM is too fancy. Implementation portion starts on page 51.

Slides [pdf] from GDC '08 for Uncharted make it sound like they didn't even use a VM.

Trading speculations

Mike, you've done great work -- when last we talked, it seemed trace trees without nesting were enough, but I do remember you were pursuing other structures.

Your point seems to cut both ways, however: if you've kept up with TraceMonkey, you know it does up, down, and tail recursion now. We also stitch together differently-typed trees anchored at common points using fragment traces. With Intel, we are investigating using spare cores for background whole-tree recompilation, which gives us more cycles for optimizing harder.

Trace trees have the virtue of returning control to the same program point (loop header, recursive call), so in the case of loops at least, one can inline method calls without constructing activation records. Less structured traces may have to push and pop activations, which always hurts.

On the differences between JS and Lua, you can say it's all a matter of proper design and engineering (what isn't?), but intrinsic complexity differences in degree still cost. You can push the hard cases off the hot paths, certainly, but they take their toll. JS has more and harder hard cases than Lua.

One example: Lua (without explicit metatable usage) has nothing like JS's prototype object chain. See this bug comment for some startling statistics on how deep these chains can get in gmail.

/be

Both has advantages and disadvantages.

I have built trace compilers both ways. Both has advantages. Both has problems. Tamarin Tracing used Dynamo-like unstructured traces that could start anywhere. It worked reasonably well, but it made loop-optimization a bit tricky. An important loop-based optimization we do in SpiderMonkey/TraceMonkey is not building activation records for function calls inside a loop, except if a side exit occurs. Since inside a loop function entries and exits are balanced, this works in our favor. Once can argue whether this is always a win. In SpiderMonkey/TraceMonkey it was critical to do this when we added the trace compiler, because the interpreter's stack frame representation was very heavy weight and very expensive to construct (30 or so words at least, with various data to fill in). In my original Java trace compiler in my dissertation I didn't emit any function call state, except in side exits. In SpiderMonkey/TraceMonkey we leave breadcrumbs behind on a "frame stack" to reduce the amount of meta data we have per side exit. This trades a bit on-trace performance for side-exit cost. Tamarin used completely unstructured traces, so it could cover all weird control-flows you throw at it, but it had to always construct and deconstruct stack frames on the fly, which was one of the reasons it was a lot slower than our VM.

Another useful feature of trace trees is their ability to contain tail duplication. Try your example with the following code in the middle:

if (k&1)
if (k&2)
if (k&3)
v += 1;
else
v += 2;
else
v += 3;
else
v += 4;

If you nest this deep enough, you experience some rather painful tail duplication. Nested loops exponentially propagate such tail duplication outwards, since you link back loop edges from every potential exit point from the inner loop the the outer loop header. In case of tree calls, we can share one shared loop exit point from the inner loop (as long types are compatible). So there is some tail duplication in the inner tree, but the outer loops contain a single trace each only.

As so often in life, which approach is better depends on the use-case. For JS we found tree calls superior. Lua is hard to compare to JS. Lua is an extremely clean and beautiful language with clear and simple semantics per opcode. JS is pretty much the opposite. Everything is a complex mess. Try reading a property from a JavaScript "array", and you find yourself traversing multiple objects along the prototype chain and checking them for holes and what not. A single object property or array element access can require dozens of machine instructions and even more importantly, dozens of guards. Along each guard you might side exit, which is a great source for massive tail duplication.

For our particular implementation we found tree calls useful and it did solve a lot of problems for us. Whether it would be beneficial for Lua, I have absolutely no idea.

Really good compiler

should optimize above code to one NOP. It's it.

Can we say the same about python?

Yes sure, JS does have some weird semantics. But no show stoppers. And most JS programs don't use all of the weirdness at once. It does take more of an effort to write a JIT compiler than for Lua, of course. With proper design and engineering, there's no reason a JS JIT compiler couldn't compete with LuaJIT.

Can we say that python, as well as js, doesn't have any show stoppers?
Is it possible to apply the same techniques you used with LuaJit to implement Python?
If so, is it possible to reuse or extend your work in LuaJit, or it would require starting from scratch?

Sure

Python (the core language) just has different quirks that need to be worked around. But no show stoppers.

What is more challenging, is to efficiently handle Python (the programming environment) with all of its classes and methods. In particular the plethora of container types, since you really want to inline their accessors.

Since I don't believe in multi-language VMs, LuaJIT is pretty Lua-specific on all levels. Your best bet right now is to join the PyPy effort (they already handle the issues I mentioned).

Since I don't believe in

Since I don't believe in multi-language VMs

Why? I have come to a similar conclusion recently because the VM dictates the semantics to a certain degree. Clojure and F# both seem to prove me wrong though, so it would be nice to get the thoughts of somebody with more experience.

[Edit: Fixed typos]

Proving what?

Clojure and F# both seem to prove that it is possible. But this not prove that it is good think to do or best thing or what ever else.

We actually had a discussion

We actually had a discussion about this recently, where Guy Steele talked about Proper Tail Call in Fortress, and he corrected Tim Bray by pointing out Clojure cheats with some subexpression rewriting. The result is that you don't get the full power of Proper Tail Call.

I am not sure what the current status of the "tail" CIL instruction is for MSIL in .NET 4 (and so I can't really comment on F#), but I've read blog posts where they have changed the calling convention and associated profiling/tracing API so that it's more (a) robust in terms of performance and ability for language designers to take the instruction seriously (b) debuggable

How about the Pypy approach?

Mike, I'm an admirer of your work and your oppinions are always very interesting an spot on. I'd like to know what's your oppinion on pypy's approach to tracing. AFAIK, it's very differet than that of Luajit or Tracemonkey, since it works "one level down" (tracing the interpreter loop rather than the user program).

It's also a much broader project, not exclusively aimed to speed up python, but to become a general framework for implementing dynamic languages. So I guess that, perhaps, the approach taken for getting performance involves other trade offs not present in Lua or js.

I wonder if the idea of having a slow interpreter automagically sped up by tracing is a good strategy. It seems this didn't worked for Mozila and, obviously, it's not the path you've taken, but considering the special characteristics of this project (interpreter writen in a subset of python, tracing one level down, etc...and the latest benchmarks), what do you think?

I would also like to know the oppinions of the Mozila guys as well :-)

TANSTAAFL

I think there are three major problems with the PyPy approach:

1. All those layers add up. Removing the overhead is costly and you rarely get down to zero overhead.

2. Many complex language constructs are based on sequences of simple decisions. The order in which the interpreter performs these is not necessarily the optimal order for compiled code.

E.g. the LuaJIT interpreter needs to do checks for metamethods in a certain order, but the JIT compiler reorders them to minimize the number of guards and to improve hoisting opportunities.

This is a bit like what some C compilers do with an (expr1 && expr2 && expr3) construct. They are able to reorder, combine or parallelize the individual expression evaluations if it's profitable and the semantics allow for that.

3. PyPy is tracing low-level constructs. This drops all high-level semantic information that was present in the original language constructs. Maybe they've found a way to keep these (I haven't checked). Otherwise they're going to lose big time on alias analysis (if/when they add that). Any store could alias any load in that model.

Generating good code for higher-level languages crucially depends on having high-level alias analysis. E.g. have a look at this Lua code:

local t={}; for i=1,100 do t[i] = math.sqrt(i) end

The loop contains two hash table loads (for _G["math"] and math["sin"]) and a store to the array part of a table. LuaJIT is able to hoist all hash table lookups out of the loop, because it can prove they cannot alias the stores to the array part. That's because it preserves the high level information in the IR (HLOAD cannot alias ASTORE).

So I'm not sure they can make PyPy fly, but at least it holds more promise than Unladen Swallow.

These are valid points

Hi mike.

This are all valid points (maybe except 3 which is not completely true) and we pay for this. The reason why we decided to do it in the first place is that python is an insanely complex language. Not only that, but certain details are not well defined, but instead they're based on CPython behaviors. I don't think we would be able to write a JIT for full Python, including builtin modules within a finite timeframe if not tracing on the meta meta level.

Regarding point 3 we do preserve certain high level constructs, so the JIT, even though it's tracing on low level, knows a bit about language semantics.

The approach is in a way middleground. It does not provide as many opportunities for micro optimizing as LuaJIT, on the other end it yields a promise of a working JIT with not that much effort.

Probably you can extend the approach to have python-level optimizations on top of meta-tracing, but we're clearly not there yet.

On the other end, we're not trying to compete with other languages, simply because Python semantics don't allow certain optimizations to happen, but instead we simply try to make python faster.

Cheers,
fijal

HLOAD aliasing ASTORE

Is this because we know the lifetime of t, ie, no metatables? Or do you mean generally, in which it would be easy to construct some metatable magic to break that. Or are HLOAD and ASTORE the primitive 'raw' ucode instructions that don't do metatables?

Could you give us some a bit more clarification on when that is applicable?

Hyperblock Scheduling

Hyperblock scheduling sounds very similar to what we are trying to achieve with tree calls--introduce some structure into the way we generate code. In an unstructured trace compiler (such as my original KVM trace compiler, Tamarin Tracing or Lua JIT), the generated code often looks very different from the incoming program structure. Some loops are unrolled. Some loops become anchors, others are turned inside out and merely attach as additional traces to an inner loop. This is often good and intended. We optimize a program for the way it behaves, not for the way it was written. In SpiderMonkey we found it troublesome though, exactly for the reason you mention (branch code). There might be a hybrid of the two worlds, i.e. unrolling some loops, letting some traces start at arbitrary points, as long things don't get too branchy and use the program structure as feedback for that. We operate in a slightly different environment than LuaJIT 2.0. We ship a production browser. A bad tail explosion leads to either poor performance (we stop compiling), or worse bad latency (we compile too much at no performance win). Neither is acceptable. It was at that point safer to trade of a bit performance win for better average case behavior. As the JaegerMonkey work is progressing in replacing our vintage Interpreter from the 90ies, we can go back to being more aggressive (and selective) when selecting traces we want to compile since we can fall back on JaegerMonkey for problematic cases. I think this is very Lua really benefits from its hyperfast interpreter. I bet our interpreter is a couple factors slower than Lua's for a lot of code, in part because of JS's strange language semantics, in part also for engineering reasons. The latter we can fix. The former we will have to live with.

Re: Hyperblock Scheduling

LuaJIT is not an "unstructured" compiler following your definition. It makes use of a highly-tuned region selection algorithm named Natural-Loop First (NLF). It quickly picks up hot inner loops in preference to other constructs. It avoids loop unrolling, except where it helps performance (e.g. for low trip counts). It picks up better traces than MRET/NET (Dynamo) and LEI (Hiniker et al).

LuaJIT is used in production, too. E.g. for games. They don't like poor performance and they absolutely hate frame drops due to unexpected latencies. The current heuristics to control trace growth are quite successful. And these heuristics are a prerequisite for the planned switch to hyperblocks.

About the interpreter: LuaJIT's interpreter (!) beats V8's JIT compiler in 6 out of 8 benchmarks and is not too far off in another one [1]. V8 wins the remaining benchmark because it has a better GC. [Of course the LuaJIT JIT compiler beats V8 by a wide margin, except for the GC-bound benchmark.]

None of the benchmarks make use of JavaScript's weird corners. And all of them can (or could) be reduced to simple constructs. So I wouldn't blame it on the language differences alone (maybe on the implementation).

The take-away point is: it's less effort to write and maintain a well-tuned interpreter than a simple compiler. Since performance is shown not to be better, why bother writing a stage 1 compiler at all?

Unjustified conclusion?

I don't doubt you when you write that LuaJIT's interpreter is faster than Chrome's compiler. They both exist and can be compared. Those results are impressive.

Is the conclusion you've made, that writing a stage 1 compiler is pointless, really justified? It doesn't seem to have measurements backing it. Did you write a stage 1 compiler for LuaJIT with the same care as you wrote the LuaJIT interpreter and compare the two? Perhaps you could write a stage 1 compiler that blows away LuaJIT's interpreter? Furthermore, perhaps this stage 1 compiler alone could beat LuaJIT's tracing compiler? Do you have evidence to the contrary?

+5 funny

Maybe that was too obvious, but ... surprise! I wrote LuaJIT 1.x, too. :-)

LuaJIT 1.0/1.1 are simple method-at-a-time JIT compilers. A bit more complex than one would do for a stage 1 compiler, though. The LuaJIT 1.1 JIT compiler is only slightly better on average (+12%), but sometimes worse than the LuaJIT 2.0 interpreter.

The 1.x design is a dead end. That was the reason for the major redesign. The 2.0 JIT compiler beats 1.1 easily.

All versions are still available from the download page, so you can do your own comparisons.

This thread

I would just like to say that this is an awesome thread, and I'd like to see more like this. It's great to see some of this discussed (a) in one place, and (b) in the context of general design tradeoffs rather than in the context of one specific system (e.g., Firefox). Most of the time, it seems that keeping up with this topic involves following a bunch of mailing lists and blog posts, on which most of the content is irrelevant.

My thanks to all the contributors to this discussion.

This thread

I agree the comments have been fantastic. Certainly more then I'd hoped for. Can anyone lure the V8 guys over this way?

Keeping up with the state-of-the-art

Most of the time, it seems that keeping up with this topic involves following a bunch of mailing lists and blog posts, on which most of the content is irrelevant.

That's why I asked a while ago about a course on JIT compiler design. I found that few compiler courses mention the topic at all, let alone delve into any kind of detail on the anatomy of cutting edge dynamic language implementations.

Definitely

I strongly agreed with you at the time, but I didn't post, because I had nothing to add. ;)

PyPy on tracing

The PyPy team posted benchmarks on their status blog last week and from the comments, it appears they are betting on tracing too.

Unstated here, when

Unstated here, when discussing tracing, an elephant in the room is tracing calls into the browser, which is a good chunk of the time in most sites (though I realize one can make a chicken-and-egg argument about why that is).

On a somewhat related note, there was a fun talk about layering tracing JITs at DLS this year that LtUers might enjoy.

What do you mean?

Hi Leo,

I'm not sure what you mean by "tracing calls into the browser" -- can you elaborate? Naively, I would guess that you mean calls into native code triggered by DOM updates, but that code is already compiled. Do you mean JS callbacks invoked by the native methods, so that the control flow is obscured by the intervening native calls?

Naively, I would guess that

Naively, I would guess that you mean calls into native code triggered by DOM updates, but that code is already compiled.

I mean exactly that: if you control the x86 compiler... ;-) There are many different ways for this to work: there are projects to write browser components in managed languages or (what I'm doing) code generate them, projects to do hardware or hyperviser-level tracing, and I'd expect instrumented C++ compilers (e.g., a LLVM extension), etc.

The IE8 guys published some funny numbers showing only 4% of the CPU time on average pages (top 250?) was in JS, with JS heavy sites (e.g., gmail) jumping up to 15-20%. My numbers have come out similar for WebKit on a smaller sample. While a lot of this discussion is great, it's important to step back and understand the browser domain and where it fits in.

Don't assume equilibrium

Intensive use of JS, e.g. for rendering to canvas, or for self-hosted DOMs, is the wave of the future. Don't let IE's trailing edge fool you. We want memory safety and control flow integrity, neither forthcoming in large C++ DOM codebases.

Many DOM methods memoize or cache. Self-hosting their "front sides" that try to hit the cache lets the tracing JIT inline the hit-case code. Wins on both safety and speed, compared to C++ implementations.

/be

Intensive use of JS The

Intensive use of JS

The question is whether it is isolated or uses native components. My chicken-and-egg statement was that perhaps we don't see a CPU intense JS because that'd be prohibitively expensive and we're already maxed out just doing native GUI stuff.

, e.g. for rendering to canvas, or for self-hosted DOMs, is the wave of the future

This is why I brought up managed browsers -- which is what this encodes. I disagree in part; I actually don't believe self-hosted DOMs are the way of the future if you mean by user code (a la Caja). In particular, the Proxy/mirror proposal is too geared to membranes, which are fragile (security-wise): I am on the side of *not* distinguishing between proxy and normal objects in terms of such controls, so any taming of the DOM can be more directly handled (see my papers on ConScript and Object Views: I see the latter principal-sensitive advice primitives implemented using interpreter support as in the former).

I have less developed guesses about the role of canvas (in terms of accessibility and performance) colored from my time @ Adobe and our current investigations into GPUs, but can't make as educated of a comment about it yet.

Don't let IE's trailing edge fool you.

Yep, that's why I said I benchmarked WebKit ;-) (OTOH, all browsers except maybe Dillo are too slow for my domain of mobile devices!)

We want memory safety and control flow integrity, neither forthcoming in large C++ DOM codebases.

Amen on the former. Unfortunately, not sure what you mean by the latter (e.g., stuff like Trevor Jim's intrusion detection systems of CFGs of code?).

DOMs, membranes, CFI, oh my

As noted, self-hosting is not all or nothing: the hot path may be the cache hit case, and the slow path goes into native code. We're going to try this in Mozilla, but after we fry other fish.

Membranes are fragile if you (the web developer) have to remember to make them yourself instead of leaking a capability (is this what you meant by fragile?), but pervasive mediation using compilers and VMs is the better way. We already create wrappers in Firefox under the hood, from a central place, to solve several problems.

This is not ocap religion, just good separation of costs and concerns. The same-origin/same-global JS access paths must be fast, so are unmediated.

"Control Flow Integrity" is Abadi, Budiu, Erlingsson, and Ligatti; the follow-on work has culminated (so far) in Google Native Client. We don't plan to compile Gecko with NaCl, but we are already mitigating attacks on vptrs in C++ objects allocated and potentially prematurely freed from type-classifying arena pools.

Virtual method forgery via JS reallocation of prematurely deleted C++ object bugs is a problem in all major browsers on many OSes.

/be

Membranes are fragile if you

Membranes are fragile if you (the web developer) have to remember to make them yourself instead of leaking a capability (is this what you meant by fragile?), but pervasive mediation using compilers and VMs is the better way. We already create wrappers in Firefox under the hood, from a central place, to solve several problems.

One small leak, whether from an incorrectly implemented membrane, incorrect usage, or an incorrect policy controlling it, can expose everything. In security terms, privilege escalation is excessive under them and the attack surface is wide. I think we're agreeing language support (e.g., dependent types a la Fine or what I'm attempting by adding origins to JS) fixes that, though I'm unclear as to the actual technique you're suggesting with compilers and VMs (in my examples, dep types is an example of mostly static analysis in compiler, a case of VM support is what we're now doing with origins in JS). I'm for the abstraction of a membrane -- it's close to notions of boundaries being examined for contracts at NEU PLT -- just not the fragile implementation as a wrapper without VM or type support. I don't know of a reasonable solution for C++-ish code, but, again, I'm pursuing a story for JavaScript.

"Control Flow Integrity"

Thanks for the ref! Trevor Jim's papers were around the same time and surprisingly similar, hadn't seen this one.

Mozilla's wrappers

We agree on membrane fragility. In the large, capability leaks happen and no amount of pedagogy will stop them. The Barth/Song/Weinberger paper on always checking access in the JS VM (using PICs to optimize the cost out of the same-origin path) was on target.

In Mozilla we use cross-origin wrappers, created by the VM when it sees an origin boundary crossing. The VM knows where these boundaries lie. Possibly this is similar to you work -- do you have a ref or any work in progress report?

/be

Nope, just the papers I

Nope, just the papers I cited. We're putting together a position paper about how we're thinking about cross-principal interactions (roughly: require explicit sharing of any browser calls implemented with techniques like ConScript's, hardware resources like CPU time implemented with OS support, and javascript heap objects implemented with AOP's/Adam's/tracing techniques, and adapt a slightly finer trust model than the one presented in Helen Wang's work like the ServiceOS). I'll send it your way once it's together.

Even if the techniques or abstractions don't float your boat, a big part is the concise specification of a wide variety of policies applications should be able to specify: it also acts as a series of challenge problems for claims of security-enabling browser primitives and designs, sort of like a design benchmark.

Was actually going to meet up with Dave H. and Dave M. to discuss some of it in the next few weeks. Probably should take the rest of this out of the public discussion :)

javascript heap objects

javascript heap objects implemented with AOP's/Adam's/tracing techniques

Aspect-Oriented Programming? Adam & Eve?

Even if the techniques or abstractions don't float your boat, a big part is the concise specification of a wide variety of policies applications should be able to specify: it also acts as a series of challenge problems for claims of security-enabling browser primitives and designs, sort of like a design benchmark.

I am interested in hearing you talk more about this, especially its interaction with JITing.

Yep, Aspect-Oriented

Yep, Aspect-Oriented Programming. Adam = Joel Weinberger & Adam Barth's paper on checking for the wrong origin being associated with a JavaScript object. For the JITing, extending Adam's observation the the PIC, the principals involved should largely be constant/static even when they interact, so advice checks can be compiled away (especially the hot path of 'permit') into just side effects for more interesting abstractions (as in object views). Will put a link about our position paper up here once we can make it public -- don't think there'll be more (public) running code than the ConScript and Object View stuff until late summer, however.

Why fragile?

A parameterizable membrane could be a small reusable bit of carefully reviewed code. No reason to expect it to be any less reliable than whatever alternate protection mechanism you'd use instead.

A language level mechanism

A language level mechanism that other features must respect is a lot less fragile than a library level one. For example, if the language changes, such as an experimental browser-specific feature (e.g., the advice primitive in the conscript paper and many others), there is no obligation for the change to respect safety properties like encapsulation assumed (but not always checkable) by the library. In contrast, if there's a language feature like a notion of SOP, the change must respect it. Furthermore, a mechanism like a membrane has a lot of parts (esp. when in a rich language like JS), while a checking mechanism in a VM with more access may be more direct and has a smaller attack surface: generally, libraries are less reliable than interpreters/compilers than OSs etc.**

Again, I'm not arguing against a membrane as an abstraction, just my experiences with JS library vs. interpreter mechanisms puts me on the interpreter side of enforcing them (the Caja experience was probably similar: if you didn't like something, you could fix it directly!). Finally, I think the real question isn't which placement of mechanism is more trusted, but what's the highest level the mechanism may be reliably placed? [[This may be a bad security perspective, but I think it's a good systems/pl one]].

OTOH, the ES committee members who have thought about this many more years than me seem to be split here despite their experiences ;-)

**I'm assuming no reliable end-to-end verification tools, e.g., pluggable, are being used, which is a different approach that I have no practical experience with, esp. in terms of reliability and gotchas.

Yes, but

Well, yeah, but that stuff is worth optimizing, too. If you mark your native calls as pure or impure, then you can, say, move some pure function calls out of a loop.

downside: asm

Perhaps unrelated to tracing vs non-tracing jits, but a downside of writing a JIT is that you usually can't implement it in your language of interest -- for example, Mike has implemented LuaJIT in assembly and C instead of Lua.

A tracing JIT implemented by the traced language would probably have unacceptable warmup costs. I would rather write a compiler in Scheme than in C, C++, or assembler.

Dog food

A tracing JIT implemented by the traced language [...]

Despite the "dog food" aspect, not all languages are well suited for writing compilers (but still would benefit from JIT compilation).

Case in point: I would not want to write a compiler in a concurrency-oriented language like NESL, Erlang, etc., when I have the choice of using Haskell, or Lisp, or one of the ML dialects.

PyPy

Tell that to the PyPy guys :) They've developed a language called RPython that is a statically typed susbset of Python which they can automatically translate to C, JVM bytecode, CLR bytecode, and probably some other stuff. This lets them right the VM, stdlib, GCs, and JIT all in Python.

Restricted scheme 'Pre-Scheme' pre-dates PyPy Restricted Python

(Try saying that 5 times fast.)

Richard Kelsey's Pre-Scheme (1996-ish) is a statically typed Scheme-like language with no garbage collection that compiles to C.

I am somewhat amused that there is a Wikipedia page for Pre-Scheme. The page has no content, but there is a reasonable list of references.

I am ignorant of projects outside of Scheme48 that use Pre-Scheme, yet it looks quite attractive for implementing compilers and vm's.

whither pre-scheme

From History of T by Olin Shivers, emphasis added:

Pre-scheme was quite interesting. Kelsey published a paper on it, as well, I believe. It was Scheme in the sense that you could load it into a Scheme system and run the code. But it was restrictive -- it required you to write in a fashion that allowed complete Hindley-Milner static type inference, and all higher-order procedures were beta-substituted away at compile time, meaning you could *straightforwardly* translate a prescheme program into "natural" C code with C-level effiency. That is, you could view prescheme as a really pleasant alternative to C for low-level code. And you could debug your prescheme programs in the interactive Scheme development environment of your choice, before flipping a switch and translating to C code, because prescheme was just a restricted Scheme. The Scheme 48 byte-code interpreter was written in prescheme. Prescheme sort of died -- beyond the academic paper he wrote, Kelsey never quite had the time to document it and turn it into a standalone tool that other people could use (Ian Horswill's group at Northwestern is an exception to that claim -- they have used prescheme for interesting work). There are ideas there to be had, however.

Jikes RVM

Well, most, if not all of the Jikes RVM JIT is written in Java, so they eat their own dog food. Mostly.

Is "byte code" dead ?

It seems to me that compiling to "byte code" will sooner or later become extinct. It seems much more reasonable to keep the syntax tree, possibly post-processed for stuff like names -> de-bruijn and so on. Then from this syntax tree, much better optimization decisions should be possible. Also I think while JavaScript is interesting for obvious reasons, from a purely technical view it is more interesting to start with designing a dynamic language from scratch (like I do with Babel-17 :-)) and look at how to interpret / compile this new language. For example, I decided that Babel-17 will have no side-effects. My working hypothesis is that we first need to understand how to deal with side-effect free dynamic languages, before looking beyond.

Is "byte code" dead ?

It seems that Mike Pall is saying bytecode is not dead but V8 and eventually Spidermonkey won't have bytecode. Byte code seems to be a way to make a fast, platform-independent (e.g. ANSI C) implementation of a language. When the implementation needs to go faster than that a line has to be crossed into the platform-dependent world so why not go all the way if there enough interest/money to write really fast compilers supporting the various platforms?

I agree. I am just wondering

I agree. I am just wondering if it is worth the effort to produce a bytecoded interpreter at all nowadays. For example, my current situation is that I am developing this language, and I am writing an interpreter for it which accompanies the spec for the language.
This interpreter is coded in Java, so it is sort of platform-independent already. The representation of the program is not bytecode, but a high-level java data structure. This data structure would be a great starting point for a compiler, too, so it seems if there is a need for speed, then it might be better to directly work on compilers instead of spending time on bytecode.

Depending on how distant the

Depending on how distant the language is semantically from Java, it may be better to target the JVM directly - which is of course bytecode itself. And if there's something you want out of an interpreter that is more tedious when you don't have an interpreter, such as debugging, introspection etc., you could interpret the bytecodes yourself.

Futamura Projection

JIT + enough partial-evaluation should let you compile runtime-provided code that is being interpreted by a runtime-provided interpreter. See Futamura Projection.

The trick is avoiding or controlling communication effects, which tend to hinder partial evaluations. Modularity also hurts partial-evaluation, but that problem is diminished during runtime, which is the context of JIT. Orthogonal persistence makes 'runtime' a permanent condition (albeit with garbage-collection of code); it is a near-optimal environment for JIT and partial evaluations, especially in combination with effect-typing or some other means of separating communication-effects from 'pure' evaluations.

It is always the case that developers write code to take advantage of the optimizations available to them. I would like to see us writing interpreters that are designed to support partial-evaluations across the interpreted code - i.e. designed to support Futamura Projection. If this requires us to avoid communications-effects within the interpreter element, then so be it. The research of Tagless Interpreters provides at least some worthy notes on the subject.

Translating source to 'byte-code' then interpreting it in a 'virtual machine' seems as though it would greatly hinder the sorts of optimizations that could be achieved with a more direct interpretive style designed to support partial-evaluation. I certainly won't be pursuing byte-code for my own work. To the degree that byte-code supports a bit of portability, so would any other intermediate representation.

I suppose it depends what

I suppose it depends what you consider "bytecode". I would consider compressed ASTs bytecode; you still need to construct a program implementing a "virtual machine" to decode and execute the AST opcodes.

As for whether bytecode is a good idea, I think a format like SafeTSA is a pretty good approach (it's basically a "safe" SSA-based bytecode) since it permits all sorts of offline optimizations and ships code in a form that is amenable to JIT compilation.

What is Bytecode?

For imperative languages, I can see how you might consider compressed AST to be a form of bytecode. FORTH is essentially human-readable bytecode for a stack machine.

But the comparison fails for declarative languages - logic programming, constraint programming, typeful programming, aspect oriented programming, reactive programming, term graph rewriting, etc. - where the semantics are less dependent upon order of expression and thus require a more wholistic view. There is an AST, but no stream of opcodes to individually decode, execute, then forget.

No

Have tracing JITs won?

In short, no. They work well for small programs (i.e. five liner scripts) and inner loops of microbenchmarks, but traces don't scale well to large programs with complex logic. One of the original selling points of trace-based JITs is that they were much simpler than a "real" compiler because of the simple, linear structure of traces. However, the complexity soon explodes again when dealing with trace trees, bailouts, traces calling traces, etc.

There are several reasons why trace compilers won't scale to big programs.

First, complex programs have long, sometimes unpredictable, paths that are difficult to identify with trace selection heuristics. Often simply identifying a method as hot and compiling the entire method while performing deep inlining will produce similar or better code. Dead code in the compiled method (which would not appear in the trace) will be eliminated with standard optimizations. But even better, the less hot, but still warm, code in that method (and its inlinees) that did not appear in whatever trace was selected will also be compiled, resulting in no expensive bailouts. Large programs have a lot of this "warm" code that is statistically unlikely to be sampled with short traces, but still warm enough that counter-triggered compilation will get it.

Secondly, the cost of adjusting a trace after its been compiled and the program does something unexpected is relatively high--potentially a complete recompilation of a new, larger trace, as opposed to simply a branch in a method-based compilation. I think the notion that it is expensive to compile the extra untraced code in a "per-method" JIT compiler is mistaken.

Third, the idea that more traditional JIT compilers don't do inlining well or make use of dynamic profile information is simply wrong--e.g. both the client and server compilers of HotSpot do deep inlining with both static and speculative devirtualization. In the case of the server compiler, it has accurate basic block execution counts, receiver method profiles, and type profiles--far richer, more statistically valid information than a single trace. It uses profile information for everything from guiding inlining decisions, to register allocation, to code layout. And the server compiler will even insert "uncommon traps" for code that has never been executed (such as nasty exception handling code); deoptimization and potential recompilation will occur if the program does execute that code in the future.

Fourth, the JIT compilers in modern VMs for Java and C# have far more optimizations than any tracing JIT in existence. Important optimizations like loop invariant code motion, peeling, and unrolling are big performance wins and are (as yet) out of the reach of any tracing JIT. Go down the list of advanced compiler optimizations and you won't find many of them in tracing JITs.

This post seems pretty off.

This post seems pretty off. Tracing has strictly more information than method JITing, and, as discussed throughout this thread, before tracing, method JITing should be done (instead of using a slow interpreter). Furthermore, tracing has demonstrated it as at least competitive on real big complex applications -- even when it uses interpreters for a lot of code!

When B. Eich talks about the long-haul, he means exactly the lack of optimizations you describe: if they're competitive without them, adding them in suggests they'll do better, which they're doing.

Finally, the cost of compilation is an interesting point, but as evidenced by work by Moh Haghighat (for Tracemonkey), it can be taken off of the critical path (if it gets on the critical path, default to method JITing!).

I have reservations about tracing -- particularly how much of a win it gives (e.g., my group focuses on parallelism, esp. for mobile, for the long haul), and whether the optimizations being examined are appropriate for the domain, but from the perspective of tracing as being strictly more info than method JITing, it seems obvious (if you have the manpower).

It depends.

Ben is absolutely right. Given unlimited time and resources, traditional graph-based compilation is destined to win. Trace compilation restricts its world view to a set of incrementally explored paths through the program. At least two trade offs are made here. First, often trace compilers start emitting native code right away as soon the first trace was encountered, and second, as additional traces are added, its often not worth it to recompile existing code (for the little win that could be had). Both results in less optimal code than a graph-based compiler could produce.

My technique of choice for JavaScript compilation--if you have unlimited time and resources--would be full blown flow-sensitive static analysis followed by a profiling phase and progressive standard control-flow graph-based code-recompilation and heavy handed code optimization, loop hoisting, type and sub-type inference, invariant code motion, all the good things we know from JVM/Hotspot and friends.

Now, unfortunately, at least in the JS world, for most applications we don't have unlimited time and resources. While Ben points out correctly that Java and C# VMs perform more optimizations than any tracing JIT in existence, they are also miserably slow when it comes to latency and unmanageable wrt memory resources. When was the last time you have seen a Java applet in your browser? I honestly can't remember the last one. There is a reason for that. JVM takes a good 3s to go past the initial beach-balling phase on my MacBook Pro, and after that the applet chugs along like molasses for another few seconds until most relevant code has been compiled at last. Its simply not a user experience most people find appealing, so Java has mostly disappeared from the client-side web.

Most web JS code has to be compiled and executed within milliseconds. That's where trace compilers are hard to beat. They deliver immediate performance right away.

Similarly, the high memory cost of traditional graph-based optimizations JVMs commonly perform make those JVMs useless in the small devices space and separate, and much less aggressively optimizing compilers are in JVMs in that space. This is again where a trace compiler comes in very handy, because it offers fairly aggressive speculation and optimization at modest cost (especially code cache size) by disregarding anything but a narrow set of frequently executed program traces when generating code.

This seems to assume traces

This seems to assume traces won't use static information -- do you think that's going to be the way going forward? I assumed not.

cheap static analysis + tracing

Yes! I am very interested in cheap, optimistic static analysis. Trace compilers only see (or want to see) frequently executed code, but often enough cold code is critical to understand some property that if proven could simplify code generation. Consider this code:

for (var i = 0; i < 100; ++i)
 for (var j = 0; j < 100; ++j)
  ;

Counter-intuitively, we need 3 traces to cover this kernel. The inner loop needs 2 traces, because both var declarations are hoisted to the top-level. The code really executes like this:

var i = undefined, j = undefined;
for (i = 0; i < 100; ++i) {
 for (j = 0; j < 100; ++j) {
  ;
 }
}

When we compile this, we have to deal with 2 possible scenarios for the inner loop part. [i=int,j=undefined] for the first iteration (i=0), and [i=int,j=int] for all subsequent ones. The reason for this is that the inner loop gets compiled. To attach it to the outer loop, the outer loop will want some native code that accepts j=undefined coming in, whereas the loop-edge of that trace wants j=int.

Now of course j=undefined is completely useless and in fact all uses of j are dominated by the assignment j=0. A trace compiler can't easily see this because it looks at the inner loop first, and j=0 isn't part of the loop. Only the loop condition and loop body are.

Some simple static def-use information could do wonders here. Based on that we could automatically block-scope vars where no use exist outside the block, or even infer type knowledge instead of having to speculate (and guard) on it. The code above can be proven easily to never overflow for example, so we don't have to check for integer math overflows, for example, with that information at hand.

Do it

Ok, so Lua has proper scoping and makes an implicit copy of the control variable, which causes less complications (changing 'i' inside the loop has no effect on the next iteration). But you should be able to copy the approach I use in LuaJIT:

Inspect the runtime values of control variables and peek at the surrounding bytecode to gather constant or typed initializers for start/stop/step. It's cheap and effective (k..N loops are most common). You get narrowing and overflow-check elimination for free.

Doing it for real requires static whole-function-tree analysis

It's a great idea to find loop control variable patterns that ensure no overflow from int to double, constant-on-trace loop bound, etc. But it can't be done just be peeking at surrounding bytecode, not if "peek" means one or two ops away:

function loopy(n) {
  for (var i = 0; i < n; i++) {
    if (sometimes())
      foopy();
    a[i] = i * i;
  }
  /* more code here... */
  function foopy() {
    if (i % 7 == 0)
      i += BIG;
  }
}

Certainly this is pathological, but we have to assume the worst when compiling, and still give correct results (if BIG requires overflow from int to double, e.g.).

In general live ranges must be analyzed across closures, assignment analysis performed, etc.

A quick-and-dirty check of bytecode would have to bound bytes inspected to constant distance limits, and it could pay off for tiny loops and synthetic benchmarks, but real code in JS has nested functions, often funargs invoked in the loop body (inlined almost for free on trace).

Static analysis of the whole tree of nested functions is needed to cover such cases. We have already done some of this (I implemented Chez Scheme inspired "display closures" for Firefox 3.5), and we will do more for loops to prove no overflow from int to double (tracked at this bug).

Well worth doing, just not as easy as looking at bytecode unless you have def/use linkage or SSA encoded into your bytecode.

Edit: again we see that language semantics differing between Lua and JS *does* matter: JS does not have proper lexical binding of loop control variable, so mutations in the body do affect the next step.

/be

Static analysis when idling

Do you think it would help if you did static analysis or maybe even static compilation with profile information when idling, and also save the results on disk for next time this person visits the web page?

Solar FUD

Umm, but LuaJIT does loop peeling and LICM. Or rather an equivalent that actually works for dynamic languages and provides both code hoisting and unswitching. It's 300 lines of code. Oh, and it performs selective unrolling, too.

Sure, let's go down the list of advanced compiler optimizations! Muchnick to the rescue ... LuaJIT also does: constant folding, constant propagation, copy propagation, algebraic simplifications, reassociation, common-subexpression elimination, alias analysis, load-forwarding, store-forwarding, dead-store elimination, store sinking, scalar replacement of aggregates, scalar-evolution analysis, narrowing, specialization, loop inversion, dead-code elimination, reverse-linear-scan register allocation with a blended cost-model, register hinting, register renaming, memory operand fusion.

Due to the nature of a trace compiler, it implicitly performs partial and interprocedural variants of all of them. And many traditional optimizations, like straightening or unreachable code elimination are unnecessary.

All of that in 120KB for the VM and 80KB for the JIT compiler. And I didn't need 15 years and a billion dollar budget for that, either.

I'm planning to add value-range propagation, array-bounds-check elimination, escape analysis, allocation sinking, if conversion, hyperblock scheduling and auto-vectorization. Anything I forgot? I'll see what I can do. :-)

unseemly gloating

On one hand, your comment seems to amount to unseemly gloating. On the other hand, according to your figures, Lua code compiled with LuaJIT beats C code compiled with GCC for SOR and another couple of benchmarks, and seems to be at least close to GCC's performance across the board. As far as I know, the only other compiler for a dynamic language that has reached this kind of performance was Craig Chambers's version of the Self compiler, and it was kind of fragile.

So I think you have the right to gloat. Thank you for sharing this work with us.

Excellent thread

I have the feeling that this thread will become an obligatory source of knowledge to any implementor of dynamic languages for years to come. Simply awesome!

Tracing JIT reference?

I don't know what a tracing JIT compiler is, so I looked for a reference. Is this paper the place to start?:

Andreas Gal, et al. Trace-based Just-in-Time Type Specialization for Dynamic Languages. PLDI '09, 2009.

No.

Tracing compilation goes back at least five years, but I believe the first big paper on the subject to turn some heads was Gal&Franz's Incremental Dynamic Code Generation with Trace Trees; this paper discusses a specific data structure called trace trees, which compact overlapping traces into a tree. References 4 and 5 in this paper are the original sources of the idea for using tracing to do JIT.

Bottom line: You have the right author (Gal) to read about tracing compilers, but are not going back far enough.

For popular usage of the

For popular usage of the term, I think older ideas like extended basic blocks are a better starting point (I think Dynamo did this?).

The core seems to be 1) dynamic traces (to enable specialization runtime invariants, possibly requiring context) and 2) allow traces to cross interprocedural boundaries (as opposed to typical method-based JITing). Gal's work has technical novelty (a trace tree is a nuanced take on this) and seems to be the first robust implementation for commercial languages (to my knowledge), namely, Java and JavaScript. Interestingly, it still hasn't demonstrated clear performance benefits over method JITing: I expect it will over the next couple years (under my interpretation, it strictly subsumes method JITing in terms of information available), but the question still seems open (they still need to use that information).

So.. tracing seems to go further back (15+ years?), but still hasn't shown its colors (with Andreas et al's great work pushing it to be neck in neck with alternatives).

It started 30 years ago

The basic idea is much older. Search for "trace scheduling". Joseph A. Fisher's PhD thesis is from 1979 (N/A online). His 1981 paper already discusses trace selection, trace joining, straightening, code motion and loop optimizations.

The Multiflow compiler (1984-1990) was a commercially viable trace compiler for C and FORTRAN (but the hardware wasn't). The Dynamo paper (2000) demonstrated the use of the technique for dynamic translation of executables.

Search for "trace cache" to see the same idea implemented in hardware. The most prominent use of it was in Intel's Netburst architecture (Pentium 4). The Core2 architecture employs a loopback buffer which effectively forms a cache of the last executed looping trace.

Maybe Franz & Gal can be credited to re-popularize the technique and to apply it to bytecode (vs. microcode in the original formulation).

Dynamic vs. static translation is orthogonal to the concept of a trace compiler. This is only a matter of when and how the execution profiling is done. It's just that dynamic execution profiling (i.e. JIT compilation) is much more viable on todays machines.

All notable applications of the technique work across basic block boundaries (straightening). The extension to work across method boundaries (inlining) is quite natural then. The use of tail duplication or a fallback interpreter is widespread, but I'd attribute this more to practical issues rather than any intrinsic property of the basic approach.

Maybe the only definitive criterion for a trace compiler is that the work selection algorithm is based on piecewise compilation steered by execution profiles. Of course there's some overlap with FDO, whole-unit-compilation or piecewise optimization in method-at-a-time JIT compilers.

I think the acceptance of trace compilation as a strong performer mostly suffers from the lack of demonstrably good implementations. This is because most implementations either never left the research stage, were retrofitted onto existing VMs and drowned in the resulting complexity and/or just never got out of their infancy wrt. optimizations.

Selling compilers is not a sustainable business model anymore (outside of niches). It's therefore rather unlikely that you'll see anyone outside the open source scene go forward and risk lots of money developing a new compiler using a rather untraditional technique.

But neither of the last two points warrant the assumption that trace compilers are not the way forward, esp. for compiling dynamic languages.

Wow, great list,

Wow, great list, thanks!

It's therefore rather unlikely that you'll see anyone outside the open source scene go forward and risk lots of money developing a new compiler using a rather untraditional technique.

Intel, Adobe, and Microsoft are all investing in new compiler techniques for dynamic languages in addition to the mainstream tracemonkey-style stuff in the news/blogs. There has been movement in others (Samsung, Qualcomm, Nokia), but we are seeing some foundational stuff happening in the first three (some, interestingly, going back to the earlier work you mentioned like microcode!). There's a fight for platform right now (web, mobile, etc.), and I believe the performance of dynamic languages for new and old application domains (web, AR, machine learning, etc.) on new and old hardware (mobile, multicore, cloud, etc.) are a compelling battleground. An interesting time to do compilers..

Lacking novelty

I think there is extensive earlier related work that is also relevant. Besides the ones you mention, there's also the trace driven work, in a commercial setting done by Transmeta.

See "The Transmeta Code Morphing Software: using speculation, recovery, and adaptive retranslation to address real-life challenges". Interestingly, Franz would certainly be aware of this as his student Kistler was involved in the implementation of the 2nd generation of that product and an author of the paper. Reports included experiments by Transmeta applying this to Java bytecode.

In a static setting, the IMPACT compiler team has done extensive work on region formation, including inter-procedural regions. Similar ideas turn up in JVM work such as the 2003 paper on "A Region-Based Compilation Technique for a Java Just-In-Time Compiler" by IBM Japan.

Joseph A. Fisher's PhD thesis is now available online

Joseph A. Fisher's PhD thesis, titled "The optimization of horizontal microcode within and beyond basic blocks: an application of processor scheduling with resources".

http://archive.org/details/optimizationofho00fish

Unsurprisingly, it overlaps a lot with the 1981 paper: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1675827 (sorry, paywall).

In general or in browsers?

Interestingly, it still hasn't demonstrated clear performance benefits over method JITing

Do you mean this for arbitrary code, or specifically in a setting like a browser (where the Javascript you run into is far from "generic" code)? I've mostly seen benchmarks and comments about the latter -- I'd be interested to see a comparison for the former.

In general: browser

In general: browser benchmarks (in the JS space) seem to be more about future general apps than what runs today. For example, you'll see a lot stuff like crypto algorithms. I agree that they should be there, but if you want a benchmark for a modern (as opposed to future) browser, they should be weighted much lower.

The JSMeter paper talks a little about this (and starts to try to fix it). It's important: our own group, for example, found that the regexps were heavily overweighted, which could have wasted effort if we hadn't spotted it. It's actually pretty hard to come up with good microbenchmarks for browsers (and an interesting problem in automation).