Representing Type Information in Dynamically Typed Languages

This is a classic paper, currently up on the scheme repository. It's about implementation strategies for dynamically typed languages, and I wanted to recommend it to LtU members because I found it very helpful.

Dynamic typing is easy to implement, but hard to implement well. Usually you wind up writing some ASM or non-portable C code that goes down to details of the OS that, technically, you should not rely on, like how many of the lowest bits of pointers returned from malloc() are always zero. If you write something portable, the result tends to be both larger and slower than you'd like, unless you learn a lot of tricks for avoiding typechecks in the first place.

So there's a complicated design space with a lot of tradeoffs, and the costs and benefits of each are not necessarily obvious. It would be nice to read a survey of implementation techniques with discussion of the tradeoffs rather than experiment and slowly work out all this stuff on your own, right? Well, here is that survey:

Representing Type Information in Dynamically Typed Languages

It's getting a bit old though; does anybody know where the equivalent paper could be found for modern (multi-core, multi-CPU, pipelining) architectures?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

thanks and a note about C

Thanks for registering that paper here. I *think* I remember seeing it years and years ago but if I did I forgot all about it until you mentioned it here. Incidentally, for a really fun example of "staged tags" (so-called in that paper -- I call them "staggered tags") see SCM.

My nit for you concerns this: "non-portable C code that goes down to details of the OS that, technically, you should not rely on, like how many of the lowest bits of pointers returned from malloc() are always zero." You might be able to argue for that conclusion from the legalese of the C standard but I doubt that such arguments would prevail against evidence like the standard "posix_memalign" function. Harbison and Steele will tell you that you can't count on any particular bit patterns when converting pointers to unsigned long ints and back again but, really, you pretty much can.

C

C is a portable language that seems to beg for non-portable code. It seems likely that whoever said "there's no such thing as portable code, only code that has been ported" was talking about C.

So Ray's point about what you "shouldn't" do could perhaps be rephrased as "shouldn't do without unit testing to make sure the target compiler/architecture behaves as expected."

Harbison and Steele will tell you that you can't count on any particular bit patterns when converting pointers to unsigned long ints and back again but, really, you pretty much can.

My first C experience was on 8086 segmented memory where I learned the hard way that pointers and longs are different. Harbison and Steele have seen stranger architectures than that.

Perhaps the relative homogeneity of modern architectures makes such concerns outdated.

re C

"Perhaps the relative homogeneity of modern
> architectures makes such concerns outdated."

Yes and no

No, because people still do and hopefully long will make strange architectures and provide C on these architectures.

Yes, because the familiar cases of ordinary 32 and 64 bit architectures, along with the standard libraries, aren't going away anytime soon and are generally the targets most likely to be of interest to someone implementing a dynamically typed language.

It's true but...

It's true that you can usually rely on some things not guaranteed by the C standard, when writing C. But it makes it hard to really say for sure what the C code means. I sort of have this thing in my head that says that one day, even if the machines are forgotten and the compilers gone, the language standard alone should suffice to allow people to figure out what the code means, without requiring intimate knowledge of machine details.

In practice I, like most people, make the assumptions I need to make to get the system to perform the actions I want. But I try to always be aware of and document the "gaps" between the semantics of what I wrote and what I was trying to express, and draw attention to the documented assumptions with asserts() that will fail if the assumptions I've made are wrong, placed right next to explanatory comments.

I don't know of *any*

I don't know of *any* high-performance multicore dynamic language. David Ungar had a fascinating talk at DLS 2009 about trying to get simple baton passing on a 64 core tilera right (so not even concurrent execution) -- it's not easy.

Are there specific problems

Are there problems specific to running dynamic languages on multicores, as opposed to running dynamic languages in general?

Impossible question?

The language designer now has to deal with multicore too, don't think dynamism has a lot to do with it.

Mutability

Most dynamic languages (such as Javascript, Lua, Perl, PHP, Python, Ruby) have pervasive unconstrained mutability. This makes it very hard in principle to use the languages in a concurrent setting.

There are added problems in practice too, because many implementations use refcount GC which turns read-only ops into read-write ops, making the mutation problem much worse.

The current consensus answer to this problem is heavyweight concurrency: fork the entire interpreter in order to use more than one core. The alternative is to change the paradigm to pervasive immutability, like Erlang and Clojure.

BaLinda K

BaLinda K is a Lisp champagne gown with a C lipstick (syntax).

It was originally designed in Lisp, then given a C-style grammar.

BaLinda K uses fork-and-join mechanism for concurrency (domain decomposition is a natural way to exploit the effective parallelism in a problem), and tuplespace for task interactions (synchronization).

In this way, mutability is not an issue.

The major difference is that today, concurrency and parallelism is widespread.

Clojure?

Isn't Clojure an example here?

Perhaps the underlying

Perhaps the underlying question really is "what's a high-performance DL?" I don't know, but can think of some PL features we regularly use when building parallel apps (... and, I believe, Clojure doesn't have.) Not all of these are necessary.

-- A general notion of where data is wrt the computation. A thread will move between cores, so, on it's own, it's an insufficient notion. A less virtualized notion (as a labmate did for Lithe in PLDI this year) can help. Other reifications of location exists, as in Swarm and the variety of PGAS languages.

-- A cooperative physical resource model. If I have a core, it's mine (until I don't, but at least I know). Same thing with memory. For computational kernels, I want to avoid GC thrashing me.

-- Access pattern support. Prefetching (at a core+cache basis), barriers, etc.

I believe (and hope) dynamic language features can work with features like these, but I just don't think we're there. Getting robust implementations of algorithms that use such notions are even trickier. Automatic sequential optimization efforts for DLs are very impressive, but automated parallelization / access pattern work seems relatively immature, so a high-performance DL would (IMO) need to solve that problem or expose primitives for programmer guidance. I don't know of any DL (or even HL language, for that matter..) that does either.

My claim is hard to defend. To be constructive, try flipping the question when you suggest a language: what about it makes it a high-performance parallel language?

Clore and Erlang

I'll agree that perhaps the real question is "what's a high-performance DL?", but I think another question is general concurrency vs. data parallelism.

Both Erlang and Clojure abstract away the parallelization using much higher level constructs. And both of these languages as I've seen them used are much more about many entities operating concurrently such as web servers or telecom. More like task parallelism than data parallelism. I have seen much fewer examples of data parallelism which I'm guessing is more what you're talking about.

Nit-picking

"what's a high-performance DL?" is the wrong question because it has an easy answer. It's high-performance according to a number of benchmarks.

The question is: "how to design a high-performance DL?" Which means, working back from the benchmarks to something which scales well under most circumstances.

I have my own ideas on that, not sure if they are good ones.

CBD

What do you think of Code Bag Descriptors and Ivan Sutherland's fleet architecture done at Berkeley?

Q

Used in kdb+.

Huh? Arthur was here a week

Huh? Arthur was here a week or two ago and said he'd be interested in making his system parallel (multicore). The high performance is from tight (sequential) array operations.

When proposing such a feature / language, it'd be helpful to say what it is and why it's high performance. This is the exact type of thing I am trying to avoid and why I flipped the argument -- it's too one-sided of a discussion otherwise. Same to Z-Bo: while I'm fairly well-read, only somebody like Tim Mattson would have a shot at small / short-lived /etc. projects without some more context (or needing to read a full paper etc.). [[And if I haven't heard of it, I'm *really* skeptical of it beyond being an idea for a design]]

You're so terse sometimes

The way your mind works amuses me :)

Are you saying that I need to be more descriptive and that namedropping Code Bag Descriptors isn't helpful?

The essence of Code Bag Descriptors comes down to what Ivan Sutherland has spent the past decade working on, including his research on Proximity Connections at Sun. In short, he has been seeking a way to focus more on communication costs and less on pipeline costs. So his argument is that typical pointer models should be banished and we should instead think in terms of fetching bags of code and executing them. This is fascinating to me, because if I understand it correctly, it is the idealized execution engine for something like Maude, a term rewriting system.

I like that idea

The direction I was heading, is that exactly. The mental model I have -oversimplified,- with multicore is that we have just a number of C64s which have access to some slow global memory. The thing I want to be able to do is serialize program parts (composition of higher-order functions) to memory, and communicate over the shared memory. I.e. make task-driven code possible, but also make it possible for some processes to run a longer time on one core.

Definitions

You wrote:

I don't know of *any* high-performance multicore dynamic language.

Maybe you could elaborate on what exactly you mean by a "multicore [dynamic] language". The q language has a parallel-map function, for example. Is there some particular feature you're interested in?

Does it?

Does it? https://code.kx.com/trac/wiki/QforMortals2/contents (anonymous/anonymous)

And, even if it does, how is parallel-map a high-performance feature? It seems rather low on the totem pole.

As for examples of features, see my above post. My rule of them is again: look at the optimizations people writing high performance parallel code and see if those are automated in a DL or at least allow explicit specification of them.

Really?

Your link is to a book about q. That's an awfully fancy way of saying nothing. :)

Q's parallel-map function is an example of parallelism. No offense, but I think that your definition of a "parallel language" is vague and open-ended (who are these people you mention, and what exactly are the optimizations that you have in mind?).

One optimization that a compiler might make is to use control and data dependence graphs to identify independent execution paths -- q (as far as I know) does not do that. However, another approach to the same problem is to offer a function that performs that same parallel task assuming that the programmer has proven to himself that those execution paths are independent (that's what q's "peach" does, and I'd argue that it's typical of the kind of decisions about proof that are made in untyped languages). Analyses of control flow graphs aren't the only way to achieve parallelism ... a parallel combinator library can do the same work.

I'm pretty familiar with q, and I've written similar interpreters in the past. Like a lot of untyped languages, they have a "universal type" which is a variant type including every possible term type, in their case this includes arrays of primitives. That's the key to their good (sequential or parallel) performance.

It was a reference book

It was a reference book where I searched for the word "parallel" and didn't hit it.

There's a lot that can be done with an array language. A lot of parallel performance optimization is about minimizing communication and synchronization (e.g., avoiding cache misses through memory scheduling), for which an array language is well-positioned. Are there any articles about Q doing this or providing guarantees to allow a developer to do it? As other concrete examples, BLAS and FFT libraries, essentially arrays with nice mathematical structure (esp. similar in dense cases) have heaps of parallel optimizations (and bodies of literature about them).

High-performance Dynamic Languages

I think parallel architectures change the tradeoffs somewhat, for both statically typed and dynamically typed languages. It makes both harder to implement. Some of the performance issues have changed and need to be reevaluated. This isn't a problem that particularly affects dynamically typed languages more than statically typed ones, but it changes the performance considerations of various implementation strategies for both.

Pipeline stalls and branches have a larger impact on performance in recent architectures than they did when the above paper was written, so avoiding the "if-else" trees typical of naive type dispatch has become more important now than it was then, for example. So for high performance dynamic dispatch you'd be more likely today to want to use jump tables (shift the type tag, add it to a pointer and jump, once) incurring a single pipeline stall, than use an if-else ladder.

Immutable vs. mutable data (and its effect on cache coherence) was almost a nonissue when the above paper was written, but now it matters a lot. It has risen to the level that for high performance either the programmer or the language runtime has to keep track of it. This affects data representation by making it more important to keep track of the distinction between mutable and immutable data than it was before. So the type systems under discussion should now be more complex, IMO, by at least one dimension, and type inference or type checking in modern languages generally needs to ensure that functions which need a mutable value don't get an immutable one, and that whatever can be proven immutable or made immutable without mucking up the programmer's semantics gets proven immutable. That in turn enables other optimizations like data sharing, paging immutable values, allocating mutable values together in high-speed shared caches, and so on.

That's just one of the issues that has changed with the importance of cache coherence. Thread-local memory is a huge issue depending on how freely the threads can migrate between cores with shared cache, CPUs with shared memory, and CPUs without shared memory. Succinctly: if a single thread is locked to a single core, then thread-local variables can be allocated in core-local memory (or registers) where there is absolutely no cache-coherence issue. If a single thread is locked to a single CPU, then thread-local variables can take advantage of the CPU or level-1 cache, where cache coherence between cores is the only issue. If threads can migrate between CPUs on the same shared-memory machine, then thread-local memory has to get pushed out to level-2 cache or main memory where coherence between CPUs is an issue. And if threads migrate between shared-memory arenas, then you have to use the network to transmit thread-local variables whenever the thread migrates (Except you only need to transmit immutable variables once). Each of these things has its own effect on performance, and data representation strategies for dynamic languages strongly affect the demand for limited cache space or network bandwidth that these models incur.

Finally you have failure issues. Any time you talk about highly parallel systems these days you have to take into account that any communication over a network can fail, any individual machine can fail, and a well-designed program should keep working and minimize or avoid production of inaccurate results in spite of a few failures. That means transactional memory, checking results for consistency, checking for network and remote errors, dealing with timeouts, etc. What specific effects data representation has on failure issues I don't know, but I bet it isn't a nonissue.

So for high performance

So for high performance dynamic dispatch you'd be more likely today to want to use jump tables (shift the type tag, add it to a pointer and jump, once) incurring a single pipeline stall, than use an if-else ladder.

This is not true. Branch predictors cache jump targets based on site of branch. A jump table has a single branch site for all targets making the branch predictor essentially useless, and incurring a pipeline stall on virtually every run through the code. A linear if-else sequence or a inline if-else binary decision tree tends to be incur fewer stalls due to the higher number of cached branch sites.

See this paper for a comprehensive study of various dispatching techniques, including jump tables.

That's surprising.

At least to me that's surprising. It's a perfect example of the kind of "updated knowledge" I was looking for. Thank you very much for the link.

Now you've got me wondering if my use cases/language implementation is pathological. I'm doing multi-argument dispatch on the basis of a jump table (and catenation of argument typetags) as opposed to branching or making an if/else tree for each possible type of each variable, and just having a hard time imagining that all the possible cases could be handled with, on average, less than one pipeline stall.

Well, I guess it is pathological.

Short summary of their results on various JVMs and architectures: in case of 2 to 4 possibilities, an if-else ladder is found most efficient. In case of 5 to 10 possibilities, a binary if-else tree is most efficient. For more than 10 possibilities, jump tables appear to be most efficient.

This is all based on Java, which dispatches on a single type argument. Since Java code usually has a small number of types possible for a particular method at a particular call site, a small number of possibilities is likely.

By comparison, my hobby lisp with multi-argument dispatch, especially on functions of higher and variable arity, is in fact pathological. More than 50 valid (non-error) possibilities in a type dispatch is quite normal in the code I'm working with.

Just think of if-else

Just think of if-else structures as a polymorphic inline cache. I think Self introduced PICs, so this result has at least been folklore since then. These Java results have been confirmed in fast interpreter research as well, where increasing the number of dispatch sites can result in 2-4x interpreter speedups in the op dispatch loop; see Ertl's work, and "Context Threading" in particular.

As for your multidispatch case, the lesson to take away is to simply increase the total number of dispatch sites in order to leverage the branch predictor. At some point the amount of inline code these PICs require overwhelms the CPU cache and you lose performance, but a single dispatch site typical for vtables/jump tables is simply too few. There's a good balance in there somewhere.

Modern Processors

That paper largely looks at the Pentium III and the UltraSparc III which may not be terribly representative of some more modern chips in their branch prediction behavior.

For example, the Pentium 3 will only allocate one BTB (branch target buffer) entry per branch site even for indirect branches. So an indirect branch will only be predicted correctly if it has the same target that was used the last time.

However, the Pentium M, the Core 2, Core i7 and K10 do not share this limitation and can predict indirect calls with multiple call targets. As a result, I would imagine the results would be a bit different for those processors.

That said, the Atom processor has the same indirect branch limitation the P3 does and I imagine at least some (perhaps most or even all) ARM processors do as well.

So while that paper is a good resource, it's worth doing some testing on the architectures you're interested in optimizing for. Agner Fog's optimization site has some excellent info about the branch prediction behavior of quite a few x86 CPUs that could be useful for tuning method dispatch as well (it's where I got the above info about indirect branch prediction).

However, the Pentium M, the

However, the Pentium M, the Core 2, Core i7 and K10 do not share this limitation and can predict indirect calls with multiple call targets.

Yes, these architectures are known to be have much better branch prediction than their predecessors. Still, as you note below, there are plenty of less advanced chips that these inline caches help.

That said, the Atom processor has the same indirect branch limitation the P3 does and I imagine at least some (perhaps most or even all) ARM processors do as well.

The Atom and the P4, of which there are still plenty around. As a general dispatch optimization, inline caches help across architectures. The costs, if any, are likely to be small on the more advanced architectures you listed above, so why not use it? I agree that measurements are needed however.

In fact, I did such measurements myself on .NET on my Core2, and found that an inline cache of test-cast-call was faster than virtual dispatch. I can't think of any reason other than improved branch prediction to explain this behaviour, but if you have another thought let me know!

Interface Method

The costs, if any, are likely to be small on the more advanced architectures you listed above, so why not use it?

The K10 can only track 3 branch instructions within any 16 byte block which could potentially be a problem for if-else based dispatch. A few nops (or longer encodings of the same instruction) can take care of that at the expense of code size. Code size in general is also a concern.

I imagine that one of the if-else based approaches will still be faster at least in some cases on the more advanced chips. The question becomes when one is preferable over the other.

I can't think of any reason other than improved branch prediction to explain this behaviour, but if you have another thought let me know!

C# is not a language I am intimately familiar with, but if I'm reading your source correctly the virtual dispatch cases are done through an interface type rather than the base class of an inheritance hierarchy. I'm not sure how those kind of calls are implemented in .NET, but they may not be as simple as a vtable call (it's a callvirt instruction either way in the CIL, but the native code generated will almost certainly be different).

Hotspot calls a stub that does a linear search for the correct itable and then does an indirect jump using the pointer at the appropriate slot in the itable it found. There are other methods for handling interface calls that have lower overhead and it's possible .NET uses one of those, but I had trouble finding any documentation on it.