UCPy: Reverse Engineering Python

Interesting paper from PyCon 2003 (so yes, it's old news - but new on Lambda, as far as I can see): UCPy: Reverse-Engineering Python.

The project partly entails replacing the "CISC-style" Python VM with a much smaller "RISC-style" VM. The authors' comments on this decision are worth considering in the light of recent discussions about the design of the Parrot VM.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Region inference

This is yet another reference to this particular kind of memory-management strategy (i.e. the paper mentions they used a plugable memory-management interface to their VM, so they could use different memory-management strategies including region inference). How long before this reaches the industry, like garbage collection did with Java?

Region inference ?

You mean region inference or "pluggable" memory management ? ("How long before this reaches the industry")

If it is the former, there is no evidence that region-based memory management yields consistently better performance over GCs; for some programs a GC will do better, for some others it will be the other way around. Further, a pure region-based system will waste far more space than a non-conservative GC (numbers from the MLKit papers point to 2-3 times more space required when compared to a mixed gc/regions system). And while mixing gc with regions solves the space issue, it is a far more complex memory manager than a simple generational stop-and-copy gc, more difficult to maintain, for almost no gain in performance.

Now, this is the experience of trying to retrofit an existing language (SML) to region-based memory management. If the language has a type system designed for region inference, it might do better in all these respects. The question is if such a language would be useful, expressive, or interesting in some theoretical or practical way.

Region Inference

You mean region inference or "pluggable" memory management ? ("How long before this reaches the industry")

I meant region inference. There seem to be lots of papers on region inference* in the past couple of years.

If it is the former, there is no evidence that region-based memory management yields consistently better performance over GCs; for some programs a GC will do better, for some others it will be the other way around.

I don't think this is a large issue. There's the same discussion on GC vs. manual memory management, waste in time and space are inevitable. OTOH I can see several domains where automatic memory management with constant-time allocation guarantees would improve productivity and eliminate entire classes of errors due to manual allocation.




*from the paper I tracked these:
Region Inference for an OO Language

Dynamic Region Inference

Real-time ?

I suppose you mean real-time apps ?

I agree that using regions over manual memory management is clearly a win; that's the same idea for any automatic memory management technique. Memory shouldn't be managed manually in most of the cases, even when performance is an issue.

What is not so clear to me is how region-based memory management is better than GCs (or real-time GCs, if that's the case). I got excited when I saw the work done with MLKit and thought that regions could solve some of the "problems" people associate with GCs (like mutator pause times); in practice it didn't seem to be so good. If it does not perform better in time, performs worse in space (due to non-stack-like data being put in the global region) and is not simpler to implement than a GC, what's the advantage ? It's not a rhetorical question: I still don't know enough about the subject, but as far as I see, there is none.

Anyway, the subject is very interesting and it's likely that I'll be doing more serious research into it soon. Thanks for the links, by the way.

The reason: Scalability

GCs generally don't perform well when dealing with multiple gigabytes up to terabytes of memory. Either you rapidly reduce the coverage of your memory-management algorithm (even generational GCs don't quite cover this), or you use regions.

Connectivity-Based Garbage Collection?

CBGC

Jesus Christ! Does everybody have a λ favicon? (LMCS for yet another and of course the usual suspects.)

Reference counting

What about reference counting then ? They don't need to scan the whole memory. You need to do a trace to detect cycles, but there are techniques for minimizing the amount of memory you have to trace looking for cycles. And it's simple to implement.

The problem is the price you pay with pointer operations. But when you have terabytes of memory, it might pay.

Regions and refcounts

I think regions are in a way an optimization of that: you can keep one refcount for a whole region, as long as all objects that reference objects inside the region know that they're in that region.

Also you can just forget about everything in that region, all cycles and refcounts between objects, because you free them all at once. If you need to "promote" some objects in terms of lifetime, you copy them somewhere else. This means you tend to do bulk operations (like filtering a whole list).

There was a paper or two by Alexander Aiken and David Gay IIRC on having regions with associated refcounts in a slightly modified C.

Regions over GC

Well, regions usually work well if your program is layered, i.e. you know a certain amount of stuff allocated in a module doesn't ever escape. Just put it in a region and free afterwards. This is just like generational CG, but you know when the generation is dead.

This also shows where regions break down: when you can't predict the lifetime of an object. I wonder if you could use refcounts for those objects in all practical cases.

Often, though, you can regionify a program by simulating stop-and-copy GC, that is, by copying the escaping values to another region.

I haven't done too much region programming (used it for one not too complex compiler, but there most objects have just one lifetime anyway). It's too bad most practically inclined programmers out there either use stuff like Java or Python instead of Cyclone, or they go for dynamic or GCed languages like ML and Lisp. I'd really like to see more practical experiences there.

I think regions are faster than CG because you only perform that part of a copying generational GC that have to be done. Copying parts of a data structure isn't too bad, either. Plus, you get excellent cache locality (sequential allocation and reuse of a just-freed region) and less overhead than malloc (because it's bulk de/allocation.

Real-time

I suppose you mean real-time apps?

Yes, but not only those. For example massively concurrent apps which execute a fixed number of instructions per process before switching to the next process would benefit from constant allocation times.

What is not so clear to me is how region-based memory management is better than GCs (or real-time GCs, if that's the case).

In the case of real-time GCs AFAICS there's a real win. IIUC regions provide better locality and better space usage than real-time GCs, those usually would waste more memory due to restricted collection times.

Performance of Region Allocation on NUMA architecture.

Wouldn't region allocation allow you to keep the 'data' near the 'code' so that distributed GC on NUMA systems would not suffer from distributed locking? For example, using region allocation to stuff just the right amount of data and code into the sidekick CPUs on the Cell would be beneficial. I'm relatively certain region inferencing would assist with that.

--Shae Erisson - ScannedInAvian.com

Region-based MM

Ah, one of my favorite topics in CS :)

You're right. I think the ML-Kit is huge because SML isn't really designed for regions. OTOH, there's Cyclone from Bell Labs, which is C enhanced with guaranteed references (non-null pointers), tagged unions (like ML sum types), and explicit regions. It looks really cool; I wonder if people use if for real programs. Would definitely be a better alternative to C++.

Thoughts on VM design

Having thought about this issue for a while, I've pretty much come to the conclusion that a heavily CISC, register-based VM is the way to go from a performance perspective.

Using a VM running on a modern processor, you end up with at least one mandatory branch misprediction every time you dispatch an instruction. Considering how long pipelines are in modern processors, this is a very considerable cost when measured in cycles. A register machine with a CISC instruction set can do much to minimize this overhead. Aside from instruction dispatch, it should also be possible to get considerably better cache coherence from code targetted at a register-based VM. Lua 5 switched to using a register-based VM and got a noticeable speed-up due to lower instruction dispatch overhead. However, the Lua compiler does not even try to perform register allocation so they fail to gain any significant cache coherence advantages.

Another advantage of a register-based design is that a naive JIT compiler (e.g. one which simply inlines the opcode functions corresponding to the instruction stream) can generate a considerably greater performance boost than a similarly naive JIT compiler for a stack-based VM. I believe Parrot Dan has talked about this benefit in his blog.

CISC?

[...] a heavily CISC, register-based VM [...] A register machine with a CISC instruction set [...]

For some reason, I subjectively associate register-based architectures with RISC, and not CISC. Could you elaborate?

Clarification

Sorry, I should have been clearer.

I agree that register-richness is generally associated with RISC machines. However, in my post I was using RISC exclusively in reference to the richness of the instruction set. For instance, a typical characteristic of a RISC instruction set is that the only instructions that read and write memory directly are the load-memory-into-register and write-register-to-memory instructions. Every other instruction generally only works with registers as the source and destination operands. Suppose you want to add the words at memory addresses x and y and store the result at memory address z. On a RISC machine this would be written along the lines of

LD r1, x
LD r2, y
ADD r3, r1, r2
ST z, r3

whereas on many CISC machines it could have been written simply as ADD z, x, y. So the work-to-instruction-dispatch ratio is four times greater for a CISC instruction set in this case.

I think I see

Ah, so while one probably should not combine RISC with small number of registers, doing the other way around (CISC and many registers) is less problematic?

But I thought the original issue was with stack vs. registers, is it somehow related to RISC/CISC dichotomy? I could imagine the same (reduced) instruction set working equally well with either N registers (Ri) or N indexed locations on stack (SP[i])... Where is the catch?

Semi-catch

I could imagine the same (reduced) instruction set working equally well with either N registers (Ri) or N indexed locations on stack (SP[i])... Where is the catch?

If you end up having to explicitly index the stack, I would argue that you are not really dealing with a stack-based VM in the sense the term is usually used. For instance, the x86 instruction set has addressing modes that allow this. I agree that you could have that kind of thing coexist with "pure" stack-based instructions but that loses many of the traditionally advertised benefits of stack machines (e.g. easy code generation).

Surely this is only helpful i

Surely this is only helpful if the VM is purely interpretative, rather than compiling its instructions to native machine instructions?

Agreed

Yes, I think that the dynamic dispatch overhead per instruction is mostly a non-issue for VMs which perform native compilation. Aside from removing the cost of the instruction dispatch in the interpreter loop across basic blocks, the JIT would also perform inlining, drastically reducing instances of dynamic dispatch (especially in a VM like HotSpot which performs class hierarchy analysis and monomorphic inlining).

On a somewhat related note, it looks more and more common for VMs to follow a strategy where most program optimization takes place in the VM itself, and the different compiler(s) targetting that VM remain fairly simple in that regard. I think that's pretty nice, because it mostly knocks down one of the greatest barriers to entry for a new language - creating an acceptably performant implementation.

Yes, I think that the dynamic

Yes, I think that the dynamic dispatch overhead per instruction is mostly a non-issue for VMs which perform native compilation. Aside from removing the cost of the instruction dispatch in the interpreter loop across basic blocks, the JIT would also perform inlining, drastically reducing instances of dynamic dispatch (especially in a VM like HotSpot which performs class hierarchy analysis and monomorphic inlining).

I pretty much agree. I suppose I should have included a caveat in my original post: I don't have the manpower to write a fully optimizing compiler or even something competitive with one of the lower-end JIT compilers. I was writing it from the perspective of someone interested in writing "little languages" who wants to optimize the execution-performance-to-effort ratio.

I agree that registers are ad

I agree that registers are advantageous for interpreters, but I wouldn't call the instruction set CISC, since CISC vs. RISC has many dimensions, one of them being that RISC is easy to decode and CISC is not. You do want a rich set of instructions (the more "big" instructions that do a lot of stuff that you have, the more your interpretation overhead is reduced), but you also want them very easy to decode. It might be useful to have the more complex addressing modes you find in CISC, to combine complex operations; but you don't want to spend branch mispredictions doing any further decoding after opcode dispatch.

My preference is one-byte opcodes and one-byte operands (i.e. indices into a 256-entry register file/stack or into a local literal pool).

It might be useful to have th

It might be useful to have the more complex addressing modes you find in CISC, to combine complex operations; but you don't want to spend branch mispredictions doing any further decoding after opcode dispatch.

Of course. There is definitely a sweet spot as far as the richness of the instruction set goes.

As far as supporting different addressing modes, this is one case where you definitely don't need to pay an extra branch misprediction. In general there are various "branchless-ifying" tricks you can apply to minimize the number of those pesky mispredictions in the decoder and dispatcher.

(BTW, it's scary how many game industry people read/post here!)