Garbage Collection Representations

I've written an overview of the different representations used in garbage collection based on my readings and the great explanations Andreas provided in another thread.

I'm certainly going to make good use of this information, and hopefully it will be useful to someone else too. I'd appreciate any corrections!

[Edit: I also describe another pointer tagging scheme that I haven't come across in the literature. If anyone has seem something similar, or you think it's unworkable in practice, I'd love to hear about it!]

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

GHC Pointer tagging

You may want to look at pointer tagging that was recently added to GHC albeit not for the purposes of GC.
http://research.microsoft.com/~simonpj/papers/ptr-tag/index.htm

They describe the

They describe the constructor inlining optimization I discuss and provide some hard numbers as to how applicable it is. It's less applicable given unboxed integers, but still interesting to see that >95% of constructors can be represented with an inlined form of some kind. Thanks!

Garbage collecting code?

What representations are used for code? I can imagine a sort of "function layout" structure which simply lists all functions a given function calls. Are there more compact representations the GC can use?

You could probably parse the

You could probably parse the machine code to find the function calls. That would be pretty compact :).

Indeed, but not very

Indeed, but not very portable. I don't see how it can be done for a backend written in GNU Lightning, for instance.

What if you build a mask

What if you build a bitmask after generating the code that tells you where the pointers are? The algorithm could be: scan over the words in the code and whenever you find a word that looks like a pointer, check the next bit in the bitmask to see if it actually is.

Edit: Actually, I guess you'd need to worry about your code generator creating relative jumps...

Relative Jumps

I guess you'd need to worry about your code generator creating relative jumps...

Indeed. We used GNU Lightning for the native code jitter in Alice ML. Since native code lives on the heap and can be collected, all code needs to be address independent. Unfortunately, the Lightning API provides no guarantees about generating relative jumps, so that we had to hack around its API at several points (which is one of the reasons why we never ported the native jitter to another architecture).

Static calls

Since native code lives on the heap and can be collected, all code needs to be address independent.

By collected, you mean compacted/relocated, right? Also, do I understand your solution to be that you disallowed static jumps between code blocks and made sure all generated jumps within a code block were relative? Or did you mean that you disallowed relative jumps between code blocks so that all static jumps between blocks were easier to fixup during relocation?

Yes

By collected, you mean compacted/relocated, right?

Yes, all of that.

Also, do I understand your solution to be that you disallowed static jumps between code blocks and made sure all generated jumps within a code block were relative? Or did you mean that you disallowed relative jumps between code blocks so that all static jumps between blocks were easier to fixup during relocation?

The former. Jumps to other blocks are always indirect. (In fact, inter block jumps might even involve switching to a different code format. E.g., some functions might be jitted, some not.)

I was planning to use GNU

I was planning to use GNU lightning in a similar manner, ie. relocating functions. Have they addressed the problem of relative jumps? I had asked a similar question awhile ago, but never got an answer.

I don't know

For various reasons, we have long abandoned further development of the native code jitter, and I haven't looked at Lightning since. From what I can see from the project page, the last release was in 2004, so it seems to be rather dead, which is a pity. Despite its problems, it was quite useful, at least on x86 (the other backends used to be too buggy).

From what I can see from the

From what I can see from the project page, the last release was in 2004, so it seems to be rather dead

There's a branch with working x86-64 code, so there's still some work being done. It'd be nice if the Mono project could produce a semi-portable code generator, as they have very broad architecture support. I e-mailed one of the guys working on the Mono JIT at one point, but never received a reply.

Two new polymorphic representations

I discuss two new polymorphic representations in my latest post. The bitmask representation has been briefly discussed here before. I came across the 64-bit representation quite awhile ago, but simply forgot about it until today. Is anyone using the 64-bit representation?