Taking Off the Gloves with Reference Counting Immix

Taking Off the Gloves with Reference Counting Immix, by Rifat Shahriyar, Stephen M. Blackburn, and Kathryn S. McKinley:

Despite some clear advantages and recent advances, reference counting remains a poor cousin to high-performance tracing garbage collectors. The advantages of reference counting include a) immediacy of reclamation, b) incrementality, and c) local scope of its operations. After decades of languishing with hopelessly bad performance, recent work narrowed the gap between reference counting and the fastest tracing collectors to within 10%. Though a major advance, this gap remains a substantial barrier to adoption in performance-conscious application domains. Our work identifies heap organization as the principal source of the remaining performance gap. We present the design, implementation, and analysis of a new collector, RCImmix, that replaces reference counting’s traditional free-list heap organization with the line and block heap structure introduced by the Immix collector. The key innovations of RCImmix are 1) to combine traditional reference counts with per-line live object counts to identify reusable memory and 2) to eliminate fragmentation by integrating copying with reference counting of new objects and with backup tracing cycle collection. In RCImmix, reference counting offers efficient collection and the line and block heap organization delivers excellent mutator locality and efficient allocation. With these advances, RCImmix closes the 10% performance gap, outperforming a highly tuned production generational collector. By removing the performance barrier, this work transforms reference counting into a serious alternative for meeting high performance objectives for garbage collected languages.

A new reference counting GC based on the Immix heap layout, which purports to close the remaining performance gap with tracing collectors. It builds on last year's work, Down for the count? Getting reference counting back in the ring, which describes various optimizations to raw reference counting that make it competitive with basic tracing. There's a remaining 10% performance gap with generational tracing that RCImmix closes by using the Immix heap layout with bump pointer allocation (as opposed to free lists typically used in RC). The improved cache locality of allocation makes RCImmix even faster than the generational tracing Immix collector.

However, the bump pointer allocation reduces the incrementality of reference counting and would impact latency. One glaring omission of this paper is the absence of latency/pause time measurements, which is typical of reference counting papers since ref counting is inherently incremental. Since RCImmix trades off some incrementality for throughput by using bump pointer allocation and copy collection, I'm curious how this impacts the pause times.

Reference counting has been discussed a few times here before, and some papers on past ref-counting GC's have been posted in comments, but this seems to be the first top-level post on competitive reference counting GC.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Language interoperability

Another huge practical advantage of reference counting is easy interoperability with C. If you write a C extension to CPython, the extension doesn't need to register liveness roots for tracing. It seems like this collector would require extensions to respect *both* reference count semantics and roots and tracing semantics.

Agreed

Reference counting makes a language behave like a good citizen. With tracing, it expects to be the world, and a host language has to do gymnastics to maintain the illusion.

If you write a C extension

If you write a C extension to CPython, the extension doesn't need to register liveness roots for tracing.

True, but you still need to invoke a ref count increment on every object that passes into C, so you haven't gained anything that I can see. Both are single function calls, one takes an object address, and one takes the address of a slot holding an object reference.

About obj-c ARC

Is new to me the idea that refcount is slower than a GC. As user of obj-c (and .net, python) I see the opposite. Exist some info about this, to look at?

Any book describing

Any book describing automatic memory management should explain this properly, like "The Garbage Collection Handbook: The Art of Automatic Memory Management" or "Garbage Collection: Algorithms for Automatic Dynamic Memory Management".

Basically, when using RC every time a pointer gets assigned to another one, a set of increment/decrement operations needs to take place.

This affects cache lines, memory access patterns and lock contentions when the pointers are used across threads.

With GC algorithms, all of this is postponed for the moment when a stop-the-world collection occurs, or is avoided by having the GC running parallel in another core.

Hey, I was trying to kill that old exaggeration.

every time a pointer gets assigned to another one

No, not every time, as I described immediately below. I would not bother correcting this overstatement, except brevity appeals to folks who don't read. You gave the ten second sound-bite version on why RC sucks, which is technically incorrect in that it's not necessary to always change counts. When some ref pins a thing in place longer than you live, it can be treated as permanent from your perspective.

It would be nice to have an extra core that has nothing better to do than collect garbage. Yes, this luxury does happen, and it's getting more common as core counts go up. But some apps try to budget all cycles on all the cores when things get busy. I have a guy who's ready to remind me that wasting X percent of cycles is wasting Y gigabits of network appliance capacity. Every one percent you save is cause for high fives. When every core is tasked doing the same thing, there isn't an extra. But consumer hardware probably has trouble tasking all the cores with something useful to do in end-user apps, so parallel GC sounds good there. What if end-user apps were written so the hardware actually got used, though? Probably burn through all the battery, I suppose.

I suppose so.

So then we might as well burn the battery on GC instead? ;) I'm joking, but there are those who present this in all seriousness as an argument in favor of C++. This and "total kWh wasted worldwide re-JIT-compiling the same code over and over."

I find those arguments to be, shall we say, second-order effects at the very most. Not to mention very difficult to quantify. (Shall we really weigh it against total kWh wasted worldwide enduring interminable C++ compiles?) But it at least offers an interesting perspective.

Cirque du Soleil presents: refcounting in Quebec

I'm not especially eco-conscious, but it's good to consider unintended consequences of getting what you want. After wishing code would scream along in a device, we should think about what happens then. In some contexts, getting done sooner and quiescing faster might be better than pegging the CPU.

A few years ago someone showed me a chassis intended to hold a dozen hard disk drives. There was no way for them to breathe, with no provision for air flow or heat dissipation. Just looking at it, you knew drives would cook themselves. Who thought of that? The best you could hope for was that no i/o would need to occur. Someone wished for a dozen spindles; what would happen if they got them?

In recent years I've noticed a tendency for something bad to happen after I unblock a piece of tech frozen in paralyzing grip of technical debt preventing further change. If I put back N degrees of freedom, folks immediately spend N+1 or N+2 degrees of freedom ... making it fail because now it's self contradictory. At first it seemed like weird ironic coincidence, but now it seems guaranteed by human nature.

Anyway, I'm not against GC per se, and I'd like to use it more. I just get stuck on projects where it's forbidden. (Even more frustrating is when floating point is forbidden, because folks compile with options using floating point registers in scratch calculations, preventing your values from being restored after context switch so you can't do it safely.) Ideally, one need not know whether a component uses GC or not. Then you'd prototype with GC, because faster coding is good use of dev resources, but be able to swap in something else later, time permitting.

Most of all I like making decisions driven by evidence, not mere opinion. Direct comparison with an alternative is hard to beat as an empirical tactic. So I want everything swappable, for purposes of comparison. I hate to see winners without at least one runner-up that lost. Too often the loser is a hypothetical arm chair experiment never tried, assumed to do the worst possible thing. For example, before dismissing RC, trying to make it work first is an intellectually honest approach.

In a dialog, here I'd have a character ask when refcounts need to change, then construct an argument. This isn't empirical at all, but it lets you plan an empirical experiment based on reasoning you uncover. You could publish tests for various options considered, and folks could see for themselves.

The answer: only when ownership changes. A C coder, Jim, is likely to malloc() and free() structs directly. Go look in the racoon library for ipsec: that's what you'll see. Actual allocation is heavyweight, but using something already allocated is cheap. Allocation bounds the scope of owning the space involved. Inside the scope, Jim assumes the pointer is free to use whenever, because it's owned. If this is ever wrong, Jim debugs the crash and goes back to assuming ownership means safe free use. (Jim never thinks "close enough for government work" because Jim is pretty sure he's a wizard.) Jim puts the owning pointer in a specific place: this field in this struct is the one tracking the allocated memory. Every other alias of that pointer is free as long as covered by the owning pointer.

When Jim uses something refcounted, tactics barely change. Adding a ref is like taking partial ownership. Jim only needs one ref to become part owner, so he treats addref() like malloc(): the moment when ownership is secured. Until Jim releases that ref, he treats the pointer the same as if he allocated it himself. It can't go away until his known ref is released. The known ref goes in a specific field in a specific struct, and every other alias of that pointer is covered and safe to use.

If the refcounted object is mutable, Jim follows whatever intricate rules are required to be safe, depending on when and how conflict might be observed and the protocol for locking if necessary under concurrency. But this doesn't change Jim's assumption the object cannot go away while his one ref is held—it's pinned. If immutable, Jim never really cares about any other ref besides his own; Jim can pretend he is the only owner, and that release() is the same as free(). One ref covers the scope of Jim's activity. Inside that scope, when the ref is known to be held, Jim never thinks about each alias requiring another ref.

If Jim sees a crash, he debugs by auditing to see whether his ref is honored. A refcounted object's generation number cannot change as long as Jim holds a ref. Jim can tell whether it changes by copying the generation number, putting it in a field right next to the owning pointer. If supposedly immutable, the object's change count can't change either, and is audited the same way: copy and compare. Nothing requires Jim count each alias. In fact, trying to do so would create a noisy signal obfuscating Jim's view.

Apologies for verbosity. I know you grasp all this already. But maybe minimal refcounting will seem more obvious to other folks.

Yep.

I like Jim. I write a lot of code like Jim's. In fact, it turns out that when you do things Jim's way, refcounts "almost never" need to be anything other than 1 or 0. So you "almost never" need them at all. Go figure.

I have problems with C++, but memory management is really not the biggest. In fact, it's probably not in the top ten. After years of working almost exclusively in GC'd environments, this came as a bit of a surprise to me. Now I feel like all the effort spent avoiding manual memory management may be somewhat misplaced. (Not that I hate GC either, and for some idioms it's really absolutely necessary.)

Does not scale.

The problem is when Jim is on sick leave and Joe needs to do a code change.

Joe admires Jim as a C master, but he just got his C orange belt last week.

Silently goes the change into source control, without all the use cases Jim knows about.

Six months later, Jim is on his third night without sleep trying to understand why one of their most important customers is facing instability issues.

Maybe not

Fundamentally, I think you're right. I've never seen any approach to high-performance computer programming that scales. Some programs are simply not going to be maintainable by junior programmers and/or without a continuity of stewardship and the accumulated lore that goes with it.

That said, I think there are things that can be done to help with this. A handful of conventions are enough to make this approach manageable even in a moderately-sized codebase, and I do think this is an area where C++ can help a lot over C. But I wouldn't really argue the point, I think I basically agree with you.

Non sequitur, but true.

So far that's not about refcounting, so I'll add an obligatory RC part by the end to keep it on topic. Jim can try to make it hard for Joe to foul the works via docs, tests, and some kind of runtime tax watching for surprises during execution.

Most languages have trouble scaling in a similar situation, when later devs make changes. The more things you can get wrong, the easier Joe can make mistakes. C lets you do a lot of things wrong, from manual memory management to weak typing made worse with casts typical in any generics scheme. (C preprocessor macro-based libraries make me shudder.) So I'm guessing your point is that C should not be used? Making folks stop using C is beyond me—lots of inertia there. My story about Jim meant to characterize his sort of thought process and how it relates to refcounts, since Jim's normal practice shows ref-per-alias isn't needed.

I described it so folks can think about automating what Jim does in his mind, so tools can use the same optimization and perhaps also catch things Joe does wrong later, if a static analysis tool can see it. Jim is obligated to review everything Joe does, though. Even better, Jim can leave a test suite in Joe's path that's hard to get around. So if six months pass before trouble, that's partly Jim's fault for not catching it sooner.

However, knowing how Jim thinks, you can cooperate when providing Jim with new C tools. When Wil writes a green thread library Jim is supposed to use, Wil had better support sticking refs representing ownership in specific places, the way Jim knows will work. Better yet, Wil's C rewriter should look for problems easy for Joe to add later, while transforming Jim's C code to continuation passing style (CPS) so it runs as a coroutine despite looking like ordinary C.

You can do handles in C acting like a poor man's smart pointer, to centralize refcounting and auditing (that might or might not be present depending on compile options) so Jim has less code to write in the first place, and Joe has fewer rules to get wrong. Code in C just tends to be ugly, so a graceful way to express refcounting like that might be hard. Wil's C rewriter can likely see refcount problems, though, and Joe's build might break.

Devil's advocate.

I was playing a bit the Devil's advocate in regards to whole process you described.

Personally I always advocate GC or RefCounting, since I got in touch with them initially via Oberon back in the 90's.

I just like to pick up on C, being a strong typed language fan. Although its issues with manual memory management are traversal to many other languages.

Your descriptions are quite good, sorry if I was inconvenient.

No inconvenience; feedback is good.

I like devil's advocacy; it's a lot better than silence and indifference. I just want a handle on someone's intent, whatever it is, because a common intent is a desire to steer conversation towards a specific objective, and the hidden agenda can be tedious. Much better is when someone announces, "I'm going to mess with you now," because that can at least be creative, which makes good feedback. In dialogs I plan to use a character named Crash who does that unabashedly.

Like Matt Hellige, I think like Jim too when I work in C: the model is idiomatic in a lot of C code doing memory management when folks try to be careful. Jim's is about the only model I used about 25 years ago. By 20 years ago I was dabbling in Smalltalk and Lisp, and my thought process got complex after I read Appel on continuations. I'm not good at simulating Jim's aversion to other languages, which is hard for me to understand beyond simple conservative caution. So while, for example, Rust looks like a good match for what Jim wants, I can't guess when that will take off. (And Go, you look hot too; stop fishing for compliments.)

hidden CPC reference?

Better yet, Wil's C rewriter should look for problems easy for Joe to add later, while transforming Jim's C code to continuation passing style (CPS) so it runs as a coroutine despite looking like ordinary C.

Is this a hidden reference to Gabriel Kerneis' Continuation-Passing C? It does exactly what you describe in this sentence (and works and is actively maintained), but I don't know if there is any explicit account of ownership in the exposed threading interface. If you're thinking of some other project doing something similar, I would be interested in the citation.

(The part of CPC that does the translation to CPS of existing C code is quite robust and will work on large-scale C programs. In particular, see not-yet-published work on using it on QEMU as a whole: QEMU/CPC: static analysis and CPS conversion for safe, portable, and efficient coroutines, Gabriel Kerneis, Charlie Shepherd, Stefan Hajnoczi, 2013.)

Here I veer way off topic.

I meant it more as generic reference to anything similar, which would include Kerneis's stuff, and also something I may write about. I posted on that thread about CPC, if you read down a bit, so you can see I've been cooking this a while.

When you create something, you need a plan for what you expect other folks to do with a new thing. In my case, I expect others to steal ideas without credit and re-apply them locally in other tools informed with additional insight. All I care about is moving things along to more interesting tech. This is basically boredom avoidance. :-) Lately I find I can't play video games much any more because I'm bored. So writing about green thread fibers is just something to do that will be interesting temporarily.

In the last week or so I finally decided I should write all the docs as fiction about a guy named Wil creating libraries involved. The main advantage of this format is lack of need for justification. Code only has to match Wil's docs. It doesn't have to be the best idea, just fully described. Dialogs with other characters provide a way to verbosely explore side issues and considerations, in a way not unlike how teams of coders meet and exchange email, which commonly motivates project code. Some docs will be specs Wil writes in the same manner used at work: dry, terse, straightforward email-style treatises, which other characters might tear apart. The goal would be to explain why every bit of C code looks that way. I need to rewrite this paragraph as a conversation in dialog when it gets fictionalized.

Fiction is supposed to be entertaining, but docs generally run boring, and these are going to be docs, so Wil must warn this will be boring despite being fiction. Funny things can be sprinkled inside, in moderate doses. For example, one character might be code detective Minerva Planck, the grand-daughter of Max Planck, who talks in the third person like a film noir detective straight out of Chandleresque parodies. She's narrating the adventures of Min Planck, code detective, with a cameo appearance by Jim Carrey.

Rust

Sounds like Rust is the perfect language for Jim.

Refcounting options.

Common understanding works as follows. If you change refs every time you copy a refcounted pointer, this is expensive because it's proportional to pointer aliasing. It also touches cache lines for no other reason than to change refcounts: logically readonly but mutable in actuality. This is the easy thing to do in automated tools like compilers, so compiler folks typically assume refcounting always scales with aliasing, and is therefore slow.

When refcounting is done manually, one need only change refcounts when necessary, then rely on someone else's ref whenever possible. Specifically, if scope B is strictly nested inside scope A, and A holds a guaranteed ref to something, then scope B doesn't need to do so when aliasing pointers to those counted objects protected by scope A. This can be much faster, touching fewer cache lines, and ought to perform better than GC except in one case, as below.

When you release a ref to something, it might be the root of a large subtree of refcounted objects that must all be reclaimed. Such a subtree can be arbitrarily large. When ref release is synchronous, recursive, and uninterruptable, the latency can take as long as required to visit all that memory and release it all in one go. In short, done naively this can be a huge latency blip. This case frightens folks with real time response needs.

To deal with this in a lightweight-process/green-thread library, with entities managed by refcounting alone, this means ref release must be done incrementally to keep real time guarantees, no matter how many resources become free at once. If you kill a lightweight process and it releases a large amount of memory, this must all be queued in a todo list processed incrementally by a system cleaner. This leads to system-cleaning fibers calling code defined by a killed process, so it might seem strange that behavior can live longer than the original process defining it. To make it well-behaved, such cleanup code can be constrained to release resources only, without allocating new resources. (Since release is generally a one-way message anyway, there's no need to wait around for replies requiring context after callback.)

I think some folks don't realize ref release can be made incremental and low latency that way.

Specifically, if scope B is

Specifically, if scope B is strictly nested inside scope A, and A holds a guaranteed ref to something, then scope B doesn't need to do so when aliasing pointers to those counted objects protected by scope A. This can be much faster, touching fewer cache lines, and ought to perform better than GC except in one case, as below.

But if you're tracking scoped lifetimes like this, you might as well go with full-blown regions which is faster than refcounting and tracing both (just refcount the region). Ditto for your green thread abstraction, ie. have task-local heaps which can be destroyed in constant time. Then you don't need runtime threads to perform cleanup.

Very interesting.

... have task-local heaps which can be destroyed in constant time.

That's a wonderful suggestion I can use in docs to explain how it subverts expected use. I never would have thought of that. Here's why:

The point of embedding a green-thread engine in a host process is to share state, rather than carefully isolate it, and to share state among fibers and lightweight processes too, often specifically for performance reasons by sending refcounted buffers through pipelines without copies for zero-copy i/o.

Fibers and lightweight process lanes scope behavior primarily, rather than state, though you only want one writer of state at one time. Typically state passed between lightweight process lanes has become immutable by convention for functional programming on data, for clean conflict-free dataflow behavior.

A fiber is for independent scheduling rather than state isolation. Typically its role is to manage one leg of transform/filter/observe for refcounted state passed through in a pipeline before it gets handed off again. In the form I like, a fiber can be as small as one activation frame, plus relationships with the rest of the environment like a parent lane. It has as much of the stack frame list as you wanted the fiber's coroutine to know, which you can arrange at spawn time.

Layers above (and below) in the host process pass state back and forth as inputs and outputs, typically in queues, and arena-style task-local heaps won't work unless references to everything outside are carefully tracked. A green-thread layer is just very involved control flow switching with highly uniform structure and behavior, so it's susceptible to analysis and control.

Promiscuous state sharing between fibers is part of the model. It's a way of letting data capture it's own associated processing behavior that gets scheduled independently of other things. Everything becomes an agent when that simplifies what an app needs done. You can easily return arbitrary state from one fiber to another by merely handing off refcounted activation frames: "Take my final stack and go nuts."

The sort of thing you suggested is a good idea for sandboxing, with a strict copy-in/copy-out policy for i/o with things you don't trust. But you could just run untrusted things somewhere else just as easily, under another VM instance in the same process.

You might find reaps of

You might find reaps of interest.

Can't quite see the connection.

I skimmed Berger's berger-oopsla2002.pdf reap allocator paper, but so far I don't get the connection in this context. You might have a specific detail in mind you can share to clarify.

Points I made about refcounts are mostly allocator agnostic, though I see one could choose to make a plan based on some underlying allocator. What I plan for allocation doesn't seem related, but I can summarize it briefly. (I'm unsure why anyone would care about this.)

A library for lightweight processes and green threads should have good well-defined behavior when resources are exhausted, like memory, since otherwise it wouldn't be very useful. So you would give it suitable memory, from your allocator of choice, which is henceforth managed solely by the library (unless you manage to corrupt it by stepping on it, which is the host environment's problem).

When it runs out of memory, it says "no" a lot, and probably kills selected lightweight process lanes according to an algorithm the host can tune. A pool of memory reserved for the library to use exclusively in self management would likely be necessary to ensure deadlock cannot occur during incremental backoffs from memory exhaustion. But while I think it's interesting, there's not much about refcounting to it. Except all space managed by the library would be refcounted, just using a different vat than anything allocated by layers above and below a green-thread library.

(I like vat as a term meaning custom heap with as many refcounting API additions desired to satisfy planned usage scenarios. It's at least clear and not that overloaded in usage. When a refcounted object hits zero refs, it goes back to the vat that allocated it.)

I skimmed Berger's

I skimmed Berger's berger-oopsla2002.pdf reap allocator paper, but so far I don't get the connection in this context. You might have a specific detail in mind you can share to clarify.

Reaps not only support constant-time full heap destruction, they also supports fine-grained release within a heap. I meant that you might find it useful as a general abstraction for implementing and/or explaining task-local heaps.

How resources within a heap are managed is orthogonal.

When it runs out of memory, it says "no" a lot, and probably kills selected lightweight process lanes according to an algorithm the host can tune.

Bad idea, if by "selected process lanes" you mean some sort of heuristic algorithm. Your nice, deterministic programs just became non-deterministic. Better to properly account for memory by charging the tasks for their use.

Good, bad, ... I'm the guy with checkin privileges.

I just mean a lane, aka lightweight process, will get selected. The VM will always kill something if it runs out of memory. A host can tune this choice several ways. A lane marked "do not kill" is exempt unless (don't do this) all lanes are marked that way. Hinting ought to be easy to add.

I probably will not read that Wick and Flatt paper. Lanes will need resource accounting, including memory allocated from the VM's vat. Going over a hard limit should kill a lane. Hitting a soft limit can signal an event a lane can handle. Limits apply to a group of lanes rather than just one, so lanes performing a single pipeline task can hand off buffers, within a group, without change in accounting status. Credit-based flow control should be supported so a host can show distributed intentions to a local VM instance to influence policy.

Regions/arenas aren't very

Regions/arenas aren't very typesafe though (in most languages); some sort of escape analysis and/or linear types would be needed to ensure that objects are not used outside of their lifetime.

Reference counting and GC are easy to make typesafe, in comparison.

Regions/arenas aren't very

Regions/arenas aren't very typesafe though (in most languages); some sort of escape analysis and/or linear types would be needed to ensure that objects are not used outside of their lifetime.

I assumed we were dealing with unsafe languages to begin with, so regions in C aren't really less safe than reference counting in C.

Regions are kind of more safe, because least we have a C-dialect that provides safe memory management. I understand what you mean though, that the lifetime nesting might be trickier to get right than forgetting refcount increment/decrement calls. I don't know of any empirical data backing this claim though.

One interesting aspect of RC

One interesting aspect of RC that I haven't seen in other region systems is the ability to annotate fields of a struct to put constraints regions, e.g. you can specify that a binary tree node and its parent are always in the same region, and this information is associated with the struct definition rather than any of its uses.

I believe you could do this

I believe you could do this in Cyclone.

Regions only work with a

Regions only work with a single scope, whereas it is possible to apply the optimization in cases where there are multiple scopes.

For a somewhat contrived example, consider a server where each connection references data for the user making that connection. If the connection maintains the reference to the user, all scopes that are contained within the lifetime of that connection can avoid modifying the reference count of the user. However, if there are multiple connections from the same user then no single connection owns the user, so you can't apply traditional full-blown regions.

This optimization is very common with manual reference counting, but I don't know of any system that applies it automatically.

Regions only work with a

Regions only work with a single scope, whereas it is possible to apply the optimization in cases where there are multiple scopes.

I'm not sure what you mean. Regions can have nested lifetime, just like McCusker described above. Perhaps you're confusing regions with arenas?

To be more specific: regions

To be more specific: regions don't really handle the case where an allocation's lifetime has multiple non-nested components, and distinct allocations have unpredictable overlap in their lifetimes.

Consider the user data in my example. Each user's data is only freed when all connections from that user disappear, so it is impossible to predict which connection will free a given user's data upon termination, and which order the data from different users will be freed. In this case, the only way to apply reference counted regions is to make a region consist of a single user's data.

Right, because the data's

Right, because the data's lifetime is session-scoped, not connection scoped, ie. where a session consists of N > 0 connections. So the user data should live in a parent region of the connection region, not a child region.

Immix seems clever, but it

Immix seems clever, but it doesn't have many of the benefits usually associated with reference counting:

  1. Reference counting is deferred, so objects have asynchronous finalizers rather than synchronous destructors. Destructors are much easier to reason about than finalizers.
  2. It requires a fairly complex runtime, including stack maps that have to be produced by a compiler.
  3. There is a backup tracing collector, so pause times are probably mostly determined by the behavior of the backup tracing collector.

Of course, the first problem is shared by all deferred referencing counting schemes, and the last two problems are shared by most reference counting implementations with a backup tracing collector, so none of these problems are unique to Immix.

When you are willing to give up the benefits of a simple immediate reference counting scheme, I feel like the problem of GC has been mostly solved by Azul's work on the C4 collector and the followup work on the Collie collector, a variant that uses transactional memory. Unfortunately, this approach requires heavy OS (and possibly hardware) cooperation, so it isn't used outside of Azul's own products, at least as far as I know.