Reference Counting vs Tracing Garbage Collection?

Dear all, this is an old subject which I touched upon trying to find a new compilation method for Hi code.

Common folklore has it that tracing collection is preferable over reference counting for functional language implementations due to the high allocation rate of nodes. So, my language has a Cheney-style garbage collector.

To get rid of some performance problems, and some technical concerns, I am considering switching to reference counting, but I am a bit wary of the performance impact.

Are there any solid studies out there which compare different schemes for garbage collection in an abstract manner?

(More concretely, it seems to me that there should be a tipping point where 'doing a lot of local work' must become cheaper than 'doing a lot of global work', but I don't have a good feeling where that tipping point would be. I.e, at 1MB, 1GB, 1TB of heap data?)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Sisal

I believe the Sisal compiler used reference counting; perhaps the articles about it explain that choice. They did some work on making it faster, anyway, eliminating some incref/decref operations.

Python, TCL8, CDL3

Thanks, but so do some other languages, some better known than others. Uhm, I am looking for real down-to-earth reports on changing between different strategies, or studies comparing various schemes abstractly.

For me, the most important part is, of course, what penalty would I pay? I.e., can I expect a factor 1.3, 2.0, or 10.0?

I don't care about tracing cyclic structures, the language rewrites a directed acyclic graph by design since I was looking at Erlang-style semantics at some point and wanted to keep the option open of collecting a DAG instead of a general graph.

Thanks again for your response, I hope there will follow some more. (Mesmerizing, a few decades ago this topic by itself would have started a flame-war.)

[ Actually, I changed my opinion. I am interested in languages which have reference counting. The more the better, since that would make me feel a bit more confident about the change. ]

Bacon et al.

I believe Bacon's publications on the Recycler are the definitive source. When combined with Petrank's sliding views, which is a principled approach to eliminating unnecessary deferred ref count operations in a concurrent setting, you get an on-the-fly garbage collector which approximates real-time/incremental collectors with its very low pause times (less than 2ms). Throughput isn't bad either, though not quite on par with tracing collection.

The last publication I saw was An efficient on-the-fly cycle collection. Also look into age-oriented collection discussed in that paper, which uses ref counting only for the old region. This bears out your intuition that lots of local work is sometimes better. I've mentioned this paper a few times in other threads on LtU, but I don't think it ever got its own story.

Elimination of rc-ops using static analysis...

Though not really related to the original question, I want to bring the work of Pramod Joisha (ISMM'06, VEE'08) to your attention, who changed the garbage collector in the Bartok-C# compiler to use reference counting and subsequently used static analysis to eliminate reference counting operations.

Particularly in a

Particularly in a multithreading setting, reference counts for local references are expensive, without simplifying the GC that much. Since this is for performance, I'd also consider scanning a small set of roots (e.g. stacks) from time to time... and if you do, then you're on a slippery slope toward some form of hybrid tracing/refcount GC, which Bacon and Petrank have described and benchmarked in their papers.

Aware of that - Asking for Hard Numbers

The point of going to reference counting is to bypass the necessity to scan or keep track of the stack and registers often needed in tracing collectors. (I compile to C and want to use C variables as registers and a I don't want to trace the C stack and machine registers. I also don't want to compile to a VM, just a collection of macros for my abstract assembly should be enough.)

Most of Bacon and Petrank's/Joisha's work doesn't apply since a large portion of it is on collecting cycles which don't occur in the data processed by my language.

If, and whenever, I will support concurrency, I already decided that I will only support coarse-grained concurrency where each process gets its own heap/memory and communicates through message boxes. So, that doesn't apply too. (Which is another reason to go for reference counting, no dead data, so spawning multiple processes isn't too hard on the memory consumption.)

Anyway, what I like is building very small and very simple systems which get the job done. (I drool at lisp systems implemented in 4KB.) I am willing to spend a lot of time thinking about how to get to a minimal system and I will never go in the direction of a baroque compiler. That's for ML and Haskell, I just want a simple tool which is minimal on most accounts and which performance is adequate.

So, I am willing to pay up to a factor 2.0 for a simple reference counting model if that gives me string support. (Switching to string support is what breaks my current tracing collector.) And so, I am looking for references/data on functional languages which implemented reference counting. Where it seems I need data on compilers implemented in the eighties, since most other compilers are just too baroque by now. CDL3 is the only language I found so far. That is good enough for me, so...

It's a pretty naive question about the naive non-optimized GC algorithms where I just would like to see some hard numbers before I go ahead since I know most of the rest.

[ Note that most of this is just out of sheer hobbyism. I just implement what I like, whenever I like. If it would be about solving the collector problem now, I also could have gone with Boehm. Which, for some reason even unknown to me, I just, well, don't. ]

Most of Bacon and

Most of Bacon and Petrank's/Joisha's work doesn't apply since a large portion of it is on collecting cycles which don't occur in the data processed by my language.

That seems pretty odd, but if it's true, look into the literature for compile-time GC. You should be able to reduce any such programs into straight, safe C with malloc/free.

Bacon, Petrank, Joisha

Bacon and Petrank work seems to be on the Java JVM, Joisha on C#. Most of it dealt with refcounting an imperative language, some with pretty weird extensions, both with possible cyclic structures. Most of it is on getting throughput high, which is difficult for refcounting, and some of it is on the merits even of refcounting an OO language, i.e., low memory overhead, immediate reclamation, and no suspension of finalizers.

Way too far overboard for what I am interested in. I just want to know the expected performance hit for going with naive refcounting in a functional language.

For the comment on compile-time analysis, see below.

Without reference counting?

Without reference counting? I think not. Say you have arrays A and B that point to Foo's. If you have an algorithm that sets various indices of A and B too Foo's you are going to need reference counts to count how many point to a particular Foo, even though there are no cycles. I think that what you're saying does apply if the data is a tree rather than a DAG.

Efficient Compile-Time

Efficient Compile-Time Garbage Collection for Arbitrary Data Structures

marco claims his language cannot create cyclic data, which means it cannot be higher order or you could create cyclic data via the Y-combinator, and it cannot contain mutation or you could create a cycle manually. The paper describes exactly what I said: an abstract interpretation producing full compile-time GC for arbitrary data structures in a first-order language without mutation.

Yet my example trivially

Yet my example trivially disproves that claim. The method described doesn't reclaim as much garbage as garbage collection does. Another example:

let ys = filter p xs
// use xs ...
// now xs is dead
// which elements of xs to garbage collect here?

How do you know which elements of xs to garbage collect for arbitrary p? A garbage collector inspects the runtime data graph to determine which elements of xs are still needed, but a compile time method do this.

The method described doesn't

The method described doesn't reclaim as much garbage as garbage collection does.

I never claimed it was as precise as runtime GC, but for a first-order language with no cyclic structures I think it would be precise enough.

You can determine which elements of p are dead with a combined path-sensitive sharing analysis with compile-time reference counting (for example) during whole-program compilation. Your example is not a difficult case since we're avoiding higher-order functions.

The difficult cases are where sharing is dependent on runtime values:

let ys = if some_runtime_condition then xs[0] else xs[1] end

The path-sensitive sharing analysis would then either duplicate the code for each possible outcome (expensive when there are lots of such branches), or record this information and pass it around in implicit runtime parameters similar to a dictionary-passing transformation.

The function prologue where deallocations occur then inspects these implicit parameters, but only for objects where the analysis determined that liveness is dependent on some runtime information.

Huh?

As I read it, 'filter p xs' is applying a runtime predicate p to each value x. ys is a list of all elements that pass that filter. How is this easier than your example?

Firstly, I already precluded

Firstly, I already precluded the use of higher order functions. Secondly, if the predicate is dependent on static data then it's much easier than a predicate based runtime data, like user input.

Not relevant.

We could just as easily substitute 'filter p' with its full definition. It is not necessarily a higher-order function in any case, since it only requires upwards functional argument and thus may be achieved via a templating or macro system. The issue regards the ys and xs.

First, disallowing higher

First, disallowing higher order functions doesn't make the problem easier. You can compile higher order functions away (for example by inlining filter in this case, but in the general case it can be done as well with a more elaborate transformation).

How do you apply your path sensitive analysis for this kind of code:

let f ys = if p then xs[0] else xs[1]
...
f zs
...

This probably requires passing liveness information around at runtime as you say. I predict that a practical implementation will become isomorphic to reference counting.

I do think (and have thought for a long time) that compile time garbage collection / region inference combined with a simple GC can have competitive performance or better. But compile time garbage collection alone doesn't cut it. Compile time garbage collection will generally collect nearly all short lived objects, so you probably don't need a generational GC. This helps because then you don't need write barriers anymore.

This probably requires

This probably requires passing liveness information around at runtime as you say. I predict that a practical implementation will become isomorphic to reference counting.

It occurred to me that a precise enough abstract interpretation can infer guards for every value, and these guards can be injected in the function prologues for conditional deallocation. So you need even less runtime information than I initially supposed.

I do think (and have thought for a long time) that compile time garbage collection / region inference combined with a simple GC can have competitive performance or better.

I agree. If I had more time, I would implement this region system, which replaces the alias analysis I alluded to above with GC of region handles. It's a fair tradeoff between regions and tracing.

And how are you going to

And how are you going to handle the filter example? i.e. what code exactly is it going to compile to. Thinking in the abstract makes everything seem easy.

Well the code generated is

Well the code generated is entirely dependent on the specific program and its allocation properties, so I'm not sure what general answer I could possibly give you except to point out the algorithms involved in discovering these properties.

LOL

Well, that reads a lot like: "Well, I don't know, and I don't give a shit." ;)

Well, I'm not sure what to

Well, I'm not sure what to tell you then. I've given plenty of citations, some of which directly demonstrate the effectiveness of the analyses I've suggested in this domain, and if that doesn't meet your standards of evidence, then I can't help you.

I liked the citations

No worries.

[ And btw. Of course a lot of those citations are not applicable to my compiler. Changing compilation schemes can take months, sometimes years, to implement. Ditto for elaborate GC schemes. If I go reference counting, I'll implement the naive -proven- manner and live with the fact that other methods might be faster. ]

The thing is none of those

The thing is none of those algorithms handle that example. They leave garbage around.

Don't Think So

Interesting paper, but looks too immature. It feels like an approach which was often taken in various Prolog machines.

More interesting:

marco claims his language cannot create cyclic data, which means it cannot be higher order or you could create cyclic data via the Y-combinator, and it cannot contain mutation or you could create a cycle manually.

The lambda calculus doesn't have a notion of cyclic data and supports higher-order functions. First it is best seen as a string rewrite system, but you can abstract from that and see it as a term rewrite system where the terms are (AST) trees. To avoid copying in the AST, a semantics where LC is seen as a DAG rewrite system is the most plausible. A small example, (lambda x . x x) just inserts an application node where function and argument refer to the same AST node.

My language is implemented as a faithful translation of combinators rewriting (the top of) a graph containing data and thunks. Some consideration shows that it is trivially a term rewrite system on a DAG, just like LC.

Just like LC, it also supports higher-order rewriting trivially.

[ Stated differently, the language is -at heart- an eager implementation of a pure LC with interfaces on values and exception support. Its operational semantics is a term rewrite system. Of course it rewrites a DAG. ]

Subtleties of RC

I can't give you hard numbers since I havn't used both methods on the same language, so no comparable data, but some subtle issues became apparent with RC.

1. allocator performance, with a copying collector a bump allocator can be used, very cheap, with RC, memory is fragmented and a more complex allocator is needed (and more complex deallocate)

2. be careful you don't build GC characteristics into the language semantics, Python has done that and so changing the GC would be VERY hard.

Elaborate, Please

Your comments in 1. are a bit too brief for me to understand. I can guesstimate what you mean, but that doesn't come near to understanding it.

What language? Why a copying collector? I am going for reference counting to get rid of a factor 2.0 in memory consumption due to stop and copy, possibly at the expense of a factor 2.0 in running time. That reference counting may fragment the memory I get, do you have numbers for that? It doesn't make sense to drop a factor 2.0 in memory consumption just to gain it in fragmentation.

More Detail

To answer your questions in order:

Languages, the two GC methods (copying and RC) were implemented on different DSLs several years apart and on different hardware so, as I said, there exists no comparable performance figures, all I can provide you is what caused me problems. RC was chosen because the language was a control system which needed minimal pause time and the DSL was designed so as to not create cycles (like your language).

By "copying collector" I was referring to collectors such as your current Cheney one. Since copying collectors provide a single contiguous free memory space, allocation consists of just incrementing (bumping) the freespace pointer, effectively zero cost. RC allocator/deallocators need to locate blocks of the correct size, coalesce adjacent memory blocks etc. We were using a Lea allocator. I've seen quotes of up to 30% of execution time in the allocator/deallocator for C/C++ code using such an allocator for malloc/free, but I can't find the reference just now. So the performance cost of the allocator used with RC must be added to the reference count mutation cost.

There is no way to quote numbers for overhead due to fragmentation since it depends not just on the implementation of your language, but of the code written in the language. Also it is hard to compare since Cheney has 100% overhead of unusable memory, but RC systems have free memory not unusable memory. Fragmentation just means that all available free memory chunks are too small for the current allocation, that can happen with only one free chunk (ie near 100% memory usage) or lots of free chunks (ie lower useage). I would argue that to get fragmented free space equal to used space (ie 100% overhead when an allocation fails) the allocation/deallocation pattern would have to approximate keeping every second piece of memory, but your language (as you have described it) does not sound like it would have such a pattern, but you would be able to analyze that better, or instrument your current system.

In fact instrumenting the current system and getting an idea of the usage patterns is probably a good idea anyway, then you have some basis for decisions.

Thanks

Very clear response, thanks.

Just for completeness. I bootstrapped the language and got some numbers out of that. Most of the time is spend comparing texts, list of characters. *Cough* Lots of them. Due to a naive implementation it holds *cough* 15 million characters *cough* during the bootstrap. Which, to make matters worse, due to a naive memory representation is elaborated to 2 * (32B + 48B) * 15M = 2.4GB of memory, 32 bytes for a character, 48 bytes for a cons cell, a factor 2 for stop-and-copy.

Please be so kind to laugh your head of now, since I am doing that too. Life is never easy. ;)

So, I'll go the route of minimizing the memory model and changing some naive representations and algorithms.

Fortunately, it shouldn't be too hard to get rid of all that character data by moving to a symbol table and supporting native strings. I don't like the factor 2 for stop-and-copy anymore, so I am considering moving to reference counting, also for making the language more robust.

Anyway, I am rambling. Thanks again. 30%, heh? Hmm, I am certain I have higher allocation rates than C, so I think it will be worse. I don't worry too much about fragmentation. Most data is less than 8 cells and I'll keep separate free lists for each size up to a certain maximum. But, hey, the gamble is to lose a factor 2.0 in time but gain that again by switching to native string representations.

Ah well, time will tell.

[ I just noticed my thread is a dupe. Unfortunately, little more information there.]

Could try pools

You could also try pools if your implementation uses lots of one or a few memory chunk sizes. This makes allocation/deallocation become only a unlink/link and there is no coalescing. If the compiler can distinguish using size or type it can even compile the allocate unlink inline (the deallocate has to recurse through internal pointers so it probably should not be inline).

But if the allocation pattern has a large ratio between maximum and average use of the pool then you will have a lot of empty pool space unless you can recycle it. As usual there is a trade off of memory use for speed.

Of course this only works for fixed size things, so probably won't help with your string problem.

First switch model, then think to GC

Now I see why you want so much to optimize memory consumption - still, I think that you should rather reevaluate your needs for GC/refcounting after changing the memory representation. Then, you can get more accurate statistics to inform your choice - maybe you won't care any more, and you'll have to consider other factors.

A language without mutation usually has to generate lots more shortly-lived garbage, therefore a copying collector is more beneficial; with a generational collector with mark-sweep for the old generation, you can have the 2x overhead only for the young generation (much smaller). That's also more flexible - the space/time tradeoff can be made by tuning the heap size/the frequency of major collections, rather than by switching algorithms.

Additionally, if your language has threads or will have them (given the increasing core counts), non-optimized refcounting will kill performance, because of extra writes and cache misses. Normally, if two threads, e.g., traverse a data structure, say more than once, without modifying it, the CPUs can use their caches. With naive refcounting, each read access becomes an (atomic) memory write, which must be flushed to memory before the other processor can write on it again. An access to main memory can easily cost 100 cycles.
Unless you use a Global Interpreter Lock like Python and Ruby, but people are still laughing at them. Therefore you really need to have something non-totally naive (deferred refcounting, as done by Bacon et al., 2001, could suffice - I elaborate elsewhere on this thread).

making the language more robust.
Why is that? Because of mem. consumption, because your language supports finalizers/destructors, or something else? Please think really hard before providing destructors, that is exactly hardcoding refcounting into the language (as Python), which was already discouraged.

Robustness / String Support

The language is an extended lambda calculus which is compiled down to combinators. All combinators do is rewrite the top of a datastructure, a DAG. Each combinator is translated to a C function, which is trampolined with the top node as an argument. [ For the operational semantics, look here, the C machine. It works. ]

During a self-compile, due to naive algorithms, I allocate 15 million characters in texts represented as lists of characters. As stated, that wastes memory like hell, 1.2GB heap space, but since I use Cheney, that is doubled to at least 2.4GB of memory.

Moving to string support would probably mean I could get that number down to, say, 20MB, with Cheney that would mean 40MB of memory consumption overall. A lot better than 2.4GB. Moreover, it seems the compiler spends most of its time comparing those texts, moving to strings would mean all those compares could be given an abstract O(1) cost, instead of O(n) now, measured in term rewrite steps, of course.

The combinators/C functions which rewrite the heap don't keep track of roots. To make sure that during a rewrite the heap doesn't run out of space, the heap is garbage collected in between trampolining whenever the amount of free space is less than a certain minimum. I.e., the system makes sure that there is always enough 'scratch space' to rewrite.

It works, it is very fast. Since I mostly expect to allocate small strings, it might even work with string support for a self-compile. However,...

In the general case, a problem would arise if during a rewrite there isn't enough scratch space, which could happen easily if the combinator would, say, deem it necessary to ask for a 100KB string.

Moving to reference counting would mean not having any heap space at all, but just be able to malloc every new node asked for, and free them when the count drops to zero.

It is a tempting scheme since it is simple and makes the whole run-time much more robust. I.e., the requirement for scratch space is dropped. Another reason that it is more robust is that you need to grow, and shrink, a Cheney style heap with a certain ratio during the evaluation of the program. I'ld rather run out of memory space for all live data than not be able to double the heap whereas the amount of live data would be, say, in circumstances just 30% of the total heap space(s).

(Cheney wastes memory like hell, you might want to grow the heap space when you're at 90% of consumption of a from space, but after doubling the from and the to space, you're only using ~25% of the total allocated space for live data.)

[ Oh, the reason it is very fast: Allocating is just bumping a pointer and there is just one root which is occasionally collected, which is a lot better than inspecting the amount of free space with every allocate request, which is the default in many languages.
Whether my compilation scheme is fast, dunno, think it is comparable to most lisps but probably the excellent CAML light/ZINC machine is faster, though I have a hard time precisely pointing out why that would be the case. ]

[ You now see why I ask for numbers? I'll gladly pay a 1.3 factor in running time to make it more robust. I'll grumpily pay a 2.0 factor. And, if I end up paying a 10.0 factor, moving to reference counting would just be a waste of time. ]

[ W.r.t. the C machine. Of course, when compiling to C functions, the dots in the representation are elaborated to positions in thunks, and the thunks hold continuation pointers to the next thunk to evaluate. The dot notation is just easier on paper. ]

Probably 3x

I am not sure how much implementation complexity is involved in moving to a string representation inside your compiler, but it sounds like it is high time. It's probably going to be easier than switching to reference counting. It's hard to argue with 1200MB vs 20MB--that's 60X. I'd bet you get a more than 5X speedup just from that.

As for GC, reference counting is not going to be the best in the end. I think you will likely pay 3x rather than 1.3x. When well tuned, a generational GC is going to give the best performance overall. And if you don't have mutation in your language, then you don't need a write barrier.

Can you point to any studies

Can you point to any studies which support the 3x proposition?

You still need a write barrier for pointer initialisation. That is still a pointer write operation & needs to increment the count.

Ah well

Yeah, list of characters waste memory, at least on 64 bit systems. Every somewhat naive representation of that will take at least two words for a char (a header and a value), and at least three words for a cons (a header, the head and the tail pointer). So, unless I do some really heavy bit-packing, 15M chars will always take 15M*(2*8B+3*8B) = 600MB, double that for Cheney. (Even if I make chars unique through a table, it is just still way too much.)

Then the fact is that most texts in my compiler are at most 8 (for 'x' or 'x#30') or 16 (for fully qualified 'system.print' or 'map.foldr') bytes long, they mostly fit in the size of a boxed integer on 64 bits.

I will support strings just because of that.

For the reference counting, I don't know. Probably I should just go ahead and implement it in a week and find out what the factor is for my language. Seems the better option than just reading more. (Although I have the uncanny feeling that 3.0 factor may be correct.)

[ Moving to strings is not too bad. All texts are already in a separate datatype. Representing texts as list of chars was just the easiest for the bootstrap and I didn't think a self-compile would consume that much memory. I'll lose a lot of fancy recursive nice functions on texts, though. ]

[ I agree, where are the numbers? This is supposed to be an exact science, no? However, I don't see the need for a write barrier, btw. ]

[ Interlisp, Mesa and a comparison of Lisp implementations seem to be the references with the most data. ]

[ I revisited some of Bacon's work. His collector stays within bounds of a normal mark-and-sweep collector. ]

Barrier Overloaded

The term barrier is overloaded, I am referring to

"A write barrier is a mechanism for executing some memory management code when a write to some object takes place (that object is then "behind the write barrier", or - informally - "write barrier-ed", or - sloppily - "write-protected")."

from the GC FAQ which includes the reference count inc/dec which is still needed on pointer initialisation and reclamation.

I'm not referring to the hardware write barriers that enforce memory operation ordering, which the C compiler will decide to include to enforce sequence points.

cheney does not require contiguous

Lex Trotman: Since copying collectors provide a single contiguous free memory space, allocation consists of just incrementing (bumping) the freespace pointer, effectively at zero cost.

That's only true of "hemisphere" oriented collectors when hemispheres are contiguous. Nothing stops a copying collector from using a discontiguous heap composed of big pages, say 128K plus or minus a factor of four per page. You still get pointer bumping until end of page. Since a copying heap need not be a fixed number of "pages" in size, there's flexibility when it comes to multiple heaps sharing the same free pool of pages, which lowers overhead in unusuable memory.

In other words, you only need as many free pages in the free pool as required to copy the next heap that gets copied.

Also it is hard to compare since Cheney has 100% overhead of unusable memory, but RC systems have free memory not unusable memory.

Cheney is breadth-first graph copy of roots from an old space to a new space, with forward references stored in the old copy as you go. Nothing requires old and new spaces to be contiguous, so Cheney does not require 100% overhead of unusuable memory for reasons noted above.

member-of-heap map

I forgot to note Cheney assumes it is very cheap to ask "is this pointer inside this heap" to see whether an old-space pointer node must be copied. A single contiguous heap makes this very cheap as an address range check. If your heap is discontiguous, you must do something clever to make that question cheap to answer. Here's what I did:

Align all heap "pages" so if a page is 128K, then the starting address is a multiple of 128K. Given a pointer, strip off the low bits; the result is page address (if in fact that address is inside a heap page). Every heap has a hashmap of heap pages contained. So answering that question is a cheap hashmap lookup. Not as cheap as an address range check, but still good.

Yes, in this case where

Yes, in this case where there are no reference cycles and assuming scanning in temporal order then you can reuse from space as to space as soon as it is scanned, since all references will have been adjusted by then so the forwarding pointer can be written over. So you don't even need enough spare to fit the whole live set, just enough to cover the largest object that needs to be copied.

== Reusing GC

== Reusing GC implementations ==
Have you taken a look at MMTk and VMKit? I guess you'd get higher-quality GC for free, and solve most of your problems.

== Fragmentation and GC ==
If you care that much about memory consumption, mark-compact is probably the best algorithm available.

Other defragmenting GC algorithms are relatively modern. The 1992 Wilson survey doesn't mention any, but many modern GC algorithms try to divide the heap in zones, gather fragmentation statistics, and selectively defragment the most fragmented zones; IIRC, Garbage First, Immix, Pauseless and Stopless all do something like this (but I might be mistaken).

Mark-sweep, malloc/free and refcounting share the same basic properties wrt. fragmentation: there's always a standard algorithm for managing free-lists or the like, and objects are not moved. The actual fragmentation then depends (_ a lot_) on the used memory allocator.

Still, for GC to perform better than non-GC, you might need to use more than 2x the memory (at least according to "Quantifying the Performance of Garbage Collection vs. Explicit Memory Management"). But if you accept a 2x time slowdown factor, you're fine.

== Fragmentation statistics.
I guess that fragmentation depends wildly on the application - for non-GC, they show some numbers on page 68 of the second paper I cite, but they come from a technical report they cite that I can't find.
== Standard references ==
Otherwise, you might want to download and at least skim these papers, and look through their references, both from Wilson - they are standard surveys on these topics (I read the first, never managed to get through the second):

Uniprocessor Garbage Collection Techniques, 1992
Dynamic Storage Allocation/ A Survey and Critical Review, 1994

== Point 1 of the previous answer
Point 1 of the previous answer (which is already discussed by 1 the first paper) just means that with a copying collector (which is what ones typically uses for a young generation) allocation is done by bumping a pointer (like on the stack), instead of managing free lists or the like (like in malloc and free).

Overly Complex

MMtk VMkit and most of the algorithms listed are very general and do not take advantage of the special case that marco's language does not create memory reference cycles, so he would be paying a lot for the reuse. And my experience with MMtk is that it is quite hard to attach to an existing runtime (hence VMkit to connect it to LLVM).

In the worst case mark/compact needs two times the amount of memory compacted, but it would be a good technique if only compacting part of the memory per cycle will allow collection to still keep up with the memory use rate. Otherwise coalescence of adjacent free chunks is still needed, adding to the runtime cost.

See the other post

I am aware of most of those since I read on various manners of GC before I implemented Cheney. I need numbers, not the rehashed opinions on merits of various schemes I am well aware of. [ I am reading the Wilson paper now, hope it gives me some more information. Thanks! ]

Moreover, when building a language it is often so that the particular language semantics makes it impossible to use a specific GC kit. Just the choice of bit-packing or structuring data in a specific order will invalidate most general kits.

Compile Time? Unlikely

In general, I gather that determining what objects could be deallocated during compilation is equivalent to the halting problem.

Now, I gather that it is possible to do a lot of compile-time analysis, but I also think that would be pretty close to just a lot of static analysis on the INCREF/DECREF operations.

Doesn't seem worth it. I want a simple tool, not an optimal tool.

[ Oops, posted on the wrong thread. This is an answer to naasking. I enjoyed some of the Bacon et al papers btw. They were interesting reads, thought most of them I just scanned for ten minutes. ]

Deferred ref-counting

Most modern refcounting algorithms perform INCREF/DECREF only for heap references, and take into account stack references only periodically - that already saves most of the cost. You already implemented stack scanning for your GC, that would be useful. Additionally, Bacon et al.'s 2001 PLDI paper [1] on the Recycler describe a particularly efficient variation of deferred refcounting (section 2). Given your language, though, you might maybe be able simplify it further: since you can't clear an object field, all DECREF should be (ultimately) caused by a stack frame being destroyed.

[1] "Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector", PLDI 2001.