Quantifying the Performance of Garbage Collection vs. Explicit Memory Management

I think the real culprit behind the "poor performance" of GC languages is demonstrated in Quantifying the Performance of Garbage Collection vs. Explicit Memory Management, by Matthew Hertz and Emery D. Berger:

Garbage collection yields numerous software engineering benefits, but its quantitative impact on performance remains elusive. One can measure the cost of conservative garbage collection relative to explicit memory management in C/C++ programs by linking in an appropriate collector. This kind of direct comparison is not possible for languages designed for garbage collection (e.g., Java), because programs in these languages naturally do not contain calls to free. Thus, the actual gap between the time and space performance of explicit memory management and precise, copying garbage collection remains unknown.

We take the first steps towards quantifying the performance of precise garbage collection versus explicit memory management. We present a novel experimental methodology that lets us treat unaltered Java programs as if they used explicit memory management. Our system generates exact object reachability information by processing heap traces with the Merlin algorithm [34, 35]. It then re-executes the program, invoking free on objects just before they become unreachable. Since this point is the latest that a programmer could explicitly free objects, our approach conservatively approximates explicit memory management. By executing inside an architecturally-detailed simulator, this “oracular” memory manager eliminates the effects of trace processing while measuring the costs of calling malloc and free.

We compare explicit memory management to both copying and non-copying garbage collectors across a range of benchmarks, and include real (non-simulated) runs that validate our results. These results quantify the time-space tradeoff of garbage collection: with five times as much memory, an Appel-style generational garbage collector with a non-copying mature space matches the performance of explicit memory management. With only three times as much memory, it runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection
degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.

Overall, generational collectors can add up to 50% space overhead and 5-10% runtime overheads if we ignore virtual memory. Very reasonable given the software engineering benefits. However, factoring in virtual memory with its attendant faulting paints a very different picture in section 5.2:

These graphs show that, for reasonable ranges of available memory (but not enough to hold the entire application), both explicit memory managers substantially outperform all of the garbage collectors. For instance, pseudoJBB running with 63MB of available memory and the Lea allocator completes in 25 seconds. With the same amount of available memory and using GenMS, it takes more than ten times longer to complete (255 seconds). We see similar trends across the benchmark suite. The most pronounced case is javac: at 36MB with the Lea allocator, total execution time is 14 seconds, while with GenMS, total execution time is 211 seconds, over a 15-fold increase.

The culprit here is garbage collection activity, which visits far more pages than the application itself [60]

Considering the prevalence of virtual memory, this poses a significant problem for garbage collected languages, server-side systems in particular. LTU has covered garbage collectors that co-operate with virtual memory managers before by the same authors.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Uninformed GC

Doesn't this paper simply show that garbage collection needs to be better informed?

It's one thing to compare explicit and automatic collection language by yanking the free calls and using GC instead. In this case, though, it looks like they generated the free calls mechanically rather than by hand. If that system is far superior, irrespective of whether the fault is virtual memory, the implication is that GC needs better static analysis tools and instrumentation. Not to say that the interference of virtual memory itself is a trivial result; rather, that the seeming result of "virtual memory causes problems" is supplanted by the possible interaction of "we can't get away with the simple approximation we were using before."

I don't have time to dig through any details until after finals, so this may just be an indication that I need to read the paper.

Some naive assumptions in GC

Yes, some naive assumptions are at fault, namely that memory has a fairly uniform, invariant cost model; unfortunately this is difficult to rectify with GC in particular, but probably not automatic memory management as a whole. Performance issues relating to virtual memory is due to GC's inherently non-local access patterns during collection (touching pages the mutator isn't currently using). A static analysis that can properly group objects by their lifetime is pretty much ideal; it's essentially region-based memory management!

I'm not sure about the relevance of generating the free calls mechanically; they are using program traces to compare the costs of manual and automatic memory management all else being equal.

GC and virtual memory

With respect to the GC's assumptions about memory behavior, it might be relevant to link to a paper posted here a few months ago: Garbage Collection Without Paging.

Already linked to that in

Already linked to that in the original post. That paper is by the same authors. :-)

Oops

I think I've gotten in the habit of reading the quoted (blue) text and ignoring everything else. Sorry about that!

Premature deduction IMHO

[[In this case, though, it looks like they generated the free calls mechanically rather than by hand. If that system is far superior, irrespective of whether the fault is virtual memory, the implication is that GC needs better static analysis tools and instrumentation.]]

For the purpose of their paper, they measured an execution trace and reproduced it with the *same inputs* replacing the GC by malloc&free which gave big improvements.
To be useful, a static analysis tool should be able to replace the GC by explicit allocation whatever the inputs are, which is a different issue..
Note that Sun is working on a similar optimisation (I don't think that it is integrated in one of their released JVM yet), I remember that they work on 'escape analysis' to replace heap allocation by stack allocation when it's possible.

got it

Ah, I see where my mistake came from, that the analysis *isn't* static. Thanks, that clears up my perception substantially.

Leaks and reachability

How does the memory consumption of the paper's simulation of manual memory management relate to the memory consumption of actual manual memory management? The paper argues that actual manual memory management would use less memory, because they are waiting to the last possible moment, just before the object becomes unreachable.

This is unconvincing because the manual version might leak, depending on the input. Fixing the leak could involve rewriting the program, with a deallocation point late enough to be valid on all input, and an earlier/wider allocation point/scope to make sure that the object remains reachable all the way to final deallocation. Now the memory consumption is higher than in the simulation.

The naive way to do this research is to rewrite a garbage-collected program using manual memory management. The authors note that this would be arduous. When I was writing signal processing code in C I was able to cheat by writing small programs that did not free the memory that they allocated. I stitched them together with shell scripts and let the operating system reclaim the memory as processes exited and were started afresh. The memory consumption of programs with manually managed memory allocation is the product of a trade-off. If you push too hard at minimising memory usage you will end up with bugs. Refusing to discuss this trade-off fudges the the question of what the paper is actually about.

The authors have done a fine piece of work. Their 'magic' deallocator does a lot better than standard garbage collection. On its own this raises various interesting questions. Is there still a big advantage compared to generational garbage collects? Can one write a code analysis tool that statically computers the location of the 'magic' deallocations? I think the answer is obviously not, but this obvious answer misses the point. The authors have put a worthwhile performance improvement into play. Can one write a code analysis tool that statically captures half the possible improvement on actual code bases?

Unfortunately the authors phrase their paper as relevant to manual programming practise.

It would be possible (though onerous) to attempt to measure it by manually rewriting the benchmark applications to use exmplicit deallocation, but we would somehow have to factor out the impact of individual programmer style.

Well, no, one step at a time, please! It would be interesting to know whether any programmer can rewrite the benchmark applications to capture the performance gains that their research hints might to obtainable. Once you have established whether it can be done at all, one might go on to automate it or to teach it.

Average vs worst case performance

When porting our visualization software from C++ to OCaml we noticed that, although the average performance was degraded by ~20%, the worst case performance improved over five fold. That was largely due to OCaml's GC spreading deallocations over time whereas our C++ code suffered from avalanching destructors. As a soft real-time application, this is important for visualization, of course: the C++ code jerked noticeably.

So I would be interested to see worst case performance profiles as well as averages or totals.

We're porting our code to F# now, so it will also be interesting to see whether or not the GC in .NET provides the same benefits that OCaml's does.

our C++ code suffered from

our C++ code suffered from avalanching destructors.

Yes, a well-known problem particularly in ref-counting. There's a simple solution though: add some incrementality to reduce pause times. One popular way: queue deallocations, and process the queue a fixed number of deallocations at a time; each recursive deallocation adds to the queue to be reclaimed later.

Generational GC can still come out ahead though, as it does not even have to deallocate children of a structure that it deems garbage from the parent node.

Slipped through the cracks

I asked them at the presentation and one thing they overlooked is the fact that the actual machine code to do the explicit free() calls is not actually in the machine code of the application at the deallocation site. They use their instrumented simulator to "magically" decide to execute the explicit free call at a particular deallocation site, which means they actually do simulate the instructions of the free call, but that call's effect on the surrounding code is not captured (e.g. greater register pressure, larger code size, I-cache effects).

The effect is probably small in most cases (I would guess less than 1% execution time), but it still gives a slight advantage to the explicitly managed version.

Overall I think the authors did a really outstanding job controlling for as many variables as possible.

There is always that last one though.....

Danger Will Robinson

This paper has been around a while, and has been used to attack yours truly on my opinion that garbage collection is often given a bad rap. I'm certainly not an expert here, but I take issue with a number of the assumptions made in this paper. I have two primary concerns.

First, the paper assumes that tracking objects is free. They inserted calls to free() exactly where the object became unreachable. The running program paid no price to track the objects. This is not a reasonable assumption. In most non-GC programs I've seen, a huge proportion of the code goes to tracking the reachability of objects. The tracking code uses both CPU cycles and memory resources. Finally, the tracking code is not oracular and often does not release an object exactly when it is no longer referenced by anything outside of the tracking code.

Second, the paper uses garbage collection technologies that are now significantly out of date. Not all GC is created equal. For example, the .NET generational collector has many mitigations in place that address some of the specific concerns raised by the paper. The generational approach has been shown to significantly reduce the amount of time spent collecting garbage in typical programs, yet no modern generational collectors were measured by this paper.

I'm not going to claim that GC is free. There are times where it is simply not (yet?) appropriate. There are times when the program's allocation patterns are better suited to explicit memory management. But there are also times when GC ends up being significantly better (in terms of performance or memory usage) than explicit memory management, i.e. in situations where heap fragmentation is a problem, a coalescing GC can improve both performance (due to improved cache hit rates) and memory usage (due to reduced heap fragmentation).

The work done by Hans Boehm, while dated, seems to be more even-handed. He also used garbage collection mechanisms that are a bit outdated (the issue was forced due to the use of a conservative collection strategy), but even so he found that the systems using GC were only a bit slower than the equivalent systems using explicit memory management.

Anyway, just my 2c.

I'm not sure what you found

I'm not sure what you found so controversial with this paper's results. Contrary to what you state, the comparison does include generational mark-sweep and generational copying collectors, which are the typical throughput-centric GCs used in every comparison I've read, and I can honestly say that I've read the majority of garbage collection papers, including Hans Boehm's, but excluding parallel and distributed collection (only read a few of those); there's no representation of incremental collectors, but that's fairly common. The results in this paper are inline with everything I've read.

As for tracking objects, the developer is doing so himself in his head, which is the inherent problem of manual memory management. No doubt the resulting programs would be leaky in the real world, thus highlighting the primary advantage of GC, but that's not the point of this paper; the point is to quantify the absolute runtime costs of GC vs the manual approach in an ideal world.

In my mind, the primary contribution of this paper is the identification of the degenerate case in GC: large heaps which cause paging. I think that's a very important contribution. The authors' follow-up paper, which I also linked to, addresses this degenerate case and obtains significant improvements as a result.

As for the .NET GC, yes it has some advantages, mainly because it's a concurrent collector, and so does not suffer the pause times of most throughput-oriented collectors. Concurrent collectors have their own costs though, mainly synchronization overheads, and difficulties compacting.

Regarding the "cost of tracking object"

I took the original comment to mean the runtime cost of other means of dynamically determining object liveness (other than tracing GC), such as refcounting , auto_ptrs and similar, or ad-hoc and application specific techniques. There's no runtime penalty for a programmer (or an oracle) who knows exactly when to call free(); but there is a penalty for updating ref counts, shared bits, etc. In some cases, these penalties can be severe--updating a refcount for an object which is swapped out causes a virtual memory miss; even in the absence of virtual memory, it can cause poor cache behavior. Refcounting is highly inappropriate in multithreaded applications (unless each object and its dependencies are combined to a single thread). auto_ptrs and the like are a bit better (being local and not affecting the state of the object), but they still can have an impact. The good thing about such techniques is their cost is more uniformly distributed.

I don't see anyplace here where the original poster was "attacked". (I hope mere technical disagreement isn't perceived as an attack).

What we really need--is faster mass storage. This observation is outside the scope of LtU, but the increasing performance gaps between semiconductor memory and disk storage is getting increasingly...annoying.

Ref-counting solves the virtual memory problem?

I took the original comment to mean the runtime cost of other means of dynamically determining object liveness (other than tracing GC), such as refcounting , auto_ptrs and similar

Ref-counting is GC. See A Unified Theory of Garbage Collection. All GCs are some mix of ref-counting for incrementality, and tracing for throughput (excluding concurrent collectors).

updating a refcount for an object which is swapped out causes a virtual memory miss

I suspect that ref-counting exploits more locality than tracing collectors because the pointers the mutator is updating are to objects it's currently operating on. On the other hand, tracing touches pages the mutator may not have touched in quite some time, which means page faulting. So perhaps ref-counting somewhat alleviates the garbage collection virtual memory problem pointed out in the paper.

Refcounting is highly inappropriate in multithreaded applications (unless each object and its dependencies are combined to a single thread).

On the contrary, a number of papers have analyzed concurrent collection using ref-counting. I'm currently reading An Efficient On-the-Fly Cycle Collection to see how applicable it might be for my project.

Goal of this Paper

First, I am quite honored to see such a debate going on about this little work. There are many issues to pick with the paper, several of which have been discussed here. Neither Emery nor I expected this paper to end this debate, there are too many variables for that. But we realized that my trace generation work (Merlin) provided a unique opportunity to provide another data point and exploit any insights this provided.

Like all research, we could not investigate everything. We tried selecting the most commonly cited algorithms (concurrent & .NET algorithms would be great, but I did not have access to a system that could have run them in the necessary harness; my thesis included BC and a few algorithms we pruned due to a lack of space). We tried to include as much as we could in our accounting and state our methodology clearly so that people could decide whether these results would actually apply to specific environments. Given this very interesting discussion, I feel like we did a good job. I just hope others do to.

A few specific answers to questions raised so far:

* While we did not introduce leaks into the malloc-free runs, the reachability-based oracles left in place all the leaks from objects remaining reachable but unused. The liveness-based oracles freed objects after their last use (e.g., fixing all leaks) and led to a fair space reduction, but almost no difference in execution time. I think adding leaks would be interesting, but could be hard to do in a reproducible manner.

* We do as much as we can to account for the costs of freeing data. Because we insert calls to free() only where there already exists a similar call, we should see most of the calls costs (including some of the added register pressure since many get used otherwise). Since our analysis is not static, locations may not always call free() each time they are executed; we would somehow need to split these to get the full costs (which would modify the run and invalidate our traces).

* We ultimately decided to not modify the program run for the malloc-free runs. This does mean that we only pay for the object tracking costs that the program itself performs for GC purposes (which, in the case of the SPEC benchmarks whose code I could see, is actually fairly reasonable). I think the paper highlights that, unless memory is an issue, GC is a reasonable choice for performance (and a clear win once software engineering is included).

Anyway, I really think this conversation is very interesting and am always glad to see my work discussed here. Thanks!

Implicit per process GC in UNIX

A while back I did some experiments with using the Unix process model
and the explicit resource freeing after exit worked as a garbage
collection system for C programs. The programs were meant to be short
lived and no explicit free'ing was done do any allocated memory due to
the fact that the kernel will return all allocated resources in on
fell swoop during the process cleanup phase. This seemed like it
would lead to efficiency gains overall since the kernel was doing a
'large grain' cleanup/deallocation (of memory pages) rather than
explicitly freeing each object during runtime.

I have since lost the results, but I can recall that the system did
take longer to cleanup processes that had allocated a greater amount
of memory, which is in line with the behaviou8r of a general garbage
collector, plus it greatly simplified the actual code, mostly due to
the style writing very small "function" programs that could be tied
together with shell script and pipes (which is how unix was meant to
be used in the first place).

The point of this was that Unix and C both do have garbage collection
when used together; you just have to learn to program in a certain
type of way to use this feature. One of the main detractors of shell
script is the cost of fork/exec'ing each process, but if major parts
of the operating system (process resource reclamaition) weren't being
ignored by the user level programs, I have a feeling that this
overhead could be balanced out.

Programs written in this way are easier to write (less code, book
keeping), easier to understand (shorter) and are potentially more
efficient (letting the OS doing the work it's supposed to). This
looked to me like a feature of Unix that most C programmers miss and
seems to be closer to the 'functional' style of programming that most
imperative languages miss.

Re: short-running big-cleanup

Apparently a similar approach was used at Honeywell for avionics type systems. It does seem like a neat idea to me. And other people are into the "fail-only" style which seems related and fun.



P.S. a little bird tells me this story, and hopefully won't mind being repeated here:

"That was the autopilot for the G-V I was working on while at
Honeywell. It's a 'frame-based' RTOS. The way it works is you have a
bunch of frames defined by their longevity. i.e., a 1 minute frame, a 1
second frame, a 100ms frame, and we even had a 4.2ms frame. You write
your code in such a way that it does the same task every frame. For
example, to control the yaw on the aircraft, I would define a thread and
tell the os to run it every 4.2ms frame. The thread would wake up, I'd
read the current stats of the aircraft, consider pilot input, run a
physics model of the aircraft, and decide what outputs to apply to the
control surfaces. The thread would then die and free up all its memory.
On the next 4.2ms event, it would all happen again. For lesser tasks
(checking to see if the pilot changed flight modes for example) you can
run those operations on a slower frame such as the 100ms frame.
The HUGE safey gain is that if it works (no memory leak, no crash) for
just a single 4.2ms frame, then it will work again and again. No memory
leak will ever build up - and if there is a bug, it is only scoped to
the time frame. We even had to consider cosmic rays hitting the memory
chips so being able to constrict a code fault to a single slice in time
is a big benefit."

"Fail Only" Programming

That sound like the system I wrote for my thesis, which I eventually found out years later, was a specialized version of co-routines in C. It was for a animation/game engine, but it was along very similar real-time requirements (as most games are, but without the life and death of avionics). Using micro-function(al) "tasks" that were of known execution time make the creation of prioritized program generator sequences incredibly fast and easy (as opposed to the usual nested conditional/switch statements generally used). The kernel allowed for task chaining and per process control, eliminating the complexity of event callbacks since each task "knew" exactly what it was supposed to be doing and what was going to be next...

But the point being, we spend so much time trying to handle failure of our programs, maybe it's time we just wrote them "wrong" and learned what the underlying system does to handle that. It could eliminate a mass of duplicate code in userspace that should be generally handled by the kernel.