Garbage Collection Without Paging

Garbage Collection without Paging, Matthew Hertz, Yi Feng, Emery D. Berger. PLDI 2005

Garbage collection offers numerous software engineering advantages, but interacts poorly with virtual memory managers. Existing garbage collectors require far more pages than the application’s working set and touch pages without regard to which ones are in memory, especially during full-heap garbage collection. The resulting paging can cause throughput to plummet and pause times to spike up to seconds or even minutes.We present a garbage collector that avoids paging. This bookmarking collector cooperates with the virtual memory manager to guide its eviction decisions. Using summary information (“bookmarks”) recorded from evicted pages, the collector can perform in-memory full-heap collections. In the absence of memory pressure, the bookmarking collector matches the throughput of the best collector we tested while running in smaller heaps. In the face of memory pressure, it improves throughput by up to a factor of five and reduces pause times by up to a factor of 45 over the next best collector. Compared to a collector that consistently provides high throughput (generational mark-sweep), the bookmarking collector reduces pause times by up to 218x and improves throughput by up to 41x. Bookmarking collection thus provides greater utilization of available physical memory than other collectors while matching or exceeding their throughput.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Video of the talk with slides

video
http://content.digitalwell.washington.edu/msr/external_release_talks_12_05_2005/12631/lecture.htm

& the Garbage Collection page
http://www.cs.kent.ac.uk/people/staff/rej/gc.html

any chance for this

any chance for this "injection" on the virtual memory manager be less invasive, for instance using a kernel module intercepting the right events?

another question; is there any sample of the source code for this memory management logic in the kernel?

Language angle?

This is impressive and much needed but where is the language angle?

implementation angle

You're right gc is hardly relevant to language expressivity, which is a focus of many folks into languages; but gc options matter to a language implementor constituency. As long as I'm posting, I'll add more to comment on the paper (rather than your post).

The paper says: Existing garbage collectors require far more pages than the application’s working set...

I was just telling a coworker yesterday about my approach to gc and avoiding steep performance drops from touching too much memory. (The following basically comes from Mithril several years ago, so this idea isn't new.)

If you break an application into many independent processes (with separate logical address spaces sharing the same physical space) which don't share state (because they only send messages to one another) then you need only touch memory used by one process at a time during gc. This approach doesn't require far more pages than an app's working set (contradicting the paper unless they only meant widely deployed gc's).

Some kinds of app (I'm thinking of caches with gigabytes of data) are hard to make small enough for near instant gc, and the solution here is not to gc those apps. :-) Or to use a foreign function interface to a cache written in C or C++.

modified collectors

If you break an application into many independent processes (with separate logical address spaces sharing the same physical space) which don't share state (because they only send messages to one another) then you need only touch memory used by one process at a time during gc. This approach doesn't require far more pages than an app's working set (contradicting the paper unless they only meant widely deployed gc's).

If you use some standard GC within a process, collecting that process's heap requires touching more pages than that process's working set. Once you've gotten around to all the processes, you've touched more pages than the app has in it's working set. Perhaps you neglected to mention something very clever about shuffling pages between different processes - It might be interesting to compact and trim the heap of a process whenever the runtime switches it out.

Some kinds of app (I'm thinking of caches with gigabytes of data) are hard to make small enough for near instant gc, and the solution here is not to gc those apps. :-) Or to use a foreign function interface to a cache written in C or C++.

Using a foreign interface to malloc and free and writing the explicit memory management in a nice comfy language - just because you want to think about memory management for the cached data doesn't mean you want to do it for the algorithms that maintain the cache too!

Maybe a GC could use something like the scheme in this paper to reduce overhead of collecting data in memory? Checking summary pages to confirm that most of the data in the oldest generation might be faster than a usual major collection.

all or just some

Brandon Moore: If you use some standard GC within a process, collecting that process's heap requires touching more pages than that process's working set.

Yes, say for a simple copying gc.

Brandon Moore: Once you've gotten around to all the processes, you've touched more pages than the app has in it's working set.

But you needn't to get around to all lightweight processes during one gc, and that's my point: they can gc separately. Some processes might rarely if ever need gc — or might create garbage at a much lower rate than other processes — so gc tends to touch memory in processes that are busier.

True, by the time you collect all processes, you've touched more than the app's working set. (Whether this qualifies as "much more" might depend on how much extra unused space you have left over after gc. You needn't have more unused memory than the largest process collected.)

I think you can arrange things so collecting one process doesn't terribly impact the working set of an app in other processes. And by doing gc more often in processes that generate lots of garbage, gc might correlate with memory in the working set. (Obviously this is only reasoning and not actual data I offer here.)

Brandon Moore: just because you want to think about memory management for the cached data doesn't mean you want to do it for the algorithms that maintain the cache too!

Yes, I want to handle bulk data in a language efficient at that, but use another language better suited to writing complex code and decision making. The latter can use gc more freely and not impact the cost of data managed in the former.

By using lots of lightweight processes that gc independently, you can get away with very simple gc schemes (like stop and copy) with minimal cost and complexity (since you needn't track inter-generation pointers) and still have some of the cost benefits of generational gc, since older processes that make garbage slowly act a little like old generations. And collecting one process at a time gives some of the incremental low latency effect of generational gc collecting younger generations.

Sorry for my wordiness. I didn't really mean to have a discussion. I just wanted to contradict one phrase in the paper; but your reply shows it's always possible to rearrange definitions to blur issues. And without data and a study it's just my conjecture that putting busier garbage generartion into one process is effective for simulating young generation collection. [cf]

By using lots of lightweight

By using lots of lightweight processes that gc independently, you can get away with very simple gc schemes (like stop and copy) with minimal cost and complexity (since you needn't track inter-generation pointers) and still have some of the cost benefits of generational gc, since older processes that make garbage slowly act a little like old generations. And collecting one process at a time gives some of the incremental low latency effect of generational gc collecting younger generations.

Your hypothesis has already been disproven. Erlang used to use your design, and the linked paper demonstrates how and why a single shared GC is quite a bit faster.

[Edit: unless of course you have some design which would mitigate the performance problems enumerated in the paper?]

granular heap

naasking Erlang used to use your design...

I didn't say much about my design, which is more along unified heap style, but with ability to make disjoint heap subsets for different processes when you want. The result is something allowing you to hedge which way you want it to go, depending on global optimizations across processes.

I like designs letting you to pick and choose effects a la carte as much as possible. Product engineering has stages of iterative refinement a little hard to describe in simple and smooth ways. Anyway, I think you can get all the effects described in the paper using my design, but it would make a lot more sense to verify such things empirically later.

Which specific effect enumerated in the paper do you want to see mitigated? If you articulate it clearly I'll try to answer just as clearly. But only as long as you're willing to accept the content as unsubstantiated brainstorming.

Clearly Erlang benefits

Clearly Erlang benefits significantly from a unified heap since message-passing reduces to pointer exchange; this is possible only because Erlang is purely functional. I'm not sure what sort of application you have in mind, but in order to make disjoint heaps useful, there can't be very much data exchange between them, unless you go with the Singularity OS's approach, where individual heaps have configurable GCs, and cross-heap references are allocated from an "exchange" heap which is GC'd independently.

not 1 to 1, or 1 to N, N to M

I started a very poorly focused reply with no hope of brevity to match your casual tone, so I thought a short and vague comment might do better than none: P processes, H heaps, P != H; probably P == C * H is a good idea; some heaps gc, some not; some messages passed by reference, some not. Whatever works well empirically without any particular dogma. (Why is dogma so attractive?) Now I need to catch the end of Meet Joe Black.

P processes, H heaps, P !=

P processes, H heaps, P != H; probably P == C * H is a good idea

What is C?

I agree that being able to more explicitly control resource allocation will result in the highest performance attainable, but at the moment, it seems that implementing such a mixed system cannot hope to approach the ease and safety of programming in either heap systems.

I agree that your approach would be attractive for a "universal VM" of sorts, like LLVM augmented with distinct heaps (which incidentally, I recently suggested to them). Essentially, it sounds like you are arguing for an operating system with first-class address spaces (like EROS, Coyotos, etc) reified as a virtual machine. I agree with this goal for a VM, but I don't see how to easily expose that complexity in a language.

Whatever works well empirically without any particular dogma. (Why is dogma so attractive?)

Depending on how strict your definition of dogma, the answer is: simplicity. Programming languages strive to manage complexity, and increasing complexity hardly seems like a win considering today's incredibly large systems. The simplicity of dogmatic solutions in understandability/maintainability, can outweigh the performance advantages of custom tailored solutions. Of course, here I'm assuming dogma == purity.

Indeed, if there is some consistent empirical evidence that certain optimizations or abstractions are often useful, they can probably be turned into a relatively simple and coherent theory.

Erlang heap models

Feeley's paper is not an argument against rys's comment. In particular, Feeley only makes a one-sided argument that message passing is much cheaper in a shared heap, and does not analyse the effect of a shared heap on the cost of GC. This does not say that rys is wrong that per-process GC can be simple and effective.

Erlang still uses per-process heaps by default, though it does provide other heap models as options, including shared and hybrid heaps. Have a look at Jesper Wilhelmsson's work at the HiPE project which describes and benchmarks them. He has a much deeper approach to making message passing fast while keeping GC times small.

HiPE publications page

broken link

The paper is Feeley's "A Case for the Unified Head Approach to Erlang Memory Management".

Ancient

I hit this problem when trying to manipulate multi-gigabyte data sets in OCaml. Normally the OCaml GC does a good job of staying out of my way -- ie. I just don't notice it at all. But as soon as the total memory used goes into swap, the whole program falls into a thrashing hole. So even loading the data into memory and "not touching it" didn't work as I quickly discovered.

For this reason I wrote ancient (readme) which is basically a form of manual memory management / file-backed shared memory for OCaml. 30-40 GB datasets in a single process now aren't a problem (on 64 bit archs at least ...)

Rich.

This shows why is it a good

This shows why is it a good idea to put GC functionality in OS rather than in a language implementation. That's an language angle for you =)

Data representation

How would the OS know how to traverse your data structures? A lot of languages use tagged representations of data, with very nonstandard tagging schemes. Some languages like C have no way of telling you what's a pointer and what isn't, so conservative garbage collection is about the best they can do. How is an operating system supposed to figure out the morass of data representations and do garbage collection for them all?

big data sets

Richard Jones: 30-40 GB datasets in a single process now aren't a problem...

At work I optimize network traffic on data sets bigger than that (but I can't specify how much bigger exactly without saying too much; the less said about work the better). I just finished a SMP safe "page" cache in C++, and product performance is limited by how effectively I can do this when RAM is much smaller than data sets. There are always more steps to take in scheduling more effectively.

If I was going to make a product that used gc, I'd handle paging of data myself (as I do now) and lock memory in RAM containing the app spaces that need gc. Then I wouldn't be at the mercy of OS virtual memory for gc effectiveness.

mmap ?

For this kind of workload won't simply mmap'ing the data file (assumption: data is in a file) and letting VM subsystem manage the caching and memory be more efficient ?

Or heck... a native library can go to the extent of presenting the mmap'ed region as an array of a data structure.

mmap standard

I can't answer in substantiating detail without revealing competitive info. In fact, just replying to a comparison with mmap might be too much.

Akhilesh Mritunjai ...letting VM subsystem manage the caching and memory be more efficient?

That's what folks try first, and I optimized that version before the new approach. It's a competition: the mmap version will never be retired so it can be used for reference. The winner on benchmarks is the one deployed. I'm sorry I can't say more about it.

Obviously, I'd like to hear

Obviously, I'd like to hear more, but I'm glad you're sharing what you do feel comfortable talking a bout.

generic optimization

It needs to get much more out of date before I say more; they consider patenting each thing I code for them. (It's been made clear to me anything I merely consider interesting is patentable. I try to be maximally aggreeable, so I assume everything I do at work is off limits no matter how obvious it is to me.)

The best I can do is speak in generic terms about optimization, using ideas true all the time. That neither points at specific costs to be minimized in this case (informing competitors), nor discloses a specific method to do that.

A lot of optimization can be driven by increasing scope of context for analysis to save costs across boundaries that would normally be opaque. Any one-size-fits-all solution for a problem doesn't often have best fit for a given app's traffic. VM subsystems don't let you control enough parameters that have a constant factor performance impact.

When mmap and the host VM don't work

Using the host OS's VM to manage your cache doesn't work when LRU fails for your use case as a cache management algorithm.

Here's one example: In the AUVSI aerial robotics competition, our team produced an image mosaicking application for all of the aerial photos that were taken by a small plane. A dataset for a single flight came in at roughly 10GB or so, and we were using OpenGL to do the rendering/compositing. The memory hierarchy went: 128 MB of video ram (only 100 or so useful for textures), 1 GB of system RAM, and 1 TB of hard drive (RAID5 for throughput). The host OS was 64-bit linux, so we could fit everything into the application's address space by mmapping all of the source data. Our first pass mmapped all of the image files into the application's address space, and manually paged images between application memory and texture memory. The performance was atrocious. As you panned around the screen, there would be long latencies as the system thrashed disk to load up the new images all the way from disk to RAM to VRAM while pushing out in the reverse order. The latency hit came when the image was needed to render the screen. The problem is that as the user would pan around the mosaic, the working set (using LRU) would effectively be a trail behind where the screen used to be displaying. What we needed was a predictive cache that preloaded the images in a 2D space around where we were currently viewing. So we implemented that, and performance was buttery smooth because when the images were needed for rendering, they were already present in texture memory.

Re: When mmap and the host VM don't work

I'm assuming that you could have / did implement your predictive cache by just touching the memory pages corresponding to images around the current 2D position?

We found that layout was very important for our data. We manually rearranged structures so that fields which were commonly read in were arranged next to each other in memory. (See readme bug #6 for details). What I'd like to be able to do would be to choose a good arrangement automatically, but I wasn't sure how to do this.

Rich.

Requires kernel co-operation

This is indeed an impressive paper, but I was disappointed at the degree of changes required from operating systems to support this functionality. I wonder if there is a GC scheme that can approximate these results without OS co-operation, either via some sort of heuristics, like a software-based page residency scheme to estimate whether a page is in memory, or an alternate GC design which can exploit more locality and incremental GC collection.

For instance, mark-copy is the fastest and most memory conscious GC I've seen yet. It splits the "old generation" into a number of smaller "windows" each of which are collected indepedently. From the design, it seems that it would exploit some of the same residency effects as seen in this paper, and that could result in its high throughput and low pause times.

As an example of a software residency heuristic I was thinking of, imagine a read and write barrier on old generation pointers, which increments an LRU count for each page. On collection, any pages below a threshold value will be skipped using this paper's bookmarking technique, and all LRU fields are reset to 0 and the cycle restarts. Since it's purely software based, this page residency technique will only approximate the paging behaviour of the underlying OS, and while it increases the cost of pointer access, this paper demonstrates how paging is so expensive that avoiding swap thrashing can make the additional overhead acceptable.

Any other ideas?

other approaches

Hi,

Always nice to see one of our papers discussed on LtU :).

We decided not to use a read-barrier approach in BC for the simple reason that read barriers are expensive and would impose a big performance penalty all the time. On the other hand, BC performs as well as a best-of-breed, high-throughput garbage collector, while performing much better under memory pressure. I think this was the right design decision.

Incidentally, this was not as significant a change to the OS as you seem to think. IIRC, we're talking 300 lines, and most of that was plumbing.

You might be interested to know that our research group pursued two parallel cooperative approaches, where the OS & GC work together to avoid paging. The first, discussed here, was the "minor changes to OS, big new GC" approach.

The second was the "extremely minor changes to existing GC algorithms, major overhaul of OS VM" approach (CRAMM: Virtual Memory Support for Garbage Collected Applications, OSDI 2006, Yang, Berger, Kaplan, Moss). That paper shows how to maximize memory utilization (big heap = fewer collections) while minimizing paging (though not quite as much as bookmarking collection). The VM algorithm is also far more principled and nice, but that's a whole other issue :).

Another approach we have developed to gather reference information is sampling the per-process LRU queue, which is cheap, fast, and works well, and also requires minimal OS support. We used this to support transparent memory management - letting background apps use system memory without impacting foreground performance (Transparent Contribution of Memory, Cipar, Corner & Berger, USENIX 2006).

Other work: Chen Ding's (Rochester) and Chandra Krintz's (UC Santa Barbara) groups have both looked at userland-only approaches to dealing with paging. These are both reactive systems that focus on shrinking the heap (while CRAMM shrinks and grows the heap predictively, which may be better). On the other hand, bookmarking collection works even when the heap itself is too big to fit in RAM.

Also, if you liked mark-copy, you should love MC^2 (OOPSLA 2004; both are from our research group, and MC^2 is mark-copy's successor).

best,
-- Emery

We decided not to use a

We decided not to use a read-barrier approach in BC for the simple reason that read barriers are expensive and would impose a big performance penalty all the time. On the other hand, BC performs as well as a best-of-breed, high-throughput garbage collector, while performing much better under memory pressure. I think this was the right design decision.

Oh, no doubt this sort of approach is "the future", as it highlights the utility in providing more information and/or control to the application about its paging behaviour. However, BC doesn't help me right now if I'm writing a VM to run on commodity Windows machines or Windows server environments unfortunately.

In that vein, I was asking after user-space only approaches that might approximate BC's benefits. I agree that a read barrier is quite expensive, but it was a first suggestion which approximates the OS's approach to LRU paging.

Also, if you liked mark-copy, you should love MC^2 (OOPSLA 2004; both are from our research group, and MC^2 is mark-copy's successor).

Indeed, thanks for the pointer, though a quick glance at the graphs don't demonstrate the clear superiority of MC^2 over MS/MSC that MC did. Am I just reading it wrong? How much throughput did MC^2 sacrifice compared to MC for its incrementality?

Hi, did you try to submit

Hi, did you try to submit your changes to the Linux kernel folks? If so, what was their response?

Curious too

I'm curious to know the answer also.

Note that AFAIK, Linux kernel hackers only accept code when there are 'real users' which wants to use the code: saying 'please add this code to the VM so that we can do our research paper without patching the kernel' won't get you very far..
OTOH "we have modified the Sun JVM's (which is opensource now) and if you add our 300 lines to the VM, then the GC is x% faster" would probably be a very good argument for inclusion, but very few researcher has the will/time to go from the first to the second situation.

Inclusion in the kernel

Sun's JVM was not available when we did this work, so that wasn't an option. That said, I'm not sure it would have made a difference. The Linux kernel community is not too sophisticated when it comes to benchmarking (a standard benchmark is "make -j bzImage", i.e., a parallel build of the Linux kernel -- not exactly a normal workload). Worse, they don't use Java and are insensitive to concerns about garbage collection, which they naturally also do not actually understand. We had a couple of rounds of fruitless discussions, which felt a bit like talking to the lead guitarist of Spinal Tap ("but it goes up to ELEVEN!"), and finally gave up.

Worse, they don't use Java

Worse, they don't use Java and are insensitive to concerns about garbage collection

Perhaps framing the discussion of VM-subsystem feedback as an application-level control for harder real-time guarantees would be more fruitful. Is there a link to the discussion with the kernel devs?

Linux kernel discussion

Here's the discussion on the Linux-MM list. We continued with some off-list correspondence, as well, but...

Obviously, we'd love to see this kind of thing in the kernel, so I welcome any suggestions toward that end.

Best,
-- Emery

I wonder if there is a GC

I wonder if there is a GC scheme that can approximate these results without OS co-operation, either via some sort of heuristics, like a software-based page residency scheme to estimate whether a page is in memory

I recently found Using Page Residency to Balance Tradeoffs in Tracing Garbage Collection, which answers this question. :-)

Difference with 'Page level Cooperative GC'?

I wonder what the difference between this paper and the previous one made by the same team 'page level cooperative GC' on a first glance, they seem quite similar.

I'm always astonished at the many features that are needed to make a really good GC: cooperation with VM, good multicore usage, 'real time' behaviour, memory usage..

Same (kind of)

Our paper "Garbage Collection without Paging" (subject of this thread) is the polished, conference (read: better) version of a previous tech report with the title you mention.