Avoiding worst case GC with large amounts of data?

Garbage collection has improved greatly over the years, but there are worst cases which are still difficult to avoid without writing your own custom memory management system. For example, let's say you have 128MB of 3D geometry data loaded in Haskell/OCaml/ML-NJ/Erlang/Lisp/YourFavoriteLanguage. This data is in stored in the native format of your language, using lists and tuples and so on. At some point, the GC is going to go through that 128MB of data. Generational collection delays this, but it some point it will still happen. Or do any systems handle this situation gracefully?

The best I've seen in this regard is Erlang, because each process has its own heap, and those heaps are collected individually. But if you put 128MB of data in one heap, there will still be a significant pause. Perhaps Erlang's GC is still better than most in this regard, as there a single-assignment language creates a unidirectional heap which has some nice properties.

And then of course there are the languages with reference counting implementations, which perform better than languages with true GC in this regard.

Thoughts?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Incremental GC

Or do any systems handle this situation gracefully?

This is the whole point of real-time/incremental garbage collectors.

And then of course there are the languages with reference counting implementations, which perform better than languages with true GC in this regard.

Reference counting can be adapted to be incremental, but the basic version is not. Anyways, if you allow cyclic data structures you'll need "true" GC anyways.

Uniprocessor Garbage Collection Techniques, an aging but good survey.


If you are wondering why most language implementations don't use incremental GC, it is mostly due to increased implementation complexity and decreased throughput and the fact that hard real-time guarantees are rarely required. However, an incremental GC for Haskell is described here.

If you know all this, then I'm not sure what your question is.

Implementations

If you know all this, then I'm not sure what your question is.

Yes, that's essentially my point and question. You hit a GC wall with large amounts of data, and as far as I can tell this has not been addressed in a concrete language implementation. If so, it's a strong, strong case for using reference counting when at all possible.

GC as a tool

I like this way of looking at the problem. Not "can I design a GC that would have no problem with big heaps", but "how could I manage a big heap with one of these particular GCs." I suspect that the best thing a GC can provide is a predictable performance model that programmers can understand and use.

Erlang's separate-heaps are pretty good like this as you say. There's also that fun Erlang performance hack for avoiding GC when you have a lot of short-lived processes (e.g. HTTP request handlers). First you measure how much heap allocation they typically do and then - if it's reasonably small - you preinitialize their heaps to that size. That way most processes will run and terminate without GC'ing and you can scrap their whole heap without ever traversing it.

Generation-scavenging collectors seem particularly fun too. When GC cost is proportional only to live data and not garbage then there's clearly an opportunity for lots of guilt-free allocation :-). Seems you could do a trick like the Erlang one, for example. Suppose you have a function that typically allocates 2MB of temporary data. If you call it with an empty 2MB "nursery" generation and don't GC until it returns then the GC should cost close to zero -- there's no live data in the nursery, it's all become garbage.

The catch with generational collectors seems to be what you mentioned: At some point, the GC is going to go through that 128MB of data. Continuing the GC-as-tool idea, we would want to understand when and why the older generations get collected and see if we can control it.

Some ideas come to mind. What about if we initially heap-allocate our 128MB, force it all the way into the last generation, and then be careful not to allocate enough long-lived data to ever need to collect that generation? Alternatively, what about "tenuring" the last generation so that the GC will leave it alone until further notice and stop at the second-last generation instead?

This is the kind of thing that interests me. In my heart I feel that generation-scavenging collectors are really good and that they support a very nice programming style. Lots of (Lisp) code I see is instead written to fight the GC and I'm not sure that it does so effectively. Do they know something I don't or do they they just want to be portable to other garbage collectors? (I wonder if language specs should state required GC characteristics the same as RnRS does for tail-call optimization.)

Anyway, it's at this point that I always make some measurements that turn out to be totally invalid in some unforseen way. I will continue this tradition with a follow-up :-)

CMUCL benchmark

Here's a small benchmark running under CMUCL. The workload is to recompile a bunch of files, which takes around 45 seconds and does over 700MB of heap-allocation. The workload is measured in three settings:

  • Fresh Lisp image (as fresh as mine usually are, anyway).
  • With 64MB of live conses allocated in the oldest generation.
  • The same as the previous case but with the last generation tenured.
For me it runs at the same speed in all cases. It would be interesting to try with substantially more than 64MB but this machine is memory-poor and I need to avoid swapping.
(in-package :cl-user)

(defconstant megabytes 64
  "We will heap-allocate this many long-lived megabytes of conses.")

;; Save the results (standard output) to a file.
(dribble "/tmp/results")
(setq *trace-output* *standard-output*)

(setq *gc-verbose* t)
(setq *compile-print* nil)
(setq *compile-verbose* nil)


(defun work ()
  "Do some work that will require garbage collections."
  ;; Recompile regexp system
  (dotimes (i 3)
    (asdf:oos 'asdf:compile-op :cl-ppcre :force t)))

(defun say (format &rest args)
  "Print a message on a single line in a grep'able way."
  (format t "~&/// ~?~%" format args))

(say "Warming up")
(work)
(gc :full t)

(say "RUN: small heap")
(time (work))
(gc :full t)

(say "Generating lots of data on the heap")
(defvar *waste*
  (loop for i from 1 to (* megabytes 1024 1024 1/8) collect i))

(say "Forcing it all into the last generation")
(gc :full t)

(say "RUN: big heap")
(time (work))
(gc :full t)

(say "Tenuring last generation")

;; We have to poke a C variable for this.
(alien:def-alien-variable ("gencgc_oldest_gen_to_gc" gencgc-oldest-gen-to-gc)
    c-call::int)

(decf gencgc-oldest-gen-to-gc)

(say "RUN: big (tenured) heap")
(time (work))
(gc :full t)
Selected output:
$ grep -e '^///' -e 'seconds' -e 'bytes consed' -e 'page faults'  /
tmp/results 
/// Warming up
/// RUN: small heap
;   45.95 seconds of real time
;   22.014652 seconds of user run time
;   0.993849 seconds of system run time
;   [Run times include 4.63 seconds GC run time]
;   0 page faults and
;   718,752,904 bytes consed.
/// Generating lots of data on the heap
/// Forcing it all into the last generation
/// RUN: big heap
;   46.24 seconds of real time
;   22.3676 seconds of user run time
;   1.047841 seconds of system run time
;   [Run times include 4.98 seconds GC run time]
;   0 page faults and
;   718,632,984 bytes consed.
/// Tenuring last generation
/// RUN: big (tenured) heap
;   46.34 seconds of real time
;   22.383596 seconds of user run time
;   1.026844 seconds of system run time
;   [Run times include 5.06 seconds GC run time]
;   0 page faults and
;   718,784,504 bytes consed.
Already I'm puzzled that the real time is twice user+system. Disk access? No time to investigate. :-)

If so, it's a strong, strong

If so, it's a strong, strong case for using reference counting when at all possible.

Please elaborate on this statement ;-). I think I can guesstimate on your assumptions and conclusions but is there any proof that reference counting (or some hybrid gc algorithm using reference counting) outperforms other, say generational, gc algorithms?

You are in good company.

Guy Steele talked about this at a language design panel. From what I remember, he said they were working on getting Java to work well for hundreds of gigabytes and suggested garbage collection may simply not be the right technique for very large amounts of data. Java uses the train algorithm, which is different from a traditional incremental gc.

Dynamic Language Wizards Talk
Java Runtime
Train Algorithm

The case you mention (3D geometry) might work well with region-based memory management, but that is more experimental.

ML Kit

Special purpose heap?

Is the data mostly static? In that case you could use a GC with two heaps, one that's managed with generational-or-what-have-you GC, and the other one that you manage by hand. I remember comments on LtU mentioning this was sometimes done.

Many solutions, but...

When I was thinking about a reply to 'Implementations', I came up with all kinds of various ad-hoc ways you could deal with the (implied) problem(s). And some of them have now been mentioned by others. I never finished/posted the reply because I fairly quickly came to realize something. (This led to another fairly interesting thought, and so I did still spend some time thinking of solutions, etc.) That "something" was that all such solutions to the problem were ad-hoc, otherwise they were pretty much, by definition, incremental GC. (The interesting idea is what sort of operations could/were reasonable to export for "standard" GC algorithms that would help with this.)

The only truly general-purpose solutions I see are:
1) If you want incremental GC, use incremental GC; or
2) the language provides enough tools to "write your own custom memory management system" (and the GC doesn't get in your way too much or is overrideable)

Alexandre Richer's post is an instance of (2), and is along the lines of what I'd suggest for the scenario your original post seems to imply (a large amount of data that lives across a "stage", e.g. loading world data for a level of a game).

(Incidentally, things like this are why lack of GC (by default) is not something I consider a problem with C++ given it's "aims".)

Orthogonal Optimization.

Garbage Collection is a model-vs.-implementation issue. It affects performance, and its part of the implementation, but ultimately, the programmer must be "unaware" of the details, and must simply let the GC abstract away details.

Of course, I always hated the idea that optimization was either hand-tuning code or giving the compiler a magic "level" number. My question is: Why can't we, orthogonally of algorithms, suggest optimizations to the compiler/interpreter? Why can't my language support an annotation saying, "this function tends to use about 2MB of heap", and, without affecting the semantics or anything about my code, the compiler will silently use this to do whatever is appropriate.

I'm not a Lisp expert by any stretch, but I thought that was somewhat how some Lisp compilers treat type annotations, but of course, in this case, its not so much a typing issue as just me having more information than the compiler issue. Sorry if I'm wrong about it, but it sounds like a good idea anyways :).

It's hard

Why can't my language support an annotation saying,
> "this function tends to use about 2MB of heap", and,
> without affecting the semantics or anything about my
> code, the compiler will silently use this to do
> whatever is appropriate.

How would the compiler make use of such information?

It's useful sometimes

If it's a generational GC the compiler can include information about the initial size of the nursery, so the data allocated for the function is never moved to older generations. Allocate, return and clean the whole area.

And if, and if, and if...

It seems that every different type of GC and different configurations would need its own set of flags and different annotations to get similar effects. And even then, most GC systems are simply not flexible enough to really provide much (though Beltway is a partial exception; there are also a lot of good looking papers at that link, e.g. Ulterior Reference Counting). More desirable would be dynamically adaptable behavior as the memory profile of a program changes throughout its life, as well as the needs.

I don't really see annotations being too useful for anything beyond tweaking for most GC implementations.

As an example, let's say you are using an implementation that uses Appel's simple generational GC (a 2-age copying collector). The scenario is "the world data for a level of a 3D game" (a large amount of staged long-lived data plus a reasonable amount of normal garbage) we're optimizing for latency but it isn't hard real-time (i.e. a half second hiccup once in a blue moon is acceptable, a 3-second pause regularly is not). Loading the data can generate pauses as long as necessary. What kind of annotations would provide good behavior and how would they be realized? What about other GC configurations?

To contrast, simply using a simple manual pool allocator for the world data and leaving the rest to the GC would work assuming the remaining live data is typically quickly collectible.

Not just memory...

First off, I don't think annotations should be tied to specific implementation patterns(for instance, saying a function needs about 2MB of RAM is just a fact of existence, and if its a fact that can be _used_ by the GC algorithm, great, if not, it is nice that its an annotation that does not change anything about the code itself or its meaning.

Second, I think it would be nice to have this sort of stuff available for more than just memory optimizations. I think a generalized scheme for encoding information(that is perhaps checkable) that doesn't change the semantics of the code would be a desirable thing, in that it might help both compiler writers(having more information available), and programmers("optimizing" their code without making it uglier). I guess "typing" annotations are sort of like this(but they change the semantics of your code). Of course, yeah, you _can_ manage it yourself(pools and such), but that involves modifying the code(and usually writing more code) to get better performance without really changing the semantics(or algorithm really).

How?

I have ho idea how it could work.

I do have a generational GC with two generations. What can it do with the information that a given function usually allocates mo more than 2MB of heap (or any other amount)?

If the programmer thinks that it's useful to GC before the real work of the function, he can call the function MaybeGarbageCollect which performs a GC if a significant amount of allocation occurred since last GC. This increases the chance that the function will run without GC. No need for an annotation. Of course a GC will still happen if the function allocates too much.

My implementation tends to GC when all threads are idle or waiting for I/O, so there should rarely be a need to call such function manually.

How would a compiler use annotations about space usage?

Space annotations

Perhaps something along the lines of Algorithm + Strategy = Parallelism for memory usage could be useful. One could define memory management strategies, combinators for these strategies and apply them to their algorithms. There's probably a PhD thesis hidden inside it ;)

In an ideal world

.... a garbage collector would be something that is provided by the operating system, not by a programming language.

Efficient GC is, of course, dependent on numerous things--the underlying hardware, the application (and it's memory usage patterns), the language semantics, etc. The OS (which is a proxy for the hardware) is the one which is probably easiest to design to.

If one views the CLR or the JVM as an operating system that hosts code written in a HLL, we're sort of there.

In an ideal world...

.... a garbage collector would be something that is provided by the operating system, not by a programming language.

Unless you were implying that the GC is ideal as well, this would seem like the worst-case scenario, not the ideal one. Such an approach is unlikely to be optimal for all applications, and in fact is more likely to be suboptimal for all applications. Especially, as there usually is some cooperation required between the language and the GC.

I'm thoroughly in the exokernel camp in thinking that the ideal solution is in the opposite direction. The operating system should be providing as much access to the hardware as is securely possible and otherwise get out of the way.

Don't confuse the OS with the kernel

I thought this had been settled years ago. Kernels are, if anything, strict (and rather small) subsets of operating systems. Linux the OS is far more than Linux the kernel (which is why RMS thinks he is justified in calling it Gnu/Linux); much of Windows (any version) runs in user space, etc.

The definition of OS I am using here, is that bunch of software which abstracts away the underlying machine. Which is far greater than the kernel. While an argument can be made that kernels are too big and monolithic, I'm all for putting as much functionality as possible (assuming the requirements are general-purpose in nature and well-defined) in a common location (in user-space, preferably) which is then available to all application programmers. Under this expansive definition, a language runtime or VM is part of an "OS", and you might have more than one "OS" (or operating environments) running on your machine. (In the mainframe world, one finds numerous different kernels running on the same machine).

And perhaps GC policies ought to be customizeable on a per-application basis. OTOH, the current practice, especially for many research languages, is to ship the GC with the language itself--often times resulting in a GC policy which is poorly-tuned to the hardware which it is running on.

Of course, it ought to be pointed out that efficient GC requires many services which can only be performed at kernel level (at least on current systems)--determining the presence/absense of a page in main memory, implementing read/write barriers, etc. One reason that many language developers like targetting the JVM or CLR is that they no longer have to worry about this junk.

Of course, in an ideal world typing systems will get clever enough that compilers can manage the program store on behalf of the user without requiring external garbage collectors; but in the real world that sorta thing is probably undecideable.

Not Undecidable, But Not Space-Safe

Scott Johnson: Of course, in an ideal world typing systems will get clever enough that compilers can manage the program store on behalf of the user without requiring external garbage collectors; but in the real world that sorta thing is probably undecideable.

Region inference is decidable, but can result in space explosions that you don't want in the contexts in which region inference is most useful, such as embedded systems with rather dramatic, fixed space constraints. The ML-Kit papers are still the best resource for learning about the issues, and on this particular topic, "Combining Region Inference and Garbage Collection" is particularly relevent.

Don't confuse User-Space with "not the kernel"

The definition of OS I am using here, is that bunch of software which abstracts away the underlying machine.

The definition of OS I'm using is the exokernel one, anything that can't be avoided or changed.* I.e. if I must (directly or indirectly) use the "OS" GC then it's a problem whether or not it is in the "kernel" proper. Of course, I have no compunction against optional "standard" libraries that provide abstractions for common things. However, I think we are more or less in agreement here.

Anyways, it was a good excuse to throw out the exokernel link as there are some really cool approaches to combining security and modularity.

* With this definition user-space/kernel-space is irrelevant, microkernel servers are in user-space but you can't avoid them, so by the above definition they are still part of the OS.

And perhaps GC policies ought to be customizeable on a per-application basis. OTOH, the current practice, especially for many research languages, is to ship the GC with the language itself--often times resulting in a GC policy which is poorly-tuned to the hardware which it is running on.

They really should, but... But current implementations and techniques fairly closely bind the language implementation with the run-time and specifically the GC. For most implementations, about the only reasonable way to swap out GCs is to have different branches of the compiler and runtime, and even that is a chunk of work. Or, alternatively, the application can manage (some of the) memory itself.

Of course, it ought to be pointed out that efficient GC requires many services which can only be performed at kernel level (at least on current systems)--determining the presence/absense of a page in main memory, implementing read/write barriers, etc.

Hence me being in favor of exokernels.

Why it was forgotten?

Hence me being in favor of exokernels.

Is the philosophy of exokernels just a return to an ancient principle of separation of mechanism and policy? Whether and why was this principle forgotten in middle ages (around 80ies I guess)? Is its return made possible by better historical context (hardware/software/market)?

Mechanism and Policy

The principle of separation of mechanism and policy seems to have suffered for several reasons:

1) The belief among many that integrated systems can be made more capable than separated systems, and that the boundary between systems, when well-demarcated, can be a source of tension and difficulty. In many cases, this is true. Fortunately, more advanced means of composing programs, and more advanced "substrates" to build programs on has alleviated this concern a bit.

2) Many vendors have found that building integrated systems increases vendor lock-in, and reduces the chance that competing products can replace their own. (This principle, unfortunately, has guided much of the design of the most popular consumer operating system). The rise of open-source software, which encourages mixing and matching of components, is posing a major threat to the integrated model.

3) The X Windowing System. I'm not going to engage in a general round of X-bashing here (this post was composed on a computer running Gnome on X11, after all), but a guiding principle of X development was separation of mechanism and policy. X, for the first fifteen years or so of its existence, was a usability disaster--which many blamed on "mechanism not policy". A major reason for X's usability flaws, of course, was that policy was delegated to application programmers (many of whom new little about good UI design), resulting in an inconsistent hodgepodge. Most modern X deployments now do the right thing, and assign the policy role to the desktop environment (Gnome or KDE mainly) rather than application code. X still maintains the seperation (X itself has little policy), but the policy is now handled at the right level of the SW stack.

Efficient GC depends on the needs of the language

It depends on the language probably more than on factors which are only known to the OS. For example on object representation.

I'm not convinced that targeting CLR or JVM from a language which is not similar to C# and Java gives better efficiency than targeting C and implementing a GC by hand. Maybe their GCs have shorter pauses or they better integrate with native threads, but they require to use a constraining VM, and an object system and representation of values which is suitable only for statically typed languages. What is more important?

In the Meantime...

Scott Johnson: Efficient GC is, of course, dependent on numerous things--the underlying hardware, the application (and it's memory usage patterns), the language semantics, etc. The OS (which is a proxy for the hardware) is the one which is probably easiest to design to.

This is a good point. One of the reasons that I find C-- promising as a compiler target is precisely its recognition that it needs an open runtime API so that back-end developers can provide runtime services that are appropriate to their language. This architecture lies somewhere between the extremes of a one-size-fits-all virtual machine and a code-generator-generator that makes no provision at all for GC or other crucial runtime services.

What is the problem, really?

But if you put 128MB of data in one heap, there will still be a significant pause

Something no one else seems to have asked: why is this pause a problem?

Unless you have hard real time constraints on your application (in which case you are probably using a non-GC or specialized GC solution anyway) what does it matter?

And if this still bugs you, why not start with the simplest solution: rethink the arrangement of your data structures so that 128MB isn't being dereferenced all at the same time.

(If you are using functional-style trees, for example, you probably have quite a bit of shared structure anyway, so you wouldn't have this problem.)

I think this is a whole lot better than requiring a programmer to think "below the hood" with compiler hints, etc. instead of at the level of the language specification.

Interactive programs

While pauses for GC in interactive software is not as big of a problem as pauses for GC in a realtime program; any pause more than a couple hundred milliseconds in length is going to disturb the user. Depending on the nature of the application, this may be no more than an annoyance--or it might make the application unsuitable for its intended purpose.

At any rate, having to "rethink the arrangement of your data" seems to me to be a bigger inconvenience to the programmer than having to insert compiler hints here and there... the programmer should be able to specify algorithms and data structures in the most natural form possible.

Caveat programmator

any pause more than a couple hundred milliseconds in length is going to disturb the user

People routinely tolerate much longer waits from applications, eg. web apps. If this pausing is only occasional, they may not really notice. I think it really depends what kind of app and what kind of usage we are taking about.

having to "rethink the arrangement of your data" seems to me to be a bigger inconvenience to the programmer than having to insert compiler hints here and there... the programmer should be able to specify algorithms and data structures in the most natural form possible.

Hmmm. I have to completely disagree with this assumption.

Sure, when starting an app, use the most natural form that you can think of at the time.

But if it fails to have desirable memory or performance characteristics, it is the responsibility of a good programmer to do what is possible with basic algorithmic/data structure improvements to fix the problem.

Expecting the compiler to make good performance out of bad (or simply not good enough) programming is expecting way too much from a dumb machine, IMHO.

What about TCO?

I mean, tail-call optimization turns excessive recursion(bad memory performance) into good(almost no memory)?

The sort of optimizations that I'm talking about are not the sort that involve wrong algorithms/data structures. If your game(or whatever) needs a 128MB of RAM, to store the textures, its a 128MB of RAM. Writing extra(maybe buggy code) to manage that memory to stop the GC from doing something stupid is a waste of effort, and does not involve a "sufficiently smart compiler". In fact, annotating performance information about code is assuming a quite dumb compiler in my opinion.

My personal feeling that annotations are good idea comes from this: when you start optimizing code, it frequently becomes bigger, buggier, and harder to maintain. If the same information could be expressed in some kind of annotation attached to the clean, readable, pure expression of the algorithm, it would be easier to read and maintain. If it didn't change the semantics of the code, it would have the added bonus that, if it goes out-of-date, it would only affect performance, not correctness of the code. Its disentangling things.

Expecting a programmer optimize correct code without altering correctness is expecting too much from a dumb programmer(at least, when the dumb programmer is me). I dunno, yeah, computers are dumb, but I hear some of them play a mean game of chess on occasion anyway :)

Orthogonal yet important factor

If you create good abstractions (i.e, encapsulate your data as well designed ADTs), changing representation shouldn't be too hard and make the code buggier. And by making data structures more efficient, I don't mean implementing GC manually. When you have a specific problematic data abstraction, you can almost always fix it directly, without having to use the general purpose technique that is GC.

To each their proper domain

I mean, tail-call optimization turns excessive recursion(bad memory performance) into good(almost no memory)?

From my point of view, this is backwards. TCO preserves the definition of the algorithm from the clutches of a particular unfortunate implementation.

Very small specialized procedure stacks are a design choice of a platform, not of the language itself (in most cases).

The sort of optimizations that I'm talking about are not the sort that involve wrong algorithms/data structures.

If the system (compiler, OS, etc.) is bad, that is of course the system designer's fault. Complain or switch systems. ;-)

The part of the equation the programmer controls is algorithmic, so algorithmic solutions should be tried first. (A good profiler can help here.)

Obviously, if these solutions become complex beyond the programmer's understanding, they should stop. ;-)