Teaching Garbage-Collection

Teaching garbage collection by implementing GCs can imply heavy curricular dependencies. We've worked at shrinking them so the material can be used in any number of contexts, and this material is being used by several universities that use PLAI. We have a pedagogic paper about our approach, which we've summarized in a blog post (with a link to the full paper).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

piig?

have you posted this to the ppig crowd?

Why?

What does this have to do with PPIG (Psychology of Programming Interest Group)?

re: why?

because i believe teaching is an area that ppig members are interested in. especially if the teaching has pedagogy. :-)

One area of GC that has tripped me up

One of the biggest questions I've had regarding GC that I normally don't see answered is, What defines "running low on memory"? That is, on a multi-GB system your program can run for all day without getting low enough on virtual memory to kick off the garbage collector -- and you wouldn't want it to exhaust system memory anyway. But if you give it a small memory pool to play with, then it won't be big enough for larger programs.

The only logical thing I can come up with is to start with a reasonable size memory pool, say a few meg. Then, after the GC routine kicks off, if the free memory isn't more than 20% of the pool, then increase the pool size until it is. However I have yet to read that in any of the texts or online resources that I've run across.

Generations

Generational systems make this reasonably easy.

e.g.

Start with one generation of some reasonable size.

When it is exhausted, consider the next older generation.

If there isn't one, then allocate one.

Then copy the survivors of the exhausted generation to the next older generation.

If the next older generation is exhausted, allocate a new generation between the two and continue to copy into that.

Schedule any exhausted generations for GC.

Simple.

Your question is one of resource dimensioning and not resource management. You have the same issues regarding memory size regardless of whether your system uses GC or some manual memory management method. Every system has its limits. Every design includes dimensioning.

Perform the following steps:

1. Define the maximum amount of memory space you need for your application. Dimension the system accordingly. Your question belongs to this step.

2. Define "running low on memory" as "there's not enough room for my datum" and perform a collection at that point. If there now seems to be enough room left then allocate, else tell the designer that he/she didn't dimension the system correctly. Go back to step 1 with a better working knowledge of your system's resource needs. This is your feedback loop and it is the answer to your question.

void * GC_alloc (unsigned int size)
{
  void * addr;

#ifdef GARBAGE_COLLECTION_TEST
  addr = malloc(size);            /* Used as a test to prove GC is faster than malloc from a huge heap on most platforms */
#else
  if(GC_free + size >= GC_tospace + GC_space_size)
    {
      /*
       * Not enough room to allocate size bytes!
       */
      GC_flip();  /* Do the collection */

      if(GC_free + size >= GC_tospace + GC_space_size)
	{
	  /*
	   * Still not enough room to allocate size bytes!
	   */	  
	  fprintf(stderr, "Out of memory! - tried to allocate %u bytes\n", size);
	  GC_dump();
	  exit(EXIT_FAILURE);
	}
    }

  addr = (void *)GC_free;
  GC_free += size;
#endif

  return addr;
}

So it's not just me

As someone who is writing their own GC for the first time, this also stumped me. I was planning to do something similar to what you suggest: if we're too close to some watermark after collecting, grow the pool.

The other area I've found challenging is dealing with temporaries. I still need to read the actual books on GC, but most of what I've found online says nothing about how to elegantly deal with references to GC objects on the stack.

Stack references

For stack references, it depends on the style of GC. For a ref-counting GC, it doesn't make any difference. For a tracing GC, they need to be regarded as roots, and if you're moving objects around, you'll need a discipline for updating them as well. Either way, the main question becomes how to identify stack references. There are a lot of approaches to this, depending on exactly what you mean by "stack." For example:

  • you might not have a real stack at all, you might heap-allocate activation records, especially if your language has full continuations or similar. Then the problem reduces to a special case of heap tracing.
  • you might use a custom stack data structure divorced from the machine stack. Stack values might be tagged or not, allowing you to easily walk the stack and identify pointers. Lua does this, or did last time I looked at it.
  • you might use the "real" C stack. Then you'll need some way to find pointers. You could use stack maps or a conservative heuristic approach (if it looks like a pointer to my memory, it probably is). There are other options as well.

There's a pretty rich literature on this stuff, but I'm not terribly up on the state of the art.

Yeah, I should be clearer

Yeah, I should be clearer about what I mean by "stack". My language implementation is a bytecode-interpreting VM. It has a custom stack data structure for the language's stack variable. So a local variable in my language lives in this stack structure and is easy for me to see as a root.

At the same time, my language is implemented in C++ which has its own stack. There are often local variables in C++ that point to garbage-collected objects. By far, the most annoying "local variable" in this context is this. For example, let's say:

  1. In my language, you call a method.
  2. The VM in its main bytecode loop looks up the method.
  3. It ends up being a native method, one that is actually implemented in C++.
  4. So the VM calls that C++ method where this is bound to an instance of some C++ class that was allocated in the GC heap.

If a collection occurs at that point, Bad Things happen. For regular C++ local variables, I can wrap references to GC objects in some kind of Handle object that registers itself as a root with the collector and handles the indirection if the object is moved. I believe that's what V8 does.

It's harder for this. Maybe the best solution here is to not use C++ instance methods for objects that live in the GC heap. I was going for something that led to simple, readable C++, but maybe that's a lost cause.

Where did the address of "this" come from?

Where did the bytecode interpreter get the address of this? Is it not in some register or stack frame? If so could you make the pointer volatile so any GC updates to the address are used on the next dereference?

For the locals you know about, the usual trick is to push them all on the/a VM stack (of which all items are considered roots) before a call to an built-in routine that can initiate a collection, then pop them all off again when the routine returns (where they potentially have new values). This is an idea I stole from the old Franz lisp interpreter (back when it was an open source dynamically scoped interpreter - not the commercial product). To make this work well you need to make each built-in do as little as possible and push as much as possible up into bytecode. The more I push up into bytecode space, the more correct my systems become and the fewer GC problems I have.

I heard that due to memory

I heard that due to memory latencies issues, reading them all back after a too-short time was much, much more costly than just writing them down. So it's probably helpful to make sure that you don't do that for very short computations, and to get as much static knowledge as you can on which won't have been mutated, and therefore don't need to be reloaded (just stored).

Not sure I understand you

Not sure I understand you. Are you talking about the stack pushes here?

If so, data gets pushed because there's a chance it can be moved by the garbage collector (I wont use the term "mutated" as you do because the program is sometimes called the mutator in GC literature). There is no static way to tell if a datum cannot become garbage in the languages I implement, and in a copying collector (my SCHEME) all data will be copied, and in a compacting collector (my PROLOG *) there is a very good chance it will be copied. There's no way round it for me and frankly I don't mind paying the cost - automatic memory management isn't that expensive.

* The PROLOG system doesn't need to push C locals onto a stack so maybe this isn't really relevant.

Where did the bytecode

Where did the bytecode interpreter get the address of this? Is it not in some register or stack frame?

Yes, usually it is. My problem here is less about finding the pointer as it is handling the fact that the object being pointed to moves during a collection. I'm using a copying collector, so every object will move during a collection and every pointer to it needs to be updated.

If so could you make the pointer volatile so any GC updates to the address are used on the next dereference?

I don't think that will help, but I'm not sure.

stack corruption

I'm using a copying collector, so every object will move during a collection and every pointer to it needs to be updated.

Ah, that's your problem. I was going to ask if you copy-collect. Moving a C++ object should confuse the runtime, because your C++ method callers are likely to use the old pointer to a C++ object you moved, as soon as you return to them. Even if you could declare this pointers volatile (I know of no way to do that), there's no way a C++ compiler knows where to get a new this pointer you want it to reload. If you're using native threads as the stack, you might be stuck.

Using your own user-space defined green threads, you could arrange that no C++ continuations are on the stack, so there's no code to which you can return after GC that has a copy of the old pointer. That would only work if you unwound the stack in order to GC, in a stackless style typically driven by a trampoline. This works with copy-collection in C, when there are no C callers to which you return that have a copy of the old pointer to an object that moved.

However, you won't be able to rewrite C++ into a stackless style with green threads, because it would subvert destructors running at end of block scope, and RAII style would not work, wrecking a commonly used best practice mechanism. You'd have to write your own C++ compiler to be safe. Or else you could constrain C++ use to leaves of the call tree only, so there were no C++ callers to which you can return after GC.

Using your own user-space

Using your own user-space defined green threads, you could arrange that no C++ continuations are on the stack, so there's no code to which you can return after GC that has a copy of the old pointer. That would only work if you unwound the stack in order to GC, in a stackless style typically driven by a trampoline.

This is actually exactly what I'm doing now. My Fiber class has the main bytecode interpret loop. In the case of native methods, that loop will end up calling into C++ methods in objects that are on the GC heap. Allocation cannot force an immediate GC.

Instead, when we return back to the main bytecode loop we see if we are getting close to a full heap, if so, we return from the fiber's interpret method (since fibers are also on the GC heap). The trampoline code there checks to see if we need a GC, runs the collect, and then resumes the fiber (at its new location in memory).

It actually works pretty well given its idiot simplicity. I just worry that not being able to force an immediate GC could be a problem. If interpreting one bytecode op requires a bunch of allocation, you could run out of heap space.

However, you won't be able to rewrite C++ into a stackless style with green threads, because it would subvert destructors running at end of block scope, and RAII style would not work, wrecking a commonly used best practice mechanism.

Given that the objects in question are allocated on a GC heap, they don't have any destructors anyway so that's no loss. (I will be adding support soon for finalizable garbage-collected objects, which is what got me thinking about this again.)

your stuff sounds interesting

Sounds like you ruled out the cause I had in mind, if no C++ callers resume their continuations after gc. Opportunities to save an old pointer and re-use it after gc are fewer, and might require either dodgy library behavior or deeply hidden quirks in the runtime.

If interpreting one bytecode op requires a bunch of allocation, you could run out of heap space.

You should be able to predict the max allocation during a bytecode op, so you know gc won't be necessary while it runs. If it's dynamic in the sense it depends on explicit or implicit args to the bytecode (however passed), you could preflight the alloc looking at the values, then pause the op and restart it after gc if there's not enough free space.

A hypothetical long-running bytecode op that does something like copy structures of unbounded size can be expressed iteratively in the trampoline, so gc occurs with an unwound stack before you resume copying where you left off.

I'm used to thinking about giving a trampoline control periodically to bound time latency in max sized chunks, so cooperative context switch can occur at end of time slice. But the same idea also applies to bounding space consumed, and generalizes to limiting other resources besides space and time consumed by one leaf dispatch from the trampoline before yielding.

If your setup cannot express one bytecode op execution as multiple trampoline dispatches, you could imitate expansion of one bytecode op into "microcode" ops of smaller granularity, looking something like the high level bytecode op is a subroutine call that traps into a block of microcode ops in pseudo "ROM" bytecode.

(Just had a flashback to twenty years ago. A tool under MPW running as a coroutine had to periodically call a spin-cursor routine, to animate the cursor, because this was also one of the ways a yield happened in CPU-bound code, so the MPW scheduler could context switch, compact the heap, and/or send signals to the tool as needed. Any tool doing i/o was already implicitly yielding because MPW's shell patched out file i/o entry points so they were mediated.)

...they don't have any destructors anyway so that's no loss.

Okay, but what about code in any libraries you use? I'm hoping the answer to this is, "I'm not using any libraries, not even the standard C++ library." (That's what I do in C++ code embedded in C apps that cannot have any dependencies.) Even if destructors are not a problem, a library you use might do something rude like maintain caches (of some kind) in globals.

If your C++ code uses anyone else's, it's hard to rule out things a library might do under the covers, including things that might make your blood boil if you only saw the source code. :-) I didn't notice whether you use native threads in your process. If a library is billed at non-thread-safe, it's more likely to use globals "because everyone does," and because that's probably the reason it's not thread-safe. If it is billed as thread-safe, it might mean they use mutex to protect globals, and they still might sink you with cached pointers to old C++ instances. Ideally a library would instead say, "Stores no global state whatsoever." (Unless the C++ runtime does this silently.)

If we look to exhaust reasons why something surprising might happen in your address space, you might need to consider whether the C or C++ runtime spawns another native thread you did not ask for explicitly, under the theory your code would not be able to observe this if you followed expected rules in normal C++ apps.

Thank you!

You should be able to predict the max allocation during a bytecode op, so you know gc won't be necessary while it runs.

For the most part that's easy. The tricky one is the opcode for calling a native (i.e. implemented in C++) method. Those methods are part of the VM so under my control, but can do arbitrary work. (Think list.append() which can require growing the list.)

That's also the exact opcode that gets me into the territory of "this" being a problem. Native methods are generally implemented as C++ instance methods on objects that are on the GC heap.

Now that I think about it, maybe the right solution here is to bounce to the trampoline whenever a native method is called. That will get this off the stack and give me a well-defined window where allocation can occur. Thank you for getting me thinking here!

I'm used to thinking about giving a trampoline control periodically to bound time latency in max sized chunks, so cooperative context switch can occur at end of time slice. But the same idea also applies to bounding space consumed, and generalizes to limiting other resources besides space and time consumed by one leaf dispatch from the trampoline before yielding.

That's an insightful way to look at it.

Okay, but what about code in any libraries you use? I'm hoping the answer to this is, "I'm not using any libraries, not even the standard C++ library."

That's correct. Maybe this is because I learned C++ before the STL was really well established but I'm used to rolling my own stuff. For my VM I find that a better fit anyway. People always say it's hard to beat the standard library, but I find you can do that easily when you don't need to be as general as it does.

For example, my growable array class can simply use the GC heap directly without trying to generalize over all possible allocators.

So, yeah, no dependencies except libuv. For that, I will be adding finalization support for the specific objects that point to memory used by libuv (files, processes, etc.). That memory won't live on the GC heap either since libuv would not be happy if I moved objects around under it.

I thought arguments to C++

I thought arguments to C++ calls were normally pinned for the duration of the call (e.g. in JNI). A more permanent pin can also be created and managed manually (or via ref counting), but the C++ world was generally immune to GC.

Also, as an aside, portable ref counting across managed and native lands was one of the nicer new features to come out of WinRT.

most welcome

You're closing on a familiar line of plans. :-) I was unable to think of a suitable way to bounce to the trampoline often enough without incurring one unfortunate effect or another: hard-to-read manual CPS, or an automated transpiler, which is a lot of work. Not sure I could automate C++ rewrite myself, but I can manage C okay.

Now that I think about it, maybe the right solution here is to bounce to the trampoline whenever a native method is called. That will get this off the stack and give me a well-defined window where allocation can occur.

A windowing effect for incremental progress of controllable granularity is what I appreciate most. I'll need to one day deploy in environments where a watchdog will kill my host process, if my async engine hogs the CPU too long for time heartbeat increments visible to a process supervisor.

People always say it's hard to beat the standard library, but I find you can do that easily when you don't need to be as general as it does.

Need for generality is a substantial handicap in STL, so it's relatively easy to beat on some dimension like performance if you can bear testing your own code aggressively. I ask folks, "Did you want this to go as fast as possible?" The answer is always yes. Then I observe, "I see you only used these four methods; want to bet $100 I can't write a faster work-alike, for this one usage?" I never get any takers though, since outcome of faster hand-rolled code is expected, especially if you leverage inside-knowledge of memory location relationships.

(For example, if you write a cache with LRU replacement and hashmap lookup, it's essential to use intrinsic links, so whenever you find something in one collection, you already know all you need about updating another collection, because you already have the address without any lookup. STL is sorta extrinsic link happy. Several years ago I found code whose algorithm had O(N^4) expected time for N=100K because someone didn't realize calling size() on a list actually iterated over a list to count members.)

There's an aspect of GC that practically begs for a treatment using either green or native threads, when GC is a task with relatively global scope, and yet might be triggered by highly specific leaf code in a call tree that runs out of memory. A transition from local to global context is awkward. Better to suspend the local task until the global one is done, then resume. In a language C++ with a monolithic stack, that can be hard to do if there's any subtle dependency involved. (This isn't related to your post; I just felt obligated to make an on-topic remark about GC.)

Handlification

Yes, handlification on the C++ side is the usual technique to use for that. And indeed, 'this' is a major pain point there. If you ask me, it demonstrates that the whole C++ marketing talk about the language enabling you to build your own pointer abstractions and thereby abstract away all memory management is just bullshit -- you can't smart-pointerise 'this', which is a complete show-stopper in some situations.

Not using methods at all is one way to avoid the problem. FWIW, for V8, we have been considering that lately, having struggled with handlification bugs a lot.

But I'm thinking that there might be a dirty trick to avoid that: lowering the handle indirection into the pointers to heap objects themselves. If you use a scheme like V8 does, then pointers to heap-allocated objects are not proper C++ pointers anyway. They are tagged, i.e., always off by one. All actual field accesses go through macros that adjust the offset (and we don't use virtual dispatch). Now, why stop there? If all these pointers are not properly-typed C++ object pointers anyway, they could as well be anything -- including handles. The accessor macros for heap classes could take care of doing both the offset adjustment and the handle indirection (and handle creation when you read out another pointer). You could then get rid of explicit Handle types altogether. Of course, the GC itself would go through a separate lower-level raw pointer interface, but that's fine, since it's operating at a different level of abstraction.

you can't smart-pointerise 'this'

Well, of course, you can, but in my view this layering of warts just underscores the funkiness of the whole approach.

And I agree, if you're already tagging pointers anyway, you might as well run with it...

I hadn't actually seen that

I hadn't actually seen that's in C++11, so thanks for the link. But yeah, oh dear. Obviously, it's not really smart-pointerising 'this', but simply storing a redundant smart self-reference in every instance. What a waste...

Yep

I don't always use enable_shared_from_this, but when I do, I willingly admit that I've painted myself into a corner.

I recommend the blue book on

I recommend the blue book on garbage collection aptly named...."Garbage Collection". This is the only physical book I still have from the 90s. It is dated '96, but it is completely adequate to get started.

Also keep in mind that referencing counting is back in vogue these days given that GC has drawbacks on mobile. Of course, referencing counting is not complete with respect to cyclic references.

Bacon's and Petrank's

Bacon's and Petrank's collector can collect cycles with local tracing.

As for GC, I've recently become a big fan of the CHICKEN concurrent real-time GC. Incredibly simple, concurrent, lock-free, wait-free write barrier, read barrier cost is a pointer indirection at worst, soft real-time (hard real-time if you verify the implementation), and supports compaction. What's not to like?

That's been on my wishlist

That's been on my wishlist for a while. Is there a reason to prefer it over the newer Garbage Collection Handbook?

Same author, so I guess the

Same author, so I guess the latter book would be a good update.

The 90s was a great decade for PL implementation books.

I second your recommendation.

This is the only physical book I still have from the 90s.

Please, tell me you've got a copy of TAOMOP somewhere.

The 90s was a great decade for PL implementation books, especially interpreters.

Searching TAOMOP doesn't get

Searching TAOMOP doesn't get me anywhere. The art of ... ... programming?

meta object protocol

The Art of the Metaobject Protocol (wikipedia) has surprisingly good writing in it, so even folks who aren't Lisp fiends can get value out of it.

As far as I know that is how

As far as I know that is how GCs do it. After GC you want to expand the heap if necessary to have a certain fixed percentage free memory. That way you ensure that each allocation is amortized O(1). The percentage you have free determines the constant. For example if you choose to have >=50% free after GC then each allocation's cost will be the time it takes to trace and free that amount of memory. If you expand the heap so that 75% is free, then the cost per allocation will be half that, because on average you have to trace 0.5 object for every allocation you do. If you choose to have only 1% free then on average you will have to trace through 100 objects for every allocation you do, etc.

This reasoning only holds

This reasoning only holds for a non-generational mark-and-sweep GC, and all objects are the same size. Also, the cache effects of GC cause this relation to go non-linear near one end of the spectrum.

Yes. For a generational GC

Yes. For a generational GC it holds per generation though.

Not really.

Most generational GCs use a copying collector for the young gen...but it doesn't matter, you've still the got the problem of how to size each generation, not to mention promotion policy.

It does hold. If you use a

It does hold. If you use a fixed size young generation, then what I described above holds for the old generation. Instead of considering all allocations, you simply only consider the allocations that survive the young generation, so it's still amortized O(1) with that strategy. (Similar reasoning holds for a re-sizable young generation, though that's a little useless because then it reverts to being non generational.)

Just because an object

Just because an object survives a young generation collection does not mean it will get promoted to the old gen. You can, for example, only promote a fraction of the survivors to the old gen. As I mentioned, a large part of the performance then depends on your promotion policy, and of course, survival rate of new allocations.

Sure, performance depends on

Sure, performance depends on it, but the fact remains that with that resize policy it's amortized O(1) per allocation.

O(1) = O(1000). Why are

O(1) = O(1000). Why are discussing GC with O(.) at all?

Why the notion of constant

Why the notion of constant time operations is useful is a discussion for another day ;-)

gc related suggestions

derekp: What defines "running low on memory"?

In part it depends on what action you take as correction (like GC) and how long that takes, and whether you want it done before memory is exhausted, if concurrency is involved. One sub-system where I work is a store used as a cache, with a maximum disk capacity, which must be configured to start pruning at a point where disk won't be exhausted before space is reclaimed. Picking numbers for max capacity can be ad hoc and disappointingly tricky, partly driven by worst case analysis preserving what seems like a lot of head room. I'd expect any answer you pick to have an unsatisfying aspect of guesswork to it.

I liked Barry Watson's reply: "Dimension the system accordingly." Footprint in RAM and disk depends on what code does in a particular problem scenario. Even if you want a one-size-fits-all answer, you should let your users configure expected sizes when they know what they are, for a given app to be run using your tools.

Is more memory to be had? That's one big question. If more can be added, who decides? If a host environment can add more memory under your control, will you be able to use it effectively? Must added new memory satisfy size or alignment constraints? Does the API require you to be able to give it back? Do you even want to give it back if it's possible? If your process stays up indefinitely, will it starve other components if you hog memory?

If your code is embedded in another system, you might have to live inside limits imposed on you, without any prospect of getting more memory. (This is the case familiar to me.) Your gc model should define what happens when no additional memory can be added under the control of your gc space. An app might try to allocate all space you have, without making any of it unreachable so it can be collected; so you'll have to kill something, or notify apps you host so they work it out themselves.

The MacOS Finder in the 80's and early 90's actually let end users open an info view on application executables, then edit values for minimum and maximum heap size used the next time that app launched from the Finder. (I tried to google this for a link, but gave up after it seemed too ancient of history for current trends in content indexing.) This is interesting because Apple UI folks were very into shielding users from internal detail, and yet this feature was considered necessary so users could manage RAM in their Macs. While this is less relevant now for several reasons, the same UI principle may apply to your gc policy control expressed for user consumption.

If your language (or process model) can express memory budgets for sub-tasks, with actions firing when limits are reached, maybe code can explicitly influence what is meant by low memory, and what response is desired as a handler under low memory conditions.

Some old school app frameworks used to have low-memory notification support, to encourage app level space collection as well as context driven persistence of app data. You might want user code visible hooks so any gc mechanism you pick can influence user space app code.

Low on memory can be

Low on memory can be determined a number of ways. For example, you could have per-thread or per-CPU bump-pointer nurseries, which should fit into CPU cache with room to spare, and so those 'run low on memory' quite easily (but they can also be collected quickly by the CPU in question).

(One can profitably extend the idea that some subset of pages is 'owned' by a CPU to pages in higher generations. One might also entertain pairwise ownership.)

One might also study bookmarking collectors, which perform GCs to help avoid paging.

Or continuous concurrent collectors, which aren't really triggered at all (though they might take a break if there isn't enough work).

Anyhow, GC is ultimately a big set of heuristics, and 'low on memory' will be tuned to meet your needs. One can benchmark them against a set of known problems.

I do this in terms of work factors.

Rather than setting the size of a memory arena, I prefer using the heap in the same way as other programs and regulating how much effort is spent on garbage collection instead.

With most (simple) GC's there is a notion of how much work (on average) it takes to show whether a pointer is live. For example, in your generational collector, there is the amount of work involved in traversing one live object's pointers. If you do it more often than you allocate objects, you will "catch up" to the number of objects allocated occasionally, even if all pointers stay live.

Therefore, as long as you do more than that amount of GC work per allocation, in the long run you're guaranteed to keep up with the generation of garbage. So for high-performance and big memory footprint, you might aim as low as 110% of that amount of work, while for typical cases you might want to do 200% or 300% of that amount of work. Or you might use an adaptive strategy and aim for 50% collections on the average, speeding up a little whenever you have more garbage than live objects at the end of a collection, and slowing down a little (but never to less than one traversal per allocation) whenever you have more live objects than garbage at the end of a collection. Or you might monitor the system's use of memory and bump up the GC effort if the virtual memory commitment ever grows larger than the physical installed memory.