GPU for GC

GPUs as an Opportunity for Offloading Garbage Collection sounds kinda cool to me.

We investigate the challenges for offloading garbage collection to a GPU, by examining the performance trade-offs for the mark phase of a mark & sweep garbage collector. We present a theo- retical analysis and an algorithm that demonstrates the feasibility of this approach. We also discuss a number of algorithmic design trade-offs required to leverage the strengths and capabilities of the GPU hardware. Our algorithm has been integrated into the Jikes RVM and we present promising performance results.

Mainly I've been trying to learn from anybody who might know: can moving a GC (especially in things like Ocaml or whatever) off to another core have good results like making pauses less obvious in interactive apps?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

yes!

See Kathryn McKinley's work, for example. I think there are a few papers here, like "Virtual Machine Services: An Opportunity for Hardware Customization".

We're seeing a similar situation in distributed/webscale systems in enabling as much logging as possible but with minimal overhead.

For the GPU paper in particular, the stumbling block is heavy memory transfer costs. Interestingly, with newer "APU"-ish architectures (ARM Mali, AMD APUs, maybe Nvidia Tegras) where the SIMT cores sits next to the CPU ones, this goes away. That lowers costs enough that we'll be seeing lots of new types of computations: more CPU scheduling of the GPU, more GPU offloading from the CPU, etc.

It's Slower

Unsurprisingly apart from the highly parallel cases (hundreds or thousands of parallel linked lists) it is slower than the CPU. Some of this is due to the cost of copying the data to the GPU and back, and some is due to the unsuitability of the GPU for this task.

As you have to block access from the worker thread during the mark/sweep, offloading to a processor of equal speed (let alone slower) makes no sense at all. You need something faster to make it worthwhile. The other issue with the GPU is that this does not make efficient use of its resources (large parts of the chip are not utilised).

Hardware architectures that use indirection, rewriting the write and read addresses of memory regions that are being copied in memory transaction buffers would seem to offer very low overhead GC, although stalls are still possible if the hardware cannot copy/compact the objects as fast as the processor is allocating new ones.

GCPU?

Has there ever been a specialized processor for garbage collection (maybe part of an MMU)? The only one I remember is built into Intel's iAPX 432.

Lisp machine hardware? It

Lisp machine hardware? It never really helps since the problem with GC are not really the cycles, but the lack of or difficulty of concurrency of doing anything while its happening.

GC also promotes the lack of regularity in the heap (everything is fragmented and non-uniform), which are specifically the problems that GPUs (and SIMD in general) are not good at. If you want perf, you make your structures regular, and then GC isn't really needed.

Non-blocking copying possible in hardware

What you can do is utilise the cache. The CPU can still read and write to the cache while garbage collection is going on. At the beginning of a memory copy (which would be cache line aligned), the affected cache lines could be locked in the cache. During the copy you can still allow read access to fill the cache from memory, but any writes would be held in the cache until the copy completed. The copy could happen via DMA without going through the cache. After the copy the origin address of the cache line is changed and the lock removed. With modern set-associative 4-12MB caches, you would struggle to create a cache collision and stall the writes. The biggest chunk you would ever have to copy would be the page size (normally 4k) as anything bigger you do by copying the first and last chunk, and remapping the whole pages (changing the virtual address whilst leaving the physical the same).

Another thought is that it might be possible to do something in software using software-transactional-memory. Basically the garbage collector could run in a separate thread, and sits above the STM layer like the other threads and the mark/check and copy sweep are issued as a series of transactions. See:

https://www.cs.purdue.edu/homes/jv/courses/590v/ppop.pdf

Of course I agree that using regular structures removes the need for GC, and therefore removing GC from languages will improve programs :-) Modern C++ using RAII and STL containers gets rid of most of the old problems (move semantics allow you to just return big objects efficiently). Ada's hiding of the details is even better.

If you can organize your

If you can organize your memory for efficient GPU access, you probably don't need GC anyways. There are always cases where this isn't possible, for sure.

Structure

I think you misunderstand the statement, I was meaning that by making it easier to write programs with correct memory management, people think about it less, and therefore write less well structured programs.

None of the 'folk myths' in the document linked to address the quality of code written. Its not about whether GC is faster, its that people write slower code (so proving the best case is no slower is irrelevant).

Most cases of memory use do not involve structure sharing, and can be handled by things like passing a vector of objects in by reference. You allocate the vector (handle on the stack) do the work and all the objects it contains get freed when it goes out of scope. Yes there are network algorithms dealing with cyclic directed graphs that need GC, but these are relatively rare in my experience (and could be dealt with by indexing rather than pointers/references).

I think the real point is

I think the real point is that if you care about perf and memory is a bottleneck, you are going to do some kind of structuring whether you can rely on GC or not. When memory is not a problem, GC is also not a problem.

I like GC and would rather not avoid it, but when doing a physics simulation with a thousand particles, having that array of non heap points is useful.

Memory control is harder with CG

Trying to fine tune memory use using heap profiling is not easy. Its also difficult when objects addresses can change at any time.

You started all this by saying "GC also promotes the lack of regularity in the heap (everything is fragmented and non-uniform), which are specifically the problems that GPUs (and SIMD in general) are not good at. If you want perf, you make your structures regular, and then GC isn't really needed."... If we consider lack of heap regularity bad, then my comment about removing GC improves programs follows directly from that. In case it was not clear the smily face indicates that this was meant as a light hearted comment, and not a serious attack on garbage collection :-).

eat oatmeal for regularity

everything is fragmented and non-uniform

i don't understand why gc's cannot subdivide the heap just like any implementation of malloc might do? i feel like you are stating something as if it were a law when it is only a habit. previous behaviour should ideally not be a predictor of future research ideas. e.g. GC.

Dividing the heap

In order to know whether an object is 'garbage', GC must search every location that might hold a reference to the object. Dividing the heap is only feasible to the degree we can limit this search.

But it is possible. There are typeful approaches like Cyclone's regions. And there are more dynamic approaches like bookmarking garbage collection. In small constant space per page, we can keep an imprecise reverse reference table (which other pages might be directly referencing this page?), and use this to collect a relatively small subset of pages at a time.

more assumptions

wait ain't it true that ref count based gc's don't have to search diddly squat other than having to have some rule for cycles, so see Bacon's work on that for one i guess.

Hardware Assisted Garbage Collection

Not exactly part of the processor, but there has been work for hardware assisted garbage collection. Dr. Kelvin Nilsen did work at Iowa State.

http://citeseerx.ist.psu.edu/viewdoc/summary;jsessionid=55FAE3818BD1160A7A9ED201ACD5EFEA?doi=10.1.1.38.7195

Abstract

Hardware-assisted garbage collection combines the potential of high average-case allocation rates and memory bandwidth with fast worst-case allocation, fetch, and store times. This paper describes an architecture that allows memory fetch and store operations to execute, on the average, nearly as fast as traditional memory.Itdescribes a feasible implementation of a garbage-collected memory module, but does not provide a thorough discussion of possible design alternatives, nor does it provide rigorous justification for choices between available design alternatives. 17 November 1992 Department of Computer Science Iowa State University 226 AtanasoffHall Ames, Iowa 50011-1040 *This work was supported by the National Science Foundation under Grant MIP-9010412, and by a National Science Foundation Graduate Fellowship. Preferred Embodiment of a HardwareAssisted Garbage-Collection System * Kelvin D. Nilsen William J.Schmidt Department of Computer Science Iowa State University...

Rekursiv and Vega

In addition to those already mentioned, there is the Rekursiv processor (obscure, unsuccessful but fun in its quirky ways). There is also the much more recent Azul Vega architecture which appears to have been a technological (if not business) success: roughly a plain general-purpose RISC with fairly minimal MMU and instruction support for GC. That approach always made sense to me.

Of course there are also research efforts: SOAR (Smalltalk on a RISC), the Scheme-79 chip, and doubtless several more.

GC progress?

The paper on hardware-assisted GC would be more useful if the abstract actually said something about what they did.

Would it help to have more tag bits associated with memory? Would it help to know if a store was pointer data or non-pointer data? A change to a pointer means a concurrent GC must rescan the page; a change to a non-pointer does not. Suppose pointers and non-pointers were changed with different instructions, and only a pointer change set the "dirty for GC purposes" bit.

software balls of mud

I like this reason in your post, because you seldom make specific interests clear.

... have good results like making pauses less obvious in interactive apps?

In the context of making UI more responsive and fluid, I think there is a lot of low hanging fruit waiting to be picked in software optimization, by not having a UI rendering thread depend unnecessarily on lots of other computation stalling UI updates (or merely degrading frame rate).

Divide and conquer works in both hardware and software; doing both is better than either alone. Using more hardware to get parallel resource processing ought to work better when software divides work to be done into non-interfering pieces. (While this seems obvious as premise, it's still just intuition, not grounded directly in empirical evidence.) However, division itself causes self-conflict if part of the programming model assumes monolithic interaction of some kind.

If software on one processor is waiting for GC to finish so it can continue, doing GC on another processor won't help, and could hurt if moving data increases cost. Moving data might cost so much, there's (conjecture) no way to avoid having a UI rendering thread wait on separate hardware to perform GC.

In the early 90's I saw very responsive UIs on terribly weak hardware compared to what we have now. It stands to reason that, if we limit how much work is done in a UI rendering context, what we have now ought to be even more responsive than vastly inferior hardware from twenty years ago. Obvious pauses in interactive apps must be caused by (accidentally? sloppily? stupidly? indifferently? lazily?) depending on too much in event handling and rendering code paths so they wait for no good reason.

The culprit seems to be awkward software organization, not insufficient hardware resources.

Programming language architecture can make this less awkward, but it still requires both standard libraries and developer app and server structure to avoid coupling everything together in one monolithic ball of mud.

yes and no

i think the multicore/processor gc question is: can that keep up with the garbage? is there a gc that doesn't stop the world? or only stops it in small bits? and yet still keeps up? it seems like the research says, "yes" even up to and including papers on gc for real time systems. but i dunno how far away it is from a vm near you/me/anybody in the trenches.

whenever there are more resources, people end up trying to do more things. so if we try to run a c64 ui today then yes we should be able to not hiccup ever at 30fps, but if we try to do something 'modern' then i dunno. we have to dig into it.

'modern' could be crappy code, so that would be your point.

'modern' could be something advanced in a good way, trying to make it easier for the developer cf. glitch or elm or frp or whatever. those might be demanding a lot more. (i dunno.)

'modern' could be something that is on top of modern runtimes, namely things with GC in them, vs. the complete lack of gc on older systems. i have no idea if original lisp machines were fast when they were not out to lunch doing gc.

so, yes, there's "modern" in the pejorative sense of the damned kids on my lawn who don't pay attention to where they are what they are doing because the gc seems to absolve them of it. anything based on java looks like that to me :-)

I can restate that to clarify intent

You posted a "GPU for GC" type of paper, with a "interactive app pauses" reason, but I mainly addressed the reason while lumping a GPU into a general "more hardware" idea. As far as more hardware goes, a GPU does sound plausible. But the essence of my comment was that Amdahl's law predicts little performance improvement from more hardware, if software interferes with parallel operation by establishing too many relationships that need sequential service to satisfy api contract assumptions.

Software partitioned into (partly or mostly) independent resource domains can be serviced more in parallel. The point I wanted folks to infer (that I didn't say explicitly) was that UI rendering can occur in some context with less overlap with backend model processing -- which might still run locally and close at hand, just not as deeply entangled. In particular, if UI does GC independently, UI pauses will be shorter, if the rest of the world need not GC at the same time. That's implicit criticism of any design with large one-size-fits-all heaps, where everything runs in the same soup and does GC in lockstep.

When I compared hardware now to that of twenty years ago, the idea was to say we already have "more hardware" without adding a GPU, if we're tracking resources on the back of a napkin, and comparing performance to known good UI in old full-color workstations. (Some sucked, and some were good, and tuning of OS to actual resources did matter a lot. Note c64 was thirty years ago, not twenty, so it's not a good example.) Using software design technique alone, you can use hardware today to model a local UI client and several remote servers, but all running on the same box with corresponding improved performance. The smaller a "client" part, the faster it can be serviced by hardware without considering other components.

Please avoid insulting other parties on my behalf via rewording. No need to call anyone kids, say code is crappy, or knock specific VMs. Mixing everything together can have worse entropy, and the more resources you have, the worse mess you can make without some planning or discipline. Maybe we just need a little more control over fractal module granularity. If I worked for a PC vendor, I might instead say we really need everyone to buy another round of next generation systems.

UI in Web-browser.

This sounds like a justification for writing UI in JavaScript to run in a web browser, and have the backend as a web service running locally (or where ever). Its interesting because I think a lot of software can be written like this, even games using webgl. You also get to use 'modern' dynamic languages on both ends, we tend to use Python on the backend but JavaScript using jsnode is also a good choice (and probably the fastest general language, by python has numpy and scipy which are actually written in 'C' to help).

exemplary anti-example

except for how all browsers seem to have noticeable pauses with their gcs?

wrong spot

*

browser might not be best performance case

This gives me too many things to say. :-) You seem to imply both browsers and their GC sub-systems are paragons of performance and snappy UI fluidity. A stripped down native UI ought to be crisper, but using a browser is easier at the expense of some performance — the usual argument that faster for devs is more important than results for users. While I'm okay with that tradeoff, it seems odd in the context of seeking least UI latency due to GC.

This sounds like a justification for writing UI in JavaScript to run in a web browser, and have the backend as a web service running locally (or where ever).

I had in mind justifying separation of UI and backend as a performance tactic, even when it means marshalling more data between (OS or lightweight) processes. This is contrarian when old school development required putting everything as close together as possible for best results. That partitioning is good when you have more cores isn't intuitive as optimization, because it means doing more in total to go faster, which used to be a contradiction.

Additionally, you get other benefits from separation of concerns. Parts are less complex, can fail independently, and can be swapped interchangeably after writing more than one version. Advantages afforded by decomposition become available that we once sacrificed in the name of performance. Instrumenting for debug observation need no longer completely re-arrange app infrastructure once it's no longer wired tightly together in the smallest snarl we can achieve.

It makes sense for one UI option to be browser frontend, regardless of what a backend does or how many pieces comprise it, or where those pieces are located. Of course, if backend is web service alone, you might have no other choice than a browser UI. Two independent good quality UIs may be too expensive for production, for many folks, arguing for one to be lower-effort test-only quality.

Some backend services may have no UI at all, like headless network components; in this case, any UI is an improvement with respect to visibility for diagnosis. To support UI where none existed before, one might emplace new backend features like HTTP service for remote state inspection.

Mobile App

A mobile app is often a good second front-end for that web-service.

On the performance front, when web-browser performance scales with the number of clients, it makes sense to to as much work there as possible (without compromising the service security). It can save you money on server scaling. Also a lot of time and money has been spent on JavaScript performance, so its pretty good now. You can even run Quake in a web browser with WebGL (and the open-gl performance is pretty much as fast as native). I think WebCL is coming too.

Add to this you get a collaborative multi-user application, whats not to like :-)

the pauses! the pauses are killing me.

any game with noticeable (more than like 5ms) gc pauses of any kind is just depressing. no matter how infrequent.

The princess and the 5ms garbage collector pause

"I could hardly play at all," said the princess. And that's how the queen knew she was a real princess.

many worse problems

Falling through floors, getting stuck in geometry, quests that get stuck in weird states and cannot be completed, auto-saves that crash when reloaded, silly pathing AIs such that your allies can't follow you up a six inch ledge, NPCs that (by accident) stand in front of doorways and block them for several minutes as you're trying to move through (or worse, trap you in a corner)... to name a few problems I've encountered in 2014.

If occasional 5ms collector pauses are the worst you're dealing with, you're in pretty good shape. :) I think we need more attention on state, consistency, and AI.

In any case, the advice I've seen for games is to force a collection every frame to keep it predictable.

misc thoughts

re: amdhal's law: therein lies one rub. some papers i think i've seen seem to indicate that it isn't so bad and that we can just let the garbage grow and the concurrent/parallel collector(s) will catch up. possibly up to and including real time constraints. random paper from my alma mater (although i was a lousy undergrad /and/ i dropped out of the master program, as if you couldn't figure that out from my level of discourse) that gives me hope.

i appreciate your criticism of one-size-fits-all, and i'm sure the erlangers are happy to see such thinking get more credence.

the thing is, i want somehow for this all to happen tomorrow for ios and android. is hacking the gc going to be enough to make up for crappy architecture and design where the ui is not separated off from the services? that's what i hope for, just out of sheer pragmatism.