EigenCFA: Accelerating Flow Analysis with GPUs

Here's a very impressive speedup on a program analysis problem by using a GPU.

EigenCFA: Accelerating Flow Analysis with GPUs - Tarun Prabhu, Shreyas Ramalingam, Matthew Might, Mary Hall

We describe, implement and benchmark EigenCFA, an algorithm for accelerating higher-order control-flow analysis (specifically, 0CFA) with a GPU. Ultimately, our program transformations, reductions and optimizations achieve a factor of 72 speedup over an optimized CPU implementation.

We began our investigation with the view that GPUs accelerate high-arithmetic, data-parallel computations with a poor tolerance for branching. Taking that perspective to its limit, we reduced Shivers’s abstract-interpretive 0CFA to an algorithm synthesized from linear-algebra operations. Central to this reduction were “abstract” Church encodings, and encodings of the syntax tree and abstract domains as vectors and matrices.

A straightforward (dense-matrix) implementation of EigenCFA performed slower than a fast CPU implementation. Ultimately, sparse-matrix data structures and operations turned out to be the critical accelerants. Because control-flow graphs are sparse in practice (up to 96% empty), our control-flow matrices are also sparse, giving the sparse matrix operations an overwhelming space and speed advantage.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Interesting

I just skimmed the paper and it is fairly interesting. The "fast CPU implementation" mentioned in the abstract is a Racket implementation "following best practices", that is done the usual way by people that are expert at implementing 0CFA. A very interesting thing in this paper is that they also wrote a sparse-matrix implementation of 0CFA, following the algorithm they had devised with the GPU in mind, but implemented in C for the CPU. This version is slower than the GPU for large programs, but faster for smaller ones (upto 4000 terms), and surprisingly faster (more than 10x speedup) than the reference implementation. (I'm retrospectively stunned that they had the good idea and rigor to do this matrices-on-CPU experiment.)

This means that the encoding as sparse matrix operations is also a source of speed-up on the CPU. Is that something fundamental about reusing all the good ideas of linear algebra, or rather a relatively-unpleasant way to extract more locality-friendly representation of symbolic term trees?

Classic connundrum

Linear algebra is about basic computations, and in practice, a lot maps to bulk operations on graphs. Similar to monads, they capture basic concepts like bulk transformations (~= a step of transitive closure). However, just like monad syntax, I'd need a compelling reason to use something so awkward. That said, there are many here: optimizations from term rewriting, memory locality, parallelism, distribution, ... .

I'd rather get a compiler from Datalog variants. A lot of the same benefits, but less awkward. See recent papers from Logicblocks..

comblas

Meant to add this particularly interesting project: http://gauss.cs.ucsb.edu/~aydin/CombBLAS/html/ .