Parallel frameworks for graph processing

Given successes in parallel frameworks for linear algebra (BLAS) and log processing (MapReduce), the current trend in large scale data parallelism seems to be more general graph processing. An early big contender discussed here was Google's Pregel.

Two fun ones have been making the rounds. First, and fairly related, is GraphLab:

Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We demonstrate the expressiveness of the GraphLab framework by designing and implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso and Compressed Sensing. We show that using GraphLab we can achieve excellent parallel performance on large scale real-world problems.

There are obvious connections to systems like Cilk/TBB and, more focused on graphs, Pingali's Amorphous Data Parallelism work.

A radically different approach is John Gilbert's Parallel Combinatorial BLAS: A Toolbox for High-Performance Graph Computation (papers, slides):

This paper presents a scalable high-performance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of high-performance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse grained parallelism for thousands of processors, which can be uncovered by using the right primitives.

We describe the Parallel Combinatorial BLAS, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications. We provide an extendible library interface and some guiding principles for future development. The library is evaluated using two important graph algorithms, in terms of both performance and ease-of- use. The scalability and raw performance of the example applications, using the combinatorial BLAS, are unprecedented on distributed memory clusters.

The approach to performance in both is fairly different. Intriguing to me is the choices in frontend languages: given the constrained domains, I suspect much higher level languages are possible (hint: synthesis).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Yes, MapReduce is not the only game in town

Awhile back I pointed somebody to Factor Graphs and the Sum-Product Algorithm. It turns out that the product-sum factor graph encoding allows to easily express the best known optimal version of several common problems. Factor Graphs are also becoming a very popular data structure for statistical analysis, and so the product-sum algorithm is somewhat of a "swiss army knife".

There is also a keynote talk at SPLASH this year about Markus Püschel's work on Spiral.

There is also more broad geometric algebra languages and libraries, as well, that allow defining problems using the rich structure of geometric algebra and compiling down to simpler primitives while optimizing away redundant computations: Gaigen 2. Gaigen 2 can compile code for a 3D GA that is equivalent in performance to 3D LA. Gaigen 2 also contains a domain specific language for implementing GA functions.

Gaigen 2 performance.

Gaigen 2 can compile code for a 3D GA that is equivalent in performance to 3D LA.

I cannot find anything about that on the site. The closest thing I found is old ray tracing performance table which does not contain anything like that. Could you provide a link?

Also, 3D GA is not as elegant as 5D GA. It's interesting what they had achieved for 5D GA.

Can you recommend a short

Can you recommend a short introduction to the elegant form of GA?

One thing I have to wonder is why, if it is elegant, it even depends on the number of dimensions at all. Linear algebra is just as elegant for any number of dimensions.

Actually, if you use

Actually, if you use (n+2)-dimensional GA for n-dimensional space, you get pretty elegant formalization.

The most notable one is that translations and rotations are uniformly represented by so called versor.

In n-dimensional GA you have to use two different operations for translation and rotation. So you do in n-dimensional LA (except you use (n+1) affine formulation where matrix multiplication can rotate and translate).

Please, take a look at comparison on Wikipedia. GA naturally incorporate some operations which are special cases in LA.

OOPs!

I linked to Gaigen 1.0, not Gaigen 2. Thanks for the correction. It looks like the Gaigen 2 website is mostly blank. Daniel Fontijne's personal research website contains Gaigen 2.5 and the associated publications. See conclusions section of their GPCE 2006 paper [1], where they state Gaigen 2.0 measures 25% slower to 40% faster than 3D LA.

Now, how relevant are these benchmarks? It depends. 3D LA occurs both memory and computation overheads whenever it needs multiple representation conversions to compute a final result. For ray tracing performance, that IR overhead might not ever be enough to shift the benefits to GA. For benchmarks where GA can directly represent problems without requiring substantial IR overhead, it will win. This gap is a problem space for tools to fill-in, but the big win is probably for teaching computer graphics and producing systems with much smaller code sizes than today.

[1] Gaigen 2: a Geometric Algebra Implementation Generator

Quite good news.

Thank you very much!

I thought it is possible to obtain speed comparable to LA (within factor of 2), but I couldn't imagine that it is possible for GA to be faster than LA.

Awhile back I pointed

Awhile back I pointed somebody to Factor Graphs and the Sum-Product Algorithm. It turns out that the product-sum factor graph encoding allows to easily express the best known optimal version of several common problems.

E.g, a lot of Bayesian models :) Talking to a ML specialist recently, he thought convex optimization has become the way to think about, and thus target, modern machine learning algorithms.

Convexity and Non-convexity

Convex optimisation is certainly important but mostly in classification. Even simple methods unsupervised methods like k-means are non-convex. There is also a movement in classification, the "deep networks" stuff, which uses non-convex optimisation.

This talk is quite interesting:

http://cs.nyu.edu/~yann/talks/lecun-20071207-nonconvex.pdf

Video for the talk

Thanks for linking to the PDF. I found a video of the talk, which colors in some of the opinionated statements on the slides: Who is Afraid of Non-Convex Loss Functions?

Signal/Collect

You might also be interested in Signal/Collect, a programming model and framework for parallel graph processing. Signal/Collect supports multiple vertex/edge types and asynchronous algorithm executions, which is advantageous for implementing algorithms of the Sum-Product family.

(disclaimer: I'm one of the developers/authors of the framework/paper)

Three tiers

We no longer live in a homogeneous computing world. Gone is the day when we programmed for one core on one CPU.

Right now we live in a three tier world:

Device Level | Operating System Level | Message Passing Level

(ASM)OpenCL | {(ASM)C/C++, TBB, CILK} | {RDMA, Hadoop/JAVA, MPI}

Markus Püschel's SPIRAL like Matteo Frigo's FFTW before it (Frigo is also part of the CILK team) are closer to where we need to go.

We need a strong database of algorithms and benchmarks, along with automated ways of tuning libraries to new machines.

There is a good talk by Sedgewick along these lines. Also, a good presentation by Stepanov.

We can leverage combinatorics to get some great bounds, but without full information we have to apply the scientific method.