Towards general nested data parallelism

There is quite a performance difference between a flattening nested data parallelism system like the seminal NESL language that transforms nested vector manipulations into flat Cray vector instructions and multicore languages like Cilk that achieve coarser parallelism but for more general computations. As explored by Guy Blelloch in the mid-90s, work-stealing techniques can also be used for coarse-grained parallelism in constrained vectorizing languages. However, the reverse is not obvious: can we apply flattening techniques for fine-grained parallelism to general data parallel languages, not just for manipulations of the restrictive nested vector datatype?

Two papers provide a semantic foundation for bridging this gap. First, Keller shows how to naturally support irregular structures like trees (paper but her thesis is cleaner): add recursive types and encode them by representing every (entire!) level of unrolling by a segmented vector. Second, Prins showed how to include higher-order values in a first-class way with the NESL-era language Proteous. Finally, Chakravarty and Keller followed-up on their u-type work to essentially clean up the rest of the expressiveness portion in their succinct paper More Types for Nested Data Parallelism.

With such a foundation, we reach Nepal, rebooted as Data Parallel Haskell, and Manticore (from the ML camp): flattening for general functional languages. However, it looks life isn't so easy (spj on youtube): none of these embeddings are actually using vector/SIMD instructions, both primarily focus on multicore and work-stealing (which does not require flattening), and they still struggle to beat C (sequential breakpoint / strong scaling issues). One of my guesses is that, while we now have a fundamental automated transformation for flattening richer structures, we have many more necessary optimizations HPC coders typically do for them. For example (and sparking my own interest), the breadth-first layout of recursively typed structures is inappropriate for common tree access patterns.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I, actually, once suggested that.

Here you can get general idea what I suggested back then.

Every pseq/par pair spark a little computation which shares code with other little computations. If we group those little computations into vectors, we can use vector operations for complete group at once.

I hadn't figured out (YET) how to throttle those computations, though.

These approaches look fairly

These approaches have a similar intuition for the transformation, but there's an interesting distinction: you exploit control parallelism while DPH exploits the data.

In favor of the data approach, one thing I really liked about Gabrielle Keller's formulation is that parallelization is type-directed. E.g., [[ 'a array ]] is treated differently (from a performance perspective) from [[ [[ 'a ]] ]]. The schema guides the partitioning orthogonally from both the sequential semantics and the particular algorithm. Once we get a better handle on how to transform structures and schedules in a way cognizant of multiple distinct access patterns in the same execution, this choice may make less sense (though it also may be an opportunity for more type hackery like type refinement).

In contrast, your approach seems to be based on repeated (e.g., recursive) access patterns, which is better when data is synthesized entirely. DPH requires the inductive structure to be reified, you don't. For example, I think DPH would need to build an explicit fib recursion tree and then solve it (e.g., through a tree traversal), while you can go directly.

A neat formulation (and automation) of your trick might be similar to tail call optimization, where, instead of eliminating the stack, you (transparently!) get data parallelism from it!