Intel Concurrent Collections for Haskell

Intel Concurrent Collections...provides a mechanism for constructing [programs] that will execute in parallel while allowing the application developer to ignore issues of parallelism such as low-level threading constructs or the scheduling and distribution of computations. The model allows the programmer to specify high-level computational steps including inputs and outputs without imposing unnecessary ordering on their execution... Data is either local to a computational step or it is explicitly produced and consumed by them. An application in this programming model supports multiple styles of parallelism (e.g., data, task, pipeline parallel). While the interface between the computational steps and the runtime system remains unchanged, a wide range of runtime systems may target different architectures (e.g., shared memory, distributed) or support different scheduling methodologies (e.g., static or dynamic).

In short CnC purports to be a "a graph-based data-flow-esque deterministic parallel programming model". An open-source Haskell edition of the software was recently released on Hackage.

A series of blog posts describe the implementation: #1, #2, #3, #4, #5 (the last post links to a draft paper).

Personally, I would have preferred a more concise and down to earth description on the whole thing. If you have experiences to share, please do.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

There's also a draft Haskell Symposium submission

There's a fifth blog post that links to a video and to the draft paper (linked here).

Regarding the clarity of the description, I think part of what's going on is that people in this project come from several different backgrounds (myself, PL, others HPC and architecture).

Frankly, the above snippet doesn't speak to me either. I'd like to know if LTU readers find the introduction in the blog posts, or the introduction in the Haskell Symposium submission more helpful than the description on the generic CnC site (excerpted above).

As I explain in the video linked above, I think of CnC as just another task-graph-style parallel programming model, along with "stream processing" and the gang. That is, it comes down to message passing, not shared memory. These message passing systems (just like Erlang), generically preserve the property that "while the interface... remains unchanged, a wide range of runtime systems may target different architectures".

The wrinkle with CnC is that it mixes in a bit of what looks like shared memory in the form of immutable/append-only, key-value stores (item collections). (You hear often nowadays of immutable key-value pairs, don't you? Especially for distributed storage.)

Why?

I understand the how, but why?

You've got a functional language with lazy evaluation. It has the property that the program's expressions can pervasively be reduced in any order, or concurrently, without affecting the deterministic outcome of the computation. And, your goal is to support concurrency. Why do it this way?

CnC is for Parallelism (despite name)

I have the same question about motivation. Haskell already has powerful support for parallelism strategies and annotations, even if it performs relatively poorly on parallel the shootout benchmarks. CnC for C++ seems better motivated.

your goal is to support concurrency

It seems not. Despite 'Concurrent' sitting in the title, the goal of Intel's Concurrent Collections seem to focus entirely upon parallelism. (For that matter, I'm also wondering why 'Collections' is in the title.)

[Edit: I suppose there might be some 'declarative concurrency' involved, of the sort described by PvR, where concurrency is necessary for progression of the computation.]

"performs poorly"

er, so is CnC somebody's attempt to have haskell parallelism that doesn't perform poorly?

Doubt it...

The performance of GHC's parallelism seems largely a property of its primitives, and has been improving a great deal across the last few releases (6.8..6.12). A comment from the developer (on the 'tour of schedulers' link) suggests that the development version of GHC makes some great strides in performance.

I suspect GHC will remain among the best places to target for improving Haskell's parallel performance. My understanding (roughly inferred from comments) is that the '-threaded' flag to GHC is causing us to pay through the nose in terms of memory barriers in allocators, GC, possibly on assignments from lazy evaluation, and so on. We might see some big improvements by upgrading GHC's allocator and GC, by avoiding unnecessary barriers or keeping them at a coarser granularity, and perhaps by modifying how laziness is handled during work-stealing (modifying pointers after lazy evaluation is mutation - at least as far as cache-thrashing is concerned!).

That said, if the number of parallelism sparks is too high, one ends up wasting a bunch of CPU to no great effect. Similarly, if parallelism is too fine-grained, then pages of state will be thrashing across processor cache boundaries. Thus proper decisions on where to apply the 'par' annotation are difficult.

CnC for Haskell code uses these primitives (and a touch of unsafeIO magic), and comes with something like five or six different implementations (schedulers). It would be nice to see some shootout benchmarks to see if the controlled expression of parallelism is the real issue.

Re: dmbarbour, raold & Tim Sweeney

Re: the naming. I'm the guy responsible for the Haskell implementation, but not CnC as a whole. There are some historical issues going on. But, yes, it's about parallelism not concurrency. The "collections" refer to the "steps" "items" and "tags" that make up the CnC graph.

Tim makes a good point about the ubiquitous "operand" parallelism exposed by purely functional languages including Haskell. (To my knowledge this has been used as a source of implicit parallelism in projects as far back as Id, and possibly further.) But there are major barriers to pure programming alone yielding a parallel paradise. First, as Guy Steele has been emphasizing, not all pure programs are parallel, we must write our programs to expose balanced trees of operations; many algorithms are serialized by data structures, or use lists which, viewed as trees, are anything but balanced!

Second, is the universal problem of granularity. "par" in Haskell is a means of hinting that a computation is sufficiently hefty to be passed to another processor. Yet, as dmbarbour and raold point out, parallel performance in GHC has not really achieved robustness, scalability, and predictability. It's good, and getting better, but not great.

Graph-based programming models (usually based on synchronous data-flow) have a long history and come with a set of techniques that do provide great parallel performance. How to bring these benefits to Haskell? While you could use "Strategies" and pBuffer for a bit of producer-consumer parallelism, or IO threads and MVars, neither of these preserve the kind of scheduling benefits that I'm talking about (and the latter is impure!).

My goal with Haskell CnC was to bring some of those benefits to Haskell, and, further--with Simon Marlow's help--to push the envelope in parallel performance on GHC. We're running GHC on machines with up to 8 sockets / 48 cores and getting >20X speedups on some reasonably useful benchmarks (not parfib). That's not perfect, but it's headed in the right direction.

(P.S. Re: dmbarbour -- yes, the GC architecture is a major problem for scaling. GHC requires global stop-the-world synchronization even on minor collections, which have to happen frequently to stay in the L2 cache. A GC rearchitecture is in process.)

Because.

I was mildly sceptical at first, but eventually I realised that CnC has something useful to offer in this space. It affords the programmer more control, without throwing away the determinism that we cherish. The control it provides is in the form of explicit specification of computation steps - granularity - and how the computation steps are wired together. It avoids lazy evaluation, in favor of explicit dataflow.

What's more, CnC is implemented as a library with a variety of different scheduler implementations. Different schedulers perform better in different circumstances - the programmer thus has even more knobs to tune performance than with the Strategies programming model.

I think there's every reason to believe that CnC will be the route to decent parallel scaling in Haskell for those willing to invest a bit more effort. DPH will ultimately be the way to scale data parallel computations, but the implementation isn't quite ready yet. Strategies are great for adding a bit of parallelism to an existing program, and the modularity benefits are worthwhile, but the high level of abstraction makes performance tuning difficult.