A theory of data parallel computing

This work is motivated by the following considerations

  • Parallelism in scientific computing is often of the data-parallel variety: a program has de facto sequential semantics, but operating on objects that are distributed over many locales. I'm not aware of much theory about this: theoretical computer science usually takes a CSP approach to parallelism, where each process does its own thing, and other processes are a source of messages that it reacts to.
  • In most contemporary processors data movement is very costly, more so than operations. There are few models that explicitly take into account that data is somewhere, and you can not just access data without making sure it's actually close to you.
  • There are many `modes' of parallelism these days: multi-threading, co-processor offloading, distributed memory message passing. A code written for one mode is impossible to translate to another. So I wanted to have a mode-independent way of describing algorithms that could be `realized' in terms of existing systems.

Here is my theory story. I have prototype software, and you can take my word for it that (within its limited functionality) it is as efficient as hand-coded mode-specific software.

pdf document, slightly updated

Feedback much appreciated.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Concurrency vs. parallelism

CSP is about concurrency, not parallelism. See my blog post for an elementary explanation of the difference.

I know

But the term "communicating sequential processes" is often applied to for instance MPI programs. And in a way it's appropriate: MPI can operate in MPMD (multiple program multiple data) mode, where each process/processor is running a fully independent main, listening for messages from other processes. And that's more or less classic CSP, not?

Anyway, I'm not interested in CSP. I want to address the reality that most scientific computing is data parallel on some level.

lost me pretty quickly

the abstract and intro all sounded good to me, or at least interesting!

but... at #2 a motivating example...

the diagram made no sense to me, terminology of alpha beta distribution was weird, etc. all in all it just looked too foreign to me. i guess it is too compact or something. i'd at the very least urge some other diagram, that one seemed really weird and in no way at a glance told me what was going on. maybe it is fine *after* you understand it, but that seems to defeat the purpose in my mind :-)

I'll be happy to explain

I'm not sure what is unclear about the diagram. Let me know where you get lost.

I start with a function yi=f(xi,xi-1,xi+1), to be computed for each index i in some vector.

I have two vectors, labeled "in" and "out", and they have a linear index. Should I put a couple of "i" letters in those vectors?

The tridents indicate how index "i" in the output is a function of i,i-1,i+1 in the input. With me so far?

The dotted lines indicate that the input and output vectors are chopped up over processors. The letter "p" indicates that processor "p" gets the part of the input and output vector in between the dotted lines.
(The nature of this ownership is left up in the air.)

So if you look at the leftmost "i" for which processor p wants to compute the output value $y_i$, then it needs for instance $x_{i-1}$ which lives on a different processor.

Distributions are mappings from processor to sets of indices. (Maybe I should emphasize that more.)

The alpha distribution is easy: it says "processor p owns indices something through something". This describes the storage of the input and output vector.

The beta distribution says "processor p owns -- in a different sense of ownership -- a somewhat larger set of indices". This describes all indices that processor p needs as input for the operation; it is something that is more a property of the algorithm than of the storage scheme.

Clearer this way?

motivating example

I've been a bit more verbose. Check out the new link.

Target audience?

Who is your target audience? Theory people will not appreciate using established names, such as number, and define that to be something else (a set). Likewise, practitioners in the OpenMP community know what a processor is and a processor does not resemble a subset of a task graph.

The paper states (under "sequential semantics") that it presents a new way to describe parallel algorithms and that it's based on mathematical concepts. That's nice, but it would be even better if the paper would motivate why we need another mathematical framework for describing semantics.

Why are the boxes around Ind and Array in Section 3.1 using = instead of the three bar version mentioned in the footnote?

Audience

1. Number vs set is standard mathematical practice and only used as short-hand here.

2. Who says a processor does not resemble a subset of a task graphs? If I have a hybrid MPI-OpenMP program, and each MPI node calls " #pragma omp task" a bunch of times, then you have precisely what I describe. Except that I can actually formally describe the hybrid task graph.

3. Motivation: 1. With this theory I can do interesting things. Such as proving when a functional model does not need a garbage collector. Or such as showing when an algorithm can be transformed to a "communication avoiding" (bad term, but some famous people keep using it) variant. 2. And the ultimate justification of the theory is that it leads to efficient code. With this theory I can describe parallel algorithms independent of the 'mode' of parallelism on which they are implemented. Portability in other words.

4. Audience: first of all of course the HPC (high performance computing) community. I think this model can make parallel programming (of data parallel algorithms in the sense I define them) more efficient. It can give the world hybrid programming in a not-so ad-hoc way as is done currently. Ok, so I could just have written the software and never mentioned the theory. But I think the theory is an interesting story too. And the theory shows why I can make claims about software functionality before I actually write the software. And hey, isn't theory fun? I think my theory is elegant and new and applicable. Not quite sure where to publish it. How on earth did Valiant ever get the BSP stuff published? That's never been implemented.

4. Boxes: yeah, I'm not always consistent in my notation. Thanks for pointing it out. I'll fix it.

Provide the reader with a map

The paper is currently written with a strong focus on the theory and despite looking for it I couldn't find where the paper contrasts and compares the work with existing theories. That leaves the reader without a map when reading the paper. This paper is competing with the large pile of dense papers we already have in our backlog. Our backlog papers have the advantage that they are already sitting on our desk and they were printed because they were considered valuable according to some 10 second/1 minute/30 minute heuristic.

How is your theory different from applying an existing resource usage analysis and checking whether the result is 0 for the resource of interest? How does your theory relate to existing theories? What shortcomings is it fixing? You have a strong selling point if you can prove that your theory always gives both portable and efficient code for some interesting definition of efficient. Both POPL and PLDI are good targets for work in that area.

Existing theories

"considered valuable according to some 10 second/1 minute/30 minute heuristic"

Understood. I was hoping that my abstract would pass the 10 second, and the first page the 1 minute test. They don't?

The lack of discussion of related work is a fair point. I searched high and low and wide for anything that remotely look like this and there is not a lot. There are several really new ideas here, which is why another commenter described it as "weird" and "foreign". If you have any idea how I can meet the reader halfway I'd really like to know.

Related work:
1. anything with "sequential semantics". This has been explored in languages such as HPF or UPC, but I'm not aware of any theory. And no such system has been used with serious performance.
2. distributions. Again, there are many programming systems with distributions, but no theory that I know of. And no one has found the right abstraction.
3. Bulk Synchronous Parallelism: there are some similarities in that my kernels look like BSP supersteps. However there are many more differences.
4. Dataflow. That was popular in the 1980s and making a resurgence. My problem with dataflow: I have never seen a theoretical result proved about it that was remotely useful, and no one knows how to write a whole program in dataflow. Since I derive dataflow, I've solved all their problems.

Unrelated work:
1. process algebras. I'm not sure what they are trying to model, but I don't see them applied to my type of applications, and their research questions seem completely different from mine.
2. CSP and anything like it with channels and/or independent processes: I don't consider anything persistent associated with communicaiton, and processes are not independent.

Does that clarify anything? Did you have any theories that you wanted me to compare against?

The abstract reels me in

The abstract reels me in until the point of "various practical concepts". Knowing at least a few of these practical concepts would give me a hint if the paper is about something that I'm interested in. I skipped the first paragraphs of the introduction because they seemed to tell me things I already knew and agreed with. It's mainly the "In this paper" paragraph that conveys information to me personally, and I still can't figure out what the paper is about after section 1.0. I get that it's about a theory, but I still don't understand what that theory buys me. Section 1.1 talks about "popular practical concepts", but I still have no idea what these practical concepts are.

Highlight the new ideas and describe what problems they aim to solve in the introduction. You obviously have good answers for questions in the forum but these answers don't really shine through in the paper.

The paper presents a type system. Which type system is the closest relative? How do they differ? Regarding other related work: What about the work on Legion? Picking a random one of the papers: Structure slicing claims to be able to infer task parallelism which seems related to your paper's claims.

Typos:

There's a missing end-bracket in the definition of alpha on page 3.

Oper is defined with both the three bar sign and colon.

related work

Ok, I've greatly extended the discussion section at the end to include relations to existing concepts. New link provided above.

Type system: hah. I've made attempts to understand what type theory is and failed. A colleague who does understand that stuff tells me that mine is a fairly simple type system. That aspect is probably not worth persuing.

Target audience?

[Duplicate post removed]

Leslie Lamport

How does this relate to Leslie Lamport's 1979 paper "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs", which defined the notion of "sequential consistency"?

That is more about

That is more about concurrency right?

Might Be.

"Multiprocess computer", and " multiprocess program" seem fairly parallel to me. Also the notion of sequential consistency seems common. Naively it sounds like the papers are saying the same thing.

Most of the highly parallel

Most of the highly parallel HPC apps I'm aware of throw sequential consistency out the proverbial window for another kind of statistical (but most still deterministic) consistency. Minibatching, for example, intentionally ignores memory coherence but works out fine in the end.

Sequential Semantics

This paper (the subject of the thread) seems to be about a kind of sequential-semantics (for example see the first paragraph of page 2), hence why I mentioned Leslie's work.

sequential-semantics and sequential-consistency sound like the same thing to me (or at least similar enough that a description of the differences would be helpful for someone familiar with the existing concept).

Right, but reading the

Right, but reading the linked paper, I don't think they are seriously going after the real performance oriented parallelism crowd. There you must choose your poison, e.g. CUDA or MapReduce, and this almost always involves relaxed ordering semantics.

Performance

Oh, I am definitely going after performance. For instance, by adopting a dataflow model I can guarantee that small messages get aggregated. (The small message problem kills languages such as UPC: essentially the compiler can not reason about cross-processor data dependencies.) Also, the section about "communication avoiding" is very much geared towards latency hiding to deal with communication cost.

But if I may ask "this almost always involves relaxed ordering semantics": relaxed from what starting point?

sequential semantics

No, not at all the same thing. The term sequential semantics means that the semantics of the program is as if there is exactly one thread of execution. Example: instruction 1 is y=Ax (matrix vector multiply), instruction 2 is alpha=x^ty (inner product). But these single instructions can somehow marshall multiple physical processors.

One model for sequential semantics would be a single-threaded processor with an attached array processor, to which data parallel operations such as matrix-vector multiply are offloaded. Of course this is not my intended execution model, but the IMP language is consistent with it.

Sounds the same

This is my understanding of sequential-consistency:

"... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program."

vs sequential-semantics:

"One model for sequential semantics would be a single-threaded processor with an attached array processor, to which data parallel operations such as matrix-vector multiply are offloaded."

To me these are both the same. The model is that he program behaves as if it were executed sequentially, and provided the result is the same (you don't break the sequential semantics) you can parallelise as much and in any way you like, whether through threading, multiple-processes, multiple-machines, vector processors, GPUs etc.

No.

Did you see my example? Matrix-vector multiplication is an instruction. Not a subroutine call: a language primitive. That instruction is executed in parallel in the sense that its execution takes place on many processors, but on the level of user code there are no instructions that are independent-of-each-other-and-therefore-parallel.

Doesn't matter

I don't think it makes any difference whether something is a subroutine call, a language primitive, or even a single machine code instruction as far as the theory goes.

Sequential semantics != sequential consistency

I think the issue of sequential consistency is not applicable to my theory because I'm using something that looks very much like a functional model: a kernel creates an object, which can then be used as input to multiple other kernels, but it can not be altered.

I think sequential consistency is only a problem if you have mutable things. In that case you have to make sure that parallel updates happen in such a way that you get the same result as a sequential execution would.

Another reason I don't think I have to worry about sequential consistency is that I don't have a sequential program that is parallelized; I start with a parallel formulation right off the bat.

(See my explanation below on what "sequential semantics" is.)

sequential consistency

I think you must have sequential consistency, I don't think it matters that you start from a parallel program. Note it says "as if the operations of all the processors were executed in some sequential order", not that there is an actual sequential program for everything. In fact all it says is that there is a sequential program for each processor, which is true for what you are doing. I don't think mutability matters, it may be that sequential-consistency is trivially proved for what you are doing, but isn't that kind of the point?

Each processor

"In fact all it says is that there is a sequential program for each processor"

Do you talk about sequential consistency in a single-threaded program? After all, processors have multiple functional units and instructions are done out of order.

"it may be that sequential-consistency is trivially proved for what you are doing, but isn't that kind of the point?"

I can live with that view.

So..... any thoughts about the more interesting parts of my work?