Models for distributed parallelism

I've been reading left and right (including on this forum) looking for models of parallel computing that are relevant to what I'm dealing with, and frankly not finding much. I get the impression that most theory of parallel computing is about shared memory, and so a bunch of old ideas for SMPs gets recycled now that we have multi-core chips (http://lambda-the-ultimate.org/node/3963). However, in the early 90s I had 256 distributed memory processors, a couple years later 2048, when I started my current job 10,000 cores or so, and a year from now I'll have half a million cores. And most of research into parallelism that I've seen is pretty much irrelevant there.

A typical parallel computation has a billion or so data points, does some data parallel operation (a very complicated type of local averaging) on them, computes some global condition, and then repeats this. In terms of "sequential consistency" (trivial: memory is distributed), completion (guaranteed), resources (none: everything is in memory), &c this is all trivial. On the other hand, moving data around is very expensive and the key to how well your algorithm performs.

And I'm not finding any models that explicitly talk about moving data. Do they exist?

My current thinking is that I model an algorithm as a sequence of BSP-like supersteps, and each superstep is a bipartite graph: the input data set, output data set, and the edges are the (data parallel) instructions that take the one to the other. The big point is that I can now talk about how the three sets (input, output, instruction) are distributed, so I have a formal way of talking about communication.

Does this sound remotely familiar to anyone? Are there models of parallel computation that are relevant to scientific computing?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

BSP and LogP *were*

BSP and LogP *were* introduced to reason about distributed memory (popularly compared to PRAM, which gave crazy algorithms without it). You might want to look at recent SPAA/SC papers to see more modern practices. There is a lot of theoretical attention to communication, e.g., bandwidth, when designing asymptotically better BLAS etc. algorithms. BSP and LogP are nice because they are simple; the closer you get to reality, the more parameters.

BSP

I've heard of LogP but never looked into it. Will do. Thanks.

In a way, BSP is designed to obscure the architecture, but in a different way from PGAS and such: in BSP anyone can send to anyone one-sidedly, but this is to increase programmer productivity, not to make the code more efficient. The synchronization at the end of a superstep is there to make sure that everyone has what they need. This means that BSP says "I'm not even going to try optimizing communication". The inevitable load unbalance that gives is then solved with a probabilistic argument that oversubscription plus random placement gives you highly likely a good balance.

I'll reserve judgement on whether oversubscription is a good idea, but I'll just remark that it probably increases message volume, and unless you have a very clever communication layer (as in: one that is much more clever than anything that exists now) you'll have lots of unaggregated messages that should have been aggregated. Given that some applications these days spend two-digits percentages of their time in communication, you don't want that.

I'm familiar with the work on better Blas. It's a different type of bandwidth, and one that I'm not thinking about yet because it's outside of explicit programmer control.

Reading the replies, I think

Reading the replies, I think I may have misunderstood: are you looking for an analytic/complexity model for describing performance of parallel algorithms, or semantic programming primitives suitable for working within one? E.g., LogP model vs. map/reduce language.

Semantic

I'm looking for programming models. BSP & Mapreduce and such.

map reduce

You are absolutely right that most of the discussion of parallelism is going from 1 active process for each program to a dozen. When we start getting into hundreds or hundreds of thousands that's another area entirely.

I'd say the biggest innovation in that sort of massive parallelism (and some would argue it isn't actually an innovation) is Google's implementation of map-reduce. Essentially Google provided an efficient framework that if you can:
1) partition the problem into lots of parts
2) do some work on the parts individually
3) combine the modifies parts in associative way

you have a generic computational framework. The whole modern NoSQL movement is pushing this idea and partitioning off large data-warehouse queries effectively. I think map-reduce provides an effective framework for talking about most problems generically. Some can't be easily partitioned, some are too sparse, and some don't have an associative combinator so it doen't solve all issues. But...

Paralation Model (Gary Sabot/Guy Blelloch)

As a side note, Map/Reduce wasn't totally innovative here. The Paralation Model came first. The essential idea is local computation should be cheap, and remote computation should by default be arbitrarily cheap or expensive. Remote computation could be tuned by organizing data differently and defining custom 'move' operatives.

But the Paralation Model is a bit clumsy and dated, for various reasons. But from it came the idea of nested data parallelism languages like NESL.

scale

The difference between the dozen processes of current parallelism research and the thousands in scientific computing isn't just one of scaling: it's one of shared memory (often loosely coupled) thread programming vs distributed memory tightly coupled message passing. The problems of thread programming don't exist in that world, but the problem of the cost of data movement is paramount, while this plays no role in thread programming, or at least is hardly considered.

Mapreduce is cool, but mostly for unstructured problems. For instance, I can do pagerank in mapreduce: it's a sparse matrix-vector multiply. Since the problem is so unstructured in the Google context it doesn't matter that the algorithm basically reads "combine everything that has the same row index". However, in scientific computing the set of things to be combined is structured (and this structure is used many times over!) so the mapreduce approach leaves much easily discovered information unused.

Semantic models of distribution

For semantic treatments of distribution, there was quite a lot of work done on distributed variants of the pi-calculus (e.g. Matthew Hennessy, A Distributed Pi Calculus, CUP 2007) as well as the ambient calculus.

For models of distributed object systems, you could have a look at my FOOL 2000 paper A Distributed Object Calculus. It has an explicit model of serialization, and distinguishes between serializable and non-serializable objects in the type system. I dare say if I were writing the paper today, I'd call the former Data Transfer Objects, but DTOs weren't quite so omnipresent back in 2000.

Distribution is the least of my worries

"For semantic treatments of distribution". Thanks for the reference to your paper. Unfortunately, I get stranded at the first formula. I have no idea how to read it.

But I suspect you and I are talking in very different contexts. The concept of "trust" does not exist in my universe. On the other hand thread migration carries cost, and I don't see you modeling that.

Let me give you an example of distributions in my world and why they are not the whole story. An important operation is the sparse matrix-vector product (mvp) with a large distributed matrix, say 100 million rows over 1000 processors, and most elements are zero. From the context (finite element methods) I know that in the mvp every processor will communicate with only a few others.

1. the communicaiton is described by a graph problem (a sparse matrix is in effect a graph), every processor gets a subgraph, and the best distribution is the one that minimizes the edge cuts from the graph partitioning. So the distribution part of the story is fairly easy.

2. here's where most parallel programming languages fall down: this "each-to-few" is cheap and performed many times. On the other hand, discovering who communicates with who is pretty expensive. So if your language can't form an object that describes the communication, it has no alternative but do the discovery of the pattern over and over again. Or move the task to middleware, which is just as bad.

For this reason, I claim that a parallel programming language should have communication as a first class object. In languages that I've seen (UPC, Chapel) communication "is just what happens" meaning that it's explicitly defined, or rather, not defined at all. Those languages concern themselves with distribution, not with communication. Which, see the mvp example, is not good enough.

Data Parallelism

The types of computation that you describe have been looked at in many different dataflow models. Scientific code (or throughput, or media code depending on your field) can be represented well in these models because, and here I will have to abuse language a bit:
1. The locality of reference is high
2. The global regularity of access is high
Which is generally referred to as abundant data-parallelism. Scan-primitives were invented in this setting, this is a good starting point. There has been a ton of work in that area since then. Array SSA is another approach for a different set of constraints on the data-structures and access patterns in the problem. Taking a rough guess at what you are looking for I would suggest that you read about NESL first as it has the same focus on the memory management costs in compositions of vector parallel operations.

What kind of questions do you want a formalism about communication costs to provide answers for?

SSA

Thanks for your reply. Scan primitives say nothing about communication. Likewise, Array SSA is only about semantics parallel operations, not about the communication that is needed to support it.

What do I want a formalism to answer. Good question. First of all, I'll be happy to see any formalism that acknowledges that communication exists. Just about the only model is know of BSP, which answers the question "how do I achieve load balance". What do I want? I want a way of expressing that different distributions give different communication, and a way to talk about that communication, and ideally quantitatively, but I'll take qualitative for starters.

Here's a report of mine where I sketch a possible model.

Rephrasing your problem

I read your report, and wish to offer an alternative take on what you are saying.

First, I am familiar with MPI and the general strategy championed by MPI lovers at various HPC centers, such as Henry Neeman's advice. I agree with you that this low level of abstraction works well for tuning code, but does not offer the productivity benefit of a high level language. Also, MPI solutions suffer "code rot" and do not respond well to forklift infrastructure upgrades.

Second, there is IMHO a basic reason why nobody has designed a parallel programming model that talks about communication: because it is viewed as an implementation detail.

Third, there was an effort in the 1980s in term rewrite systems to define "architectural styles" as progressively more deterministic rewrite theories. The book Parallel Program Design is one reference.

Fourth, these problems lie at the intersection of information theory and control theory, where patterns of information dictate optimal control. No languages today express these optimization problems adequately, and there are uncountably many control laws possible, making synthesizing solutions from first principles by searching for the right set of laws, even for simplified problems, NP-Complete. I asked about this awhile back on LtU, although did not state my use case (which is exactly what you are talking about). Please see Seeking papers that unify control theory and programming language design

Control?

I have no idea what you're driving at and doing some minimal reading about control I am none the wiser.

Ownership

The model described in your report seems to deal more with the concept of ownership. Each datum in the system exists in a single location, and on demand is moved to another location in order to be accessed. There seems to be a trivial assumption that data can always move from p_i to p_j : there are no criteria that define if a move is possible or not, data can always flow between any two processors in the system. If I have understood your model correctly then all data moves, it is not copied, as a result of which each datum can only exist in a single location in your model. This seems reminiscent of what can be described using a linear logic.

Your comments below seem to indicate that you would like to relate which type of programs are realisable by whether or not they can be constructed within your model. This suggests that the predictive power that you would like to obtain is questions of the form "Is this pattern of communication possible?". But as I've noted above it is trivially true that data can always move between any two processors. If this true then I cannot see how you can rule out any patterns of communication with your model.

Obtaining quantitative answers to questions of cost requires a model of the cost to move a datum from p_i to p_j within the system. In real systems this tends to partially depend on i and j, and their distance within the interconnect topology, and partially depends upon the state of the interconnect at that point in time. If the cost varies according to the state of the interconnect then you may need to consider integrating a model of time, and possibly how you will consider queuing within the interconnect.

Have you looked at models of caches, in particular partitioned caches, as a possible comparison with your work? There is an analogy between lines being locked to a particular cache and data being located on a processor / eviction of a line and transmission of a datum / fetching a line and receiving a datum.

Data moves

Thanks for your reply. You are right that my model talks a lot about ownership. However it's somewhat inaccurate to say that data moves or can be moved, although everyone does say that. In parallel programming the idea is that every data item has a sole owner, and if someone needs it who doesn't have direct access, then it can be copied. If the other guy changes that copy, that has no meaning, since the authoritative copy is still elsewhere.

Think of it this way: you're working on an array of length 100, you have 10 processors, so every processor stores 10 elements, let's say a contiguous 10 elements. If you want to do the addition x[i] := x[i]+x[i-1]+x[i+1] then you need to obtain copies of the elements to the left and right of your leftmost and rightmost element. (Distributed memory: you can't just read it, it has to be copied over the network.) So it's not so much that data gets moved in the sense of "now it's here, now it's there" as that it gets copied.

Here's something that really puzzles me. If you take the parallel algorithm of the previous paragraph and implement it on multiple processors then every processor is an independent process and it has its own data. However, in a wider sense there is just one process: "global compute of pointwise average" applied to just one data set that happens to be distributed. I know of no formal models that express this global view. They all talk about the individual processes as if they are independent. Ok, in my field we call it SPMD: Single Program Multiple Data, which means that each processor is running (a copy of) the same executable, just on different data. But giving it a name is a far cry from a formal model. In my proposed model there is a sense of an overall process: I have one big bipartite graph of instructions, and each processor gets a subgraph of that.

So one thing I want a model to say is "there is one big process, and the individual processes are subsets/projections of it onto a single processor". Are there really no models that express this fact of life? Am I really the first one to do the obvious?

You remark on my implicit assumption that data can move between any p_i and p_j. Correct. In a later version of my report I'm trying to model cases of distributed memory clusters where each node has a GPU. (If you're not familiar with the hardware, cluster nodes are connected by a network so they can talk, a GPU is connected to its host CPU by a bus, so it can only talk to that host.) The problem is then how to express an algorithm that treats the GPUs as connected. My model can do that by introducing explicit edges between CPUs and between CPU & GPU. The physical communication between CPUs that you need is then a quotient graph by dividing out the CPU/GPU link from the logical graph that describes the algorithm.

However, it's not so much that I want the model to say that a certain connection is not possible as that I'm having to hardwire this fact into the model.

My statement "you can't get there from here" probably confused you. I was trying to say that certain acts like accessing a shared queue are simply not possible directly, in distributed memory. And rather than the model saying that it's impossible, I want such acts to be impossible to express in the model.

Caches: yep. Good point. There are definite connections to what I'm trying. The big difference is that I'm aiming for a programming model, and data movement through the cache hierarchy is not programmable.

Switch of terminology

So one thing I want a model to say is "there is one big process, and the individual processes are subsets/projections of it onto a single processor". Are there really no models that express this fact of life? Am I really the first one to do the obvious?

I'm not familiar with any models that are used to reason about data migration in the way that you desire. However there may be some similar work using different terminology in the Formal Verification area. In particular techniques for stepwise refinement would be an appropriate way of building this type of model. The large process at the top would abstract away the location and ownership of detail, but then you could define refinements of this model that introduce the details of location and use simulation arguments to show that the refinements are sound. Your overall goal of showing that certain things are not expressible would then be achieved by showing that there is no way to refine the abstract system (i.e. the shared queue example) into the concrete system that includes the copies. The more quantitative issue of cost could then be tackled as a direct measurement of the copies in the representation of the concrete system.

Really the key issue is tackling how to rule out specific patterns of data-movement: a formalism that doesn't allow those patterns to be expressed is attempting to exclude that set of possibilities syntactically, which seems quite difficult. By using a refinement process you are working with something closer to a program transformation where you can define that particular abstractions of a distributed algorithm can be partitioned into concrete instances. That is, you are constructively defining the partitioning steps that are valid and then ruling out all of the unreachable configurations of concrete processes.

This description is quite hand wavy and vague as it would require some serious thought to work out how to do it exactly. The direct summary would be that I believe what you are trying to do is novel, and there may be some useful techniques in stepwise refinement to help you get there. On a semi-related tangent this paper on language semantics for data movement might be relevant. In particular programs in Icarus have strong semantics guarantees about equivalence in their semantics when executed as one large process, or distributed into many pieces.

Simple example

Here's what my model can do. Suppose the operation under consideration is circular shift of a vector: y_i = x_{i-1}. Suppose I have one processor per vector element, and I give y_i to processor i, and I let the instruction "y_i=x_{i-1}" be done by processor i.

Now if processor i also has x_i, then communication will happen. On the other hand, if I distribute x in such a way that processor i has x_{i-1} then no communication needs to happen.

This sounds utterly trivial but it's not: the communication is formally derived. Communication is the set of all you need minus the set of all you already have. (Note!!! Communication is an explicit object: it's a set that is constructed by formal reasoning.)

In other words my model can formulate a lemma: given distributions so-and-so, this algorithm can be done without communication. Given other distributions, there will be communication, and it will be between processors such-and-such with a message volume so-and-so.

So just like language theorists reason "if the dependencies between loop iterations are such and such, then the loop can be vectorized", I am reasoning "if distributions of data and work are so and so, then communication will be suchlike".

Take a look at the X10 programming language

X10 is a high-level language for supercomputing based on a "partitioned global address space" model. The idea is that each processor has an affinity for each region of memory, and the programming language's type system requires you to make cross-region operations explicit in the language.

Unlike MPI, the communication topology is not explicit (ie, there's no explicit message-passing), but there is enough information in the source program that the compiler can generate code with good communication properties. The net result is that the language can be higher-level than MPI, while retaining many of the performance benefits.

PGAS

As with all PGAS languages: I don't believe it. I've had to referee papers that show good performance with UPC or CAF or whatnot, and they invariably engage in considerable code massaging. If you program such languages naively 1/ they create threads all over the place which leads to considerable OS overhead 2/ latency hiding is very hard (in X10 it leads to thread creation) 3/ message aggregation is left as an exercise to the middleware. Wanna guess how much I trust middleware?

Oh, to echo previous posters in this thread: what does the PGAS model say? What is the theory behind it?

Infrastructure-level solutions for parallel programs

I would think infrastructure level solutions such as Cilk or KAAPI are aware of communication costs and optimise for them.

Cilk

First of all, Cilk is for multi-threading, meaning shared memory, so there is not a lot of communication. Secondly, you're saying that the implementation layer of Cilk is probably aware of communication. Could very well be, but what I'm looking for is a model, any model, that reasons explicitly about communication. Cilk does no such thing and leaves it up to the lower layers: it leaves communication implicit.

Message passing?

Does message passing where you explicitly send messages to communicate fulfill your criteria?

alpha beta

Message passing is indeed a (very low level) model for distributed parallelism. It says that every data transfer has a one time cost (alpha=latency) and a per-byte cost (beta=1/bandwidth). You can even derive topology-independent lower bounds on the cost of collective operations that are (close to) attainable.

Unfortunately it is not terribly informative, it doesn't help programming a whole lot (it's like coding assembly), and it's very hard to reason about. Trying to analyze an MPI program is about as bad as trying to derive invariants on a spaghetti-coded Fortran66 program. Probably worse.

I was hoping someone had come up with something better in the almost 20 years that MPI has existed, or the more than 20 years that message passing has existed in a general sense.

So, just like the programming languages community can prove useful stuff about functional languages as opposed to Fortran, I'm looking for something more informative than the MPI model.

Bloom has some of the

Bloom has some of the characteristics you specify, but I doubt it's exactly what you're looking for. On Ltu here.

temporal logic

I wish I knew how to describe in a single sentence what the difference is between this sort of system (talking about distributed rather than parallel, temporal logic, events, consistency, transactions) and what I'm looking for.

You're not finding any

You're not finding any models that explicitly talk about moving data. But what would you like these models to state, and how? Should the model be aware of the amount of data being transfered, or should it state in side-comments what the cost of moving the data is?

Say what?

I'm looking for a model that reflects the reality of distributed memory. Not the distributed shared memory of NUMA, distributed memory as in MPI.

In shared memory models, any process can say "x := 5"; in distributed memory either x has to belong to you (separate address spaces) or you you first have to do an explicit data transfer (distributed ownership of a single address space). So I want a model that 1/ says "you can't get there from here", and 2/ has explicit mechanisms for moving data. As an example of the first point, enqueing tasks (to name but one activity that is popular in models that I've seen) is not possible because a queue can not exist in the sense that anyone can access it just like that.

What should a model "say"? No idea. What does a functional model "say"?

Maybe a model should say something quantitative. PRAM seems to make quantitative statements. Maybe a model for distributed computing should say that broadcasting a data item costs at least log(p) time, and "Concurrent Read" is impossible. But probably I would be happy enough if the model simply had no way of expressing CR.

I guess it's not so much a matter of saying something, as not being able to say certain other things.

Data is somewhere, actions are somewhere, and if an action in location x involves data in location y, then there should be an explicit thing that says that data has to be moved. Accessing random data from random tasks should not be expressible. The actual cost of the movement I'll grant you, though it would be nice if it were possible to express that task migration is lots and lots and lots more expensive than data migration. Again, maybe task migration should be an explicit thing in the model. So not just saying that task t has a location p, but if later it has location q, then there has to have been an explicit act in the model of moving the task.

As I said in a comment: communication (and task migration) should be a first class object.

And then I want the model to be theoretically powerful. It should be to MPI as Lisp is to Fortran. Unfortunately, as a numerical analysist I'm not used to thinking about the theoretical properties of programming models. Programming is just something I do. You tell me what Lisp has over Fortran, and I'll tell you what I'm looking for over MPI.

Kell Calculus or Concept Oriented or RDP

Both of these models handle distribution explicitly:

  • Kell Calculus [1]
  • Concept oriented programming [2][3]

I am also developing a paradigm (Reactive Demand Programming) that enables developers to control distributed dataflows. I leverage two orthogonal mechanisms:

  1. There is a separation in `time` by use of asynchronous product and sum types in a dataflow. Elements in an asynchronous product may have different latencies (for communication and real-time computation). Explicit synchronization is necessary to combine values from asynchronous products. Asynchronous products effectively represent slices in parallel pipelines that can only be combined with explicit synchronization.
  2. The separation in `space` is protected by annotating the type of a signal with its location. Explicit capabilities to hop between hosts have a high latency and enforce a serializable signal type.

I rejected Kell calculus and similar models (where distribution is part of behavior semantics) because I desire a very nice security property: separation of location and authority. But I can still use phantom types and similar annotations to prevent `accidental` communication between hosts.

I note that support for separation of `parallel pipelines` seems nearly as valuable on a local machine as it does in the distributed scenario.

Conway's Law

Few folks think about the sort of questions you're asking, and good answers are pretty long, starting with a discussion about the relation of language and metalanguage. (I'm trying to think of short treatment of the latter. But basically, you want an informal set of stories defining the context in which specific entities are later analyzed in any formal way. You need an awful lot meta-language back-story to get what you want, and a lot of it will sound like brainstorming and operations research.)

So this post is about a smaller thing: Conway's Law. It's basically a tendency to create technical systems with as many gross parts as the number of human groups involved. That is, division of labor in human groups gets replicated in technical architecture. Here are some links:

http://c2.com/cgi/wiki?ConwaysLaw
http://en.wikipedia.org/wiki/Conway%27s_Law
http://www.melconway.com/research/committees.html

What are the effects of Conway's Law on designing distributed system models? You can probably think of several. (A subtle one is an idea of "fish don't perceive water.")

If I have time, I'll write a post later about probability of 1) receiving distributed messages, 2) without transmission error, 3) without errors in original sender data, and 4) without errors in detecting presence of errors.

model making suggestions

(continued) Here's my advice on model making. Enumerate 1) entities that can exist, 2) properties they can possess, 3) effects of those properties, 4) conditions of compatibility and/or conflict, 5) patterns of possible changes, and 6) specific operations of interest.

Much of that amounts to definitions expressed in meta-language (e.g. informal plain language), so first order predicate logic applies very little except to the extent definitions contradict one another. This step is descriptive and amounts to taxonomy, giving shape to things you can say in your model using such definitions.

Inside that you can express a language or system aiming to make target goals true.

In distributed computing systems I often describe everything as graphs, as a nearly always applicable lowest common denominator. Entities in your graph(s) will have an idea about the shape of the graph and changes inside it, including cost to access neighborhoods of locality, where locality is generally fractal. As usual, per Phil Karlton, naming things and cache invalidation are hard parts, since they involve signal propagation in a world where no one has a guaranteed correct global view.

You can try to derive a theory of what happens when no errors occur. Then, assuming a nonzero error rate actually occurs everywhere, you can try to put bounds on what can fail in response to errors.