Memory Models: A Case for Rethinking Parallel Languages and Hardware, CACM, August 2010

Memory Models: A Case for Rethinking Parallel Languages and Hardware by Sarita V. Adve and Hans-J. Boehm

This is a pre-print of the actual version.

The era of parallel computing for the masses is here, but writing correct parallel programs remains far more difficult than writing sequential programs. Aside from a few domains, most parallel programs are written using a shared-memory approach. The memory model, which specifies the meaning of shared variables, is at the heart of this programming model. Unfortunately, it has involved a tradeoff between programmability and performance, and has arguably been one of the most challenging and contentious areas in both hardware architecture and programming language specification. Recent broad community-scale efforts have finally led to a convergence in this debate, with popular languages such as Java and C++ and most hardware vendors publishing compatible memory model specifications. Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings in current popular languages and systems that prevent achieving the vision of structured and safe parallel programming.

This paper discusses the path to the above convergence, the hard lessons learned, and their implications. A cornerstone of this convergence has been the view that the memory model should be a contract between the programmer and the system - if the programmer writes disciplined (data-race-free) programs, the system will provide high programmability (sequential consistency) and performance. We discuss why this view is the best we can do with current popular languages, and why it is inadequate moving forward. We then discuss research directions that eliminate much of the concern about the memory model, but require rethinking popular parallel languages and hardware. In particular, we argue that parallel languages should not only promote high-level disciplined models, but they should also enforce the discipline. Further, for scalable and efficient performance, hardware should be co-designed to take advantage of and support such disciplined models. The inadequacies of the state-of-the-art and the research agenda we outline have deep implications for the practice, research, and teaching of many computer science sub-disciplines, spanning theory, software, and hardware.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Causal consistency is the answer to the memory model conundrum

Causal consistency is the answer to the memory model conundrum. See the section on causal consistency in Actor Model of Computation.

Also variable races are impossible in a modern concurrent programming language like ActorScript.

Related paper

Data Structures in the Multicore Age is in the current issue. That link hits a pay-walled version - our library subscription covered it but I can't find a more accessible copy.

The paper analyses how far we can scale algorithms if we are limited to a concurrency primitive aimed at shared memory models (CAS in this case). Rather than looking at alternatives it focuses on analysing exactly how far we can go using one particular memory model for one particular data-structure. Both papers complement each other's arguments.

Compare and Swap is not the answer

The compare and swap instruction is not a scalable solution to coordinating the many cores.

Yes

That is the point of the paper. They expand on the point in a little bit more detail.

Transactional memory doesn't help the manycores

Transactional memory (which is recommended in the paper) is not a scalable solution to coordinating the many cores.

"Citation needed"

Do you have any sources for that statement?

Because I don't quite buy you argument about eventual consistency either.

I looked to use EC for collaborative diagram drawing purposes, it doesn't work right from the box. You still need complex rules, etc. I didn't tried to apply it to memory architectures, though, but I am sure I won't get far.

EC?

Where did Hewitt make an argument about eventual consistency?

My bad.

I confused Casual Consistency and Eventual Consistency.

Scalability of Transactions

It is true that transactions need to be kept small, otherwise you'll start running into issues of starvation, convoying, priority inversion, or inefficiency (high rework ratio). But developers can do a great deal with just small transactions...

I gave transactions a try a while back. The trouble I had with them is that transactional semantics don't compose or integrate in a convenient way with IO effects (sensors, actuators, foreign services), streaming data (e.g. audio and video stream synchronization), and observer patterns.

I don't think it's entirely fair

I don't think it's entirely fair to say that they "recommend" TM. They only mention TM at the very end, and it's not really the point of the paper:

As we go forward, we will also need to take into account the evolution of hardware support for synchronization. Today's primary construct, the CAS operation, works on a single memory location. Future architectures will most likely support synchronization techniques such as transactional memory, allowing threads to instantaneously read and write multiple locations in one indivisible step. Perhaps more important than the introduction of new features like transactional memory is the fact that the relative costs of synchronization and coherence are likely to change dramatically as new generations of multicore chips role out. We will have to make sure to consider this evolution path carefully as we set our language and software development goals.

Oversimplifying

Actually modern processors have become vastly more efficient at executing compare-and-swap instructions. With more cores on the same die and sharing the same L2 and/or L3 caches, many CAS operations can be done directly in the cache. And of course if the processor already has exclusive ownership of that line in its L1 cache then it's essentially a store operation.

But of course CAS is only as scalable as the application's architecture. If your application uses CAS on a few rarely-contended locations in order to allocate large blocks of non-overlapping work to different workers, then one could argue convincingly that CAS scales well. But if your application's behavior is based on a large number of CAS's on highly contended locations, obviously it is not going to scale well. But given that, even if you weren't using CAS and your application has a lot of updates to shared cache lines, you suffer severe penalties due to cache coherence.

Strange Comment

This comment in the intro struck me as strange:

The memory model defines an interface between a program and any hardware or software that may transform that program (e.g., the compiler, the virtual machine, or any dynamic optimizer). It is not possible to meaningfully reason about either a program (written in a high-level, bytecode, assembly, or machine language) or any part of the language implementation (including hardware) without an unambiguous memory model.

I see this as strange because it is completely backwards. The memory model (and associated definition of consistency) can define the transformations of the program that respect correctness. But does this not seem like a convoluted way to achieve a result? In a sequential model we define correctness as a transformation that produces the same I/O mapping (or respects some other definition of observable behaviour). Given a set of codes to be executed in parallel the set of possible orderings can be mapped onto a set of computed results. A correct transformation preserves this set, nothing more, nothing less.

It's not even remotely a new idea. It has been known for decades that the semantics of message passing models are easy to reason about than shared memory models. Looking at the argument around the code in Figures 4a and 5 we can see serious contortions to try and avoid this basic fact. The argument around Figure 5 seems to justify this: if the possible schedules of the code allow different observable outcomes then the code is inconsistent. If a correctness preserving transformation produces code that is consistent then the problem is with the assumption that the original code worked. Deliberately specifying the model to allow these cases seems horrific. Under most assumptions of scheduling / parallel behaviour X can differ in both reads, trying to warp the assumptions about how the underlying scheduler behaves in order to make something else true is a recipe for disaster.

I can't see a real result in this paper and after they finish reviewing what the current state of the art is I feel distinctly underwhelmed as it seems very standard and well-known:

  • C++ and Java have memory models that are hard to formalise. We knew this.

  • Both C++ and Java use shared memory and it is hard to define a workable definition of consistency. We knew this.
  • Cache hierarchies that involve sharing between different subsets of the cores at different levels make the problems worse. We knew this.

But what we also knew, and that isn't mentioned anywhere in this paper, is that alternatives to shared memory make these challenges easier. Sometimes it doesn't matter how hard you look in the wrong area, there will still be no good solutions.

A pessimistic view would be to abandon shared-memory altogether. However, the intrinsic advantages of a global address space are, at least anecdotally, supported by the widespread use of threads despite the inherent challenges. We believe the fault lies not in the global address space paradigm, but in the use of undisciplined or “wild shared-memory” permitted by current system

How do you define observable behavior?

I don't understand your comment at all. How can you tell whether programs have the same behavior without a language definition telling you how programs are allowed to behave?

A memory model is a critical part of the definition for parallel programs. Even a message passing model needs to guarantee that values received are the same as values sent, that there will be no closed causal loops in the message order (as a result of speculation), and so on.

When you talk about possible schedules of parallel thread it sounds like you are only considering interleavings of statements in order. That assumes sequential consistency, which hardware designers discarded years ago in the quest for speed. A number of the figures concern results that are impossible under any sequential order, but standard processors may none the less produce.

The main result from recent years is that if synchronization is explicitly marked, and enough to keep the non-synchronization parts from racing, then hardware can keep using current tricks and giving current performance, while still promising sequential consistency at the language level.

Observable Behaviour

I'm not sure where you thought I claimed that we could write programs at all without a language definition - I can't see that in anything that I wrote. Language semantics provide both a definition of what a program means, and limit the possible behaviours. Where evaluations can occur in parallel there needs to be a definition of how the alternative values can be reconciled. This can occur in every read and write operation, arbitrating over a shared resource, or it can happen explicitly in determining the possible orderings of communication. One technique makes it difficult to formalise language semantics (including the consistency of results) in a composable manner. This paper explicitly makes that argument although they refuse to consider the alternative.

A number of the figures concern results that are impossible under any sequential order, but standard processors may none the less produce.

Actually the case that they point out is not possible under any sequential order is also not producible by any known processor (as the authors point out). It seems unlikely that speculated traces within a core would ever be propagated to other cores, which is the justification for that particular scenario.

The idea that sequential consistency has been abandoned is a strange one. If anything designers are falling back on simpler in-order execution cores. Many simple cores are currently outpacing a few complex cores in most cases. The best example of throwing complex designs at the problem is the latest generations of Intel cores - but they go to quite extreme lengths to guarantee consistency. Do you have specific examples in mind?

"Simple" cores and memory models

The idea that sequential consistency has been abandoned is a strange one. If anything designers are falling back on simpler in-order execution cores.

I don't understand your point. I may have misunderstood your comment or even part of the subject paper, but it seems that you are conflating the lack of several machine-level optimization techniques (e.g. value and branch prediction) with sequential consistency. It is clearly not the case, as for example the memory model of ARM machine code is weaker than Intel's (see page 2).

Even if this were true, next-generation ARM cores available today do feature out-of-ordrer execution and speculative execution.

Many simple cores are currently outpacing a few complex cores in most cases.

I find this assertion highly doubtful and fail to see what it has to do with memory models; the micro-architectural evolution of processors in consumer electronics devices does not side with you.

The best example of throwing complex designs at the problem is the latest generations of Intel cores - but they go to quite extreme lengths to guarantee consistency. Do you have specific examples in mind?

Which "consistency" are you talking about? I do agree that they indeed work very hard to recover a relatively strong memory model in the presence of extremely aggressive micro-architectural tricks, but neither x86(-64) nor IA64 guarantees sequential consistency.

Simple cores

Let's take the issue of out-of-order despatch versus in-order. The former achieves much higher performance on single threaded code at the expense of a much more complex circuit. In order to effectively implement the latter requires speculation / prediction to push available work earlier in the execution trace. As a result of this a reorder buffer is required to maintain sequential consistency for the outside world. This reordering can be made simpler by weakening the definition of consistency to allow some results to be pushed back out in an order that doesn't match the original code.

A simple core that executes in-order trivially matches the definition of sequential consistency - as it produces the results exactly in the order that the code specifies they can be retired in that order without any extra work. When designers move from a single-core to many cores a tradeoff arises: use fewer cores that attempt to execute each thread faster, or spend that silicon budget on a larger number of simple cores and rely on a larger number of threads to achieve throughput. Examples of the later approach can be seen in most GPUs today that use hundreds of in-order cores rather than 4-8 complex cores as we can see the latest x86 lineup. The comparison is not exact as GPUs also amortise the cost of control-flow by arranging these simple cores into vector units.

If you go down the route of lots of simple cores then at some point the issue of how to merge results from different threads in a consistent view still arises - but in a different place. Rather than a guarantee on consistency from the order in which individual results were retired back to memory it becomes the responsibility of an external memory control (or cache hierarchy) to arbitrate access between the different units. This is the same problem of determining what output a particular program (spread across multiple cores) will produce under the different possible scheduling / parallel overlaps of the individual computations. It may well be that this arbitration / retirement is not normally referred to as sequential consistency in which case my terminology may have been confusing.

In the case of the x86 chips I was actually talking about the translation between the semantics of the x86 instruction set within each individual core and the real operations being scheduled on the architecture underneath. There is a layer of complex circuitry to ensure that the data-flow architecture underneath produces results that are consistent with the single thread's view of what it is doing. I can't say anything further about that particular point as it relies on some currently unpublished work.

So in summary - my terminology is probably very confusing, but I was referring to the process of merging results from separate parallel processes into a sequence that is consistent with the semantics of the original program. My claim at the start was that viewing the consistency problem as being a set of constraints on the transformations that can legally be performed on the program seems to be a more complex view than asking how we can define the semantics of the language on a given piece of hardware so that it produces the same results. I would suggest that this observation follows from the author's problems in making the former view work.

Rather than a guarantee on

Rather than a guarantee on consistency from the order in which individual results were retired back to memory it becomes the responsibility of an external memory control (or cache hierarchy) to arbitrate access between the different units. This is the same problem of determining what output a particular program (spread across multiple cores) will produce under the different possible scheduling / parallel overlaps of the individual computations

There is a case that happen in practice with weak memory models -- so in everyday CPUs now -- where the output of two programs running in-order in parallel is not the output of any interleaving / scheduling of the individual in-order computations. It's the "Decker's algorithm" case is cited in the paper : consider X and Y are initially 0, then execute (X = 1; r1 = Y) || (Y = 1; r2 = Y). On most multicore hardware available today, you can get r1=0, r2=0 as a result.
This comes from the cache hierarchy : rather than writing directly to the common memory, which incurs costly synchronizations, hardware thread rather write in their own write buffer which is eventually synchronized. When reading they check the local write buffer first, then the main memory. In the example, none of the write where committed to the main memory when the read took place.

I don't understand whether you consider these situations or not. There is no assumption of complex reordering of instructions by the processor. Everything is executed in-order, but the memory hierarchy produces non-sequentially-consistent result -- and the observable effect is the same as if the read and write were swapped, as the read takes place before the other thread's write is observable; which is why certain people describe the effect of weak memory models as instructions reordering.

PS : if you want to test for some weird test results on your machine, you may try the diy tool developed by Jade Algave and Luc Maranget. The example above is the very first in the documentation. I just tested, and on my machine the output above (both 0) happened 42 times out of 10^6.

Decker's Algorithm

Yes I would say that is covered. I take it that is the same Dekker that invented Dekker splits in floating-point algorithms? Figure 4 in the paper was the case that doesn't show up on current hardware.

Viewing the problem as arbitration happening on the outside, rather than reordering of events on the inside, doesn't change anything. They are two different descriptions of the same phenomena. Which one is more appropriate depends on what you are trying to prove, and which way makes it easier to do so.

In the case of (X=1; r1=Y) || (Y=1; r2=X) happening you could treat both of the processes as atomic events that read from some memory (or intermediate buffer) and write to some memory (or intermediate buffer). In the buffered case both pairs of operations could overlap in time and thus operate on 'stale' data, producing the r1=0, r2=0 case as a result. This leads to three possible interleavings L;R, or R;L, or L||R (that is both processes buffer the contents of memory and then execute at the same time).

The description in the paper covers the three sequentially consistent interleavings (which are L;R, R;L and also L0;R0;L1;R1) and introduces something with the same effect as L||R by describing a store buffer. The above description should make it obvious that the same effect can be produced using a read buffer infront of both pieces of code.

If you break down the memory into explicit (non-shared) buffers and make explicit the filling of these buffers then these kinds of effects are easier to model. This claim is well-known in the CSP community and is the basis for the argument that CSP-style semantics that make these steps explicit are more composable than descriptions of processes using shared-memory semantics. (and of course there is a whole separate body of claims for an asynchronous message passing style as well).

In both cases a model is being defined to reason about what effects different forms of parallel execution can have on a program and how to expose that model to the programmer in the language semantics. The paper makes the case that a model of shared-memory that exposes all of the warts and weirdness is simply too much for the programmer to understand - but that the solution lies in defining a weaker memory model that only exposes some of the problems. Given that an alternative model exists that precisely shows where these problems crop up it seems a strange approach to me.

Even a message passing model

Even a message passing model needs to guarantee that values received are the same as values sent

Well, it doesn't _need_ to guarantee this, and practically speaking it can't if you're sending messages over an unreliable network.

It could specify that the chance of undetected random corruption must be below some threshold, though.