Parallel/Distributed

Detecting Data Race and Atomicity Violation via Typestate-Guided Static Analysis

Detecting Data Race and Atomicity Violation via Typestate-Guided Static Analysis. Yue Yang, Anna Gringauze, Dinghao Wu, Henning Rohde. Aug. 2008

The correctness of typestate properties in a multithreaded program often depends on the assumption of certain concurrency invariants. However, standard typestate analysis and concurrency analysis are disjoint in that the former is unable to understand threading effects and the latter does not take typestate properties into consideration. We combine these two previously separate approaches and develop a novel typestate-driven concurrency analysis for detecting race conditions and atomicity violations. Our analysis is based on a reformulation of typestate systems in which state transitions of a shared variable are controlled by the locking state of that variable. By combining typestate checking with lockset analysis, we can selectively transfer the typestate to a transient state to simulate the thread interference effect, thus uncovering a new class of typestate errors directly related to low-level or high-level data races. Such a concurrency bug is more likely to be harmful, compared with those found by existing concurrency checkers, because there exists a concrete evidence that it may eventually lead to a typestate error as well. We have implemented a race and atomicity checker for C/C++ programs by extending a NULL pointer dereference analysis. To support large legacy code, our approach does not require a priori annotations; instead, it automatically infers the lock/data guardianship relation and variable correlations. We have applied the toolset to check a future version of the Windows operating system, finding many concurrency errors that cannot be discovered by previous tools.

Typestates extend the ordinary types by recoding the state of objects and allowing the safety violations that stem from operations being invoked on objects that are in the wrong state.

C++ Futures

The next C++ standard is supposed to include futures in the libraries. Futures allow assignment to a value that is to be calculated at a later time - assigning a promise in lieu of the final value until such time as it becomes available. The current thread carries on until the value is actually needed, at which time the value is made available or a the thread goes into a wait state. Being a library facility, instead of a built-in language facility as in Oz or Alice-ML, the use of futures is not nearly as transparent. Still, one should be able to get a similar effect. In an article on Broken promises–C++0x futures, Bartosz Milewski criticizes the C++ implementation for it's lack of composability.

What was omitted from the standard, however, was the ability to compose futures. Suppose you start several threads to perform calculations or retrieve data in parallel. You want to communicate with those threads using futures. And here’s the problem: you may block on any individual future but not on all of them. While you are blocked on one, other futures might become ready. Instead of spending your precious time servicing those other futures, you might be blocked on the most time-consuming one. The only option to process futures in order of completion is by polling (calling is_ready on consecutive futures in a loop) and burning the processor.

My personal opinion is that what he is really saying is that concurrency through futures (which in the simple case would be a declarative form of concurreny) don't easily do message-based concurrency. I suppose one could look at CTM and describe how to do Erlang type concurrency with nothing more than dataflow variables. But I think the bigger problem is that we assume that there is a single approach to solving the concurrency problem. We might as well say that STM is good, but it fails to deliver declarative or message concurrency. I personally think languages, either through libraries or built-in language facilities, should provide multiple ways of dealing with concurrency and distribution. Although I don't use C++ anymore, I'm glad that they are integrating lightweight concurrency models into the language.

(Surprised we don't have a category dedicated to concurrency, so I'll post this under parallel/distributed.)

Using Promises to Orchestrate Web Interactions

Phil Windley posted a few useful links about this topic following WWW2008.

Bonus question: How is this item connected to the one I posted earlier today?

Verifying Compiler Transformations for Concurrent Programs

Verifying Compiler Transformations for Concurrent Programs. Sebastian Burckhardt, Madanlal Musuvathi, Vasu singh.

ompilers transform programs, either to optimize performance or to translate language-level constructs into hardware primitives. For concurrent programs, ensuring that a transformation preserves the semantics of the input program can be challenging. In particular, the emitted code must correctly emulate the semantics of the language-level memory model when running on hardware with a relaxed memory model. In this paper, we present a novel proof methodology for proving the soundness of compiler transformations for concurrent programs. Our methodology is based on a new formalization of memory models as dynamic rewrite rules on event streams. We implement our proof methodology in a first-of-its-kind semi-automated tool called Traver to verify or falsify compiler transformations. Using Traver, we prove or refute the soundness of several commonly used compiler transformations for various memory models. In this process, we find subtle bugs in the CLR JIT compiler and in the JSR-133 Java JIT compiler recommendations.

The goal is to reason about the effects that different memory models may have on the validity of transformations. Program execution is modeled as an event stream, with the memory model being able to alter the event stream by swapping or eliminating events. Each concurrent execution thread produces a separate event stream. The event stream produced by the execution of the concurrent program is the (possibly altered) result of merging the event streams of each component. The validity of transformation can thus be proved relative to a specific memory model (i.e., a set of stream rewrite rules).

Traver lives here.

Programmable Concurrency in a Pure and Lazy Language

Programmable Concurrency in a Pure and Lazy Language, Peng Li's 2008 PhD dissertation, is a bit more implementation focused than is common on LtU. The paper does touch on a wide range of concurrency mechanisms so it might have value to any language designer considering ways to tackle the concurrency beast.

First, this dissertation presents a Haskell solution based on concurrency monads. Unlike most previous work in the field, this approach provides clean interfaces to both multithreaded programming and event-driven programming in the same application, but it also does not require native support of continuations from compilers or runtime systems. Then, this dissertation investigates for a generic solution to support lightweight concurrency in Haskell, compares several possible concurrency configurations and summarizes the lessons learned.

The paper's summary explains what I like most about it:

the project ... solves a systems problem using a language-based approach. Systems programmers, Haskell implementors and programming language designers may each find their own interests in this dissertation.

Even if concurrency isn't your thing, section 6.3 describes the author's findings on the pros and cons of both purity and laziness in a systems programming context.

Functional building blocks as concurrency patterns

While teaching INGI1131, my concurrent programming course, I have become even more impressed by a concurrent paradigm, namely functional programming extended with threads and ports, which I call multi-agent dataflow programming. This paradigm has many good properties:


  • The declarative concurrent subset (no ports) has no race conditions and can be programmed like a functional language. The basic concept is dataflow synchronization of single-assignment variables. A useful data structure is the stream, a list with dataflow tail used as a communication channel.
  • Nondeterminism can be added exactly where needed and minimally, by using ports. A port is simply a named stream to which any thread can send.
  • All functional building blocks are concurrency patterns. Map, fold, filter, etc., are all useful for building concurrent programs. Here are two examples: a contract net protocol and an observable port object. Chapter 5 of CTM gives many more examples.
  • Concurrent systems can be configured in any order and even concurrently with actual use of the system. Note that this is true for ports even though they can express nondeterminism.
  • Designing concurrent programs is amazingly easy. For example, any declarative part of the program can be put in its own thread, loosening the coupling between system's parts, without changing correctness.
  • The paradigm is easy to implement efficiently.
  • The paradigm is easily extended to support fault-tolerant transparent distributed programming: see Raphael Collet's dissertation. Google's MapReduce is a famous example.

This paradigm seems to be exactly what is needed for both small and big parallel systems (both multicore and Internet, tight and loose coupling). I am surprised that it is not used more often. What do you think? Does it deserve a bigger role?

Local Rely-Guarantee Reasoning

Local Rely-Guarantee Reasoning, Xinyu Feng. Accepted for publication at POPL 2009.

Rely-Guarantee reasoning is a well-known method for verification of shared-variable concurrent programs. However, it is difficult for users to define rely/guarantee conditions, which specify threads' behaviors over the whole program state. Recent efforts to combine Separation Logic with Rely-Guarantee reasoning have made it possible to hide thread-local resources, but the shared resources still need to be globally known and specified. This greatly limits the reuse of verified program modules.

In this paper, we propose LRG, a new Rely-Guarantee-based logic that brings local reasoning and information hiding to concurrency verification. Our logic, for the first time, supports a frame rule over rely/guarantee conditions so that specifications of program modules only need to talk about the resources used locally, and the certified modules can be reused in different threads without redoing the proof. Moreover, we introduce a new hiding rule to hide the resources shared by a subset of threads from the rest in the system. The support of information hiding not only improves the modularity of Rely-Guarantee reasoning, but also enables the sharing of dynamically allocated resources, which requires adjustment of rely/guarantee conditions.

In the beginning there was Hoare logic, which taught us how to reason about sequential, imperative programs. Then, Owicki and Gries extended Hoare logic with some additional rules that enabled reasoning about some concurrent imperative programs. This was good, but there were a lot of "obviously correct" concurrent programs that it couldn't handle. So Owicki-Gries logic begat two children.

The elder child was Jones's introduction of the rely-guarantee method. The intuition here is that if you have two subprograms M1 and M2, and M1 will work in an environment with a working M2, and M2 will work in an environment with a working M1, then when you put the two together you have a working M1 and M2. This is a really powerful reasoning method, but unfortunately it's not terribly modular.

The younger child of Owicki-Gries was concurrent separation logic. The intuition behind it is that if you can divide the heap into disjoint (logical) pieces, and only let one process access each chunk at a time, then you can't have any race conditions. This is a very simple principle, and permits modular, compositional reasoning about concurrent programs -- even pointer programs. But there are programs that can't be proven in this style.

So the obvious thing to want is the ability to combine these two styles of reasoning. Unfortunately, this is hard -- there have been several logics proposed to do this, each of which does a bit better than the last. Feng's is the latest, and the best I've seen so far. (Though concurrency is not really my area.)

An interesting point is that these kinds of reasoning principles, while invented for the concurrent world, are also interesting for reasoning about modular sequential programs. This is because when you create imperative abstractions, it's helpful to be able to give up knowledge about exactly when state changes can happen. So you need the same sorts of techniques to handle this kind of conceptual nondeterminism that you need for the actual nondeterminism of parallel hardware.

Intel Ct: C for Throughput Computing

Intel is working on a C++ extension called Ct to simplify multicore programming and data parallelism more properly than they did with previous library efforts like TBB.

One of the main challenges in scaling multi-core for the future is that of migrating programming tools, build environments, and millions of lines of existing code to new parallel programming models or compilers. To help this transition, Intel researchers are developing “Ct,” or C/C++ for Throughput Computing.

Ct is not intended to be a C++ dialect or replacement but rather a tricky integration of a runtime engine with existing C++ compiler environments.

In principal, Ct works with any standard C++ compiler because it is a standards-compliant C++ library (with a lot of runtime behind the scenes). When one initializes the Ct library, one loads a runtime that includes the compiler, threading runtime, memory manager — essentially all the components one needs for threaded and/or vectorized code generation.

The Transactional Memory / Garbage Collection Analogy

Courtesy of my shiny new commute, I have been listing to various podcasts, including Software Engineering Radio. A while back, they had an interview with Dan Grossman on his OOPSLA 2007 paper, which I have not seen discussed here.

The Transactional Memory / Garbage Collection Analogy is an essay comparing transactional memory with garbage collection based on the analogy:

Transactional memory (TM) is to shared-memory concurrency
as
garbage collection (GC) is to memory management.

Grossman presents the analogy as a word-for-word transliteration of a discussion of each of the technologies. (Hence the "fun" category.)

(As an aside, Grossman does not address message-passing, but says, "Assuming that shared memory is one model we will continue to use for the foreseeable future, it is worth improving," which is probably a correct assumption.)

One point that he does make is that

[This essay] will lead us to the balanced and obvious-once-you-say-it conclusion that transactions make it easy to define critical sections (which is a huge help in writing and maintaining shared-memory programs) but provide no help in identifying where a critical section should begin or end (which remains an enormous challenge).

The one serious weakness of the analogy, to me, is that GC does not require (much) programmer input to work, while TM does.

Although some parts of the analogy are strained, there are some interesting correspondences.

Twilight of the GPU

This interview with Tim Sweeney discusses his prediction that graphic rendering will move from special purpose GPUs back to the CPU:

I expect that in the next generation we'll write 100 percent of our rendering code in a real programming language—not DirectX, not OpenGL, but a language like C++ or CUDA. A real programming language unconstrained by weird API restrictions. Whether that runs on NVIDIA hardware, Intel hardware or ATI hardware is really an independent question. You could potentially run it on any hardware that's capable of running general-purpose code efficiently.

This is driven by the development of cheap multi-core CPUs. Consumers might still buy graphics boards, but they'll be fully programmable multi-core devices:

Intel's forthcoming Larrabee product will be sold as a discrete GPU, but it is essentially a many-core processor, and there's little doubt that forthcoming Larrabee competitors from NVIDIA and ATI will be similarly programmable, even if their individual cores are simpler and more specialized.

How are we going to program these devices? NVIDIA showed the data-parallel paradigm was practical with CUDA (LtU discussion). Now, Tim asks:

...can we take CUDA's restricted feature set—it doesn't support recursion or function pointers—and can we make that into a full C++ compatible language?... From my point of view, the ideal software layer is just to have a vectorizing C++ compiler for every architecture.

The FP lover is me says that data-parallel + C++ = the horror! However, I can appreciate it is a practical solution to a pressing problem. What do you think? Is Tim right? Is a data-parallel extension to C++ the right way to go? Or is the next version of Unreal going to be written in FRP?

XML feed