Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading

Junfeng Yang, Heming Cui, Jingyue Wu, Yang Tang, and Gang Hu, "Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading", Communications of the ACM, Vol. 57 No. 3, Pages 58-69.

We believe what makes multithreading hard is rather quantitative: multithreaded programs have too many schedules. The number of schedules for each input is already enormous because the parallel threads may interleave in many ways, depending on such factors as hardware timing and operating system scheduling. Aggregated over all inputs, the number is even greater. Finding a few schedules that trigger concurrency errors out of all enormously many schedules (so developers can prevent them) is like finding needles in a haystack. Although Deterministic Multi-Threading reduces schedules for each input, it may map each input to a different schedule, so the total set of schedules for all inputs remains enormous.

We attacked this root cause by asking: are all the enormously many schedules necessary? Our study reveals that many real-world programs can use a small set of schedules to efficiently process a wide range of inputs. Leveraging this insight, we envision a new approach we call stable multithreading (StableMT) that reuses each schedule on a wide range of inputs, mapping all inputs to a dramatically reduced set of schedules. By vastly shrinking the haystack, it makes the needles much easier to find. By mapping many inputs to the same schedule, it stabilizes program behaviors against small input perturbations.

The link above is to a publicly available pre-print of the article that appeared in the most recent CACM. The CACM article is a summary of work by Junfeng Yang's research group. Additional papers related to this research can be found at http://www.cs.columbia.edu/~junfeng/

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Misleading title

Full Abstract:

Our accelerating computational demand and the rise of multicore hardware have made parallel programs, especially shared-memory multithreaded programs, increasingly pervasive and critical. Yet, these programs remain extremely difficult to write, test, analyze, debug, and verify. Conventional wisdom has attributed these difficulties to nondeterminism (i.e., repeated executions of the same program on the same input may show different behaviors), and researchers have recently dedicated much effort to bringing determinism into multithreading.

In this article, we argue that determinism is not as useful as commonly perceived: it is neither sufficient nor necessary for reliability. We present our view on why multithreaded programs are difficult to get right, describe a promising approach we call stable multithreading to dramatically improve reliability, and summarize our last four years’ research on building and applying stable multithreading systems.

The proposal is an alternative to determinism, not an addition to it (as 'not enough' in the title seems to connote).

Anyhow, I'm certainly in favor of stable models for parallelism and concurrency. My recent approaches have involved E-vat like models together with pipeline parallelism, which enable me to reason compositionally about memory usage, resource management, and process control. I wonder if the reified 'schedule' concept, mentioned in the OP, might be a suitable replacement for vats.

My recent approaches have

My recent approaches have involved E-vat like models together with pipeline parallelism, which enable me to reason compositionally about memory usage, resource management, and process control.

I'm not sure promises/futures truly enable compositional reasoning about memory usage or resource management. They have lazy default semantics, which is notoriously difficult to reason about in languages that aren't lazy everywhere (and often even there).

Futures and promises have a

Futures and promises have a lot of problems, I agree. I don't use them. I've periodically contemplated promises, futures, and variations (such as rational types or Oz-like logic variables). But I repeatedly reach the same conclusion: these techniques hinder generic equational reasoning (since any function that potentially outputs a promise must be understood in its entire context) and it becomes difficult to reason compositionally about dataflow, divergence, and resource/memory use.

Pipeline parallelism doesn't depend on promises, and is really a separate concept from them. It does depend on having clear pipelines that can be computed in an earlier stage from the runtime data. But this can be achieved using a streaming data model, or arrows, or reactive dataflow signals.

The utility of 'vats' is having little islands that are internally consistent, and have simple snapshot-consistency relationships with other vats. A lot of malign data races can be avoided entirely, while allowing for open systems to have a little bit of temporary inconsistency.

I think replay is a good way

I think replay is a good way forward. Rather than package a value up as a promise, have the computation return "not yet" and allow the task to continue executing with that value. When the computation has produced the value, re-execute the task with the right value. Of course to make this work, we must be able to undo and redo any stateful operations, but this is what I'm basing Glitch on anyways to support incremental computation as well as live programming. Of course, you are now executing your tasks multiple times, but I don't think task computation is really bottleneck for these applications.

Performance-oriented pipeline parallelism is a whole different barrel of fish where you want a statically known graph and highly encapsulated communication between vertices via graph edges only. Synchronous execution might even be a requirement for something really fast like CUDA.

Can be an addition

As the paper makes clear, determinism and stable multithreading are not mutually exclusive. Indeed, they can be used together, i.e., SMT can be an addition to DMT.

constraining top-down or bottom-up?

Nify thoughts in this work!

Skimming the paper, it sounds like they are trying to tame the wild west. I'd sorta rather start with e.g. skeletons or some such, where things are constrained from the ground-up in a sense. Call me a paranoid chicken.

Coming from different directions

It appears that Junfeng Yang came out of Dawson Engler's group at Stanford. And much like Dawson, seems to be approaching the problem from the "how can I find or fix problems in sloppy code?" direction rather than from the "how can I not write sloppy code in the first place?" direction. This, of course, has the pragmatic benefit that it lets developers keep doing the same things they've always done with the languages they're comfortable with.

Junfeng and Dawson are

Junfeng and Dawson are closer to systems people in that vain "how can I fix problems with current programming models" is more of the systems approach vs. the "how can I avoid problems with new programming models" of PL people. There is a lot of overlap of course, I work in a systems team where I face this divergence in thinking almost every day. I actually discussed these differences with Junfeng when he visited our group a few months ago; I think more bridging work in this area would be useful.

Agreed

That's a good way to look at it.

Defensive code for pervasive indeterminacy and inconsistency

We need defensive code for pervasive indeterminacy and inconsistency
because of massive networking and many-core architectures. Programs
need to assume that things will go wrong and take corresponding
cost-effective measures.

Consistency is a myth and Inconsistency Robustness must become the
norm for programming.

See Inconsistency Robustness 2014.

nine nines?

How would you augment the Erlang/OTP approach?

Unfortunately, Erlang/OTP does not properly support Actors

Unfortunately, Erlang/OTP does not properly support the Actor Model. See Actor Model of Computation.

That wasn't the question

I am aware of your contributions, Dr. Hewitt. But that wasn't the question.

Is there anything practically wrong with the Erlang/OTP approach? To wit (pi), how many 9s has a 'correct' implementation of an Actor model achieved?

Which of the principles of Actors are satisfied by Erlang/OTP?

The Actor Model claims to embody current theory and practice for
(concurrent) computation.

Which of the principles of the Actor Model are satisfied by
Erlang/OTP? And which principles of the Actor Model are violated
by Erlang/OTP?

To what extent are the strengths of Erlang/OTP due to adherence to
principles of Actors. And to what extent are the limitations of
Erlang/OTP due to violating principles of Actors.

I would think you & possibly

I would think you & possibly Dr. Armstrong have the authoritative answers to your questions.

Practically speaking, Erlang/OTP delivers the goods in fault tolerance, in the field. Per my understanding, this is due to OTP's primary principle of expecting and embracing component failures. Erlang's (however un-canonical) foundations in Actor model and process isolation allows for OTP supervisory components to achieve the design goals of OTP.

So the question remains as to (a) what is wrong with this practical approach ala Erlang/OTP? and (b) in what sense do implementations of the orthodox Actor model improve on the empirical results we have from Erlang/OTP approach?

And with all due respect, you seem to exhibit the same tin ear that Godel displayed in response to Wittgenstein's 'so what?' critique of possibility of paradox in self referential logic. Just as your "direct logic" is a practical approach -- my pet name for DL is 'blue collar logic' -- for getting things done and staying clear of the difficulties inherent in formalizing 'meta-physics' and 'intuition', a practical approach to concurrency ala Erlang/OTP has proven itself repeatedly in the field (where it counts for 'Engineers'), never mind its deviation from the Canon of Actor Model.

As an aside, this little interaction of ours has inspired what I call the "Hewitt Test" "contra" "Turing Test". The test setup is just like the Turing Test. There is a room with a teletype. A random person sits behind the teletype and poses a question. If the response to the query (regardless of content) is "The Actor Model", then we know Carl Hewitt is behind the curtain.

/All the best!

But the actor model.

But the actor model.

:)

I've never successfully engaged Hewitt in a meaningful discussion. It's as though he defines concurrency in terms of the actor model, so he can't even contemplate relative strengths and weaknesses of alternatives. If it isn't actor model, then it isn't true concurrency, nothing more to discuss.

How can (concurrent) computation be rigorously defined?

How can (concurrent) computation be rigorously defined?

The first attempt was to define computation rigorously using the
lambda calculus. However, it became apparent that the lambda
calculus is inadequate and various alternatives were developed, e.g.,
Petri nets. In 1972, the Actor Model was invented and shown to
encompass the known computational methods, e.g. Petri nets, lambda
calculus, interrupt handlers, arbiters, semaphores, threads, locks,
monitors, etc. Also, using the Actor Model, it was proved that
nondeterministic Turing machines cannot implement concurrent
computation.

What is your proposal to rigorously define concurrent computation?

Informally, concurrent

Informally, concurrent describes any phenomenon that coincides spatially and temporally. Concurrent computation can be formalized if you also formalize the spatial-temporal model. Actors model does achieve concurrency, of course. So do cellular automata, threads, CSP, event calculi, temporal logic, and a number of other models.

the Actor Model was invented and shown to encompass the known computational methods

Sure. But this is a statement of universality or generality, not concurrency.

using the Actor Model, it was proved that nondeterministic Turing machines cannot implement concurrent computation

Not at all. You have argued that actors model can express some behaviors that a nondeterministic Turing machines cannot express. (This 'proof' relies on a non-constructive assumption of 'fairness' in Actors model; intuitionists may be unwilling to accept this proof.) But this is, again, a claim of generality, that seems irrelevant to concurrency.

You should make an effort to distinguish concurrency from generality. They are different concepts, both informally and formally. If you are unable to comprehend and make clear distinctions between concurrency and other concepts, then there is no value in discussing concurrency with you.

Yes, that post is a nice

Yes, that post is a nice point of evidence that Carl Hewitt conflates 'concurrency' with some form of 'generality' - an ability to express all computations. Which supports a conclusion that there is little value in discussing concurrency with him. Thanks. Bye.

Computation these days means concurrent computation

Because of the efficiency of many-core computer architectures and
the fact that almost all computers are connected to the network,
computation these days means concurrent computation, i.e.,
in practice all computation is becoming concurrent computation.

Erlang/OTP is inadequate with respect to hardware failures

Unfortunately, Erlang/OTP is inadequate with respect to managing
hardware failures. For example, Erlang/OTP does not have the
mechanisms below.

Because of hardware failures, message reception cannot be guaranteed
although best efforts are made. If a response for a request is not
received, an exception is thrown to maintain robustness.

If anActor is sent aRequest, then at least one the following two possibilities will occur (not mutually exclusive):  
  1. a RequestUnacknowledged[ ] exception is thrown even though anActor may receive aRequest
  2. One of the following will occur mutually exclusively
    ⊗ a RequestNotReceived[ ] exception is sent to the requestor of aRequest and anActor does not receive aRequest
    ⊗ anActor receives aRequest, an Unresponsive[ ] exception is sent to the requestor of aRequest, and a response is not sent for aRequest
    ⊗ a response  is sent for a aRequest and an Unresponsive[ ] exception is not sent