Is Transactional Programming Actually Easier?

Is Transactional Programming Actually Easier?, WDDD '09, Christopher J. Rossbach, Owen S. Hofmann, and Emmett Witchel.

Chip multi-processors (CMPs) have become ubiquitous, while tools that ease concurrent programming have not. The promise of increased performance for all applications through ever more parallel hardware requires good tools for concurrent programming, especially for average programmers. Transactional memory (TM) has enjoyed recent interest as a tool that can help programmers program concurrently.

The TM research community claims that programming with transactional memory is easier than alternatives (like locks), but evidence is scant. In this paper, we describe a user-study in which 147 undergraduate students in an operating systems course implemented the same programs using coarse and fine-grain locks, monitors, and transactions. We surveyed the students after the assignment, and examined their code to determine the types and frequency of programming errors for each synchronization technique. Inexperienced programmers found baroque syntax a barrier to entry for transactional programming. On average, subjective evaluation showed that students found transactions harder to use than coarse-grain locks, but slightly easier to use than fine-grained locks. Detailed examination of synchronization errors in the students’ code tells a rather different story. Overwhelmingly, the number and types of programming errors the students made was much lower for transactions than for locks. On a similar programming problem, over 70% of students made errors with fine-grained locking, while less than 10% made errors with transactions.

I've recently discovered the Workshop on Duplicating, Deconstructing, and Debunking (WDDD) and have found a handful of neat papers, and this one seemed especially relevant to LtU.

[Edit: Apparently, there is a PPoPP'10 version of this paper with 237 undergraduate students.]

Also, previously on LtU:

Transactional Memory versus Locks - A Comparative Case Study

Despite the fact Tommy McGuire's post mentions Dr. Victor Pankratius's talk was at UT-Austin and the authors of this WDDD'09 paper represent UT-Austin, these are two independent case studies with different programming assignments. The difference in assignments is interesting because it may indicate some statistical noise associated with problem domain complexity (as perceived by the test subjects) and could account for differences between the two studies.

Everyone always likes to talk about usability in programming languages without trying to do it. Some claim it can't even be done, despite the fact Horning and Gannon did work on the subject 3+ decades ago, assessing how one can Language Design to Enhance Program Reliability. This gives a glimpse both on (a) why it is hard (b) how you can still try to do usability testing, rather than determine the truthiness of a language design decision.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

After reading Duffy, I wonder what we are discussing?

I found the examples given by Duffy convincing, but, if I read the stars correctly, it looks to me that proper support for atomic regions is probably only easily implemented by just 'letting them be atomic.' i.e., no transactions at all, just block everything except for one thread which goes into the atomic region.

But atomic regions are as old as cheap wine. Nesc has them, and they are not the same as transactional memory (I assume it is just disabling all interrupts in their setting). Where's the (explicit) commit and roll-back in Duffy's examples?


I don't see any TM, only an attempt to give atomic sections TM-like semantics, and that seems to have failed (in C#).

Maybe hardware support for atomic sections, no TM, would just be enough?

Nesc atomics

Nesc has them, and they are not the same as transactional memory (I assume it is just disabling all interrupts in their setting).

That's correct. NesC atomic sections are implemented by disabling interrupts (although in theory the language itself doesn't specify the mechanism for managing atomic sections, only the semantics). Of course, nesC is built around cooperative multi-tasking of background tasks, with preemption from relatively short interrupt-handlers (in other words, a pretty standard small embedded system). So it's dealing with slightly different problems than most of these TM systems are aimed at.

Easily implemented...

I assume you're referring to Joe Duffy's retrospective on transactional memory (previously discussed on ltu).

There was a good discussion in the Duffy article at reddit.

In any case, you're right that the "easily implemented" stm model is one big global lock -- that in fact is a direct implementation of the *semantics* of stm. The point is that these semantics can be implemented in a much more concurrent way, especially when mutation is discouraged/controlled.

Software stm in e.g. Haskell and Clojure works just fine, and indeed both have *different* implementations (GHC has rollbacks, Clojure is optimistic). The correctness of such implementations is again indeed a tricky business, but there is also work towards, e.g., verifying them in coq.

Ease of implementation

And ease of implementation isn't the most important thing. Scalability is a big concern, too. VJ Marathe and his collaborators have done some great work in this area. Check out SkySTM.

"works just fine"

i thought that wasn't entirely clear for non-small problems.

It depends on what we mean

It depends on what we mean by "works..." (i.e. our criteria for efficiency).

so in other words

it doesn't work just fine. :-) :-)

STM is the worst possible

STM is the worst possible way to parallelize code, except for all the others. -- Brian Hurt, The problem with STM: your languages still suck

If you need the flexibility, composition, atomicity, consistency, and isolation properties, then STM as seen in Haskell and Clojure is among the best tools available. These models of STM also degrade nicely in the face of developer error: common errors will hurt performance rather than introduce deadlock or data races. But there are some performance costs for that flexibility, and you will not easily reason about performance or fairness.

Whether we say STM "works just fine" might be stated relative to what we say of locks and other synchronization mechanisms. Do locks work just fine? I don't believe so.

Peeve

... the worst possible ..., except for all the others.

I really dislike this rhetorical form. Particularly when the real claim being made is " ... is the best ... that I'm aware of, in my opinion."

I like how it fills out the

I like how it fills out the subject line. :-)

[But I grant your point.]

mostly agreed

my problem is more with the overall attitude i think i get from STM and Actor pushers, that everything is great and nothing can go wrong and just swallow the damned pill.

when in fact we need more tools, more profiling, analysis, understanding, benchmarking, experience, study, yadda^3.

(i at least mostly heartily concur that regular old locking sucks dirty donkey buttocks. tho, when things go wrong it is usually pretty clear how/why cf. Dr. Click's old saw.)

The garbage collection

The garbage collection analogy applies again. It works fine for many cases. There will be times when the extant implementations are insufficient to certain needs, and there will be times when no implementation will be suitable. But that's true of most approaches to any given problem.

so then i have to wonder

what is the analogy of memory region inference approaches for concurrency?

It depends on what we mean

It depends on what we mean by "fine..." (i.e. our criteria for effectiveness).

The baselines for both of

The baselines for both of these studies tell us almost nothing.

As before... locks seem like an incredibly low bar for typical parallel programs. What about comparisons to OpenMP, Cilk/TBB, UPC, etc. that are actually intended for fine-grained parallelism?

Furthermore, for domains where locks are interesting -- e.g., OS kernels, SQL internals, etc. -- I don't think the mistakes undergraduates make in a class assignment are all that telling. Such a scenario is almost tantamount to examining kernel hacking for novices. Students can still be a great target in other domains; for example, the push to use multicore and clouds in scientific computing is often mirrored in modern course offerings for non-cs students. There, the joke is that parallelization typically leads to slowdown.

OKay, great points

So would you rather see a workshop at the annual Linux Conference where kernel hackers are tested for 8 hours to see how well they can handle problems?

It seems there is a continuum of developers these studies should target. I agree that the danger in following these studies is that they push designs towards the lowest common denominator among all programmers. I don't personally read too much into the API design studies currently being done at CMU for this reason.

I think the thing I found most surprising was how bad the students were at matching up what they liked vs. what they did best with. On the other hand, that could just be the Dunning-Kruger Effect.

Another metric I would've liked to see tested for in the studies was Bugs per Source Lines of Code (bugs/SLOC), since one of the most reliable defect metrics we have across languages and APIs is precisely that. In other words, Figure 2 in the paper shows the concrete syntax used, and one might infer that in the PPoPP'10 paper "ideal syntax" we might see a correlation not only in how much the undergrads like the syntax, but also in fewer bugs due to fewer SLOC.

For me, it seems okay to assume that if transactional memory takes off, then many programmers using it will be at not much higher competency level than UT-Austin undergrads. I am not sure I agree about OpenMP and Cilk/TBB. Languages like X10, Fortress and Chapel would be a more interesting comparison for a future study, probably especially X10 since it is for the JVM.

Concurrency vs. Parallelism

I agree, Leo, that such comparisons would be useful. There are plenty of tools out there to help us leverage data-parallelism or develop parallel algorithms. I'd add Intel Concurrent Collections to that list (since they've been mentioned recently), and I'm certain many more could be added. [If we include SIMD and MIMD parallelism, then Coconut, Nikola, CUDA, and GpuGen come to mind.]

But the example 'shooting gallery' problem used in this paper is about concurrency in the domain. Threads for 'shooters' and 'cleaners' exist to represent concurrent users, similar to that seen in application servers or multi-player games or concurrent garbage-collectors. The critical features are fairness, progress, and consistency. While concurrency could be leveraged for task parallelism (to scale to ever greater numbers of users) it is unclear where we would bother with data-parallel algorithms in this given problem.

If we're talking about locks vs. STM, we should be talking about concurrent scenarios like the sample problem here, and thus we should be comparing them instead to other utilities and models for concurrent program development. Thus, rather than OpenMP, Cilk, et al., we should compare instead to tools such as event stream processing, complex event processing, publish-subscribe, reactive or dataflow programming, blackboard or tuple spaces, actors, orc, π-calculus, join-calculus, concurrent constraint programming, timed logic, and so on.

As I note below, STM can be leveraged for some neat orchestration characteristics (though I've not seen this utilized outside Haskell). Locks, OTOH, are typically accompanied by a few extra signaling facilities, such as semaphores or Windows CreateEvent.

At first blush, I'd actually

At first blush, I'd actually think about randomized + SIMD algorithms for this problem.

Randomized + SIMD

Do you also have a randomized + SIMD model of dining philosophers? ;-)

For the problem in the paper, were I to violate its purpose in favor of its spirit, I'd not bother with randomization. I'd develop a simple function to compute a fair, consistent shooting/cleaning schedule based on the current counts of cleaners and red and blue shooters and gallery width, and I'd apply the schedule in a tight loop inside one thread. This sort of advanced schedule formation is quite usable for many real concurrency problems. I do not see where SIMD would apply. Multi-core parallelism might be useful, but only for huge galleries.

But the problem is specified in terms of certain implementation characteristics. Each shooter and cleaner is, by problem definition, a thread. Shooters choose lanes randomly, and thus may choose the same lane, so some synchronization is needed. A silly rule forbids cleaners from working in parallel (one cleaner at a time). Even the choice for lock class in Java is specified for you.

I suspect it would be difficult to ask 147 undergrads to solve a problem that truly requires task-parallel concurrency (to ensure progress or for scalability). Likely, too many have a difficult time even with a toy problem. Further, you'd be stuck grading a wide variety of solutions...

Do you also have a

Do you also have a randomized + SIMD model of dining philosophers? ;-).... I do not see where SIMD would apply. Multi-core parallelism might be useful, but only for huge galleries.

The premise of the paper is CMPs, which I'd generalize to parallel single-socket hardware. Thus, I'd assume one of the numbers is big -- otherwise, I'm skeptical of getting a parallel speedup. Similarly, I'd assume it's compute bound, not memory bound. Furthermore, the reads/writes are at the bit level for this problem. Clean checks are scans (or even prefix) and I suspect something similar (albeit randomized scatter/gather etc.) for shooters. It's almost begging for a blocked representation! However, I'm taking advantage of the reduced formulation of the problem.

I suspect it would be difficult to ask 147 undergrads to solve a problem that truly requires task-parallel concurrency (to ensure progress or for scalability). Likely, too many have a difficult time even with a toy problem. Further, you'd be stuck grading a wide variety of solutions...

My **unquantified** belief is that task parallelism is the common (and usable) case with additional synchronization using locks, transactions, etc. being dwarfed just by simple barriers (explicit, the task decomposition, atomics, etc.). Essentially, parallel-for, then nested parallel-for, etc. This is similar to the MapReduce observation. Furthermore, for more exotic forms beyond the above where we would suggest locking, transactions, etc., I suspect pattern frameworks will often be the way forward: if you care about performance, you're either not a novice or should be using one of these.

However, this is veering from the topic, and, again, an unquantified suspicion.

Performance was not a metric

The premise of the paper is CMPs, which I'd generalize to parallel single-socket hardware. Thus, I'd assume one of the numbers is big -- otherwise, I'm skeptical of getting a parallel speedup. Similarly, I'd assume it's compute bound, not memory bound.

I'm guessing you didn't actually read the problem from the paper. Performance was not a metric. That is, the paper is not attempting to justify the use of multi-core processing. Rather, it assumes the use of multi-threading (for a problem that doesn't actually need the parallelism) and asks "which technique offers developers a better means of managing this concurrency?"

There certainly was no compute bound: each shooter thread writes exactly one 'bit' of information to each selected lane in the gallery, and it knows in upon construction which bit to write (the red bit or blue bit). So there is no significant computation involved. The most expensive computation is stepping a pseudo-random number generator to select a lane (or two, simultaneously, to expose deadlock concerns). Then there is the far more expensive synchronization: wait for any cleaners to leave and check to see if the lanes are still white before (atomically) coloring them red or blue.

Compare-and-swap would have been a more efficient solution than locking or transactions. Avoiding multi-core parallelism entirely would have been yet more efficient (due to cache locality and synchronization issues). As is, the problem is 'bounded' by synchronization, scheduling, and cache-thrashing.

What are reasons other than

What are reasons other than performance to run multiple threads?

Fairness and Interactivity

Imagine we're writing an multi-user OS for a computer that doesn't have any hardware support for threads. Concurrency cannot buy us any extra performance and in fact will damage throughput. None-the-less we still might write a threading system so multiple users would get a feeling of interactively using the same hardware "at the same time" fairly.

Simple Expression

Often breaking a problem down into task threads makes it easier to express and reason about - i.e. for properties such as progress or anticipated latencies in the face of arbitrary delays (such as IO or divergence) - even if no actual parallelism is involved. Haskell supported threads for years before it supported parallelism. Alternative techniques, such as event-loop concurrency, require careful scheduling with asynchronous IO and often explicit program transforms (to CPS or functor objects). Such program transforms tend be difficult to utilize with popular libraries. A Language-Based Approach to Unifying Events and Threads [pdf] contrasts these further.

Of course, most concurrent programming models (reactive or dataflow, actors model, timed logic, concurrent constraint, term rewriting, etc.) don't require 'threading'. Concurrency is implicit. If threads are used under-the-hood, that's pure implementation detail.

You, thus, might consider threads an artifact of imperative shared-memory programming models. If we were using sane programming models, we'd not see threads.

Really?

According to paragraph 1, it's about exploiting concurrency in CMPs:

Chip multi-processors (CMPs) have become ubiquitous, while tools that ease concurrent programming have not. The promise of increased performance for all applications through ever more parallel hardware requires good tools for concurrent programming, especially for average pro- grammers. Transactional memory (TM) has enjoyed re- cent interest as a tool that can help programmers program concurrently.

Or, perhaps, the authors were not serious when they wrote their motivation. I came in expecting to see TM used to exploit CMPs for a compute bound problem. If the goal is a dining philosophers simulation independent of performance**, the comparison should be a sequential deterministic (modulo random seed) round-robin implementation to TM, not to locks.

**Dining philosophers is also applicable to performance for concurrent sequential systems, i.e., single core memory-bound ones. My comments haven't been pertaining to this scenario.

Re: Motivation

I think you misread the motivation.

Decomposing the first sentence: CMP is ubiquitous therefore we don't need to justify the presence of concurrency in a surrogate problem. Tools that ease concurrent programming are not ubiquitous thus we should spend more money spreading the word about our favorite tool.

The second sentence repeats these points, with slightly different words, in case you couldn't read between the clauses of the first sentence. The third sentence gives the audience a name for their favorite tool.

It is, over all, a very traditional motivation. ;-)

We can decompose concurrency

We can decompose concurrency from arising from the exploitation from cpu parallelism and into that for other reasons (modeling, memory parallelism, etc.). If the motivation is concurrency from CMPs, we're in the first camp. Perhaps you're right and we should write this off as trying to catch interest by riding somebody else's wave.

The paper seems to be really about novices using threads and locks to naturally express concurrent computations without caring for performance. Given that one of the results is that the target audience didn't actually learn the condition variable pattern, this is very far into the novice camp. A fourth alternative, completely sequential code (even more so than coarse locks), might be the real baseline.

Orchestration with Transactions

Haskell's STM provides a feature that I've not seen elsewhere: one may, after reading some data, explicitly retry a transaction, which causes the thread to block until at least one of those STM variables have changed. For example, if you 'retry' after checking whether a queue is empty, you will automatically wait until someone adds to that queue. Thus, Haskell's STM is also providing rendezvous between threads. Haskell's AdvSTM also allows you to add IO events onRetry, such as signaling another thread or process to fill that queue... or perhaps counting the number of retries.

I would be interested in seeing how such features affect the 'usability' of STM relative to competing concurrency-management tools. If we protect data using locks, we often need must explicitly use 'events' and other utilities to orchestrate threads.

Based on my observations I would tend to agree

I've had similar experiences in the Concurrent Programming course I teach.

In my course I go through semaphores, monitors and various flavors of message passing and just give them one lecture on TM. They get to try out all these mechanisms except TM in exercises and assignments. When it comes to the exam I typically have questions on all these constructs including TM and also a construct which they haven't seen before (I've used Conditional Critical Regions).

When grading I've observed some amazing things. There are very weak students who just don't have the faintest on how to solve the tasks involving semaphors, monitors and message passing although they have been trained in this. However, and this is where it gets interesting, some of them manage to get full scores on the tasks with TM and CCRs.

Now, I haven't tried to make a proper study out of this and you should take my observations with a handfull of salt. Still, having seen these results I am fairly convinced that high level synchronization constructs such as TM really are much simpler for programmers. It's really worth the effort of trying to pursue research in improving the way we do parallel and concurrent programming.

The problem with STM

Is that the languages still suck.

Ok, I'll admit that it's hard to write correct code with good performance using transactions in C++/Java/C#. Is this a problem with transactions- or with C++/Java/C#?

Post author names in articles

Sometimes articles lack authors in their paper references. While a small oversight, I find this decidedly annoying. If you're going to go ahead and post an abstract, it makes sense to post the author list as well. In addition to the good karma one gathers through proper attribution, it helps readers like me more quickly associate it with a previous line of work and/or other work by the authors, if their names are familiar.

Don't want to annoy you :)

Attention to detail is important. Thanks. Updated.

Thanks!

Thanks!