Transactional Memory versus Locks - A Comparative Case Study

Dr. Victor Pankratius is scheduled to give a talk this Friday at UT Austin entitled, "Transactional Memory versus Locks - A Comparative Case Study". The presentation is here, a technical report is forthcoming.

The study is a side-by-side comparison of Transactional Memory with traditional multi-thread programming; general features of the study are:

  • 12 students, apparently not experienced with multi-threaded programming, divided into six teams: three using Pthreads, three using Pthreads and Intel's STM C compiler.
  • All teams worked on the same project, a parallel desktop search engine. Competition was for best performance.
  • TM "improve[s] quality of parallel code [as judged by Dr. Pankratius and the STM compiler team at Intel], reduce[s] implementation and debugging effort, and provide[s] acceptable performance".
  • "If subjects are inexperienced, then TM does not help them either. Parallel programming remains difficult."

If anyone knows more, or can go to the presentation, I would like to hear more.

(I am posting his to the forum since I cannot find a detailed description other than the presentation.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

TM + Explicit Memory Management

A language without a garbage-collector but with Software Transactional Memory introduces some challenges. I'm a tad curious as to whether Intel's STM C Compiler manages to release memory resources obtained within an aborted transaction.

I find it unfortunate that they named as 'winners' solutions with (presumably malign) race-conditions. The goal shouldn't merely be to achieve performance, via parallelization and whatnot. The ability to intuit or verify certain forms of correctness is also critical.

Compiler download and information

The Intel C++ STM Compiler, Prototype Edition is available to play with. Also, take a look at Code Generation and Optimization for Transactional Memory Constructs in an Unmanaged Language; Ali-Reza Adl-Tabatabai is listed as one of the coauthors of the compiler.

Thanks

From the reference to the Prototype Edition, it seems transactional new/delete is supported as of version 3.0, and malloc/free as of 2.0. Good to know they've added it. I can't imagine having programmed without either GC or TM memory management.

I've been saying this for

I've been saying this for awhile: I want to see a comparison of a C++ STM to something like Cilk++ (and preferably not TBB until frontend support is in place). We know pthreads suck and transactions sometimes help: the question is comparisons of pthread alternatives, and the above would be close to apples-to-apples.

Cilk++

It'd be nice to see how Cilk++, and perhaps some other parallelism mechanisms (such as 'future' pipelining, and the plain-old multi-process-plus-a-database approach) hold up in terms of performance and relative grokkability by novice programmers.

But I'd honestly be more interested in the results if they were for a large, modular project rather than a small, monolithic one. Issues of concurrency become more significant when dealing with dozens of modules written by different programmers or integrating modules with strange concurrency quirks (e.g. OpenGL's thread context).

How does Cilk++ handle modularity? or frameworks? To maintain its 'serial semantics' across a black-box boundary, I imagine it would need to guarantee serial calls of any function that crosses a module boundary. If run under the race-detector, it would only determine races for a particular configuration of modules.

I believe their race

I believe their race detector is dynamic, so it's "just" testing, and unclear whether it digs into linked code. I haven't encountered a need to use that, nor their performance predictor, in my parallel browser project (where I'm really worried about lower-level details like l1/l2 misses).

Crossing the boundary to Cilkified and non-Cilkified code is annoying in my experience, but for difficulties relating to sharing templated code. I don't recall guarantees, but expect there are none (beyond blame ;-)). There is documentation about it if you're curious. My general approach is just to Cilkify everything (... though their compiler creates a slowdown, essentially costing you a core on well-parallelized code). After, I rewrite and optimize again in TBB :)

I have found plugging in other libraries with opposing models of parallelism is fine because the design is often silo'd.... until you want to optimize interactions between the two. That seems like a fundamental algorithmic problem relating to computational model costs (unless, perhaps, it's easy to specify many different algorithms, in which case a tuner for switching between them is not so hard, though understanding and optimizing such code seems hard).

Standing back.... Before worrying about uncooperative code, I'd like to figure out the cooperative version. E.g., I wonder how controversial is the claim that most multithreaded code today is trash anyways. Cilk++ invariants hold for modules within the language. I don't know of 1,000,000loc projects written in Cilk (not just little parts) and thus if these invariants are sufficient. That's a study to itself.

TR is up

http://www.rz.uni-karlsruhe.de/~kb95/papers/pankratius-TMStudy.pdf

It's 34 pages, so it can hopefully answer your questions.

I just got back from the talk

Link to the paper wasn't working for me before attending the talk, so hopefully these comments aren't too redundant.

He clarified that code inspection revealed race conditions in all of the solutions (aside from one case where the use of 1,000+ locks was so convoluted they couldn't identify presence or absence of races), not just the winners. He also clarified that the race conditions were indeed malign, although they didn't manifest in the benchmarking and evaluation.

I asked him if the decision to base the study on an "industrial" stm implementation versus a "research toy" implementation with stronger guarantees was worth it in retrospect. He said if he did it again he'd be interested in trying a different implementation, perhaps in a managed language, but that the prevalence of C in industry meant it couldn't be ignored.

My cynical take on it is that it's a proof point that using an on-your-honor STM together with traditional locking can be equivalently bad as compared to using traditional locking alone.