Threads Cannot be Implemented as a Library

Hans-J. Boehm. Threads Cannot Be Implemented As a Library. Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, June 2005, pp. 261-268. Official version. Technical report version.

In many environments, multi-threaded code is written in a language that was originally designed without thread support (e.g. C), to which a library of threading primitives was subsequently added. There appears to be a general understanding that this is not the right approach. We provide specific arguments that a pure library approach, in which the compiler is designed independently of threading issues, cannot guarantee correctness of the resulting code. We first review why the approach almost works, and then examine some of the surprising behavior it may entail. We further illustrate that there are very simple cases in which a pure library-based approach seems incapable of expressing an efficient parallel algorithm. Our discussion takes place in the context of C with Pthreads, since it is commonly used, reasonably well specified, and does not attempt to ensure type-safety, which would entail even stronger constraints. The issues we raise are not specific to that context.

The paper includes several interesting examples, and reminds us yet again of the importance of formally well defined semantics.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Yeek

It's not often a posting here on LtU directly affects my day job (device drivers, OSs, BIOSs, etc). But this does. The posix threading model is at least as well developed as the threading model OSs provide to device drivers and other services. So the problems he talks about show up in all sorts of ugly ways.

One comment I'd like to add: first of all, the first couple of generations of the Dec Alpha chip could only read/write 32-bit and 64-bit values. But I'm wondering if problems couldn't also show up with an agressive cache based system. Change reading/writting 64-bit values to reading/writting cache lines, and the problem becomes a lot more apparent. Especially considering it's much more likely that data protected by different locks will end up on the same cache line than it will end up in the same 32-bit or 64-bit word.

Not as significant as it seems at first?

It's not quite a strawman, but apparently the Posix threads specification doesn't account for 'volatile', and yet 'volatile' is a pretty significant part of the C (and C++ specification), e.g. IIRC all the discussions of "sequence points" apply to control the scope of volatile access rearrangement. ("volatile", to a first approximation, enforces what the paper calls "sequential consistency"; volatile accesses can't be reordered.) The paper only mentions volatile when doing performance comparisons, and never really discusses it, apparently since it's outside the scope of its concern.

So yes, it may well be that threads cannot be implemented safely as a pure library--there is a need for a useful memory model to go with it. But volatile seems to offer that, even if the pthreads spec doesn't account for it. I know plenty of programmers who use threads and use volatile in C and C++ and life goes on just fine. That may be "by chance", because Posix threads didn't specify volatile working (or it may not be by chance because they're not using Posix threads), but if it works fine in C and C++, the primary languages the author sees as "at risk", then I'm not convinced this is of much practical (as opposed to theoretical) significance. (Of course, this being LtU, I suppose talking about the behavior of C/C++ code is purely theoretical, not practical, for most readers.) I suspect that you could just rewrite some of the pthreads spec, change absolutely 0 lines of compiler and library code, and be perfectly happy (hence undercutting the pthreads-as-problematic-example this paper builds upon, if not the paper's general claim).

On the other hand, maybe it is the case that the rules for volatile are not actually quite sufficient to provide thread safety due to obscure cases and race conditions; and if so I'd want to know about it right away. But that's not in this paper.

Do threads mean shared state?

So yes, it may well be that threads cannot be implemented safely as a pure library--there is a need for a useful memory model to go with it.
I agree with the first part of the statement, and conditionally agree with the second - one needs a memory model assuming threads communicate via shared state. Or do they by definition?

Strangely enough, I was pondering about concurrency (not threads) as a language feature vs. library just yesterday - is it a proof of collective mind? :-)

Volatile

I know plenty of programmers who use threads and use volatile in C and C++ and life goes on just fine. That may be "by chance", because Posix threads didn't specify volatile working..

Check this interesting discussion and especially this

Simple thread usage is fine, though.

I've programmed many multithreaded applications in my proffessional life in C++, and I never had any problems with the following simple model: any concurrent job is protected by a mutex/critical section/semaphore that is provided by one of the participants in the job, usually one of the objects involved.

I guess the problems arise from wanting to automate serialization though, either automatically by the compiler or through a library.

What about FP? the functional programming model theoritically does not suffer from this problem, as there is no destructive update.

Complexity, scalability?

I think time complexity and scalability are the real questions that matter, especially when you get to very large numbers of concurrent threads/processes.

You haven't had problems beca

You haven't had problems because you've been relying on language extensions without realising it. (And, presumably, because you're a competent programmer.)

The claim that threading requires language extensions and not just a library is perfectly true, but it isn't a point against existing thread libraries such as Posix threads because they do, in fact, include language extensions that most users aren't aware of. The Posix spec places additional requirements on a C compiler beyond those in the C spec; the relevant one here is that (to oversimplify a bit) Pthread calls contain implicit memory barriers. Any C (or C++ or insert-language-of-choice) compiler intended to be used on a Posix system has to be aware of these additional rules and respect them when optimising code. A hypothetical "pure C" compiler -- one that implemented the ISO C spec and nothing more -- would mis-compile Pthreads code. (Similar comments apply to Windows threads, of course.)

This is why the Boehm paper received something of a lukewarm response in the C/C++ threading community. The requirement for language extensions is old news. Ayone who was paying attention already knew that.

You haven't had problems beca

You haven't had problems because you've been relying on language extensions without realising it. (And, presumably, because you're a competent programmer.)

...or i don't touch things where I know threads will mess up my application; for example, bitfield access.

Similar comments apply to Windows threads, of course

What extensions are you talking about? I haven't heard of those.I don't believe VC++ 6.0 has any extensions towards threads. Could you provide any links?

...or i don't touch things wh

...or i don't touch things where I know threads will mess up my application; for example, bitfield access.

This is not a workable approach in the long term. As compilers get smarter, they can perform optimisations that can do seriously surprising things to your code. If the compiler isn't aware of the special handling that thread-related functions need, disaster will result sooner or later, because the compiler will do something that would be perfectly safe in single threaded code but falls apart when concepts like cache coherency start to become nontrivial issues.

You may also want to dwell on the fact that any object whose size doesn't match the machine word is really a bitfield in disguise.

For an analogous situation outside the realm of threading, consider the massive breakage that hit a wide variety of software (including the Linux kernel) a few years ago when GCC started paying detailed attention to the aliasing rules in the C standard.

What extensions are you talking about? I haven't heard of those.I don't believe VC++ 6.0 has any extensions towards threads. Could you provide any links?

They're not explicitly documented as language extensions because the same company makes the compiler and the operating system, so they can just make Visual C++ aware of the special handling that certain OS calls require behind the scenes. Most of these special cases aren't documented. (It's not in Microsoft's interest to make life easy for third party compiler vendors.)

For an example of one that is documented, see the descriptions of the Interlocked functions in the MSDN documentation, which contain notes like this:

"Intel IPF: This function generates a full memory barrier (or fence) and performs the increment operation. This ensures the strict memory access ordering that is necessary, but it can decrease performance. For performance-critical applications, consider using InterlockedIncrementAcquire or InterlockedIncrementRelease."

The memory barrier here is something the compiler has to be aware that it needs to insert, and is equivalent to the one required by Posix under similar circumstances. (It's actually overkill in many cases, which is why the most recent revision of the Windows API has split many of the Interlocked functions into separate Acquire and Release versions.)

Aha, thanks for the info (I a

Aha, thanks for the info (I actually never needed to use the interlocked bunch of functions, so I didn't think of those when I thought about possible 'extensions' towards threads. But now that I think about it, there are lots of issues for atomically reading/writing a variable).

But if the O/S provides atomic R/W functions, why can't the rest of the synchronization primitives like critical sections/mutexes/semaphores can not be in a library?

"Microsoft's interest"

(It's not in Microsoft's interest to make life easy for third party compiler vendors.)

That's highly debatable. It's certainly in the interests of their user community... and a stronger user community is a stronger platform. Perhaps the real reason is, a monopoly need not care?

Hmm

It's an interesting question. I'm certain that "monopolies don't care" is part of it, but I bet it's deeper than that.

First, you need to ask yourself: is a wide range of compilers actually in the interest of users? It seems like it should be; but a compiler monoculture does have its advantages--you never have to worry about whether an app is incompatible with your libc. It's not clear to me whether the benefits of standardization outweigh the drawbacks of a monopoly in this case.

More seriously, though, Microsoft doesn't want a stronger platform; they want a stronger monopoly. One way they strengthen their monopoly is to introduce new new lock-in features in the platform; having a compiler monopoly makes that easier, because they don't have to get the third-party compiler vendors to adopt their new features. Just push them out to MSDN, along with lots of marketing to convince the Certifiable Developers that they'll never get the babes if they don't have the Secure Processor Activity Zone Wireless Access Relay Enabler (SPAZWARE).

(This is probably OT, except insofar as language designers sometimes have to think about how to get them adopted.)

I don't get it.

I don't see that the compiler must do anything special in this case. And indeed, the code generated looks like an ordinary function call (as I expected), and there's nothing magical inside InterlockedIncrement.

Of course, as I am using ye olde Pentium III with single CPU, I presumably have Pentium-III-with-single-CPU DLLs, which don't need to do anything clever, and that would explain why they contain nothing more cunning than a XADD -- and with no LOCK, neither. But surely the same code must work identically when run on a multi-CPU system, with only the InterlockedIncrement code changing as appropriate for the local system?

I don't understand... why can the function itself not contain the memory barrier?

You may be taking too narrow

You may be taking too narrow a view of the phrase "language extension". It doesn't have to mean an explicitly thread oriented extension such as memory barriers or acquire/release.

In the example you describe, where the critical thread synchronisation functions are off in a DLL, you're effectively taking advantage of the fact that the compiler will treat the DLL as a black box. It will simply generate a call to the dynamically loaded function, and make the minimal conservative assumptions about invariants that can be trusted to remain true across the call. In the multithreaded version of the DLL, any necessary synchronisation can happen inside it, without the compiler (of the calling code) having to do anything special.

But this is all outside the scope of the programming language. The C and C++ standards don't recognise the concept of dynamic linking, and have nothing to say about its side effects.

If the code was statically linked, it's not hard to imagine a compiler smart enough to look inside the library code, inline it instead of generating a function call if it thought that would help, and then optimise the resulting code in a way that would interfere with thread synchronisation if the compiler didn't take special precautions. I'm sure there are existing compilers that can do that; if not, there soon will be. In principle you could even do the same thing with DLLs, via some kind of just-in-time compilation machinery.

The fact that DLLs are treated as black boxes, effectively disabling certain optimisations that would be possible if the full source code to both caller and callee were available to the compiler, is a language extension.

FYI

If the code was statically linked, it's not hard to imagine a compiler smart enough to look inside the library code, inline it instead of generating a function call if it thought that would help, and then optimise the resulting code in a way that would interfere with thread synchronisation if the compiler didn't take special precautions. I'm sure there are existing compilers that can do that; if not, there soon will be. In principle you could even do the same thing with DLLs, via some kind of just-in-time compilation machinery.

Just to keep everyone up to date, the current state of Java compilation technology does all that. Of course, Java threads require more than a library, having both language-level mechanisms ("synchronized", "volatile"), and a deeply confusing official memory model, so it's still correct to say that libraries are not enough for concurrency. This does cause some anomalies for the unaware (google for "double-checked locking"), but is otherwise probably as good as shared-state-concurrency-with-explicit-monitors can get.

Java Memory Model

The original Java memory model did have problems and qualified as "deeply confusing". The memory model was replaced in JDK1.5 with a new, less confusing model.

In the new memory model double-checked locking works if the checked field is declared volatile. This is not a fixed with the old model.