The Trouble With Locks

On the
blog

November 4: Other Concurrency Sessions at PDC
November 3: PDC'09: Tutorial & Panel

October 26: Hoare on Testing
October 23: Deprecating export Considered for ISO C++0x

The Trouble With Locks

This article appeared in C/C++ Users Journal, 23(3), March 2005.
Many people believe they understand how to write correct concurrent programs, but may not yet fully comprehend just how subtle it really is. This article focuses on the mainstream status quo approach to concurrency, lock-based programming. We will see why lock-based programming is difficult even for experts to use correctly, and why it’s fundamentally flawed for building large programs.

In my most recent articles [1, 2], I presented reasons why concurrency (e.g., multithreading) will be the next revolution in the way we develop software—a sea change of the same order as the OO revolution. I also stated “the vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago didn’t yet grok objects.”

In this column I’d like to consider just one question that several people wrote to ask, namely: “Is concurrency really that hard?” In particular, a few readers felt that lock-based programming is well understood; it is, after all, the status quo mainstream solution to concurrency control.

So, “is concurrency really that hard?” My short answer is this:

Lock-based programming, our status quo, is difficult for experts to get right. Worse, it is also fundamentally flawed for building large programs. This article focuses exclusively on lock-based programming just because there’s so much so say in even a 50,000-foot overview that I ran out of room.

Lock-free programming is difficult for gurus to get right. I’ll save this for another time, but if you’re interested you should check out Andrei Alexandrescu’s recent articles for a sampling of the issues in lock-free programming and hazard pointers. [3, 4] (Aside: Yes, I’m implying Andrei is a guru. I hope he doesn’t mind my outing him in public like this. I don’t think it was much of a secret.) (More aside: The hazard pointer work shows in particular why if you’re writing lock-free data structures you really really want garbage collection. You can do it yourself without garbage collection, but it’s like working with knives that are sharp on both edges and don’t have handles. But that’s another article. Specifically, it’s Andrei’s other article.)

Unfortunately, today’s reality is that only thoughtful experts can write explicitly concurrent programs that are correct and efficient. This is because today’s programming models for concurrency are subtle, intricate, and fraught with pitfalls that easily and frequently result in unforeseen races (i.e., program corruption), deadlocks (i.e., program lockup), and performance cliffs (e.g., priority inversion, convoying, and sometimes complete loss of parallelism and/or worse performance than a single-threaded program). And even when a correct and efficient concurrent program is written, it takes great care to maintain—it’s usually brittle and difficult to maintain correctly because current programming models set a very high bar of expertise required to reason reliably about the operation of concurrent programs, so that apparently innocuous changes to a working concurrent program can (and commonly do in practice) render it entirely or intermittently nonworking in unintended and unexpected ways. Because getting it right and keeping it right is so difficult, at many major software companies there is a veritable priesthood of gurus who write and maintain the core concurrent code.

Some people think I’m overstating this, so let me amplify. In this article, I’ll focus on just the narrow question of how to write a lock-based program basically correctly, meaning that it works (avoids data corruption) and doesn’t hang (avoids deadlock and livelock). That’s pretty much the minimum requirement to write a program that runs at all.

Question: Is the following code thread-safe? If it is, why is it safe? If it isn’t always, under what conditions is it thread-safe?

T Add( T& a, T& b ) {
  T result;
  // … read a and b and set result to proper values …
  return result;
}

There are a lot of possibilities here. Let’s consider some of the major ones.

Lock-Based Solutions?

Assume that reading a T object isn’t an atomic operation. Then if a and/or b are accessible from another thread we have a classic race condition: While we are reading the values of a and/or b, some other thread might be changing those objects, resulting in blowing up the program (if you’re lucky, say by causing the object to follow an internal pointer some other thread just deleted) or reading corrupt values.

How would you solve that? Please stop and think about it before reading on. …

Ready? Okay: Now please stop a little longer and think about your solution some more, and consider whether it might have any further holes, before reading on. …

Now that you’ve thought about it deeply, let’s consider some alternatives.

Today’s typical lock-based approach is to acquire locks so that uses of a and b on one thread won’t interleave. Typically this is done by acquiring a lock on an explicit synchronization object (e.g., a mutex) that covers both objects, or to acquire locks on an implicit mutex associated with the objects themselves. To acquire a lock that covers both objects, Add has to know what that lock is, either because a and b and their lock are globals (but then why pass a and b as parameters?) or because the caller acquires the lock outside of Add (which is usually preferable). To acquire a lock on each object individually, we could write:

T SharedAdd( T& a, T& b ) {
  T result;
  lock locka( a );      // lock is a helper whose constructor acquires a lock
  lock lockb( b );      // and whose destructor releases the lock
  // … read a and b and set result to proper values …
  return result;
} // release the locks

I renamed the function because probably most T objects are never shared and we don’t want to incur the cost of the lock on most Add operations.

Aside: In some cases we might even be more efficient and have a reader-writer lock that allows multiple concurrent readers, but to do that we have to guarantee at least that the readers won’t physically change the object including any internal data indirectly reachable by the object; this is a much stronger requirement than just plain old const because const is logical and shallow, not physical and deep.

Is this enough to make the function thread-safe? No. For one thing, there’s no way to guarantee that all code that uses a and b will respect the lock. There has to be an enforced discipline to ensure everyone plays by the locking rules to take a lock (and it has to be the same lock) before touching a or b.

So let’s assume you have a nice baseball bat (or Nerf gun) and can go around your office and force everyone on your team to play by the rules, so that nobody ever forgets to acquire the right locks when touching a or b. Now are we thread-safe? No, for a completely different reason: SharedAdd as written above is a recipe for deadlock. For a simple example, consider the following code that calls SharedAdd:

T global1, global2;
 
// on thread A
SharedAdd( global1, global2 );
 
// on thread B
SharedAdd( global2, global1 );

One fine day, thread A acquires the lock on global1, gets interrupted by (or runs concurrently with) thread B which acquires the lock on global2, then both wait forever for each other’s locks. Not a great place to be.

I’m sure a good number of readers saw this one coming; good for you. The next question is, how would you fix this? One approach would be to recognize that we’re okay as long as the locks are always acquired in the same order, and smart folks like Andrei have published helper types that make it easier to automate that. Let’s roll our own: Let’s have SharedAdd harden itself to this by consistently taking the locks in a given global order, say by the objects’ addresses, and we’ll even be smart and remember that the standard doesn’t allow direct comparisons between pointers to arbitrary objects but does allow them using the std::less library helper:

T SharedAdd( T& a, T& b ) {
  T result;
  lock lock1( less()(&a,&b) ? a : b );
  lock lock2( less()(&a,&b) ? b : a );
  // … read a and b and set result to proper values …
  return result;
} // release the locks

Now the problematic code we showed earlier is fine, because regardless of the order that global1 and global2 are passed in, we will always lock the one having the lowest address first, and so the deadlock cannot happen… right?

True, it cannot happen in that example. But any other code in the world might have acquired a lock on whichever one has the higher address, then called SharedAdd. Ouch. And it could be code we don’t control, such as third-party libraries we only have binaries to. (And it could be code that hasn’t been written yet and won’t exist until a few years from now, such as new features or new overrides of virtual functions. More on this in a moment.)

Maybe a Lock-Based Solution Plus Extreme and Perfect Discipline?

But let’s keep pushing on the lock-based approaches for a moment. Let’s assume that every piece of code in our universe is under our control and all of it always strictly plays fair and never gets the locking order wrong, not even by mistake. Quite an assumption, but let’s assume.

Is that enough to be thread-safe? Not in general, for the simple reason that this approach only works if the only locks you ever acquire are acquired not only in the same order but at the same time, so that some code somewhere knows what all the locks in use at a given time will be and can acquire them at once. Put another way, locking operations can’t safely nest. That’s a problem, because it basically rules out safe locking in layered applications and libraries, which covers most of the world’s significant software.

To make this more concrete, consider: What does SharedAdd have to do to calculate the new values? For example, in the “read a and b and set result to proper values” code, can it call (possibly virtual) member functions of a or b? Can it call helper functions? And does whatever it calls, and all the functions those things call, and so on, ever try to acquire any locks?

If even one part of the code that gets executed inside the a and b locks caused us to call a function somewhere that could try to acquire a lock, we’re open to deadlock again, even if every function everywhere in the code globally respected the discipline to acquire locks in order. Consider: What if during “read a and b and set result to proper values” we end up calling a function that tries to acquire a lock on an object c whose address is between a’s and b’s? That too is a recipe for deadlock, because if we’re doing that while some other thread already holds the lock for c, and then that thread tries to lock a or b (say, by calling SharedAdd(a,b)), we’re toast. Kaput.

This demonstrates the fundamental issue that lock-based synchronization isn’t composable, meaning that you can’t take two concurrency-correct lock-based pieces of code and glue them together into a bigger concurrency-correct piece of code. Building a large system using lock-based programming is like building a house using bricks that are made out of different kinds of volatile materials that are stable in isolation but not in all combinations, so that they often work well and stack nicely but occasionally they cause violent explosions when the wrong two bricks happen to touch.

That’s why calling into any unknown code while holding a lock is a recipe for deadlock.

Let me share one data point that illustrates both that lock-based programming is hard and that very few people are actually doing much of it: Many shipping frameworks, even very popular ones by very savvy companies, do dangerous stuff that they’re just getting away with because developers aren’t writing much lock-based code. Here’s an example pointed out to me by Chris Brumme [5]: Look in just about any framework, even ones that are designed for concurrency including the Java libraries and .NET Framework, and you will find code that calls a virtual function while holding a lock—which is a recipe for deadlock. (Think about it: Deadlock can happen anytime you hold a lock and try to acquire another one, unless you can rule out that another thread might try to acquire the locks in the other order. So if you’re holding a lock, and then call a virtual function that could or should be overridden by code outside your framework, which by definition means you’re calling out into unknown and unknowable black-box code… One shudders.) And the folks who write those libraries are smart people. Locking is the status quo, but it’s not good enough for widespread use.

There are other lock-based solutions we haven’t talked about. One is to use monitors, or objects that lock themselves so as to guarantee that all member functions are synchronized with respect to each other, so that no two threads can be executing member functions on the same object at the same time. Examples in the field include Java’s Vector and the .NET Framework’s SyncHashTable. Fundamentally, just the fact that monitors use locks shows that they’re open to the same issues as all uses of locks, including deadlock if their implementations ever call into unknown code while holding the object’s lock. And such internally locked objects have their own additional pitfalls, notably that they are limited and don’t solve the specific problem we’ve been considering which requires atomic use of both a and b at the same time (and locking each individual member function of a and b isn’t enough to do that).

Conclusions

In this article, we’ve focused on only the one narrow question of how to write a lock-based program correctly, meaning that it works (avoids data corruption) and doesn’t hang (avoids deadlock and livelock). That’s the minimum requirement to write a program that runs at all. Along the way, we’ve noted reasons why many concurrent programs in the field today aren’t actually correct, but are just getting away with lock-based programming’s flaws because concurrency isn’t yet in routine widespread use.

We have virtually ignored the equally difficult question how to write a lock-based program efficiently, one that actually scales decently well when multiple processors are available and avoids all the common pitfalls, from priority inversion to serialization to convoying to aggressive locking, that can easily render it no better, and even worse, than a single-threaded program.

We’ve also ignored lock-free programming, which avoids nearly all the problems of lock-based programming but creates even more; experts readily admit that lock-free programming is much harder still than writing lock-based code.

Personally, I think that the most important sentence in [1] and [2] was the fourth and last conclusion: “4. Finally, programming languages and systems will increasingly be forced to deal well with concurrency.” Before concurrency becomes accessible for routine use by mainstream developers, we will need significantly better still-to-be-designed higher-level language abstractions for concurrency than are available in any language on any platform. Lock-based programming is our status quo and it isn’t enough; it is known to be not composable, hard to use, and full of latent races (where you forgot to lock) and deadlocks (where you locked the wrong way). Comparing the concurrency world to the programming language world, basics like semaphores have been around almost as long as assembler and are at the same level as assembler; locks are like structured programming with goto; and we need an “OO” level of abstraction for concurrency.

The good news is that many people are working on this problem. As the next-generation programming models become available, they will be a key factor in opening the gates to the broad and strong adoption of concurrent programming that will let us make our programs simpler (by separating logical control flows), more responsive (by accomplishing more work during idle time like waiting for disk and memory), and fully able to exploit the continuing exponential gains in processor throughput.

Notes

[1] H. Sutter. “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” (Dr. Dobb’s Journal, 30(3), March 2005).

[2] H. Sutter. “The Concurrency Revolution” (C/C++ Users Journal, 23(2), February 2005). This is an abbreviated version of [1].

[3] A. Alexandrescu. “Lock-Free Data Structures” (C/C++ Users Journal, 22(10), October 2004).

[4] A. Alexandrescu and M. Michael. “Lock-Free Data Structures with Hazard Pointers” (C/C++ Users Journal, 22(12), December 2004).

[5] http://blogs.msdn.com/cbrumme

The Trouble With Locks

Lock-Based Solutions?

Maybe a Lock-Based Solution Plus Extreme and Perfect Discipline?

Conclusions

Notes

Copyright © 2009 Herb Sutter