GotW #45: Reference Counting

GotW #45

On the
blog

November 4: Other Concurrency Sessions at PDC
November 3: PDC'09: Tutorial & Panel

October 26: Hoare on Testing
October 23: Deprecating export Considered for ISO C++0x

This is the original GotW problem and solution substantially as posted to Usenet. See the book More Exceptional C++ (Addison-Wesley, 2002) for the most current solution to this GotW issue. The solutions in the book have been revised and expanded since their initial appearance in GotW. The book versions also incorporate corrections, new material, and conformance to the final ANSI/ISO C++ standard.

Reference Counting - Part III
Difficulty: 9 / 10

In this final chapter of the miniseries, we consider the effects of thread safety on reference-counted strings. Is reference counting really an optimization? The answer will likely surprise you.

Problem

JG Question

1. Why is Optimized::String not thread-safe? Give examples.

Guru Questions

2. Demonstrate how to make Optimized::String (from GotW #44) thread-safe:

a) assuming that there atomic operations to get, set, and compare integer values; and
b) assuming that there aren't.

3. What are the effects on performance? Discuss.

Solution

Introduction

Standard C++ is silent on the subject of threads. Unlike Java, C++ has no built-in support for threads, and does not attempt to address thread safety issues through the language. So why a GotW issue on threads? Simply because more and more of us are writing multithreaded (MT) programs, and no discussion of reference-counted String implementations is complete if it does not cover thread safety issues.

I won't go into a long discussion about what's involved with writing thread-safe code in detail; see a good book on threads for more details.

1. Why is Optimized::String not thread-safe? Give examples.

It is the responsibility of code that uses an object to ensure that access to the object is serialized as needed. For example, if a certain String object could be modified by two different threads, it's not the poor String object's job to defend itself from abuse; it's the job of the code that uses the String to make sure that two threads never try to modify the same String object at the same time.

The problem with the code in GotW #44 is twofold: First, the reference-counted (copy-on-write, COW) implementation is designed to hide the fact that two different visible String objects could be sharing a common hidden state; hence it's the String class's responsibility to ensure that the calling code never modifies a String whose representation is shared. The String code shown already does that, by performing a deep copy ("un-sharing" the representation) if necessary the moment a caller tries to modify the String. This part is fine in general.

Unfortunately, it now brings us to the second problem: The code in String that "un-shares" the representation isn't thread-safe. Imagine that there are two String objects s1 and s2:

a) s1 and s2 happen to share the same representation under the covers (okay, because String is designd for this);
b) thread 1 tries to modify s1 (okay, because thread 1 knows that no one else is trying to modify s1);
c) thread 2 tries to modify s2 (okay, because thread 2 knows that no one else is trying to modify s2);
d) at the same time (error)

The problem is (d): At the same time, both s1 and s2 will attempt to "un-share" their shared representation, and the code to do that is is not thread-safe. Specifically, consider the very first line of code in String::AboutToModify():

    void String::AboutToModify(
      size_t n,
      bool   bMarkUnshareable /* = false */
    ) {
      if( data_->refs > 1 && data_->refs != Unshareable ) {
        /* ... etc. ... */

This if-condition is not thread-safe. For one thing, evaluating even "data_->refs > 1" may not be atomic; if so, it's possible that if thread 1 tries to evaluate "data_->refs > 1" while thread 2 is updating the value of refs, the value read from data_->refs might be anything -- 1, 2, or even something that's neither the original value nor the new value. The problem is that String isn't following the basic thread-safety requirement that code that uses an object must ensure that access to the object is serialized as needed. Here, String has to ensure that no two threads are going to use the same "refs" value in a dangerous way at the same time. The traditional way to accomplish this is to serialize access to the shared StringBuf (or just its .refs member) using a critical section or semaphore. In this case, at minimum the "comparing an int" operation must be guaranteed to be atomic.

This brings us to the second issue: Even if reading and updating "refs" were atomic, there are two parts to the if condition. The problem is that the thread executing the above code could be interrupted after it evaluates the first part of the condition but before it evaluates the second part. In our example:

Thread 1	Thread 2
>> enter s1's AboutToModify() evaluate "data_->refs > 1" (true, because data_->refs is 2)
****** context switch ******
	>> enter s2's AboutToModify() (runs all the way to completion, including that it decrements data_->refs to the value 1) << exit s2's AboutToModify()
****** context switch ******
evaluate "data_->refs != Unshareable" (true, because data_->refs is now 1) enters AboutToModify's "I'm shared and need to unshare" block, which clones the representation, decrements data_->refs to the value 0, and gives up the last pointer to the StringBuf... poof, we have a memory leak because the StringBuf that had been shared by s1 and s2 can now never be deleted

Having covered that, we're ready to see how to solve these safety problems.

2. Demonstrate how to make Optimized::String (from GotW #44) thread-safe:

a) assuming that there atomic operations to get, set, and compare integer values; and

b) assuming that there aren't.

I'm going to answer b) first because it's more general. What's needed here is a lock-management device like a critical section or a mutex. The code is equivalent in either case, so below I'll use a critical section, which is usually a more efficient synchronization device than a general-purpose mutex. The key to what we're about to do is quite simple: It turns out that if we do things right we only need to lock access to the reference count itself.

Before doing anything else, we need to add a CriticalSection member object into Optimized::StringBuf. Let's call the member cs:

  namespace Optimized {

    struct StringBuf {
        StringBuf();              // start off empty
       ~StringBuf();              // delete the buffer
        StringBuf( const StringBuf& other, size_t n = 0 );
                                  // initialize to copy of other,
                                  //  and ensure len >= n

        void Reserve( size_t n ); // ensure len >= n

        char*    buf;             // allocated buffer
        size_t   len;             // length of buffer
        size_t   used;            // # chars actually used
        unsigned refs;            // reference count
        CriticalSection cs;       // serialize work on this object
    };

The only function that necessarily operates on two StringBuf objects at once is the copy constructor. String only calls StringBuf's copy constructor from two places (from String's own copy constructor, and from AboutToModify()). Note that String only needs to serialize access to the reference count, because by definition no String will do any work on a StringBuf that's shared (if it is shared, it will be read in order to take a copy, but we don't need to worry about anyone else trying to change or Reserve() or otherwise alter/move the buffer).

The default constructor needs no locks:

    String::String() : data_(new StringBuf) { }

The destructor need only lock its interrogation and update of the refs count:

    String::~String() {
      bool bDelete = false;
      data_->cs.Lock(); //---------------------------
      if( data_->refs == Unshareable || --data_->refs < 1 ) {
        bDelete = true;
      }
      data_->cs.Unlock(); //-------------------------
      if( bDelete ) {
        delete data_;
      }
    }

For the String copy constructor, note that we can assume that the other String's data buffer won't be modified or moved during this operation, since it's the responsibility of the caller to serialize access to visible objects. We must still, however, serialize access to the reference count itself, as we did above:

    String::String( const String& other )
    {
      bool bSharedIt = false;
      other.data_->cs.Lock(); //---------------------
      if( other.data_->refs != Unshareable ) {
        bSharedIt = true;
        data_ = other.data_;
        ++data_->refs;
      }
      other.data_->cs.Unlock(); //-------------------

      if( !bSharedIt ) {
        data_ = new StringBuf( *other.data_ );
      }
    }

So making the String copy constructor safe wasn't very hard at all. This brings us to AboutToModify(), which turns out to be very similar, but notice that this sample code actually acquires the lock during the entire deep copy operation... really, the lock is strictly only needed when looking at the refs value, and again when updating the refs value at the end, but I decided to lock the whole operation instead of getting slightly better concurrency by releasing the lock during the deep copy and then reacquiring it just to update refs:

    void String::AboutToModify(
      size_t n,
      bool   bMarkUnshareable /* = false */
    ) {
      data_->cs.Lock(); //---------------------------
      if( data_->refs > 1 && data_->refs != Unshareable ) {
        StringBuf* newdata = new StringBuf( *data_, n );
        --data_->refs;   // now all the real work is
        data_->cs.Unlock(); //-----------------------
        data_ = newdata; //  done, so take ownership
      }
      else {
        data_->cs.Unlock(); //-----------------------
        data_->Reserve( n );
      }
      data_->refs = bMarkUnshareable ? Unshareable : 1;
    }

None of the other functions need to be changed. Append() and operator[]() don't need locks because once AboutToModify() completes we're guaranteed that we're not using a shared representation. Length() doesn't need a lock because by definition we're okay if our StringBuf is not shared (there's no one else to change the used count on us), and we're okay if it is shared (the other thread would take its own copy before doing any work and hence still wouldn't modify our used count on us):

    void String::Append( char c ) {
      AboutToModify( data_->used+1 );
      data_->buf[data_->used++] = c;
    }

    size_t String::Length() const {
      return data_->used;
    }

    char& String::operator[]( size_t n ) {
      AboutToModify( data_->len, true );
      return *(data_->buf+n);
    }

  }

Again, note the interesting thing in all of this: The only locking we needed to do involved the refs count itself.

With that observation and the above general-purpose solution under our belts, let's look back to the a) part of the question:

a) assuming that there atomic operations to get, set, and compare integer values; and

Some operating systems provide these kinds of functions.

NOTE: These functions are usually much more efficient than general-purpose synchronization primitives like critical sections and mutexes. It is, however, a fallacy so say that we can use atomic integer operations "instead of locks" because locking is still required -- the locking is just generally less expensive than other alternatives, but it's not free by a long shot.

Here is a thread-safe implementation of String that assumes we have three functions: an IntAtomicGet, and IntAtomicDecrement and IntAtomicIncrement that safely return the new value. We'll do essentially the same thing we did above, but use only atomic integer operations to serialize access to the refs count:

  namespace Optimized {

    String::String() : data_(new StringBuf) { }

    String::~String() {
      if( IntAtomicGet( data_->refs ) == Unshareable ||
          IntAtomicDecrement( data_->refs ) < 1 ) {
        delete data_;
      }
    }

    String::String( const String& other )
    {
      if( IntAtomicGet( other.data_->refs ) != Unshareable ) {
        data_ = other.data_;
        IntAtomicIncrement( data_->refs );
      }
      else {
        data_ = new StringBuf( *other.data_ );
      }
    }

    void String::AboutToModify(
      size_t n,
      bool   bMarkUnshareable /* = false */
    ) {
      int refs = IntAtomicGet( data_->refs );
      if( refs > 1 && refs != Unshareable ) {
        StringBuf* newdata = new StringBuf( *data_, n );
        if( IntAtomicDecrement( data_->refs ) < 1 ) {
          delete newdata;  // just in case two threads
        }                  //  are trying this at once
        else {             // now all the real work is
          data_ = newdata; //  done, so take ownership
        }
      }
      else {
        data_->Reserve( n );
      }
      data_->refs = bMarkUnshareable ? Unshareable : 1;
    }

    void String::Append( char c ) {
      AboutToModify( data_->used+1 );
      data_->buf[data_->used++] = c;
    }

    size_t String::Length() const {
      return data_->used;
    }

    char& String::operator[]( size_t n ) {
      AboutToModify( data_->len, true );
      return *(data_->buf+n);
    }

  }

3. What are the effects on performance? Discuss.

Without atomic integer operations, copy-on-write typically incurs a significant performance penalty. Even with atomic integer operations, COW can make common String operations take nearly 50% longer -- even in single-threaded programs.

In general, copy-on-write is often a bad idea in multithread-ready code. In short, the reason is that the calling code can no longer know whether two distinct String objects actually share the same representation under the covers, and so String must incur overhead to do enough serialization that calling code can take its normal share of responsibility for thread safety. Depending on the availability of more-efficient options like atomic integer operations, the impact on performance ranges from "moderate" to "profound."

SOME EMPIRICAL RESULTS

In this test environment I tested six main flavours of string implementations:

             Name  Description
  ---------------  ---------------------------------------
            Plain  Non-use-counted string; all others are
                   modeled on this (a refined version of
                   the GotW #43 answer)

       COW_Unsafe  Plain + COW, not thread-safe (a refined
                   version of the GotW #44 answer)

    COW_AtomicInt  Plain + COW + thread-safe (a refined
                   version of this GotW #45 1(a) answer
                   above)

   COW_AtomicInt2  COW_AtomicInt + StringBuf in same
                   buffer as the data (another refined
                   version of this GotW #45 #1(a) above)

      COW_CritSec  Plain + COW + thread-safe (Win32
                   critical sections) (a refined version
                   of this GotW #45 #1(b) answer above)

        COW_Mutex  Plain + COW + thread-safe (Win32
                   mutexes) (COW_CritSec with mutexes
                   instead of critical sections)

I also threw in a seventh flavour to measure the result of optimizing memory allocation instead of optimizing copying:

  Plain_FastAlloc  Plain + an optimized memory allocator

I focused on comparing Plain with COW_AtomicInt. COW_AtomicInt was generally the most efficient thread-safe COW implementation. The results were as follows:

1. For all mutating and possibly-mutating operations, COW_AtomicInt was always worse than Plain. This is natural and expected.
2. COW should shine when there are many unmodified copies, but for an average string length of 50:
a) When 33% of all copies were never modified, and the rest were modified only once each, COW_AtomicInt was still slower than Plain.
b) When 50% of all copies were never modified, and the rest were modified only thrice each, COW_AtomicInt was still slower than Plain.
This result may be more surprising to many -- particularly that COW_AtomicInt is slower in cases where there are more copy operations than mutating operations in the entire system!
Note that, in both cases, traditional thread-unsafe COW did perform better than Plain. This shows that indeed COW can be an optimization for purely single- threaded environments, but it is less often appropriate for thread-safe code.
3. It is a myth that COW's principal advantage lies in avoiding memory allocations. Especially for longer strings, COW's principal advantage is that it avoids copying the characters in the string.
4. Optimized allocation, not COW, was a consistent true speed optimization in all cases (but note that it does trade off space). Here is perhaps the most important conclusion from the Detailed Measurements section:
"* Most of COW's primary advantage for small strings could be gained without COW by using a more efficient allocator. (Of course, you could also do both -- use COW and an efficient allocator.)"

Q: Why measure something as inefficient as COW_CritSec? A: Because at least one popular commercial basic_string implementation used this method as recently as a year ago (and perhaps still does, I haven't seen their code lately), despite the fact that COW_CritSec is nearly always a pessimization. Be sure to check whether your library vendor is doing this, because if the library is built for possible multithreaded use then you will bear the performance cost all the time -- even if your program is single-threaded.

Q: What's COW_AtomicInt2? A: It's the same as COW_AtomicInt except that, instead of allocating a StringBuf and then separately allocating the data buffer, the StringBuf and data are in the same allocated block of memory. Note that all other COW_* variants use a fast allocator for the StringBuf object (so that there's no unfair "double allocation" cost), and the purpose of COW_AtomicInt2 is mainly to demonstrate that I have actually addressed that concern... COW_AtomicInt2 is actually slightly slower than COW_AtomicInt for most operations (because of the extra logic).

I also tested the relative performance of various integer operations (incrementing int, incrementing volatile int, and incrementing int using the Win32 atomic integer operations), to ensure that COW_AtomicInt results were not unfairly skewed by poor implementations or function call overhead.

APPROACH

To assess COW, I performed measurements of three kinds of functions:

- copying (where COW shines, its raison d'etre)
- mutating operations that could trigger reallocation (represented by Append, which gradually lengthens; this is to make sure any conclusions drawn can take into account periodic reallocation overhead due to normal string use)
- possibly-mutating operations that do not change length enough to trigger reallocation, or that do not actually mutate the string at all (represented by operator[])

It turns out that the last two both incur a constant (and similar, within ~20%) cost per operation, and can be roughly considered together. Assuming for simplicity that mutating-and-extending operations like Append (235ms overhead) and possibly-mutating operations like operator[] (280ms overhead) will be about equally frequent, the COW_AtomicInt overhead for mutating and possibly-mutating operations is about 260ms per 1,000,000 operations in this implementation.

Finally, for each of 2(a) and 2(b), I first used the "Raw Measurements" section below to hand-calculate a rough prediction of expected relative performance, then ran the test to check actual performance.

SUMMARY FOR CASE 2(a):

    PREDICTION

      COW_AtomicInt Cost         Plain Cost
      -------------------------  ----------------------
      1M shallow copies          1M deep copies
       and dtors            400   and dtors        1600
      667K mutations        ???                     ???
      667K deep copies     1060
      extra overhead on
       667K deep copies     ???
      extra overhead on
       667K mutations       175
                          -----                   -----
                           1635+                   1600+

    TEST
        (program that makes copies in a tight loop, and
         modifies 33% of them with a single Append and
         another 33% of them with a single op[])

      Running 1000000 iterations with strings of length 50:
        Plain_FastAlloc    642ms  copies: 1000000  allocs: 1000007
                  Plain   1726ms  copies: 1000000  allocs: 1000007
             COW_Unsafe   1629ms  copies: 1000000  allocs:  666682
          COW_AtomicInt   1949ms  copies: 1000000  allocs:  666682
         COW_AtomicInt2   1925ms  copies: 1000000  allocs:  666683
            COW_CritSec  10660ms  copies: 1000000  allocs:  666682
              COW_Mutex  33298ms  copies: 1000000  allocs:  666682

SUMMARY FOR CASE 2(b):

    PREDICTION

      COW_AtomicInt Cost         Plain Cost
      -------------------------  ----------------------
      1M shallow copies          1M deep copies
       and dtors            400   and dtors        1600
      1.5M mutations        ???                     ???
      500K deep copies      800
      extra overhead on
       500K deep copies     ???
      extra overhead on
       1.5M mutations       390
                          -----                   -----
                           1590+                   1600+

    TEST
        (program that makes copies in a tight loop, and
         modifies 25% of them with three Appends and
         another 25% of them with three operator[]s)

      Running 1000000 iterations with strings of length 50:
        Plain_FastAlloc    683ms  copies: 1000000  allocs: 1000007
                  Plain   1735ms  copies: 1000000  allocs: 1000007
             COW_Unsafe   1407ms  copies: 1000000  allocs:  500007
          COW_AtomicInt   1838ms  copies: 1000000  allocs:  500007
         COW_AtomicInt2   1700ms  copies: 1000000  allocs:  500008
            COW_CritSec   8507ms  copies: 1000000  allocs:  500007
              COW_Mutex  31911ms  copies: 1000000  allocs:  500007

RAW MEASUREMENTS

TESTING CONST COPYING + DESTRUCTION: The target case of COW

  Notes:
   - COW_AtomicInt always took over twice as long to create and
      destroy a const copy as did plain thread-unsafe COW.
   - For every copy of a string that was later modified,
      COW_AtomicInt added constant unrecoverable overhead
      (400ms per 1,000,000) not counting the overhead on other
      operations.
   * Most of COW's primary advantage for small strings could be
      gained without COW by using a more efficient allocator.
      (Of course, you could also do both -- use COW and an
      efficient allocator.)
   * COW's primary advantage for large strings lay, not in
      avoiding the allocations, but in avoiding the char copying.

Running 1000000 iterations with strings of length 10:
  Plain_FastAlloc    495ms  copies: 1000000  allocs: 1000003
            Plain   1368ms  copies: 1000000  allocs: 1000003
       COW_Unsafe    160ms  copies: 1000000  allocs:       3
    COW_AtomicInt    393ms  copies: 1000000  allocs:       3
   COW_AtomicInt2    433ms  copies: 1000000  allocs:       4
      COW_CritSec    428ms  copies: 1000000  allocs:       3
        COW_Mutex  14369ms  copies: 1000000  allocs:       3

Running 1000000 iterations with strings of length 50:
  Plain_FastAlloc    558ms  copies: 1000000  allocs: 1000007
            Plain   1598ms  copies: 1000000  allocs: 1000007
       COW_Unsafe    165ms  copies: 1000000  allocs:       7
    COW_AtomicInt    394ms  copies: 1000000  allocs:       7
   COW_AtomicInt2    412ms  copies: 1000000  allocs:       8
      COW_CritSec    433ms  copies: 1000000  allocs:       7
        COW_Mutex  14130ms  copies: 1000000  allocs:       7

Running 1000000 iterations with strings of length 100:
  Plain_FastAlloc    708ms  copies: 1000000  allocs: 1000008
            Plain   1884ms  copies: 1000000  allocs: 1000008
       COW_Unsafe    171ms  copies: 1000000  allocs:       8
    COW_AtomicInt    391ms  copies: 1000000  allocs:       8
   COW_AtomicInt2    412ms  copies: 1000000  allocs:       9
      COW_CritSec    439ms  copies: 1000000  allocs:       8
        COW_Mutex  14129ms  copies: 1000000  allocs:       8

Running 1000000 iterations with strings of length 250:
  Plain_FastAlloc   1164ms  copies: 1000000  allocs: 1000011
            Plain   5721ms  copies: 1000000  allocs: 1000011 [*]
       COW_Unsafe    176ms  copies: 1000000  allocs:      11
    COW_AtomicInt    393ms  copies: 1000000  allocs:      11
   COW_AtomicInt2    419ms  copies: 1000000  allocs:      12
      COW_CritSec    443ms  copies: 1000000  allocs:      11
        COW_Mutex  14118ms  copies: 1000000  allocs:      11

Running 1000000 iterations with strings of length 1000:
  Plain_FastAlloc   2865ms  copies: 1000000  allocs: 1000014
            Plain   4945ms  copies: 1000000  allocs: 1000014
       COW_Unsafe    173ms  copies: 1000000  allocs:      14
    COW_AtomicInt    390ms  copies: 1000000  allocs:      14
   COW_AtomicInt2    415ms  copies: 1000000  allocs:      15
      COW_CritSec    439ms  copies: 1000000  allocs:      14
        COW_Mutex  14059ms  copies: 1000000  allocs:      14

Running 1000000 iterations with strings of length 2500:
  Plain_FastAlloc   6244ms  copies: 1000000  allocs: 1000016
            Plain   8343ms  copies: 1000000  allocs: 1000016
       COW_Unsafe    174ms  copies: 1000000  allocs:      16
    COW_AtomicInt    397ms  copies: 1000000  allocs:      16
   COW_AtomicInt2    413ms  copies: 1000000  allocs:      17
      COW_CritSec    446ms  copies: 1000000  allocs:      16
        COW_Mutex  14070ms  copies: 1000000  allocs:      16



TESTING APPEND: An always-mutating periodically-reallocating operation

  Notes:
   - Plain always outperformed COW.
   - The overhead of COW_AtomicInt compared to Plain did not
      vary greatly with string lengths: It was fairly constant
      at about 235ms per 1,000,000 operations.
   - The overhead of COW_AtomicInt compared to COW_Unsafe did not
      vary greatly with string lengths: It was fairly constant
      at about 110ms per 1,000,000 operations.
   * The overall ever-better performance for longer strings was
      due to the allocation strategy (see GotW #43), not COW vs.
      Plain issues.

Running 1000000 iterations with strings of length 10:
  Plain_FastAlloc    302ms  copies:       0  allocs:  272730
            Plain    565ms  copies:       0  allocs:  272730
       COW_Unsafe    683ms  copies:       0  allocs:  272730
    COW_AtomicInt    804ms  copies:       0  allocs:  272730
   COW_AtomicInt2    844ms  copies:       0  allocs:  363640
      COW_CritSec    825ms  copies:       0  allocs:  272730
        COW_Mutex   8419ms  copies:       0  allocs:  272730

Running 1000000 iterations with strings of length 50:
  Plain_FastAlloc    218ms  copies:       0  allocs:  137262
            Plain    354ms  copies:       0  allocs:  137262
       COW_Unsafe    474ms  copies:       0  allocs:  137262
    COW_AtomicInt    588ms  copies:       0  allocs:  137262
   COW_AtomicInt2    536ms  copies:       0  allocs:  156871
      COW_CritSec    607ms  copies:       0  allocs:  137262
        COW_Mutex   7614ms  copies:       0  allocs:  137262

Running 1000000 iterations with strings of length 100:
  Plain_FastAlloc    182ms  copies:       0  allocs:   79216
            Plain    257ms  copies:       0  allocs:   79216
       COW_Unsafe    382ms  copies:       0  allocs:   79216
    COW_AtomicInt    492ms  copies:       0  allocs:   79216
   COW_AtomicInt2    420ms  copies:       0  allocs:   89118
      COW_CritSec    535ms  copies:       0  allocs:   79216
        COW_Mutex   7810ms  copies:       0  allocs:   79216

Running 1000000 iterations with strings of length 250:
  Plain_FastAlloc    152ms  copies:       0  allocs:   43839
            Plain    210ms  copies:       0  allocs:   43839
       COW_Unsafe    331ms  copies:       0  allocs:   43839
    COW_AtomicInt    438ms  copies:       0  allocs:   43839
   COW_AtomicInt2    366ms  copies:       0  allocs:   47825
      COW_CritSec    485ms  copies:       0  allocs:   43839
        COW_Mutex   7358ms  copies:       0  allocs:   43839

Running 1000000 iterations with strings of length 1000:
  Plain_FastAlloc    123ms  copies:       0  allocs:   14000
            Plain    149ms  copies:       0  allocs:   14000
       COW_Unsafe    275ms  copies:       0  allocs:   14000
    COW_AtomicInt    384ms  copies:       0  allocs:   14000
   COW_AtomicInt2    299ms  copies:       0  allocs:   15000
      COW_CritSec    421ms  copies:       0  allocs:   14000
        COW_Mutex   7265ms  copies:       0  allocs:   14000

Running 1000000 iterations with strings of length 2500:
  Plain_FastAlloc    122ms  copies:       0  allocs:    6416
            Plain    148ms  copies:       0  allocs:    6416
       COW_Unsafe    279ms  copies:       0  allocs:    6416
    COW_AtomicInt    380ms  copies:       0  allocs:    6416
   COW_AtomicInt2    304ms  copies:       0  allocs:    6817
      COW_CritSec    405ms  copies:       0  allocs:    6416
        COW_Mutex   7281ms  copies:       0  allocs:    6416



TESTING OPERATOR[]: A possibly-mutating operation, never does mutate

  Notes:
   - Plain always vastly outperformed COW.
   - Results were independent of string lengths.
   - The overhead of COW_AtomicInt compared to Plain was
      constant at about 280ms per 1,000,000 operations.
   - COW_AtomicInt2 fared better in this test case, but
      COW_AtomicInt did better overall and so I am focusing
      on comparing that with Plain.

[10x iterations] Running 10000000 iterations with strings of length 10:
  Plain_FastAlloc      3ms  copies:       0  allocs:       3 [*]
            Plain      2ms  copies:       0  allocs:       3 [*]
       COW_Unsafe   1698ms  copies:       0  allocs:       3
    COW_AtomicInt   2833ms  copies:       0  allocs:       3
   COW_AtomicInt2   2112ms  copies:       0  allocs:       4
      COW_CritSec   3587ms  copies:       0  allocs:       3
        COW_Mutex  71787ms  copies:       0  allocs:       3

   [*] within measurement error margin, both varied from 0ms to 9ms



TESTING VARIOUS INTEGER INCREMENT/DECREMENT OPERATIONS

  Test Summary:
   - "plain" performs the operations on normal
      nonvolatile ints
   - "volatile" is the only case to use volatile ints
   - "atomic" uses the Win32 InterlockedXxx operations
   - "atomic_ass" uses inline x86 assembler locked
      integer operations

  Notes:
   - ++atomic took only three times as long as either
      ++volatile and unoptimized ++plain
   - ++atomic does not incur function call overhead

[100x iterations] Running 100000000 iterations for integer operations:

          ++plain   2404ms, counter=100000000
          --plain   2399ms, counter=0

       ++volatile   2400ms, counter=100000000
       --volatile   2405ms, counter=0

         ++atomic   7480ms, counter=100000000
         --atomic   9340ms, counter=0

     ++atomic_ass   8881ms, counter=100000000
     --atomic_ass  10964ms, counter=0

Here are a few extra notes on the relative timings of various flavours of x86 assembler implementations of IntAtomicIncrement (these timings were taken under the same conditions as above and can be compared directly):

    Instructions                    Timing
    ---------------------------     --------
    __asm mov       eax, 1
    __asm lock xadd i, eax
    __asm mov       result, eax     ~11000ms

    __asm mov       eax, 1
    __asm lock xadd i, eax          ~10400ms

    __asm lock inc i                 ~8900ms
      (this is the one actually used above)

Note that the non-atomic versions are much better, and map directly onto the "plain" timings:

    __asm inc i                      ~2400ms

Conclusion: So there is indeed overhead introduced by the x86 LOCK instruction, even on a single-CPU machine. This is natural and to be expected, but I point it out because some people said there was no difference.

I am very impressed that Win32's InterlockedIncrement is actually faster at 765ms than my hand-coded assembler at 900ms, even though my hand-coded version actually does less work (only a single instruction!) because it doesn't return the new value. Of course, I'm no x86 assembler expert; the explanation is certainly that the OS's wrapper is using a different opcode than my hand-coded version.

Finally, of course, note that the Win32 atomic int functions clearly are not incurring function-call overhead. Never assume -- measure.

A few important points about this test harness:

1. CAVEAT LECTOR: Take this for what it is: A first cut at a test harness. Comments and corrections are welcome. I'm showing raw performance numbers here; I haven't inspected the compiled code, and I've made no attempt to determine the impact of cache hits/misses and other secondary effects. (Even so, this GotW took much more effort than usual to produce, and I guarantee that the next few issues will feature simpler topics!)
2. TANSTAAFL ("there ain't no such thing as a free lunch" -R.A.Heinlein). Even with atomic integer operations, it is incorrect to say "there's no locking required" because the atomic integer operations clearly do perform serialization, and do incur significant overhead.
3. The test harness itself is SINGLE-threaded. A thread-safe COW implementation incurs overhead even in programs that are not themselves multithreaded. At best, COW could be a true optimization only when the COW code does not need to be made thread-safe at all (even then, see Rob Murray's "C++ Strategies and Tactics" book, pages 70-72, for more empirical tests that show COW is only beneficial in certain situations). If thread safety is required, COW imposes a significant performance penalty on all users, even users running only single-threaded code.

Click here to download the test harness source code.

Reference Counting - Part IIIDifficulty: 9 / 10