GotW #84: Monoliths "Unstrung"

GotW #84

On the
blog

November 4: Other Concurrency Sessions at PDC
November 3: PDC'09: Tutorial & Panel

October 26: Hoare on Testing
October 23: Deprecating export Considered for ISO C++0x

This is the original GotW problem and solution substantially as posted to Usenet. See the book Exceptional C++ Style (Addison-Wesley, 2004) for the most current solution to this GotW issue. The solutions in the book have been revised and expanded since their initial appearance in GotW. The book versions also incorporate corrections, new material, and conformance to the final ANSI/ISO C++ standard (1998) and its Technical Corrigendum (2003).

Monoliths "Unstrung"
Difficulty: 3 / 10

"All for one, and one for all" may work for Musketeers, but it doesn't work nearly as well for class designers. Here's an example that is not altogether exemplary, and it illustrates just how badly you can go wrong when design turns into overdesign. The example is, unfortunately, taken from a standard library near you...

Problem

JG Question

1. What is a "monolithic" class, and why is it bad? Explain.

Guru Questions

2. List all the member functions of std::basic_string.

3. Which ones should be members, and which should not? Why?

Solution

Avoid Unduly Monolithic Designs

1. What is a "monolithic" class, and why is it bad? Explain.

The word "monolithic" is used to describe software that is a single big, indivisible piece, like a monolith. The word "monolith" comes from the words "mono" (one) and "lith" (stone) whose vivid image of a single solid and heavy stone well illustrates the heavyweight and indivisible nature of such code.

A single heavyweight facility that thinks it does everything is often a dead end. After all, often a single big heavyweight facility doesn't really do more -- it often does less, because the more complex it is, the narrower its application and relevance is likely to be.

In particular, a class might fall into the monolith trap by trying to offer its functionality through member functions instead of nonmember functions, even when nonmember nonfriend functions would be possible and at least as good. This has at least two drawbacks:

(Major) It isolates potentially independent functionality inside a class. The operation in question might otherwise be nice to use with other types, but because it's hardwired into a particular class that won't be possible, whereas if it were exposed as a nonmember function template it could be more widely usable.

(Minor) It can discourage extending the class with new functionality. "But wait!" someone might say. "It doesn't matter whether the class's existing interface leans toward member or nonmember functions, because I can equally well extend it with my own new nonmember functions either way!" That's technically true, and misses the semantic point: If the class's built-in functionality is offered mainly (or only) via member functions and therefore present that as the class's natural idiom, then the extended nonmember functions cannot follow the original natural idiom and will always remain visibly second-class johnny-come-latelies. That the class presents its functions as members is a semantic statement to its users, who will be used to the member syntax that extenders of the class cannot use. (Do we really want to go out of our way to make our users commonly have to look up which functions are members and which ones aren't? That already happens often enough even in better designs.)

Where it's practical, break generic components down into pieces.

Guideline: Prefer "one class (or function), one responsibility."
Guideline: Where possible, prefer writing functions as nonmember nonfriends.

The rest of this article will go further toward justifying the latter guideline.

The Basics of Strings

2. Name all the member functions of std::basic_string.

It's, ahem, a fairly long list. Counting constructors, there are no fewer than 103 member functions. Really. If that's not monolithic, it's hard to imagine what would be.

Imagine yourself in an underground cavern, at the edge of a subterranean lake. There is a submerged channel that leads to the next air-filled cavern. You prepare to dive into the black water and swim through the flooded tunnel...

Take a deep breath, and repeat after ISO/IEC 14882:1998(E) [1]:

namespace std {
  template<class charT, class traits = char_traits<charT>,
	   class Allocator = allocator<charT> >
  class basic_string {

    // ... some typedefs ...

    explicit basic_string(const Allocator& a = Allocator());
    basic_string(const basic_string& str, size_type pos = 0,
		 size_type n = npos,
                 const Allocator& a = Allocator());
    basic_string(const charT* s,
		 size_type n,
                 const Allocator& a = Allocator());
    basic_string(const charT* s,
                 const Allocator& a = Allocator());
    basic_string(size_type n, charT c,
                 const Allocator& a = Allocator());
    template<class InputIterator>
      basic_string(InputIterator begin, InputIterator end,
		   const Allocator& a = Allocator());
   ~basic_string();
    basic_string& operator=(const basic_string& str);
    basic_string& operator=(const charT* s);
    basic_string& operator=(charT c);

    iterator       begin();
    const_iterator begin() const;
    iterator       end();
    const_iterator end() const;

    reverse_iterator       rbegin();
    const_reverse_iterator rbegin() const;
    reverse_iterator       rend();
    const_reverse_iterator rend() const;

    size_type size() const;
    size_type length() const;
    size_type max_size() const;
    void resize(size_type n, charT c);
    void resize(size_type n);
    size_type capacity() const;
    void reserve(size_type res_arg = 0);
    void clear();
    bool empty() const;

    const_reference operator[](size_type pos) const;
    reference       operator[](size_type pos);
    const_reference at(size_type n) const;
    reference       at(size_type n);

    basic_string& operator+=(const basic_string& str);
    basic_string& operator+=(const charT* s);
    basic_string& operator+=(charT c);
    basic_string& append(const basic_string& str);
    basic_string& append(const basic_string& str,
                         size_type pos, size_type n);
    basic_string& append(const charT* s, size_type n);
    basic_string& append(const charT* s);
    basic_string& append(size_type n, charT c);
    template<class InputIterator>
      basic_string& append(InputIterator first,
                           InputIterator last);
    void push_back(const charT);

    basic_string& assign(const basic_string&);
    basic_string& assign(const basic_string& str,
                         size_type pos, size_type n);
    basic_string& assign(const charT* s, size_type n);
    basic_string& assign(const charT* s);
    basic_string& assign(size_type n, charT c);
    template<class InputIterator>
      basic_string& assign(InputIterator first,
                           InputIterator last);

    basic_string& insert(size_type pos1,
                         const basic_string& str);
    basic_string& insert(size_type pos1,
                         const basic_string& str,
			 size_type pos2, size_type n);
    basic_string& insert(size_type pos, const charT* s,
                         size_type n);
    basic_string& insert(size_type pos, const charT* s);
    basic_string& insert(size_type pos, size_type n,
                         charT c);
    iterator insert(iterator p, charT c);
    void     insert(iterator p, size_type n, charT c);
    template<class InputIterator>
      void insert(iterator p,
                  InputIterator first,
                  InputIterator last);

You break surface at a small air pocket and gasp. Don't give up -- we're halfway there! Another deep breath, and:

    basic_string& erase(size_type pos = 0,
                        size_type n = npos);
    iterator erase(iterator position);
    iterator erase(iterator first, iterator last);

    basic_string& replace(size_type pos1, size_type n1,
			  const basic_string& str);
    basic_string& replace(size_type pos1, size_type n1,
			  const basic_string& str,
			  size_type pos2, size_type n2);
    basic_string& replace(size_type pos, size_type n1,
                          const charT* s, size_type n2);
    basic_string& replace(size_type pos, size_type n1,
                          const charT* s);
    basic_string& replace(size_type pos, size_type n1,
                          size_type n2, charT c);

    basic_string& replace(iterator i1, iterator i2,
			  const basic_string& str);
    basic_string& replace(iterator i1, iterator i2,
                          const charT* s, size_type n);
    basic_string& replace(iterator i1, iterator i2,
                          const charT* s);
    basic_string& replace(iterator i1, iterator i2,
			  size_type n, charT c);
    template<class InputIterator>
      basic_string& replace(iterator i1, iterator i2,
			    InputIterator j1,
                            InputIterator j2);

    size_type copy(charT* s, size_type n,
                   size_type pos = 0) const;
    void swap(basic_string<charT,traits,Allocator>&);

    const charT* c_str() const;         //  explicit
const charT* data() const;
    allocator_type get_allocator() const;

    size_type find (const basic_string& str,
                    size_type pos = 0) const;
    size_type find (const charT* s,
                    size_type pos, size_type n) const;
    size_type find (const charT* s,
                    size_type pos = 0) const;
    size_type find (charT c, size_type pos = 0) const;
    size_type rfind(const basic_string& str,
                    size_type pos = npos) const;
    size_type rfind(const charT* s,
                    size_type pos, size_type n) const;
    size_type rfind(const charT* s,
                    size_type pos = npos) const;
    size_type rfind(charT c, size_type pos = npos) const;

    size_type find_first_of(const basic_string& str,
			    size_type pos = 0) const;
    size_type find_first_of(const charT* s,
			    size_type pos,
                            size_type n) const;
    size_type find_first_of(const charT* s,
                            size_type pos = 0) const;
    size_type find_first_of(charT c,
                            size_type pos = 0) const;
    size_type find_last_of (const basic_string& str,
			    size_type pos = npos) const;
    size_type find_last_of (const charT* s,
			    size_type pos,
                            size_type n) const;
    size_type find_last_of (const charT* s,
                            size_type pos = npos) const;
    size_type find_last_of (charT c,
                            size_type pos = npos) const;

    size_type find_first_not_of(const basic_string& str,
				size_type pos = 0) const;
    size_type find_first_not_of(const charT* s,
                                size_type pos,
				size_type n) const;
    size_type find_first_not_of(const charT* s,
                                size_type pos = 0) const;
    size_type find_first_not_of(charT c,
                                size_type pos = 0) const;
    size_type find_last_not_of (const basic_string& str,
				size_type pos = npos) const;
    size_type find_last_not_of (const charT* s,
                                size_type pos,
				size_type n) const;
    size_type find_last_not_of (const charT* s,
				size_type pos = npos) const;
    size_type find_last_not_of (charT c,
                                size_type pos = npos) const;

    basic_string substr(size_type pos = 0,
                        size_type n = npos) const;
    int compare(const basic_string& str) const;
    int compare(size_type pos1, size_type n1,
		const basic_string& str) const;
    int compare(size_type pos1, size_type n1,
		const basic_string& str,
		size_type pos2, size_type n2) const;
    int compare(const charT* s) const;
    int compare(size_type pos1, size_type n1,
		const charT* s, size_type n2 = npos) const;
  };
}

Whew -- 103 member functions or member function templates! The rocky tunnel expands and we break through a new lake's surface just in time. Somewhat waterlogged but better persons for the experience, we look around inside the new cavern to see if the exercise has let us discover anything interesting.

Was the recitation worth it? With the drudge work now behind us, let's look back at what we've just traversed and use the information to analyze Question 3:

Membership Has Its Rewards -- and Its Costs

3. Which ones should be members, and which should not? Why?

Recall the advice from #1: Where it's practical, break generic components down into pieces.

Guideline: Prefer "one class (or function), one responsibility."
Guideline: Where possible, prefer writing functions as nonmember nonfriends.

For example, if you write a string class and make searching, pattern-matching, and tokenizing available as member functions, you've hardwired those facilities so that they can't be used with any other kind of sequence. (If this frank preamble is giving you an uncomfortable feeling about basic_string, well and good.) On the other hand, a facility that accomplishes the same goal but is composed of several parts that are also independently usable is often a better design. In this example, it's often best to separate the algorithm from the container, which is what the STL does most of the time.

I [2,3] and Scott Meyers [4] have written before on why some nonmember functions are a legitimate part of a type's interface, and why nonmember nonfriends should be preferred (among other reasons, to improve encapsulation). For example, as Scott wrote in his opening words of the cited article:

I'll start with the punchline: If you're writing a function that can be implemented as either a member or as a non-friend non-member, you should prefer to implement it as a non-member function. That decision increases class encapsulation. When you think encapsulation, you should think non-member functions. [4]

So when we consider the functions that will operate on a basic_string (or any other class type), we want to make them nonmember nonfriends if reasonably possible. Hence, here are some questions to ask about the members of basic_string in particular:

	Always make it a member if it has to be one: Which operations must be members, because the C++ language just says so (e.g., constructors) or because of functional reasons (e.g., they must be virtual)? If they have to be, then oh well, they just have to be; case closed.
	Prefer to make it a member if it needs access to internals: Which operations need access to internal data we would otherwise have to grant via friendship? These should normally be members. (There are some rare exceptions such as operations needing conversions on their left-hand arguments and some like operator<<() whose signatures don't allow the *this reference to be their first parameters; even these can normally be nonfriends implemented in terms of (possibly virtual) members, but sometimes doing that is merely an exercise in contortionism and they're best and naturally expressed as friends.)
	In all other cases, prefer to make it a nonmember nonfriend: Which operations can work equally well as nonmember nonfriends? These can, and therefore normally should, be nonmembers. This should be the default case to strive for.

It's important to ask these and related questions because we want to put as many functions into the last bucket as possible.

A word about efficiency: For each function, we'll consider whether it can be implemented as a nonmember nonfriend function as efficiently as a member function. Is this premature optimization (an evil I often rail against)? Not a bit of it. The primary reason why I'm going to consider efficiency is to demonstrate just how many of basic_string's member functions could be implemented "equally well as nonmember nonfriends." I want to specifically shut down any accusations that making them nonmember nonfriends is a potential efficiency hit, such as preventing an operation from being performed in constant time if the standard so requires. A secondary reason is optimizing a library is not premature when the library's implementer has concrete information from past experience (e.g., past user reports, or his own experience in the problem domain) about what operations are being used by users in time-sensitive ways. So don't get hung up about premature optimization -- we're not falling into that trap here. Rather, we are primarily investigating just how many of basic_string's members could "equally well" be implemented as nonmember nonfriends. Even if you are expecting many to be implementable as nonmembers, the actual results may well surprise you.

Operations That Must Be Members

As a first pass, let's sift out those operations that just have to be members. There are some obvious ones at the beginning of the list.

	constructors (6)
	destructor
	assignment operators (3)
	[] operators (2)

Clearly the above functions must be members. It's impossible to write a constructor, destructor, assignment operator, or [] operator in any other way!

12 functions down, 91 to go...

Operations That Should Be Members

We asked: Which operations need access to internal data we would otherwise have to grant via friendship? Clearly these have a reason to be members, and normally ought to be. This list includes the following, some of which provide indirect access to (e.g., begin()) or change (e.g., reserve()) the internal state of the string:

	begin (2)
	end (2)
	rbegin (2)
	rend (2)
	size
	max_size
	capacity
	reserve
	swap
	c_str
	data
	get_allocator

The above ought to be members not only because they're tightly bound to basic_string, but they also happen to form the public interface that nonmember nonfriend functions will need to use. Sure, you could implement these as nonmember friends, but why?

There are a few more that I'm going to add to this list as fundamental string operations:

insert (three-parameter version)

	erase (1 -- the "iter, iter" version)
	replace (2 -- the "iter, iter, num, char" and templated versions)

We'll return to the question of insert(), erase(), and replace() a little later. For replace() in particular, it's important to be able to choose well and make the most flexible and fundamental version(s) into member(s).

Into the Fray: Possibly-Contentious Operations That Can Be Nonmember Nonfriends

First in this section, allow me to perform a stunning impersonation of a lightning rod by pointing out that all of the following functions have something fundamental in common, to wit: Each one could obviously as easily -- and as efficiently -- be a nonmember nonfriend.

	at (2)
	clear
	empty
	length

Sure they can be nonmember nonfriends, that's obvious, no sweat. Of course, the above functions also happen to have something else pretty fundamental in common: They're mentioned in the standard library's container and sequence requirements, as member functions. Hence the lightning-rod effect...

"Yeah, wait!" I can already hear some standardiste-minded people saying, heading in my direction and resembling the beginnings of a brewing lynch mob. "Not so fast! Don't you know that basic_string is designed to meet the C++ Standard's container and sequence requirements, and those requirements require or suggest that some of those functions be members? So quit misleading the readers! They're members -- live with it!" Indeed, and true, but for the sake of this discussion I'm going to waive those requirements with a dismissive wave of a hand and a quick escape across some back yards past small barking dogs and various clothes drying on the line.

Having left my pursuers far enough behind to resume reasoned discourse, here's the point: For once, the question I'm considering here isn't what the container requirements say. It's which functions can without loss of efficiency be made nonmember nonfriends, and whether there's any additional benefit to be gained from doing so. If those benefits exist for something the container requirements say must be a member, well, why not point out that the container requirements could be improved while we're at it? And so we shall...

Take empty() as a case in point. Can we implement it as a nonmember nonfriend? Sure... the standard itself requires the following behavior of basic_string::empty(), in the C++ Standard, subclause 21.3.3, paragraph 14:

Returns: size() == 0.

Well, now, that's pretty easy to write as a nonmember without loss of efficiency:

template<class charT, class traits, class Allocator>
bool empty( const basic_string<charT, traits, Allocator>& s )
{
return s.size() == 0;
}

Notice that, while we can make size() a member and implement a nonmember empty() in terms of it, we could not do the reverse. In several cases here there's a group of related functions, and perhaps more than one could be nominated as a member and the others implemented in terms of that one as nonmembers. Which function should we nominate to be the member? My advice is to choose the most flexible one that doesn't force a loss of efficiency -- that will be the enabling flexible foundation on which the others can be built. In this case, we choose size() as the member because its result can always be cached (indeed, the standard encourages that it be cached because size() "should" run in constant time), in which case an empty() implemented only in terms of size() is no less efficient than anything we could do with full direct access to the string's internal data structures.

For another case in point, what about at()? The same reasoning applies. For both the const and non-const versions, the standard requires the following:

Throws: out_of_range if pos >= size().

Returns: operator[]( pos ).

That's easy to provide as a nonmember, too. Each is just a two-line function template, albeit a bit syntactically tedious because of all those template parameters and nested type names -- and remember all your typenames! Here's the code:

template<class charT, class traits, class Allocator>
typename basic_string<charT, traits, Allocator>
         ::const_reference
at( const basic_string<charT, traits, Allocator>& s,
    typename basic_string<charT, traits, Allocator>
             ::size_type pos )
{
if( pos >= s.size() ) throw out_of_range( "don't do that" );
return s[pos];
}

template<class charT, class traits, class Allocator>
typename basic_string<charT, traits, Allocator>::reference
at( basic_string<charT, traits, Allocator>& s,
    typename basic_string<charT, traits, Allocator>
             ::size_type pos )
{
if( pos >= s.size() )
    throw out_of_range( "I said, don't do that" );
return s[pos];
}

What about clear()? Easy, that's the same as erase(begin(),end()). No fuss, no muss. Exercise for the reader, and all that.

What about length()? Easy, again -- it's defined to give the same result as size(). What's more, note that the other containers don't have length(), and it's there in the basic_string interface as a sort of "string thing", but by making it a nonmember suddenly we can consisently say "length()" about any container. Not too useful in this case because it's just a synonym for size(), I grant you, but a noteworthy point in the principle it illustrates -- making algorithms nonmembers immediately also makes them more widely useful and usable.

In summary, let's consider the benefits and drawbacks of providing functions like at() and empty() as members vs. nonmembers. First, just because we can write these members as nonmembers (and nonfriends) without loss of efficiency, why should we do so? What are the actual or potential benefits? Here are several:

	Simplicity. Making them nonmembers lets us write and maintain less code. We can write empty() just once and be done with it forever. Why write it many times as basic_string::empty() and vector::empty() and list::empty() and so forth, including writing them over and over again for each new STL-compliant container that comes along in third-party or internal libraries and even in future versions of the C++ standard? Write it once, be done, move on.
	Consistency. It avoids gratuitous incompatibilities between the member algorithms of different containers, and between the member and nonmember versions of algorithms (some existing inconsistencies between members and nonmembers are pointed out in [5]). If customized behavior is needed, then specialization or overloading of the nonmember function templates should be able to accommodate it.
	Encapsulation. It improves encapsulation (as argued by Meyers strongly and at length in [4]).

Having said that, what are the potential drawbacks? Here are two, although in my opinion they are outweighed by the above advantages:

Consistency. You could argue that keeping things like empty() as members follows the principle of least surprise -- similar functions are members, and in other class libraries things like IsEmpty() functions are commonly members, after all. I think this argument is valid, but weakened when we notice that this wouldn't be surprising at all if people were in the habit of following Meyers' advice in the first place, routinely writing functions as nonmembers whenever reasonably possible. So the question really comes down to whether we ought to trade away real benefits in order to follow a tradition, or to change the tradition to get real benefits. (Of course, if we didn't already have size(), then implementing empty() in particular as a nonmember nonfriend would not be possible. The class's public member functions do have to provide the necessary and sufficient functionality already.) I think this argument is weak because writing them as nonmembers yields a greater consistency, as noted above, than the questionable inconsistency being claimed here.

Namespace pollution. Because empty() is such a common name, putting it at namespace scope risks namespace pollution -- after all, will every function named empty() want to have exactly these semantics? This argument is also valid to a point, but weakened in two ways: First, by noting that encouraging consistent semantics for functions is a Good Thing; and second, by noticing that overload resolution has turned out to be a very good tool for disambiguation, and namespace pollution has never been as big a problem as some have made it out to be in the past. Really, by putting all the names in one place and sharing implementations, we're not polluting that one namespace as much as we're sanitizing the functions by gathering them up and helping to avoid the gratuitous and needless incompatibilities that Meyers mentions (see above).

With perhaps these more contentious choices out of the way, then, let's continue with other operations that can be nonmember nonfriends. Some, like those listed above, are mentioned in the container requirements as members; again, here we're considering not what the standard says we must do, but rather what given a blank page we might choose to do in designing these as members vs. nonmember nonfriends.

More Operations That Can Be Nonmember Nonfriends

Further, all of the remaining functions can be implemented as nonmember nonfriends:

	resize (2)
	assign (6)
	+= (3)
	append (6)
	push_back
	insert (7 -- all but the three-parameter version)
	erase (2 -- all but the "iter, iter" version)
	replace (8 -- all but the "iter, iter, num, char" and templated versions)
	copy
	substr
	compare (5)
	find (4)
	rfind (4)
	find_first_of (4)
	first_last_of (4)
	find_first_not_of (4)
	find_last_not_of (4)

Consider first resize():

    void resize(size_type n, charT c);
    void resize(size_type n);

Can each resize() be a nonmember nonfriend? Sure it can, because it can be implemented in terms of basic_string's public interface without loss of efficiency. Indeed, the standard's own functional specifications express both versions in terms of the functions we've already considered above. For example:

template<class charT, class traits, class Allocator>
void resize( basic_string<charT, traits, Allocator>& s,
             typename Allocator::size_type n, charT c)
{
if( n > s.max_size() ) throw length_error( "won't fit" );
if( n <= s.size() )
{
    basic_string<charT, traits, Allocator> temp( s, 0, n );
    s.swap( temp );
}
else
{
    s.append( n - s.size(), c );
}
}

template<class charT, class traits, class Allocator>
void resize( basic_string<charT, traits, Allocator>& s,
             typename Allocator::size_type n )
{
resize( s, n, charT() );
}

Next, assign(): We have six, count 'em, six forms of assign(). Fortunately, this case is simple: most of them are already specified in terms of each other, and all can be implemented in terms of a constructor and operator=() combination.

Next, operator+=(), append(), and push_back(): What about all those pesky flavors of appending operations, namely the three forms of operator+=(), the six-count'em-six forms of append(), and the lone push_back()? Just the similarity ought to alert us to the fact that probably at most one needs to be a member; after all, they're doing about the same thing, even if the details differ slightly, for example appending a character in one case, a string in another case, and an iterator range in still another case. Indeed, as it turns out, all of them can likewise be implemented as nonmember nonfriends without loss of efficiency:

	Clearly operator+=() can be implemented in terms of append(), because that's how it's specified in the C++ standard.
	Equally clearly, five of the six versions of append() can be nonmember nonfriends because they are specified in terms of the three-parameter version of append(), and that in turn can be implemented in terms of insert(), all of which quite closes the append() category.
	Determining the status of push_back() takes only slightly more work, because its operational semantics aren't specified in the basic_string section, but only in the container requirements section, and there we find the specification that a.push_back(x) is just a synonym for a.insert(a.end(),x).

What's the linchpin holding all the members of this group together as valid nonmember nonfriends? It's insert(), hence insert()'s status as a good choice to be the member that does the work and encapsulates the append-related access to the string's internals in one place, instead of spreading the internal access all over the place in a dozen different functions.

Next, insert(): For those of you who might think that six-count'em-six versions of assign() and six-count'em-six versions of append() might have been a little much, those were just the warmup... now we consider the eight-count'em-eight versions of insert(). Above, I've already nominated the three-parameter version of insert() as a member, and now it's time to justify why. First, as noted above, insert() is a more general operation than append(), and having a member insert() allowed all the append() operations to be nonmember nonfriends; if we didn't have at least one insert() as a member, then at least one of the append()s would have had to be a member, and so I chose to nominate the more fundamental and flexible operation. But we have eight-count'em-eight flavors of insert() -- which one (or more) ought to be the member(s)? Five of the eight forms of insert() are already specified in terms of the three-parameter form, and the others can also be implemented efficiently in terms of the three-parameter form, so we only need that one form to be a member. The others can all be nonmember nonfriends.

For those of you who might think that the eight-count'em-eight versions of insert() take the cake, well, that was warmup too. In a moment we'll consider the ten-count'em-ten forms of replace(). Before we attempt those, though, let's take a short break to tackle an easier function first, because it turns out that erase() is instructive in building up to dealing with replace()...

Coffee Break (Sort Of): Erasing erase()

Next erase(): After talking about the total 30-count'em-30 flavors of assign(), append(), insert(), and replace() -- and having dealt with 20 of the 30 already above -- you will be relieved to know that there are only three forms of erase(), and that only two of them belong in this section. After what we just went through for the others, this is like knocking off for a well-deserved coffee break...

The troika of erase() members is a little interesting. At least one of these erase() functions must be a member (or friend), there being no other way to erase efficiently using the other already-mentioned member functions alone. There are actually two "families" of erase() functions:

erase( pos, length )
    basic_string& erase(size_type pos = 0, size_type n = npos);

erase( iter, ... )
    iterator erase(iterator position);
    iterator erase(iterator first, iterator last);

First, notice that the two families' return types are not consistent: the first version returns a reference to the string, whereas the other two return iterators pointing immediately after the erased character or range. Second, notice that the two families' argument types are not consistent: the first version takes an offset and length, whereas the other two take an iterator or iterator range; fortunately, we can convert from iterators to offsets via pos = iter - begin() and from offsets to iterators via iter = begin() + pos.

(Aside: The standard does not require, but an implementer can choose, that basic_string objects store their data in a contiguous charT array buffer in memory. If they do, then the conversion from iterators to positional offsets and vice versa clearly need incur no overhead. (I would argue that even segmented storage schemes could provide for very efficient conversion back and forth between offsets and iterators using only the container's and iterator's public interfaces.)

This allows the first two forms to be expressed in terms of the third (again, remember your typenames and qualifications!):

template<class charT, class traits, class Allocator>
basic_string<charT, traits, Allocator>&
erase( basic_string<charT, traits, Allocator>& s,
       typename Allocator::size_type pos = 0,
       typename Allocator::size_type n =
         basic_string<charT, traits, Allocator>::npos )
{
if( pos > s.size() )
    throw out_of_range( "yes, we have no bananas" );
typename basic_string<charT, traits, Allocator>::iterator
    first = s.begin()+pos,
    last = n == basic_string<charT, traits, Allocator>::npos
           ? s.end() : first + min( n, s.size() - pos );
if( first != last )
s.erase( first, last );
return s;
}

template<class charT, class traits, class Allocator>
typename basic_string<charT, traits, Allocator>::iterator
erase( basic_string<charT, traits, Allocator>& s,
       typename basic_string<charT, traits, Allocator>
                ::iterator position )
{
return s.erase( position, position+1 );
}

OK, coffee break's over...

Back to Work: Replacing replace()

Next, replace(): Truth be told, the ten-count'em-ten replace() members are less interesting than they are tedious and exhausting.

At least one of these replace() functions must be a member (or friend), there being no other way to replace efficiently using the other already-mentioned member functions alone. In particular, note that you can't efficiently implement replace() in terms erase() followed by insert() or vice versa because both ways would require more character shuffling, and possibly buffer reallocation.

Note again that we have two families of replace() functions:

replace( pos, length, ... )
    basic_string& replace(size_type pos1, size_type n1, // #1
                        const basic_string& str);
    basic_string& replace(size_type pos1, size_type n1, // #2
                        const basic_string& str,
                        size_type pos2, size_type n2);
    basic_string& replace(size_type pos, size_type n1, // #3
                          const charT* s, size_type n2);
    basic_string& replace(size_type pos, size_type n1, // #4
                          const charT* s);
    basic_string& replace(size_type pos, size_type n1, // #5
                          size_type n2, charT c);

replace( iter, iter, ... )
    basic_string& replace(iterator i1, iterator i2,     // #6
                        const basic_string& str);
    basic_string& replace(iterator i1, iterator i2,     // #7
                          const charT* s, size_type n);
    basic_string& replace(iterator i1, iterator i2,     // #8
                          const charT* s);
    basic_string& replace(iterator i1, iterator i2,     // #9
                        size_type n, charT c);
    template<class InputIterator>                       // #10
      basic_string& replace(iterator i1, iterator i2,
                            InputIterator j1,
                            InputIterator j2);

This time, the two families' return types are consistent; that's a small pleasure. But, as with erase(), the two families' argument types are not consistent: one family is based on an offset and length, whereas the other is based on an iterator range. As with erase(), because we can convert between iterators and positions, we can easily implement one family in terms of the other.

When considering which must be members, we want to choose the most flexible and fundamental version(s) as members and implement the rest in terms of those. Here are a few pitfalls one might encounter while doing this analysis, and how one might avoid them. Consider first the first family:

One function (#2)? One might notice that the standard specifies all of the first family in terms of the #2 version. Unfortunately, some of the passthroughs would needlessly construct temporary basic_string objects, so we can't get by with #2 alone even for just the first family. The standard specifies the observable behavior, but the operational specification isn't necessarily the best way to actually implement the a given function.

Two functions (#3 and #5)? One might notice that all but #5 in the first family can be implemented efficiently in terms of #3, but then #5 would still need to be special-cased to avoid needlessly creating a temporary string object (or its equivalent).

Consider second the second family:

One function (#6)? One might notice that the standard specifies all of the second family in terms of #6. Unfortunately, again, some of the passthroughs would needlessly construct temporary basic_string objects, so we can't get by with #6 alone even for just the second family.

Three functions (#7, #9, #10)? One might notice that most of the functions in the second family can be implemented efficiently in terms of #7, except for #9 (for the same reason that made #5 the outcast in the first family, namely that there was no existing buffer with the correct contents to point to) and #10 (which cannot assume that iterators are pointers, or even for that matter basic_string::iterators!).

Two functions (#9, #10)! One might then immediately notice that all but #9 in the second family can be implemented efficiently in terms of #10, including #7. In fact, assuming string contiguity and position/iterator convertability as we've already assumed, we could probably even handle all the members of the first family... aha! That's it.

So it appears that the best we can do is two member functions upon which we can then build everything else as nonmember nonfriends. The members are the "iter, iter, num, char" and templated versions. The nonmembers are everything else. (Exercise for the reader: For each of the other eight versions, write sample implementations as efficient nonmember nonfriends.)

Note that #10 well illustrates the power of templates -- this one function can be used to implement all but two of the others without any loss of efficiency, and to implement the remaining ones with what would probably be only a minor loss of efficiency (constructing a temporary basic_string containing n copies of a character).

Time for another quick coffee break...

Coffee Break #2: Copying copy() and substr()

Oh, copy(), schmopy(). Note that copy() is a somewhat unusual beast, and that its interface is inconsistent with the std::copy() algorithm. Note again the signature:

    size_type copy(charT* s, size_type n,
                   size_type pos = 0) const;

The function is const; it does not modify the string. Rather, what the string object has to do is copy part of itself (up to n characters, starting at position pos) and dump it into the target buffer (note, I deliberately did not say "C-style string"), s, which is required to be big enough -- if it's not, oh well, then the program will scribble onward into whatever memory happens to follow the string and get stuck somewhere in the Undefined Behavior swamp. And, better still, basic_string::copy() does not, repeat not, append a null object to the target buffer, which is what makes s not a C-style string (besides, charT doesn't need to be char; this function will copy into a buffer of whatever kind of characters the string itself is made of). It is also what makes copy() a dangerous function.

Guideline: Never use functions that write to range-unchecked buffers (e.g., strcpy(), sprintf(), basic_string::copy()). They are not only crashes waiting to happen, but a clear and present security hazard -- buffer overrun attacks continue to be a perennially popular pastime for hackers and malware writers.

All of the required work could be done pretty much as simply, and a lot more flexibly, just by using plain old std::copy():

string s = "0123456789";

char* buf1 = new char[5];
s.copy( buf1, 0, 5 );  // ok: buf will contain the chars
                       //     '0', '1', '2', '3', '4'
copy( s.begin(), s.begin()+5, buf1 );
                       // ok: buf will contain the chars
                       //     '0', '1', '2', '3', '4'

int* buf2 = new int[5];
s.copy( buf2, 0, 5 );  // error: first parameter is not char*
copy( s.begin(), s.begin()+5, buf2 );
                       // ok: buf2 will contain the values
                       //     corresponding to '0', '1', '2',
                       //     '3', '4' (e.g., ASCII values)

Incidentally, the above code also shows how basic_string::copy() can trivially be written as a nonmember nonfriend, most trivially in terms of the copy() algorithm -- another exercise for the reader, but do remember to handle the n == npos special case correctly.

While we're taking a breather, let's knock off another simple one at the same time: substr(). Recall its signature:

basic_string substr(size_type pos = 0,
size_type n = npos) const;
We'll neglect for the moment to comment on the irksome fact that this function chooses to take its parameters in the order "position, length", whereas the copy() we just considered takes the very same parameters in the order "length, position". Nor will we mention that, besides being aesthetically inconsistent, trying to remember which function takes the parameters in which order makes for an easy trap for users of basic_string to stumble into because both parameters also happen to be of the same type (size_type), worse luck, so that if the users get it wrong their code will continue to happily compile without any errors or warnings and continue to happily run... well, except only every so often hiccupping and generating odd user support calls when odd strings get emitted after odd and wrong substrings get taken. And never would we state graphically that all of this could be viewed as the moral equivalent of neglectfully leaving a land mine neatly labeled "Design By Committee™" lying around the countryside just waiting to blow up without warning beneath the unwary.
Like I said, we won't comment on all that. Not a word. No, with blinders clamped firmly in place we'll ignore the carnage on either side, the cries of confusion, the smell of powder, and we'll march strictly along the main line of thought we're doggedly pursuing -- member vs. nonmember nonfriend functions.

Notice that substr() can be easily implemented as a nonmember nonfriend, because it's syntactic sugar for a string constructor -- after all, the standard itself specifies that it must simply return a fresh basic_string object constructed using the equivalent of basic_string<charT,traits,Allocator>( data()+pos, min(n, size()-pos) ). That is, creating a new string that's a substring of an existing one can be done equally well with or without substr(), using the more general string constructor which already does all this and a lot more besides:

string s = "0123456789";

string s2 = s.substr( 0, 5 ); // s2 contains "01234"
string s3( s.data(), 5 ); // s3 contains "01234"
string s4( s.begin(), s.begin()+5 ); // s4 contains "01234"

All right, break's over. Back to work again... fortunately we have only two families left to consider, compare() and the *find*()s.

Almost There: Comparing compare()s

The penultimate family is compare(). It has five members, all of which can be trivially shown to be implementable efficiently as nonmember nonfriends. How? Because in the standard they're specified in terms of basic_string's size() and data(), which we already decided to make members, and traits::compare(), which does the real work.

Wow, wait a minute. That was almost easy! Let's not question it, but move right along...

The Home Stretch: Finding the find()s

Our relief doesn't last long, alas. Lest the sight of the eight-count'em-eight flavors of insert() and the ten-count'em-ten versions of replace() wasn't enough to bring you verily to the verge of tears, we end on a truly unsurpassable distressing note, namely this: the 24-count'em-24 (yes, really) variations on find()-like algorithms.

There are six families of find functions, each with exactly four members:

find -- forward search for the first occurrence of a string or character (str, s, or c) starting at point pos

rfind -- backward search for the first occurrence of a string or character (str, s, or c) starting at point pos

find_first_of -- forward search for the first occurrence of any of one or more characters (str, s, or c) starting at point pos

find_last_of -- backward search for the first occurrence of any of one or more characters (str, s, or c) starting at point pos

	find_first_not_of -- forward search for the first occurrence of any but one or more characters (str, s, or c) starting at point pos
	find_last_not_of -- backward search for the first occurrence of any but one or more characters (str, s, or c) starting at point pos

Each family has four members:

	"str, pos" where str contains the characters to search for (or not) and pos is the starting position in the string
	"ptr, pos, n" where ptr is a charT* pointing to a buffer of length n containing the characters to search for (or not) and pos is the starting position in the string

"ptr, pos" where ptr is a charT* pointing to a null-terminated buffer containing the characters to search for (or not) and pos is the starting position in the string

"c, pos" where c the characters to search for (or not) and pos is the starting position in the string

All of these can be written efficiently as nonmember nonfriends; the implementations are left as exercises for the reader. Having said that, we're done!

But let's add one final note about string finding. In fact, you might have noticed that, in addition to the extensive bevy of basic_string::*find*() algorithms, the C++ Standard also provides a not-quite-as-extensive-but-still-plentiful bevy of std::*find*() algorithms. In particular:

	std::find() can do the work of basic_string::find()
	std::find() using reverse_iterators, or std::find_end(), can do the work of basic_string::rfind()

std::find_first_of(), or std::find() with an appropriate predicate, can do the work of basic_string::find_first_of()

std::find_first_of(), or std::find() with an appropriate predicate, using reverse_iterators can do the work of basic_string::find_last_of()

std::find() with an appropriate predicate can do the work of basic_string::find_first_not_of()

std::find() with an appropriate predicate and using reverse_iterators can do the work of basic_string::find_last_not_of()

What's more, the nonmember algorithms are more flexible, because they work on more than just strings. Indeed, all of the basic_string::*find*() algorithms could be implemented using the std::find and std::find_end(), tossing in appropriate predicates and/or reverse_iterators as necessary.

So what about just ditching the basic_string::*find*() families altogether and just telling users to use the existing std::find*() algorithms? One caution here is that, even though the basic_string::*find*() work can be emulated, doing it with the default implementations of std::find*() would incur significant loss of performance in some cases, and there's the rub. The three forms each of find() and rfind() that search for substrings (not just individual characters) can be made much more efficient than a brute-force search that tries each position and compares the substrings starting at those positions. There are well-known algorithms that construct finite state machines on the fly to run through a string and find a substring (or prove is absence) in linear time, and it might be desirable to take advantage of such techniques.

To take advantage of such optimizations, could we provide overloads (not specializations, see [6]) of std::find*() that work on basic_string iterators? Yes, but only if basic_string::iterator is a class type, not a plain charT*. The reason is that we wouldn't want to specialize std::find() for all pointer types is because not all character pointers necessarily point into strings; so we would need basic_string::iterator to be a distinct type that we can detect and partially specialize on. Then those specializations could perform the optimizations and work at full efficiency for matching substrings.

Summary

Decomposition and encapsulation are Good Things. In particular, it's often best to separate the algorithm from the container, which is what the STL does most of the time.

It's widely accepted that basic_string has way too many member functions. Of the 103 functions in basic_string, only 32 really need to be members, and 71 could be written as nonmember nonfriends without loss of efficiency. In fact, many of them needlessly duplicate functionality already available as algorithms, or are themselves algorithms that would be useful more widely if only they were decoupled from basic_string instead of buried inside it.

Don't repeat basic_string's mistakes in your design -- decouple your algorithms from your containers, use template specialization or overloading to get special-purpose behavior where warranted (as for substring searching), and above all follow the guidelines:

Guideline: Prefer "one class (or function), one responsibility."
Guideline: Where possible, prefer writing functions as nonmember nonfriends.

References

[1] A thick tome also known as the ISO C++ Standard.

[2] H. Sutter. "What's In a Class? The Interface Principle" (C++ Report, 10(3), March 1998).

[3] H. Sutter. Exceptional C++, Items 31-34 (Addison-Wesley, 2000).

[4] S. Meyers. “How Non-Member Functions Improve Encapsulation” (C/C++ Users Journal, 18(2), February 2000).

[5] S. Meyers. Effective STL (Addison-Wesley, 2000).

[6] H. Sutter. "Why Not Specialize Function Templates?" (C/C++ Users Journal, 19(7), July 2001).

Monoliths "Unstrung"Difficulty: 3 / 10

Problem

JG Question

Guru Questions

Solution

Avoid Unduly Monolithic Designs

The Basics of Strings

Membership Has Its Rewards -- and Its Costs

Operations That Must Be Members

Operations That Should Be Members

Into the Fray: Possibly-Contentious Operations That Can Be Nonmember Nonfriends

More Operations That Can Be Nonmember Nonfriends

Coffee Break (Sort Of): Erasing erase()

Back to Work: Replacing replace()

Coffee Break #2: Copying copy() and substr()

Almost There: Comparing compare()s

The Home Stretch: Finding the find()s

Summary

References

Copyright © 2009 Herb Sutter

Monoliths "Unstrung"
Difficulty: 3 / 10