GotW #29

Home Blog Talks Books & Articles Training & Consulting

On the
blog
RSS feed November 4: Other Concurrency Sessions at PDC
November 3
: PDC'09: Tutorial & Panel
October 26: Hoare on Testing
October 23
: Deprecating export Considered for ISO C++0x

This is the original GotW problem and solution substantially as posted to Usenet. See the book Exceptional C++ (Addison-Wesley, 2000) for the most current solutions to GotW issues #1-30. The solutions in the book have been revised and expanded since their initial appearance in GotW. The book versions also incorporate corrections, new material, and conformance to the final ANSI/ISO C++ standard.

Strings
Difficulty: 7 / 10

So you want a case-insensitive string class? Your mission, should you choose to accept it, is to write one.

Problem

Write a ci_string class which is identical to the standard 'string' class, but is case-insensitive in the same way as the (nonstandard but widely implemented) C function stricmp():

    ci_string s( "AbCdE" );

    // case insensitive
    assert( s == "abcde" );
    assert( s == "ABCDE" );

    // still case-preserving, of course
    assert( strcmp( s.c_str(), "AbCdE" ) == 0 );
    assert( strcmp( s.c_str(), "abcde" ) != 0 );

Solution

Write a ci_string class which is identical to the standard 'string' class, but is case-insensitive in the same way as the (nonstandard but widely implemented) C function stricmp():

The "how can I make a case-insensitive string?" question is so common that it probably deserves its own FAQ -- hence this issue of GotW.

Note 1: The stricmp() case-insensitive string comparison function is not part of the C standard, but it is a common extension on many C compilers.

Note 2: What "case insensitive" actually means depends entirely on your application and language. For example, many languages do not have "cases" at all; for those that do, you still have to decide whether you want accented characters to compare equal to unaccented characters, and so on. This GotW provides guidance on how to implement case-insensitivity for standard strings in whatever sense applies to your particular situation.

Here's what we want to achieve:

    ci_string s( "AbCdE" );

    // case insensitive
    assert( s == "abcde" );
    assert( s == "ABCDE" );

    // still case-preserving, of course
    assert( strcmp( s.c_str(), "AbCdE" ) == 0 );
    assert( strcmp( s.c_str(), "abcde" ) != 0 );

The key here is to understand what a "string" actually is in standard C++. If you look in your trusty string header, you'll see something like this:

  typedef basic_string<char> string;

So string isn't really a class... it's a typedef of a template. In turn, the basic_string<> template is declared as follows, in all its glory:

  template<class charT,
           class traits = char_traits<charT>,
           class Allocator = allocator<charT> >
      class basic_string;

So "string" really means "basic_string<char, char_traits<char>, allocator<char> >". We don't need to worry about the allocator part, but the key here is the char_traits part because char_traits defines how characters interact and compare(!).

basic_string supplies useful comparison functions that let you compare whether a string is equal to another, less than another, and so on. These string comparison functions are built on top of character comparison functions supplied in the char_traits template. In particular, the char_traits template supplies character comparison functions named eq(), ne(), and lt() for equality, inequality, and less-than comparisons, and compare() and find() functions to compare and search sequences of characters.

If we want these to behave differently, all we have to do is provide a different char_traits template! Here's the easiest way:

  struct ci_char_traits : public char_traits<char>
                // just inherit all the other functions
                //  that we don't need to override
  {
    static bool eq( char c1, char c2 )
      { return toupper(c1) == toupper(c2); }

    static bool ne( char c1, char c2 )
      { return toupper(c1) != toupper(c2); }

    static bool lt( char c1, char c2 )
      { return toupper(c1) <  toupper(c2); }

    static int compare( const char* s1,
                        const char* s2,
                        size_t n ) {
      return memicmp( s1, s2, n );
             // if available on your compiler,
             //  otherwise you can roll your own
    }

    static const char*
    find( const char* s, int n, char a ) {
      while( n-- > 0 && toupper(*s) != toupper(a) ) {
          ++s;
      }
      return s;
    }
  };

And finally, the key that brings it all together:

  typedef basic_string<char, ci_char_traits> ci_string;

All we've done is created a typedef named "ci_string" which operates exactly like the standard "string", except that it uses ci_char_traits instead of char_traits<char> to get its character comparison rules. Since we've handily made the ci_char_traits rules case-insensitive, we've made ci_string itself case-insensitive without any further surgery -- that is, we have a case-insensitive string without having touched basic_string at all!

This GotW should give you a flavour for how the basic_string template works and how flexible it is in practice. If you want different comparisons than the ones memicmp() and toupper() give you, just replace the five functions shown above with your own code that performs character comparisons the way that's appropriate in your particular application.

Exercises for the Reader

1. Is it safe to inherit ci_char_traits from char_traits<char> this way? Why or why not?

2. Why does the following code fail to compile?

    ci_string s = "abc";
    cout << s << endl;

HINT 1: See GotW #19.

HINT 2: From 21.3.7.9 [lib.string.io], the declaration of operator<< for basic_string is specified as:

    template<class charT, class traits, class Allocator>
    basic_ostream<charT, traits>&
    operator<<(basic_ostream<charT, traits>& os,
               const basic_string<charT,traits,Allocator>& str);

ANSWER: Notice first that cout is actually a basic_ostream<char, char_traits<char> >. Then we see the problem: operator<< for basic_string is templated and all, but it's only specified for insertion into a basic_ostream with the same 'char type' and 'traits type' as the string. That is, the standard operator<< will let you output a ci_string to a basic_ostream<char, ci_char_traits>, which isn't what cout is even though ci_char_traits inherits from char_traits<char> in the above solution.

There are two ways to resolve this: define insertion and extraction for ci_strings yourself, or tack on ".c_str()":

        cout << s.c_str() << endl;

3. What about using other operators (e.g., +, +=, =) and mixing strings and ci_strings as arguments? For example:

        string    a = "aaa";
        ci_string b = "bbb";
        string    c = a + b;

ANSWER: Again, define your own operators, or tack on ".c_str()":

        string    c = a + b.c_str();

Copyright © 2009 Herb Sutter