Rethinking How to Teach C++:Part 6: Analyzing Strings

In the last article in this series,1 we explained how Accelerated C++ Practical Programming by Example2 used two examples to introduce the basics of manipulating character strings. The first example is a function that takes a string as its argument and returns a vector of strings, one for each word in the input. The second is a collection of functions that use a vector of strings to represent a rectangular picture, with one element for each row. With this representation, we found it straightforward to write functions that put a frame around a picture, or that concatenate two pictures horizontally or vertically.

These examples showed how to manipulate strings and vectors an element at a time. What they did not show—and what is one of the great advantages of using the C++ standard library instead of dealing with individual elements directly—was how to use the standard-library algorithms to simplify programs. This article revisits a programming example from Part 5,1 and shows how library algorithms can make the program substantially easier to understand. After that, we will use similar algorithms to solve a somewhat more complicated problem.

Revising a program to use library algorithms
We already used iterators in a direct loop in the vcat function:


for (vector::const_iterator it =
     bottom.begin();
          it != bottom.end(); ++it)
             ret.push_back(*it);
This loop concatenates all of the elements of bottom (which represents the picture that will constitute the bottom part of a vertical concatenation) onto the end of ret (which is the value the function will eventually return). Although the loop is not terribly complicated, it is verbose. Accordingly, it offers a nice opportunity to show how to use the library to simplify a program:


ret.insert(ret.end(), bottom.begin(), bottom.end());
This call does not really use a library algorithm because insert is a member of vector, rather than a standalone algorithm. Nevertheless, it is clearly simpler than the loop it replaces. Moreover, we can use a general algorithm to accomplish the same task:


copy(bottom.begin(), bottom.end(), back_inserter(ret));
This example is particularly nice because it lets us introduce three of the most useful library facilities simultaneously: insert, copy, and back_inserter. Moreover, by this time we have explained enough about the language in general that we can describe the effect of copy by saying that


     copy(begin, end, out);
is the same as that of


     while (begin != end)
          *out++ = *begin++;
except for the changes to the values of the iterators. This method of description will prepare readers for the introduction of generic functions later.

Notice we said iterators, not pointers. In fact, we have not even mentioned pointers yet in the book,2 and we shall not do so for another 68 pages. However, the properties of iterators are all that we need for this algorithm to work, so we do not need to know about pointers to understand the algorithm.

The example also gives us the opportunity to mention two common mistakes:


     // These two calls will not work!
  copy(bottom.begin(), bottom.end(), ret);
  copy(bottom.begin(), bottom.end(), ret.end());
The first one will fail because ret is a container, not an iterator. The second will fail for a more subtle—and therefore more important—reason.

The reason is that generic algorithms, such as copy, do not operate directly on containers—they operate only on the iterators that are given to them as arguments. Therefore, the second call to copy is capable of performing only operations that bottom.begin() and bottom.end() support. These operations do not include any that might change the number of elements in bottom. Such changes take place only through container operations, not through operations on the container's iterators.

Accordingly, if we wish to use copy to append elements to a container, we must give it an iterator capable of changing the container's size. The back_inserter function provides such an iterator.

A more ambitious revision
In Part 5,1 we developed the following program:


vector split(const string& s)
{
     vector ret;
     typedef string::size_type string_size;
     string_size i = 0;

          // invariant: we have processed characters
          // [original value of i, i)
     while (i != s.size()) {
                        // ignore leading blanks
                        // invariant: characters in range
                        // [original i, current i) are all spaces
           while (i != s.size() && isspace(s[i]))
                 ++i;

                       // find end of next word
           string_size j = i;
                       // invariant: none of the characters in range
                       // [original j, current j) is a space
           while (j != s.size() && !isspace(s[j]))
                 ++j;

                       // if we found some nonwhitespace characters
           if (i != j) {
                                    // copy from s starting at i and
                                    // taking j-i chars
                 ret.push_back(s.substr(i, j - i));
                 i = j;
          }
     }
     return ret;
}
This function takes a string that represents a sequence of words separated by whitespace, and returns a vector, each element of which is a single word.

What is interesting about this function is that it has two loops that are in almost the same form, namely


     while (i != s.size() && isspace(s[i]))
           ++i;
and

     while (j != s.size() && !isspace(s[j]))
           ++j;
These loops are classic examples of a linear search, which the standard library implements as the find_if algorithm. The _if implies that the algorithm expects a predicate—a function that yields a truth value—as one of its arguments, and uses the predicate to determine when to terminate the search.

We would like to give isspace as the predicate argument to find_if. Unfortunately, there is one small snag: we cannot be assured of using isspace directly. The problem is that isspace can be overloaded to handle ordinary and wide characters, and we do not have the mechanism at this point to pass an overloaded function as an argument to a template function, which find_if is.

Fortunately, this snag is easy to circumvent, especially in light of the fact that we also need a function that yields the opposite of isspace. The circumvention is to define two small functions to test for space and not space, respectively:


bool space(char c)
{
     return isspace(c);
}
bool not_space(char c)
{
     return !isspace(c);
}
Once we have defined these functions, we can rewrite split so that it uses library algorithms instead of the two inner loops:


vector split(const string& str)
{
     typedef string::const_iterator iter;
     vector ret;

     iter i = str.begin();
     while (i != str.end()) {

          // ignore leading blanks
          i = find_if(i, str.end(), not_space);

          // find end of next word
          iter j = find_if(i, str.end(), space);

          // copy the characters in [i,j)
          if (i != str.end())
               ret.push_back(string(i, j));
          i = j;
     }
     return ret;
}
Look how much simpler this function has become. In particular, notice how the two loops that were so similar have become two calls to find_if that are also similar.

Finding URLs
By using iterators to mark positions in sequences, find and find_if to locate elements with particular properties, and constructors and push_back to extract subsequences and append them to other sequences, we can write programs to manipulate strings in quite a variety of useful ways. At this point, the most important skill the reader can learn is how to be comfortable in using these facilities. Accordingly, we came up with an example that is a variation on split, but significantly more complicated: Examine a string and extract every substring that represents a URL. A URL is a sequence characters of the form


     protocol-name::/resource-name
where protocol-name contains only letters, and resource-name may include letters, digits, and certain punctuation characters.

The function that finds URLs looks like this:


vector find_urls(const string& s)
{
     vector ret;
     typedef string::const_iterator iter;
     iter b = s.begin(), e = s.end();
     // look through the entire input
     while (b != e) {
          // look for one or more letters followed by ://
          b = url_beg(b, e);
          // if we found it
          if (b != e) {
               // get the rest of the URL
               iter after = url_end(b, e);
               // remember the URL
               ret.push_back(string(b, after));
               // advance b and check for
               // more URLs on this line
               b = after;
          }
     }
     return ret;
}
You can see this function has a similar structure to split. The main difference is the use of url_beg and url_end instead of find_if. The idea is that url_beg will locate the first (or next) position in the sequence at which a URL begins—specifically, one or more letters followed by :// and then by a character that is permitted to appear as part of a protocol-name. Each time url_beg finds the beginning of a URL, we use url_end to find the end of the URL. If we assume that we have a function called not_url_char that will determine whether its argument is not a valid character for a URL protocol-name, it is almost trivial to write url_end:


string::const_iterator
url_end(string::const_iterator b,
     string::const_iterator e)
{
     return find_if(b, e, not_url_char);
}
Interestingly, we can use find to implement not_url_char:


bool not_url_char(char c)
{
     // characters, in addition to alphanumerics,
     // that can appear in a URL

     static const string url_ch = "~;
          /?:@=&$-_.+!*'(),";

     // see whether c can appear in a
     // URL and return the negative
     return !(isalnum(c) ||
             find(url_ch.begin(), url_ch.end(),
                  c) != url_ch.end());
}
In this example, we use url_ch as the sequence in which to find the character c that we are given as our argument. If c is not there, then it's not a URL character.

Finally, we can write url_beg, which is one of the most complicated functions in this part of the book:2


string::const_iterator
url_beg(string::const_iterator b, string::const_iterator e)
{
     static const string sep = "://";
     typedef string::const_iterator iter;

     // i marks where the separator was found
     iter i = b;
     while ((i = search(i, e, sep.begin(),
          sep.end())) != e) {

          // make sure the separator isn't
          // at the beginning or end of the line
          if (i != b && i + sep.size() != e) {
               // beg marks the beginning of
               // the protocol-name
               iter beg = i;
               while (beg != b &&
                    isalpha(beg[-1]))--beg;
          // is there at least one appropriate character
          // before and after the separator?
          if (beg != i &&!not_url_char
               (i[sep.size()]))
                    return beg;
          }
          // the separator we found wasn't part of a URL;
          // advance i past this separator

          i += sep.size();
     }
     return e;
}
This function's form is familiar, making it a good context to introduce several new ideas, particularly the search function and the notion of indexing an iterator in much the same way as one would index a container. Note the use of beg[-1] to denote the character before the one that beg denotes. This strategy of using familiar contexts to introduce new ideas pervades the book.2

Discussion
Many programs spend a lot of energy dealing with sequences—character strings, vectors, or lists. The most common operations on sequences are to search for elements with particular properties and to copy subsequences from one place to another. Accordingly, the copy, find, find_if, and search library algorithms are useful tools for such programs, as is the back_inserter iterator adaptor. Every C++ programmer should become so familiar with these facilities as to make them second nature—a familiarity we try to instill with examples.

In finding examples, we tried to imagine programs people might write for their own purposes—finding all of a particular kind of entity, such as a word or a URL in a sequence such as a file or a string. For example, if we have a program that reads the contents of a Web page in HTML, it is easy to imagine the find_url program being used to locate all the links on that page.

What we do not emphasize, because we think readers will find it out for themselves, is how much easier it is to use the standard library containers and algorithms to write such programs than it is to do so without them.

To convince yourself of how useful the library is here, try writing find_url without using the string or vector classes or the library algorithms. You will discover that you have to deal with memory allocation, with searching algorithms (you have to find the ::/ somehow), and with lots of other details that turn programming from a pleasure into a nuisance.

Our goal, therefore, is to show by example how the standard library can simplify ordinary programming.

REFERENCES

  1. Koenig, Andrew and Barbara Moo. "Analysis of a Classroom Exercise, Part 5: Working with Strings," Journal of Object-Oriented Programming, 13(11): March 2001.
  2. Koenig, Andrew and Barbara Moo. Accelerated C++ Practical Programming by Example, Addison–Wesley, Reading, MA, 2000.