Rethinking How to Teach C++, Part 8: An interesting revision

In Part 7 of this series (Payback time)1, we developed a program to solve exercise 5-1 in Accelerated C++,2 and hinted that we had a better solution. The problem is to produce a permuted index for a file: an index in which every word in every line appears as an index term.

The solution strategy is to take each phrase, such as


      The quick brown fox
and rotate it so that every word appears first once:


      The quick brown fox
      quick brown fox The
      brown fox The quick
      fox The quick brown
remembering all the while where the original phrase begins. Then the program sorts the rotated phrases and prints them out in their original order.

The part of the program that we think we can improve is the data structure that we use to represent a rotated phrase. The original program used this data structure:


      typedef vector<string> Phrase;
      struct Rotated_phrase {
           Phrase words;
           Phrase::size_type bound;
      };
Here, bound is the index of the element that represents the beginning of the original phrase. In other words, we would represent the four phrases that result from our sample input as follows:


     The quick brown fox       4
     quick brown fox The       3
     brown fox The quick       2
     fox The quick brown       1
In each case, the integer represents the number of words at the beginning of the rotated phrase that come from the end of the original phrase. So, for example, the 3 in the second line indicates that quick brown fox is the end of the original phrase, and, in consequence, The is the beginning of the phrase.

The problem with this data structure is that it wastes space. A phrase with n words will be repeated n times, once for each rotation, thereby consuming space proportional to n2. We would like to be able to reduce the space consumption.

Of course, we can't improve the program's execution speed all that much, because the output must contain all of those phrases. Accordingly, each n-word input phrase must contribute O(n2) space to the size of the output, and it takes O(n2) time to write those characters. Nevertheless, these facts do not prevent us from substantially reducing the amount of space the program occupies while it runs.

Moreover, the revised program has a nice pedagogical property: It solves a problem for which we believe most programmers would use pointers, but it does so in a way that does not use pointers directly. Of course, it does use iterators, and a pointer is a kind of iterator—but it is capable of using any kind of iterator, not just a pointer.

STRATEGY
The idea behind our revised program is to avoid having to store more words than we read. In other words, when we read an n-word phrase, we would like to store one copy of those words rather than n copies of them. We will do so by making use of what we like to call the fundamental theorem of software engineering (even though it isn't really a theorem): All problems can be solved by introducing an extra level of indirection. In this case, the indirection is to use two data structures. One data structure will store each line of input. The second data structure will have one element for each rotation, but instead of storing a copy of the entire line for each rotation, it will refer to a line that is stored in the first data structure and it will note where in the line each rotation starts. In other words, instead of taking as much space as a full line of input, each rotation will consist of two values, one that indicates the line we are discussing, and one that indicates where in that line the rotation starts.

With this strategy, the data structure for our sample sentence is illustrated in Figure 1. Each row represents an element of a vector of objects. Each element contains two components. The first component identifies a particular vector<string> object that contains the phrase we are dealing with. The second component indicates the particular element of the vector<string> that begins the rotated phrase. So, for example, the second element of our vector refers to our phrase, The quick brown fox, and to the word quick within that phrase. Accordingly, one part of the rotated phrase is quick brown fox and the other part is The.

Figure 1
Figure 1. Data structure for sample sentence.

If we were writing this program in C, we would probably use pointers as our indicators. However, in C++ we can avoid pointers entirely in favor of iterators. If we store each line of text as an element of a vector, we can use an iterator to refer to that line. Similarly, if the line itself is also a vector, we can use an iterator to refer to the particular element in question.

IMPLEMENTATION
We'll begin with the data structures:


    typedef vector<string> Phrase;

     struct Rotation {
         vector<Phrase>::const_iterator phrase;
         Phrase::const_iterator rhs;
     };
We use a single Phrase object to represent each line of input. Each of those Phrase objects is an element of a single vector<Phrase> object that holds all of the input. We shall call this object lines, because it has one element for each line of input. This strategy will let us use a vector<Phrase>:: const_iterator to identify the phrase that corresponds to a particular input line, and a Phrase::const_iterator to identify the point in the phrase at which the right-hand part of the output line begins. So, for example, we would represent


     fox The quick brown
by noting that this rotation corresponds to an output line that looks like


     The quick brown    fox
We will therefore represent this rotation with a Rotation object that has a phrase iterator that refers to an element that contains


     The quick brown fox
and an rhs iterator that refers to the fox element within that phrase.

With this information, we can write the part of the program that reads the input. In our previous column, we generated the rotations as part of reading the input; this time, we will read all of the input and then generate all of the rotations. Here is the code to do this:


// one element for each line of input
vector<Phrase> lines;
// one element for each rotation
vector<Rotation> rotations;
// the current input line
string line;
// the length of the longest left-hand side
string::size_type max_size = 0;

while (getline(cin, line)) {
     Phrase phrase = split(line);
     Phrase::const_iterator begin = phrase.begin();
     Phrase::const_iterator end = phrase.end();
     if (begin != end)
          max_size = max(max_size,
               phrase_length(begin, --end));
     lines.push_back(phrase);
}
The split and phrase_length functions are the same as in our previous column. The split function takes a string and splits it into individual words, each of which it returns as an element of a vector<string>. The phrase_length function figures out how many characters a given phrase will occupy when written. As with our previous program, we need to know how long the longest phrase will be so that we can pad the output properly.

There is an important reason not to generate rotations until after we have read all of the elements of lines, and that is that lines is a vector. One property of the vector class is that appending an element to a vector is permitted to reallocate the vector, and reallocation invalidates any iterators that refer to elements of the vector. Therefore, if we are going to work with iterators that refer to elements of lines, we must ensure that we do not do anything that might invalidate those iterators. We do so by reading all the elements of lines before creating any iterators that refer to its elements. Once we have read all the elements of lines, we can begin to store data structures that contain iterators that refer to elements of lines without worrying that a further change to lines might invalidate any of those iterators.

Our next job is to fill the vector of rotations:


     // examine each line in the input
     for (vector<Phrase>::const_iterator
          p = lines.begin();
          p != lines.end(); ++p) {
          Rotation e;
          e.phrase = p;
          // generate a rotation for each word in the phrase
          for (e.rhs = p->begin();
               e.rhs != p->end(); ++e.rhs)
               rotations.push_back(e);
     }
You can see that nothing in this loop copies any strings. All that it does is to manipulate iterators and build a vector of Rotation data structures, each of which is really nothing more than a pair of iterators. Because we do not copy any strings, we are storing only a single copy of each word, in contrast to the previous program that stored n copies of each word of an n-word phrase.

The outer for loop causes p to refer to each element of lines in turn. Each of those elements represents a line of input. Given a line, we want to create one rotation for each word in that line, with each of those rotations having an rhs that starts at that word. The inner for loop creates these rotations, with e.rhs referring to every word in the Phrase denoted by p.

Next, we sort the rotations:


stable_sort(rotations.begin(), rotations.end(),
     compare);
This statement would be trivial were it not that we must define an appropriate compare function. This definition is somewhat tricky, so we shall return to it later.

The final part of our program is to print the output:


     for (vector<Rotation>::const_iterator i =
          rotations.begin();
          i != rotations.end(); ++i) {
        // p refers to the input line that corresponds to the
        // current line of output, which we must
                 // write with a gap immediately before the
                 // element denoted by rhs
        vector<Phrase>::const_iterator p = i->phrase;
        // write an appropriate amount of padding
        cout << string(max_size -
                       phrase_length(p->begin(),
                            i->rhs), ' ');

          // write the part before the gap
          write_phrase(cout, p->begin(), i->rhs);

          // write the gap and the rest of the line
          cout << "   ";
          write_phrase(cout, i->rhs, p->end());
          cout << endl;
     }
This code is not all that difficult if you remember that each phrase will be written with a gap just before the point to which i->rhs refers, and with enough padding to arrange for all the gaps to line up. Accordingly, the part of the phrase before the gap will occupy a number of characters equal to phrase_length(p->begin(), i->rhs), so we subtract that number from max_size to get the number of spaces to use as padding. Then we write the first part of the phrase, some spaces, the second part of the phrase, and the end of line.

THE COMPARISON FUNCTION
It is now time to return to what we said is the hard part of the program: the comparison function. The version in our previous column had it easy, because each rotation was already in the correct order to use for comparisons. Now, however, each phrase is stored in the order in which it will be printed, which is not the order to use for comparisons. For example, consider the following output:


           The quick    brown fox
     The quick brown    fox
                 The    quick brown fox
                        The quick brown fox
The same input phrase generates all four of these output lines, so we must somehow treat that phrase in several different ways for comparison purposes.

The easiest way to solve this problem is to use the lexicographical_compare function from the standard library. This function takes two pairs of iterators, each of which delimits a sequence, and compares the elements of those sequences in dictionary order. This function allows us to define compare as follows:


     // first try--not quite good enough
     bool compare(const Rotation& x,
          const Rotation& y)
     {
          return lexicographical_compare
               (x.rhs, x.phrase->end(),
                y.rhs, y.phrase->end()))
     }
This function compares the parts of the phrases starting at rhs, which you will remember is the location of the gap. This strategy is fine as far as it goes, but it has a problem. Suppose we have two different output lines with identical right-hand parts:


       The quick    brown fox
     The quicker    brown fox
The right-hand parts will be considered equal, and the left-hand parts won't be considered at all. Therefore, these two lines will come out in whatever order they appeared in the input.

This behavior is not what we want. Instead, we want the line with quick to precede the line with quicker, on the basis that quick precedes quicker in the alphabet. For that matter, we might have:


       A quick    brown fox
     The quick    brown fox
which we would like to appear in the order shown because A precedes The.

Finally, suppose we had all four combinations of A/The and quick/quicker in the output. In that case, we would like the output to look like this:


         A quick    brown fox
       The quick    brown fox
       A quicker    brown fox
     The quicker    brown fox
with all common instances of the word just before the gap clustered together.

Our conclusion is still that comparison should be based on the right-hand side if the phrases differ there. If, however, the right-hand sides are the same, then we would like to examine the left-hand sides in reverse order by words. Believe it or not, that comparison function is not much harder to write than it is to describe:


bool compare(const Rotation& x, const Rotation& y)
{
     // if x's right-hand side is less than y's, the result is true

     if (lexicographical_compare
               (x.rhs, x.phrase->end(),
                y.rhs, y.phrase->end()))
          return true;

     // if y's right-hand side is less than x's, the result is false
     if (lexicographical_compare
               (y.rhs, y.phrase->end(),
                x.rhs, x.phrase->end()))
          return false;
     // otherwise, we determine the result by comparing
     // the left-hand sides in reverse order
     return lexicographical_compare(
         reverse_iterator<Phrase::const_iterator>
              (x.rhs), x.phrase->rbegin(),
         reverse_iterator<Phrase::const_iterator>
              (y.rhs), y.phrase->rbegin());
}
Our initial comparison is the same as in the previous version of compare. If this comparison succeeds, then we will consider the right-hand part of x to be less than y, so we return true. If it fails, we must now see whether the right-hand sides are equal. To do so, we compare the same operands in the opposite order. If this second comparison succeeds, then we can consider the right-hand part of y to be less than that of x, so we return false.

If the second comparison fails, we know that neither of the right-hand sides of x and y is less than the other one. Therefore, the right-hand sides are equal to each other, which means that we must compare the left-hand sides. We do so by using reverse iterators. To obtain the place to start the comparison, we construct an appropriate reverse iterator from x.rhs (and similarly for y.rhs); to end the comparison, we use the rbegin member, which already returns an appropriate reverse iterator.

If you think this compare function is too complicated, you may find it an interesting exercise to write it without using lexicographical_compare.

DISCUSSION
A line with n words will always consume at least O(n2) time because it will generate O(n2) output. Therefore, our revised program will be only slightly faster at best than the program that appeared in our previous article. In general, we advocate optimizing programs only when their execution time is unacceptably long. So why change the policy in this particular case?

One reason is entirely pragmatic: We actually thought of this program before the one we presented in the last article. We decided not to present it first because we thought it would be easier to understand after seeing a more straightforward solution.

The other reason is one we mentioned earlier: By using iterators in a context in which most programmers would use pointers, it shows that in some contexts, an abstract data structure will solve a problem just as well as a more concrete one. To better appreciate the difference between iterators and pointers in this context, you might try modifying this program to use lists instead of vectors. Then think about what it would have taken to change the data structures in this way if the program had used pointers and pointer operations instead of iterators.

THE MISSING PIECES
For completeness, here are the definitions of the split, write_phrase, and phrase_length functions described in Part 7 of this series:


     // split a string into individual words
     vector<string> split(const string& s)
     {
          vector<string> ret;
          typedef string::size_type string_size;
          string_size i = 0;
          // invariant: we have processed characters
                    //[original value of i,i)
          while (i != s.size()) {
              // ignore leading blanks invariant:
                             // characters in range
                             // [original i, current i) are all spaces
                             while (i != s.size()
                      && isspace(s[i]))
                           ++i;

                 // find end of next word
                 string_size j = i;
                 // invariant: none of the characters in range
                                   // [original j, current j) is a space
                 while (j != s.size()
                      && !isspace(s[j]))
                           ++j;
                 // if we found some nonwhitespace characters
                 if (i != j) {
                      // copy from s starting at
                   // i and taking j - i chars
                      ret.push_back
                           (s.substr(i, j - i));
                           i = j;
                 }
          }
          return ret;
     }
          // write the words in the range [begin, end)
          // with a space between adjacent words
  void write_phrase(ostream& out,
                   Phrase::const_iterator begin,
                   Phrase::const_iterator end)
     {
          if (begin != end) {
               out << *begin;
               while (++begin != end)
                    out << " " << *begin;
          }
     }

  // determine how much space
    // the words in the range [begin, end)
    // would consume if written by write_phrase
     string::size_type phrase_length
         (Phrase::const_iterator begin,
          Phrase::const_iterator end)
     {
          string::size_type sum = 0;
          if (begin != end) {
               sum = begin->size();
               while (++begin != end)
                    sum += begin->size() + 1;
          }
          return sum;
     }
REFERENCES
  1. Koenig, Andrew and Barbara E. Moo. "Rethinking How to Teach C++, Part 7: Payback time," Journal of Object-Oriented Programming, 14(1): May 2001.
  2. Koenig, Andrew and Barbara E. Moo. Accelerated C++: Practical Programming by Example, Addison–Wesley, Reading, MA, 2000.