Supercharged Strings -- ADTmag

Supercharged Strings

By Steve Ball
December 18, 1999

Steve Ball is an independent contractor specializing in C++/Java/UNIX development. He currently lectures in Object-Oriented Design and C++ Programming at Auckland Institute of Technology. You can contact him at [email protected].

STRINGS HAVE BEEN awarded a special status in the Java language. The String class is the only reference type that can be instantiated by a compile-time literal and the only one to have its own operators (string concatenation, + and +=). Despite its privileged rank, there is one thing you can't do with a string that you can do with an instance of almost any other class: you can't change its value.

The immutability of strings has a number of happy consequences for the Java developer. I'll demonstrate how taking advantage of their special properties can improve the efficiency of code that works with string objects. I'll also show what the Java Language Specification has to say about supercharging string operations and how not everything it says is what it seems.

MAXIM 11: USE STRING OBJECTS IN PREFERENCE TO CHARACTER ARRAYS
A string literal is a sequence of zero or more Unicode characters enclosed in double-quotes. The compiler replaces each literal with a reference to an instance of the String class that contains the quoted characters.

Because a string once constructed cannot be modified, multiple identical occurrences of a string literal can share the same String instance without the risk of the object being altered through one reference and affecting the other. The Java virtual machine (JVM) exploits this opportunity, replacing multiple occurrences of the same string literal with references to the same unique String instance.


String s1 = "twine";
String s2 = "twine";
boolean shared = (s1 == s2);    // true

Concatenation between string literals is performed at compile-time, so these two strings will also refer to the same instance:


String s1 = "twine";
String s2 = "tw" + "ine";
boolean shared = (s1 == s2);    // true

This means that there is no runtime overhead incurred if you split a string literal across multiple lines to prevent it from exceeding the line length of your editor:


String message = "Warning: The system will be " +
    "shutting down in one minute. Please exit " +
    "your programs and log off";

These strings will be concatenated to form a single String instance when the class is compiled, incurring a minuscule time penalty then, but none when the program is subsequently executed.

The Java Language Specification 1.0 (JLS) (§3.10.5) states emphatically that the JVM merges (or, to use its phraseology, interns) string literals across different classes and packages. When a class is loaded, its string literals are merged with those already loaded. String literals that match any of the ones that have already been loaded are replaced with references to existing String instances. The remaining string literals require new String instances to be constructed, which may themselves become shared when subsequent classes are loaded. So even if s1 and s2 had been variables of different classes from different packages, they would still refer to the same object.

The JLS further states that "Strings computed by constant expressions (§15.27) are computed at compile time and then treated as if they were literals." So these two strings should both refer to the same unique instance also:


String s1 = "Route 66";
String s2 = "Route " + 66;
boolean shared = (s1 == s2);    // true

That is the theory anyway. In practice, the Java runtime may not go to quite these lengths to ensure that all possible duplication is eliminated, making the number of actual String instantiations as few as possible. Compilers and runtimes vary in their levels of compliance with the JLS string interning rules. Table 1 shows which compilers and runtimes obey which rules.

Table 1. Compiler support for string interning rules.
	JDK 1.0	JDK 1.1	JAVA 2	VISUAL J++ 1.1
String literals from same class merged **	*
String literals from different classes merged **
String literals from different packages merged **
String literals concatenated at compile time ***
String based on constant (non-string) expressions concatenated at compile time ***
The JDK 1.0.2 runtime does not merge identical string literals occurring in the same class under all circumstances. See the Effective Java Web page for further details. * Function of the Java runtime environment (JVM). *** Function of the Java compiler.

Object construction is one of the most expensive atomic operations in Java (McManus, A. & J. Hunt, "The Need for Speed," Java Report, Vol. 3, No. 5, 1998). Any mechanism that reduces the number of object instances constructed is worth taking advantage of. By using String objects instead of character arrays for constant string values you allow the runtime system not only to save space in memory (reducing the load on the virtual memory system) but also time in loading the class.

The elimination of redundant String instantiations is not the only benefit of string immutability though. It also allows the String class's own methods to be implemented much more efficiently than would otherwise be possible.

EFFICIENT STRING METHODS
If a String object can never be modified, it follows that sub-ranges of its character sequence must also be constant. This permits a number of the String class's methods to be implemented very efficiently. The substring method is a good example. The length of time taken to perform the operation does not depend on the length of the extracted sub-string. It executes in constant time.


String s = "A very long string... (>100 chars)";
// extract the 20th through 99th characters
String ss = s.substring(20, 100);

The String class is implemented using private instance variables to hold a char array for the characters of the string, the string's length, and an integer offset indicating the string's starting position in the char array. The substring method may be given a compact implementation that simply constructs a new String object that shares the same char array but with different values for the length and offset.


private String(char[] chars, int length, int offset) {
    this.chars = chars;
    this.length = length;
    this.offset = offset;
}
public String substring(int begin, int end) {
    return new String(this.chars, end - begin,
        this.offset + begin);
}

This is an advantage that Java has over other languages, like C++, whose string objects are mutable. No similarly efficient implementation is possible in C++ for the substr member function of the STL string class, which is forced to fabricate a private copy of the character sub-range to prevent any of these characters from being later modified via the originating string. It operates in linear time (with respect to the length of the sub-string), unlike Java's substring method, which executes in constant time.

The same penalty is imposed (in Java) with character arrays. Extracting a sub-string, as another character array, also has linear time performance. The characters must be copied out using System.arraycopy into a separate array—the length of time taken is a function of the number of characters copied.

This is another good reason to prefer String objects to character arrays: to take advantage of the String class's uniquely efficient methods.

MAXIM 12: ELIMINATE TEMPORARY STRINGS AND SAVE TIME
The examples so far have dealt with constant string values. For storing modifiable strings, Java provides the StringBuffer class. String buffers support numerous methods for inserting, appending, and deleting characters in the string, and can be converted easily to and from String objects.

The compiler uses the StringBuffer class internally to implement the String class' concatenation operator. Wherever an expression occurs using the string concatenation operator, the compiler replaces it with equivalent operations on a temporary StringBuffer object. For example, this line:


String s = "You have " + n + " new messages";

is converted by the compiler to the following statement:


String s = new StringBuffer("You have ").append(n).
    append(" new messages").toString();

A temporary StringBuffer object is first initialized with the leftmost operand. The remaining arguments are then appended to the string buffer using its append method.* The completed StringBuffer object is finally converted back to a String object using the toString method.

This translation of the original expression results in the construction of a temporary StringBuffer object. However, this is a cost that may be avoided if the arguments to the string concatenation operator are all constants. If the compiler supports it, operands based on constant values are concatenated at compile time and the construction of a StringBuffer object at runtime is circumvented. A compiler that supports the concatenation of strings based on constant expressions (see Table 1) would convert this expression, containing only constants:

"You have " + 5 + " new messages"

directly to this string literal:

"You have 5 new messages"

Because it requires the construction of a temporary string concatenation (when the compiler cannot eliminate it) incurs a fixed runtime cost per expression. If an expression containing multiple concatenation operators is split into independent statements, the cost of constructing a temporary is incurred by each statement instead of being incurred once and amortized across all concatenation operators in the expression. For example, the previous example split into three statements:

String s = "You have "; s += 5; s += " new messages";

prevents the compiler from concatenating the constants at compile-time and incurs the construction of two StringBuffer temporaries to boot!

The benefits of attempting to build composite strings using the fewest possible statements are greatest when the components are all constant values, as it gives the compiler the opportunity to replace the entire expression with a single string literal at compile-time.

Even when the string concatenation expression contains values computed at runtime, there is still considerable advantage in attempting to concatenate as many components together in the same expression. In this block of code, which displays the program's arguments, two of the components, i and args[i], are not known at compile time, so the concatenation would have to be deferred until the program executes even if the string had been built using a single statement. Nevertheless, splitting the concatenation into multiple statements still incurs an unnecessary penalty:


for (int i = 0; i < args.length; ++i) {
    String s = "args[";
    s += i;
    s += "] = \"";
    s += args[i];
    s += "\"";
    System.out.println(s);
}

Consider what happens to the body of the loop when the compiler re-expresses the string concatenation in terms of StringBuffer operations:


String s = "args[";
s = new StringBuffer(s).append(i).toString();
s = new StringBuffer(s).append("] = \"").toString();
s = new StringBuffer(s).append(args[i]).toString();
s = new StringBuffer(s).append("\"").toString();

This results in four StringBuffer and four String temporary objects being constructed! This is needlessly wasteful. The simple solution is to build the string with a single expression:

String s = "args[" + i + "] = \"" + args[i] + "\"";

This expression, once reworded by the compiler, is still considerably more compact than the previously translated code:


String s = new StringBuffer("args[").append(i).
    append("] = \"").append(args[i]).append("\"").
     toString();

There will be only two objects (one a temporary) created by this expression—a great improvement on the nine objects previously constructed. Depending on the JVM, the difference in performance between this formulation and the former one varies between a factor of 3 and a factor of 10.**

However, for a variety of reasons, it may not be possible to combine all the parts of a string together in one expression. In that case, imitate the behavior of the compiler by manually creating a StringBuffer object to hold the intermediate string values. Once the string is fully formed in the buffer, it may be converted to a String object with toString.


StringBuffer buf = new StringBuffer("n^19=");
double power = Math.pow(n, 19);
buf.append(power);
buf.append(" n^20=");
buf.append(power *= n);
String s = buf.toString();

This approach constructs only one temporary object—the string buffer—and so represents a significant time saving over multiple uses of the string concatenation += operator.

MAXIM 13: ALWAYS PARENTHESIZE SUB-EXPRESSIONS WHEN CONCATENATING WITH STRINGS
The compiler's somewhat mechanical conversion of string concatenation operators to StringBuffer operations can cause some surprises. For example, what output do you think the programmer intended this statement to produce?

System.out.println("When does 2 + 2 = " + 2 + 2);

Because string concatenation is used to "add" the numeric values, the statement does not produce the conventional answer of "4":

When does 2 + 2 = 22

The cause of this surprising output is that the + operator, representing both string concatenation and arithmetic addition associates left to right. So the expression:

w + x + y + z

is evaluated as

((w + x) + y) + z

If either w or x are strings, the other is converted to a string if required and the two strings are concatenated. The concatenation yields another string to which y and then z are further concatenated (after suitable conversions have been applied).

A similar problem arises when the numeric expression contains operators of lower precedence than the string concatenation operator:

str = "n >>> 4 = " + n >>> 4; // compile error

In this instance, the right shift with zero-extension operator (>>>) has a lower precedence than the string concatenation operator. Evaluation of the first sub-expression (assuming n equal to 100) produces:

"n >>> 4 = 100" >>> 4

The >>> operator may not be validly applied to a String object, so a compile error results. This situation is less troublesome than the first, though, because it cannot go unnoticed.

As a matter of defensive programming, always parenthesize numeric and boolean expressions where they are to be concatenated with strings.


System.out.println("When does 2 + 2 = " + (2 + 2));
System.out.println("File is empty: " + (length == 0));

Even in cases where the arithmetic operators involved in a numeric expression all have precedence higher than the + operator, parenthesize the expression so that lines like this:

System.out.println("2n = " + n * 2);

do not change radically in meaning when replaced with arithmetically equivalent forms:


System.out.println("2n = " + n + n);    // intended?
System.out.println("2n = " + (n + n));  // intent clear

MAXIM 14: AVOID THE STRING.INTERN METHOD
There is one last thing to be said about improving the performance of string operations, and I know that if I don't say it, I will be deluged with email. So here goes: it regards the use of the String.intern method.

The intern method maintains a pool of unique String instances. When intern is invoked against a String object, the pool is searched (using equals) for an instance with an identical value. If one is found, the String object from the pool is returned. Otherwise, this String object is added to the pool and a reference to it is returned.

It follows that for any two strings s and t, s.intern() == t.intern() and s.equals(t) == true are equivalent tests (JLS 1.0, §20.12.47).

The application in Listing 1 reads a Java source file and counts how many of Java's 47 keywords occur within it. The keywords are stored in a string array and the tokens of the source file are compared sequentially against each of the array elements:


for (int t = 0; t < tokens.length; ++t)
    for (int k = 0; k < keywords.length; ++k)
        if (tokens[t].equals(keywords[k])) {
            ++counts[k];
            break;
        }

In The Java Programming Language, 2nd ed. (Addison-Wesley, 1997), Ken Arnold and James Gosling describe a technique based on the intern method that can be used to improve the performance of string matching operations such as this one:

However, any two strings with the same contents return the same String object from intern, which enables you to compare string references to test equality, instead of the slower test of string contents…Dealing with the results of intern makes comparing object references equivalent to comparing string contents, but much faster (§8.3, p. 166).

Applying that technique to the keyword counting application, the token string is initially interned so that the equals test may be replaced with a reference comparison between the interned token string and the current keyword string:


for (int t = 0; t < tokens.length; ++t) {
    tokens[t] = tokens[t].intern();

    for (int k = 0; k < keywords.length; ++k)
        if (tokens[t] == keywords[k]) {
            ++counts[k];
            break;
        }
}

Although the JLS (§20.12.47) states that, "All literal strings and string-valued constant expressions are interned", the program now fails to locate any keywords when running under JDK 1.0.2 (see Table 1) because the string literals in the keywords array have not been interned. A static initializer added after the declaration of the keywords array can be used to intern each of its string elements:


static {
    for (int i = 0; i < keywords.length; ++i)
        keywords[i] = keywords[i].intern();
}

Now the application works just the same as before. But by how much has this improved performance? Table 2 compares the times taken to count the keywords in the application's own source code using both approaches and a variety of runtime environments.

Table 2. Keyword counting application benchmark figures.
	EQUALS	intern	Hashtable
JDK 1.0.2	56.6	26.7	7.5
JDK 1.1.1	25.7	8.3	4.0
JDK 1.1.5	31.5	8.7	3.3
JDK 1.1.6	1.5	2.2	0.7
JDK 1.1.7	1.7	2.1	0.8
JDK1.2 beta 3	27.4	6.9	3.0
JDK1.2 beta 4	1.4	0.9	0.5

The figures record the time taken, in milliseconds, by the string matching method of the keyword counting application (countKeywords) to count the keywords in the application's own source code using the three equivalent string matching algorithms.

The results are quite surprising. The intern method approach fared well against plain string comparison with the older runtimes, but with more recent versions, it actually increased the execution time! Furthermore, when you consider the relative complexity of reference and string comparison,*** why was the intern method, at its best, only improving on the equals method by a factor of two?

The answer is that the intern method is a relatively expensive operation—on some JVMs, as costly as 48 string comparisons! This cost has to be recouped later by the savings made in the reference comparisons. (With the JVMs I tested, the break-even point varied between 9 and 52 comparisons.)

The intern method uses a hashtable internally to manage its unique references. For set membership operations on unbounded domains, hashtables are usually the highest performing data organizations. The performance of this application can be improved by making its use of a hashtable explicit.

First, a Hashtable variable is added to the CountKeywords class and then a static initializer is used to load the keywords and their index values:


static private Hashtable hashtable = new Hashtable();

static {
    for (int i = 0; i < keywords.length; ++i)
        hashtable.put(keywords[i], new Integer(i));
}

Any token may be identified as a keyword (in virtually constant time) by attempting to retrieve it from the hashtable:


for (int t = 0; t < tokens.length; ++t) {
    Integer k = (Integer) hashtable.get(tokens[t]);

    if (k != null)
        ++counts[k.intValue()];
}

Table 2 shows the performance improvement. In all cases, this out-performs both previous approaches! Why wasn't the intern method able to achieve the same high performance? First, once the intern method either located the token string in its hashtable or added it to it, the calling method still had to search its own keywords array with the interned sting. Second, the explicit Hashtable object only ever contains the keyword strings, whereas the intern method's hashtable has every unique token from the source file (whether keyword or not) added to it. It may take a considerable time to add these symbols to the hashtable.

The hashtable may also have difficulty managing so many strings. JDK 1.1.7, for example, crashes with this message if you attempt to intern more than 30,000 unique strings:

*** panic: 16-bit string hashtable overflow abnormal program termination

Another reason you may wish to avoid the intern method is that, unlike the other two approaches, it cannot be adapted to perform case-insensitive searching.

The quickest way to search a collection of 10 or fewer strings is to place them in an array and iterate through them comparing with equals. For more strings than that, use a Hashtable object.

Regardless of the runtime environment, the intern method gives comparable or worse performance than equals when the number of strings is 10 or fewer and significantly worse performance than a hashtable when it is greater.

CONCLUSION
The decision by the Java language creators to implement strings as immutable objects was an inspired one, with numerous positive spin-offs. One such benefit was that it permitted string literals to be incarnated in Java as fully functioning, bona fide objects.

The tight integration of String objects into the language and the optimizations made possible by their immutable nature are benefits uniquely available to the Java programmer. Couple this with the techniques presented in this article and your strings will be truly supercharged.

SOURCE CODE ONLINE
Source code for the benchmarking applications as well as complete implementations of the CountKeywords application using all three approaches is available on the Internet from the Effective Java page.

* Where one operand to the string concatenation operator is a String object and the other is not, string conversion is employed to convert the non-string argument to a string. Reference types are simply converted by invoking their toString methods. Conversion of primitive types necessitates the intermediate step of constructing a temporary instance of one of the wrapper classes-Boolean, Integer, Float, etc.-against which the toString method may be applied.

** Ironically, it is the immutable nature of String objects that allows the StringBuffer class to implement a very efficient toString method, which is a part of the reason why the difference is not much greater.

*** The relative execution times of reference and string comparisons depend on the host JVM. Reference comparison requires a mere three JVM instructions (two aloads and an if_acmpeq) so even in a best-case scenario (comparing two strings that differ on the first character) string comparison will be significantly more expensive on current JDKs, between 4 and 12 times.

URL
Effective Java Web Page
http://effectivejava.com/column/superstrings