STRINGS HAVE BEEN awarded a special status in the Java language. The String class
is the only reference type that can be instantiated by a compile-time literal and the only
one to have its own operators (string concatenation, + and +=). Despite its privileged rank,
there is one thing you can't do with a string that you can do with an instance of almost any
other class: you can't change its value.
The immutability of strings has a number of happy consequences for the Java developer.
I'll demonstrate how taking advantage of their special properties can improve the efficiency
of code that works with string objects. I'll also show what the Java Language Specification
has to say about supercharging string operations and how not everything it says is what it seems.
MAXIM 11: USE STRING OBJECTS IN PREFERENCE TO CHARACTER ARRAYS
A string literal is a sequence of zero or more Unicode characters enclosed in double-quotes.
The compiler replaces each literal with a reference to an instance of the
String class that contains the quoted characters.
Because a string once constructed cannot be modified, multiple identical occurrences of a
string literal can share the same String instance without
the risk of the object being altered through one reference and affecting the other. The Java
virtual machine (JVM) exploits this opportunity, replacing multiple occurrences of the same
string literal with references to the same unique String instance.
String s1 = "twine";
String s2 = "twine";
boolean shared = (s1 == s2); // true
Concatenation between string literals is performed at compile-time, so these two strings will also
refer to the same instance:
String s1 = "twine";
String s2 = "tw" + "ine";
boolean shared = (s1 == s2); // true
This means that there is no runtime overhead incurred if you split a string literal across
multiple lines to prevent it from exceeding the line length of your editor:
String message = "Warning: The system will be " +
"shutting down in one minute. Please exit " +
"your programs and log off";
These strings will be concatenated to form a single String
instance when the class is compiled, incurring a minuscule time penalty then, but none when the
program is subsequently executed.
The Java Language Specification 1.0 (JLS) (§3.10.5) states emphatically that the JVM merges
(or, to use its phraseology, interns) string literals across different classes and packages.
When a class is loaded, its string literals are merged with those already loaded. String literals
that match any of the ones that have already been loaded are replaced with references to existing
String instances. The remaining string literals require new
String instances to be constructed, which may themselves become
shared when subsequent classes are loaded. So even if s1 and
s2 had been variables of different classes from different
packages, they would still refer to the same object.
The JLS further states that "Strings computed by constant expressions (§15.27) are computed at
compile time and then treated as if they were literals." So these two strings should both refer to
the same unique instance also:
String s1 = "Route 66";
String s2 = "Route " + 66;
boolean shared = (s1 == s2); // true
That is the theory anyway. In practice, the Java runtime may not go to quite these lengths to ensure that
all possible duplication is eliminated, making the number of actual
String instantiations as few as possible. Compilers and runtimes
vary in their levels of compliance with the JLS string interning rules. Table 1 shows which compilers
and runtimes obey which rules.
Table 1. Compiler support for string interning rules.
|
|
JDK 1.0 |
JDK 1.1 |
JAVA 2 |
VISUAL J++ 1.1 |
String literals from same class merged ** |
*
|
|
|
|
String literals from different classes merged ** |
|
|
|
|
String literals from different packages merged ** |
|
|
|
|
String literals concatenated at compile time *** |
|
|
|
|
String based on constant (non-string) expressions concatenated at compile time *** |
|
|
|
|
*The JDK 1.0.2 runtime does not merge identical string literals occurring in the same class under
all circumstances. See the Effective Java Web page for further details.
** Function of the Java runtime environment (JVM).
*** Function of the Java compiler.
|
|
Object construction is one of the most expensive atomic operations in Java (McManus, A. &
J. Hunt, "The Need for Speed," Java Report, Vol. 3, No. 5, 1998). Any mechanism that reduces
the number of object instances constructed is worth taking advantage of. By using
String objects instead of character arrays for constant string
values you allow the runtime system not only to save space in memory (reducing the load on the virtual
memory system) but also time in loading the class.
The elimination of redundant String instantiations is not the only
benefit of string immutability though. It also allows the String class's own methods to be implemented
much more efficiently than would otherwise be possible.
EFFICIENT STRING METHODS
If a String object can never be modified, it follows that sub-ranges
of its character sequence must also be constant. This permits a number of the
String class's methods to be implemented very efficiently. The
substring method is a good example. The length of time taken to perform the operation does not depend
on the length of the extracted sub-string. It executes in constant time.
String s = "A very long string... (>100 chars)";
// extract the 20th through 99th characters
String ss = s.substring(20, 100);
The String class is implemented using private instance variables
to hold a char array for the characters of the string, the
string's length, and an integer
offset indicating the string's starting position in the char
array. The substring method may be given a compact implementation that simply constructs a new
String object that shares the same
char array but with different values for the
length and offset.
private String(char[] chars, int length, int offset) {
this.chars = chars;
this.length = length;
this.offset = offset;
}
public String substring(int begin, int end) {
return new String(this.chars, end - begin,
this.offset + begin);
}
This is an advantage that Java has over other languages, like C++, whose string objects are mutable.
No similarly efficient implementation is possible in C++ for the substr
member function of the STL string class, which is forced to fabricate
a private copy of the character sub-range to prevent any of these characters from being later modified via
the originating string. It operates in linear time (with respect to the length of the sub-string),
unlike Java's substring method, which executes in constant time.
The same penalty is imposed (in Java) with character arrays. Extracting a sub-string, as another character
array, also has linear time performance. The characters must be copied out using
System.arraycopy into a separate arraythe length of time
taken is a function of the number of characters copied.
This is another good reason to prefer String objects to character
arrays: to take advantage of the String class's uniquely efficient
methods.
MAXIM 12: ELIMINATE TEMPORARY STRINGS AND SAVE TIME
The examples so far have dealt with constant string values. For storing modifiable strings, Java provides
the StringBuffer class. String buffers support numerous methods for
inserting, appending, and deleting characters in the string, and can be converted easily to and from
String objects.
The compiler uses the StringBuffer class internally to implement
the String class' concatenation operator. Wherever an expression occurs using the string concatenation
operator, the compiler replaces it with equivalent operations on a temporary
StringBuffer object. For example, this line:
String s = "You have " + n + " new messages";
is converted by the compiler to the following statement:
String s = new StringBuffer("You have ").append(n).
append(" new messages").toString();
A temporary StringBuffer object is first initialized with the
leftmost operand. The remaining arguments are then appended to the string buffer using its append
method.* The completed StringBuffer object is finally converted
back to a String object using the toString
method.
This translation of the original expression results in the construction of a temporary
StringBuffer object. However, this is a cost that may be avoided
if the arguments to the string concatenation operator are all constants. If the compiler supports it,
operands based on constant values are concatenated at compile time and the construction of a
StringBuffer object at runtime is circumvented. A compiler that
supports the concatenation of strings based on constant expressions (see Table 1) would convert this
expression, containing only constants:
"You have " + 5 + " new messages"
directly to this string literal:
"You have 5 new messages"
Because it requires the construction of a temporary string concatenation (when the compiler cannot
eliminate it) incurs a fixed runtime cost per expression. If an expression containing multiple
concatenation operators is split into independent statements, the cost of constructing a temporary
is incurred by each statement instead of being incurred once and amortized across all concatenation
operators in the expression. For example, the previous example split into three statements:
String s = "You have ";
s += 5;
s += " new messages";
prevents the compiler from concatenating the constants at compile-time and incurs the construction of
two StringBuffer temporaries to boot!
The benefits of attempting to build composite strings using the fewest possible statements are greatest
when the components are all constant values, as it gives the compiler the opportunity to replace the
entire expression with a single string literal at compile-time.
Even when the string concatenation expression contains values computed at runtime, there is still
considerable advantage in attempting to concatenate as many components together in the same expression.
In this block of code, which displays the program's arguments, two of the components,
i and args[i], are not known
at compile time, so the concatenation would have to be deferred until the program executes even if the
string had been built using a single statement. Nevertheless, splitting the concatenation into multiple
statements still incurs an unnecessary penalty:
for (int i = 0; i < args.length; ++i) {
String s = "args[";
s += i;
s += "] = \"";
s += args[i];
s += "\"";
System.out.println(s);
}
Consider what happens to the body of the loop when the compiler re-expresses the string concatenation
in terms of StringBuffer operations:
String s = "args[";
s = new StringBuffer(s).append(i).toString();
s = new StringBuffer(s).append("] = \"").toString();
s = new StringBuffer(s).append(args[i]).toString();
s = new StringBuffer(s).append("\"").toString();
This results in four StringBuffer and four
String temporary objects being constructed! This is needlessly
wasteful. The simple solution is to build the string with a single expression:
String s = "args[" + i + "] = \"" + args[i] + "\"";
This expression, once reworded by the compiler, is still considerably more compact than the previously
translated code:
String s = new StringBuffer("args[").append(i).
append("] = \"").append(args[i]).append("\"").
toString();
There will be only two objects (one a temporary) created by this expressiona great improvement on the
nine objects previously constructed. Depending on the JVM, the difference in performance between this
formulation and the former one varies between a factor of 3 and a factor of 10.**
However, for a variety of reasons, it may not be possible to combine all the parts of a string together
in one expression. In that case, imitate the behavior of the compiler by manually creating a
StringBuffer object to hold the intermediate string values.
Once the string is fully formed in the buffer, it may be converted to a
String object with toString.
StringBuffer buf = new StringBuffer("n^19=");
double power = Math.pow(n, 19);
buf.append(power);
buf.append(" n^20=");
buf.append(power *= n);
String s = buf.toString();
This approach constructs only one temporary objectthe string bufferand so represents a
significant time saving over multiple uses of the string concatenation
+= operator.
MAXIM 13: ALWAYS PARENTHESIZE SUB-EXPRESSIONS WHEN CONCATENATING WITH STRINGS
The compiler's somewhat mechanical conversion of string concatenation operators to
StringBuffer operations can cause some surprises. For example,
what output do you think the programmer intended this statement to produce?
System.out.println("When does 2 + 2 = " + 2 + 2);
Because string concatenation is used to "add" the numeric values, the statement does not produce the
conventional answer of "4":
When does 2 + 2 = 22
The cause of this surprising output is that the + operator, representing both string concatenation and
arithmetic addition associates left to right. So the expression:
w + x + y + z
is evaluated as
((w + x) + y) + z
If either w or x are
strings, the other is converted to a string if required and the two strings are concatenated. The
concatenation yields another string to which y and then
z are further concatenated (after suitable conversions have
been applied).
A similar problem arises when the numeric expression contains operators of lower precedence than the
string concatenation operator:
str = "n >>> 4 = " + n >>> 4; // compile error
In this instance, the right shift with zero-extension operator (>>>) has a lower precedence than the string concatenation operator. Evaluation of the first sub-expression (assuming n equal to 100) produces:
"n >>> 4 = 100" >>> 4
The >>> operator may not be validly applied to a String
object, so a compile error results. This situation is less troublesome than the first, though, because
it cannot go unnoticed.
As a matter of defensive programming, always parenthesize numeric and boolean expressions where they
are to be concatenated with strings.
System.out.println("When does 2 + 2 = " + (2 + 2));
System.out.println("File is empty: " + (length == 0));
Even in cases where the arithmetic operators involved in a numeric expression all have precedence higher
than the + operator, parenthesize the expression so that lines like this:
System.out.println("2n = " + n * 2);
do not change radically in meaning when replaced with arithmetically equivalent forms:
System.out.println("2n = " + n + n); // intended?
System.out.println("2n = " + (n + n)); // intent clear
MAXIM 14: AVOID THE STRING.INTERN METHOD
There is one last thing to be said about improving the performance of string operations, and I know
that if I don't say it, I will be deluged with email. So here goes: it regards the use of the
String.intern method.
The intern method maintains a pool of unique
String instances. When intern is
invoked against a String object, the pool is searched
(using equals) for an instance with an identical value. If one
is found, the String object from the pool is returned. Otherwise,
this String object is added to the pool and a reference to it is returned.
It follows that for any two strings s and
t, s.intern() == t.intern()
and s.equals(t) == true are equivalent tests (JLS 1.0, §20.12.47).
The application in Listing 1
reads a Java source file and counts how many of Java's 47 keywords occur within it. The keywords are stored
in a string array and the tokens of the source file are compared sequentially against each of the array
elements:
for (int t = 0; t < tokens.length; ++t)
for (int k = 0; k < keywords.length; ++k)
if (tokens[t].equals(keywords[k])) {
++counts[k];
break;
}
In The Java Programming Language, 2nd ed. (Addison-Wesley, 1997), Ken Arnold and James Gosling
describe a technique based on the intern method that can be used to improve the performance of string
matching operations such as this one:
However, any two strings with the same contents return the same
String object from intern, which enables you to compare
string references to test equality, instead of the slower test of
string contents…Dealing with the results of intern makes comparing object references
equivalent to comparing string contents, but much faster (§8.3, p. 166).
Applying that technique to the keyword counting application, the token string is
initially interned so that the equals test may
be replaced with a reference comparison between the interned token string and the current keyword string:
for (int t = 0; t < tokens.length; ++t) {
tokens[t] = tokens[t].intern();
for (int k = 0; k < keywords.length; ++k)
if (tokens[t] == keywords[k]) {
++counts[k];
break;
}
}
Although the JLS (§20.12.47) states that, "All literal strings and string-valued constant
expressions are interned", the program now fails to locate any keywords when running under
JDK 1.0.2 (see Table 1) because the string literals in the keywords array have not been interned.
A static initializer added after the declaration of the keywords
array can be used to intern each of its string elements:
static {
for (int i = 0; i < keywords.length; ++i)
keywords[i] = keywords[i].intern();
}
Now the application works just the same as before. But by how much has this improved performance?
Table 2 compares the times taken to count the keywords in the application's own source code using
both approaches and a variety of runtime environments.
Table 2. Keyword counting application benchmark figures.
|
|
EQUALS |
intern |
Hashtable |
JDK 1.0.2 |
56.6 |
26.7 |
7.5 |
JDK 1.1.1 |
25.7 |
8.3 |
4.0 |
JDK 1.1.5 |
31.5 |
8.7 |
3.3 |
JDK 1.1.6 |
1.5 |
2.2 |
0.7 |
JDK 1.1.7 |
1.7 |
2.1 |
0.8 |
JDK1.2 beta 3 |
27.4 |
6.9 |
3.0 |
JDK1.2 beta 4 |
1.4 |
0.9 |
0.5 |
|
The figures record the time taken, in milliseconds, by the string matching method of the keyword counting
application (countKeywords) to count the keywords in the application's
own source code using the three equivalent string matching algorithms. |
|
The results are quite surprising. The intern method approach fared well against plain string comparison
with the older runtimes, but with more recent versions, it actually increased the execution time!
Furthermore, when you consider the relative complexity of reference and string comparison,***
why was the intern method, at its best, only improving on the equals method by a factor of two?
The answer is that the intern method is a relatively expensive operationon some JVMs, as costly as
48 string comparisons! This cost has to be recouped later by the savings made in the reference
comparisons. (With the JVMs I tested, the break-even point varied between 9 and 52 comparisons.)
The intern method uses a
hashtable internally to manage its unique references. For set
membership operations on unbounded domains, hashtables are
usually the highest performing data organizations. The performance of this application can be improved
by making its use of a hashtable explicit.
First, a Hashtable variable is added to the
CountKeywords class and then a static initializer is used to
load the keywords and their index values:
static private Hashtable hashtable = new Hashtable();
static {
for (int i = 0; i < keywords.length; ++i)
hashtable.put(keywords[i], new Integer(i));
}
Any token may be identified as a keyword (in virtually constant time) by attempting to retrieve
it from the hashtable:
for (int t = 0; t < tokens.length; ++t) {
Integer k = (Integer) hashtable.get(tokens[t]);
if (k != null)
++counts[k.intValue()];
}
Table 2 shows the performance improvement. In all cases, this out-performs both previous approaches!
Why wasn't the intern method able to achieve the same high performance? First, once the intern method
either located the token string in its hashtable or added it to
it, the calling method still had to search its own keywords array with the interned sting. Second,
the explicit Hashtable object only ever contains the keyword
strings, whereas the intern method's
hashtable has every unique token from the source file
(whether keyword or not) added to it. It may take a considerable time to add these symbols to the
hashtable.
The hashtable may also have difficulty managing so many strings.
JDK 1.1.7, for example, crashes with this message if you attempt to intern more than 30,000 unique strings:
*** panic: 16-bit string hashtable overflow
abnormal program termination
Another reason you may wish to avoid the intern method is that, unlike the other two approaches, it
cannot be adapted to perform case-insensitive searching.
The quickest way to search a collection of 10 or fewer strings is to place them in an array and iterate
through them comparing with equals. For more strings than that, use a
Hashtable object.
Regardless of the runtime environment, the intern method gives comparable or worse performance than
equals when the number of strings is 10 or fewer and significantly
worse performance than a hashtable when it is greater.
CONCLUSION
The decision by the Java language creators to implement strings as immutable objects was an inspired one,
with numerous positive spin-offs. One such benefit was that it permitted string literals to be
incarnated in Java as fully functioning, bona fide objects.
The tight integration of String objects into the language and the
optimizations made possible by their immutable nature are benefits uniquely available to the Java programmer.
Couple this with the techniques presented in this article and your strings will be truly supercharged.
SOURCE CODE ONLINE
Source code for the benchmarking applications as well as complete implementations of the
CountKeywords application using all three approaches is
available on the Internet from the Effective Java page.
* Where one operand to the string concatenation operator is a String object and the other is not,
string conversion is employed to convert the non-string argument to a string. Reference types are
simply converted by invoking their toString methods. Conversion of primitive types necessitates the
intermediate step of constructing a temporary instance of one of the wrapper classes-Boolean,
Integer, Float, etc.-against which the toString method may be applied.
** Ironically, it is the immutable nature of String objects that allows the StringBuffer
class to implement a very efficient toString method, which is a part of the reason why the difference
is not much greater.
*** The relative execution times of reference and string comparisons depend on the
host JVM. Reference comparison requires a mere three JVM instructions (two aloads and an if_acmpeq)
so even in a best-case scenario (comparing two strings that differ on the first character) string
comparison will be significantly more expensive on current JDKs, between 4 and 12 times.