Important new features under development in mid-level inter- nationalization services
Dr. Mark Davis is a program director at IBM's Java Technology Center Silicon Valley, CA, co-founder of the Unicode effort, and is president of the Unicode Consortium. He can be contacted at [email protected]. Helena Shih is the technical lead of the IBM Classes for Unicode at IBM's Center of Java Technology. She can be contacted at [email protected].
THE DESIGNERS OF Java made the important design decision that all text would be stored in Unicode.
This solves the problem inherent in most other text-handling schemes, of always having to juggle multiple,
limited character encodings. It puts all languages on an equal footing, and makes the whole process of
designing for worldwide products far easier. But proper support of international text requires far more
than just storing characters in Unicode.
IBM's wholly-owned subsidiary, Taligent, had a great deal of previous experience in Unicode software
internationalization. In 1996, Sun contracted with Taligent to design and develop classes for the proper
handling of multilingual text for JDK 1.1. Our goals were to provide an architecture that supplied the
required functionality, was fully object-oriented (OO), could easily be extended to add additional features
or to support additional countries, and would scale well across both small and large projects written in
Java. There are some aspects of the architecture that we frequently get questions or complaints about, so
we'll explain why we made some of the decisions we did.
In JDK 1.1 we focused on the "server level" support: that is, on the mid-level internationalization services.
Most of the low-level text services were already in JDK 1.0 and only needed some enhancement. The high-level
international services (for input and output) utilized the host platform services in JDK 1.1. In Java 2 the
high-level international services have been greatly improved and no longer depend on the host platform services.
We preview some of the most important new features under development in the mid-level internationalization
services. Many of these features are in Java 2. IBM is making others available through different channels,
including classes available on the IBM alphaWorks Web site. (Note: IBM also has C and C++ versions of its
Unicode internationalization services at its Web site.)
DATE AND TIME SUPPORT
The Calendar class contains an API that allows you to interpret a Date according to a local calendar system,
even non-Gregorian ones. It also contains routines to support GUI requirements, such as rolling, adding,
and subtracting dates and times. The TimeZone class enables the conversion
between universal time (UTC) and local time. It also contains rules for figuring out the daylight savings
time according to the local conventions.
Why is January Zero?
We probably get more complaints about this than any other issue. Here's what happened.
The JDK 1.0 Date API and implementation were very specific to the Gregorian calendar, and were
not terribly Y2K friendly. Although the Gregorian calendar is used in most of the world, many
countries use other calendar systems. For instance, businesses in Europe often use a calendar
that measures by day, week, and year, rather than day, month, and year. There are also a large
number of traditional calendars in widespread use in the Middle East and Asia. To deal with these
problems, we split out the date computations into another class, Calendar, and retained Date purely
as a storage class.
The zero-based month numbers in Date were a vestige of old C-style programmingoriginally month
names were stored in a zero-based array and the months were numbered accordingly for convenience.
JavaSoft felt that consistency with the old Date APIs were important, so we needed to keep this
convention in Calendar. So calling calendar.set(1998, 3, 5) gives you
April 15th, not March 15th.
Time Zone Display Names
Every abstract class in the internationalization framework, except for
TimeZone, has a getDisplayName() function.
This means there is no easy way to get the displayable name for a time zone in JDK 1.1. This has
led to some confusion, because some people thought that TimeZone.getID()
returned a displayable name, when it actually returns an internal programmatic ID, one that should
not be displayed to end users. Moreover, the internal IDs themselves were too short and
confusing: "AST" could stand for either "Atlantic Standard Time"
or "Alaska Standard Time." To remedy this, the new method getDisplayName() has
been added to TimeZone in Java 2, and longer more descriptive internal IDs
are available.
// In JDK 1.1
TimeZone zone = TimeZone.getTimeZone("EST");
SimpleDateFormat sdf = new SimpleDateFormat(
"z", Locale.English);
fmt.getCalendar().setTimeZone(zone);
String name = format.format(new Date());
// name is "Eastern Standard Time"
TimeZone zone =
TimeZone.getTimeZone(
"America/New_York");
String name = zone.getDisplayName(Locale.ENGLISH);
// name is "Eastern Standard Time"
Better Y2K Support
Should "01/01/00" be year 2000 or year 1900? JDK 1.1 used the 80-20 rule. This amounts to adding
1900 to the two-digit year, and if the result was more than 80 years in the past, add another 100
(for the Gregorian calendar). The Java 2 method DateFormat.set2DigitStartDate()
provides more specific control. This method sets the exact start of a 100-year range in which 2-digit years
are interpreted.
// In JDK 1.1: There is no way to specify when the
// 2 digit year starts.
// In Java 2:
GregorianCalendar cal = new GregorianCalendar
(1952, Calendar.SEPTEMBER, 13);
DateFormat fmt = DateFormat.getInstance();
fmt.set2DigitYearStart(cal.getTime());
fmt.parse("9-12-52"); // returns 9-13-1952
fmt.parse("9-14-52"); // returns 9-14-2052
Improved Daylight Savings Switchover
In JDK 1.1, SimpleTimeZone allows the start and end dates for Daylight
Savings Time to be specified in only one way, as the Nth or Nth-from-last occurrence of a given weekday
in a given month, e.g., the last Sunday in October. However, some time zones have more complicated rules
for the switchover dates. For example, in Brazil Eastern Time, DST ends on the first Sunday on or after
February 11th, which cannot be expressed with the JDK 1.1 APIs.
This was resolved in Java 2 by adding several new types of DST start and end rules. The following rule
types will handle all known modern and historical time zones and provide more flexibility for the future:
- A fixed date in a given month, e.g., the 1st of April.
- The first occurrence of a given day of the week on or after a certain date in the month, e.g., the
first Sunday on or after February 18th, or equivalently, the first Sunday after the second Thursday.
- The first occurrence of a given day of the week on or before a certain date in the month.
Correct Rolling
There is no way to implement a date widget with arrow buttons correctly, with the Calendar class in
JDK 1.1. Suppose the user has selected the MONTH value for January 30, 1997,
and hits the up arrow twice. The first click correctly sets the control to February 28, 1997, but the
second click sets it to March 28, 1997, instead of March 30, 1997.
The correct implementation is to remember the original date and roll the month field the proper number of
steps from that original date for each click of the arrows. Unfortunately, in JDK 1.1, you can only roll
a field a single unit at a time. Java 2 fixes this problem by adding the ability to roll a field, multiple
units in a single operation.
// In Java 2:
myCalendar.setTime(aDate);
myCalendar.roll(MONTH, numberOfArrowClicks);
International Calendar Classes
Although the Calendar class is architected to allow for multiple calendars, both JDK 1.1 and 1.2 only
include support for the Gregorian calendar. However, IBM is previewing a large set of international
calendars on the alphaWorks Web siteincluding Hebrew, Islamic, Buddhist, and Japanese calendars.
LOCALES AND RESOURCES
A locale in the JDK is merely an identifier. This identifier is made up of the ISO language code and
country code, plus optional variants (for information on the ISO codes, see the Unicode Web site).
Because Locale is just a lightweight identifier, there is no need for validity checking when you
construct a locale. Whenever you construct an international object, you have the opportunity to
supply an explicit Locale or you can use whatever the current default locale is on your system:
Collator col = Collator.getInstance(Locale.FRANCE);
if (col.compare(string1, string2) < -1) {
... // based on the French locale's sort sequence
Collator col = Collator.getInstance();
if (col.compare(string1, string2) < -1) {
... // based on the default locale's sort sequence
The ResourceBundle class provides a way to isolate translatable text or
localizable objects from your core source code. For example, resource bundles can be used for translatable
error messages, or building translatable components. The JDK also uses resource bundles to hold its own
localized data. For example, when you ask for a NumberFormat object, the
necessary formatting information is retrieved from a resource bundle.
Why Can't You Set the Default Locale in Applets?
People frequently ask for the ability to call Locale.setDefault() within
an applet. The problem is that a single JVM can run more than one applet at a time in the same address
space. Locale.setDefault() would change the default locale for the whole
address space, which means that all of the applets would be affected; this is considered a security
violation. To work around this, set the applet's locale instead of using
Locale.setDefault(). When you need an international class, supply the locale
explicitly:
NumberFormat nf = NumberFormat.getInstance(
myApplet.getLocale());
ResourceBundle Fallback Detection
The ResourceBundle implementation currently includes a fallback mechanism:
if the specified resource can't be found in the specified locale, ResourceBundle searches:
- first in the resource bundle for the specified language and country
- then in the resource bundle for the specified language
- then in the resource bundle for the default locale's language and country
- then in the resource bundle for the default locale's language
- finally in the root resource bundle
Sometimes this is not what you want, or at least you may want to be able to detect when a particular piece
of data came from a fallback locale rather than the specified one. For example, suppose you wanted a
specific resource from the French Belgian locale, and there is only a French locale installedyou'll
get the wrong resource. In Java 2 we added a method, getLocale(), to find
out the actual locale that a resource bundle comes from, so that you can determine if a fallback was used.
// In Java 2
Locale frBE_Locale = new Locale("fr", "BE");
ResourceBundle rb =
ResourceBundle.getBundle("MyResources", frBE_Locale);
if (!rb.getLocale().equals(frBE_Locale)) {
// French Belgian resources not available,
// report an error
COMPARISON AND BOUNDARIES
In JDK 1.1, Collator allows you to compare strings in a language-sensitive
way. The standard comparison in String will just do a binary comparison.
For strings that will be displayed to the user, this is almost always incorrect! Wherever the ordering
or equality of strings is important to the user, such as when presenting an alphabetized list, then use
a Collator instead. Otherwise a German, for example, will find that you don't equate two strings that
she thinks are equal!
if (string1.compareTo(string2) < 0) {...
// bitwise comparison
Collator col = Collator.getInstance();
if (col.equals(string1, string2)) {...
...
if (col.compare(string1, string2) < 0) {...
Why Have CharacterIterator?
The CharacterIterator class is used in
BreakIterator and a few other places in the JDK and is used even more in
Java 2. Sometimes we are asked why we didn't use String
or StringBuffer instead.
String and StringBuffer are simple classes
that store their characters contiguously. Insertion or deletion of characters in a
StringBuffer end up shifting all the characters that follow, which works
fine for reasonably small numbers of characters. However, this model doesn't scale well. Consider a
word processor, for example, where shifting many kilobytes of characters just to insert or delete
one character involves far too much extra work. For acceptable performance in these circumstances,
text needs to be stored in data structures that use internally discontiguous chunks of storage.
We needed some way to have a more abstract representation of text that could be used both for String
and for larger-scale text models. Unfortunately, we couldn't change String
and StringBuffer to descend from an abstract class that would provide this
sort of representation. To resolve this problem, we added a very minimal interface,
CharacterIterator. This interface allows both sequential
(forward and backward) and random access to characters from any source, not just from a
String or StringBuffer.
Rule-Based BreakIterator
The BreakIterator class finds character, word, line, and sentence
boundaries, which may vary depending on the locale. The JDK 1.1 BreakIterator
implementation uses a state machine, which makes it very fast. However, it does not allow the behavior to
vary depending on the locale. If the built-in classes don't support behavior the clients want, they must
create a completely new BreakIterator subclass of their ownthey can't
leverage the JDK code at all.
Therefore, we undertook an extensive revision of the BreakIterator framework.
The new RuleBasedBreakIterator class essentially works the same way the old
class did, but it builds the category and state tables from a textual description, which is essentially
a string of regular expressions. This description can be loaded from a resourceallowing different
breaking rules for different languagesor supplied by the clientallowing runtime customization.
This class is provided on the alphaWorks Web site.
Locale-Sensitive Searching
The CollationElementIterator class is intended for use in locale-sensitive
text searching. However, it is missing several methods in JDK 1.1 that make it impossible to use with fast
string searching algorithms such as Boyer-Moore. The following new methods were added in Java 2 to fix this:
- The getOffset() method tells where a collation element was found.
- The previous() and setOffset() methods enable
backing up and moving around in the text being searched.
- The new setText() method allows reuse of a
CollationElementIterator. When collating or searching a large number of
strings, it is much faster to reuse one CollationElementIterator than to
construct a new one each time.
- The isIgnorable() method tells whether a collation element is ignorable.
- The getMaxExpansion() method returns the maximum length of any expansion
sequence producing a given character. A fast search algorithm needs to know the maximum "shift" distance
in looking for possible match sites. This is complicated by the fact that in natural language, a match
can occur with different numbers of characters. If a search pattern for German text contains "oe", for
example, it can match the single character "ö" in the text being searched. With the maximum expansion
length, a fast search algorithm can compute the correct lower limit on shift distances.
Unicode Normalization
Unicode is more than just "wide ASCII." One of the principal operations on Unicode is to normalize
text, ensuring that you have a unique spelling for a given text. Text normalization includes
decomposition and composition forms of characters. Text can be normalized to be a canonical
equivalent to the original unnormalized text or to be a compatibility equivalent to the
original unnormalized text. For more information, please see Unicode technical report #15 on the
Unicode Web site.
One of the Unicode normalization forms is used internally as a part of JDK 1.1, but it is not public.
The Normalizer class incorporates this technology and allows either batch
or incremental normalization of text. This class is provided on the alphaWorks Web site.
Formatting And Parsing
JDK 1.1 provides a rich set of functionality for formatting values into strings and parsing strings into
values in a locale-sensitive way. These include numbers, dates, times, and messages.
Number formatting supports spreadsheet-style patterns. For example, a format such
as "#,##0.00#" will produce output like "1,234.567" or "5.00"; the pattern specifies
that you have at least 2 decimal digits, but no more than 3. You can also reset the decimals and
other characteristics of the pattern programmatically. Number formatting also provides powerful
pattern parsing support for proportional font decimal alignment.
Date/Time formatting supports similar features, and is fully integrated
with Calendar. Message formatting allows access to number, date, and time
formatting within the context of a localizable string.
Substitutable Currencies
NumberFormat provides the factory method
getCurrencyInstance(), which creates an object that can convert numbers
to and from strings in the currency format of a given locale. In JDK 1.1, these formats were treated
just like any other number formats. They were constructed from strings that were fetched from
ResourceBundles. In 1.2, the currency symbol can be specified independently
from the rules for decimal places, thousands separator, and so on, and is supplied in the pattern
with the international currency symbol ("¤" = "\u00A4"').
// In Java 2
DecimalFormatSymbols us_syms = (DecimalFormat
)fmt.getDecimalFormatSymbols();
us_syms.setCurrencySymbol("US$ ");
fmt.setDecimalFormatSymbols(us_syms);
result = fmt.format(1234.56)
// result is "US$ 1,234.56"
ISO Currency Codes
Additionally, we added an API to retrieve the 3-letter international currency codes defined in ISO
4217. These are necessary in an application that deals with many different currencies because the
regular, one-character currency symbols are often shared by many different currencies. For example,
both the US and Canada use "$" in their default currency format. An application dealing with
both currencies will probably want to use "USD" and "CAD" instead. In Java 2, this is now possible,
using a sequence of two international currency symbols ("¤¤" = "\u00A4\u00A4") in the pattern.
// In Java 2:
fmt = new DecimalFormat("\u00a4\u00a4 #,##0.00;(
\u00a4\u00a4 #,##0.00)");
result = fmt.format(1234.56);
// result is "USD 1,234.56".
Parse Error Information
The abstract method parseObject() in java.text. Format is used to
parse strings and turn them into objects. In JDK 1.1, the program can find out how far the parse got
so that it can continue from that point on. However, it cannot find out how far it got if there was
an error. In Java 2 a new field, errorOffset, now contains that information. If an error occurs
during parsing, the formatters set this value before returning an error or throwing an exception.
In the following example, a text field is parsed for a number. If an error is found, the text beyond
the error is highlighted, a message is displayed, and a beep is played.
// In Java 2:
String contents = textField.getText();
try {
NumberFormat fmt = NumberFormat.
getInstance();
Number value = fmt.parse(contents);
} catch (ParseException foo) {
errorLabel.getToolkit().beep();
errorLabel.setText(
myResourceBundle.getString("invalid number"));
textField.select(foo.getErrorOffset(),
contents.length());
}
Number Format Enhancements
On the alphaWorks Web site we provide a class that correctly supports exponentials in number
formatting and parsing. The new number formatter supports formats such as "1.2345E3", as well as
engineering exponents, where the exponent is always a power of 3. It also supports formatting
and parsing BigInteger or BigDecimal
values without loss of precision, and "nickel-rounding": the ability to round to multiples of a
specified number, such as $0.05. (This is important for some countries whose smallest coin
is 5 units instead of 1. The implementation is not restricted to nickels, however, and can be
used to round to multiples of any given value.)
Here is an example using the class from the alphaWorks Web site.
NumberFormat fmt = new NumberFormat("0.0000E00");
String result = fmt->format(123456789);
// result is "1.2346E08"
Number Formats in Words
The ability to take a numeric value (e.g., 12,345) and translate it into words (e.g., "twelve
thousand three hundred forty-five") is often needed in business applications, for example, to write
out the amount on a check. Number spellout in English is a relatively easy thing to do; good algorithms
for this are well-known and widely used. A number-spellout engine that can be customized for any
language is another thing altogether.
It's not enough to simply take the algorithm for English and read the literal string values from a
resource file. English separates all component parts of a number with spaces; Italian and German do
not. Some languages, such as Spanish and Italian, drop the word for "one" from the phrases "one hundred"
or "one thousand." There are many other examples that show translating a number into words is not a
trivial task.
To solve these issues, we developed a class called RuleBasedNumberFormat.
It's a general, rule-based mechanism for converting numbers to spelled-out strings. This class is
available on the alphaWorks Web site, along with information on the usage and rule syntax.
RuleBasedNumberFormat fmt =
new RuleBasedNumberFormat(rules);
String result = fmt.format(1234);
// result is "one thousand two hundred thirty four"
CONCLUSION
The internationalization services in Java 1.1 provide a wide range of functionality and are easily
extended to add additional features and to support additional countries. We have had an opportunity
to discuss some of the design decisions taken in developing these classes and some of the enhancements
that are being included in future releases. A more detailed discussion is available at the IBM
international classes Web site, and includes more about the JDK i18n classes and possible future
internationalization improvements that IBM is discussing with Sun. These future possibilities
include the following:
- A character conversion API, so that you can actually find out which character code converters are
supported on your system, and get understandable display names for them, e.g., instead of "Cp1089",
in English displaying Arabic (ISO 8859-6) and in French displaying Arabe (OSI 8859-6)
- Customized locales, so that you could have fine-grained control over default behavior, e.g.,
NumberFormat.setInstance(new Locale("en", "US",
"Acme Widgets, Inc."),
new DecimalFormat("#,##0.0#"));
- Convenience methods for common cases, e.g.,
String value = Number.formatCurrency(amount);
We are working on many future enhancements; some of which are available right now on IBM's alphaWorks
Web site. We encourage those interested to download versions from thereany comments on the design
and implementation are welcome!
ACKNOWLEDGMENTS
Our thanks to Kathleen Wilson, Rich Gillam, and Laura Werner for their extensive review, suggestions,
and for organization of the document. Many other people at IBM and Sun contributed to the Java
internationalization efforts.
Trademarks
Unicode is a trademark of Unicode, Inc. Copyright © 1998, IBM Corp. All rights reserved.