The Java International API: Beyond JDK 1.1

 

“Important new features under development in mid-level inter- nationalization services“

Dr. Mark Davis is a program director at IBM's Java Technology Center Silicon Valley, CA, co-founder of the Unicode effort, and is president of the Unicode Consortium. He can be contacted at [email protected]. Helena Shih is the technical lead of the IBM Classes for Unicode at IBM's Center of Java Technology. She can be contacted at [email protected].

THE DESIGNERS OF Java made the important design decision that all text would be stored in Unicode. This solves the problem inherent in most other text-handling schemes, of always having to juggle multiple, limited character encodings. It puts all languages on an equal footing, and makes the whole process of designing for worldwide products far easier. But proper support of international text requires far more than just storing characters in Unicode.

IBM's wholly-owned subsidiary, Taligent, had a great deal of previous experience in Unicode software internationalization. In 1996, Sun contracted with Taligent to design and develop classes for the proper handling of multilingual text for JDK 1.1. Our goals were to provide an architecture that supplied the required functionality, was fully object-oriented (OO), could easily be extended to add additional features or to support additional countries, and would scale well across both small and large projects written in Java. There are some aspects of the architecture that we frequently get questions or complaints about, so we'll explain why we made some of the decisions we did.

In JDK 1.1 we focused on the "server level" support: that is, on the mid-level internationalization services. Most of the low-level text services were already in JDK 1.0 and only needed some enhancement. The high-level international services (for input and output) utilized the host platform services in JDK 1.1. In Java 2 the high-level international services have been greatly improved and no longer depend on the host platform services.

We preview some of the most important new features under development in the mid-level internationalization services. Many of these features are in Java 2. IBM is making others available through different channels, including classes available on the IBM alphaWorks Web site. (Note: IBM also has C and C++ versions of its Unicode internationalization services at its Web site.)

DATE AND TIME SUPPORT
The Calendar class contains an API that allows you to interpret a Date according to a local calendar system, even non-Gregorian ones. It also contains routines to support GUI requirements, such as rolling, adding, and subtracting dates and times. The TimeZone class enables the conversion between universal time (UTC) and local time. It also contains rules for figuring out the daylight savings time according to the local conventions.

Why is January Zero?
We probably get more complaints about this than any other issue. Here's what happened. The JDK 1.0 Date API and implementation were very specific to the Gregorian calendar, and were not terribly Y2K friendly. Although the Gregorian calendar is used in most of the world, many countries use other calendar systems. For instance, businesses in Europe often use a calendar that measures by day, week, and year, rather than day, month, and year. There are also a large number of traditional calendars in widespread use in the Middle East and Asia. To deal with these problems, we split out the date computations into another class, Calendar, and retained Date purely as a storage class.

The zero-based month numbers in Date were a vestige of old C-style programming—originally month names were stored in a zero-based array and the months were numbered accordingly for convenience. JavaSoft felt that consistency with the old Date APIs were important, so we needed to keep this convention in Calendar. So calling calendar.set(1998, 3, 5) gives you April 15th, not March 15th.

Time Zone Display Names
Every abstract class in the internationalization framework, except for TimeZone, has a getDisplayName() function. This means there is no easy way to get the displayable name for a time zone in JDK 1.1. This has led to some confusion, because some people thought that TimeZone.getID() returned a displayable name, when it actually returns an internal programmatic ID, one that should not be displayed to end users. Moreover, the internal IDs themselves were too short and confusing: "AST" could stand for either "Atlantic Standard Time" or "Alaska Standard Time." To remedy this, the new method getDisplayName() has been added to TimeZone in Java 2, and longer more descriptive internal IDs are available.


// In JDK 1.1
TimeZone zone = TimeZone.getTimeZone("EST");
SimpleDateFormat sdf = new SimpleDateFormat(
"z", Locale.English);
fmt.getCalendar().setTimeZone(zone);
String name = format.format(new Date());
// name is "Eastern Standard Time"

TimeZone zone = 
TimeZone.getTimeZone(
"America/New_York");
String name = zone.getDisplayName(Locale.ENGLISH);
// name is "Eastern Standard Time"

Better Y2K Support
Should "01/01/00" be year 2000 or year 1900? JDK 1.1 used the 80-20 rule. This amounts to adding 1900 to the two-digit year, and if the result was more than 80 years in the past, add another 100 (for the Gregorian calendar). The Java 2 method DateFormat.set2DigitStartDate() provides more specific control. This method sets the exact start of a 100-year range in which 2-digit years are interpreted.

// In JDK 1.1: There is no way to specify when the
// 2 digit year starts.

// In Java 2:
GregorianCalendar cal = new GregorianCalendar
(1952, Calendar.SEPTEMBER, 13);
DateFormat fmt = DateFormat.getInstance();
fmt.set2DigitYearStart(cal.getTime());
fmt.parse("9-12-52"); // returns 9-13-1952
fmt.parse("9-14-52"); // returns 9-14-2052

Improved Daylight Savings Switchover
In JDK 1.1, SimpleTimeZone allows the start and end dates for Daylight Savings Time to be specified in only one way, as the Nth or Nth-from-last occurrence of a given weekday in a given month, e.g., the last Sunday in October. However, some time zones have more complicated rules for the switchover dates. For example, in Brazil Eastern Time, DST ends on the first Sunday on or after February 11th, which cannot be expressed with the JDK 1.1 APIs.

This was resolved in Java 2 by adding several new types of DST start and end rules. The following rule types will handle all known modern and historical time zones and provide more flexibility for the future:

  1. A fixed date in a given month, e.g., the 1st of April.
  2. The first occurrence of a given day of the week on or after a certain date in the month, e.g., the first Sunday on or after February 18th, or equivalently, the first Sunday after the second Thursday.
  3. The first occurrence of a given day of the week on or before a certain date in the month.
Correct Rolling
There is no way to implement a date widget with arrow buttons correctly, with the Calendar class in JDK 1.1. Suppose the user has selected the MONTH value for January 30, 1997, and hits the up arrow twice. The first click correctly sets the control to February 28, 1997, but the second click sets it to March 28, 1997, instead of March 30, 1997.

The correct implementation is to remember the original date and roll the month field the proper number of steps from that original date for each click of the arrows. Unfortunately, in JDK 1.1, you can only roll a field a single unit at a time. Java 2 fixes this problem by adding the ability to roll a field, multiple units in a single operation.


// In Java 2:
myCalendar.setTime(aDate);
myCalendar.roll(MONTH, numberOfArrowClicks);

International Calendar Classes
Although the Calendar class is architected to allow for multiple calendars, both JDK 1.1 and 1.2 only include support for the Gregorian calendar. However, IBM is previewing a large set of international calendars on the alphaWorks Web site—including Hebrew, Islamic, Buddhist, and Japanese calendars.

LOCALES AND RESOURCES
A locale in the JDK is merely an identifier. This identifier is made up of the ISO language code and country code, plus optional variants (for information on the ISO codes, see the Unicode Web site). Because Locale is just a lightweight identifier, there is no need for validity checking when you construct a locale. Whenever you construct an international object, you have the opportunity to supply an explicit Locale or you can use whatever the current default locale is on your system:


Collator col = Collator.getInstance(Locale.FRANCE);
if (col.compare(string1, string2) < -1) {
    ... // based on the French locale's sort sequence
Collator col = Collator.getInstance();
if (col.compare(string1, string2) < -1) {
    ... // based on the default locale's sort sequence

The ResourceBundle class provides a way to isolate translatable text or localizable objects from your core source code. For example, resource bundles can be used for translatable error messages, or building translatable components. The JDK also uses resource bundles to hold its own localized data. For example, when you ask for a NumberFormat object, the necessary formatting information is retrieved from a resource bundle.

Why Can't You Set the Default Locale in Applets?
People frequently ask for the ability to call Locale.setDefault() within an applet. The problem is that a single JVM can run more than one applet at a time in the same address space. Locale.setDefault() would change the default locale for the whole address space, which means that all of the applets would be affected; this is considered a security violation. To work around this, set the applet's locale instead of using Locale.setDefault(). When you need an international class, supply the locale explicitly:


NumberFormat nf = NumberFormat.getInstance(
   myApplet.getLocale());

ResourceBundle Fallback Detection
The ResourceBundle implementation currently includes a fallback mechanism: if the specified resource can't be found in the specified locale, ResourceBundle searches:
  • first in the resource bundle for the specified language and country
  • then in the resource bundle for the specified language
  • then in the resource bundle for the default locale's language and country
  • then in the resource bundle for the default locale's language
  • finally in the root resource bundle
Sometimes this is not what you want, or at least you may want to be able to detect when a particular piece of data came from a fallback locale rather than the specified one. For example, suppose you wanted a specific resource from the French Belgian locale, and there is only a French locale installed—you'll get the wrong resource. In Java 2 we added a method, getLocale(), to find out the actual locale that a resource bundle comes from, so that you can determine if a fallback was used.

// In Java 2
Locale frBE_Locale = new Locale("fr", "BE");
ResourceBundle rb = 
ResourceBundle.getBundle("MyResources", frBE_Locale);
if (!rb.getLocale().equals(frBE_Locale)) {
        // French Belgian resources not available, 
        // report an error

COMPARISON AND BOUNDARIES
In JDK 1.1, Collator allows you to compare strings in a language-sensitive way. The standard comparison in String will just do a binary comparison. For strings that will be displayed to the user, this is almost always incorrect! Wherever the ordering or equality of strings is important to the user, such as when presenting an alphabetized list, then use a Collator instead. Otherwise a German, for example, will find that you don't equate two strings that she thinks are equal!

if (string1.compareTo(string2) < 0) {... 
// bitwise comparison

Collator col = Collator.getInstance();
if (col.equals(string1, string2)) {...
...
if (col.compare(string1, string2) < 0) {...

Why Have CharacterIterator?
The CharacterIterator class is used in BreakIterator and a few other places in the JDK and is used even more in Java 2. Sometimes we are asked why we didn't use String or StringBuffer instead.

String and StringBuffer are simple classes that store their characters contiguously. Insertion or deletion of characters in a StringBuffer end up shifting all the characters that follow, which works fine for reasonably small numbers of characters. However, this model doesn't scale well. Consider a word processor, for example, where shifting many kilobytes of characters just to insert or delete one character involves far too much extra work. For acceptable performance in these circumstances, text needs to be stored in data structures that use internally discontiguous chunks of storage.

We needed some way to have a more abstract representation of text that could be used both for String and for larger-scale text models. Unfortunately, we couldn't change String and StringBuffer to descend from an abstract class that would provide this sort of representation. To resolve this problem, we added a very minimal interface, CharacterIterator. This interface allows both sequential (forward and backward) and random access to characters from any source, not just from a String or StringBuffer.

Rule-Based BreakIterator
The BreakIterator class finds character, word, line, and sentence boundaries, which may vary depending on the locale. The JDK 1.1 BreakIterator implementation uses a state machine, which makes it very fast. However, it does not allow the behavior to vary depending on the locale. If the built-in classes don't support behavior the clients want, they must create a completely new BreakIterator subclass of their own—they can't leverage the JDK code at all.

Therefore, we undertook an extensive revision of the BreakIterator framework. The new RuleBasedBreakIterator class essentially works the same way the old class did, but it builds the category and state tables from a textual description, which is essentially a string of regular expressions. This description can be loaded from a resource—allowing different breaking rules for different languages—or supplied by the client—allowing runtime customization. This class is provided on the alphaWorks Web site.

Locale-Sensitive Searching
The CollationElementIterator class is intended for use in locale-sensitive text searching. However, it is missing several methods in JDK 1.1 that make it impossible to use with fast string searching algorithms such as Boyer-Moore. The following new methods were added in Java 2 to fix this:

  1. The getOffset() method tells where a collation element was found.
  2. The previous() and setOffset() methods enable backing up and moving around in the text being searched.
  3. The new setText() method allows reuse of a CollationElementIterator. When collating or searching a large number of strings, it is much faster to reuse one CollationElementIterator than to construct a new one each time.
  4. The isIgnorable() method tells whether a collation element is ignorable.
  5. The getMaxExpansion() method returns the maximum length of any expansion sequence producing a given character. A fast search algorithm needs to know the maximum "shift" distance in looking for possible match sites. This is complicated by the fact that in natural language, a match can occur with different numbers of characters. If a search pattern for German text contains "oe", for example, it can match the single character "ö" in the text being searched. With the maximum expansion length, a fast search algorithm can compute the correct lower limit on shift distances.
Unicode Normalization
Unicode is more than just "wide ASCII." One of the principal operations on Unicode is to normalize text, ensuring that you have a unique spelling for a given text. Text normalization includes decomposition and composition forms of characters. Text can be normalized to be a canonical equivalent to the original unnormalized text or to be a compatibility equivalent to the original unnormalized text. For more information, please see Unicode technical report #15 on the Unicode Web site.

One of the Unicode normalization forms is used internally as a part of JDK 1.1, but it is not public. The Normalizer class incorporates this technology and allows either batch or incremental normalization of text. This class is provided on the alphaWorks Web site.

Formatting And Parsing
JDK 1.1 provides a rich set of functionality for formatting values into strings and parsing strings into values in a locale-sensitive way. These include numbers, dates, times, and messages.

Number formatting supports spreadsheet-style patterns. For example, a format such as "#,##0.00#" will produce output like "1,234.567" or "5.00"; the pattern specifies that you have at least 2 decimal digits, but no more than 3. You can also reset the decimals and other characteristics of the pattern programmatically. Number formatting also provides powerful pattern parsing support for proportional font decimal alignment.

Date/Time formatting supports similar features, and is fully integrated with Calendar. Message formatting allows access to number, date, and time formatting within the context of a localizable string.

Substitutable Currencies
NumberFormat provides the factory method getCurrencyInstance(), which creates an object that can convert numbers to and from strings in the currency format of a given locale. In JDK 1.1, these formats were treated just like any other number formats. They were constructed from strings that were fetched from ResourceBundles. In 1.2, the currency symbol can be specified independently from the rules for decimal places, thousands separator, and so on, and is supplied in the pattern with the international currency symbol ("¤" = "\u00A4"').


// In Java 2
DecimalFormatSymbols us_syms = (DecimalFormat
 )fmt.getDecimalFormatSymbols();
us_syms.setCurrencySymbol("US$ ");
fmt.setDecimalFormatSymbols(us_syms);
result = fmt.format(1234.56)       
// result is "US$ 1,234.56"

ISO Currency Codes
Additionally, we added an API to retrieve the 3-letter international currency codes defined in ISO 4217. These are necessary in an application that deals with many different currencies because the regular, one-character currency symbols are often shared by many different currencies. For example, both the US and Canada use "$" in their default currency format. An application dealing with both currencies will probably want to use "USD" and "CAD" instead. In Java 2, this is now possible, using a sequence of two international currency symbols ("¤¤" = "\u00A4\u00A4") in the pattern.

// In Java 2:
fmt = new DecimalFormat("\u00a4\u00a4 #,##0.00;(
   \u00a4\u00a4 #,##0.00)");
result = fmt.format(1234.56);			
// result is "USD 1,234.56".

Parse Error Information
The abstract method parseObject() in java.text. Format is used to parse strings and turn them into objects. In JDK 1.1, the program can find out how far the parse got so that it can continue from that point on. However, it cannot find out how far it got if there was an error. In Java 2 a new field, errorOffset, now contains that information. If an error occurs during parsing, the formatters set this value before returning an error or throwing an exception.

In the following example, a text field is parsed for a number. If an error is found, the text beyond the error is highlighted, a message is displayed, and a beep is played.


// In Java 2:
String contents = textField.getText();
try {
      NumberFormat fmt = NumberFormat.
       getInstance();
      Number value = fmt.parse(contents);
} catch (ParseException foo) {
      errorLabel.getToolkit().beep();
      errorLabel.setText(
      myResourceBundle.getString("invalid number"));
      textField.select(foo.getErrorOffset(), 
      contents.length());
}

Number Format Enhancements
On the alphaWorks Web site we provide a class that correctly supports exponentials in number formatting and parsing. The new number formatter supports formats such as "1.2345E3", as well as engineering exponents, where the exponent is always a power of 3. It also supports formatting and parsing BigInteger or BigDecimal values without loss of precision, and "nickel-rounding": the ability to round to multiples of a specified number, such as $0.05. (This is important for some countries whose smallest coin is 5 units instead of 1. The implementation is not restricted to nickels, however, and can be used to round to multiples of any given value.)

Here is an example using the class from the alphaWorks Web site.


NumberFormat fmt = new NumberFormat("0.0000E00");
String result = fmt->format(123456789);
// result is "1.2346E08"

Number Formats in Words
The ability to take a numeric value (e.g., 12,345) and translate it into words (e.g., "twelve thousand three hundred forty-five") is often needed in business applications, for example, to write out the amount on a check. Number spellout in English is a relatively easy thing to do; good algorithms for this are well-known and widely used. A number-spellout engine that can be customized for any language is another thing altogether.

It's not enough to simply take the algorithm for English and read the literal string values from a resource file. English separates all component parts of a number with spaces; Italian and German do not. Some languages, such as Spanish and Italian, drop the word for "one" from the phrases "one hundred" or "one thousand." There are many other examples that show translating a number into words is not a trivial task.

To solve these issues, we developed a class called RuleBasedNumberFormat. It's a general, rule-based mechanism for converting numbers to spelled-out strings. This class is available on the alphaWorks Web site, along with information on the usage and rule syntax.


RuleBasedNumberFormat fmt = 
    new RuleBasedNumberFormat(rules);

String result = fmt.format(1234);
// result is "one thousand two hundred thirty four"

CONCLUSION
The internationalization services in Java 1.1 provide a wide range of functionality and are easily extended to add additional features and to support additional countries. We have had an opportunity to discuss some of the design decisions taken in developing these classes and some of the enhancements that are being included in future releases. A more detailed discussion is available at the IBM international classes Web site, and includes more about the JDK i18n classes and possible future internationalization improvements that IBM is discussing with Sun. These future possibilities include the following:
  • A character conversion API, so that you can actually find out which character code converters are supported on your system, and get understandable display names for them, e.g., instead of "Cp1089", in English displaying Arabic (ISO 8859-6) and in French displaying Arabe (OSI 8859-6)
  • Customized locales, so that you could have fine-grained control over default behavior, e.g.,
    
    NumberFormat.setInstance(new Locale("en", "US", 
    "Acme Widgets, Inc."),
                     new DecimalFormat("#,##0.0#"));
    
    
  • Convenience methods for common cases, e.g.,
    
    String value = Number.formatCurrency(amount);
    
    
We are working on many future enhancements; some of which are available right now on IBM's alphaWorks Web site. We encourage those interested to download versions from there—any comments on the design and implementation are welcome!

ACKNOWLEDGMENTS
Our thanks to Kathleen Wilson, Rich Gillam, and Laura Werner for their extensive review, suggestions, and for organization of the document. Many other people at IBM and Sun contributed to the Java internationalization efforts.

Trademarks
Unicode is a trademark of Unicode, Inc. Copyright © 1998, IBM Corp. All rights reserved.

URLs
alphaWorks
www.alphaWorks.ibm.com/

Unicode internationalization services
www.ibm.com/java/tools/international-classes/

ISO codes
www.unicode.org/unicode/onlinedat/

Unicode technical report #15
www.unicode.org/unicode/reports/tr15/