Columns

Perspective on XML: XML circles the globe

Developers are an ambitious bunch. The early rallying cry for Linux began with the words “world domination.” So how do software developers, whether community groups or corporate coders, go about such a conquest? Perhaps one can take a lesson from the myth of the Tower of Babel, whose story is so widespread across cultures that some either think it’s true or an apt symbol of a basic pillar of the human psyche.

The ancient lessons set up around this story involved the vanity and inherent limitations of human nature. Lessons drawn in the modern era might be quite different. Perhaps we shall never actually colonize space until the current trend of disappearing languages leads to the supremacy of one global super-language (a creepy and terrifying thought). If our diversity of tongue and culture will never allow us to challenge divinity, the more modest goal of world domination is probably impossible without managing the problem of Babel. When developing software for a global audience, managing this problem is called internationalization (“i18n” in the clever abbreviation).

I never thought much about i18n until I worked as a consultant at IBM, which is legendary for its attention to global deployment of technology. I’ve always been one to pick up languages and adapt to cultures quickly, so I was surprised at how extensively i18n considerations reshaped my most basic development habits. i18n is not just a matter of ditching knee-jerk biases (although many in the U.S. could start with that basic step); it requires support from tools and frameworks and special attention in all development methodologies.

As with all important development problems, the key to success or failure lies in the data. i18n is a non-starter without data that supports (almost) all writing systems, text translations and structured fields rich enough to accommodate differing local conventions. That’s one reason why the success of XML is so encouraging. XML was born with i18n in its genes, and achieving the data characteristics I mentioned for i18n is often a matter of good XML design. The basic guidelines are simple enough to state, although they take some attention to get right.

Don’t undermine XML’s character model. One of the most misunderstood aspects of XML is its basis in Unicode. I can understand that because Unicode is a tough subject for people who are used to thinking of strings from a European language point of view (and even for many used to more complex character models). But don’t even think of using XML without understanding its concept of text. And if XML’s text model is too much for you, forget entirely about developing any software for a global community or market because any other i18n mechanism will involve at least as many complexities.

The most common problem I come across with regard to XML and Unicode is when applications extract data from XML text into data structures that can’t handle the complexity of Unicode (e.g., simple strings in C). All seems well in testing because the testers only use ASCII or European language test data. Then the software is deployed and a Chinese user enters input that causes a failure. Even more pernicious is the occasional error propagated from a source that might be considered authoritative by developers. I recently examined a fairly well-known XML tool whose default configuration does not allow characters from non-European languages. As such, this is not even an XML tool at all -- so blatant is its non-compliance -- but it is advertised as an XML tool and unwary users may not appreciate its limitations.

The XML-RPC specification is another similar case. XML-RPC is a fairly popular protocol for exchanging XML data over HTTP, but its specification makes the fairly ridiculous stipulation that all strings sent by way of XML-RPC must be ASCII. Luckily, most XML-RPC implementations ignore this limitation and allow the full range of Unicode characters to be sent, but such scorn of non-English users in an XML-based specification causes a great deal of confusion.

Another recommendation, and a much harder one to follow, is to ensure that your software handles translated versions of text. There is a standard attribute, xml:lang, which allows the articulation of multiple instances of an element’s content, each in a different language. Be sure not to block this usage (e.g., through schema constraints) and be sure your tools respond intelligently so that, say, a Hebrew-speaking user would be presented with text that has been translated into Hebrew where available. XPath, the most important little language for XML processing, does provide some support for this.

One last thing I recommend is to design structures and conventions in your XML to accommodate varying cultural norms. Internationally respected specifications are a good source for such structures and conventions.

As an example, be careful when modeling people’s names to accommodate the fact that some cultures prefer to display or sort by given name, others by family name, or that additional names and titles are essential in some cultures. Docbook is a good specification to emulate in this regard. Other examples include dates (use the ISO-8601 standard rather than, say, DD/MM/YYYY), numerals (be aware that different countries use commas and periods in different ways within numerals), currency, addresses and telephone numbers.

There is a great deal of work to i18n, and it is hardly enough to take proper advantage of the facilities in XML, but it is a first step toward managing the Babel problem. I’ll leave to the shamans the question of whether this will actually lead to world domination of your software products.

About the Author

Uche Ogbuji is a consultant and co-founder at Fourthought Inc. in Boulder, Colo. He may be contacted at [email protected].