Columns

Keep your XML clean

I work as a consultant specializing in the intersection between XML and other application development technologies. In that work I have been disappointed overall with the quality of design of XML documents in the enterprise. Usually I find at least a minimum of function, in that documents are well formed; however, I've learned not to take well-formedness for granted. The problem is that I find very little suitability of form in the XML formats I see. Some will immediately chorus: "But why is form important if one has function?"

Modern architecture developed upon the principle that in designing buildings ("machines for living" to use the famous phrase by architect Le Corbusier), form follows function. In other words, the intended uses of the space by its eventual inhabitants should determine the design of the outside as well as the inside of the structure. This does not mean that no aesthetic principles of form should apply, but rather that the form should derive naturally from the function. The resulting form is not just a nicety, but a fundamental part of the usefulness and value of the building.

But the prevailing attitude toward XML design I've seen is: "Establish the function and never mind the form." You can string almost any jumble of tags and content together to create a functional format for some database dump or technical article, but if you don't pay attention to form, you will end up with data that is harder to process by machines as well as people, and is thus more expensive to maintain.

There are many aspects to XML design. How do you develop your vocabulary? How do you decide what to leave in and what to omit from the model? How do you structure the content model of the main elements? How do you choose between elements and attributes? Developers often seem to think of XML as a technology not serious enough to require careful design, but I have seen projects pay a steep price for such nonchalance. At a minimum, you should use the same techniques of analysis and design as you would when developing application code. A discussion covering that would be far more than I could fit into this space. I shall comment on two areas where I have seen some of the worst abuses: readability and consistency.

Readability: In some cases, the creators of XML formats do not expect humans to ever read those formats. Several times I have written or spoken about the ways in which I use Web Services Description Language (WSDL), and invariably someone expresses puzzlement at the idea that anyone would ever actually read a WSDL file directly or edit it in a text or XML editor.

"WSDL is just for the Web services toolkits," I've been told several times.

In reality, even if your preference is not, like mine, to deal with all sorts of XML in a plain old text editor, at some point you'll find you have little choice. Toolkits break down and if you have not made the XML readable, you may regret it while developing or debugging code to process the XML, or while communicating the format to other developers. Never assume that it is not important for XML to be readable.

Always use very explicit and unambiguous element and attribute names. Consider using hyphens or underscores in naming rather than "hump case" -- for example "first-name" rather than "FirstName." Try to group elements logically so that when "pretty printed" they stand out, rather than having endless runs of sibling elements.

Consistency: When XML is developed by multiple contributors without shared formal standards, or if they come from toolkits such as data bindings where varied conditions in execution are manifested in the XML, consistency often suffers. Similar constructs might use different conventions at different points in the XML format. One element might be called "business-name" and a sibling "biz-tax-id." Each instance of such inconsistency is but a minor blemish, but in my experience, if developers do not pay particular attention to consistency, this sort of blemish proliferates until it becomes very confusing to follow the data.

Choosing between elements and attributes is often not trivial, but once you've made your choice, stick to the same convention in similar situations. I often see XML where "ID" is an attribute on one element, and then a child element on another. This makes it harder to write generic and reusable code for processing the XML.

Elegance of form in XML is not a luxury. On the contrary, it helps to save money. Architects have learned how to study the function of a building closely, and then work hard on a form that enhances the function in intangible ways that separate great architecture from poor. Very similar intangibles mark the difference between an XML design that lasts well, can be reused and is inexpensive to process, and one that is difficult to maintain. The next time you develop an XML design, print out a sample document and hold it up to a colleague for a quick read. If he or she develops a headache from trying to figure out what it means, consider this an omen of the pain that maintaining the format will cause in the future.

About the Author

Uche Ogbuji is a consultant and co-founder at Fourthought Inc. in Boulder, Colo. He may be contacted at [email protected].