Columns

Managing XML libraries

Libraries of text have always been the framework for knowledge and learning in human societies. Important developments throughout history have fueled great leaps forward in civilization. Clay tablets of the Sumerians, Babylonians and Assyrians in Ur (in modern-day Iraq), and papyrus scrolls in Thebes in New Kingdom Egypt took the gathering of texts to the greatest extent possible given the fragile material used for books in the period. Library establishment mixed thoroughly with politics in Egyptian Alexandria (under heavy Greek influence) and the great library of parchment was born. But a key contribution came from Rome, not so much in content, nor in format.

It wasn't Rome that produced the moveable type system for mass publication, but it was Rome that introduced to text management the meticulous systems of categorization and cataloging. The Romans applied their legendary thirst for order into making a highly organized body of knowledge out of the innumerable texts amassed from the many civilizations that preceded them. This lesson is not lost even as we approach the third millennium after the apex of Roman culture.

It's a stretch to say that XML is a leap forward in managing text on par with say moveable type, but the adoption of a lingua franca for markup is a significant milestone. All it needs to fulfill the promise of making the Web the next leap in the evolution of libraries is an indexing system as effective for XML as the Roman catalogs were for parchment scrolls and codices (precursors to modern-day books). In February, the World Wide Web Consortium (W3C) put its stamp on a system whose ambition is to provide such indexing capability for the Web. As the Web continues to standardize on XML for text formatting, these companion languages, enabling what the W3C calls the Semantic Web, aim for a virtual global library, open for intelligent processing.

Why do we as application developers care about this? No one pays us to build or maintain libraries, and it's not our problem if the Web is currently a bit of a mess. One reason is that we often find ourselves implementing XML solutions because the conventional wisdom is that XML will solve the information interchange problem. We know better. We know the difficulties of agreeing on XML format and of sticking to these agreements after the fact. We know that as we manage a large body of documents it begins to be difficult to find what we need through ad hoc queries. XML allows one to standardize the syntax of text, but provides no means of formalizing the meaning of the text.

"Formalizing the meaning of text" is an extremely ambitious goal, but it is exactly what we're trying to do with the lines upon lines of code that we plug into our XML parser libraries. If we could express in a specialized language some key facets of the meaning in our XML corpus, it would go a long way toward reducing and sanitizing those lines of code. If we had a means of expressing common meta data that frames the XML, such as subject matter, workflow details and business rules, and if this means were extensible and naturally suited to networks, then we could do much to rationalize XML in our business applications.

Resource Description Framework (RDF) provides just such facilities and is the core of the Semantic Web technology recently completed by the W3C. It allows one to define a sort of knowledge map for Web resources. This map is not limited to XML but, in my experience, I have found RDF and XML to be excellent companions; together they can not only revolutionize the management of data, but also the contexts for the raw data that turns it into information.

RDF expresses simple but formalized assertions about Web resources like XML. These assertions can be added, aggregated, queried and manipulated using fairly mechanical means, and you can gain much going this far, but the W3C has ambitions of going further. Enter the Web Ontology Language (OWL), which enhances the basic assertions of RDF with tools for expressing the meanings and relationship of terms that underlie the meaning in data. Such formalization of the vocabulary and concepts that are relevant in a particular context is called an ontology. OWL expresses ontologies in ways so that machines can do a respectable amount of the processing without relying on human interpretation.

This, of course, is one of the Holy Grails of computing, and it is important not to get too caught up in the futuristic possibilities of such a framework. For now it's enough that RDF and OWL provide a more maintainable and easily shared means of managing systems of XML documents.

Never mind intelligent automata that melt through petabytes of text effortlessly and complete homework or make business decisions for us. The goal within immediate reach is more along the lines of distributing the sort of context that enables search engine agents to differentiate Java the island from Java the language.

Libraries have always been the center of learning, even if librarians haven't recently had as much prestige as in ancient Rome. Just as the frenzy of the Renaissance centered around well-cataloged collections of text, a renaissance in the Web information age might be powered by rich text through XML, as well as rich contextualization and indexing through Semantic Web technologies.

About the Author

Uche Ogbuji is a consultant and co-founder at Fourthought Inc. in Boulder, Colo. He may be contacted at [email protected].