- By Johanna Ambrosio
In some ways, querying XML documents is like going “back to the future” for anyone who remembers the days of hierarchical database management systems (DBMSs).
Before the commercialization of their relational brethren in the 1980s, hierarchical DBMSs were the mainstay of the Global 2000’s major applications. Lightning fast, their biggest strength was in processing many transactions simultaneously. But they lacked the ability to relate different characteristics to each other, so for an auto-parts supplier to find out the names of all customers who live in Chicago and spend more than $3,000 per year on exhaust systems, it meant that someone had to do a lot of custom programming work. Thus, the relational DBMS was born.
Despite their shortcomings, speed and stability are among the reasons that some hierarchical DBMSs continue to plug away 30 or more years after their introduction. IBM is on Version 8 of IMS, which is still a revenue producer after all this time. In Software AG’s case, its Tamino XML product is separate from its hierarchical mainstay, Adabas, but experience in one realm surely helped in the other.
Why XML is different
Just as hierarchical and relational DBMSs are two distinct though related beasts, XML requires a language different from SQL for efficient processing of queries, updates and responses. The emerging XQuery standard is aimed at helping with this issue. In XML, hierarchy and the order in which separate pieces come into the document are critical; there are no neat rows and tables like those that exist in relational DBMSs. At the same time, there is a “wild flexibility” inherent in XML, in the words of Jeff Jones, director of strategy for data management at IBM, what with all the varied types of direct relationships that one finds in XML documents.
Put simply, XML is not “structured” in a way that makes sense to native SQL.
Another difference: In the relational world, a field is either present or it is not, in which case the user gets an error message. But in XML Land, “a field may be in five or 10 different documents, or nowhere at all,” said Ronald Schmelzer, a senior analyst with ZapThink LLC, a consultancy in Waltham, Mass.
Despite XML’s inherent messiness within some semblance of order, “there’s a growing need to query XML documents as if they were a relational database,” Schmelzer said. An ever-increasing amount of corporate information is residing in XML these days, and there is a need to be able to retrieve it easily, efficiently and in a way that mimics the already-familiar constructs of relational DBMSs.
To help do that, there is a growing cadre of support for a still-developing standard called XQuery. Coming out of the World Wide Web Consortium (W3C), XQuery has already been adopted by such heavy hitters as Microsoft, Oracle and IBM, which are all promising to use XQuery in part to help their relational DBMSs better process XML. (See “What are the vendors up to?”)
The soon-to-be standard has created some unusual partners: Microsoft and IBM, for example, have teamed up to publish some XQuery test suites. Oracle and IBM have submitted an XQuery API for Java, called XQJ, that is wending its way through the Java Community Process, and to which Sun, Sybase and others have lent their support. And SourceForge has several open-source versions of XQuery, including one that is “lite.”
What XQuery brings
The beauty of XQuery is that it “provides an easy and powerful way to query across a set of XML documents,” explained Jason Hunter, an independent consultant in San Jose, Calif. Hunter is also the keeper of the http://x-query.com Web site, which is intended to help programmers understand more about the technology and how to best use it.
“But the really big promise comes from the fact that XQuery can access not just XML, but anything that can look like XML, including relational databases,” he added. “Everyone’s catching on to this as an integration tool.”
ZapThink’s Schmelzer agrees that XQuery provides much flexibility in how one can use XML-based information. “XQuery is applicable to XML documents whether or not they’re in a database, so you can use it to transform XML from one document format to another” — PDF, say, or HTML.
This ability to transform search results into entirely new XML documents makes XQuery a step ahead of its predecessor language, XPath, said Jeroen van Rotterdam, CEO/CTO at X-Hive Corp. The Netherlands-based concern supports XQuery, XPath and other standards within its X-Hive/DB, a native XML database.
“Until XQuery, customers used XPath,” van Rotterdam said. “And while XPath is quite efficient at retrieving data, there were no features for constructing new XML documents.” He sees that as XQuery’s biggest advantage over XPath.
That said, XPath expressions are used throughout XQuery, so anyone who has experience with the older language will find much that is familiar, experts say.
“Programmers who know XPath have a fast path to learning XQuery,” said Mary Fernandez, who serves as chair of the W3C’s XPath 2.0 task force and is on the W3C’s XQuery working group.
Also, anyone familiar with XSLT will find similar feature sets in XQuery, she added, including the ability to construct new XML values and to transform XML values. But “the way that those operations are expressed is substantially different,” she explained. XSLT was originally “expressed as a rendering language, to transform XML to HTML or some other target language.” But XQuery is a functional, modular language and so it “is more closely related to a high-level programming language like C++ and Java,” noted Fernandez.
Fundamentally, XSLT is “not particularly easy to use,” said ZapThink’s Schmelzer. “It’s rules-based and is hard to make work efficiently. So rather than use rules, XQuery lets me just query the XML document and return it in the format that I want.”
Still, the W3C’s Fernandez warns, if “you are entirely new to using XML, then some work is required” to learn XQuery, “just as it is in learning any new programming language.”
SQL knowledge won’t help here
Comparisons to SQL, however, are a whole other matter. XQuery is, by most accounts, much more complex to learn than is SQL — they are just completely different. So do not count on quickly “picking up” XQuery if you are a relational database expert. (If you have experience in hierarchical DBMSs, that will help more.)
The key point to remember here is that XQuery does not originate from any kind of SQL base. (Despite some reports to the contrary, they are not related syntactically.) Yes, they are both declarative languages, but most of the similarity ends there.
That is not to say that there will not be increasing synergy over time. Several members of the W3C’s XQuery working group have their roots in SQL, including IBM’s Don Chamberlain, who was involved with SQL from its beginnings.
Indeed, some of XQuery borrows concepts from SQL. “One of the key language structures in XQuery was modeled after SQL’s join feature, to query multiple documents simultaneously,” said Howard Katz, founder of Fatdog Inc., a Roberts Creek, British Columbia, vendor that uses XQuery as the basis of its text-search product.
Still, the key point to remember is that “hierarchical data structures are more complex than relational data structures,” and that adds to the learning curve, according to X-Hive’s van Rotterdam. “A path expression used in XQuery is similar to those used in operating systems, but overall it’s harder than in relational systems,” he said.
“It’s easy to get started with simple queries by copying other people’s patterns,” said consultant Hunter, “but it takes a little practice to understand why things work and to invent really dramatic queries.”
Another issue is that there is much more flexibility with XQuery than with SQL, and that can mean a higher learning curve, noted IBM’s Jones.
Igor Polevoy, a senior software developer at ABN Amro, an international bank with U.S. headquarters in Chicago, can attest to the complexity of XQuery. Polevoy’s advanced technology group has built a repository of UML models in the X-Hive native XML database. Why XML? “Because the models are stored in a language called XMI — XML Model Interchange — and so it’s native to XML,” he explained.
And while the good news is that he has found that XQuery “lets us get the information in any XML format we want,” like any new technology, it has some problems, too. To begin with, what takes a “few” lines of code in SQL requires “hundreds or thousands” of lines of XQuery code, Polevoy said.
“There’s a real programming language built into XQuery,” he noted, “and so you write functions and do subroutines.” That alone makes it more complex than SQL.
Another problem is that XQuery has no mechanism for a programmatic inclusion. (SQL lacks this, too, but it does not need this facility, Polevoy said.) So within the XQuery code, one has to “repeat over and over” things that in other languages would function as placeholders for, say, a common file system.
Other functions that XQuery 1.0 lacks are update and full-text search, and there are not many tuning facilities in the official standard definition. (For update features, Polevoy is using XUpdate, another standard, from XMLdb.org.) There is no word yet on whether future versions of XQuery will include any of these functions.
“This is the first version of XQuery, and therefore it is impossible to meet every requirement of every possible user of the language,” said the W3C’s Fernandez. She reminds us that other query languages, especially SQL, have evolved over the course of 20 years or more; therefore, comparisons are not necessarily fair.
“I think the working group has been very effective in identifying those requirements that were critical for Version 1.0,” she said. And even without full-text search or update features, XQuery is “very useful,” as its “rapid adoption by major software vendors” proves.
In the interim, though, the vendors will be putting together their own, non-standard versions of missing features — as well as their own extensions. XQuery expert Hunter calls this incompatibility “the most challenging” feature of XQuery adoption for the vendors. “Right now the technology’s not done, and the spec’s been changing. It will take a while for everyone to be 100% conformant,” he said.
The completed standard is not expected until next year at the earliest, at least six months behind earlier estimates.
“XQuery 1.0 required addressing the requirements of two substantially distinct user communities: document processing and database systems,” the W3C’s Fernandez said. “Reaching that point required understanding the requirements, vocabulary and applications of both communities — and that simply takes time.”
When to use XQuery
Indeed, over time more and more vendors will incorporate XQuery into their DBMSs or content management systems. To the extent that customers will need to learn it at all will be more of a function of how seriously their shops consider their XML data.
“It has a lot to do with the state of the economy,” said ZapThink’s Schmelzer. “Most people aren’t investing time in technologies that aren’t widely applicable, like .NET or Java. Having a CIO or lead architect tell someone to learn XQuery at this point is a risk” at least until XML catches on more widely.
Also, there are other paths to look at for storing and retrieving XML documents, most prominent among them SQL/XML.
Most experts agree that if your application is XML-only, the way to go at this point is with XQuery, XUpdate and perhaps a “pure” XML DBMS. (Most of the relational DBMSs are still in the fairly early stages of XML support or extensions.)
But if you are looking at a blend of SQL and XML information, then there is another option. Most of the relational gang is supporting a standard called SQL/XML, which comes out of the same ANSI/ISO group that is the steward for the basic SQL standard.
Vendor support for this is based on the premise that customers are not going to throw away their SQL infrastructures anytime soon. Despite the inherent inefficiencies of using native SQL to query XML, many customers simply prefer doing that via extensions in their existing DBMSs, rather than spending time and money on native XML products that require new skills.
Sandeepan Banerjee, senior director of XML technologies at Oracle, recommends that customers look at their application mix before deciding which type of XML store to use. He said that SQL/XML is not only better for a mix of data types, but it is also superior for enterprise data, for information where speed is key, where information may change frequently, where there is a fairly high volume of data and where “most” of the data is in relational format. “If you’re looking for data center-like support” for heterogeneous information, he said, “SQL/XML is farther along and less of a risk than cobbling together a bunch of closely related standards.”
XQuery, on the other hand, is “envisioned today as a Web-based query mechanism for accessing static Web pages, PDF files, content management systems and other information that does not change frequently,” if at all, Banerjee said. So it is not just a matter of whether the source of the data is XML or relational; it is a matter of how the information is used and how frequently it changes.
Finally, there is one other trend that one might consider when planning an XML strategy. According to ZapThink’s Schmelzer, the standalone XML market is pretty much going away. “The market’s matured enough to the point where XML storage is going from a specialty to a feature,” he said, much like object-oriented (OO) storage did before it. And just like with OO storage, there might not be cases where an optimized XML approach makes more sense than a general relational DBMS that has been extended to work with XML.
Please see the following related stories:
“What are the vendors up to?” by Johanna Ambrosio
“Users judge BEA’s XQuery play” by Jack Vaughan
Briefing Book: XQuery Update