In-Depth

Look at storage issues before you leap into XML

The following article is adapted from Chapter 5 of XML: A Manager's Guide, Second Edition by Kevin Dick. Used with the permission of the author and Addison-Wesley.

 

XML documents are data that can be either at rest or in transit. Therefore, enterprises that want to successfully deploy XML must figure out how to manage XML in both of these states. For XML at rest, developers must first decide on the type of store to use. For XML in transit, they must first decide on the server infrastructure to deploy.

It is not uncommon for projects using XML to stall while figuring out how to address the storage issue. The confusion stems from the fact that there are three vastly different choices: a database management system (DBMS), a content management system (CMS), or a native XML store. The appropriate choice depends on the characteristics of your XML data.

What if you use XML as a data interchange format? In this case, a source application encodes data from its own native format as XML, and a target application decodes the XML data into its own native format. XML is an intermediate data representation. Both the source and target applications already have persistent storage mechanisms, almost certainly DBMSs of one sort or another. There is really no need to store the XML documents persistently themselves, except perhaps for logging purposes.

In fact, the entire purpose of the interchange format is to combine data from an external source with the rest of the data in the DBMS. If you want to access or search data from these interchange documents along with data already in the DBMS, you need to convert it from XML to the DBMS's native format. You may take this approach even further by making XML the lingua franca among different data sources. The discussion of data servers that follows addresses this option. But even in this sophisticated case, XML remains an intermediate data format. The data is ultimately translated and stored in an existing DBMS.

What if you use XML as a content format? In this case, authoring tools generate content as XML, and layout tools generate stylesheets for displaying this content. But content production usually requires higher-level features beyond storage, such as collaborative authoring, rendering to different media, and indexing documents. Moreover, you may also have content in other formats that you must manage alongside the XML documents.

In this case, you probably want to use a CMS that addresses persistent storage in conjunction with these other needs. Because most commercial CMSs evolved with the use of SGML, vendors have found it fairly easy to add excellent XML support. So if XML is a content and layout representation, use a CMS. CMS products with XML capabilities include BroadVision Publishing Center, Chrystal's Astoria, Documentum4i, Interwoven TeamSite, OmniMark Technologies' OmniMark, Red Bridge Interactive's DynaBase, SiberLogic's SiberSafe, and Vignette's Content Suite.

What if you use XML as an operational data format? Operational data is data that directly drives an application or process. Usually, DBMSs maintain operational data, but there are two cases where XML is likely to be the format. In the first case, XML is the format for an important work product of some kind. The business document architecture uses XML in precisely this manner. Each document represents a completed work product exchanged between organizations. Certainly, organizations will break down this document and map certain portions to corresponding DBMSs. However, the XML document is the starting point for driving this downstream processing and the ultimate point of reference for auditing. In the second case, XML is the format for instructions used in executing a process, and there is an emerging class of orchestration applications that use an XML format to describe the assembly of software components or the workflow for business processes.

In either of these situations, neither a traditional DBMS nor a CMS is appropriate. You need to use the XML document as a single unit but still index its internal contents. Traditional DBMSs do this poorly because they either have to disassemble the document into their internal formats or create special functions for treating documents as Large Objects (LOBs). Traditional CMSs do this poorly because they are not optimized for subsecond response under high request loads. So if XML is an operational data representation, use a native XML store. Such products include Ipedo XML Database, IXIASOFT's TEXTML Server, NeoCore XMS, Software AG's Tamino, and XYZFind Server. The products mentioned later in the 'Data server' section of this article can also store native XML data and are particularly useful when your application has a combination of native XML and traditional DBMS data.

Server infrastructure
With an appropriate place to store XML data persistently, the next concern is distributing and manipulating this data. In modern Web application architectures, servers play critical roles in assembling, processing, and distributing information. Adding XML support to your server infrastructure mostly involves making sure that existing servers are XML-enabled, with perhaps the installation of a few XML-specific components. Most important, you must verify that XML capabilities meet the scalability and reliability demands of all server functions.

In general, there are three types of server components: data servers, app servers, and content servers. Data servers access, aggregate, and format data. App servers execute business logic components and mediate distributed business processing. Content servers facilitate the acquisition of content, enhance its accessibility to users, and apply formatting. Different types of servers [can] work together in a typical Web application environment. This type of server web provides the conduit for propagating XML documents within an enterprise and throughout the Internet.

While each server component [can be shown] as a distinct node, this arrangement isn't necessary. Server software may combine these components in different ways, and different combinations lead to distinct product segments. Integration servers combine data server functions to aggregate information from multiple sources with application server functions to control the flow of business processes. Portal servers combine data server functions to access information from multiple sources with content server functions to filter this information based on user requirements. Personalization servers combine application server functions to calculate user needs with content server functions to customize their experiences dynamically. By understanding the roles of the three basic types of server functionality, you can evaluate whether such a combination suits your needs.

Data servers
DBMSs inherently constrain the use of data. They have to choose a particular paradigm, such as relational or object. Relational DBMSs with normalized tables optimize the combination data in different ways. Object DBMSs with associated instances optimize the traversal of information webs. Within a given paradigm, each individual database has a particular structure limiting the types of information it can store and the access patterns it supports. DBMSs do a wonderful job of managing data when a given database must support only a few types of applications and when each application relies on only a few databases. However, when a given database must support a wide variety of application types or a given application must rely on many different databases, satisfying these demands often taxes DBMSs to their limits. In such cases, an XML-enabled data server can improve flexibility and performance.

XML broadens the use of data. The ability to design special purpose data formats quickly encourages the combination of information managed in different databases. So while data servers have existed for some time, XML's emergence as a solution to information exchange problems has elevated their role. Data servers perform three major functions: 1) they unify the data access interface to simplify application development, 2) they aggregate data from different sources to deliver customized packages of information, and 3) they consolidate requests to DBMSs to improve performance. XML requires special support only in the first two functions. Because optimizing performance through consolidation strategies like data caching and connection pooling occurs internally to the data server, the use of XML as the format does not affect this function.

An XML-enabled data server supports XML as the unified data access format. When an application submits a request to the data server, the data server fulfills it with an XML document. Given the rise of XML messaging, the data server should probably support this interaction over SOAP, using an interface specified in WSDL. Merely retrieving ad hoc bits of data as XML documents that the application then has to translate into programming data structures doesn't add much benefit. Programmatic solutions such as ODBC and JDBC already satisfy this need. The more substantial benefit comes from defining synthetic XML documents that form customized packages of data suited to a particular purpose.

To deliver a synthetic XML document, the data server must have a mapping between the document type and the structures managed by back-end DBMSs. A developer defines an XML DTD or Schema for the document type and then maps fields in the database schemas to element and attribute types. The developer also defines the keys used to select the correct records for populating a document instance. At runtime, an app submits a request for a synthetic document type and the appropriate keys. The data server then looks up the mapping, constructs queries based on the mapping and the keys, and puts the results into an XML document. This results document is valid with respect to the specified DTD or Schema.

A DBMS vendor may include some data server capabilities with its DBMS product. For instance, Oracle9i includes XML mapping capabilities. In cases where the need for a data sever stems from a small set of homogeneous databases attempting to serve many different apps, this solution is sufficient. But when the need for a data server stems from a set of apps attempting to aggregate data across heterogeneous databases, you probably need a separate data server product.

Such products include eXcelon's eXtensible Information Server and Versant enJin, both of which are based on object persistence engines. Data servers require many of the capabilities of back-end databases to provide high availability and transactional integrity. They use their own persistence engine as a staging area between applications and back-end DBMSs. Therefore, most of the native XML store products discussed previously can also operate as XML data servers by adding features for synchronizing with back-end databases. In fact, many vendors of these products are finding that this approach drives a substantial percentage of their sales. Conversely, data server products like eXcelon and enJin can operate as native XML stores, so distinctions between the two markets are blurring. When evaluating either type of product's suitability as a data server, focus on the facilities for mapping back-end data to XML documents and the efficiency of performance optimization strategies like caching and pooling.

Application servers
Application servers operate in the middle tier, applying business logic to data, then handing off the results for presentation. In this capacity, they have three primary reasons for working with XML documents:
* They may need to accept data as XML documents from data servers.
* They may need to provide business results as XML documents to content servers.
* They may have to exchange XML-formatted business messages with other application servers

To support these operations, the application server can supply basic and advanced services.

Basic services include the execution of XML and XSLT processors, as well as a SOAP implementation. Whether it extracts data from XML documents, exchanges XML business documents, or produces XML business results, the application server needs the access and creation capabilities of an XML processor. Because many developers use XSLT for pre- and post-document processing, support for this standard should be part of the basic package. Interaction with XML-enabled data, application, and content servers almost certainly includes SOAP communication, so an implementation of the protocol is essential.

Theoretically, because an app server can execute any code in a language it supports, providing basic services is simply a matter of downloading XML and XSLT processors plus a SOAP implementation, then installing them. Practically, assuring the performance and quality of execution requires the vendor at the very least to certify components for use with the app server and probably include the recommended packages in the product distribution. You want to make sure that the vendor has tested the particular components, can provide estimates of how much throughput these components can handle, and knows how to support their use with its application server. For J2EE application servers, most vendors recommend the Xerces XML processor, the Xalan XSLT processor, and either their own or a particular third-party SOAP implementation. Microsoft has its own XML processor, XSLT processor, and SOAP implementation for its application server products.

Advanced services tend to vary significantly across application servers and evolve rapidly over time. Therefore, it's more appropriate to focus on the categories of advanced services rather than particular instances. Most advanced services are delivered in the form of frameworks. There are abstraction frameworks and task frameworks. Abstraction frameworks give developers more flexibility to make future changes by performing operations at a higher level. Two excellent examples are Sun's Java API for XML Processing (JAXP) and Java API for XML Messaging (JAXM). Both of these frameworks provide high-level APIs for performing specific XML-related operations. By programming to these abstract APIs rather than the concrete APIs of specific components, developers make it possible to switch their XML processor or XML messaging protocol easily.

Task frameworks provide additional functionality for building specific types of applications. Personalization is a good example of a task framework used to produce XML documents for content servers. These types of applications use metadata about user preferences and metadata about content topics to generate customized content. Because XML is a convenient format for both types of metadata, there is the opportunity to deliver a package that greatly simplifies the development of such applications. But perhaps the best XML-related example of such an app is B2B messaging. This type of application touches on a host of issues, from specifying the allowable flows of messages, to generating views of executing processes, to integrating with back-end systems. Providing all this functionality would be difficult for a single application development team. By using XML, vendors can deliver a widely applicable framework that puts such apps within the reach of more organizations. All the major application server vendors -- including BEA, IBM, Microsoft, Oracle, and Sun -- provide their own flavors of both personalization and B2B messaging frameworks.

Content servers
Content servers combine data from DBMSs, results from business operations, and authored content into presentation formats for different users. XML-based technologies improve every stage of the fulfillment pipeline. At the end of the pipeline, they enable dynamic layouts that better fit users' needs. In the middle of the pipeline, they make it easier to connect a user to the exact information he wants. At the beginning of the pipeline, they make it easier to acquire the library of content necessary to satisfy the user base. Most content servers focus on one or two aspects of this pipeline, so implementing a complete XML content strategy may require several types of content servers.

The most common use for XML in content servers is applying dynamic presentation to XML content. This process occurs by using XSLT to generate pages in XML-based presentation languages such as HTML, VoiceXML, and WML. Based on variables, including the type of client device, the type of content, and the localization settings for the user, the content server selects an XSLT transform and applies it to the XML document. Because most Web servers have programming extensions that support XSLT, you won't need any additional server infrastructure if all you want is dynamic presentation.

Customizing layouts for users is only part of the content delivery equation. Users also need help finding the content that addresses their immediate needs. Traditional search engines suffer from the problem of distinguishing between different contexts for the same word. With XML content, a search engine can use the element structure and attribute values to improve search precision. Using an XML-aware search engine helps maximize the benefits of an XML-based content strategy. Usually, employing such a product involves assigning a dedicated server or cluster of servers to perform searches that then refer users to the appropriate content. Such standalone solutions include DocSoft's extend XML and XML Global's GoXML Search. Of course, most of the CMS and native XML store products discussed previously can perform searches on XML document collections, but this approach works only if you store all the content you plan to search in one of these products.

XML-aware search engines leverage metadata at the element and attribute levels. However, metadata can also apply to entire collections of content. The foundation of the Semantic Web is the use of metadata to provide a conceptual map of an entire site or group of sites. Another W3C Recommendation, Resource Definition Framework (RDF), provides a standardized XML vocabulary for describing the types of content offered, the relationships among content, and the conditions under which content might be relevant. Most site creators use an implicit information model in selecting and organizing content. RDF makes it possible to state this model explicitly. The availability of machine-readable models facilitates automated information retrieval, filtering, and visualization capabilities far beyond those of traditional search engines. The Semantic Web is in its early development, and much of the work is in the form of research and open source projects. However, in the near future, RDF may migrate into mainstream content infrastructure. Web servers will offer RDF descriptions. Search engines will use these descriptions as part of the search criteria. Authoring tools will generate these descriptions.

In addition to making it easier to find content, XML makes it easier to acquire content. Content can come from two sources: You can create it, or you can borrow it. When creating content, the ability of multiple authors to collaborate effectively greatly enhances productivity. Web Distributed Authoring and Versioning (WebDAV), a set of XML-based extensions to HTTP from the IETF, makes it possible for authors to work together to create, enhance, and maintain content. A WebDAV server manages contributions, tracks changes, and enforces permissions. A number of portal servers, including Microsoft's SharePoint Portal Server and Oracle9iAS Portal, use WebDAV to enable the collaborative editing of portal content. Common Web servers such as Apache and IIS also support WebDAV. Any client that speaks the WebDAV protocol can use these servers to collaborate on documents. Such clients include content authoring tools such as Adobe Acrobat and Microsoft Office. Taken to an extreme, WebDAV enables the replacement of traditional document management systems with a set of distributed WebDAV-capable servers. Oracle iFS and Xythos's Web File Server use this approach.

It is often more cost-effective to borrow content from someone else than to generate it yourself. However, this type of syndication faces two problems. First, it is often difficult to fit third-party content into an application because of differences in layout. XML solves this problem by giving both parties a format for exchanging information separate from presentation. The subscriber knows the structure of each publisher's content, so it can use XSLT to integrate content from different sources and apply its preferred layout. There is also the problem of how to negotiate subscriptions, track usage, and update information automatically. Information and Content Exchange (ICE) addresses these issues by providing a standard XML protocol for such interactions between subscribers and publishers. ICE support is available in a wide variety of products that generate and manage content, including Interwoven's OpenSyndicate, Oracle9i, and Vignette's Content Syndication Server.