In-Depth
Look at storage issues before you leap into XML
- By Kevin Dick
- October 31, 2002
The following article is adapted from Chapter 5 of
XML: A Manager's Guide, Second Edition
by Kevin Dick. Used with the permission of the author and
Addison-Wesley.
XML documents are data that can be either at rest or in transit. Therefore,
enterprises that want to successfully deploy XML must figure out how to manage
XML in both of these states. For XML at rest, developers must first decide on
the type of store to use. For XML in transit, they must first decide on the
server infrastructure to deploy.
It is not uncommon for projects using XML to stall while figuring out how to
address the storage issue. The confusion stems from the fact that there are
three vastly different choices: a database management system (DBMS), a content
management system (CMS), or a native XML store. The appropriate choice depends
on the characteristics of your XML data.
What if you use XML as a data interchange format? In this case, a source
application encodes data from its own native format as XML, and a target
application decodes the XML data into its own native format. XML is an
intermediate data representation. Both the source and target applications
already have persistent storage mechanisms, almost certainly DBMSs of one sort
or another. There is really no need to store the XML documents persistently
themselves, except perhaps for logging purposes.
In fact, the entire purpose of the interchange format is to combine data from
an external source with the rest of the data in the DBMS. If you want to access
or search data from these interchange documents along with data already in the
DBMS, you need to convert it from XML to the DBMS's native format. You may take
this approach even further by making XML the lingua franca among different data
sources. The discussion of data servers that follows addresses this option. But
even in this sophisticated case, XML remains an intermediate data format. The
data is ultimately translated and stored in an existing DBMS.
What if you use XML as a content format? In this case, authoring tools
generate content as XML, and layout tools generate stylesheets for displaying
this content. But content production usually requires higher-level features
beyond storage, such as collaborative authoring, rendering to different media,
and indexing documents. Moreover, you may also have content in other formats
that you must manage alongside the XML documents.
In this case, you probably want to use a CMS that addresses persistent
storage in conjunction with these other needs. Because most commercial CMSs
evolved with the use of SGML, vendors have found it fairly easy to add excellent
XML support. So if XML is a content and layout representation, use a CMS. CMS
products with XML capabilities include BroadVision Publishing Center, Chrystal's
Astoria, Documentum4i, Interwoven TeamSite, OmniMark Technologies' OmniMark, Red
Bridge Interactive's DynaBase, SiberLogic's SiberSafe, and Vignette's Content
Suite.
What if you use XML as an operational data format? Operational data is data
that directly drives an application or process. Usually, DBMSs maintain
operational data, but there are two cases where XML is likely to be the format.
In the first case, XML is the format for an important work product of some kind.
The business document architecture uses XML in precisely this manner. Each
document represents a completed work product exchanged between organizations.
Certainly, organizations will break down this document and map certain portions
to corresponding DBMSs. However, the XML document is the starting point for
driving this downstream processing and the ultimate point of reference for
auditing. In the second case, XML is the format for instructions used in
executing a process, and there is an emerging class of orchestration
applications that use an XML format to describe the assembly of software
components or the workflow for business processes.
In either of these situations, neither a traditional DBMS nor a CMS is
appropriate. You need to use the XML document as a single unit but still index
its internal contents. Traditional DBMSs do this poorly because they either have
to disassemble the document into their internal formats or create special
functions for treating documents as Large Objects (LOBs). Traditional CMSs do
this poorly because they are not optimized for subsecond response under high
request loads. So if XML is an operational data representation, use a native XML
store. Such products include Ipedo XML Database, IXIASOFT's TEXTML Server,
NeoCore XMS, Software AG's Tamino, and XYZFind Server. The products mentioned
later in the 'Data server' section of this article can also store native XML
data and are particularly useful when your application has a combination of
native XML and traditional DBMS data.
Server infrastructure
With an appropriate place to store XML
data persistently, the next concern is distributing and manipulating this data.
In modern Web application architectures, servers play critical roles in
assembling, processing, and distributing information. Adding XML support to your
server infrastructure mostly involves making sure that existing servers are
XML-enabled, with perhaps the installation of a few XML-specific components.
Most important, you must verify that XML capabilities meet the scalability and
reliability demands of all server functions.
In general, there are three types of server components: data servers, app
servers, and content servers. Data servers access, aggregate, and format data.
App servers execute business logic components and mediate distributed business
processing. Content servers facilitate the acquisition of content, enhance its
accessibility to users, and apply formatting. Different types of servers [can]
work together in a typical Web application environment. This type of server web
provides the conduit for propagating XML documents within an enterprise and
throughout the Internet.
While each server component [can be shown] as a distinct node, this
arrangement isn't necessary. Server software may combine these components in
different ways, and different combinations lead to distinct product segments.
Integration servers combine data server functions to aggregate information from
multiple sources with application server functions to control the flow of
business processes. Portal servers combine data server functions to access
information from multiple sources with content server functions to filter this
information based on user requirements. Personalization servers combine
application server functions to calculate user needs with content server
functions to customize their experiences dynamically. By understanding the roles
of the three basic types of server functionality, you can evaluate whether such
a combination suits your needs.
Data servers
DBMSs inherently constrain the use of data. They
have to choose a particular paradigm, such as relational or object. Relational
DBMSs with normalized tables optimize the combination data in different ways.
Object DBMSs with associated instances optimize the traversal of information
webs. Within a given paradigm, each individual database has a particular
structure limiting the types of information it can store and the access patterns
it supports. DBMSs do a wonderful job of managing data when a given database
must support only a few types of applications and when each application relies
on only a few databases. However, when a given database must support a wide
variety of application types or a given application must rely on many different
databases, satisfying these demands often taxes DBMSs to their limits. In such
cases, an XML-enabled data server can improve flexibility and performance.
XML broadens the use of data. The ability to design special purpose data
formats quickly encourages the combination of information managed in different
databases. So while data servers have existed for some time, XML's emergence as
a solution to information exchange problems has elevated their role. Data
servers perform three major functions: 1) they unify the data access interface
to simplify application development, 2) they aggregate data from different
sources to deliver customized packages of information, and 3) they consolidate
requests to DBMSs to improve performance. XML requires special support only in
the first two functions. Because optimizing performance through consolidation
strategies like data caching and connection pooling occurs internally to the
data server, the use of XML as the format does not affect this function.
An XML-enabled data server supports XML as the unified data access format.
When an application submits a request to the data server, the data server
fulfills it with an XML document. Given the rise of XML messaging, the data
server should probably support this interaction over SOAP, using an interface
specified in WSDL. Merely retrieving ad hoc bits of data as XML documents that
the application then has to translate into programming data structures doesn't
add much benefit. Programmatic solutions such as ODBC and JDBC already satisfy
this need. The more substantial benefit comes from defining synthetic XML
documents that form customized packages of data suited to a particular
purpose.
To deliver a synthetic XML document, the data server must have a mapping
between the document type and the structures managed by back-end DBMSs. A
developer defines an XML DTD or Schema for the document type and then maps
fields in the database schemas to element and attribute types. The developer
also defines the keys used to select the correct records for populating a
document instance. At runtime, an app submits a request for a synthetic document
type and the appropriate keys. The data server then looks up the mapping,
constructs queries based on the mapping and the keys, and puts the results into
an XML document. This results document is valid with respect to the specified
DTD or Schema.
A DBMS vendor may include some data server capabilities with its DBMS
product. For instance, Oracle9i includes XML mapping capabilities. In cases
where the need for a data sever stems from a small set of homogeneous databases
attempting to serve many different apps, this solution is sufficient. But when
the need for a data server stems from a set of apps attempting to aggregate data
across heterogeneous databases, you probably need a separate data server
product.
Such products include eXcelon's eXtensible Information Server and Versant
enJin, both of which are based on object persistence engines. Data servers
require many of the capabilities of back-end databases to provide high
availability and transactional integrity. They use their own persistence engine
as a staging area between applications and back-end DBMSs. Therefore, most of
the native XML store products discussed previously can also operate as XML data
servers by adding features for synchronizing with back-end databases. In fact,
many vendors of these products are finding that this approach drives a
substantial percentage of their sales. Conversely, data server products like
eXcelon and enJin can operate as native XML stores, so distinctions between the
two markets are blurring. When evaluating either type of product's suitability
as a data server, focus on the facilities for mapping back-end data to XML
documents and the efficiency of performance optimization strategies like caching
and pooling.
Application servers
Application servers operate in the middle
tier, applying business logic to data, then handing off the results for
presentation. In this capacity, they have three primary reasons for working with
XML documents:
* They may need to accept data as XML documents from data
servers.
* They may need to provide business results as XML documents to
content servers.
* They may have to exchange XML-formatted business messages
with other application servers
To support these operations, the application server can supply basic and
advanced services.
Basic services include the execution of XML and XSLT processors, as well as a
SOAP implementation. Whether it extracts data from XML documents, exchanges XML
business documents, or produces XML business results, the application server
needs the access and creation capabilities of an XML processor. Because many
developers use XSLT for pre- and post-document processing, support for this
standard should be part of the basic package. Interaction with XML-enabled data,
application, and content servers almost certainly includes SOAP communication,
so an implementation of the protocol is essential.
Theoretically, because an app server can execute any code in a language it
supports, providing basic services is simply a matter of downloading XML and
XSLT processors plus a SOAP implementation, then installing them. Practically,
assuring the performance and quality of execution requires the vendor at the
very least to certify components for use with the app server and probably
include the recommended packages in the product distribution. You want to make
sure that the vendor has tested the particular components, can provide estimates
of how much throughput these components can handle, and knows how to support
their use with its application server. For J2EE application servers, most
vendors recommend the Xerces XML processor, the Xalan XSLT processor, and either
their own or a particular third-party SOAP implementation. Microsoft has its own
XML processor, XSLT processor, and SOAP implementation for its application
server products.
Advanced services tend to vary significantly across application servers and
evolve rapidly over time. Therefore, it's more appropriate to focus on the
categories of advanced services rather than particular instances. Most advanced
services are delivered in the form of frameworks. There are abstraction
frameworks and task frameworks. Abstraction frameworks give developers more
flexibility to make future changes by performing operations at a higher level.
Two excellent examples are Sun's Java API for XML Processing (JAXP) and Java API
for XML Messaging (JAXM). Both of these frameworks provide high-level APIs for
performing specific XML-related operations. By programming to these abstract
APIs rather than the concrete APIs of specific components, developers make it
possible to switch their XML processor or XML messaging protocol easily.
Task frameworks provide additional functionality for building specific types
of applications. Personalization is a good example of a task framework used to
produce XML documents for content servers. These types of applications use
metadata about user preferences and metadata about content topics to generate
customized content. Because XML is a convenient format for both types of
metadata, there is the opportunity to deliver a package that greatly simplifies
the development of such applications. But perhaps the best XML-related example
of such an app is B2B messaging. This type of application touches on a host of
issues, from specifying the allowable flows of messages, to generating views of
executing processes, to integrating with back-end systems. Providing all this
functionality would be difficult for a single application development team. By
using XML, vendors can deliver a widely applicable framework that puts such apps
within the reach of more organizations. All the major application server vendors
-- including BEA, IBM, Microsoft, Oracle, and Sun -- provide their own flavors
of both personalization and B2B messaging frameworks.
Content servers
Content servers combine data from DBMSs, results
from business operations, and authored content into presentation formats for
different users. XML-based technologies improve every stage of the fulfillment
pipeline. At the end of the pipeline, they enable dynamic layouts that better
fit users' needs. In the middle of the pipeline, they make it easier to connect
a user to the exact information he wants. At the beginning of the pipeline, they
make it easier to acquire the library of content necessary to satisfy the user
base. Most content servers focus on one or two aspects of this pipeline, so
implementing a complete XML content strategy may require several types of
content servers.
The most common use for XML in content servers is applying dynamic
presentation to XML content. This process occurs by using XSLT to generate pages
in XML-based presentation languages such as HTML, VoiceXML, and WML. Based on
variables, including the type of client device, the type of content, and the
localization settings for the user, the content server selects an XSLT transform
and applies it to the XML document. Because most Web servers have programming
extensions that support XSLT, you won't need any additional server
infrastructure if all you want is dynamic presentation.
Customizing layouts for users is only part of the content delivery equation.
Users also need help finding the content that addresses their immediate needs.
Traditional search engines suffer from the problem of distinguishing between
different contexts for the same word. With XML content, a search engine can use
the element structure and attribute values to improve search precision. Using an
XML-aware search engine helps maximize the benefits of an XML-based content
strategy. Usually, employing such a product involves assigning a dedicated
server or cluster of servers to perform searches that then refer users to the
appropriate content. Such standalone solutions include DocSoft's extend XML and
XML Global's GoXML Search. Of course, most of the CMS and native XML store
products discussed previously can perform searches on XML document collections,
but this approach works only if you store all the content you plan to search in
one of these products.
XML-aware search engines leverage metadata at the element and attribute
levels. However, metadata can also apply to entire collections of content. The
foundation of the Semantic Web is the use of metadata to provide a conceptual
map of an entire site or group of sites. Another W3C Recommendation, Resource
Definition Framework (RDF), provides a standardized XML vocabulary for
describing the types of content offered, the relationships among content, and
the conditions under which content might be relevant. Most site creators use an
implicit information model in selecting and organizing content. RDF makes it
possible to state this model explicitly. The availability of machine-readable
models facilitates automated information retrieval, filtering, and visualization
capabilities far beyond those of traditional search engines. The Semantic Web is
in its early development, and much of the work is in the form of research and
open source projects. However, in the near future, RDF may migrate into
mainstream content infrastructure. Web servers will offer RDF descriptions.
Search engines will use these descriptions as part of the search criteria.
Authoring tools will generate these descriptions.
In addition to making it easier to find content, XML makes it easier to
acquire content. Content can come from two sources: You can create it, or you
can borrow it. When creating content, the ability of multiple authors to
collaborate effectively greatly enhances productivity. Web Distributed Authoring
and Versioning (WebDAV), a set of XML-based extensions to HTTP from the IETF,
makes it possible for authors to work together to create, enhance, and maintain
content. A WebDAV server manages contributions, tracks changes, and enforces
permissions. A number of portal servers, including Microsoft's SharePoint Portal
Server and Oracle9iAS Portal, use WebDAV to enable the collaborative editing of
portal content. Common Web servers such as Apache and IIS also support WebDAV.
Any client that speaks the WebDAV protocol can use these servers to collaborate
on documents. Such clients include content authoring tools such as Adobe Acrobat
and Microsoft Office. Taken to an extreme, WebDAV enables the replacement of
traditional document management systems with a set of distributed WebDAV-capable
servers. Oracle iFS and Xythos's Web File Server use this approach.
It is often more cost-effective to borrow content from someone else than to
generate it yourself. However, this type of syndication faces two problems.
First, it is often difficult to fit third-party content into an application
because of differences in layout. XML solves this problem by giving both parties
a format for exchanging information separate from presentation. The subscriber
knows the structure of each publisher's content, so it can use XSLT to integrate
content from different sources and apply its preferred layout. There is also the
problem of how to negotiate subscriptions, track usage, and update information
automatically. Information and Content Exchange (ICE) addresses these issues by
providing a standard XML protocol for such interactions between subscribers and
publishers. ICE support is available in a wide variety of products that generate
and manage content, including Interwoven's OpenSyndicate, Oracle9i, and
Vignette's Content Syndication Server.