RECENTLY, MODELISTICA HAS, been conducting an evaluation and feasibility study to determine
the suitability of XML and Java for the representation and manipulation of Transport and Land Use
(TLU) modeling information as used in urban and regional planning.
XML, the Extended Markup Language, is highly publicized as the replacement for HTML for describing
document content on the Web. But markup languages have a long history and have applications far
beyond those of the Web. They are currently being used for information description and exchange
in such diverse areas as finance and trade, mathematics, chemistry, biology, knowledge representation,
genealogy, software package description and distribution, CASE, graphics, and more.
MARKING UP
HTML is the most popular and most widely known use of markup. XML was designed by the World Wide
Web Consortium—often referred to as W3C—to enable the use of the Standard Generalized
Markup Language (SGML) on the Web. XML is a public standard: it is not a proprietary development
of any single company. The version 1.0 specification was accepted by the W3C as a formal
Recommendation on Feb. 10, 1998. The XML Web page at the W3C site is the entry point to a sea
of information about XML, SGML, and related technologies and applications.
XML is an abbreviated version of SGML—the international standard for defining the
structure and content of electronic documents. XML eliminates the more complex and unused
features of SGML making it much simpler to implement, but still compatible with its ancestor.
XML is actually not a single language but a meta-language. XML can describe both the syntax
of specific classes of documents, and their contents. The portion of XML that determines
document syntax is named the Document Type Definition language (DTD). XML supports multiple DTDs.
From the XML perspective, HTML is just one of these document types—the one most frequently
used on the Web. It defines a single, fixed type of document with markups that let you describe
a common class of simple office-style reports. Because it provides only one way of describing
information, HTML is overburdened with dozens of interesting but often incompatible inventions
from different manufacturers. In contrast, XML allows the creation of markup languages customized
to the needs of specific applications—which is what brought me to investigate the possibilities
of defining a markup language for urban planning. If you're interested, a long list of current and
under-development applications of SGML and XML can be found at OASIS' XML Web page.
DEFINING THE APPLES AND ORANGES
My first experiments showed that XML files for our area of interest, Transport and Land Use (TLU)
information would be large. One to four megabytes looks normal. So it was important that parsers
had good performance in terms of both speed and memory usage. Defining an XML document type for
TLU is among the project's long-term goals, so attention was also given to the ability to parse
XML Document Type Definitions, and validate XML documents against it. I also considered
implementation of current and upcoming XML standards.
All the parsers and tools reviewed here are available on Web. As of this writing, no commercial
parsers were available and most parsers were flagged with some label indicating the publisher
wasn't claiming the parsers were production-quality. These products (with the exception of
Microsoft's original parser) are freely available for download. They are releases in all but name.
CONFORMANCE
The XML standard classifies documents into one of three categories: not well formed, well formed
but invalid, and valid. A document is well formed when it meets all the syntactic and semantic
requirements described in the XML standard. A well formed XML document is also valid when no
Document Type Definition (DTD) is provided. When a DTD is provided, a valid document must also
comply with the grammar described by the DTD.
Furthermore, documents can be stand-alone or can have references to external information, and the
XML standard allows for special treatment of external definitions by non validating parsers.
I used James Clark's XMLTest test suite to evaluate how well the parsers conformed to the XML
definition. The XMLTest suite is composed of several hundred small XML files and DTDs, each one
testing for conformance with a specific aspect of the XML standard. The tests range from simple
checks to highly contrived entity definitions and expansions. The test suite also includes
normalized versions of all the valid files so that they can be compared with the output of
the targeted parsers.
For validating parsers, I added yet another test. I introduced a simple but obvious error in the first
lines of one of the large files I used in the performance test. The name of one of the elements was
changed to one that did not appear in the DTD and was, hence, invalid. This was a trivial test and
only one of the validating parsers failed it.
NAMESPACES, XLINK, AND XPOINTER
XML namespaces provide a simple method for qualifying names used in XML documents by associating
them with namespaces identified by URI. Namespaces are intended to avoid problems of recognition
and collision in documents with fragments of different types. An example is the case of a small
database described in XML Data, embedded in an HTML document.
The XML Linking Language (XLink) consists of constructs that may be inserted into XML documents
to describe links between objects. XLink can describe the simple unidirectional hyperlinks of
today's HTML as well as more sophisticated multi-ended and typed links. The XML Pointer Language
(XPointer) allows hyperlinks that reference arbitrary document fragments.
The Namespaces, XLink, and X- Pointer specifications are currently at the "working draft" level,
so they were not included in the evaluation. These technologies are important, so you'll find
mention of the parsers that implement the current draft versions of the standards.
DOM LEVEL 1
The Document Object Model (DOM) is a language-neutral API that allows programs to dynamically
access and update the content, structure and style of documents. The DOM Level 1 Specification
is already a publicly available W3C Recommendation.
The DOM defines a standard set of objects for representing HTML and XML documents, a standard
model of how these objects can be combined, and a standard interface for accessing and manipulating
them. A specific library can support the DOM as an interface to proprietary data structures and
APIs. Applications that use the standard DOM interfaces rather than product-specific APIs, become
independent of particular implementations. The DOM standard currently defines language bindings to
Java, Corba IDL, and ECMA Script (the European JavaScript/JScript standard).
To evaluate the DOM compliance of the libraries, I wrote a small program to test the interfaces
defined in the DOM Java binding as they appear in the org.w3c.dom
package
(see Listing 1.)
The program was run using each library in turn.
SAX
The Simple API for XML (SAX), is a standard interface for event-based XML parsing, developed collaboratively
by the members of the XML-DEV mailing list (see the Microstar Web site). A SAX-compliant XML parser reports
parsing events to the application through callbacks, without necessarily building any internal structures.
The application implements handlers to deal with the different events, much like it's done by modern
graphical user interfaces (GUIs), like Java AWT.
The SAX API makes the parser layer totally independent from other application or library functionality.
A particular set of event handlers may be used to build an in-memory representation of an XML document,
while a different set of handlers may render the document on the fly. Java packages that implement a
SAX driver are in fact interchangeable, at least in theory.
PERFORMANCE
Speed and memory usage tests using two large XML files were performed (0.8 and 1.2 MB, respectively)
by one of our in-house applications. Each file contains several thousand XML elements nested in a
four-level deep hierarchy, and all of the elements have one or more attributes.
For each parser and file, three runs were performed: one without validation, one providing the DTD
and enabling validation, and a third run using the same scheme as in the second but introducing
a validity error in the first 10 lines of the XML file. The DTD is the same for both files. It
consists of 530 lines and uses DTD entity definitions (sort of a DTD macro) moderately.
A separate test was performed on the parsers that provide an Object Model to measure model navigation
speed, and memory use. The test consisted of loading a large XML file and querying the object model
while constructing yet another application-specific structure. To force navigation of the complete
structure, the model's Document object (or its equivalent) was used to write a new XML file out to
disk. Because this was a performance and not a compatibility test, changes were made to the test
program so it would run with those parsers that didn't implement the DOM Level 1 standard, implemented
it incompletely, or had their own proprietary object model. Libraries that would have required
non-trivial changes were not tested. This test was performed with validation turned off to minimize
parser overhead and focus on object model navigation.
All tests where done using SUN's Java Runtime Environment (JRE) version 1.1.7A on an 300 MHz Intel
Pentium II with 128 MB of RAM running Windows 98. The maximum heap space for all tests was set to
64 MB. The programs were compiled using the same version of the SUN javac compiler, with
optimizations turned on. Times were measured using an external command line program called from
a batch file, so they include the time needed to load the Java VM and any required libraries.
Memory usage was obtained by examining the trace of the programs after running them with verbose
garbage collection turned on. Note that these tests were not devised as benchmarks that would
help determine the split-second fastest parser, nor byte consumption per XML element. They were
designed with the intention of exposing problems in parser design that had an obvious impact
in performance when working with large XML documents.
DOCUVERSE DOM SDK
The DOM SDK, by Docuverse, is not an XML parser, but a DOM implementation that works on top of any parser that exposes a SAX interface. It is discussed here because the very first tests already showed that it is indeed very simple to combine the DOM SDK with different parsers. Performing the DOM test using the SDK with both the Ælfred and XP parsers shows that these combinations are serious competitors to integrated parsers like Sun's or IBM's.
The DOM SDK is available at the Docuverse DOM SDK page . The license allows free distribution of the
binaries (.class and .jar files) but is very restrictive about copying or modification of the source
code and documentation.
JAMES CLARK'S XP V0.4
XP is a "high performance"
XML parser produced by James Clark, who was technical lead for the W3C SGML activity group. This group
produced the first draft of the new XML standard. XP is non-validating, but it checks if documents
are well-formed, and are capable of parsing external entities including DTDs. The only interface
XP provides for applications is a SAX driver, so it qualifies as a lightweight parser. The
documentation provided with the parser consists only of the output from JavaDoc. The documentation
was too succinct at times and assumes familiarity with SAX.
XP performed in the top tier, along with the parsers from Microstar, IBM, and Microsoft.
XP performed well also when combined with the DOM SDK using the DOM test suite. Under this test
suite, XP and Ælfred, another lightweight parser, produced almost equivalent results. It is
expected these two parsers will evolve in different directions in the near future. James Clark's
XP emphasizes conformance, and will probably evolve into a validating parser, while AElfred
emphasizes efficiency, portability, and fault tolerance, and will probably evolve in that direction
without adding the complexity of new features. XP performed well under the XML conformance test,
which is not surprising. After all, James Clark himself devised the test suite. The XP parser is
free and is available at James Clark's Web site.
MICROSTAR ÆLFRED V1.1
Ælfred is a parser
that concentrates on optimizing speed and size rather than error reporting. This approach is the most
useful for deployment over the Internet. Ælfred consists of only two core class files, the main
parser class (XmlParser.class) and a small interface for your
own program to implement (XmlProcessor.class). All other classes
in the distribution are either optional or demonstrations. At 31 K, Ælfred's JAR file was, by
far, the smallest among all the parsers.
Ælfred uses only JDK 1.0.2 features, but testing showed that it runs fine with JDK 1.1.6, 1.1.7A,
and 1.2rc1. The documentation claims that the parser is compatible with most character encodings
available on the Internet, but no attempt was made to test that assertion.
This parser was designed to be very lightweight, very portable, and very fault tolerant. It will
produce correct output for well-formed and valid documents, but it won't necessarily reject every
document that is not valid or not well formed. Ælfred will probably never become a validating parser.
Ælfred comes with very complete API documentation in the form of HTML files generated by JavaDoc
1.1. Several simple example projects are also included. This parser was fast in the tests that
didn't involve validation, and was able to complete the DOM test when combined with the Docuverse
DOM SDK. Ælfred and XP performed almost equally.
The conformance test showed that Ælfred is not as fault tolerant as the documentation
suggests. Ælfred generated exceptions for valid documents that were not stand-alone, and
went into an endless loop of error reporting for some of them. Ælfred failed to report many
documents that weren't well formed.
Ælfred is free for both commercial and non-commercial use and redistribution. The only requirement
is that Microstar's copyrights are preserved in derivative source code, and that any modifications
are clearly documented. Ælfred can be downloaded from Microstar's site.
MICROSOFT XML (MSXML) V 1.9
MSXML is a validating XML
parser produced by Microsoft as part of its Internet Explorer 5 effort. The parser has support for
namespaces and is compliant with the XML draft specification of November 1997. The parser provides
its own Object Model, which is quite powerful but isn't DOM Level 1-compliant. MSXML does not provide
a SAX driver, but drivers are available elsewhere—check out Lars Marius Garshol's Free XML
Software page and the Microstar Web site.
MSXML's documentation consists of several sample projects and JavaDoc documentation for the API.
The API documentation is nicely laid out, but many of the methods are undocumented in this version.
The sample projects include some interesting ones like an XML viewer applet. Another set of applets
can take small databases described in XML Data and lay them out nicely using tables and dynamic HTML.
Some of the applets even allow for the edition of the XML Data information, from changing field
values, to adding and deleting records.
MSXML was the top performer in terms of both speed, and memory usage. The parser performed better
than the small SAX driven parsers in all tests, despite the fact that MSXML always builds an in-memory
model of the document and validation was always turned on. In the DOM test, MSXML consumed only half
the memory of its closest rival. All this performance fits in a JAR file of just 101 K, which gives
the parser the smallest footprint among those that provide an object model. Also note that MSXML's
performance is provided through 100% Pure Java code. Whatever the secret is to MSXML's performance,
other parsers would do well imitating it.
The DOM test had to be adapted to be run with MSXML. The algorithm remained the same, but many
declarations and method calls had to be changed. MSXML performed quite well on this test. It's
speed and memory performance was better than that of any of the other parsers. The API is not
DOM-compliant, but it is as expressive as DOM, so it shouldn't be difficult to make MSXML DOM
Level 1 compatible.
On the conformance test, MSXML gave incorrect warnings and errors about many valid documents.
The parser also failed to detect many of the documents that were not well formed or invalid.
MSXML does not provide a SAX driver, but drivers are available on the Web (as mentioned previously).
MSXML originally didn't work with SUN's JDK 1.1.6 or 1.1.7A, because two locations in the
library's initialization code assumed that the JDK version would be convertible to a float
value. The Integrated Development Environment (IDE) used to construct the tests suites promptly
pointed me to the faulty lines, so I fixed them. Oddly, MSXML reported an invalid document with
JDK 1.2 on a test that ran to completion with JDK 1.1.7A.
Microsoft entered into an agreement with Data Channel for further development of the parser. At
this writing, MSXML had been removed from the Microsoft Web site. Unfortunately, the current beta
of the parser provided by Data Channel is evaluated below MSXML in all regards. Fortunately, the
license Microsoft provided with its version 1.9 parser is liberal enough that you'll likely be
able to find copies of the original, or of its heirs elsewhere.
DATA CHANNEL XML PARSER
The XML parser from
Data Channel (DCXML) is derived from Microsoft's. Surprisingly, the package layout and the methods
available in the DC parser are very different from those in MSXML. DCXML performed well below most
other parsers in all tests for speed and memory use. Even though DCXML is on its first beta, no
differences with the base code layout and performance were expected.
The documentation provided with DCXML consists of the output of JavaDoc over a set of Java files
with absolutely no JavaDoc comments. As such, the documentation is useful for browsing through
the source code and little more.
DCXML performed well below the other parsers in all tests in terms of speed, but it was able
to complete tests that Sun XML couldn't when the Sun parser ran out of memory.
Object model tests on DCXML were not performed because it lacked the equivalent of the DOM
method getElementsByTagName(). I could perform that
test on MSXML because it provides the same functionality through an
Element.getChildren().item() method.
In the conformance test, DCXML failed to recognize about 15% of valid documents, generating null
pointer exceptions for several of them. DCXML had only a few problems with documents that were
not well formed. Most of the errors occurred in documents that had references to external entities.
The licensing policy for DCXML is currently unknown. The license that was bundled with the
downloaded parser is an exact copy of the liberal one that came with MSXML 1.9. A different
version of the license in Data Channel's Web site states that the parser is free for commercial
use as long as some value added is provided. Yet another version of the licensing policy was
received via email, stating that the parser was free only for non-commercial use. This parser
is in an early beta state, and its characteristics and the related policies may change considerably
by the time it's released.
SUN XML, EARLY ACCESS 1
The Sun XML Library
consists of a fast parser with optional validation. It has a SAX interface and the library provides
an object model that is DOM Level 1 compliant. Sun's XML Library is labeled "Early Access 1", which
means it's still under construction.
The parser's API documentation was generated by JavaDoc 1.2 and it's very complete. SUN also provides
several sample programs that highlight library features such as DOM, namespace support, and JavaBean
support. The set of sample programs serve well as a tutorial about the libraries' capabilities.
As in other libraries built around the SAX API, the parser and the object model are completely
independent. SAX compatibility enables you to use the Sun parser core with other applications,
including other DOM implementations like Docuverse's DOM SDK. The class in charge of building
the in-memory object model, the DocumentBuilder class, implement's the SAX DocumentHandler
interface, which enables the use of Sun's object model with other SAX-compliant parsers, like XP.
Sun's parser performed quite well in the tests that did not involve validation. On the tests where
we included a DTD to provide validation, times were comparable to those of the fastest parsers but
memory consumption skyrocketed. With validation enabled, the parser failed with an "out of memory"
exception and was not able to complete the test with the 1.2 MB XML file. The test that involved a
file with an invalid element on the first few lines consumed as much memory as when the file was parsed
entirely.
The parser also performed very poorly on the DOM navigation test, taking more than 25 minutes to complete.
The results made Sun's DOM implementation the worst performer. On the conformance test, Sun XML failed to
recognize just one of the valid documents.
Current licensing for the Sun XML parser is restricted to evaluation purposes. The licensing policy
for the finished parser has not been disclosed by Sun. The parser can be found at the Java Developer's
Connection page.
IBM XML4J (XML FOR JAVA) V 1.1.4
XML4J is a validating
XML parser produced by the IBM alphaWorks project. The parser is compliant with the XML 1.0 standard,
it has support for namespaces and DTD manipulation and it implements DOM Level 1. XML4J also provides
a SAX driver. XLink/XPointer support are also provided but were not tested.
The documentation for XML4J consists of a tutorial, complete API documentation in HTML and several
sample projects. The tutorial covers all relevant features of the library with special attention
to the libraries unique features. The API documentation was generated by IBM's own implementation
of JavaDoc. The API descriptions are very complete and the HTML layout has a quality comparable to
that of documentation generated by JavaDoc 1.2.
One of XML4J's unique features is the possibility of setting up event handlers and filters at the
object model level. This feature enables you to build an object model out of only selected parts of
a complex XML document. But XML4J's functionality comes at a price. The JAR file that contains IBM's
library is 460 K, which is four times the size of the libraries from Sun and Microsoft, and more
than twice the size of the libraries for DCXML and XP.
XML4J was among the top performers in the lot, falling only behind the MSXML parser overall, and
behind the lightweight parsers only in the tests that did not involve validation. IBM's parser worked
fine on the DOM test. Memory use was elevated when compared to that of MSXML and the lightweight
parsers on the simple tests but was well below that of other parsers in the validation and DOM
tests. Unlike Sun's library which behaved like the lightweight parsers when no object model was
requested, IBM's parser consumed the same amount of memory when used as a SAX driver and when a
DOM model was requested.
This parser went through the XML conformance test with very few problems. It failed to recognize
only three documents as not being well formed. XML4J is distributed with full source code and a
free commercial license. The parser is available for download at IBM's alphaWorks' Web site.
LORIA SXP V 0.72
SXP is an XML library
produced by Loria in France, as part of their ambitious XSilfide project. SXP is implemented as a
SAX driver and has support for DOM Level 1, namespaces, XLink, and XPointer. The documentation
that comes with the library consists of JavaDoc generated HTML pages, but the comments are few
and succinct. The documentation was unhelpful in trying to discover the unique characteristics
of the library.
SXP's performance was poor in terms both of speed and memory use. SXP was notably slow in the
tests that involved validation, taking at least 15 minutes to complete any of them. But poor
performance is to be expected of a parser in beta-optimizations are best left until the end
of the development process. On the plus side, the DOM test ran unaltered with the SXP library.
On the conformance test, SXP failed to recognize more than 20 documents as valid, and it failed
to detect the errors in a like number of documents that were not well formed or were invalid.
SXP is free for academic, research, and non-commercial use. The library is available at Loria's
Web site.
CONCLUSION
Here, we've reviewed both beta versions and officially shipping products. In the case of MSXML
we even looked at a retracted product. Ordinarily, a comparative review such as this would be
patently unfair to all products involved. But this is a unique situation. All products are free
and all are available on the Web. Whatever games the producers may be playing with labels
like "beta" or "early access" or "version 0.72", the products have all been released.
That said, nothing can be concluded about their current performance or about their limitations,
if any. As an example, when evaluations began, IBM's XML4J at version 1.0.4 was one of the worst
performers in terms of speed, memory use, and correctness. Things changed considerably with the
1.1.4 release and was one of the best parsers we reviewed. These parsers will certainly continue
to improve or "die" and do so at breakneck speed. Our intention was to help you evaluate the
current crop of parsers.
As the evaluation results show, different parsers excel under different requirements.
The lightweight Ælfred from Microstar is the obvious choice when deploying simple
applets on the Web. When XML document validation is required IBM's XML4J and Microsoft's
MSXML are probably the right choices, with excellent standards compatibility being in
favor of the former, and outstanding speed being the latter's claim to fame. For applications
like the one we're working on, DOM Level 1 compliance is of primary concern because good
compliance with that standard makes the parsers almost plug-and-play, and hence completely replaceable.
Our evaluation made it clear that combining XML and Java is definitely viable, that performance
in this combination does not have to be an issue, and that conformance with the standard is rapidly improving.
There are several possible categories. The first one is that of very lightweight and very forgiving
parsers like Ælfred, for use on the Web. The second is that of welterweight parsers, that
implement most core standards, and provide a reasonable balance between performance, size, and
features (I expect Sun's parser to evolve in this direction). Then there's that of heavyweight
parsers which implement all relevant standards, are very strict about validation, and provide a
wealth of additional features, at the expense of some speed and higher memory requirements. I
suspect the lightweight and welterweight parsers will converge on an ideal feature set. The
heavyweights will remain and a totally different category will result from the integration of
parsers with other types of programs, like Web clients and servers. All the lights are green on XML.
Table 1. From version 0.4 1 to version 1.9, the claimed capabilities all over the map.
|
PRODUCT |
VERSION |
JAR SIZE (in KB) |
JDK 1.2 SUPPORT |
VALIDATING |
DOM LEVEL 1 SUPPORT |
SAX SUPPORT |
NAMESPACES |
XLINK\ XPOINTER |
Ælfred |
1.2a |
32 |
Yes |
No |
No |
Yes |
No |
No |
DCXML |
Beta 1 |
144 |
Yes |
Yes |
Yes |
No |
Yes |
No |
XML4J |
1.1.4 |
460 |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
MSXML |
1.9 |
101 |
No |
Yes |
No |
No |
Yes |
No |
Sun XML |
EarlyAccess 1 |
104 |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
SXP |
0.72 |
196 |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
XP |
0.4 |
173 |
Yes |
No |
No |
Yes |
No |
No |
Note: As this issue went to press, SUN released a new version of its XML parser. Although it
was released too late to be included in this review, quick tests showed that speed and memory performance
have improved to levels comparable to those of MSXML and XML4J.
Juancarlo Añez is a Consultant with Modelistica, in Caracas, Venezuela, which provides technical and professional services in urban and regional planning. He can be contacted at
[email protected].