Java XML Parsers

   
SideBar
Short Line   MugMugMugMug

Cup rating system:
4 Outstanding
3 Good
2 Acceptable
1 Poor


Product Review


RECENTLY, MODELISTICA HAS, been conducting an evaluation and feasibility study to determine the suitability of XML and Java for the representation and manipulation of Transport and Land Use (TLU) modeling information as used in urban and regional planning.

XML, the Extended Markup Language, is highly publicized as the replacement for HTML for describing document content on the Web. But markup languages have a long history and have applications far beyond those of the Web. They are currently being used for information description and exchange in such diverse areas as finance and trade, mathematics, chemistry, biology, knowledge representation, genealogy, software package description and distribution, CASE, graphics, and more.

MARKING UP
HTML is the most popular and most widely known use of markup. XML was designed by the World Wide Web Consortium—often referred to as W3C—to enable the use of the Standard Generalized Markup Language (SGML) on the Web. XML is a public standard: it is not a proprietary development of any single company. The version 1.0 specification was accepted by the W3C as a formal Recommendation on Feb. 10, 1998. The XML Web page at the W3C site is the entry point to a sea of information about XML, SGML, and related technologies and applications.

XML is an abbreviated version of SGML—the international standard for defining the structure and content of electronic documents. XML eliminates the more complex and unused features of SGML making it much simpler to implement, but still compatible with its ancestor. XML is actually not a single language but a meta-language. XML can describe both the syntax of specific classes of documents, and their contents. The portion of XML that determines document syntax is named the Document Type Definition language (DTD). XML supports multiple DTDs.

From the XML perspective, HTML is just one of these document types—the one most frequently used on the Web. It defines a single, fixed type of document with markups that let you describe a common class of simple office-style reports. Because it provides only one way of describing information, HTML is overburdened with dozens of interesting but often incompatible inventions from different manufacturers. In contrast, XML allows the creation of markup languages customized to the needs of specific applications—which is what brought me to investigate the possibilities of defining a markup language for urban planning. If you're interested, a long list of current and under-development applications of SGML and XML can be found at OASIS' XML Web page.

DEFINING THE APPLES AND ORANGES
My first experiments showed that XML files for our area of interest, Transport and Land Use (TLU) information would be large. One to four megabytes looks normal. So it was important that parsers had good performance in terms of both speed and memory usage. Defining an XML document type for TLU is among the project's long-term goals, so attention was also given to the ability to parse XML Document Type Definitions, and validate XML documents against it. I also considered implementation of current and upcoming XML standards.

All the parsers and tools reviewed here are available on Web. As of this writing, no commercial parsers were available and most parsers were flagged with some label indicating the publisher wasn't claiming the parsers were production-quality. These products (with the exception of Microsoft's original parser) are freely available for download. They are releases in all but name.

CONFORMANCE
The XML standard classifies documents into one of three categories: not well formed, well formed but invalid, and valid. A document is well formed when it meets all the syntactic and semantic requirements described in the XML standard. A well formed XML document is also valid when no Document Type Definition (DTD) is provided. When a DTD is provided, a valid document must also comply with the grammar described by the DTD.

Furthermore, documents can be stand-alone or can have references to external information, and the XML standard allows for special treatment of external definitions by non validating parsers.

I used James Clark's XMLTest test suite to evaluate how well the parsers conformed to the XML definition. The XMLTest suite is composed of several hundred small XML files and DTDs, each one testing for conformance with a specific aspect of the XML standard. The tests range from simple checks to highly contrived entity definitions and expansions. The test suite also includes normalized versions of all the valid files so that they can be compared with the output of the targeted parsers.

For validating parsers, I added yet another test. I introduced a simple but obvious error in the first lines of one of the large files I used in the performance test. The name of one of the elements was changed to one that did not appear in the DTD and was, hence, invalid. This was a trivial test and only one of the validating parsers failed it.

NAMESPACES, XLINK, AND XPOINTER
XML namespaces provide a simple method for qualifying names used in XML documents by associating them with namespaces identified by URI. Namespaces are intended to avoid problems of recognition and collision in documents with fragments of different types. An example is the case of a small database described in XML Data, embedded in an HTML document.

The XML Linking Language (XLink) consists of constructs that may be inserted into XML documents to describe links between objects. XLink can describe the simple unidirectional hyperlinks of today's HTML as well as more sophisticated multi-ended and typed links. The XML Pointer Language (XPointer) allows hyperlinks that reference arbitrary document fragments.

The Namespaces, XLink, and X- Pointer specifications are currently at the "working draft" level, so they were not included in the evaluation. These technologies are important, so you'll find mention of the parsers that implement the current draft versions of the standards.

DOM LEVEL 1
The Document Object Model (DOM) is a language-neutral API that allows programs to dynamically access and update the content, structure and style of documents. The DOM Level 1 Specification is already a publicly available W3C Recommendation.

The DOM defines a standard set of objects for representing HTML and XML documents, a standard model of how these objects can be combined, and a standard interface for accessing and manipulating them. A specific library can support the DOM as an interface to proprietary data structures and APIs. Applications that use the standard DOM interfaces rather than product-specific APIs, become independent of particular implementations. The DOM standard currently defines language bindings to Java, Corba IDL, and ECMA Script (the European JavaScript/JScript standard).

To evaluate the DOM compliance of the libraries, I wrote a small program to test the interfaces defined in the DOM Java binding as they appear in the org.w3c.dom package (see Listing 1.) The program was run using each library in turn.

SAX
The Simple API for XML (SAX), is a standard interface for event-based XML parsing, developed collaboratively by the members of the XML-DEV mailing list (see the Microstar Web site). A SAX-compliant XML parser reports parsing events to the application through callbacks, without necessarily building any internal structures. The application implements handlers to deal with the different events, much like it's done by modern graphical user interfaces (GUIs), like Java AWT.

The SAX API makes the parser layer totally independent from other application or library functionality. A particular set of event handlers may be used to build an in-memory representation of an XML document, while a different set of handlers may render the document on the fly. Java packages that implement a SAX driver are in fact interchangeable, at least in theory.

PERFORMANCE
Speed and memory usage tests using two large XML files were performed (0.8 and 1.2 MB, respectively) by one of our in-house applications. Each file contains several thousand XML elements nested in a four-level deep hierarchy, and all of the elements have one or more attributes.

For each parser and file, three runs were performed: one without validation, one providing the DTD and enabling validation, and a third run using the same scheme as in the second but introducing a validity error in the first 10 lines of the XML file. The DTD is the same for both files. It consists of 530 lines and uses DTD entity definitions (sort of a DTD macro) moderately.

A separate test was performed on the parsers that provide an Object Model to measure model navigation speed, and memory use. The test consisted of loading a large XML file and querying the object model while constructing yet another application-specific structure. To force navigation of the complete structure, the model's Document object (or its equivalent) was used to write a new XML file out to disk. Because this was a performance and not a compatibility test, changes were made to the test program so it would run with those parsers that didn't implement the DOM Level 1 standard, implemented it incompletely, or had their own proprietary object model. Libraries that would have required non-trivial changes were not tested. This test was performed with validation turned off to minimize parser overhead and focus on object model navigation.

All tests where done using SUN's Java Runtime Environment (JRE) version 1.1.7A on an 300 MHz Intel Pentium II with 128 MB of RAM running Windows 98. The maximum heap space for all tests was set to 64 MB. The programs were compiled using the same version of the SUN javac compiler, with optimizations turned on. Times were measured using an external command line program called from a batch file, so they include the time needed to load the Java VM and any required libraries. Memory usage was obtained by examining the trace of the programs after running them with verbose garbage collection turned on. Note that these tests were not devised as benchmarks that would help determine the split-second fastest parser, nor byte consumption per XML element. They were designed with the intention of exposing problems in parser design that had an obvious impact in performance when working with large XML documents.

DOCUVERSE DOM SDK

The DOM SDK, by Docuverse, is not an XML parser, but a DOM implementation that works on top of any parser that exposes a SAX interface. It is discussed here because the very first tests already showed that it is indeed very simple to combine the DOM SDK with different parsers. Performing the DOM test using the SDK with both the Ælfred and XP parsers shows that these combinations are serious competitors to integrated parsers like Sun's or IBM's.

The DOM SDK is available at the Docuverse DOM SDK page . The license allows free distribution of the binaries (.class and .jar files) but is very restrictive about copying or modification of the source code and documentation.

JAMES CLARK'S XP V0.4

XP is a "high performance" XML parser produced by James Clark, who was technical lead for the W3C SGML activity group. This group produced the first draft of the new XML standard. XP is non-validating, but it checks if documents are well-formed, and are capable of parsing external entities including DTDs. The only interface XP provides for applications is a SAX driver, so it qualifies as a lightweight parser. The documentation provided with the parser consists only of the output from JavaDoc. The documentation was too succinct at times and assumes familiarity with SAX.

XP performed in the top tier, along with the parsers from Microstar, IBM, and Microsoft. XP performed well also when combined with the DOM SDK using the DOM test suite. Under this test suite, XP and Ælfred, another lightweight parser, produced almost equivalent results. It is expected these two parsers will evolve in different directions in the near future. James Clark's XP emphasizes conformance, and will probably evolve into a validating parser, while AElfred emphasizes efficiency, portability, and fault tolerance, and will probably evolve in that direction without adding the complexity of new features. XP performed well under the XML conformance test, which is not surprising. After all, James Clark himself devised the test suite. The XP parser is free and is available at James Clark's Web site.

MICROSTAR ÆLFRED V1.1

Ælfred is a parser that concentrates on optimizing speed and size rather than error reporting. This approach is the most useful for deployment over the Internet. Ælfred consists of only two core class files, the main parser class (XmlParser.class) and a small interface for your own program to implement (XmlProcessor.class). All other classes in the distribution are either optional or demonstrations. At 31 K, Ælfred's JAR file was, by far, the smallest among all the parsers.

Ælfred uses only JDK 1.0.2 features, but testing showed that it runs fine with JDK 1.1.6, 1.1.7A, and 1.2rc1. The documentation claims that the parser is compatible with most character encodings available on the Internet, but no attempt was made to test that assertion.

This parser was designed to be very lightweight, very portable, and very fault tolerant. It will produce correct output for well-formed and valid documents, but it won't necessarily reject every document that is not valid or not well formed. Ælfred will probably never become a validating parser.

Ælfred comes with very complete API documentation in the form of HTML files generated by JavaDoc 1.1. Several simple example projects are also included. This parser was fast in the tests that didn't involve validation, and was able to complete the DOM test when combined with the Docuverse DOM SDK. Ælfred and XP performed almost equally.

The conformance test showed that Ælfred is not as fault tolerant as the documentation suggests. Ælfred generated exceptions for valid documents that were not stand-alone, and went into an endless loop of error reporting for some of them. Ælfred failed to report many documents that weren't well formed.

Ælfred is free for both commercial and non-commercial use and redistribution. The only requirement is that Microstar's copyrights are preserved in derivative source code, and that any modifications are clearly documented. Ælfred can be downloaded from Microstar's site.

MICROSOFT XML (MSXML) V 1.9

MSXML is a validating XML parser produced by Microsoft as part of its Internet Explorer 5 effort. The parser has support for namespaces and is compliant with the XML draft specification of November 1997. The parser provides its own Object Model, which is quite powerful but isn't DOM Level 1-compliant. MSXML does not provide a SAX driver, but drivers are available elsewhere—check out Lars Marius Garshol's Free XML Software page and the Microstar Web site.

MSXML's documentation consists of several sample projects and JavaDoc documentation for the API. The API documentation is nicely laid out, but many of the methods are undocumented in this version. The sample projects include some interesting ones like an XML viewer applet. Another set of applets can take small databases described in XML Data and lay them out nicely using tables and dynamic HTML. Some of the applets even allow for the edition of the XML Data information, from changing field values, to adding and deleting records.

MSXML was the top performer in terms of both speed, and memory usage. The parser performed better than the small SAX driven parsers in all tests, despite the fact that MSXML always builds an in-memory model of the document and validation was always turned on. In the DOM test, MSXML consumed only half the memory of its closest rival. All this performance fits in a JAR file of just 101 K, which gives the parser the smallest footprint among those that provide an object model. Also note that MSXML's performance is provided through 100% Pure Java code. Whatever the secret is to MSXML's performance, other parsers would do well imitating it.

The DOM test had to be adapted to be run with MSXML. The algorithm remained the same, but many declarations and method calls had to be changed. MSXML performed quite well on this test. It's speed and memory performance was better than that of any of the other parsers. The API is not DOM-compliant, but it is as expressive as DOM, so it shouldn't be difficult to make MSXML DOM Level 1 compatible.

On the conformance test, MSXML gave incorrect warnings and errors about many valid documents. The parser also failed to detect many of the documents that were not well formed or invalid. MSXML does not provide a SAX driver, but drivers are available on the Web (as mentioned previously).

MSXML originally didn't work with SUN's JDK 1.1.6 or 1.1.7A, because two locations in the library's initialization code assumed that the JDK version would be convertible to a float value. The Integrated Development Environment (IDE) used to construct the tests suites promptly pointed me to the faulty lines, so I fixed them. Oddly, MSXML reported an invalid document with JDK 1.2 on a test that ran to completion with JDK 1.1.7A.

Microsoft entered into an agreement with Data Channel for further development of the parser. At this writing, MSXML had been removed from the Microsoft Web site. Unfortunately, the current beta of the parser provided by Data Channel is evaluated below MSXML in all regards. Fortunately, the license Microsoft provided with its version 1.9 parser is liberal enough that you'll likely be able to find copies of the original, or of its heirs elsewhere.

DATA CHANNEL XML PARSER

The XML parser from Data Channel (DCXML) is derived from Microsoft's. Surprisingly, the package layout and the methods available in the DC parser are very different from those in MSXML. DCXML performed well below most other parsers in all tests for speed and memory use. Even though DCXML is on its first beta, no differences with the base code layout and performance were expected.

The documentation provided with DCXML consists of the output of JavaDoc over a set of Java files with absolutely no JavaDoc comments. As such, the documentation is useful for browsing through the source code and little more.

DCXML performed well below the other parsers in all tests in terms of speed, but it was able to complete tests that Sun XML couldn't when the Sun parser ran out of memory.

Object model tests on DCXML were not performed because it lacked the equivalent of the DOM method getElementsByTagName(). I could perform that test on MSXML because it provides the same functionality through an Element.getChildren().item() method.

In the conformance test, DCXML failed to recognize about 15% of valid documents, generating null pointer exceptions for several of them. DCXML had only a few problems with documents that were not well formed. Most of the errors occurred in documents that had references to external entities.

The licensing policy for DCXML is currently unknown. The license that was bundled with the downloaded parser is an exact copy of the liberal one that came with MSXML 1.9. A different version of the license in Data Channel's Web site states that the parser is free for commercial use as long as some value added is provided. Yet another version of the licensing policy was received via email, stating that the parser was free only for non-commercial use. This parser is in an early beta state, and its characteristics and the related policies may change considerably by the time it's released.

SUN XML, EARLY ACCESS 1

The Sun XML Library consists of a fast parser with optional validation. It has a SAX interface and the library provides an object model that is DOM Level 1 compliant. Sun's XML Library is labeled "Early Access 1", which means it's still under construction.

The parser's API documentation was generated by JavaDoc 1.2 and it's very complete. SUN also provides several sample programs that highlight library features such as DOM, namespace support, and JavaBean support. The set of sample programs serve well as a tutorial about the libraries' capabilities.

As in other libraries built around the SAX API, the parser and the object model are completely independent. SAX compatibility enables you to use the Sun parser core with other applications, including other DOM implementations like Docuverse's DOM SDK. The class in charge of building the in-memory object model, the DocumentBuilder class, implement's the SAX DocumentHandler interface, which enables the use of Sun's object model with other SAX-compliant parsers, like XP.

Sun's parser performed quite well in the tests that did not involve validation. On the tests where we included a DTD to provide validation, times were comparable to those of the fastest parsers but memory consumption skyrocketed. With validation enabled, the parser failed with an "out of memory" exception and was not able to complete the test with the 1.2 MB XML file. The test that involved a file with an invalid element on the first few lines consumed as much memory as when the file was parsed entirely.

The parser also performed very poorly on the DOM navigation test, taking more than 25 minutes to complete. The results made Sun's DOM implementation the worst performer. On the conformance test, Sun XML failed to recognize just one of the valid documents.

Current licensing for the Sun XML parser is restricted to evaluation purposes. The licensing policy for the finished parser has not been disclosed by Sun. The parser can be found at the Java Developer's Connection page.

IBM XML4J (XML FOR JAVA) V 1.1.4

XML4J is a validating XML parser produced by the IBM alphaWorks project. The parser is compliant with the XML 1.0 standard, it has support for namespaces and DTD manipulation and it implements DOM Level 1. XML4J also provides a SAX driver. XLink/XPointer support are also provided but were not tested.

The documentation for XML4J consists of a tutorial, complete API documentation in HTML and several sample projects. The tutorial covers all relevant features of the library with special attention to the libraries unique features. The API documentation was generated by IBM's own implementation of JavaDoc. The API descriptions are very complete and the HTML layout has a quality comparable to that of documentation generated by JavaDoc 1.2.

One of XML4J's unique features is the possibility of setting up event handlers and filters at the object model level. This feature enables you to build an object model out of only selected parts of a complex XML document. But XML4J's functionality comes at a price. The JAR file that contains IBM's library is 460 K, which is four times the size of the libraries from Sun and Microsoft, and more than twice the size of the libraries for DCXML and XP.

XML4J was among the top performers in the lot, falling only behind the MSXML parser overall, and behind the lightweight parsers only in the tests that did not involve validation. IBM's parser worked fine on the DOM test. Memory use was elevated when compared to that of MSXML and the lightweight parsers on the simple tests but was well below that of other parsers in the validation and DOM tests. Unlike Sun's library which behaved like the lightweight parsers when no object model was requested, IBM's parser consumed the same amount of memory when used as a SAX driver and when a DOM model was requested.

This parser went through the XML conformance test with very few problems. It failed to recognize only three documents as not being well formed. XML4J is distributed with full source code and a free commercial license. The parser is available for download at IBM's alphaWorks' Web site.

LORIA SXP V 0.72

SXP is an XML library produced by Loria in France, as part of their ambitious XSilfide project. SXP is implemented as a SAX driver and has support for DOM Level 1, namespaces, XLink, and XPointer. The documentation that comes with the library consists of JavaDoc generated HTML pages, but the comments are few and succinct. The documentation was unhelpful in trying to discover the unique characteristics of the library.

SXP's performance was poor in terms both of speed and memory use. SXP was notably slow in the tests that involved validation, taking at least 15 minutes to complete any of them. But poor performance is to be expected of a parser in beta-optimizations are best left until the end of the development process. On the plus side, the DOM test ran unaltered with the SXP library.

On the conformance test, SXP failed to recognize more than 20 documents as valid, and it failed to detect the errors in a like number of documents that were not well formed or were invalid. SXP is free for academic, research, and non-commercial use. The library is available at Loria's Web site.

CONCLUSION
Here, we've reviewed both beta versions and officially shipping products. In the case of MSXML we even looked at a retracted product. Ordinarily, a comparative review such as this would be patently unfair to all products involved. But this is a unique situation. All products are free and all are available on the Web. Whatever games the producers may be playing with labels like "beta" or "early access" or "version 0.72", the products have all been released.

That said, nothing can be concluded about their current performance or about their limitations, if any. As an example, when evaluations began, IBM's XML4J at version 1.0.4 was one of the worst performers in terms of speed, memory use, and correctness. Things changed considerably with the 1.1.4 release and was one of the best parsers we reviewed. These parsers will certainly continue to improve or "die" and do so at breakneck speed. Our intention was to help you evaluate the current crop of parsers.

As the evaluation results show, different parsers excel under different requirements. The lightweight Ælfred from Microstar is the obvious choice when deploying simple applets on the Web. When XML document validation is required IBM's XML4J and Microsoft's MSXML are probably the right choices, with excellent standards compatibility being in favor of the former, and outstanding speed being the latter's claim to fame. For applications like the one we're working on, DOM Level 1 compliance is of primary concern because good compliance with that standard makes the parsers almost plug-and-play, and hence completely replaceable.

Our evaluation made it clear that combining XML and Java is definitely viable, that performance in this combination does not have to be an issue, and that conformance with the standard is rapidly improving.

There are several possible categories. The first one is that of very lightweight and very forgiving parsers like Ælfred, for use on the Web. The second is that of welterweight parsers, that implement most core standards, and provide a reasonable balance between performance, size, and features (I expect Sun's parser to evolve in this direction). Then there's that of heavyweight parsers which implement all relevant standards, are very strict about validation, and provide a wealth of additional features, at the expense of some speed and higher memory requirements. I suspect the lightweight and welterweight parsers will converge on an ideal feature set. The heavyweights will remain and a totally different category will result from the integration of parsers with other types of programs, like Web clients and servers. All the lights are green on XML.

Table 1. From version 0.4 1 to version 1.9, the claimed capabilities all over the map.
PRODUCT VERSION JAR SIZE (in KB) JDK 1.2 SUPPORT VALIDATING DOM LEVEL 1 SUPPORT SAX SUPPORT NAMESPACES XLINK\
XPOINTER
Ælfred 1.2a 32 Yes No No Yes No No
DCXML Beta 1 144 Yes Yes Yes No Yes No
XML4J 1.1.4 460 Yes Yes Yes Yes Yes Yes
MSXML 1.9 101 No Yes No No Yes No
Sun XML EarlyAccess 1 104 Yes Yes Yes Yes Yes No
SXP 0.72 196 Yes Yes Yes Yes Yes Yes
XP 0.4 173 Yes No No Yes No No

Note: As this issue went to press, SUN released a new version of its XML parser. Although it was released too late to be included in this review, quick tests showed that speed and memory performance have improved to levels comparable to those of MSXML and XML4J.

URLs

W3C's XML Page
www.w3c.org/xml

OASIS' XML Web Page
www.oasis-open.org/cover/xml.html#applications

James Clark's XMLTest Suite
www.jclark.com/xml

XML Data
www.w3.org/TR/1998/NOTE-XML-data-0105/

The DOM Level 1 Specification
www.w3.org/TR/1998/PR-DOM-Level-1-19980818/

Microstar Software
www.microstar.com

Docuverse DOM SDK
www.docuverse.com/domsdk/index.html

Lars Marius Garshol's Free XML Software Page
www.stud.ifi.uio.no/~larsga/linker/XMLtools.html

Sun's XML Parser at the Java Developer's Connection Page
developer.javasoft.com/developer/earlyAccess/xml/

IBM's alphaWorks XML Parser Page
www.alphaworks.ibm.com/formula/xml/

Loria's Web Page
www.loria.fr/projets/XSilfide/EN/sxp/

Juancarlo Añez is a Consultant with Modelistica, in Caracas, Venezuela, which provides technical and professional services in urban and regional planning. He can be contacted at [email protected].