Parsing With JavaCC
- By Steve Kapp
- May 24, 2001
Many applications today need to be compatible with files of varying formats. To correctly extract information from these files, their contents must be parsed according to predefined rules. For Java-based code, the Java Compiler Compiler toolkit—also known as JavaCC—is available to facilitate the development of parsing classes.
This article presents a simple parsing problem I solved by employing a JavaCC-based solution. First, I briefly consider the problem domain—a PostScript interpreter in this case—discuss how JavaCC offered a viable solution, and show several code snippets for the parser. By the end of the article, readers will have a basic understanding of how to develop apps that employ JavaCC-based parsers.
The Problem Domain
JRIP, the Java technology-based Raster Image Processor, is a project that models the internal operations of a printer in the Java programming environment. JRIP allows print jobs to be submitted through a variety of mechanisms, such as a file or URL; processes the print job according to the contents of the job; and outputs rasterized images to a variety of output devices, such as a JFC-based window or a JPEG file.
One of the file formats JRIP supports is PostScript, a standard page description language used in many laser printers around the world. For the purposes of this article, PostScript can be considered to be entirely ASCII-based. While this makes it easier to describe, JavaCC-based parsers can handle ASCII, Unicode, or binary file formats.
The structure of a PostScript file is simple. It consists of a series of comments and tokens. Each token stands on its own and is generally independent of the tokens that immediately surround it. PostScript devices must be able to scan the input stream, pick out the individual token strings, create PostScript-specific objects from the token string, and then execute those objects. Within PostScript, each token string is transformed into an instantiation of an object, such as an integer, real, string, name, boolean or array. The PostScript language predefines these object types. The following code shows a simple PostScript file that draws the outline of a box with a 2.5 point wide pen (0.035"). (By convention, all of the listings show the line numbers to the left of the program text.)
01 %!PS
02 newpath
03 100 100 moveto
04 100 200 lineto
05 200 200 lineto
06 200 100 lineto
07 closepath
08 2.5 setlinewidth stroke
09 showpage
Compiler Construction Tools
Compiler construction tools have been around for quite a few years. The best known examples are the famous
yacc and
lex tools from the Unix domain (or their GNU brethren
bison and
flex). These tools, as well as their successors, allow a stream of input data to be parsed based upon two constructs:
- Tokens. A token is a sequence of input characters that has meaning based upon the desired syntax. The first step in parser construction is to extract tokens from the input stream. This generally involves specification of those tokens in some form of regular expressions. Token extraction is also known as scanning or lexing (for lexical analysis).
- Backus–Naur Form (BNF) Productions. A BNF production is a set of token sequences that has meaning based upon the desired syntax. For example, the string "2*3+4" can be abstractly interpreted as "INTEGER MULT INTEGER ADD INTEGER." The second step in the parser construction is to group the tokens together to form the valid sequences for the desired syntax.
Productions can be defined in terms of tokens and other productions. By allowing the definition of productions to use other production definitions, a recursive-descent parser is possible. The example below shows the power of recursive descent parsers, with an eye toward parsing "2*3+4" again.
01 OPERATOR = ADD or MULT
02 EXPRESSION = INTEGER OPERATOR INTEGER or
03 INTEGER OPERATOR EXPRESSION or
04 EXPRESSION OPERATOR INTEGER or
05 EXPRESSION OPERATOR EXPRESSION
While this is a simplistic and non-optimized example, recursive descent parsers permit the parsing problem to be broken down into more manageable chunks. Consider how the above definition would expand if EXPRESSION and OPERATOR were not allowed on the right side of the production.
Tokens and productions, however, are only half the story. Determining the correct token and production definitions for your grammar is a large task, but not the only task you want to do. Most likely, you are not developing a parser for the pure joy of programming—you are developing a parsing application so that something useful may be done with the data from the input stream. For example, the string "2*3+4" should be evaluated to yield an integer with a value of 10.
The next step is to determine what semantic actions to perform when tokens or productions are matched. Not all tokens or productions must have semantic actions associated with them, just the ones that make sense for your grammar. For example, a target grammar may allow string tokens that contain embedded escape characters (such as Java's grammar does). A common semantic action is to "un-escape" the embedded escape characters immediately. Obviously, this would not be necessary for the boolean tokens true and false.
In general, these are the steps you need to take to write a parsing application:
- Understand the grammar!
- Construct regular expressions that describe the token strings.
- Construct productions that define the valid syntax of the grammar.
- Add semantic handling as you develop.
- Test, test, and test even more.
Parsers are very complex beasts and quite difficult to implement correctly on the first pass. An incremental development model should be considered. Also, make sure to have a supply of test data available. A ready supply of test data will pay handsome dividends during development.
JavaCC in all its Glory
JavaCC offers an excellent toolkit for generating parser classes in Java. JavaCC generates top-down, recursive descent parsers. The top-down nature of JavaCC allows it to be used with a wider variety of grammars than other traditional tools, such as yacc and lex. Another difference between JavaCC and yacc/lex is that JavaCC contains all parsing information in one file. The convention is to name this file with a .jj extension.
To use JavaCC, .jj file was developed that contained the grammar rules for PostScript. .jj files generally contain several elements:
- Portions of the parser class that will be generated
- Portions of the token manager class that will be generated
- List of tokens
- List of BNF productions
For the PostScript interpreter, the tokens are generally independent of each other. This allows the main production to be extremely simple, as shown in Listing 1.
In a nutshell, this main production defines a method, translationUnit(), that may be called repeatedly until the end of the input stream has been reached. This construct allows the parser setup to be very simple:
- Open the input stream.
- Construct a parser object.
- Call the parser object's translationUnit() method until it returns false or an error is detected.
- Execute the created PostScript objects.
- Close the input stream.
We'll look at the rest of the productions later, but for now, let's finish up the mechanics of how to set up the parser. Once the .jj file is available, the javacc tool is run upon it. javacc is a Java-based tool that transforms the contents of a .jj file into a full, Java-based set of parser classes. The parser classes may then be compiled with javac and included in your code base.
JavaCC was originally developed at SunTest, the Java test automation unit of Sun Microsystems. Currently, MetaMata is freely distributing JavaCC as a self-extracting class file. The distribution includes the documentation, as well as many example grammars, such as C/C++, HTML, IDL, Java, and appropriately enough, a grammar for .jj files.
Class Hierarchy
The implementation of a PostScript interpreter was fairly complex—but don't fret, we will not examine the arcane details. On the other hand, we will need to briefly discuss the base PostScript classes. These classes implement the PSObject interface, which contains methods for executing an object, retrieving its access rights, and copying the object. The class hierarchy is depicted in Figure 1.
Figure 1. PostScript class hierarchy.
Most of the PostScript base classes encapsulate one value, such as the PSString, PSBoolean, and PSInteger classes. Other PostScript base classes, such as the PSArray class, are container classes for holding PostScript base classes. The parser is responsible for creating the objects. In and of itself, the creation of the PostScript objects is not very exciting. The interesting part for a PostScript interpreter occurs when the objects are executed. Viewed in this light, the parser acts as a producer of objects for the PostScript interpreter to consume. Next, we'll look at how to construct the parser that generates these objects.
Defining Tokens
There are four types of tokens for use with JavaCC. The most common type of token, defined by the JavaCC keyword TOKEN, defines a sequence of characters that form a valid Token object. Tokens are the objects used by the parser to formulate productions. For example, the code to the right shows the token definition for a PostScript number.
01 TOKEN : {
02 | ) ()* >
03 | | ) >
04 | )+ "#" ()+ >
05 | <#type1_real: (="">#type1_real:> | ) ()*
()* >
06 | <#type2_real:>#type2_real:> ()*>
07 | <#digit: ["0"-"9"]="">
08 | <#sign: ["-","+"]="">
09 | <#decimal: ["."]="">
10 }
#decimal:>#sign:>#digit:>
These definitions identify three types of valid tokens:
PS_INTEGER,
PS_REAL, and
PS_RADIX. These three token types may be used when defining valid productions. A number of helper definitions are also illustrated in this construct. The helper definitions have a "#" prefix and will never be matched as tokens themselves. They only exist to aid in the creation of real tokens.
The second type of token definition is used to skip the meaningless white space that surrounds valid tokens. As shown below, this type of token is defined by the JavaCC keyword SKIP. Strings matched by SKIP are simply thrown away.
01 SKIP : {
02 " "
03 | "\r"
04 | "\n"
05 | "\t"
06 }
To discuss the next two token keywords (
MORE and
SPECIAL_TOKEN), we'll use PostScript comments as an example. Oftentimes, comments are of importance to a parser—consider a "pretty print" program. The PostScript parser picks out the PostScript comments and associates them with the closest token. PostScript has two types of comments. The first is a normal type of comment, similar to the vanilla-flavor Java comment. These comments are generally used for program documentation purposes by the PostScript program developer.
The second type of comment, called Document Structuring Convention (DSC) comments, contains metadata about the PostScript file itself, such as the paper size or TrueType fonts required. Generally, PostScript interpreters ignore both standard and DSC comments for page rendering. However, the DSC comments may be used to control printing machinery, such as staplers and duplexers, that a PostScript job requires. DSC comments are analogous to Java's javadoc comments.
Both types of comments recognize three valid end-of-line sequences: "\r", "\n", and "\r\n". Standard comments begin with a "%" and run to the end of the line, while DSC comments begin with a "%%" and run to the end of the line. The token definitions for comments are shown in Listing 2.
Listing 2 illustrates several features of JavaCC-based parsers, so we will analyze them in detail. Starting at line 1, MORE is a JavaCC keyword that lets the parser know the matched string is only part of the token and not the entire token. MORE allows complex regular expressions to be broken down into simpler expressions.
In this case, we are also changing state to a user-defined lexical state, either IN_NORMAL_COMMENT or IN_SPECIAL_COMMENT, when one of the start comment tokens is matched. Just like other compiler compiler tool-kits, JavaCC parsers have the notion of a lexical state. Lexical states afford the luxury of matching a token string only when in that state. By default, the lexical state is DEFAULT (another JavaCC keyword). There is no requirement to use lexical states in a parser—the default state is fine for many parsers.
Once the lexical analyzer is in one of the comment states, characters are accumulated (thanks to the MORE keyword) if the characters do not form an end-of-line sequence (see line 6). Notice that while the characters are being accumulated, the state does not change—normal and DSC comments both accumulate characters in this manner.
When an end-of-line sequence is matched, a token of type SPECIAL_TOKEN is generated. A full-fledged token could have been generated, but in this PostScript application, the parser should ignore comments. Tokens of type SPECIAL_TOKEN are ignored by the parser, but are still available if needed by attaching the special token to the next full-fledged token that is generated. This is useful for full-fledged compilers, source code metrics programs (e.g., how many lines of code without comments), "pretty-printers," and other apps. In fact, a list of special tokens is associated with next full-fledged Token object—think of the case where many comments exist on successive lines. To access the list of special tokens associated with a Token object, loop over the list of special tokens as shown below. Token objects will be discussed in more detail later.
01 Token token = tokenFromParser;
02 while (token.specialToken != null) {
03 System.out.println(token.specialToken);
04 token = token.specialToken;
05 }
Defining Productions
So far, we have discussed defining tokens. Tokens are the low-level building blocks of any parser. The next step is to use these building blocks to construct a
parse tree. The parse tree contains all combinations of tokens permitted by a particular grammar. These combinations are defined as productions in the .jj file.
JavaCC features four types of productions. We have already covered one of them, the regular expression production, that is used to produce Token objects. Next, we'll concentrate on BNF productions, which allow developers to specify the parse tree. BNF productions are specified as normal Java methods and may take arbitrary input parameters and return an arbitrary value. This flexibility makes BNF productions very powerful and the keystone of parser development.
To understand how to structure a production, we will examine the string production from the PostScript interpreter (see Listing 3).
Line 1 defines a production named stringType(). This production will result in a full-fledged method being generated in the parser code once javacc is executed upon the .jj file. stringType() accepts one parameter and returns a PSObject. Productions do not have to accept parameters or return values. The ability to do so allows the calling object to pass values down the parse tree, as well as receive values passed back up the parse tree. We will examine this ability in more detail later.
The braces and parameters at the end of line 1 define local variables and Java code that will be generated at the beginning of the stringType() method. If the production does not require any extra variables or code, these braces may be empty.
Line 2 begins the meat of the production. The stringType() production is invoked whenever a Token object is generated of type PS_STRING (line 2) or of type HEX_STRING (line 7). In JavaCC parlance, these sequences of strings, tokens, and productions are called expansion choices. In the case of a Token object of type PS_STRING, the local variable t (defined in line 1) is set to the Token object and then t is acted upon. While it is not strictly necessary to have Java code associated with the expansion choices, it is usually required if the parser is not simplify validating the syntax of a grammar. After all, you want your parser to do something.
Both line 2 and line 7 illustrate very simple expansion choices—each is composed of only one token. Generally, there is an arbitrary sequence of token, strings, and productions that define the valid sequence of a production. Looking at the calculator example again, a valid production may be:
01 float expression() : { Token t1; Token t2; } {
02 t1 = expression() "+" t2 = expression() {
03 return (Float.valueOf(t1.image) +
Float.valueOf(t2.image));
04 }
05 t1 = expression()"*" t2 = expression() {
06 return (Float.valueOf(t1.image) *
Float.valueOf(t2.image));
07 }
08 }
Manipulating Tokens
Up to this point, we have examined how to set up regular expression productions and BNF productions. We have also glossed over an important topic: the Token and token manager classes. These classes are generated by
javacc and are the base of the lexical analysis portion of the parser.
The token manager class is responsible for the lexical analyzer portion of the parser. The token manager class contains methods and attributes that may be accessed in the semantic actions associated with regular expression productions. The most important of these attributes is image, a package-scope attribute that contains the matched sequence of characters that the regular expression production is about to turn into a Token object. The image attribute is an object of type StringBuffer. It is often convenient to manipulate this string buffer before the full-fledged Token object is created. Line 3 in the code below illustrates how the image attribute of the token manager is manipulated in the regular expression production for a PostScript string.
01 TOKEN : {
02 ) ()*
()> {
03 image = expandEscapeCharacters(image);
04 }
05 | <#not_rparen:>#not_rparen:> |
| ~[")"]) >
06 | <#esc_lparen: "\\("="">
07 | <#esc_rparen: "\\)"="">
08 | <#lparen: "("="">
09 | <#rparen: ")"="">
10 }
#rparen:>#lparen:>#esc_rparen:>#esc_lparen:>
The most compelling reason for manipulating the image attribute of the token manager is that the sequence of matched characters is in a mutable form. Once the regular expression production completes its task and creates a Token object, the sequence of characters is no longer mutable. Token objects maintain their own public image attribute; however, this attribute is an immutable String object for tokens and not a StringBuffer object. (Yes, this is an unfortunate choice of attribute names.) The image attribute of a Token object may be accessed or replaced at any time. The calculator example (see this page) illustrates an example where this attribute is parsed to create a numerical value.
In addition, tokens maintain track of their position in the input stream via the attributes beginLine, beginColumn, endLine, and endColumn. As shown in the code on special tokens (see page 32), tokens also maintain a specialToken attribute to access the list of special tokens ignored by the parser.
Adding Methods to the Parser
Thus far, only two types of productions have been covered. Regular expression productions specify tokens, while BNF productions specify the expansion choices that make up a grammar. The remaining two types of productions are used to add arbitrary Java code to the parser and token manager classes. When javacc generates the parser, several classes are generated—the two most important are the parser and token manager classes. The parser class handles the BNF productions. These classes are named according to the mandatory PARSER_BEGIN (NameOfParser) construct in the .jj file. For example, the PostScript parser's name is PSParser, which instructs javacc to generate the PSParser.java and PSParserToken-Manager.java source code files.
Token manager declaration productions, identified by the JavaCC keyword TOKEN_MGR_DECLS, insert methods and attributes to the generated token manager class. An example of where this may be useful would be inserting an expandEscapeCharacters() method into the token manager. This routine would be available to the Java code associated with the regular expression productions. The code below illustrates the token manager declaration production.
01 TOKEN_MGR_DECLS : {
02 public String expandEscapeCharacters(String s) {
03 // Perform something useful in body of method.
04 }
05 }
Javacode productions, identified by the JavaCC keyword
JAVACODE, are workaround mechanisms for building JavaCC-based grammars. It is not always easy to break down a grammar into a set of productions. Indeed, sometimes it is not possible or desirable. Javacode productions permit users to define productions that can do whatever they want. In contrast with the token manager declaration productions, Javacode productions are added into the parser class.
An excellent example (lifted directly from the JavaCC documentation) of a Javacode production is the case when you have matched an opening brace but are not interested in any other tokens until the closing brace is found. Listing 4 shows how to achieve this effect with a Javacode production.
Layering Productions
Astute readers may have noticed that we are building this parser in a bottoms-up approach. Now that you have seen how to define productions and expansion choices, you can examine how to layer productions on top of each other to build up a grammar. As mentioned earlier, PostScript has a very simple grammar, but even in a simple PostScript parser, we can take advantage of production layering.
In Listing 1, we briefly discussed translationUnit(), the main production. This production contained six expansion choices, five of which are other productions (including the stringType() production examined earlier). The last expansion choice is , a JavaCC keyword used to mark the end of the input stream. Please note that most of these productions accept an input parameter, originating from an input parameter to the translationUnit() production, and return a value, which happens to be ignored in these cases. This production will return true until the end of the input stream is reached.
A more complex parser would contain more layering, but in this case, translationUnit() is the main production. This production is called from the parser's main() routine as shown in Listing 5.
The main() routine calls the translationUnit() method repeatedly until it returns false, which indicates that the end of the input stream was encountered. The main() routine also executes the objects generated by the lower level productions. Those objects could have been passed back up the call stack, but in this instance, those objects were cached in the environment (the m_environment variable) within the translationUnit() method. The main() routine re-trieves the objects from the environment as needed (see line 10 in Listing 5).
Line 16 in Listing 5 shows a catch block for Parse-Exception objects. If the parser cannot match a grammar, an object of this type is thrown. This exception does not need to be declared for each production—javacc will insert it automatically for you.
Test, Test, and Test Again
There is no doubt about it: Parsers are tricky beasts. Any software that directly uses and relies upon regular expressions is bound to be complicated. To combat this complexity, two approaches should be taken.
- Ready supply of test cases. There should be enough test cases on hand to validate that the parser operates acceptably. Exactly how many is dependent upon the complexity of the grammar. It can even be argued that the test cases should be available before any development begins. Forget about testing every variation—this approach is generally not feasible with parsers. Pick a broad subset of test cases to validate the basic workings of the parser, and then add problem test cases as they are discovered.
- Incremental testing. In my experience, the best ap-proach is to test in an incremental fashion. Verify that the first regular expression production matched the desired string, then verify the next regular expression production, and so on. After the tokens have been successfully identified, then concentrate on the BNF productions and the expansion choices. Add Java code to the productions in an incremental fashion as well. Finally, regression tests should be performed after every change—you may find that a faulty new production will prevent your correct existing production from ever being executed.
Conclusion
This article has described how the JavaCC toolkit can be used to generate parsing classes. Once a grammar file has been defined, the javacc tool converts the .jj file into a set of Java source code files that contain pieces of the parser. Compared to hand-coding a parser, development with JavaCC is quicker and easier to maintain. If a parsing project is on your horizon, you should consider JavaCC.
Acknowledgements
The JRIP project and PostScript interpreter were initiated while I was employed at Questra Corp. I would like to thank Steve Mink, who contributed to early drafts of this article, as well as Rich Hecht, who provided management support for JRIP.
Resources
- Aho, Alfred V., R. Sethi, and J.D. Ullman. Compilers, Principles, Techniques, and Tools. Addison–Wesley, 1986.
- McManis, Chuck. "Looking for lex and yacc for Java? You don't know Jack," http://www.javaworld.com/javaworld/jw-12-1996/jw-12-jack.html.
- Enseling, Oliver. "Build your own languages with JavaCC," http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-cooltools.html.