Dynamic Java Program Corpus AnalysisPart 1: The Analyzer

Program corpus analysis is important in the optimization of runtime systems. Conventional linguistic analysis is static in nature and cannot reflect dynamic behaviors revealed by versatile object-oriented programming languages. We propose a pattern-based runtime profiler in this article. Unlike a conventional profiler or runtime visualization tools, representative program corpora accumulated and benchmarked not only shows monolithic functions that introduce excessive runtime overhead, but also reflect their correlated code patterns. We propose a pattern-based analysis to address a program runtime bottleneck in a sequence of method invocations. It will reveal more semantic meanings in performance bottleneck rendered by object-oriented programming systems.

Linguistic analysis of text corpus is done by parsing token strings sequentially, as in the spoken language of human beings. However, program execution is not sequential and polymorphic; operations largely complicate execution trace. For example, token strings A, B, and C are a program corpus. Pf(A|B) is the condition probability that A will occur consecutively before B, and Pb(C|B) with C will occur after B. If Pf(A|B) and Pb(C|B) are relatively small, B may be a frequently used candidate code pattern. However, program behavior is not static and varied due to dynamic binding and polymorphic operations associated with the context where B resides. If we want to fit the lexicon model in program corpus, we must consider using a runtime trace to detect the dominant convergence of code patterns.

Code patterns (also called idioms) are low-level patterns specific to a programming language. They reflect a style that experienced programmers routinely apply in their work. The same recurrent structures are often used and may be embedded in certain design patterns, such as the Singleton design pattern in C++ or Smalltalk. It is hard to find a specific pattern from a large program corpus due to the diversity of dynamic binding and inheritance. We have built a Java runtime profiler in a pattern-based approach to detect specific control structures that are central to what we call control patterns.1 These control patterns can be performance bottlenecks and will be highlighted in our visualization tool. They can also be recurrent structures in a problem domain and can be used as a pattern-finding system. If applied in a software vulnerability penetration tool, it is a good testing tool for finding interface incompatibility.

In the section titled "The Design and Implementation of the Analyzer," the design and implementation of the profile analyzer is discussed. The meanings of the evaluation values of the programs are also discussed in this section. In Part 2 of the article, a suite of Java programs is collected for our benchmark programs. The results of these benchmark programs by our analyzer are discussed. Finally, our system will be compared with the profile produced by the original software implementation of the JVM.2,3

The runtime information of Java programs can be obtained by running the Java programs on our modified JVM software implementation. In this section, we design and implement an analyzer to dissect several object-oriented program behaviors, seeking possible patterns to find potential performance improvement.

Object-Oriented Program Behaviors
Method Invocation Localities: Hardware caches in a computer system can improve performance because there is a reference locality of memory space in programs. Memory reference locality refers to the allocation of a small range of memory space in a given period of time. Similarly, we wonder if there exists method invocation locality during object-oriented program execution. Method invocation locality relates to method invocations confined to a small set of methods or classes over a period of time.

The program segment in Figure 1 traverses a tree and gives each node a number to represent the traversal order. Despite the number of classes and methods in the whole program during the execution of this program segment, only two classes and seven methods are involved. They are the stack and ce classes, and the stack.empty(), stack.push(), stack.pop(), ce.setOK(), ce.setValue(), ce.hasomeChildren() and ce.nextChild() methods. As a result, the method invocation locality might be a behavior to characterize object-oriented programs.

Hwang Figure 1
Figure 1. Segment of a tree traversal program.

A method invocation consists of three parts: receiver class, method class, and method. In the analyzer, localities of receiver class, method class, and method are evaluated, respectively. To define the localities, a window size is first chosen to be the range to count the number of classes or methods. For receiver class locality and method class locality, the total number of evaluated program classes is chosen as the window size. For method locality, the total number of methods of the evaluated programs is chosen as the window size. The receiver class locality {method class locality, method locality} is then defined as the ratio of the receiver class count {method class count, method count} in this window to the window size. By the definition of locality values, one can easily realize that method invocation sequences with smaller locality values exhibit better locality phenomenon.

Figure 2 shows the method invocation sequence produced by running the program segment of Figure 1. We hypothetically define the window size as 12; the receiver class locality {method class locality, method locality} of this window then is 2/12 {2/12, 7/12}.

Hwang Figure 2
Figure 2. Method invocation sequence of program segment in Figure 1.

For the whole method invocation sequence of a program execution, the window is shifted from the beginning to the end of the method invocation sequence, and the respective localities of each shift are evaluated. The average of these locality values is then used to represent the locality of a program execution.

Control Patterns: During object-oriented program execution, the action to invoke a method execution is known as a control transformation. A method invocation sequence records all the control transformations that occurred during program execution. They might exhibit some recurrence patterns in the method invocation sequence, and these recurrent patterns of control transformation are called control patterns. In this article, we evaluate three kinds of control patterns: consecutive, hierarchy consecutive, and loop-N patterns. Examples will be explained later.

Consider the method invocation sequence in Figure 2. The first and second method invocation are an instance of a receiver class consecutive pattern and a method class consecutive pattern. This is because they are consecutive and their receiver classes are the same (stack), and their method classes are the same (stack). Consider the sixth to fourteenth method invocations. The same method invocation is repeated every three method invocations. As a result, they are an instance of the receiver class loop-3 pattern, the method class loop-3 pattern, and the method loop-3 pattern.

Let's look at another example in Figure 3. At the left of this figure is a class hierarchy diagram, and at the right is a program segment. The TextArea class is a subclass of the TextComponent class; the Component class, and the Object class. The method appendText() is defined in the TextArea class, and method enable() is defined in the Component class. The correspondent method invocation sequence of the last four statements is listed in Figure 4. The first and second method invocations both invoke the enable() method of the class Component, so they are an instance of the method consecutive pattern. Now let's focus on the method classes of these four method invocations. All of the method classes belong to the same class hierarchy, although two of them are in the Component class, and two of them are in the TextArea class. As a result, these four method invocations are instances of the class hierarchy consecutive pattern. When the situation occurs on the receiver class, it is called a receiver hierarchy consecutive pattern.

Hwang Figure 3
Figure 3. UI program example.

Hwang Figure 4
Figure 4. Method invocation sequence of the program in Figure 3.

Other Behaviors: An object can receive a message (method invocation) defined either in the class of the receiver or in the superclasses of the receiver. When an object executes the method defined in the class of the object, the method class is the same as the receiver class. When an object executes the method defined in the superclasses of the object, then the method class is not the same as the receiver class, but it is in the same class hierarchy. The analyzer we designed can measure the distance between the receiver class and the method class when they are in the same class hierarchy. If the method lookup algorithm is implemented by searching the class hierarchy, then the time spent in the method lookup will be affected by the distance between the receiver class and the method class.

Take the method invocation sequence in Figure 4 as an example. The receiver class of the first and second method invocations is TextArea, while their method class is Component. Component is a direct superclass of TextComponent, and TextComponent is a direct superclass of TextArea, so the distance between Component and TextArea is 2.

Other Java program behaviors our analyzer measures and analyzes are: average method size of all methods in a program, method size distribution, the number of native methods existing in a program, and the invocation count of native methods during program execution. How these behaviors affect the program performance will be explained the Part 2 of this article.

Analyzer Architecture
The architecture overview of the analyzer is shown in Figure 5. The static information file is first read line by line to construct the database. After reading the entire static information file, the database contains the classes and method information that is used during the program execution as well as the events produced during the program execution.

Hwang Figure 5
Figure 5. Analyzer architecture overview.

Figure 6 shows the database infrastructure. The methods defined in the same class are linked together and are pointed to by the class defining them. Each method node contains its name, signature, and size. Classes are linked by their inheritance relationships. Each class node contains its name, a pointer that shows the class node of its superclass, a pointer showing the methods defined in it and two integer values (left-value and right-value) used in class inclusion tests.4 The left- and right-value of the class node will be further discussed in the "Implementation Notes" section. The database also contains an event list where the event entries are indexed by the data in the dynamic information file. Each event entry contains three pointers. The ReceiverClass points to a class node that represents the receiver class of this event. The MethodClass also points to a class node that represent the method class of this event. The Method points to a method node that represents the method of this event.

Hwang Figure 6
Figure 6. Database infrastructure.

After the database is constructed, it is ready for the analysis engine. The analysis engine is designed to analyze the various program behaviors described in the section "Object-Oriented Program Behaviors." The analysis engine takes the database and the dynamic information file as the inputs and provides its analysis results to the presentation component of the analyzer. The presentation component is responsible for organizing and showing the analysis results to the user.

A screenshot of the running analyzer is presented in Figure 7. In the center of the analyzer is a method invocation graph. The x-axis represents time. The y-axis represents the receiver class, method class, or method, depending on what the user chooses in the Graph Menu. The ranges of the method invocation graph can be appointed by the users. When a new range is appointed, the ReDraw button must be clicked to redraw the method invocation graph. When the user clicks on the method invocation graph, the position of the clicked point will be shown to the right of the ReDraw button. The right-hand side of the method invocation graph shows the various analysis results revealed when the user requests behavior analysis from the menu.

Hwang Figure 7
Figure 7. A snapshot of the analyzer.

Implementation Notes
Class Hierarchy Encoding: Class inclusion tests4,5 are usually needed during analysis. For example, when analyzing the class hierarchy consecutive pattern, two classes should be tested to see if they are in the same class hierarchy. Java is a single inheritance programming language, so the relative numbering method is adopted for the class inclusion tests in our analyzer implementation.

Each class node in the analyzer database has a pointer showing its direct superclass node (as described earlier). First, we use this pointer to collect all the direct subclasses of each class node, and perform the depth-first tree traversal to traverse the class nodes and assign the correspondent left- and right-values.

When the class hierarchy is encoded by the relative numbering method, a class inclusion test can be finished by two integer comparisons. Take the class hierarchy in Figure 8 as an example. It is already encoded by the relative numbering method. If we want to see if class F is a subclass of class A, we compare the left-value of class F with the left-value and right-value of class A. If the left-value of class F is between the left- and right-value of class A, then class F is a subclass of class A. Otherwise, class F is not a subclass of class A. When 5 is between 0 and 6, class F is a subclass of class A. As another example, consider class C and class E. The left-value of class C is 2, but 2 is not between 3 and 5. Thus, class C is not a subclass of class E.

Hwang Figure 8
Figure 8. Class hierarchy encoding example.

Control Pattern Evaluation: This describes how the percentages of control patterns in a method invocation sequence are calculated. Different control patterns are calculated separately. Only one control pattern is calculated for each pass of the method invocation sequence. Figure 8 shows how the analyzer calculates the percentages of control patterns in a method invocation sequence.

Hwang Figure 9
Figure 9. Predictor for evaluating control patterns.

The method invocation sequence is fed into the predictor. The predictor monitors several internal states to predict the next method invocation. The output of the predictor is compared with the next method invocation input from the method invocation sequence. If the output of the predictor is the same as the input method invocation, then the input method invocation and the method invocations in the predictor are an instance of the evaluated control pattern. The input method invocation is then fed into the predictor to update its internal states.

The predictor is configurable when set up with different internal state configurations that correspond to different control pattern evaluations. To evaluate the consecutive patterns, set up the predictor with one internal state to record the previous method invocation. The internal state is used as an output to compare with the next input method invocation. To evaluate the loop-3 pattern, we set up the predictor with three internal states to record the previous three method invocations. The third internal state is used as an output to compare with the next input method invocation. With these techniques, the percentages of control patterns in a method invocation sequence can be easily evaluated.


  1. Huang, Shih Kun. Optimizing Run-Time Behaviors in Object-Oriented Programing Systems. Ph.D. Dissertation, Institute of Computer Science and Information Engineering, National Chiao-Tung University, HsinChu, Taiwan, 1996.
  2. Lindholm, Tim and Frank Yellin. The Java Virtual Machine Specification, Addison–Wesley, Reading, MA,1997.
  3. Meyer, Jon and Troy Downing. Java Virtual Machine, O'Reilly and Associates, Sebastopol, CA,1997.
  4. Vitek, Jan, R. Nigel Horspool, and Andreas Krall. "Efficient Type Inclusion Tests," OOPSLA'97, Atlanta, GA, 142–157, Oct. 1997.
  5. Krall, A., J. Vitek, and N. Horspool. "Near Optimal Hierarchical Encoding of Types," ECOOP'97, Jyvaskyla, Finland, 128–145, June 1997.