Global System Analysis at WorkIntroducing the compilation strategy of SmallEiffel, the GNU Eiffel compiler

Column Editor's note: SmallEiffel is a highly successful open source implementation of Eiffel developed under the direction of Dominique Colnet at LORIA in France. It uses a number of particularly original and innovative compilation techniques. In this column, Colnet and his colleague Olivier Zendra raise the veil and show how they have taken advantage of global system analysis techniques to achieve a high degree of performance in the SmallEiffel compilation process and the resulting code.

—Bertrand Meyer

Although program optimization has always been a major preoccupation of compiler designers and implementers, most of the tools currently available do not feature one of the most efficient methods—global system analysis. On the contrary, SmallEiffel, the GNU Eiffel Compiler strongly relies on global system analysis in order to compile the system's Eiffel source code more efficiently and generate a highly optimized executable. In this column, we will present the SmallEiffel compilation strategy and highlight the benefits of global system analysis, which makes it possible to reach C-like performance at runtime, while keeping a high-level, purely object-oriented design with the Eiffel language.

EXAMPLE HIERARCHY
In order to present the SmallEiffel's compilation strategy, throughout this article we will use the following class hierarchy, where the abstract type, FRUIT, features two concrete subtypes: APPLE and PEACH (see Figure 1).

Figure 1
Figure 1. Source code of the FRUIT hierarchy.

LIVE CODE COMPUTATION
The main advantage of whole system analysis is that it allows the computation—at compile time—of all of the system's live code, that is, all the code that is present in the source code and can actually be reached at execution time. Code that cannot be executed because it is unreachable is known as dead code. By knowing at compilation time all the live code, it will be possible, as we shall see later, to customize the generated executable according to the set of live types—types that can actually be instantiated at runtime.

Global system analysis is one of the earliest stages in the SmallEiffel compilation process. Live code computation is closely linked to the inventory of classes that may have instances at runtime. Live code computation begins at the system's root or the system's entry point (the equivalent of the main function in languages such as C and C++) and recursively follows the routine calls and instantiation instructions (unless they have already been reached). Note that in order to keep live code computation simple and fast, no flow analysis is done. The result of this first computation is completely independent of the order of instructions in the compiled program. Furthermore, when a conditional instruction is reached, all the alternatives are assumed to be possible. Although this gives a non-optimal analysis, we found that practically speaking, the impact on the generated code was very limited, while allowing a significant speedup of the compilation process. At the end of the process, when no new code has to be explored, the whole live code of the system is known. As a first example, we shall use the simple root class shown in Figure 2.

Colnet Figure 2
Figure 2. The MAIN example.

When compiled with SmallEiffel, this program is translated into an executable that contains only what it actually uses at runtime: the PEACH class data and the code for its display procedure. This is the only code that can be reached from the system's root, which is procedure main in class MAIN. The code for the display_more procedure, for example, is not needed and thus not compiled. Similarly, although the source code for class APPLE may be sitting beside the source code for PEACH, it is not required and therefore it is not analyzed.

Although all this may seem fairly obvious, at least for modern compilers, there is another great benefit brought by global analysis that is hidden in the MAIN example. Indeed, you have to remember, since Eiffel is a pure object-oriented language, the peach.display call is by definition (and like every Eiffel call) a call subject to dynamic dispatch, (i.e., late binding) or—to reuse C++ terminology—is a virtual call. This design choice was made by Bertrand Meyer, the designer of Eiffel, since it gives the application developer the most powerful mechanism. However, developers are often concerned by the cost of this dynamic dispatch, which is why in C++ virtual routines must be marked explicitly by the virtual keyword, thus allowing the compiler to optimize away the other static calls. With global analysis, SmallEiffel is able to determine at compile time that this system only uses the PEACH class. As a consequence, there is no need for any code performing dynamic dispatch, and the peach.display call is compiled into a static, direct fast call.

This implementation of the peach.display call is the first example of the code customization performed by Small-Eiffel, thanks to global system analysis.

CODE CUSTOMIZATION
Code customization is one of the most important reasons for the performance of the programs compiled by the SmallEiffel compiler. Indeed, global system analysis (at the beginning of the compilation process) brings in crucial information that can be used later on, when producing the executable.* Generating code that has been customized or specialized (thanks to this information) gives a great boost in performance.

For instance, let's consider again the MAIN example of Figure 2 and the related classes. Although the code for the display procedure is defined in class FRUIT, this code is customized for a target of type PEACH.

The peach.display call will then execute the binary code corresponding to the display procedure exactly as if it had been defined as follows in the source code for the PEACH class.


 display is
  do
    print("Fruit name is ") -- (1)
    print("PEACH")          -- (2)
    print("%N")             -- (3)
  end
Note that in this example, the code that was automatically generated for line (2) directly corresponds to print("PEACH"), not to the print(name) the developer wrote. Indeed, compile time global analysis and code customization spotted the constant attribute access and replaced it directly with its statically computed value, hence, a speedup at execution time.

Actually, all live routines are customized like this, in the context of each type they belong to.

A slightly more complex example, the MAIN2 system allows us to further illustrate the benefits of this compilation method in object-oriented languages (see Figure 3).

Colnet Figure 3
Figure 3. The MAIN2 example.

As the Eiffel source code for class FRUIT (see Figure 1) shows, display_more is a template routine, which means that it is a routine's code completely written in FRUIT, but depends on information that is not completely available in FRUIT. This template routine indeed uses the deferred feature, stone_fruit of FRUIT, that is undefined and only available in its concrete descendants, PEACH and APPLE. It becomes the True constant in class PEACH and the False constant in class APPLE.

As with any other procedure, the template routine display_more is customized at generation time to each live type it belongs to. When the target of the fruit.display_more call is actually a PEACH, the stone_fruit expression inside the display_more function is always True. Thus, in this case, the then part of the if-then-else statement is always executed while the else part is never reached. Taking this into account, the display_more routine customized in the context of APPLE looks like this:


display_more is
   -- Automatically customized for PEACH
   do
    print("Fruit name is ")
    print("PEACH")
    print("%N")
    print("This is a stone fruit.")
   end
Conversely, in class APPLE, the else part is always executed and the Eiffel code for class APPLE looks like this:


display_more is
 -- Automatically customized for APPLE
   do
    print("Fruit name is ")
    print("APPLE")
    print("%N")
    print("Not a stone fruit.")
   end
In the MAIN2 example (see Figure 3), unlike in MAIN (see Figure 2), it is impossible to decide at compile time whether the target of the fruit.display_more call is going to be a PEACH or an APPLE. Consequently, actual dynamic dispatch has to take place at runtime and code has to be generated to implement it.

DYNAMIC DISPATCH IMPLEMENTATION
Many current compilers (for example, most Java and C++) rely on Virtual Function Tables (VFT) to implement dynamic dispatch. Conceptually, the VFT system works as follows. The receiver type of a call is used as an index in a two-dimensional table, while the second index is named by the routine to invoke. The VFT element pointed at by those indexes is the address of the code corresponding to the function the call has to be dispatched to. An indirect function call is then made by de-referencing this pointer, and the execution of the call goes on.

Unlike these compilers, SmallEiffel uses a Binary Tree Dispatch (BTD) technique to implement late binding. Thanks to global system analysis, the compiler knows at compile time the set of all the possible types for the receiver at any given call site. Each type is identified by a unique integer, the type ID. Since every object has a field that contains its type ID, it is possible for each call site to generate code that, at execution time, gets the type ID stored in the receiver object and compares it to the set of possible types for that specific call. Once the object type has been matched, the exact routine is called with a direct call. The following example illustrates this by showing the generated C code (in simplified C-like form) for the fruit.display_more call in MAIN2:


DISPATCH_display_more(FRUIT* receiver) {
 if (receiver->type_ID <= peachid)="" {="" peach_display_more((peach*)receiver)="" }="" else="" {="" apple_display_more((apple*)receiver)="" }="" }="">
Note that since the cast is information for the C compiler, it does not cost anything at runtime. Furthermore, this cast is safe, because the object type has been actually tested just before.

This code snippet shows one advantage of BTD over VFT dispatch—the fact that the calls are direct, and not indirect ones, and are thus faster. Another advantage with VFT is that it is very difficult to avoid having a function pointer, hence, an indirection. On the contrary, with BTD the compiler can easily decide to replace the call to the correct routine by the routine body itself, like this:


DISPATCH_display_more(FRUIT* receiver) {
 if (receiver->ID <= peachid)="" {="">print("Fruit name is ");
    print("PEACH");
    print("%N");
    print("This is a stone fruit.");
   }
 else {
    print("Fruit name is ");
    print("APPLE");
    print("%N");
    print("Not a stone fruit.");
   }
 }
This inlining of dispatched routines provides an important optimization because it removes the cost of one routine call and because it simplifies the code at each branch, making it easier for the C compiler to further optimize.

In order to make the receiver type tests as fast as possible when more than two receiver types occur at a given call site, a binary tree of tests is generated that performs a dichotomic search on the type ID. Then the set of all possible types has to be sorted and recursively split until all the subsets of possible types have only one element.

A call site with a high degree of polymorphism could look like this:


DISPATCH_mixed(FRUIT* f) {
  if (f->ID <= appleid)="" 4="" */="" if="" (f-="">ID <= peachid)="" 2="" */="" if="" (f-="">ID <= cherryid)="" 1="" */="" cherry_mixed((cherry*)f);="" else="" peach_mixed((peach*)f);="" else="" if="" (f-="">ID <= bananaid)="" 3="" */="" banana_mixed((banana*)f);="" else="" apple_mixed((apple*)f);="" else="" if="" (f-="">ID <= orangeid)="" 6="" */="" if="" (f-="">ID <= coconutid)="" 5="" */="" coconut_mixed((coconut*)f);="" else="" orange_mixed((orange*)f);="" else="" if="" (f-="">ID <= plumid)="" 7="" */="" plum_mixed((plum*)f);="" else="" olive_mixed((olive*)f);="" }="">
The fact that the dispatch code, and its execution cost, grows with the number of types could be cause for some concern. In fact, this code executes very well on modern processors and is generally faster than indirection-based VFT code. As our experiments have shown,1,2 even for pathological megamorphic cases, (with 50 possible types) the BTD code is competitive with the VFT version. Typical call sites generally have few possible types, for which the BTD technique is significantly faster than VFT-based ones.

OVERALL RESULTS
Having described global system analysis and some of the optimizations it makes possible in the context of the SmallEiffel compiler, we think it is important to briefly present some of their performance results.

Implemented entirely in Eiffel and featuring about 80,000 lines of live code, SmallEiffel is a significant benchmark to test an Eiffel compilation process, such as the one implemented in SmallEiffel. Furthermore, the Small-Eiffel code relies heavily on high-level, purely object-oriented techniques.

For instance, dynamic dispatch is a crucial tool used throughout SmallEiffel code. A good illustration of its usefulness is given by the abstract type EXPRESSION, which features more than 50 subtypes and is handled all along the compilation process. Dynamic dispatch removes the burden of manually writing long sequences of conditionals to find the actual type of expression being compiled. It also makes triggering the code generation of an expression as simple as calling "exp.compile_to_c."

On the whole code for SmallEiffel, global system analysis makes it possible to accurately predict the type of the receiver for more than 90% of the calls. This means that 90% of the calls, which in the Eiffel language definition are all polymorphic calls, can be resolved by the compiler as monomorphic, direct calls.

One of the reasons behind this impressive score is code customization, which automatically adapts code to the various type contexts it appears in. As a consequence, every call whose receiver is Current (the Eiffel equivalent of this in C++ or Java) is a perfectly predictable call for the compilation process of SmallEiffel. Such calls are very common in object-oriented programming; for example, when calling another routine in the current class or when accessing attributes of the current object.

This extremely high type prediction score with a fast dynamic dispatch technique for the remaining 10% of calls, probably accounts for an important part of the SmallEiffel compiler's speed. The latter is indeed able to read, parse, analyze, and verify 80,000 lines of live Eiffel code, as well as generate and write the corresponding C code in about 6 seconds on a Pentium III 550 Megahertz processor.

CONCLUSION
The compilation strategy of SmallEiffel, the GNU Eiffel compiler that is described in this article focuses on high-level, object-oriented optimizations. It accounts for remarkable performance, both for the generated code and for the compilation process, and is thus perfectly suited to a pure object-oriented language such as Eiffel. SmallEiffel, as well as more information about the techniques explained in this article, can be found at http://SmallEiffel.loria.fr.

References

  1. Zendra, O., D. Colnet and S. Collin. "Efficient Dynamic Dispatch Without Virtual Function Tables, The SmallEiffel Compiler," 12th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'97), Atlanta, GA, (32)10:125–141, Oct. 1997.
  2. D. Colnet and O. Zendra."Optimizations of Eiffel Programs: SmallEiffel, The GNU Eiffel Compiler," 29th Conference on Technology of Object-Oriented Languages and Systems (TOOLS Europe'99), IEEE Computer Society, Nancy, France, p. 341–350, June 1999.

FOOTNOTE
* For the sake of portability, SmallEiffel does not actually generate an executable itself, but instead compiles Eiffel into C code, which is used as a portable assembly code and is then compiled into an executable by a classical C ANSI compiler. However, this has no influence on our discussion.