Disambiguated Glommable Expression Templates Reintroduced

In 1997 the journal Computers in Physics printed an earlier version of this article entitled "Disambiguated Glommable Expression Templates,"1 followed in 1998 by its sequel.2 Since that time several people have suggested that I pursue republication in C++ Report to bring the material to a wider audience. This article reintroduces the topic of disambiguated glommable expression templates and, while based closely on the prior article in Computers in Physics, aims to present the material in a manner that will be accessible to the wider C++ populace.

Naively, one might expect that scientific applications development and C++ would be a natural union. C++ provides excellent support for object-oriented programming, abstraction, and encapsulation. Moreover, scientific computational modeling—both simulation of physical systems and abstract mathematical analysis—is replete with structure. Expressing this structure and abstraction in a language with direct support for the construction of user-defined types that model the application domain is therefore a powerful attraction for a growing number of computational scientists.

Unfortunately, this seemingly natural union has historically been more of a stormy relationship. Early efforts to apply object-oriented programming in C++ to the modeling of physical systems revealed performance concerns relating to virtual function calls. Later efforts at class library design revealed serious performance concerns relating to temporary object proliferation, especially in applications involving vector and matrix math.3,4 Programmers have been faced with some difficult choices between elegance and efficiency. To get good computational efficiency, many of us have had to keep the object-orientation at fairly high levels of structure in our codes, and stick to more Fortran-esque coding at the lowest levels. Although there is certainly much to be gained from a well-structured "top half" of your code, it has nevertheless been a point of frustration for many not to be able to use objects at the lower levels.

In my own mind, the true desperation of the situation was most clearly demonstrated when Stroustrup wrote in the final paragraph of section 6.5.2 of The Design and Evolution of C++5:

It would be nice if every kind of numeric software could be written in C++ without loss of efficiency, but unless something can be found that achieves this without compromising the C++ type system it may be preferable to rely on Fortran, assembler, or architecture-specific extensions.

Capitulate? NEVER!! If only John Paul Jones had been a software engineer!
Fortunately, the situation has been rapidly improving in recent times. Two specific events I would cite as particularly noteworthy to the community of scientific C++ researchers are

  • the availability of the Kuck and Associates C++ (KCC) compiler with dramatically improved optimization capabilities, and
  • development of the "expression templates" class design technique.
KCC was the first C++ compiler to really make dramatic progress in overcoming the abstraction penalty in C++, by implementing a variety of optimizations that substantially alleviate the "small object penalty," thereby making it feasible to consider using C++ objects at lower levels of a code.6 Likewise, the development of expression templates by Todd Veldhuizen7,8 has made it possible to implement matrix-oriented math in C++ without any matrix-sized temporaries.

The technique of expression templates has been applied to the problem of building numeric container classes for scientific modeling in C++. Todd Veldhuizen introduced this subject in the C++ Report,7 and Scott Haney took up the subject in Computers in Physics.9 It is worth mentioning here that container design is a common activity in scientific C++ framework projects, not because scientific C++ programmers are unaware of std::vector<T> or std::valarray<T>, but because those standard containers, while arguably fit for the Standard, nevertheless fall short of the requirements of computational scientists. For example, std::vector<T> is one-dimensional and provides no bounds checking, whereas computational modeling in the physical and mathematical sciences frequently mandates the use of multidimensional containers. Similarly, std::valarray<T> may be multidimensional, but is not flexible enough to support domain-specific abstractions like triangular matrices, suffers a critical deficiency in its Standard-mandated public interface, and moreover, is not actually as efficient as originally envisioned.*

I briefly summarize the work done by others in the classical discipline of building expression template-enabled containers, and then extend the technique so that it becomes container neutral, thereby increasing the accessibility of the technology. Although the machinery of expression templates is somewhat obtuse and the material in this article may be hard to understand on a first reading, the end result will be gratifying. We find in the final analysis, that the "subscription cost" for endowing code with the ability to use expression template technology is very reasonable. User-defined classes will need to obey a certain protocol that is developed in the course of this article, and which is easily satisfied.

Finally, I hope this will be useful even to people who are not specifically interested in high-performance numerics. The template tricks employed to achieve the generalization of expression templates presented here are potentially of wider applicability.

CLASSIC EXPRESSION TEMPLATE ENABLED CONTAINERS Let me now give a whirlwind recap of classic expression templates. For excruciating detail please consult the references, but the following summary should be good enough to provide orientation for the previously uninitiated.

To open, let me just acknowledge that there probably has never been an "easy reading" article printed on the subject of expression templates. It would be vain for me to seriously expect this one to be the trend-breaker. Nevertheless, I am going to attempt a top-down introduction to the subject, and hope that it is clearer in the retelling than it was to me at the beginning, three years ago.

The goal is to be able to do this:


HugeMatrix t, u, v, w, x, y z;
t = sin(u) + (v-w) / (x + y*z);
and have it run FAST! Now, this sort of syntax has been supported for years by overloading the binary mathematical operators resulting in the production of matrix-sized temporaries, but this approach doesn't result in fast code, because of a few effects.

  • Production of the temporary matrix variables pounds the heap. Each temporary matrix constructed has to allocate space. Calling operator new is extremely expensive, and its cost can dominate all other operations, especially for small matrices.
  • Each binary operator results in a single independent loop through all the elements of its own subexpression parameters to build up the temporary variable that represents that operator's node in the expression parse tree. There are thus many loops in a typical expression evaluation: one for each operator plus another for the final assignment to the result object on the left-hand side.
  • Extreme waste of cycles due to repetitive stores and loads. For example, evaluating v-w in the previous code fragment results in a loop that ranges through all the elements of v and w, storing their difference in a temporary matrix. Then at the next higher node in the parse tree, a loop runs through all the elements of this temporary matrix, refetching elements from memory that were just stored at the previous stage of the expression evaluation. This behavior results in extra trips to memory that we would like to avoid.
  • Cache thrashing. Writing results to memory when forming the temporary at one level of the expression tree results in trashing a cache line that is needed to hold operands at another level. Such an expression evaluation system literally steps on its own toes and stumbles along to the solution at a pace that only the truly disinterested can tolerate.
Computational scientists really do count loads and stores in loops, either by studying the assembly code produced by the C++ compiler or by using hardware-profiling tools, and this sort of wasted memory bandwidth is definitely unacceptable. The notational convenience of abstractions such as that shown is desirable, but not at the performance cost of using conventional binary operator overloading.

To overcome these problems, expression templates introduce a different framework for expression evaluation. In this framework we still overload the binary mathematical operators, but the object returned from these functions is not the full matrix-sized value of the subexpression at that level of the parse tree. Rather, it is an object that is able to produce the value of the subexpression at a particular location in the indexing domain, which is accomplished by applying the specified mathematical operator to the values of its subexpressions at the same index location. In other words, the overloaded global operator functions return not the full matrix-sized values, but rather objects that can produce the answer one element at a time, when requested. The actual job of composing the full, final, matrix-sized answer (the left-hand side) is reserved for the assignment operator, which itself actually loops through all the elements in the indexing domain and assigns the result of the expression on the right at that location to each element in the left-hand side. In this manner, the whole mathematical expression evaluation is accomplished in a single loop (embodied within the assignment operator) irrespective of the number of terms in the expression, and each element of each operand on the right-hand side of the expression is fetched from memory exactly once. No needless loops, no wasted trips to memory, no thrashing the cache. Now this is FAST!

That's the high-level overview. Here's how it works. We start by introducing an expression template enabled array class. We will call this ETArray<T>. As described, we will want to give ETArray<T> an assignment operator that allows it to appear on the left-hand side of the expression template statement. And what will represent the expression that is being assigned? For this we introduce a class Xpr<T,E>, which will represent the expression at any node in the parse tree. Specifically, Xpr is the class that wraps the expression, and the expression itself is represented by the template parameter E. The class Xpr is known as an anonymizing expression wrapper because it can hold any subexpression of arbitrary complexity, allowing clients to work with any expression by holding on to it via the wrapper, without having to know the name of the type object that actually implements the expression. Given the Xpr<T,E> class, we can then exhibit the ETArray<T> container:


template<class T, class E>
class Xpr
{
	E e;
  public:
	Xpr( const E& e ) : e(e) {}
	T operator[] ( int n ) const { return e[n]; }
};

template<class T>
class ETArray
{
  public:
	template<class E>
	ETArray& operator=( const Xpr<T,E>& x )
	{
		for( int i=o; i < size( ); i++ )
		  data[i] = x[i];
		return *this;
	}
// other public members, data, etc.
};
And now it is clear why we need the wrapper class in the first place. We need to tell the difference between expression template expressions and anything else that might occur to the right of an equal sign. If the assignment operator just took a const E& parameter, it would match anything, even things that didn't come from our expression evaluation framework. We can't have that.

So this is the machinery that implements the outermost layer of an expression template enabled array class. We have an expression wrapper object (we will describe later how that is constructed) and given that, we can implement the assignment operator that contains the single loop ranging over all elements of the result object on the left-hand side, assigning to each element the value of the expression on the right-hand side at the corresponding location. The exhibited code uses indexing into the container and the expression via an explicit use of operator[], but that is just an implementation detail. It could also be done with iterators at the expense of lengthening the presentation a bit to include the definitions of the iterator types themselves and the forwarding typedefs, etc. Sticking integer offsets into the domain keeps the mechanics simple, allowing us to focus on the expression template machinery rather than on the details of an iterator-based evaluation engine.

Now let us turn to the subexpressions themselves. The binary mathematical operators will take arguments of type ETArray<T> and return the wrapped subexpression object Xpr<T,E>, where E is a binary operator. To keep the total number of classes involved in this expression evaluation framework finite, we make E itself a template class, with parameters to specify the types of the subexpression parameters and the specific mathematical operation to be performed. We then need the classes that actually implement the elemental mathematical operations, and finally we can present the actual global binary operator functions:


template<class T, class Container>
class ConstRef {
	const Container& c;
  public:
	ConstRef( const Container& c_) : c(c_) {}
	T operator[] ( int n ) const { return c[n]; }
};

template<class T, class A, class B, class Op>
class XprBinOp {
	A a;
	B b;
  public:
	XprBinOp( const A& a_, const B& b_ )
		: a(a_), b(b_) {}
	T operator[] ( int n ) const
	{
		return Op::apply( a[n], b[n] );
	}
};

template<class T>
class OpAdd {
  public:
	static inline T apply( const T& a, const T& b )
	{
		return a + b;
	}
};

template<class T>
Xpr< T, XprBinOp< T, ConstRef< T, ETArray<T> >,
                  ConstRef< T, ETArray<T> >,
                  OpAdd<T> > >
operator+( const ETArray<T>& a, const ETArray<T>& b )
{

	typedef XprBinOp< T, ConstRef< T, ETArray<T> >,
                          ConstRef< T, ETArray<T> >,
                          OpAdd<T> > ExprT;
	return Xpr< T, ExprT > (
		ExprT ( ConstRef<T, ETArray<T> >(a),
                        ConstRef<T, ETArray<T> >(b) ) );
}
// And similarly for the other binary mathematical operators.
There is a reason that OpAdd is not just std::plus. The functors in the Standard Library are utilized by invoking the member function operator( ) on the functor object instance itself. This means that to use functors you must actually retain the functor object so that you can subsequently invoke its member function. In this case this would mean adding another member to class XprBinOp, even though the actual object would have no state. We avoid this needless space overhead and the runtime cost of copying these empty objects (which are never really empty since the compiler will add pad bytes to keep them from having zero size) and instead just use an applicative template class with a static apply method. Only the case for addition is shown, but the extension to the other binary mathematical operators is straightforward.

It also bears mention that in the preceding code the global operator+ function taking two ETArray<T>s must be augmented with additional variants that allow adding ETArray<T> objects to subexpressions (represented by Xpr<T,E> objects) and subexpressions to each other. Without adding global operator functions for all these combinations we wouldn't be able to write any arbitrary mathematical expression and expect it to parse.

Finally, we digress just for a moment to discuss the purpose of ConstRef and why the subexpression objects are held by value instead of by reference in the wrapper and expression evaluation classes. Look closely at the global operator+ function. Notice that the Xpr object it creates is on the stack, and must therefore be returned by value—you can't return a reference to a local variable because its lifetime is over once you exit the function's scope. Clients that need to hold on to this subexpression wrapper object will need to make a copy and hold it by value. Similarly the XprBinOp object that the Xpr object wraps is also created on the stack in the global operator function and must be held by value for the same reason. But we do not want to make the expression classes copy their subexpression parameters in the case where these parameters are the array class program variable objects in the overall expression. This is because these objects are huge, and we don't want them to be copied unnecessarily—attempting to avoid that is how this whole thing began! So we introduce the ConstRef class, which treats a reference to the matrix terminals (the leaf nodes in the expression parse tree) as a copyable object just like all the subexpression temporary objects that percolate out of the expression parse tree. Another way of understanding this is to recognize that it is just an outworking of the standard C++ rule that temporaries used to initialize reference members in classes can be destroyed immediately after the constructor ends, whereas temporaries used to initialize local references and function parameters will persist as long as the reference itself does.

There are more details to be sure. The references cover the rest of the details. But this introduction should be enough to provide the orientation and background for the rest of this article.

IMPEDANCE MISMATCH Now let us take stock of the situation. Consider the impact of the advent of expression template enabled array classes such as the one previously sketched on the existing base of scientific C++ code. How easy would it be to put ETArray<T> into service in a code base that was already written using some other arraylike container class? Would you replace the existing container wholesale with the new expression template version? What if the pre-existing array class had different index offsets, different data layout, etc.? What if the new class represents rank differently than the incumbent? And perhaps more problematic still, what if the project had made heavy use of object-oriented modeling, resulting in a well-developed family of C++ classes representing various abstractions in the problem domain, more than one of which possessed arraylike semantics, among other salient attributes?

In any of these situations, it might be hard to put an expression template enabled array class to work in the project. Yet the benefits of improved performance and increased expressiveness make them nearly irresistible.

There is also an issue even larger than merely how to put this technique to work in your code. There is the issue of how to promote the interoperability of scientific software written in C++. Although object-oriented programming techniques have done much to improve the state of software, especially in terms of raising the level of abstraction and improving the robustness of software, it is much less clear that "software reusability" has been improved one whit. In fact, a powerful argument can be made that C++ has done much to impede the reusability of scientific software. How? Through the explosion of containers that nearly every C++ programmer invents to hold arraylike data sets. The result is that practically every C++ project winds up with its own application-specific array class, and reusability of numeric processing subroutines in C++ is thwarted because there is no standard container. This phenomenon alone can be a serious liability for C++ in scientific circles where Fortran is the norm, since Fortran has no such difficulty. Note again that it is not that scientific C++ programmers are unaware of the facilities of the Standard Library, but rather that the classes in std:: do not adequately model the properties needed by numeric containers in scientific and engineering problem domains, so that numericists are forced to roll their own.

In view of these problems, it seems that it would be extremely valuable if there were a way to "glom" an expression template facility onto an existing user-defined container class. That would solve the problem of how to put expression templates to work in existing code. Moreover, if this mechanism enabled numeric C++ subroutines to be expressed in terms of the properties of arraylike containers, without reference to a specific container type, we would have the beginnings of a plan to promote the reusability of scientific C++ code.

SEARCHING FOR GLOMMABLE EXPRESSION TEMPLATES To design our improved facility, let us think back on how classic expression templates work, and then consider how to generalize that using the features of C++. In the foregoing, we built a class ETArray<T> that was a templated container for holding the data, along with a family of template classes and global operator functions that constructed the expression template (basically the assemblage of a parse tree for the mathematical expression). These parts of the system were tied together into a unified system in which program variables of the ETArray<T> class type were able to serve as terminals (leaf nodes) in the expression template parse tree.

Abstract Base Class The most immediate idea for generalization might be using inheritance to represent the concept of "eligible to be a terminal in an expression template parse tree." We identify an object's capability to be indexed as the specific property that makes it so eligible. We might then consider making an abstract base class Indexable, which could be subclassed by any class that wished to participate as a terminal in an expression template statement:


template<class T>
class Indexable {
	virtual T operator[] (int) =0;
};

class UserVec : public Indexable<double> }
	double *data;
  public:
	//...
	double operator[] ( int i ) { return data[i]; }
};
This class UserVec is so trivial as to be of no real value, but it serves as an example of a user-defined container class. You may imagine the trivial storage and accessor mechanisms of UserVec to be replaced by any more sophisticated container facility that might actually be in service.

Then imagine the machinery being re-expressed in terms of "Indexable" objects. User-defined container classes would then derive from this base class Indexable and implement the indexing operator. This would enable the expression template machinery to serve a variety of user-defined container classes.

Unfortunately, this proposed implementation must be immediately rejected. There is no way we can afford the runtime hit of a virtual function resolution on every access to an element of our container. Surely this cure is worse than the disease, as the speed of execution of such code would be unbearable.

Implicit Type Conversion Reaffixing our C++ thinking caps, we ask ourselves, "What are the available ways to make an object appear to be of a certain class?" We have rejected conventional inheritance because of the unacceptability of the virtual function call. However, there is another way: implicit type conversions. A class can provide an implicit conversion to another type; then when the compiler sees an opportunity to perform type matching through implicit conversion of arguments, it will happen transparently. We could thus imagine something like:


template<class T>
class Indexable {
	T *v;
  public:
	Indexable ( T *v_ ) : v(v_) {}
	T operator[] ( int i ) { return v[i]; }
};

class UserVec {
	double *data;
  public:
	// ...
	operator Indexable<double>( )
	{
		return Indexable<double>(data);
	}
};
Initially this looks promising, but alas, it does not work either. It fails because C++ does not allow the use of implicit conversions when deducing template arguments. This is to prevent template functions from matching too many extraneous possibilities that are unintended by the programmer. However, the entire expression template machinery is built upon global operator functions taking templated arguments, so if we cannot construct those terminal argument types using implicit conversion, then we cannot use this approach.

Combining Templates and Inheritance Having been thwarted twice, perhaps it is time to regroup. The first approach (inheriting from a base class and specializing the salient property to our specific needs) clearly works, and is a natural idea from the standpoint of object-oriented design. Moreover, it circumvents the problem that derailed the second plan, since an implicit conversion is not required to match the argument types of the global template operator functions. However, the speed penalty of the virtual functions is not acceptable in this application since it requires resolution at runtime. What we really want is a solution in which the mapping from the Indexable abstraction class to the user-defined container class is accomplished at compile time. The traditional approach to creating generalized operations that are specialized at compile time is to use templates, such as is done in the STL. But these design techniques typically employ template composition, not inheritance. We wonder: "Can inheritance and templates be combined somehow?" Indeed they can:


template<class T, class Container>
class Indexable {
  public:
	T operator [] ( int i )
	{
		return static_cast< Container * >( this )->operator[](i);
	}
};

class UserVec : public Indexable<double,UserVec> {
	double *data
  public:
	//...
	double operator [] ( int i ) { return data[i]; }
};
And here we have a winner. The concept of "Index-ability" is provided by a base class, which is templated on the container type. The Indexable class is intentionally designed to serve as a base class, so its operator[] function delegates to the derived class's own operator[] by statically casting itself to the type of the derived class, which must be provided as a template parameter. The container then derives from this base class, thereby registering intent to participate as a terminal in expression template statements, providing its own class type as the template parameter to the base class. Note that the indexing operator of the derived class does not even need to be changed.

This design pattern, which seems new to each person who discovers it, is popularly known as the "curiously recursive template pattern," and has been described previously in the C++ Report by James Coplien. The full impact of this curious technique is hard to describe succinctly. "Compile time polymorphism" seems to come about as close as anything. What is most wondrous about all this is that modern optimizing compilers such as Kuck and Associates KCC can actually inline these indexing operations so that there is no cost for the circuitous method definitions.

FLESHING OUT A GLOMMABLE EXPRESSION TEMPLATE FACILITY Armed with both an understanding of how classical expression template facilities have built expression template enabled containers and a technique for abstracting the notion of eligibility to be a terminal in an expression template statement while maintaining the performance of static dispatch, we are now in a position to separate the expression template machinery from the container. For the sake of brevity we concentrate here on the binary mathematical operators as before, but generalization to unary operators and scalar terminals is straightforward.

Amazingly, we can preserve almost all of the code unchanged: the anonymizing expression wrapper class Xpr, the binary operator node class XprBinOp, the applicative template classes OpAdd and friends, and the ConstRef class can all remain unchanged. The only part of the expression template evaluation framework that must change is the manner of writing the global operator functions. Here is the new global operator+:


template<class T, class A, class B>
Xpr< T, XprBinOp<T, ConstRef<T,Indexable<T,A> >,
	       ConstRef<T,Indexable<T,B> >, OpAdd<T> > >
operator+( const Indexable<T,A>& a, const Indexable<T,B>& b )
{
	typedef XprBinOp< T, ConstRef<T,Indexable<T,A> >,
				 ConstRef<T,Indexable<T,B> >, OpAdd<T> > ExprT;
  	return Xpr< T, ExprT >(
		ExprT( ConstRef<T,Indexable<T,A> >(a),
                       ConstRef<T,Indexable<T,B> >(b) ));
}
Here we have changed the type of arguments to the function from ETArray<T> to Indexable, thereby allowing the one version of the global operator+ to work with any container which derives from Indexable.

Of course we have to augment this with the affiliated global operator functions that combine subexpression (Xpr<T,E>) nodes with Indexables and so forth, but overall remarkably little of the core design has to change. The total amount of grunge involved in writing all the different global operator variations might drive you to use CPP macros to help write all the template functions (yikes!) or perhaps—as some have done—to use a Perl script as a program generator. Be all that as it may, the needed functions are all straightforward modifications of this one.

There is now one final piece of the machinery to put in place—the assignment operator. It would be nice if this could be placed in the base class Indexable, but C++ does not allow operator= to be inherited. So this will have to be implemented explicitly in the derived class. Fortunately however, the implementation can be trivial—it can simply return the result of a method from class Indexable with any name other than "operator=". The same trick of downcasting the base class to the derived class thus enables a trivial implementation of assignment. This is again reminiscent of techniques that have been employed by other authors in the classic expression template implementations, but by using member templates we can avoid even the single virtual function call that some classic expression template implementations have incurred. All the op equal variants can be provided directly in the base class because they can be inherited. The completed classes are as follows:


template<class T, class C>
class Indexable {
public:
	explicit Indexable ( ) {}
	T operator[] ( int n ) const
	{
		return static_cast<const C*>(this)->operator[](n);
	}
	template<class E>
	C& assign_from(const Xpr<T,E>& x )
	{
		C *me = static_cast<C*>(this);
		for( int i=o; : < mysize(i); i++ )
			me->operator[] (i) = x[i];
		return *me;
	}
	template<class E>
	Indexable& operator+=( const Xpr<T,E>& x )
	{
			C *me = static_cast<C*>(this);
			for ( int i=o; i < me->size( ); i++)
		me->operator[] (i) += x[i];
			return *this;
	}
	  // ...
};
The additional operator assignment functions are implemented by trivial modification of operator+=.

The complete UserVec class can now be exhibited:


template<class T>
class UserVec : public Indexable<T,UserVec> {
   T *data;
   int sz;
  public
	UserVec ( int siz )
	: Indexable<T,UserVec>( ), sz(siz)
	{ data = new T[ sz ] ; }
	~UserVec( ) { delete[] data; }
	template<class E>
	UserVec& operator=( const Xpr<T,E>& x )
	{
		return assign_from(x);
	}
	int size ( ) const { return sz; }
	T operator[] ( int i ) { return data[i]; }
	// other methods, copy constructor, etc.
};
Notice how low the "subscription cost" is for registering a class with the expression template facility. All we have to do is derive from a class Indexable and provide methods to tell it how big we are, to index our contained data, and to invoke the provided assignment proxy.

By this point we have actually succeeded in fleshing out a workable expression template facility that is independent of the container type. Multiple distinct user-defined container types can derive from Indexable and thereby register for participation as terminals in expression template statements.

A specific example in which it is useful to have multiple distinct arraylike classes is that of staggered-grid hydrodynamics (computational modeling of fluid flow in a discretized spatial domain). The mathematical analysis results in both quantities defined at the centers of the zones of the computational grid, and in quantities defined on the vertices. Both types of field quantities are arraylike in character but are nonetheless quite distinct mathematical entities, and it makes good sense in an object-oriented analysis to represent these concepts with separate types. For example, we may use a class zcsf to represent zone-centered scalar fields, and a separate class vcsf to represent vertex-centered scalar fields. Both of these classes can derive from Indexable, allowing variables of each type to serve as terminals in expression template statements, without having to represent both concepts with a single type or having to replicate the full expression template machinery for each distinct terminal type. Glommability definitely saves us work in this situation.

DISAMBIGUATION One property of the preceding prescription is disconcerting. Note that the expression template machinery is coded in terms of the Indexable class. In any given project, multiple user-defined container classes may inherit from this class. This means that it will now be possible to combine, in a single mathematical statement, terminals that do not "go together." For example, in the staggered-grid hydrodynamics example, we might have:


template<class T> class vcsf :
	public Indexable<T,vcsf> {/*...*/};
template<class T> class zcsf :
	public Indexable<T,zcsf> {/*...*/};

vcsf v1, v2, v3;
zcsf z1, z2, z3;

v1 = v2 + z3;
Suppose this is a typo and not what the programmer intended. Since all three terminals are of types that are derived from Indexable, the statement will compile and execute, but will almost certainly produce erroneous results. Application of runtime shape–conformality checks may catch some of this at runtime, but there is more wrong here than just a combination of misshapen terminals. These terminals are of the wrong type and should not be able to appear in a statement like this. C++ is supposed to be strongly typed. Our use of the Xpr<T,E> anonymizing template expression wrapper class was specifically designed to hide within the template parameter E any amount of complexity in the intermediate type that represents a given node in the expression parse tree. But this use of templates to match any object representing a node in an expression template parse tree irrespective of the type of its terminals is too general. It subverts the C++ type system in a very subtle and insidious way. We want to match only parse tree nodes arising from expressions over terminals of the right type, not parse tree nodes arising from terminals of any type.

Recall the earlier Stroustrup quote in which he insists that the needs of high-performance numerics cannot be allowed to compromise the C++ type system. Although he was evidently thinking about language extensions and not library facilities when he wrote that, the admonition seems relevant here. Indeed, the danger of the "match-anything" nature of the expression template anonymizing wrapper class, combined with glommability, is so great that I was literally unwilling to publish the glommability innovation until I could come up with a way to add some safety controls. Disambiguation, then, is the safeguard that makes glommable expression templates publishable.

To ensure that the expression template matching machinery combines only terminals and subexpressions of compatible types, we must hold the type of the terminal inside the template expression wrapper class but outside of the template parameter that represents the parse tree node. This additional parameter will be used as a "disambiguator," or "discriminator," to ensure that only like-typed terminals can appear in expression template statements. The most obvious choice for the disambiguator for Indexables of a given derived type is that derived type itself. This will guarantee that only terminals of the same type can be combined into a single expression template statement. This is a good default, and can actually be so designated using template parameter defaulting:


template<class T, class C, class D = C>
class Indexable { /* . . . */};

template<class T>
class UserVec : public Indexable<T,UserVec>
{

 public:
	template<class E>
	UserVec& operator=( const Xpr<T,E,UserVec>& x );
};
This says that UserVec objects can be assigned from expressions composed from terminals that are also UserVec objects.

Although making this disambiguation token default the same as the type of the terminal containers seems natural, note that many other acceptable choices for the disambiguation token are possible, and perhaps even desirable if made a formal part of the object-oriented design. For example, you might have a variety of spatial field classes. If they are all defined on the same grid, it might make very good sense to allow them to share a common disambiguator token, thereby allowing them to be mixed freely in mathematical expressions.

Note also that this now allows a "policy-free" implementation of high-performance numeric containers since the mechanism can now be separated from the policy. People who like multidimensional array classes can have them. People who prefer to see the tensor rank of the matrices propagated to the type system can have that too, simply by defining a container for each tensor rank. Both groups can have the same expression template facility glommed on to their favorite containers. The same freedom is also available for choices in storage layout (C or Fortran storage formats), index limits (0- or 1-based, or even adjustable), etc. Quite literally, the policy can be set by the container leaving the mechanism in the glommable expression template facility.

In addition to the minor changes to Indexable and the prototypical UserVec exhibited previously, it is also necessary to add the disambiguation token to the Xpr class and to modify the assortment of global template operator functions accordingly. For example, we could do the following:


template<class T, class E, class D>
class Xpr
{
   E e;
  public:
	Xpr( const E& e ) : e(e) {}
	T operator [] (int n ) const { return e[n] ; }
};
// Binary operations between Indexables and Xprs
#define XM_DEFINE_GLOBAL_OPERATOR(op,ap) \
template>class T, class A, class B, class D> \
Xpr<T, XprBinOp<T,Indexable<T,A,D> >, \
	Xpr<T,B,D>, ap<T> >, D> \
op ( const Indexable<T,A,D>& a, const Xpr<T,B,D>& b) \
{ \
	typedef XprBinOp< T, ConstRef<T,Indexable<T,A,D> >, \
	 Xpr<T,B,D>, ap<T> > ExprT; \
	return Xpr<T,ExprT,D>( ExprT( \
	 ConstRef<T,Indexable<T,A,D> >(a), b ) ) ; \
}
XM_DEFINE_GLOBAL_OPERATOR(operator+, OpAdd )
XM_DEFINE_GLOBAL_OPERATOR(operator-, OpSub )
XM_DEFINE_GLOBAL_OPERATOR(operator*, OpMul )
XM_DEFINE_GLOBAL_OPERATOR(operator/, OpDiv )
#undef XM_DEFINE_GLOBAL_OPERATOR
For the sake of sanity we will not rehash the entire facility here, but just point out that the template parameter (D in this example code) must be added to all instances of arguments or return values of types Indexable or Xpr. It is at least straightforward, if arduous.

To see the effectiveness of this disambiguation strategy, consider the following example code:


template<class T>
class UserVec : public Indexable<T,UserVec> { /* ...*/ };
template<class T>
class FooBar : public Indexable<T,FooBar> { /* ...*/ };

UserVec<float> a(5), b(5);
FooBar<float> c(5);

a = -99.; b= 2.; c=3.;

// This statement should be illegal!
a = b + c;
Compiling with KCC now produces an error:


KCC -x -no_implicit_include +KO -none +z -c tstUV.cc
"tstUV.cc", line 144: error: no operator "+" matches these operands
  operand types are: UserVec<float> + FooBar<float>

a = b + c;
That is just what we want.

At this point we finally have a watertight disambiguated glommable expression template facility. Scientific programmers can now freely embark on detailed object-oriented analysis and design, making class distinctions wherever the problem domain analysis indicates they should be. They can write code that benefits from the expressive power of array syntax mathematics without sacrificing either C++'s strong typing or suffering performance hits from the proliferation of large temporary objects.

PROMOTING THE REUSABILITY OF SCIENTIFIC C++ These are exciting days for C++. Long dismissed in scientific circles owing to criticisms ranging from performance consideration to quality-of-implementation issues, recent developments in compiler technology and programming techniques are changing the face of scientific C++. The language is now directly competitive with Fortran in performance metrics, while maintaining far superior support for the full range of modern programming paradigms: from structured programming techniques to object-oriented programming, and most recently, generic programming. With the ratification of the ANSI C++ standard, compiler vendors are deploying product offerings far superior to those available only a short while ago. Quality of template implementations is dramatically improving, with support for member templates, defaultable template arguments, traits, and sane instantiation semantics becoming generally available. Acceptable quality implementations of exceptions, namespaces, RTTI, the new keywords, and the STL are all becoming generally available.

With these improvements in the status quo comes the opportunity to revisit unsettled (or perhaps just unsatisfactorily resolved) issues in the realm of scientific C++. Just as the general purpose class libraries in the wider C++ culture have matured from the organization evident in early class libraries to the sophisticated capabilities of the STL, which exhibits a clear separation between containers and algorithms, it is now possible to re-express scientific algorithms in a container-neutral style. The benefits to be gained from such a rejuvenation of our software base are great. We can achieve expressive simplicity through the use of highly abstract index-free mathematical formalism provided by expression template technology. We can do all this with acceptable performance. And we can do it while maintaining adaptability to end-user constraints. That is to say, we no longer have to worry about developing, standardizing, and promoting "the one true container for high-performance numerics." Rather, we can concentrate on building high-performance applications and libraries, confident that we can afford to do good object-oriented domain analysis and abstraction, without either sacrificing performance or forcing clients of our software to submit to imposed interfaces. A subsequent article will further explore some of these ramifications.

Acknowledgments I would like to acknowledge valuable discussions with Scott Haney and Arch Robison during the conduct of the research described in this article. I would like to thank the editors of Computers in Physics for consenting that this material could be republished. I would like to thank Andrei Alexandrescu, who suggested a superior explanation of a particular thorny issue. And I would like to thank Eric Sather, who provided numerous constructive comments relating to the presentation of this material.

Finally, the research described here was performed under the auspices of the U.S. Department of Energy by the Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.

References

  1. Furnish, G. "Disambiguated Glommable Expression Templates," Computers in Physics, 11(3): 263–269, May/June 1997.
  2. Furnish, G. "Container-Free Numerical Algorithms in C++," Computers in Physics, 12(3): 258 266, May/June 1998.
  3. Haney, S. W. "Is C++ Fast Enough for Scientific Computing?" Computers in Physics, 8(6): 690–694, Nov./Dec. 1994.
  4. Forslund, D. W. et al. "Experiences in Writing a Distributed Particle Simulation Code in C++," in USENIX C++ Conference, 1990.
  5. Stroustrup, B. The Design and Evolution of C++, Addison–Wesley, Reading, MA, 1994.
  6. Robison, A. D. "C++ Gets Faster for Scientific Computing," Computers in Physics, 10(5): 458–462, Sept./Oct. 1996.
  7. Veldhuizen, T. "Expression Templates," C++ Report, 7(5): 26–31, June 1995.
  8. Veldhuizen, T. and K. Ponnambalam. "Linear Algebra with C++ Template Metaprograms," Dr. Dobb's Journal of Software Tools, 21(8): 38–44, Aug. 1996.
  9. Haney, S. W. "Beating the Abstraction Penalty in C++ Using Expression Templates," Computers in Physics, 10(6): 552–557, Nov./Dec. 1996.
  10. Koenig, A. and B. Stroustrup. "Foundations for Native C++ Styles," Software Practice and Experience, 25(S4), Dec. 1995.

FOOTNOTES
* See http://www.kai.com/C_plus_plus/v3.4/doc/UserGuide/chapter_9.html#valarray-changes for more information.

There is such a thing as the empty base optimization in C++, but there is no "empty member optimization."

This term was introduced by Koenig10 in a different context to describe algorithm generation by template techniques, but it seems also well suited for describing this combination of inheritance with templates.