Taking Testing to the Extreme

At almost every presentation I make these days someone asks me, "What do you think about Extreme Programming?" This usually occurs after I have been talking about the importance of design and the effectiveness of testing. Apparently, the assumption on the part of the questioner is that Extreme Programming (XP) downplays the importance of both design and testing. My standard answer is, "I have seen Extreme Programming used effectively, and I have seen it fail miserably." The difference I have observed is that successful projects follow the technique as it was originally defined and unsuccessful projects do not.

I wrote a column some time ago entitled, "Let's Don't and Say We Did."1 Many people are anxious to say they are doing Extreme Programming projects because it is new and sexy, but they do not have sufficient discipline to completely follow the technique. They want to say they are doing Extreme Programming as a way of justifying their lack of a software development process. Kent Beck, the developer of the technique, has basically said a project should follow all the steps or people should not try the technique at all.2 As defined, the technique represents a valid method but, as practiced by many, it fails to keep many of its promises.

Reaction to Extreme Programming is approaching religious fervor. One author actually referred to "Extreme testing," (his term for the testing associated with Extreme Programming) as "magic."3 So I thought I would take this opportunity to examine the technique a bit and discuss it from a testing and quality point of view.

I have labeled this section an "interpretation" because I will summarize the essentials of Extreme Programming. I may not emphasize the same elements others have (unintentionally), and I certainly will not cover each facet. Extreme Programming is an approach to programming that emphasizes the rapid production of code in a minimal process environment.

The four values of Extreme Programming2 are:

  • Simplicity—Make nothing more complex than it has to be to meet current needs, even if you know it will have to be more complex later.
  • Communication—Communicate early and often among the several roles associated with the project.
  • Feedback—Provide information back to the programmer who produced the code when you find a problem with the code you are using.
  • Aggressiveness—Move forward as quickly as possible, even if it means making changes later. In fact, the title of Beck's seminal text about Extreme Programming is subtitled "Embrace Change."

Extreme Programming does not eliminate process; it makes the process support the work, rather than the other way around. The technique includes four basic activities: coding, testing, listening, and designing.

This is the order in which Beck presents these activities in his book but, in fact, we might think of each of these as an individual thread intertwined with other threads. For some projects, you may begin by prototyping unfamiliar ideas, which means coding is the first thing done. In other cases, the business rules dominate the system and you may listen to a business person, do some design, and write a few specific test cases before coding.

A fundamental assumption of Extreme Programming is that taking a minimalist approach to each activity in the development process is acceptable because they reinforce each other. For example, integration is continuous and could be error-prone; but if tests are easy to run, then you know when there is a problem.

Of particular interest, the description of the technique calls for the test cases to be written prior to code in some cases. This should be done when writing the tests clarifies the specifications of the functionality being coded. We have seen this phenomena preached in recent years where projects have been encouraged to have testers work during the early phases of a project to determine whether or not the requirements are testable.

I am not aware of any studies showing that Extreme Programming produces products faster or with better quality. Nor am I aware of studies that show it is ineffective. I do not want to waste space making judgements in the absence of quantitative data. Time and experience will show where Extreme Programming works, and where it does not.

I certainly agree with the intention to define only the minimum process needed to achieve the goals of a project. Why would you define more than is needed? What has to be clearly determined is just what those goals are. The goal of an Extreme Project seems to be producing a software product with acceptable quality in the least amount of time possible. This could be contrasted with an application project that is part of a product line effort in which explicit decisions are made to take longer to develop the initial implementation of a component for it to be usable in a number of products. Neither is necessarily a better approach; they simply seek to achieve different goals.

Three weaknesses I see in the testing component of Extreme Programming are:

  • No guidance on how to select test cases—other than to say that programmers develop unit tests and customers develop functional tests.
  • No guidance on how much testing is enough—the advice is to write all of the tests you can think of. That is pretty vague.
  • No guidance on when to test—other than when to write unit test cases.

In the following sections I will explore each of these issues, and relate them to the values of Extreme Programming.

Programmers build unit tests and customers build functional tests according to the XP literature, so a clarification is needed. These statements mix the two different dimensions as illustrated in Figure 1. When the advice for programmers is to construct unit tests, that is a matter of scope; advice about customers creating functional tests refers to a specific technique for selecting tests.

Figure 1
Figure 1. Dimensions of TestCaseSelection.

The unit level of scope is the smallest scope the programmer thinks is meaningful to isolate and test. Traditionally, programmers performed unit tests on individual functions to ensure that a given set of inputs resulted in correct answers. In object-oriented development, the unit scope is at the object level rather than the method level. The reason is that much of the input to an individual method is actually the instance attributes and class attributes that are global across the set of methods defined on an object.

It is possible to select tests using a functional or structural approach at each level of scope. Functional tests are based on the specification of whichever level of scope is currently analyzed. For the unit level of scope, this would typically be the pre- and post-conditions for each individual method. At the system level of scope, functional tests are designed from requirements that are based on use cases in our usual development process.

Structural tests rely on knowledge of the underlying structure of the artifact under test. Customers cannot have this knowledge. However, the dynamically configured systems created today require a structural test component with system scope to determine whether the various configurations are possible, necessary, and sufficient.

The reason there are both functional and structural approaches is that each gives the tester a different perspective on the quality of the software. Testing based on the functional specification allows us to investigate whether or not the unit does what it is supposed to do. Testing based on the structure of the implementation allows us to determine if the unit does anything it is not supposed to. At the unit level, the structure of the code should be sufficiently simple so that writing functional tests should also cover the complete structure of the code.

When the Extreme Programming process says programmers write unit tests, the intent seems to be that programmers should check their own work. When combined with the continuous integration approach of Extreme Programming, the programmer is guided to create tests for increasingly larger and more complex classes. The programmer selects tests based on what the class is intended to do and on the structure of the pieces that comprise the unit under creation. By assuming that pieces given to you have also been unit tested, objects from classes created by others are treated like black boxes.

My advice is:

  • Select test cases at each level of scope (shown in Figure 1) to ensure that as the system is integrated, it continues to work correctly.
  • Use both sources of tests, at each level of scope, to ensure that the broadest range of defects is identified.

How much testing is enough, and how do we know when we have achieved this goal? The Extreme Programming technique advises us to create "all the tests you can think of." It relies on the give-and-take between two programming partners, with one thinking of test cases the other has not considered. The effectiveness of the tests depends on the creativity of the programmers. Extreme Programming is dedicated to a streamlined development approach, so any metric for test coverage must be easy to compute.

Coverage can be viewed from two perspectives: internal and external. Measures such as counting the specific lines of code that are executed are internal to the product. These measures require instrumentation and can slow the development process. Counting test cases written against a method's specification is external to the product and typically easier to apply.

In keeping with the Extreme philosophy, we can begin with a very coarse-grained measure and refine it if the resulting level of quality is unacceptable. The more coarsely grained the measure, the easier and less expensive it will be to count. The trade-off is the larger the grain size, the fewer tests required, which could potentially fail to identify defects.

Here are some simple external measures of test coverage:

  1. Unit tests
    • Unit test cases are written for every important public method, according to the Extreme Programming technique. This is a subjective measure depending on judgements made by the programmer regarding what is important. One way to make this more systematic is to use the CRC class description, developed as part of the Extreme Development technique, and to cover those methods that correspond to the class' domain responsibilities.
    • A more definitive test coverage criterion is to test every public method and to produce tests that include values from every equivalence class for each parameter in these methods. This is much more objective and provides a rational, and fairly quick, technique for covering the product. The equivalence classes for an attribute correspond to the states of the class for objects and can be obtained from the domain for primitive types. If all possible combinations of these values are covered, the test set gets very large even for relatively small classes. In my next column, I will revisit the OATS technique that reduces the size of this set. I will also introduce a small tool we are building to help with managing the technique. It should be available for download by then.
  2. Functional tests
    • Customers will write one test case for every requirement scenario. This is usually referred to as the "sunny day" scenario, the case that happens most often. If there is a formal use case, the test cases can be extended to include the alternative and exceptional scenarios.
    • As with the unit tests, a refinement on the initial measure is to perform further analysis on the variables contained in the scenario. In this case, we list each variable from the scenario, identify its data type, and use this to establish equivalence classes.

The weaknesses with these approaches include:

  • It is not possible to make accurate estimates of the reliability of the system, and
  • It is not possible to guarantee the safe operation of the software.

My advice is:

  • Record the coverage levels used. By simply thinking about coverage, the test sets produced will cover more new ground and be less repetitive.
  • If you receive more feedback about the poor quality of your code than you find acceptable, increase the level of coverage.

The final level of development testing occurs as pieces of completed code, i.e., unit-tested code, are assembled into larger pieces. This occurs almost continuously in object-oriented development. Classes are defined with attributes that are instances of other classes. This fundamental level of integration brings together the work of multiple teams. The questions testing will answer at this point are different than those answered by the unit-test activity. Because the attributes created use someone else's work, the major question is whether the object will do what we need it to do. A secondary question, for the "overall good of the project," is whether the object does everything it claims.

This testing is continuous, in that all but a very small number of the classes defined create instances of other classes. Therefore, almost every class test involves integrating previously written classes. The testing is also recursive, in that failure of a test may result from failure of new code in the current class, or it may be a faulty interaction between the new code and the previously existing code.

The integration of the instance of the class under test takes place within the context of the unit testing of the new class being defined. For this activity to contribute value to the project, there must be feedback to the developer of the class whose object caused a test failure. Typically, a test failure in an integration environment is followed by sufficient debugging to determine which object caused the failure. Some of the objects come from classes that are "trusted," which only means we suspect them last.

My advice is:

  • Adopt a standard e-mail template that reminds the developer of what information should be fed back to the developer who created the class.
  • Be certain the unit tests use the attributes from the classes under development (an additional test coverage measure).

For testing to be effective in this informal, minimal effort environment, it must be as automated as possible. For the customer's functional testing of the system, there is an array of tools from which to choose. Application-level test tools attach at the presentation layer, GUI, or Web interface, and have record-and-playback capability that allows regression testing to be fully automated and repeated at will.

Unit test tools are harder to find. I share a belief—with many Extreme practitioners—in developer-defined, minimal tools. Kent Beck wrote the JUnit test framework that works well with the PACT approach we have used for some time. It supports a minimal, API-based testing technique that partially captures test cases as objects that can then be created and reused. The essential elements of a test case are encapsulated in the test class. The test class places the object under test in the appropriate state, sends the appropriate messages to the object under test, and then examines the object under test to ensure the expected result was achieved. The PACT structure, which takes advantage of OO programming features, minimizes the cost of constant change.

The most intensive task in either of these types of testing is defining the expected results for a test case. This requires the time of domain-knowledgeable people. In the Extreme Programming technique, this knowledge is provided by a close relationship, and intense communication, with the client. Rotating client participants among the programming pairs provides some level of communication; however, the large amount of informal communication also provides ready access to these domain experts.

I have made two points in this column. First, many projects are not following the full definition of the Extreme Programming technique, which results in a less-effective technique. Second, Extreme Programming has some weaknesses in its definition of testing. I have described additions where benefits outweigh their costs. However, these techniques still do not represent a thorough test of the product; for example, I would not want the flight dynamics system of the airplane used for my next flight to have been tested in this manner. This is a trade-off each development organization should carefully consider for each project.


  1. McGregor, J. D. "Let's Don't and Say We Did," JOOP, 11(5): 6–11, 14, Sept. 1998.
  2. Beck, K. Extreme Programming: Embrace Change, Addison–Wesley, 2000.
  3. Jeffries, R. E. "Extreme Testing," Software Testing & Quality Engineering, Mar./Apr., 1999.