May 10 2008

Confidence as a test code metric

With testing occupying a major part of our development process, we have often attempted to quantify test code quality. Like many things, it is worth considering how test code ultimately manifests itself in terms of added value. This is why Stuart and I recently tend to conclude that, stripped from technically granular details, test code must fundamentally contribute in building confidence that the system under test is complete, a proof that what we've built is and will continue working as intended.

A working system fulfilling its business objectives can be considered complete enough, but, if not easily extensible and maintainable, will not grant itself to the conclusion of being as good as it can be. Advancements in software development methodologies that assist in delivering working software that is easy to extend and maintain - higher level abstractions, modeling and design - have been driven by the need to reduce technical debt. Technical debt can be viewed as the cost of change.

Test code is code, too. As code bases grow more elaborate, test code also suffers from technical debt, demanding methods to eliminate the factors that hinder its maintainability and extensibility. Present procedures geared towards extensible and maintainable test code are habitually counter-proportional to the amount of confidence they achieve.

The confidence scale

The different categories on the scale are not mutually exclusive, in fact they are commonly combined as members of a suite that exercises the system in various degrees of instrumentation. Walking the scale from left (empty) to right (full), we move from tests that are generally easier to write, understand, run and maintain but at the same time are less representative of the real system with all its components integrated.

Dependency neutral tests with all of the tested component's dependencies stubbed are disconnected, vaguely describe how the component interacts with its environment and offer minimal proof that the component will work as specified once a member of the application ecosystem.

The fundamental difference between interaction based dependency neutral tests and their stubbed counterparts is the accurate interaction specification of collaborating components through the use of mock objects instead of stubs. Here, we concentrate on specifying the contract of communication between two components. Although much closer to how the actual system operates, these tests are still disconnected. Despite the accurate specification of the interaction, we have don't have complete proof that the pieces fit. In particular, interaction based dependency neutral tests do not offer proof that the mocked collaborators have been tested to work.

It becomes apparent that the major flaw of interaction based dependency neutral tests is their disconnect from their peers.

As we move towards the "full" side of the confidence scale, tests tend to become larger and more complicated. Wired tests draw a picture much closer to that of the system in its entire form but suffer from poor defect localization (test failures are not always directly related to the intent of the specific test) and disrespect encapsulation (setup code often exposes the behavior of components irrelevant to the context of the current test). The dependency wired tests' contribution to technical debt is much more significant.

Understanding the importance of confidence in our system and aiming to reduce technical debt, Synthesized Testing suggests a solution that attempts to rectify the disconnect of lightweight, interaction based dependency neutral tests and reduce the need of overarching, prone to technical debt dependency wired tests.