Difficulties with testing software
Testing software is a complex and time consuming activity. Testing effectively revolves around trying different values for the inputs to the system, and ensuring the system produces the correct results. For a perfectly tested system every single permutation of inputs must be tried, whether the inputs are ones which the system is designed to accept or not – the system must still cope with failure of interconnected systems.
Testing every single input possibility is a futile and wasteful task, especially with very large, complex systems. The set of all possible combinations could be massive, and so take an unreasonable length of time to test. Therefore effective testing comes down to choosing inputs carefully to try to model the full range of inputs the system must accept, in the hope that all errors in the system can be uncovered. While reasonable testing often cannot detect all the errors in a system it can detect enough errors to make the system acceptable for actual use.
Choosing the set of inputs to test has been the subject of much work. Techniques such as white and black box testing, unit testing and structural testing are common testing procedures. While these offer different advantages and disadvantages, it is difficult to establish how effective each technique actually is. Testing can be measured on a "rate of discovery" basis; in ideal conditions most faults will be detected early in the testing process, with the rate slowing down until the rate of fault detection is low enough to be acceptable for the intended purpose of the software.
However this assumes a testing process will pick up every fault at an even rate. A weak testing process may miss various types of errors, meaning they will be present when software is delivered or released.
A way of assessing the effectiveness of the testing mechanism is to intentionally introduce faults into the code of the system, to change the logic behind what the program actually does. The theory is that the number of artificial faults in the code is known; assuming the faults are similar to real, undiscovered faults an estimate can be made of the number of unknown faults remaining in the system.
If realistic faults are introduced but are not found during testing, it is a safe assumption that the testing process is poor. If it did not find the artificial errors the chances are it would find few if any real faults.
Test cases that do not find any artificial faults can be removed from the test set without harm, as they probably won't find any real faults either. In this way the test set can be assessed and improved; effectively "testing the testing".
Software Mutation is a way of ensuring artificially introduced faults accurately model the type of faults likely to appear naturally. They can be generated automatically, so useful test sets can be produced much quicker than errors introduced by hand; this also reduces the possibility of manually introduced faults being biased and unrealistic.
Even simple changes to the code are enough to model the majority of typical errors. The Competent Programmer Hypothesis suggests that code written by good programmers is generally at most a few keystrokes from being correct - a typo in an expression, an incorrect test of equality or an incorrect constant. These are exactly the type of errors mutation can introduce efficiently.
The test set can then be used with the resulting collection of programs. Mutants that are logically equivalent to the original program - that result in exactly the same output - can be removed or killed; the mutant is effectively a duplicate of the original, so tells us nothing. This reduces the size of the mutant set, making it more efficient.
Issues with mutation testing
Testing programs using conventional methods takes time due to running many different test cases on the program. Mutation testing involves running a smaller test set but against a larger population of programs.
Traditionally programs are run until completion. Programs which are allowed to run until completion have the advantage of measuring the actual output, so will be more accurate. A particularly fault tolerant program may be able to correct any errors introduced along the way. However waiting for the program to run to completion will take time. This is known as Strong mutation testing.
As the location in the code of introduced errors are known, it is possible to stop execution after the mutated code has been run. This is Weak mutation testing. It makes the process much quicker, as the program does not have to run to completion; however any fault tolerance in the rest of the program will not be allowed to correct the situation, so the results of the test may not accurately represent the actual results.
A compromise between the two is Firm mutation testing, where execution is stopped at some point between the mutated code and the end of the program. Ideally this should be early enough to save time, yet late enough to give the program a chance to correct the error itself.
Even a small program can be mutated in an infinite number of ways; modifications can be made to it in many different ways and in many different permutations. This could result in a very large set of mutants to be examined.
Selective mutation offers an alternative. Instead of generating totally random mutants, mutation can be restricted to specific areas of the program code. Sections of code trusted to be correct can be ignored, while sections of code where faults may occur can be tested more thoroughly.
Adapted from the introduction to my third year Computer Science degree project. Sources include:
- Competent Programmer Hypothesis - "Theoretical and empirical studies on using program mutation to test the functional correctness of programs", 1978, Richard A DeMillo et al
- "Software Fault Injection", Jeffrey Voas, 1998