Test-driven development with mutation testing – an experimental study

Roman, Adam; Mnich, Michal

doi:10.1007/s11219-020-09534-x

Test-driven development with mutation testing – an experimental study

Open access
Published: 18 November 2020

Volume 29, pages 1–38, (2021)
Cite this article

Download PDF

You have full access to this open access article

Software Quality Journal Aims and scope Submit manuscript

Test-driven development with mutation testing – an experimental study

Download PDF

7865 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Test-driven development (TDD) is a popular design approach used by the developers with testing being the important software development driving factor. On the other hand, mutation testing is considered one of the most effective testing techniques. However, there is not so much research on combining these two techniques together. In this paper, we propose a novel, hybrid approach called TDD+M which combines test-driven development process together with the mutation approach. The aim was to check whether this modified approach allows the developers to write a better quality code. We verify our approach by conducting a controlled experiment and we show that it achieves better results than the sole TDD technique. The experiment involved 22 computer science students split into eight groups. Four groups (TDD+M) were using our approach, the other four (TDD) – a normal TDD process. We performed a cross-experiment by measuring the code coverage and mutation coverage for each combination (code of group X, tests from group Y). The TDD+M tests achieved better coverage on the code from TDD groups than the TDD tests on their own code (53.3% vs. 33.5% statement coverage and 64.9% vs. 37.5% mutation coverage). The TDD+M tests also found more post-release defects in the TDD code than TDD tests in the TDD+M code. The experiment showed that adding mutation into the TDD process allows the developers to provide better, stronger tests and to write a better quality code.

Regression test selection in test-driven development

Article 27 December 2023

C++11/14 Mutation Operators Based on Common Fault Patterns

An Empirical Comparison of EvoSuite and DSpot for Improving Developer-Written Test Suites with Respect to Mutation Score

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Test-driven development (TDD) is a common Agile practice introduced by Kent Beck (2002) for software development. According to the recent State of Agile Report (2018), 33% of teams use this technique in their everyday work. On the other hand, mutation testing is considered one of the most effective test techniques Ammann and Offutt (2008). We understand test effectiveness as the ability to detect faults in code. Test thoroughness is usually measured in terms of coverage. Two most popular measures are statement coverage and – in case of mutation testing – mutation coverage (known also as mutation score).

The recent study of Papadakis et al. (2019) gathers the results on the mutation testing effectiveness published in years 1991-2018. In particular, the authors refer to (Ahmed et al. 2016; Chekam et al. 2017; Gligoric et al. 2015; Gopinath et al. 2014; Just et al. 2014; Li et al. 2009; Papadakis et al. 2018; Ramler et al. 2017) reporting the following findings:

there is a correlation between coverage and test effectiveness;
both statement and mutation coverage correlate with fault detection, with mutants having higher correlation;
there is a weak correlation between coverage and number of bug-fixes
mutation testing provides valuable guidance toward improving the test suites of a safety-critical industrial software system;
mutation testing finds more faults than prime path, branch and all-uses;
there is a strong connection between coverage attainment and fault revelation for strong mutation but weak for statement, branch and weak mutation; fault revelation improves significantly at higher coverage;
mutation coverage and test suite size correlate with fault detection rates, but often the individual (and joint) correlations are weak; test suites of very high mutation coverage levels enjoy significant benefits over those with lower score levels.

In this paper, we investigate the impact of mutation testing on the overall TDD process. To do this, we modified the TDD framework by extending it with the additional step involving mutation testing. Next, we asked eight groups of students to write the same software. Four groups used TDD approach and four others the modified approach with mutation step (TDD+M). Then, by using a cross-testing approach, we compared the effectiveness of tests written in these two TDD frameworks: with and without mutation.

We measure the test effectiveness (and the overall code quality) using statement and mutation coverage. In this context, the study of Papadakis et al. (2019) is important for our research, as it supports the thesis that we can measure effectiveness of test suites in terms of statement and mutation coverage. We also measure the overall code quality by analyzing the number of field defects (that is, found after the release) detected by tests written in one framework on code written by the other one.

The novelty of this paper, comparing to the studies previously cited, is that we do not focus on the coverage criteria themselves, but on the role of mutation in the TDD process: we investigate if mutation testing improves the quality of code developed within the TDD approach. Also, because in our experiment all the teams were independently writing the same software, that is, the code for the same set of requirements, we were able to compare the effectiveness of mutation in a more objective way by performing a cross-experiment with cross-testing. Its concept is similar to the one from defect pooling technique for defect prediction. We use a test suite from one team on the code written by another team. Such an approach allows us to check the effectiveness of the test suite more fairly, because in the cross-experiment the test case design is not biased by the code for which it was written. The tests are executed on code which was not seen by the test designers. We can compare their results with the results of tests written exactly for this code. This way we can compare two test design approaches: without (TDD) and with (TDD+M) mutation involved. We measure the TDD+M tests effectiveness ”itself”, by not considering the code for which it was written.

The goal of our study was to answer the following research questions:

RQ1. Do the tests written with the TDD+M approach give better code coverage than the ones written in a pure TDD approach with no mutation process involved?

RQ2. Are the tests written with the TDD+M approach stronger (more effective) than the ones written using a pure TDD approach?

RQ3. Is the external code quality better when the TDD+M is used than in case of using the TDD approach only?

By ’stronger’ or ’more effective’ tests, we mean tests that have higher probability of detecting faults and that give better coverage in terms of metrics such as statement coverage or mutation coverage (see Section 5.1 for the definition).

To verify RQ1, we use the statement coverage, to verify RQ2 – mutation coverage, and to verify the RQ3 – the number of field defects found by the tests in the code and their defect detection efficiency. The model for comparison should be as simple as possible to give us clear results and to avoid any biases caused by the model complexity. RQ1 seems to be easy to answer: mutation forces the developers to cover their code more thoroughly, so by definition it will give higher coverage. But it is still interesting to measure how much better would their tests be in terms of the statement coverage comparing to tests written without mutation. To answer RQ2 and RQ3, we will use the above-mentioned cross-testing technique.

RQ1 and RQ2 are about internal quality, and RQ3 about external quality (ISO, 2005). Internal software quality is about the design of the software and we express it in terms of the coverage. External quality is the fitness for purpose of the software and we express it in terms of number of field defects, that is – defects detected after the release. Of course, this is a simplified view on quality, because quality is a multi-dimensional concept. However, the mentioned metrics are related with quality and are easy to calculate, so we decided to use them in our study.

The rest of the paper is organized as follows. In Sections 2 and 3, we describe the TDD framework and the mutation technique in more detail. In Section 4, we introduce the TDD+M approach, which combines test-driven approach with mutation testing process. Section 5 describes the experiment we performed to verify if our approach works better than a pure TDD method. Section 7 follows with the summary of our findings and some final conclusions. In the Appendix, we describe in detail the two experiments we performed.

2 Test-driven development

A developer working with the TDD framework writes tests for the code before writing this code. Next, the developer implements a part of the code for which all the tests designed earlier should pass. This iterative approach allows the developer to create the application in small pieces, although even when using TDD, the developers sometimes tend to write quite large test cases (Čaušević et al. 2012; Fucci et al. 2017). In each iteration, some part of the functionality is created, but the main rule holds all the time: before the code is written, the developer has to implement the corresponding tests.

The steps within the TDD approach are as follows:

1
Write a test for the functionality to be implemented.
2
Run the test (the new test should fail, because there is no code for it) – this step verifies that the tests themselves are written correctly.
3
Implement the minimal amount of code so that all the tests pass – this step verifies that the code implements the intended functionality for a given iteration. In case of failures, modify the code until all the tests pass.
4
Refactor the code in order to improve its readability and maintainability.
5
Return to step 1.

Refactoring is done, because the code is implemented in a series of many short iterations. In each of them, some small portion of a new functionality is added, so the frequent code changes may easily affect its clean structure. Refactoring can make the code tidy again. The TDD process is presented schematically in Fig. 1.

There is a plethora of the literature on the TDD method, such as the seminal publication is the Kent Beck’s book (Beck, 2002) mentioned earlier. Another interesting source of knowledge on TDD is (Astels, 2003), which is a practical guide to the TDD from the developer’s perspective. Of course, there is also a lot of publications that investigate the impact of the TDD on the final application quality. Janzen (2005) verifies the TDD approach in practice and in particular evaluate its impact on the internal software quality. They also focus on some pedagogical implications. In (Bhat and Nagappan, 2006) a support of TDD for two different Microsoft company applications (Windows and MSN) is presented. The great number of publications (cf. the references in (Khanam and Ahsan, 2017)) suggests that the method is frequently used and it has de facto become a standard practice for iterative software development.

However, some meta-analyses and surveys show that the TDD impact on different aspects of software development process is inconclusive. In Table 1, we reproduce the results of such a survey from (Pančur and Ciglarič, 2011). The table presents a quick overview of perceived effects on different parameters in several studies.

Table 1 Overview of perceived effects of TDD on different aspects of software development process (after (Pančur and Ciglarič 2011))

Test-driven development with mutation testing – an experimental study

Abstract

Similar content being viewed by others

Regression test selection in test-driven development

C++11/14 Mutation Operators Based on Common Fault Patterns

An Empirical Comparison of EvoSuite and DSpot for Improving Developer-Written Test Suites with Respect to Mutation Score

1 Introduction

2 Test-driven development

3 Mutation testing

3.1 Mutation testing process

4 Test-Driven Development + Mutation Testing

5 Experimental comparison of TDD and TDD+M

5.1 The scope of experiment

5.2 Context selection

5.3 Hypotheses

5.4 Variables selection

5.5 Selection of subjects

5.6 Experiment design

5.7 Results

5.7.1 Answer to Research Question 1

5.7.2 Answer to Research Question 2

5.7.3 Statistical analysis for the results on RQ1 and RQ2

5.7.4 Answer to Research Question 3

5.7.5 A note on Defect Detection Efficiency regarding RQ1, RQ2 and RQ3

5.7.6 Learning outcome of the students involved in the experiment

6 Threats to validity

6.1 External validity

6.2 Internal validity

7 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher’s Note

Appendices

Appendix

Experiment 0

2.1 Experiment 0 – results

Detailed results and comments on the main experiment

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation