Skip to main content
Log in

Variation factors in the design and analysis of replicated controlled experiments

Three (dis)similar studies on inspections versus unit testing

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript


In formal experiments on software engineering, the number of factors that may impact an outcome is very high. Some factors are controlled and change by design, while others are are either unforeseen or due to chance. This paper aims to explore how context factors change in a series of formal experiments and to identify implications for experimentation and replication practices to enable learning from experimentation. We analyze three experiments on code inspections and structural unit testing. The first two experiments use the same experimental design and instrumentation (replication), while the third, conducted by different researchers, replaces the programs and adapts defect detection methods accordingly (reproduction). Experimental procedures and location also differ between the experiments. Contrary to expectations, there are significant differences between the original experiment and the replication, as well as compared to the reproduction. Some of the differences are due to factors other than the ones designed to vary between experiments, indicating the sensitivity to context factors in software engineering experimentation. In aggregate, the analysis indicates that reducing the complexity of software engineering experiments should be considered by researchers who want to obtain reliable and repeatable empirical measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others




  • Anderson T, Darling D (1952) Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann Math Stat 23(2):193–212

    Article  MathSciNet  MATH  Google Scholar 

  • Basili VR, Selby RW (1987) Comparing the effectiveness of software testing strategies. IEEE Trans Softw Eng 13(12):1278–1296

    Article  Google Scholar 

  • Basili VR, Shull F, Lanubile F (1999) Building knowledge through families of experiments. IEEE Trans Softw Eng 25(4):456–473

    Article  Google Scholar 

  • Berling T, Runeson P (2003) Evaluation of a perspective based review method applied in an industrial setting. IEE Proc SW 150(3):177–184

    Article  Google Scholar 

  • Cartwright N (1991) Replicability, reproducibility, and robustness: comments on Harry Collins. Hist Polit Econ 23(1):143–155

    Article  Google Scholar 

  • Clarke P, O’Connor RV (2012) The situational factors that affect the software development process: towards a comprehensive reference framework. Inf Softw Technol 54(5):433–447

    Article  Google Scholar 

  • da Silva FQB, Suassuna M, França ACC, Grubb AM, Gouveia TB, Monteiro CVF, dos Santos IE (2012) Replication of empirical studies in software engineering research: a systematic mapping study. Empir Softw Eng. doi:10.1007/s10664-012-9227-7

    Google Scholar 

  • Dybå T, Sjøberg DIK, Cruzes DS (2012) What works for whom, where, when, and why?: on the role of context in empirical software engineering. In: Proceedings of the 11th international symposium on empirical software engineering and measurement, pp 19–28

  • Gomez OS, Juristo N, Vegas S (2010) Replications types in experimental disciplines. In: Proceedings of the fourth international symposium on empirical software engineering and measurement

  • Hannay J, Jørgensen M (2008) The role of deliberate artificial design elements in software engineering experiments. IEEE Trans Softw Eng 34(2):242–259

    Article  Google Scholar 

  • Hetzel W (1972) An experimental analysis of program verification problem solving capabilities as they relate to programmer efficiency. Comput Pers 3(3):10–15

    Article  Google Scholar 

  • Hoaglin D, Andrews D (1975) The reporting of computation-based results in statistics. Am Stat 29(3):112–126

    Google Scholar 

  • Humphrey WS (1995) A discipline for software engineering. Addison-Wesley, Reading, MA

    Google Scholar 

  • Jedlitschka A, Pfahl D (2005) Reporting guidelines for controlled experiments in software engineering. In: Proceedings of the 4th international symposium on empirical software engineering, pp 95–104

  • Jørgensen M, Grimstad S (2011) The impact of irrelevant and misleading information on software development effort estimates: a randomized controlled field experiment. IEEE Trans Softw Eng 37(5):695–707

    Article  Google Scholar 

  • Jørgensen M, Grimstad S (2012) Software development estimation biases: the role of interdependence. IEEE Trans Softw Eng 38(3):677–693

    Article  Google Scholar 

  • Jørgensen M, Gruschke T (2009) The impact of lessons-learned sessions on effort estimation and uncertainty assessments. IEEE Trans Softw Eng 35(3):368–383

    Article  Google Scholar 

  • Jørgensen M, Shepperd M (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33:33–53

    Article  Google Scholar 

  • Juristo N, Gomez OS (2012) Replication of software engineering experiments. In: Meyer B, Nordio M (eds) Empirical software engineering and verification. LNCS, vol 7007. Springer, pp 60–88

  • Juristo N, Vegas S (2011) The role of non-exact replications in software engineering experiments. Empir Softw Eng 16(3):295–324

    Article  Google Scholar 

  • Juristo N, Moreno AM, Vegas S (2004) Reviewing 25 years of testing technique experiments. Empir Softw Eng 9(1–2):7–44

    Article  Google Scholar 

  • Juristo N, Moreno AM, Vegas S, Solari M (2006) In search of what we experimentally know about unit testing. IEEE Softw 23:72–80

    Article  Google Scholar 

  • Juristo N, Vegas S, Solari M, Abrahao S, Ramos I (2012) Comparing the effectiveness of equivalence partitioning, branch testing and code reading be stepwise abstraction applied by subjects. In: Proceedings fifth IEEE international conference on software testing, verification and validation, Montreal, Canada, pp 330–339

  • Kitchenham BA, Fry J, Linkman SG (2003) The case against cross-over designs in software engineering. In: 11th international workshop on software technology and engineering practice (STEP 2003), Amsterdam, The Netherlands, pp 65–67

  • Kitchenham, BA (2008) The role of replications in empirical software engineering—a word of warning. Empir Softw Eng 13:219–221

    Article  Google Scholar 

  • Kitchenham BA, Al-Khilidar H, Babar MA, Berry M, Cox K, Keung J, Kurniawati F, Staples M, Zhang H, Zhu L (2007) Evaluating guidelines for reporting empirical software engineering studies. Empir Softw Eng 13(1):97–121

    Article  Google Scholar 

  • Kitchenham B, PearlBrereton O, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering—a systematic literature review. Inf Softw Technol 51(1):7–15

    Article  Google Scholar 

  • Laitenberger O (1998) Studying the effects of code inspection and structural testing on software quality. In: Proceedings 9th international symposium on software reliability engineering, pp 237–246

  • Lindsay RM, Ehrenberg ASC (1993) The design of replicated studies. Am Stat 47(3):217–227

    Google Scholar 

  • Mäntylä MV, Lasseinus C, Vanhanen J (2010) Rethinking replication in software engineering: can we see the forest for the trees? In: Knutson C, Krein J (eds) 1st international workshop on replication in empirical software engineering research, Cape Town, South Africa

  • Miller J (2000) Applying meta-analytical procedures to software engineering experiments. J Syst Softw 54(1):29–39

    Article  Google Scholar 

  • Miller J (2005) Replicating software engineering experiments: a poisoned chalice or the holy grail. Inf Softw Technol 47(4):233–244

    Article  Google Scholar 

  • Montgomery DC (2001) Design and analysis of experiments, 5th edn. Wiley, New York

    Google Scholar 

  • Pickard L, Kitchenham BA, Jones P (1998) Combining empirical results in software engineering. Inf Softw Technol 40(14):811–821

    Article  Google Scholar 

  • Runeson P, Andrews A (2003) Detection or isolation of defects? An experimental comparison of unit testing and code inspection. In: 14th international symposium on software reliability engineering, pp 3–13

  • Runeson P, Anderson C, Thelin T, Andrews A, Berling T (2006) What do we know about defect detection methods? IEEE Softw 23(3):82–90

    Article  Google Scholar 

  • Runeson P, Stefik A, Andrews A, Grönblom S, Porres I, Siebert S (2011) A comparative analysis of three replicated experiments comparing inspection and unit testing. In: Proceedings 2nd international workshop on replication in empirical software engineering research, Banff, Canada, pp 35–42

  • Runeson P, Höst M, Rainer A, Regnell B (2012) Case study research in software engineering—guidelines and examples. Wiley, New York

    Book  Google Scholar 

  • Schmidt S (2009) Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev Gen Psychol 13(2):90–100

    Article  Google Scholar 

  • Shull F, Basili VR, Carver J, Maldonado JC, Travassos GH, Mendonca M, Fabbri S (2002) Replicating software engineering experiments: addressing the tacit knowledge problem. In: Proceedings of the 1st international symposium empirical software engineering, pp 7–16

  • Shull FJ, Carver J, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Softw Eng 13(2):211–218

    Article  Google Scholar 

  • Siegel S, Castellan N (1956) Nonparametric statistics for the behavioural sciences. McGraw-Hill, New York

    Google Scholar 

  • Sjøberg DIK (2007) Knowledge acquisition in software engineering requires sharing of data and artifacts. In: Basili V, Rombach H, Schneider K, Kitchenham B, Pfahl D, Selby R (eds) Empirical software engineering issues: critical assessment and future directions. LNCS, vol 4336. Springer, pp 77–82

  • So S, Cha S, Shimeall T, Kwon Y (2002) An empirical evaluation of six methods to detect faults in software. SW Test Ver Rel 12(3):155–171

    Article  Google Scholar 

  • Teasley BE, Leventhal LM, Mynatt CR, Rohlman DS (1994) Why software testing is sometimes ineffective: two applied studies of positive test strategy. J Appl Psychol 79(1):142–155

    Article  Google Scholar 

  • Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslen A (2012) Experimentation in software engineering. Springer

  • Yin RK (2009) Case study research design and methods, 4th edn. Sage Publications, Beverly Hills, CA

    Google Scholar 

Download references


We thank Sam Grönblom and Ivan Porres, Åbo Akademi university, Finland, for providing data from experiment 3. The first author conducted parts of the work during a sabbatical at North Carolina State University, USA. We thank the anonymous reviewers to help focus the manuscript and thereby significantly improve it.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Per Runeson.

Additional information

Communicated by: Natalia Juristo



Table 16 Defects in the PSP programs; classifications based on Basili and Shelby’s scheme (Basili and Selby 1987)
Table 17 Defects in the real-time programs; classifications based on Basili and Shelby’s scheme (Basili and Selby 1987)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Runeson, P., Stefik, A. & Andrews, A. Variation factors in the design and analysis of replicated controlled experiments. Empir Software Eng 19, 1781–1808 (2014).

Download citation

  • Published:

  • Issue Date:

  • DOI: