Empirical Software Engineering

, Volume 19, Issue 6, pp 1781–1808 | Cite as

Variation factors in the design and analysis of replicated controlled experiments

Three (dis)similar studies on inspections versus unit testing
  • Per Runeson
  • Andreas Stefik
  • Anneliese Andrews


In formal experiments on software engineering, the number of factors that may impact an outcome is very high. Some factors are controlled and change by design, while others are are either unforeseen or due to chance. This paper aims to explore how context factors change in a series of formal experiments and to identify implications for experimentation and replication practices to enable learning from experimentation. We analyze three experiments on code inspections and structural unit testing. The first two experiments use the same experimental design and instrumentation (replication), while the third, conducted by different researchers, replaces the programs and adapts defect detection methods accordingly (reproduction). Experimental procedures and location also differ between the experiments. Contrary to expectations, there are significant differences between the original experiment and the replication, as well as compared to the reproduction. Some of the differences are due to factors other than the ones designed to vary between experiments, indicating the sensitivity to context factors in software engineering experimentation. In aggregate, the analysis indicates that reducing the complexity of software engineering experiments should be considered by researchers who want to obtain reliable and repeatable empirical measures.


Formal experiments Replication Reproduction Experiment design Code inspection Unit testing 



We thank Sam Grönblom and Ivan Porres, Åbo Akademi university, Finland, for providing data from experiment 3. The first author conducted parts of the work during a sabbatical at North Carolina State University, USA. We thank the anonymous reviewers to help focus the manuscript and thereby significantly improve it.


  1. Anderson T, Darling D (1952) Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann Math Stat 23(2):193–212MathSciNetzbMATHCrossRefGoogle Scholar
  2. Basili VR, Selby RW (1987) Comparing the effectiveness of software testing strategies. IEEE Trans Softw Eng 13(12):1278–1296CrossRefGoogle Scholar
  3. Basili VR, Shull F, Lanubile F (1999) Building knowledge through families of experiments. IEEE Trans Softw Eng 25(4):456–473CrossRefGoogle Scholar
  4. Berling T, Runeson P (2003) Evaluation of a perspective based review method applied in an industrial setting. IEE Proc SW 150(3):177–184CrossRefGoogle Scholar
  5. Cartwright N (1991) Replicability, reproducibility, and robustness: comments on Harry Collins. Hist Polit Econ 23(1):143–155CrossRefGoogle Scholar
  6. Clarke P, O’Connor RV (2012) The situational factors that affect the software development process: towards a comprehensive reference framework. Inf Softw Technol 54(5):433–447CrossRefGoogle Scholar
  7. da Silva FQB, Suassuna M, França ACC, Grubb AM, Gouveia TB, Monteiro CVF, dos Santos IE (2012) Replication of empirical studies in software engineering research: a systematic mapping study. Empir Softw Eng. doi: 10.1007/s10664-012-9227-7 Google Scholar
  8. Dybå T, Sjøberg DIK, Cruzes DS (2012) What works for whom, where, when, and why?: on the role of context in empirical software engineering. In: Proceedings of the 11th international symposium on empirical software engineering and measurement, pp 19–28Google Scholar
  9. Gomez OS, Juristo N, Vegas S (2010) Replications types in experimental disciplines. In: Proceedings of the fourth international symposium on empirical software engineering and measurementGoogle Scholar
  10. Hannay J, Jørgensen M (2008) The role of deliberate artificial design elements in software engineering experiments. IEEE Trans Softw Eng 34(2):242–259CrossRefGoogle Scholar
  11. Hetzel W (1972) An experimental analysis of program verification problem solving capabilities as they relate to programmer efficiency. Comput Pers 3(3):10–15CrossRefGoogle Scholar
  12. Hoaglin D, Andrews D (1975) The reporting of computation-based results in statistics. Am Stat 29(3):112–126Google Scholar
  13. Humphrey WS (1995) A discipline for software engineering. Addison-Wesley, Reading, MAGoogle Scholar
  14. Jedlitschka A, Pfahl D (2005) Reporting guidelines for controlled experiments in software engineering. In: Proceedings of the 4th international symposium on empirical software engineering, pp 95–104Google Scholar
  15. Jørgensen M, Grimstad S (2011) The impact of irrelevant and misleading information on software development effort estimates: a randomized controlled field experiment. IEEE Trans Softw Eng 37(5):695–707CrossRefGoogle Scholar
  16. Jørgensen M, Grimstad S (2012) Software development estimation biases: the role of interdependence. IEEE Trans Softw Eng 38(3):677–693CrossRefGoogle Scholar
  17. Jørgensen M, Gruschke T (2009) The impact of lessons-learned sessions on effort estimation and uncertainty assessments. IEEE Trans Softw Eng 35(3):368–383CrossRefGoogle Scholar
  18. Jørgensen M, Shepperd M (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33:33–53CrossRefGoogle Scholar
  19. Juristo N, Gomez OS (2012) Replication of software engineering experiments. In: Meyer B, Nordio M (eds) Empirical software engineering and verification. LNCS, vol 7007. Springer, pp 60–88Google Scholar
  20. Juristo N, Vegas S (2011) The role of non-exact replications in software engineering experiments. Empir Softw Eng 16(3):295–324CrossRefGoogle Scholar
  21. Juristo N, Moreno AM, Vegas S (2004) Reviewing 25 years of testing technique experiments. Empir Softw Eng 9(1–2):7–44CrossRefGoogle Scholar
  22. Juristo N, Moreno AM, Vegas S, Solari M (2006) In search of what we experimentally know about unit testing. IEEE Softw 23:72–80CrossRefGoogle Scholar
  23. Juristo N, Vegas S, Solari M, Abrahao S, Ramos I (2012) Comparing the effectiveness of equivalence partitioning, branch testing and code reading be stepwise abstraction applied by subjects. In: Proceedings fifth IEEE international conference on software testing, verification and validation, Montreal, Canada, pp 330–339Google Scholar
  24. Kitchenham BA, Fry J, Linkman SG (2003) The case against cross-over designs in software engineering. In: 11th international workshop on software technology and engineering practice (STEP 2003), Amsterdam, The Netherlands, pp 65–67Google Scholar
  25. Kitchenham, BA (2008) The role of replications in empirical software engineering—a word of warning. Empir Softw Eng 13:219–221CrossRefGoogle Scholar
  26. Kitchenham BA, Al-Khilidar H, Babar MA, Berry M, Cox K, Keung J, Kurniawati F, Staples M, Zhang H, Zhu L (2007) Evaluating guidelines for reporting empirical software engineering studies. Empir Softw Eng 13(1):97–121CrossRefGoogle Scholar
  27. Kitchenham B, PearlBrereton O, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering—a systematic literature review. Inf Softw Technol 51(1):7–15CrossRefGoogle Scholar
  28. Laitenberger O (1998) Studying the effects of code inspection and structural testing on software quality. In: Proceedings 9th international symposium on software reliability engineering, pp 237–246Google Scholar
  29. Lindsay RM, Ehrenberg ASC (1993) The design of replicated studies. Am Stat 47(3):217–227Google Scholar
  30. Mäntylä MV, Lasseinus C, Vanhanen J (2010) Rethinking replication in software engineering: can we see the forest for the trees? In: Knutson C, Krein J (eds) 1st international workshop on replication in empirical software engineering research, Cape Town, South AfricaGoogle Scholar
  31. Miller J (2000) Applying meta-analytical procedures to software engineering experiments. J Syst Softw 54(1):29–39CrossRefGoogle Scholar
  32. Miller J (2005) Replicating software engineering experiments: a poisoned chalice or the holy grail. Inf Softw Technol 47(4):233–244CrossRefGoogle Scholar
  33. Montgomery DC (2001) Design and analysis of experiments, 5th edn. Wiley, New YorkGoogle Scholar
  34. Pickard L, Kitchenham BA, Jones P (1998) Combining empirical results in software engineering. Inf Softw Technol 40(14):811–821CrossRefGoogle Scholar
  35. Runeson P, Andrews A (2003) Detection or isolation of defects? An experimental comparison of unit testing and code inspection. In: 14th international symposium on software reliability engineering, pp 3–13Google Scholar
  36. Runeson P, Anderson C, Thelin T, Andrews A, Berling T (2006) What do we know about defect detection methods? IEEE Softw 23(3):82–90CrossRefGoogle Scholar
  37. Runeson P, Stefik A, Andrews A, Grönblom S, Porres I, Siebert S (2011) A comparative analysis of three replicated experiments comparing inspection and unit testing. In: Proceedings 2nd international workshop on replication in empirical software engineering research, Banff, Canada, pp 35–42Google Scholar
  38. Runeson P, Höst M, Rainer A, Regnell B (2012) Case study research in software engineering—guidelines and examples. Wiley, New YorkCrossRefGoogle Scholar
  39. Schmidt S (2009) Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev Gen Psychol 13(2):90–100CrossRefGoogle Scholar
  40. Shull F, Basili VR, Carver J, Maldonado JC, Travassos GH, Mendonca M, Fabbri S (2002) Replicating software engineering experiments: addressing the tacit knowledge problem. In: Proceedings of the 1st international symposium empirical software engineering, pp 7–16Google Scholar
  41. Shull FJ, Carver J, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Softw Eng 13(2):211–218CrossRefGoogle Scholar
  42. Siegel S, Castellan N (1956) Nonparametric statistics for the behavioural sciences. McGraw-Hill, New YorkGoogle Scholar
  43. Sjøberg DIK (2007) Knowledge acquisition in software engineering requires sharing of data and artifacts. In: Basili V, Rombach H, Schneider K, Kitchenham B, Pfahl D, Selby R (eds) Empirical software engineering issues: critical assessment and future directions. LNCS, vol 4336. Springer, pp 77–82Google Scholar
  44. So S, Cha S, Shimeall T, Kwon Y (2002) An empirical evaluation of six methods to detect faults in software. SW Test Ver Rel 12(3):155–171CrossRefGoogle Scholar
  45. Teasley BE, Leventhal LM, Mynatt CR, Rohlman DS (1994) Why software testing is sometimes ineffective: two applied studies of positive test strategy. J Appl Psychol 79(1):142–155CrossRefGoogle Scholar
  46. Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslen A (2012) Experimentation in software engineering. SpringerGoogle Scholar
  47. Yin RK (2009) Case study research design and methods, 4th edn. Sage Publications, Beverly Hills, CAGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Per Runeson
    • 1
  • Andreas Stefik
    • 2
  • Anneliese Andrews
    • 3
  1. 1.Lund UniversityLundSweden
  2. 2.University of NevadaLas VegasUSA
  3. 3.University of DenverDenverUSA

Personalised recommendations