Empirical Software Engineering

, Volume 20, Issue 1, pp 110–141 | Cite as

A practical guide to controlled experiments of software engineering tools with human participants

  • Andrew J. KoEmail author
  • Thomas D. LaToza
  • Margaret M. Burnett


Empirical studies, often in the form of controlled experiments, have been widely adopted in software engineering research as a way to evaluate the merits of new software engineering tools. However, controlled experiments involving human participants actually using new tools are still rare, and when they are conducted, some have serious validity concerns. Recent research has also shown that many software engineering researchers view this form of tool evaluation as too risky and too difficult to conduct, as they might ultimately lead to inconclusive or negative results. In this paper, we aim both to help researchers minimize the risks of this form of tool evaluation, and to increase their quality, by offering practical methodological guidance on designing and running controlled experiments with developers. Our guidance fills gaps in the empirical literature by explaining, from a practical perspective, options in the recruitment and selection of human participants, informed consent, experimental procedures, demographic measurements, group assignment, training, the selecting and design of tasks, the measurement of common outcome variables such as success and time on task, and study debriefing. Throughout, we situate this guidance in the results of a new systematic review of the tool evaluations that were published in over 1,700 software engineering papers published from 2001 to 2011.


Research methodology Tools Human participants Human subjects Experiments 



We thank Bonnie E. John for her early contributions to this work. This material is based in part upon work supported by the NSF Grant CCF-0952733 and AFOSR FA9550-0910213 and FA9550-10-1-0326. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.


  1. Anderson JR, Reiser BJ (1985) The LISP tutor. Byte 10:159–175Google Scholar
  2. Aranda J (2011) How do practitioners perceive Software Engineering Research? Retrieved: 08-01-2011
  3. Atkins DL, Ball T, Graves TL, Mockus A (2002) Using version control data to evaluate the impact of software tools: A case study of the version editor. IEEE Trans Softw Eng 28(7):625–637CrossRefGoogle Scholar
  4. Bangor A, Kortum PT, Miller JT (2008) An empirical evaluation of the system usability scale. Int J Human-Comput Interact 24(6):574–594CrossRefGoogle Scholar
  5. Basili VR (1993) The experimental paradigm in software engineering. Int Work Exp Eng Issues: Crit Assess Futur Dir 706:3–12Google Scholar
  6. Basili VR (1996) The role of experimentation in software engineering: Past, current, and future. International Conference on Software Engineering, 442–449Google Scholar
  7. Basili VR (2007) The role of controlled experiments in software engineering research. Empirical Software Engineering Issues, LNCS 4336, Basili V et al. (Eds.), Springer-Verlag, 33–37Google Scholar
  8. Basili VR, Selby RW, Hutchens DH (1986) Experimentation in software engineering. IEEE Trans Softw Eng, 733–743, JulyGoogle Scholar
  9. Basili VR, Caldiera G, Rombach HD (1994) The goal question metric approach. In Encyclopedia of Software Engineering, John Wiley and Sons, 528–532Google Scholar
  10. Beringer P (2004) Using students as subjects in requirements prioritization. International Symposium on Empirical Software Engineering, 167–176Google Scholar
  11. Boehm BW, Papaccio PN (1988) Understanding and controlling software costs. IEEE Trans Softw Eng SE-14(10):1462–1477CrossRefGoogle Scholar
  12. Breaugh JA (2003) Effect size estimation: factors to consider and mistakes to avoid. J Manag 29(1):79–97Google Scholar
  13. Bruun A, Gull P, Hofmeister L, Stage J (2009) Let your users do the testing: a comparison of three remote asynchronous usability testing methods. ACM Conference on Human Factors in Computing Systems, 1619–1628Google Scholar
  14. Buse RPL, Sadowski C, Weimer W (2011) Benefits and barriers of user evaluation in software engineering research. ACM Conference on Systems, Programming, Languages and ApplicationsGoogle Scholar
  15. Carver J, Jaccheri L, Morasca S, Shull F (2003). Issues in using students in empirical studies in software engineering education. Software Metrics Symposium, 239–249Google Scholar
  16. Chuttur MY (2009). Overview of the technology acceptance model: Origins, developments and future directions. Indiana University, USA, Sprouts: Working Papers on Information SystemsGoogle Scholar
  17. Davis FD (1989) Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Q 13(3):319CrossRefGoogle Scholar
  18. Dell N, Vaidyanathan V, Medhi I, Cutrell E, Thies W (2012) “Yours is better!” Participant response bias in HCI. ACM Conference on Human Factors in Computing Systems, 1321–1330Google Scholar
  19. Dieste O, Grim´n A, Juristo N, Saxena H (2011) Quantitative determination of the relationship between internal validity and bias in software engineering experiments: Consequences for systematic literature reviews. International Symposium on Empirical Software Engineering and Measurement, 285–294Google Scholar
  20. Dig D, Manzoor K, Johnson R, Nguyen TN (2008) Effective software merging in the presence of object-oriented refactorings. IEEE Trans Softw Eng 34(3):321–335CrossRefGoogle Scholar
  21. Dittrich Y (ed) (2007) Special issue on qualitative software engineering research. Inf Softw Technol, 49(6):531–694. doi: 10.1016/j.infsof.2007.02.009
  22. Dybå T, Kampenes V, Sjøberg D (2006) A systematic review of statistical power in software engineering experiments. Inf Softw Technol 48(8):745–755CrossRefGoogle Scholar
  23. Dybå T, Prikladnicki R, Rönkkö K, Seaman C, Sillito J (2011) Qualitative research in software engineering. Empir Softw Eng 16(4):425–429CrossRefGoogle Scholar
  24. Easterbrook S, Singer J, Storey M, Damian D (2008) Selecting empirical methods for software engineering research, in Guide to Advanced Empirical Software Engineering, Springer, 285–311Google Scholar
  25. Feigenspan J, Kastner C, Liebig J, Apel S, Hanenberg S (2012). Measuring programming experience. International Conference on Program Comprehension, 73–82Google Scholar
  26. Fenton N (1993) How effective are software engineering methods? J Syst Softw 22(2):141–146CrossRefGoogle Scholar
  27. Flyvbjerg B (2006) Five misunderstandings about case study research. Qual Inq 12(2):219–245CrossRefGoogle Scholar
  28. Glass RL, Vessey I, Ramesh V (2002) Research in software engineering: an analysis of the literature. Inf Softw Technol 44(8):491–506CrossRefGoogle Scholar
  29. Golden E, John BE, Bass L (2005) The value of a usability-supporting architectural pattern in software architecture design: a controlled experiment. ACM/IEEE International Conference on Software EngineeringGoogle Scholar
  30. Greenberg S, Buxton B (2008) Usability evaluation considered harmful (some of the time). ACM Conference on Human Factors in Computing Systems, 111–120Google Scholar
  31. Gwet KL (2010) Handbook of inter-rater reliability, 2nd edn. Advanced Analytics, GaithersburgGoogle Scholar
  32. Hanenberg S (2010) An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time. ACM International Conference on Object-Oriented Programming Systems Languages and Applications (OOPSLA), 22–35Google Scholar
  33. Hannay JE, Sjøberg DIK, Dyba T (2007) A systematic review of theory use in software engineering experiments. IEEE Trans Softw Eng 33(2):87–107CrossRefGoogle Scholar
  34. Holmes R, Walker RJ (2013) Systematizing pragmatic software reuse. ACM Trans Softw Eng Methodol 21(4), Article 20: 44 pagesGoogle Scholar
  35. John B, Packer H (1995) Learning and using the cognitive walkthrough method: a case study approach. ACM Conference on Human Factors in Computing Systems, 429–436Google Scholar
  36. Juristo N, Moreno AM (2001) Basics of software engineering experimentation. SpringerGoogle Scholar
  37. Kampenes V, Dybå T, Hannay J, Sjøberg D (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11–12):1073–1086CrossRefGoogle Scholar
  38. Kaptein M, Robertson J (2012) Rethinking statistical analysis methods for CHI. ACM Conference on Human Factors in Computing Systems, 1105–1114Google Scholar
  39. Kelleher C, Pausch R (2005). Stencils-based tutorials: design and evaluation. ACM Conference on Human Factors in Computing Systems, 541–550Google Scholar
  40. Keller FS (1968) Good-bye teacher. J Appl Behav Anal 1:79–89CrossRefGoogle Scholar
  41. Keppel G (1982) Design and analysis: a researcher’s handbook, 2nd edn. Prentice-Hall, Englewood CliffsGoogle Scholar
  42. Kersten M, Murphy G (2006) Using task context to improve programmer productivity. ACM Symposium on Foundations of Software Engineering, 1–11Google Scholar
  43. Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam, K.E., Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28(8):721–734Google Scholar
  44. Kitchenham BA, Brereton P, Turner M, Niazi MK, Linkman S, Pretorius R, Budgen D (2010) Refining the systematic literature review process—two participant-observer case studies. Empir Softw Eng 15(6):618–653CrossRefGoogle Scholar
  45. Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with Mechanical Turk. ACM Conference on Human Factors in Computing Systems, 453–456Google Scholar
  46. Ko AJ, Myers BA (2009) Finding causes of program output with the Java Whyline. ACM Conference on Human Factors in Computing Systems, 1569–1578Google Scholar
  47. Ko AJ, Wobbrock JO (2010) Cleanroom: edit-time error detection with the uniqueness heuristic. IEEE Symposium on Visual Languages and Human-Centric Computing, 7–14Google Scholar
  48. Ko AJ, Burnett MM, Green TRG, Rothermel KJ, Cook CR (2002) Using the Cognitive Walkthrough to improve the design of a visual programming experiment. J Vis Lang Comput 13:517–544CrossRefGoogle Scholar
  49. Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. International Conference on Software Engineering (ICSE)Google Scholar
  50. LaToza TD, Myers BA (2010) Developers ask reachability questions. International Conference on Software Engineering (ICSE), 185–194Google Scholar
  51. LaToza, TD, Myers BA (2011) Visualizing call graphs. IEEE Visual Languages and Human-Centric Computing (VL/HCC), Pittsburgh, PAGoogle Scholar
  52. LaToza TD, Myers BA (2011) Designing useful tools for developers. ACM SIGPLAN Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU), 45–50Google Scholar
  53. Lazar J, Feng JH, Hochheiser H (2010) Research methods in human-computer interaction. WileyGoogle Scholar
  54. Lott C, Rombach D (1996) Repeatable software engineering experiments for comparing defect-detection techniques. Empir Softw Eng 1:241–277CrossRefGoogle Scholar
  55. Martin DW (1996) Doing psychology experiments, 4th edn. Brooks/Cole, Pacific GroveGoogle Scholar
  56. McDowall D, McCleary R, Meidinger E, Hay RA (1980) Interrupted Time Series Analysis, 1st Edition. SAGE PublicationsGoogle Scholar
  57. Murphy GC, Walker RJ, Baniassad ELA (1999) Evaluating emerging software development technologies: lessons learned from assessing aspect-oriented programming. IEEE Trans Softw Eng 25(4):438–455CrossRefGoogle Scholar
  58. Murphy-Hill E, Murphy GC, Griswold WG (2010). Understanding context: creating a lasting impact in experimental software engineering research. FSE/SDP Workshop on Future of Software Engineering ResearchGoogle Scholar
  59. Newell A (1973) You can’t play 20 questions with nature and win: projective comments on the papers of this symposium. In: Chase WG (ed) Visual information processing. Academic, New YorkGoogle Scholar
  60. Nickerson RS (1998) Confirmation bias: a ubiquitous phenomenon in many guises. Rev Gen Psychol 2(2):175–220CrossRefGoogle Scholar
  61. Nimmer JW, Ernst MD (2002) Invariant inference for static checking: an empirical evaluation. SIGSOFT Softw Eng Notes 27(6):11–20CrossRefGoogle Scholar
  62. Olsen DR (2007) Evaluating user interface systems research. ACM Symposium on User Interface Software and Technology, 251–258Google Scholar
  63. Polson P, Lewis C, Rieman J, Wharton C (1992) Cognitive walkthroughs: a method for theory-based evaluation of user interfaces. Int J Human-Comput Interact 36:741–773Google Scholar
  64. Ramesh V, Glass RL, Vessey I (2004) Research in computer science: an empirical study. J Syst Softw 70(1–2):165–176CrossRefGoogle Scholar
  65. Rombach HD, Basili VR, Selby RW (1992) Experimental software engineering issues: critical assessment and future directions. International Workshop Dagstuhl Castle (Germany), Sept. 14–18Google Scholar
  66. Rosenthal R (1966) Experimenter effects in behavioral research. Appleton, New YorkGoogle Scholar
  67. Rosenthal R, Rosnow R (2007) Essentials of behavioral research: methods and data analysis. McGraw-Hill, 3rd editionGoogle Scholar
  68. Rosenthal R, Rubin DB (1978) Interpersonal expectancy effects: the first 345 studies. Behav Brain Sci 1(3):377–386CrossRefGoogle Scholar
  69. Ross J, Irani L, Silberman MS, Zaldivar A, Tomlinson B (2010) Who are the crowdworkers? Shifting demographics in Mechanical Turk. ACM Conference on Human Factors in Computing Systems, 2863–2872Google Scholar
  70. Rothermel KJ, Cook C, Burnett MM, Schonfeld J, Green TRG, Rothermel G (2000) WYSIWYT testing in the spreadsheet paradigm: an empirical evaluation. ACM International Conference on Software Engineering, 230–239Google Scholar
  71. Rubin J, Chisnell D (2008) Handbook of usability testing: how to plan, design, and conduct effective tests. WileyGoogle Scholar
  72. Runeson P, Höst M (2009) Guidelines for conducting and reporting case study research in software engineering. Empir Softw Eng 14(2):131–164CrossRefGoogle Scholar
  73. Shull F, Singer J, Sjøberg DIK (2006) Guide to advanced empirical software engineering. SpringerGoogle Scholar
  74. Sillito J, Murphy G, De Volder K (2006) Questions programmers ask during software evolution tasks. ACM SIGSOFT/FSE, 23–34Google Scholar
  75. Sjoberg DIK, Dyba T, Jorgensen M (2007) The future of empirical methods in software engineering research. In 2007 Future of Software Engineering (FOSE ’07), 358–378Google Scholar
  76. Sjøberg D, Anda B, Arisholm E, Dyba T, Jorgensen M, Karahasanovic A, Koren E, Voka M (2003) Conducting realistic experiments in software engineering. Empirical Software Engineering and MeasurementGoogle Scholar
  77. Sjøberg DIK, Hannay JE, Hansen O, Kampenes VB, Karahasanović A, Liborg N-K, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753CrossRefGoogle Scholar
  78. Steele CM, Aronson J (1995) Stereotype threat and the intellectual test performance of African-Americans. J Personal Soc Psychol 69:797–811CrossRefGoogle Scholar
  79. Tichy WF (1998) Should computer scientists experiment more? 16 excuses to avoid experimentation. IEEE Comput 31(5):32–40CrossRefMathSciNetGoogle Scholar
  80. Tichy WF, Lukowicz P, Prechelt L, Heinz EA (1995) Experimental evaluation in computer science: a quantitative study. J Syst Softw 28(1):9–18CrossRefGoogle Scholar
  81. Walther JB (2002) Research ethics in internet-enabled research: human subjects issues and methodological myopia. Ethics Inf Technol 4(3):205–216CrossRefGoogle Scholar
  82. Wickelgren WA (1977) Speed-accuracy tradeoff and information processing dynamics. Acta Psychol 41(1):67–85CrossRefGoogle Scholar
  83. Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. SpringerGoogle Scholar
  84. Yin RK (2003) Case study research: design and methods. Sage PublicationsGoogle Scholar
  85. Zannier C, Melnik G, Maurer F (2006) On the success of empirical studies in the international conference on software engineering. ACM/IEEE International Conference on Software Engineering, 341–350Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Andrew J. Ko
    • 1
    Email author
  • Thomas D. LaToza
    • 2
  • Margaret M. Burnett
    • 3
  1. 1.University of WashingtonSeattleUSA
  2. 2.University of California, IrvineIrvineUSA
  3. 3.Oregon State UniversityCorvallisUSA

Personalised recommendations