Skip to main content

A practical guide to controlled experiments of software engineering tools with human participants


Empirical studies, often in the form of controlled experiments, have been widely adopted in software engineering research as a way to evaluate the merits of new software engineering tools. However, controlled experiments involving human participants actually using new tools are still rare, and when they are conducted, some have serious validity concerns. Recent research has also shown that many software engineering researchers view this form of tool evaluation as too risky and too difficult to conduct, as they might ultimately lead to inconclusive or negative results. In this paper, we aim both to help researchers minimize the risks of this form of tool evaluation, and to increase their quality, by offering practical methodological guidance on designing and running controlled experiments with developers. Our guidance fills gaps in the empirical literature by explaining, from a practical perspective, options in the recruitment and selection of human participants, informed consent, experimental procedures, demographic measurements, group assignment, training, the selecting and design of tasks, the measurement of common outcome variables such as success and time on task, and study debriefing. Throughout, we situate this guidance in the results of a new systematic review of the tool evaluations that were published in over 1,700 software engineering papers published from 2001 to 2011.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2


  1. 1.

    Strictly speaking, an experiment is by definition quantitative [Basili 2007]. Other kinds of empirical studies are not technically experiments. Thus, in this paper, when we refer to experiments, we mean quantitative experiments.

  2. 2.

    Studies using this last method were often referred to by authors as “case studies,” but this usage conflicts with the notion of case studies as empirical investigations of some phenomenon within a real-life context (Yin 2003), as the tool use experience reports in these papers were not performed in real life contexts. “Case study” was also used to refer to evaluations without human use.


  1. Anderson JR, Reiser BJ (1985) The LISP tutor. Byte 10:159–175

    Google Scholar 

  2. Aranda J (2011) How do practitioners perceive Software Engineering Research? Retrieved: 08-01-2011

  3. Atkins DL, Ball T, Graves TL, Mockus A (2002) Using version control data to evaluate the impact of software tools: A case study of the version editor. IEEE Trans Softw Eng 28(7):625–637

    Article  Google Scholar 

  4. Bangor A, Kortum PT, Miller JT (2008) An empirical evaluation of the system usability scale. Int J Human-Comput Interact 24(6):574–594

    Article  Google Scholar 

  5. Basili VR (1993) The experimental paradigm in software engineering. Int Work Exp Eng Issues: Crit Assess Futur Dir 706:3–12

    MATH  Google Scholar 

  6. Basili VR (1996) The role of experimentation in software engineering: Past, current, and future. International Conference on Software Engineering, 442–449

  7. Basili VR (2007) The role of controlled experiments in software engineering research. Empirical Software Engineering Issues, LNCS 4336, Basili V et al. (Eds.), Springer-Verlag, 33–37

  8. Basili VR, Selby RW, Hutchens DH (1986) Experimentation in software engineering. IEEE Trans Softw Eng, 733–743, July

  9. Basili VR, Caldiera G, Rombach HD (1994) The goal question metric approach. In Encyclopedia of Software Engineering, John Wiley and Sons, 528–532

  10. Beringer P (2004) Using students as subjects in requirements prioritization. International Symposium on Empirical Software Engineering, 167–176

  11. Boehm BW, Papaccio PN (1988) Understanding and controlling software costs. IEEE Trans Softw Eng SE-14(10):1462–1477

    Article  Google Scholar 

  12. Breaugh JA (2003) Effect size estimation: factors to consider and mistakes to avoid. J Manag 29(1):79–97

    Google Scholar 

  13. Bruun A, Gull P, Hofmeister L, Stage J (2009) Let your users do the testing: a comparison of three remote asynchronous usability testing methods. ACM Conference on Human Factors in Computing Systems, 1619–1628

  14. Buse RPL, Sadowski C, Weimer W (2011) Benefits and barriers of user evaluation in software engineering research. ACM Conference on Systems, Programming, Languages and Applications

  15. Carver J, Jaccheri L, Morasca S, Shull F (2003). Issues in using students in empirical studies in software engineering education. Software Metrics Symposium, 239–249

  16. Chuttur MY (2009). Overview of the technology acceptance model: Origins, developments and future directions. Indiana University, USA, Sprouts: Working Papers on Information Systems

  17. Davis FD (1989) Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Q 13(3):319

    Article  Google Scholar 

  18. Dell N, Vaidyanathan V, Medhi I, Cutrell E, Thies W (2012) “Yours is better!” Participant response bias in HCI. ACM Conference on Human Factors in Computing Systems, 1321–1330

  19. Dieste O, Grim´n A, Juristo N, Saxena H (2011) Quantitative determination of the relationship between internal validity and bias in software engineering experiments: Consequences for systematic literature reviews. International Symposium on Empirical Software Engineering and Measurement, 285–294

  20. Dig D, Manzoor K, Johnson R, Nguyen TN (2008) Effective software merging in the presence of object-oriented refactorings. IEEE Trans Softw Eng 34(3):321–335

    Article  Google Scholar 

  21. Dittrich Y (ed) (2007) Special issue on qualitative software engineering research. Inf Softw Technol, 49(6):531–694. doi:10.1016/j.infsof.2007.02.009

  22. Dybå T, Kampenes V, Sjøberg D (2006) A systematic review of statistical power in software engineering experiments. Inf Softw Technol 48(8):745–755

    Article  Google Scholar 

  23. Dybå T, Prikladnicki R, Rönkkö K, Seaman C, Sillito J (2011) Qualitative research in software engineering. Empir Softw Eng 16(4):425–429

    Article  Google Scholar 

  24. Easterbrook S, Singer J, Storey M, Damian D (2008) Selecting empirical methods for software engineering research, in Guide to Advanced Empirical Software Engineering, Springer, 285–311

  25. Feigenspan J, Kastner C, Liebig J, Apel S, Hanenberg S (2012). Measuring programming experience. International Conference on Program Comprehension, 73–82

  26. Fenton N (1993) How effective are software engineering methods? J Syst Softw 22(2):141–146

    Article  Google Scholar 

  27. Flyvbjerg B (2006) Five misunderstandings about case study research. Qual Inq 12(2):219–245

    Article  Google Scholar 

  28. Glass RL, Vessey I, Ramesh V (2002) Research in software engineering: an analysis of the literature. Inf Softw Technol 44(8):491–506

    Article  Google Scholar 

  29. Golden E, John BE, Bass L (2005) The value of a usability-supporting architectural pattern in software architecture design: a controlled experiment. ACM/IEEE International Conference on Software Engineering

  30. Greenberg S, Buxton B (2008) Usability evaluation considered harmful (some of the time). ACM Conference on Human Factors in Computing Systems, 111–120

  31. Gwet KL (2010) Handbook of inter-rater reliability, 2nd edn. Advanced Analytics, Gaithersburg

    Google Scholar 

  32. Hanenberg S (2010) An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time. ACM International Conference on Object-Oriented Programming Systems Languages and Applications (OOPSLA), 22–35

  33. Hannay JE, Sjøberg DIK, Dyba T (2007) A systematic review of theory use in software engineering experiments. IEEE Trans Softw Eng 33(2):87–107

    Article  Google Scholar 

  34. Holmes R, Walker RJ (2013) Systematizing pragmatic software reuse. ACM Trans Softw Eng Methodol 21(4), Article 20: 44 pages

    Google Scholar 

  35. John B, Packer H (1995) Learning and using the cognitive walkthrough method: a case study approach. ACM Conference on Human Factors in Computing Systems, 429–436

  36. Juristo N, Moreno AM (2001) Basics of software engineering experimentation. Springer

  37. Kampenes V, Dybå T, Hannay J, Sjøberg D (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11–12):1073–1086

    Article  Google Scholar 

  38. Kaptein M, Robertson J (2012) Rethinking statistical analysis methods for CHI. ACM Conference on Human Factors in Computing Systems, 1105–1114

  39. Kelleher C, Pausch R (2005). Stencils-based tutorials: design and evaluation. ACM Conference on Human Factors in Computing Systems, 541–550

  40. Keller FS (1968) Good-bye teacher. J Appl Behav Anal 1:79–89

    Article  Google Scholar 

  41. Keppel G (1982) Design and analysis: a researcher’s handbook, 2nd edn. Prentice-Hall, Englewood Cliffs

    Google Scholar 

  42. Kersten M, Murphy G (2006) Using task context to improve programmer productivity. ACM Symposium on Foundations of Software Engineering, 1–11

  43. Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam, K.E., Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28(8):721–734

    Article  Google Scholar 

  44. Kitchenham BA, Brereton P, Turner M, Niazi MK, Linkman S, Pretorius R, Budgen D (2010) Refining the systematic literature review process—two participant-observer case studies. Empir Softw Eng 15(6):618–653

    Article  Google Scholar 

  45. Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with Mechanical Turk. ACM Conference on Human Factors in Computing Systems, 453–456

  46. Ko AJ, Myers BA (2009) Finding causes of program output with the Java Whyline. ACM Conference on Human Factors in Computing Systems, 1569–1578

  47. Ko AJ, Wobbrock JO (2010) Cleanroom: edit-time error detection with the uniqueness heuristic. IEEE Symposium on Visual Languages and Human-Centric Computing, 7–14

  48. Ko AJ, Burnett MM, Green TRG, Rothermel KJ, Cook CR (2002) Using the Cognitive Walkthrough to improve the design of a visual programming experiment. J Vis Lang Comput 13:517–544

    Article  Google Scholar 

  49. Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. International Conference on Software Engineering (ICSE)

  50. LaToza TD, Myers BA (2010) Developers ask reachability questions. International Conference on Software Engineering (ICSE), 185–194

  51. LaToza, TD, Myers BA (2011) Visualizing call graphs. IEEE Visual Languages and Human-Centric Computing (VL/HCC), Pittsburgh, PA

  52. LaToza TD, Myers BA (2011) Designing useful tools for developers. ACM SIGPLAN Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU), 45–50

  53. Lazar J, Feng JH, Hochheiser H (2010) Research methods in human-computer interaction. Wiley

  54. Lott C, Rombach D (1996) Repeatable software engineering experiments for comparing defect-detection techniques. Empir Softw Eng 1:241–277

    Article  Google Scholar 

  55. Martin DW (1996) Doing psychology experiments, 4th edn. Brooks/Cole, Pacific Grove

    Google Scholar 

  56. McDowall D, McCleary R, Meidinger E, Hay RA (1980) Interrupted Time Series Analysis, 1st Edition. SAGE Publications

  57. Murphy GC, Walker RJ, Baniassad ELA (1999) Evaluating emerging software development technologies: lessons learned from assessing aspect-oriented programming. IEEE Trans Softw Eng 25(4):438–455

    Article  Google Scholar 

  58. Murphy-Hill E, Murphy GC, Griswold WG (2010). Understanding context: creating a lasting impact in experimental software engineering research. FSE/SDP Workshop on Future of Software Engineering Research

  59. Newell A (1973) You can’t play 20 questions with nature and win: projective comments on the papers of this symposium. In: Chase WG (ed) Visual information processing. Academic, New York

    Google Scholar 

  60. Nickerson RS (1998) Confirmation bias: a ubiquitous phenomenon in many guises. Rev Gen Psychol 2(2):175–220

    Article  Google Scholar 

  61. Nimmer JW, Ernst MD (2002) Invariant inference for static checking: an empirical evaluation. SIGSOFT Softw Eng Notes 27(6):11–20

    Article  Google Scholar 

  62. Olsen DR (2007) Evaluating user interface systems research. ACM Symposium on User Interface Software and Technology, 251–258

  63. Polson P, Lewis C, Rieman J, Wharton C (1992) Cognitive walkthroughs: a method for theory-based evaluation of user interfaces. Int J Human-Comput Interact 36:741–773

    Google Scholar 

  64. Ramesh V, Glass RL, Vessey I (2004) Research in computer science: an empirical study. J Syst Softw 70(1–2):165–176

    Article  Google Scholar 

  65. Rombach HD, Basili VR, Selby RW (1992) Experimental software engineering issues: critical assessment and future directions. International Workshop Dagstuhl Castle (Germany), Sept. 14–18

  66. Rosenthal R (1966) Experimenter effects in behavioral research. Appleton, New York

    Google Scholar 

  67. Rosenthal R, Rosnow R (2007) Essentials of behavioral research: methods and data analysis. McGraw-Hill, 3rd edition

  68. Rosenthal R, Rubin DB (1978) Interpersonal expectancy effects: the first 345 studies. Behav Brain Sci 1(3):377–386

    Article  Google Scholar 

  69. Ross J, Irani L, Silberman MS, Zaldivar A, Tomlinson B (2010) Who are the crowdworkers? Shifting demographics in Mechanical Turk. ACM Conference on Human Factors in Computing Systems, 2863–2872

  70. Rothermel KJ, Cook C, Burnett MM, Schonfeld J, Green TRG, Rothermel G (2000) WYSIWYT testing in the spreadsheet paradigm: an empirical evaluation. ACM International Conference on Software Engineering, 230–239

  71. Rubin J, Chisnell D (2008) Handbook of usability testing: how to plan, design, and conduct effective tests. Wiley

  72. Runeson P, Höst M (2009) Guidelines for conducting and reporting case study research in software engineering. Empir Softw Eng 14(2):131–164

    Article  Google Scholar 

  73. Shull F, Singer J, Sjøberg DIK (2006) Guide to advanced empirical software engineering. Springer

  74. Sillito J, Murphy G, De Volder K (2006) Questions programmers ask during software evolution tasks. ACM SIGSOFT/FSE, 23–34

  75. Sjoberg DIK, Dyba T, Jorgensen M (2007) The future of empirical methods in software engineering research. In 2007 Future of Software Engineering (FOSE ’07), 358–378

  76. Sjøberg D, Anda B, Arisholm E, Dyba T, Jorgensen M, Karahasanovic A, Koren E, Voka M (2003) Conducting realistic experiments in software engineering. Empirical Software Engineering and Measurement

  77. Sjøberg DIK, Hannay JE, Hansen O, Kampenes VB, Karahasanović A, Liborg N-K, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753

    Article  Google Scholar 

  78. Steele CM, Aronson J (1995) Stereotype threat and the intellectual test performance of African-Americans. J Personal Soc Psychol 69:797–811

    Article  Google Scholar 

  79. Tichy WF (1998) Should computer scientists experiment more? 16 excuses to avoid experimentation. IEEE Comput 31(5):32–40

    Article  Google Scholar 

  80. Tichy WF, Lukowicz P, Prechelt L, Heinz EA (1995) Experimental evaluation in computer science: a quantitative study. J Syst Softw 28(1):9–18

    Article  Google Scholar 

  81. Walther JB (2002) Research ethics in internet-enabled research: human subjects issues and methodological myopia. Ethics Inf Technol 4(3):205–216

    Article  Google Scholar 

  82. Wickelgren WA (1977) Speed-accuracy tradeoff and information processing dynamics. Acta Psychol 41(1):67–85

    Article  Google Scholar 

  83. Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Springer

  84. Yin RK (2003) Case study research: design and methods. Sage Publications

  85. Zannier C, Melnik G, Maurer F (2006) On the success of empirical studies in the international conference on software engineering. ACM/IEEE International Conference on Software Engineering, 341–350

Download references


We thank Bonnie E. John for her early contributions to this work. This material is based in part upon work supported by the NSF Grant CCF-0952733 and AFOSR FA9550-0910213 and FA9550-10-1-0326. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

Author information



Corresponding author

Correspondence to Amy J. Ko.

Additional information

Communicated by: Premkumar Thomas Devanbu



Past literature has considered the role of human participants in software engineering research before 2002 (Sjøberg et al. 2005). We still know little, however, about the trends in empirical evaluations with human participants in software engineering from the past decade. To address this gap, and to support the methodological discussion in this paper, we conducted a systematic literature review (Kitchenham et al. 2010) of research from the past decade.

We focused on publications from four of the top software engineering research venues: the International Conference on Software Engineering (ICSE), the ACM SIGSOFT Symposium on Foundations of Software Engineering (and associated European Software Engineering Conference in alternating years), ACM Transactions on Software Engineering and Methodology, and IEEE Transactions on Software Engineering. We chose these primarily because of their status as flagship software engineering journals and conferences, but also because of their focus on software engineering tools (as opposed to studies focused on understanding software engineering practice). Our sample included all of the research publications in these four venues that were published from 2001 to 2011. For conferences, we included only the technical proceedings, excluding experience reports and software engineering education research papers. The resulting set included 1,701 publications.

To begin, we first sought to identify the subset of these papers that reported on new technologies, methods, formalisms, design patterns, metrics, or other techniques intended to be used by a software professional of some kind. The first two authors redundantly classified a random sample of 99 of the 1,701 papers (the Cohen’s Kappa of this classification was 0.8, considered near perfect agreement). We then split the remaining papers and classified them independently in random order, identifying 1,392 papers (81 % of all publications) describing some form of software engineering “tool.” As seen in Fig. 3, the proportion of papers contributing tools has not changed significantly in the past decade (χ 2(10,N = 1,701) = 11.1, p = .35).

Fig. 3

Proportion of papers contributing software engineering tools

To narrow the sample of “tool papers” to “tool papers with evaluations,” we checked each paper for an evaluation of the tool (including evaluations of both internal properties such as performance or accuracy and external properties of how the tool is used by people). We counted an evaluation as “empirical” if it included any quantitative or qualitative data from observation of the tool’s behavior or use. Therefore, if the paper only described a task that could be performed with a tool, but was not actually performed, the paper was not included. Achieving high reliability when redundantly classifying the same 99 papers (Kappa = 0.83), we split and classified the remaining papers, identifying 1,065 reporting on both a tool and empirical evaluation of it; 77 % of tool papers included an empirical evaluation. As seen in Fig. 4, the proportion of papers contributing tools and reporting empirical evaluations of them has increased significantly in the past decade from about 50 % to over 90 % (χ 2(10,N = 1,392) = 117.9, p < .001). This shows that empirical evaluations are now widely adopted and perhaps even expected for publication of novel tool contributions.

Fig. 4

Proportion of papers contributing tools that also reported an empirical evaluation of the tool

To further narrow our sample to tool papers that evaluated some aspect of the use of a tool, we identified the individual studies reported in each paper. We counted any study with a unique method within a paper as a separate study (lab studies that were run multiple times or case studies run on multiple programs were counted as a single study). After achieving high reliability on this definition (Kappa = 0.62), we found 1,141 studies across 1,065 papers.

With the studies in each paper identified, we then classified each of these studies as either involving the human use of the tool or not. Any person using the tool for some task, including the paper authors themselves if they described their own use of the tool, was included. After achieving high reliability on this classification (Kappa = .70), we classified the papers, finding 345 studies across 289 papers that involved human use of a tool. As shown in Fig. 5, the proportion of studies that involve developers or other software professionals using a tool is actually on a slow decline in our sample, from a peak in 2002 of 44 % to only 26 % of papers in 2011 (χ 2(10,N = 1141) = 21.0, p < .05). This means that although more evaluations are being done, a smaller proportion of them are evaluating a tool’s use by humans (instead evaluating its internal characteristics such as speed, accuracy, or correctness).

Fig. 5

Proportion of studies contributing evaluations examining how the authors or recruited participants used the tool. Contrary to the graph in Fig. 4, this graph shows a significant decline, suggesting that papers are increasingly evaluating the non-human use of tools

The subset of tool evaluations of human use that recruited human participants and did not use authors is shown in Fig. 6. This plot shows that although studies evaluating tool use are less common (Fig. 5), an increasing proportion of these studies involve human participants. Therefore, when software engineering researchers are studying how a tool is used, they are increasingly recruiting participants rather than using themselves.

Fig. 6

Proportion of studies evaluating tool use that involved non-author human participants. Although studies of use are less common (Fig. 5), they are increasingly involving human participants

Most of the studies instead used lab studies (conducting an study in a controlled setting with participants), interviews (demonstrating the tool and asking participants to respond to spoken questions about it), surveys (demonstrating the tool and asking participants to respond to written questions about it), field deployments (observations of the use of the tool in real settings on real tasks), and a method we will call tool use experience reports Footnote 2 (the use of the tool with a specific program or data set, either by the authors or some other person). As shown in Fig. 7, the most common method by far was the tool use experience report, which was the method of 67 % of the 345 studies evaluating human use of a tool. As seen in Fig. 7, the other four methods were much less common. The relative proportion of tool application studies compared to other types has not changed significantly in the past decade (χ 2(10,N = 345) = 9.6, p = .473). The frequency of experiments per year has also not changed significantly over the past decade (χ 2(10,N = 345) = 8.2, p = .611). Figure 8 shows that only a small subset of these studies are controlled experiments. In fact, the number of experiments evaluating tool use has ranged from 2 to 9 studies per year in these four venues, for a total of only 44 controlled experiments with human participants over 10 years.

Fig. 7

Proportion of methods used in studies evaluating human use of a tool

Fig. 8

Proportion of papers evaluating use that were experiments. The number of experiments ranged from 2 to 9 per year

These results indicate several major trends in the methods that software engineering researchers use to evaluate tools: (1) empirical evaluations are now found in nearly every paper contributing a tool, (2) the proportion of these evaluations that evaluate the human use of the tool is on the decline, (3) an increasing proportion of human use evaluations involve non-author participants, but (4) experiments evaluating human use are still quite rare.

These findings are subject to several threats to validity. We considered only 4 journals and conference proceedings in our review, focusing on those with a strong reputation for contributing new tools. There are many other software engineering publication venues where such work appears. It is possible that the trends we observed are particular to the venues we chose; for example, Buse et al. (2011) found that while ICSE and FSE showed no signs of increases in user evaluations, OOSPLA, ISSTA, and ASE did. There may also be evaluations with human participants that were never published or that were published in other venues after being rejected by the venues that we did analyze.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ko, A.J., LaToza, T.D. & Burnett, M.M. A practical guide to controlled experiments of software engineering tools with human participants. Empir Software Eng 20, 110–141 (2015).

Download citation


  • Research methodology
  • Tools
  • Human participants
  • Human subjects
  • Experiments