Empirical studies, often in the form of controlled experiments, have been widely adopted in software engineering research as a way to evaluate the merits of new software engineering tools. However, controlled experiments involving human participants actually using new tools are still rare, and when they are conducted, some have serious validity concerns. Recent research has also shown that many software engineering researchers view this form of tool evaluation as too risky and too difficult to conduct, as they might ultimately lead to inconclusive or negative results. In this paper, we aim both to help researchers minimize the risks of this form of tool evaluation, and to increase their quality, by offering practical methodological guidance on designing and running controlled experiments with developers. Our guidance fills gaps in the empirical literature by explaining, from a practical perspective, options in the recruitment and selection of human participants, informed consent, experimental procedures, demographic measurements, group assignment, training, the selecting and design of tasks, the measurement of common outcome variables such as success and time on task, and study debriefing. Throughout, we situate this guidance in the results of a new systematic review of the tool evaluations that were published in over 1,700 software engineering papers published from 2001 to 2011.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Strictly speaking, an experiment is by definition quantitative [Basili 2007]. Other kinds of empirical studies are not technically experiments. Thus, in this paper, when we refer to experiments, we mean quantitative experiments.
Studies using this last method were often referred to by authors as “case studies,” but this usage conflicts with the notion of case studies as empirical investigations of some phenomenon within a real-life context (Yin 2003), as the tool use experience reports in these papers were not performed in real life contexts. “Case study” was also used to refer to evaluations without human use.
Anderson JR, Reiser BJ (1985) The LISP tutor. Byte 10:159–175
Aranda J (2011) How do practitioners perceive Software Engineering Research? http://catenary.wordpress.com/2011/05/19/how-do-practitioners-perceive-software-engineering-research/. Retrieved: 08-01-2011
Atkins DL, Ball T, Graves TL, Mockus A (2002) Using version control data to evaluate the impact of software tools: A case study of the version editor. IEEE Trans Softw Eng 28(7):625–637
Bangor A, Kortum PT, Miller JT (2008) An empirical evaluation of the system usability scale. Int J Human-Comput Interact 24(6):574–594
Basili VR (1993) The experimental paradigm in software engineering. Int Work Exp Eng Issues: Crit Assess Futur Dir 706:3–12
Basili VR (1996) The role of experimentation in software engineering: Past, current, and future. International Conference on Software Engineering, 442–449
Basili VR (2007) The role of controlled experiments in software engineering research. Empirical Software Engineering Issues, LNCS 4336, Basili V et al. (Eds.), Springer-Verlag, 33–37
Basili VR, Selby RW, Hutchens DH (1986) Experimentation in software engineering. IEEE Trans Softw Eng, 733–743, July
Basili VR, Caldiera G, Rombach HD (1994) The goal question metric approach. In Encyclopedia of Software Engineering, John Wiley and Sons, 528–532
Beringer P (2004) Using students as subjects in requirements prioritization. International Symposium on Empirical Software Engineering, 167–176
Boehm BW, Papaccio PN (1988) Understanding and controlling software costs. IEEE Trans Softw Eng SE-14(10):1462–1477
Breaugh JA (2003) Effect size estimation: factors to consider and mistakes to avoid. J Manag 29(1):79–97
Bruun A, Gull P, Hofmeister L, Stage J (2009) Let your users do the testing: a comparison of three remote asynchronous usability testing methods. ACM Conference on Human Factors in Computing Systems, 1619–1628
Buse RPL, Sadowski C, Weimer W (2011) Benefits and barriers of user evaluation in software engineering research. ACM Conference on Systems, Programming, Languages and Applications
Carver J, Jaccheri L, Morasca S, Shull F (2003). Issues in using students in empirical studies in software engineering education. Software Metrics Symposium, 239–249
Chuttur MY (2009). Overview of the technology acceptance model: Origins, developments and future directions. Indiana University, USA, Sprouts: Working Papers on Information Systems
Davis FD (1989) Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Q 13(3):319
Dell N, Vaidyanathan V, Medhi I, Cutrell E, Thies W (2012) “Yours is better!” Participant response bias in HCI. ACM Conference on Human Factors in Computing Systems, 1321–1330
Dieste O, Grim´n A, Juristo N, Saxena H (2011) Quantitative determination of the relationship between internal validity and bias in software engineering experiments: Consequences for systematic literature reviews. International Symposium on Empirical Software Engineering and Measurement, 285–294
Dig D, Manzoor K, Johnson R, Nguyen TN (2008) Effective software merging in the presence of object-oriented refactorings. IEEE Trans Softw Eng 34(3):321–335
Dittrich Y (ed) (2007) Special issue on qualitative software engineering research. Inf Softw Technol, 49(6):531–694. doi:10.1016/j.infsof.2007.02.009
Dybå T, Kampenes V, Sjøberg D (2006) A systematic review of statistical power in software engineering experiments. Inf Softw Technol 48(8):745–755
Dybå T, Prikladnicki R, Rönkkö K, Seaman C, Sillito J (2011) Qualitative research in software engineering. Empir Softw Eng 16(4):425–429
Easterbrook S, Singer J, Storey M, Damian D (2008) Selecting empirical methods for software engineering research, in Guide to Advanced Empirical Software Engineering, Springer, 285–311
Feigenspan J, Kastner C, Liebig J, Apel S, Hanenberg S (2012). Measuring programming experience. International Conference on Program Comprehension, 73–82
Fenton N (1993) How effective are software engineering methods? J Syst Softw 22(2):141–146
Flyvbjerg B (2006) Five misunderstandings about case study research. Qual Inq 12(2):219–245
Glass RL, Vessey I, Ramesh V (2002) Research in software engineering: an analysis of the literature. Inf Softw Technol 44(8):491–506
Golden E, John BE, Bass L (2005) The value of a usability-supporting architectural pattern in software architecture design: a controlled experiment. ACM/IEEE International Conference on Software Engineering
Greenberg S, Buxton B (2008) Usability evaluation considered harmful (some of the time). ACM Conference on Human Factors in Computing Systems, 111–120
Gwet KL (2010) Handbook of inter-rater reliability, 2nd edn. Advanced Analytics, Gaithersburg
Hanenberg S (2010) An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time. ACM International Conference on Object-Oriented Programming Systems Languages and Applications (OOPSLA), 22–35
Hannay JE, Sjøberg DIK, Dyba T (2007) A systematic review of theory use in software engineering experiments. IEEE Trans Softw Eng 33(2):87–107
Holmes R, Walker RJ (2013) Systematizing pragmatic software reuse. ACM Trans Softw Eng Methodol 21(4), Article 20: 44 pages
John B, Packer H (1995) Learning and using the cognitive walkthrough method: a case study approach. ACM Conference on Human Factors in Computing Systems, 429–436
Juristo N, Moreno AM (2001) Basics of software engineering experimentation. Springer
Kampenes V, Dybå T, Hannay J, Sjøberg D (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11–12):1073–1086
Kaptein M, Robertson J (2012) Rethinking statistical analysis methods for CHI. ACM Conference on Human Factors in Computing Systems, 1105–1114
Kelleher C, Pausch R (2005). Stencils-based tutorials: design and evaluation. ACM Conference on Human Factors in Computing Systems, 541–550
Keller FS (1968) Good-bye teacher. J Appl Behav Anal 1:79–89
Keppel G (1982) Design and analysis: a researcher’s handbook, 2nd edn. Prentice-Hall, Englewood Cliffs
Kersten M, Murphy G (2006) Using task context to improve programmer productivity. ACM Symposium on Foundations of Software Engineering, 1–11
Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam, K.E., Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28(8):721–734
Kitchenham BA, Brereton P, Turner M, Niazi MK, Linkman S, Pretorius R, Budgen D (2010) Refining the systematic literature review process—two participant-observer case studies. Empir Softw Eng 15(6):618–653
Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with Mechanical Turk. ACM Conference on Human Factors in Computing Systems, 453–456
Ko AJ, Myers BA (2009) Finding causes of program output with the Java Whyline. ACM Conference on Human Factors in Computing Systems, 1569–1578
Ko AJ, Wobbrock JO (2010) Cleanroom: edit-time error detection with the uniqueness heuristic. IEEE Symposium on Visual Languages and Human-Centric Computing, 7–14
Ko AJ, Burnett MM, Green TRG, Rothermel KJ, Cook CR (2002) Using the Cognitive Walkthrough to improve the design of a visual programming experiment. J Vis Lang Comput 13:517–544
Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. International Conference on Software Engineering (ICSE)
LaToza TD, Myers BA (2010) Developers ask reachability questions. International Conference on Software Engineering (ICSE), 185–194
LaToza, TD, Myers BA (2011) Visualizing call graphs. IEEE Visual Languages and Human-Centric Computing (VL/HCC), Pittsburgh, PA
LaToza TD, Myers BA (2011) Designing useful tools for developers. ACM SIGPLAN Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU), 45–50
Lazar J, Feng JH, Hochheiser H (2010) Research methods in human-computer interaction. Wiley
Lott C, Rombach D (1996) Repeatable software engineering experiments for comparing defect-detection techniques. Empir Softw Eng 1:241–277
Martin DW (1996) Doing psychology experiments, 4th edn. Brooks/Cole, Pacific Grove
McDowall D, McCleary R, Meidinger E, Hay RA (1980) Interrupted Time Series Analysis, 1st Edition. SAGE Publications
Murphy GC, Walker RJ, Baniassad ELA (1999) Evaluating emerging software development technologies: lessons learned from assessing aspect-oriented programming. IEEE Trans Softw Eng 25(4):438–455
Murphy-Hill E, Murphy GC, Griswold WG (2010). Understanding context: creating a lasting impact in experimental software engineering research. FSE/SDP Workshop on Future of Software Engineering Research
Newell A (1973) You can’t play 20 questions with nature and win: projective comments on the papers of this symposium. In: Chase WG (ed) Visual information processing. Academic, New York
Nickerson RS (1998) Confirmation bias: a ubiquitous phenomenon in many guises. Rev Gen Psychol 2(2):175–220
Nimmer JW, Ernst MD (2002) Invariant inference for static checking: an empirical evaluation. SIGSOFT Softw Eng Notes 27(6):11–20
Olsen DR (2007) Evaluating user interface systems research. ACM Symposium on User Interface Software and Technology, 251–258
Polson P, Lewis C, Rieman J, Wharton C (1992) Cognitive walkthroughs: a method for theory-based evaluation of user interfaces. Int J Human-Comput Interact 36:741–773
Ramesh V, Glass RL, Vessey I (2004) Research in computer science: an empirical study. J Syst Softw 70(1–2):165–176
Rombach HD, Basili VR, Selby RW (1992) Experimental software engineering issues: critical assessment and future directions. International Workshop Dagstuhl Castle (Germany), Sept. 14–18
Rosenthal R (1966) Experimenter effects in behavioral research. Appleton, New York
Rosenthal R, Rosnow R (2007) Essentials of behavioral research: methods and data analysis. McGraw-Hill, 3rd edition
Rosenthal R, Rubin DB (1978) Interpersonal expectancy effects: the first 345 studies. Behav Brain Sci 1(3):377–386
Ross J, Irani L, Silberman MS, Zaldivar A, Tomlinson B (2010) Who are the crowdworkers? Shifting demographics in Mechanical Turk. ACM Conference on Human Factors in Computing Systems, 2863–2872
Rothermel KJ, Cook C, Burnett MM, Schonfeld J, Green TRG, Rothermel G (2000) WYSIWYT testing in the spreadsheet paradigm: an empirical evaluation. ACM International Conference on Software Engineering, 230–239
Rubin J, Chisnell D (2008) Handbook of usability testing: how to plan, design, and conduct effective tests. Wiley
Runeson P, Höst M (2009) Guidelines for conducting and reporting case study research in software engineering. Empir Softw Eng 14(2):131–164
Shull F, Singer J, Sjøberg DIK (2006) Guide to advanced empirical software engineering. Springer
Sillito J, Murphy G, De Volder K (2006) Questions programmers ask during software evolution tasks. ACM SIGSOFT/FSE, 23–34
Sjoberg DIK, Dyba T, Jorgensen M (2007) The future of empirical methods in software engineering research. In 2007 Future of Software Engineering (FOSE ’07), 358–378
Sjøberg D, Anda B, Arisholm E, Dyba T, Jorgensen M, Karahasanovic A, Koren E, Voka M (2003) Conducting realistic experiments in software engineering. Empirical Software Engineering and Measurement
Sjøberg DIK, Hannay JE, Hansen O, Kampenes VB, Karahasanović A, Liborg N-K, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753
Steele CM, Aronson J (1995) Stereotype threat and the intellectual test performance of African-Americans. J Personal Soc Psychol 69:797–811
Tichy WF (1998) Should computer scientists experiment more? 16 excuses to avoid experimentation. IEEE Comput 31(5):32–40
Tichy WF, Lukowicz P, Prechelt L, Heinz EA (1995) Experimental evaluation in computer science: a quantitative study. J Syst Softw 28(1):9–18
Walther JB (2002) Research ethics in internet-enabled research: human subjects issues and methodological myopia. Ethics Inf Technol 4(3):205–216
Wickelgren WA (1977) Speed-accuracy tradeoff and information processing dynamics. Acta Psychol 41(1):67–85
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Springer
Yin RK (2003) Case study research: design and methods. Sage Publications
Zannier C, Melnik G, Maurer F (2006) On the success of empirical studies in the international conference on software engineering. ACM/IEEE International Conference on Software Engineering, 341–350
We thank Bonnie E. John for her early contributions to this work. This material is based in part upon work supported by the NSF Grant CCF-0952733 and AFOSR FA9550-0910213 and FA9550-10-1-0326. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.
Communicated by: Premkumar Thomas Devanbu
Past literature has considered the role of human participants in software engineering research before 2002 (Sjøberg et al. 2005). We still know little, however, about the trends in empirical evaluations with human participants in software engineering from the past decade. To address this gap, and to support the methodological discussion in this paper, we conducted a systematic literature review (Kitchenham et al. 2010) of research from the past decade.
We focused on publications from four of the top software engineering research venues: the International Conference on Software Engineering (ICSE), the ACM SIGSOFT Symposium on Foundations of Software Engineering (and associated European Software Engineering Conference in alternating years), ACM Transactions on Software Engineering and Methodology, and IEEE Transactions on Software Engineering. We chose these primarily because of their status as flagship software engineering journals and conferences, but also because of their focus on software engineering tools (as opposed to studies focused on understanding software engineering practice). Our sample included all of the research publications in these four venues that were published from 2001 to 2011. For conferences, we included only the technical proceedings, excluding experience reports and software engineering education research papers. The resulting set included 1,701 publications.
To begin, we first sought to identify the subset of these papers that reported on new technologies, methods, formalisms, design patterns, metrics, or other techniques intended to be used by a software professional of some kind. The first two authors redundantly classified a random sample of 99 of the 1,701 papers (the Cohen’s Kappa of this classification was 0.8, considered near perfect agreement). We then split the remaining papers and classified them independently in random order, identifying 1,392 papers (81 % of all publications) describing some form of software engineering “tool.” As seen in Fig. 3, the proportion of papers contributing tools has not changed significantly in the past decade (χ 2(10,N = 1,701) = 11.1, p = .35).
To narrow the sample of “tool papers” to “tool papers with evaluations,” we checked each paper for an evaluation of the tool (including evaluations of both internal properties such as performance or accuracy and external properties of how the tool is used by people). We counted an evaluation as “empirical” if it included any quantitative or qualitative data from observation of the tool’s behavior or use. Therefore, if the paper only described a task that could be performed with a tool, but was not actually performed, the paper was not included. Achieving high reliability when redundantly classifying the same 99 papers (Kappa = 0.83), we split and classified the remaining papers, identifying 1,065 reporting on both a tool and empirical evaluation of it; 77 % of tool papers included an empirical evaluation. As seen in Fig. 4, the proportion of papers contributing tools and reporting empirical evaluations of them has increased significantly in the past decade from about 50 % to over 90 % (χ 2(10,N = 1,392) = 117.9, p < .001). This shows that empirical evaluations are now widely adopted and perhaps even expected for publication of novel tool contributions.
To further narrow our sample to tool papers that evaluated some aspect of the use of a tool, we identified the individual studies reported in each paper. We counted any study with a unique method within a paper as a separate study (lab studies that were run multiple times or case studies run on multiple programs were counted as a single study). After achieving high reliability on this definition (Kappa = 0.62), we found 1,141 studies across 1,065 papers.
With the studies in each paper identified, we then classified each of these studies as either involving the human use of the tool or not. Any person using the tool for some task, including the paper authors themselves if they described their own use of the tool, was included. After achieving high reliability on this classification (Kappa = .70), we classified the papers, finding 345 studies across 289 papers that involved human use of a tool. As shown in Fig. 5, the proportion of studies that involve developers or other software professionals using a tool is actually on a slow decline in our sample, from a peak in 2002 of 44 % to only 26 % of papers in 2011 (χ 2(10,N = 1141) = 21.0, p < .05). This means that although more evaluations are being done, a smaller proportion of them are evaluating a tool’s use by humans (instead evaluating its internal characteristics such as speed, accuracy, or correctness).
The subset of tool evaluations of human use that recruited human participants and did not use authors is shown in Fig. 6. This plot shows that although studies evaluating tool use are less common (Fig. 5), an increasing proportion of these studies involve human participants. Therefore, when software engineering researchers are studying how a tool is used, they are increasingly recruiting participants rather than using themselves.
Most of the studies instead used lab studies (conducting an study in a controlled setting with participants), interviews (demonstrating the tool and asking participants to respond to spoken questions about it), surveys (demonstrating the tool and asking participants to respond to written questions about it), field deployments (observations of the use of the tool in real settings on real tasks), and a method we will call tool use experience reports Footnote 2 (the use of the tool with a specific program or data set, either by the authors or some other person). As shown in Fig. 7, the most common method by far was the tool use experience report, which was the method of 67 % of the 345 studies evaluating human use of a tool. As seen in Fig. 7, the other four methods were much less common. The relative proportion of tool application studies compared to other types has not changed significantly in the past decade (χ 2(10,N = 345) = 9.6, p = .473). The frequency of experiments per year has also not changed significantly over the past decade (χ 2(10,N = 345) = 8.2, p = .611). Figure 8 shows that only a small subset of these studies are controlled experiments. In fact, the number of experiments evaluating tool use has ranged from 2 to 9 studies per year in these four venues, for a total of only 44 controlled experiments with human participants over 10 years.
These results indicate several major trends in the methods that software engineering researchers use to evaluate tools: (1) empirical evaluations are now found in nearly every paper contributing a tool, (2) the proportion of these evaluations that evaluate the human use of the tool is on the decline, (3) an increasing proportion of human use evaluations involve non-author participants, but (4) experiments evaluating human use are still quite rare.
These findings are subject to several threats to validity. We considered only 4 journals and conference proceedings in our review, focusing on those with a strong reputation for contributing new tools. There are many other software engineering publication venues where such work appears. It is possible that the trends we observed are particular to the venues we chose; for example, Buse et al. (2011) found that while ICSE and FSE showed no signs of increases in user evaluations, OOSPLA, ISSTA, and ASE did. There may also be evaluations with human participants that were never published or that were published in other venues after being rejected by the venues that we did analyze.
About this article
Cite this article
Ko, A.J., LaToza, T.D. & Burnett, M.M. A practical guide to controlled experiments of software engineering tools with human participants. Empir Software Eng 20, 110–141 (2015). https://doi.org/10.1007/s10664-013-9279-3
- Research methodology
- Human participants
- Human subjects