Skip to main content

An experiment on the effectiveness and efficiency of exploratory testing


The exploratory testing (ET) approach is commonly applied in industry, but lacks scientific research. The scientific community needs quantitative results on the performance of ET taken from realistic experimental settings. The objective of this paper is to quantify the effectiveness and efficiency of ET vs. testing with documented test cases (test case based testing, TCT). We performed four controlled experiments where a total of 24 practitioners and 46 students performed manual functional testing using ET and TCT. We measured the number of identified defects in the 90-minute testing sessions, the detection difficulty, severity and types of the detected defects, and the number of false defect reports. The results show that ET found a significantly greater number of defects. ET also found significantly more defects of varying levels of difficulty, types and severity levels. However, the two testing approaches did not differ significantly in terms of the number of false defect reports submitted. We conclude that ET was more efficient than TCT in our experiment. ET was also more effective than TCT when detection difficulty, type of defects and severity levels are considered. The two approaches are comparable when it comes to the number of false defect reports submitted.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others


  1. Obviously it will help a tester if such knowledge exists (to find expected risks).

  2. For recent reviews on software testing techniques, see Jia and Harman (2011); Ali et al. (2010); da Mota Silveira Neto et al. (2011); Nie and Leung (2011); Dias Neto et al. (2007).

  3. The 90 minutes session length was decided as suggested by Bach (2000) but is not a strict requirement (we were constrained by the limited time available for the experiments from our industrial and academic subjects).

  4. The C++ source code files were given to the subjects as an example to see code formatting and indentation. The purpose was to guide the subjects in detecting formatting and indentation defects.

  5. jEdit version 4.2

  6. The exploratory charter provided the subjects with high-level test guidelines.

  7. Cohen’s d shows the mean difference between the two groups in standard deviation units. The values for d are interpreted differently for different research questions. However, we have followed a standard interpretation offered by Cohen (1988), where 0.8, 0.5 and 0.2 show large, moderate and small practical significances, respectively.

  8. Median is a more close indication of true average than mean due to the presence of extreme values.

  9. FTFI number is somewhat ambiguously named in the original article, since the metric is not about fault interactions, but interactions of inputs or conditions that trigger the failure.

  10. η 2 is a commonly used effect size measure in analysis of variance and represents an estimate of the degree of the association for the sample. We have followed the interpretation of Cohen (1988) for the significance of η 2 where 0.0099 constitutes a small effect, 0.0588 a medium effect and 0.1379 a large effect.

  11. The term mean rank is used in Tuckey-Kramer test for multiple comparisons. This test ranks the set of means in ascending order to reduce possible comparisons to be tested, e.g., in the ranking of the means W >X >Y >Z, if there is no difference between the two means that have the largest difference (W & Z), comparing other means having smaller difference will be of no use as we will get the same conclusion.

  12. Vargha and Delaney suggest that the  12 statistic of 0.56, 0.64 and 0.71 represent small, medium and large effect sizes respectively (Vargha and Delaney 2000).



  • Abran A, Bourque P, Dupuis R, Moore JW, Tripp LL (eds) (2004) Guide to the software engineering body of knowledge – SWEBOK. IEEE Press, Piscataway

    Google Scholar 

  • Agruss C, Johnson B (2000) Ad hoc software testing: a perspective on exploration and improvisation.

  • Ahonen J, Junttila T, Sakkinen M (2004) Impacts of the organizational model on testing: three industrial cases. Empir Softw Eng 9(4):275–296

    Article  Google Scholar 

  • Ali S, Briand L, Hemmati H, Panesar-Walawege R (2010) A systematic review of the application and empirical investigation of search-based test case generation. IEEE Trans Softw Eng 36(6):742–762

    Article  Google Scholar 

  • Andersson C, Runeson P (2002) Verification and validation in industry – a qualitative survey on the state of practice. In: Proceedings of the 2002 international symposium on empirical software engineering (ISESE’02). IEEE Computer Society, Washington, DC

  • Arisholm E, Gallis H, Dybå T, Sjøberg DIK (2007) Evaluating pair programming with respect to system complexity and programmer expertise. IEEE Trans Softw Eng 33:65–86

    Article  Google Scholar 

  • Bach J (2000) Session-based test management. Software Testing and Quality Engineering Magazine, vol 2, no 6

  • Bach J (2003) Exploratory testing explained.

  • Basili V, Shull F, Lanubile F (1999) Building knowledge through families of experiments. IEEE Trans Softw Eng 25(4):456–473

    Article  Google Scholar 

  • Beer A, Ramler R (2008) The role of experience in software testing practice. In: Proceedings of euromicro conference on software engineering and advanced applications

  • Berner S, Weber R, Keller RK (2005) Observations and lessons learned from automated testing. In: Proceedings of the 27th international conference on software engineering (ICSE’05). ACM, New York

  • Bertolino A (2007) Software testing research: achievements, challenges, dreams. In: Proceedings of the 2007 international conference on future of software engineering (FOSE’07)

  • Bertolino A (2008) Software testing forever: old and new processes and techniques for validating today’s applications. In: Jedlitschka A, Salo O (eds) Product-focused software process improvement, lecture notes in computer science, vol 5089. Springer, Berlin Heidelberg

    Google Scholar 

  • Bhatti K, Ghazi AN (2010) Effectiveness of exploratory testing: an empirical scrutiny of the challenges and factors affecting the defect detection efficiency. Master’s thesis, Blekinge Institute of Technology

  • Briand LC (2007) A critical analysis of empirical research in software testing. In: Proceedings of the 1st international symposium on empirical software engineering and measurement (ESEM’07). IEEE Computer Society, Washington, DC

    Google Scholar 

  • Brooks A, Roper M, Wood M, Daly J, Miller J (2008) Replication’s role in software engineering. In: Shull F, Singer J, Sjøberg DI (eds) Guide to advanced empirical software engineering. Springer, London, pp 365–379 doi:10.1007/978-1-84800-044-5_14

    Chapter  Google Scholar 

  • Chillarege R, Bhandari I, Chaar J, Halliday M, Moebus D, Ray B, Wong MY (1992) Orthogonal defect classification–a concept for in-process measurements. IEEE Trans Softw Eng 18(11):943–956

    Article  Google Scholar 

  • Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum

  • da Mota Silveira Neto PA, do Carmo Machado I, McGregor JD, de Almeida ES, de Lemos Meira SR (2011) A systematic mapping study of software product lines testing. Inf Softw Technol 53(5):407–423

    Article  Google Scholar 

  • Dias Neto AC, Subramanyan R, Vieira M, Travassos GH (2007) A survey on model-based testing approaches: a systematic review. In: Proceedings of the 1st ACM international workshop on empirical assessment of software engineering languages and technologies (WEASELTech’07): held in conjunction with the 22nd IEEE/ACM international conference on automated software engineering (ASE) 2007. ACM, New York

  • do Nascimento LHO, Machado PDL (2007) An experimental evaluation of approaches to feature testing in the mobile phone applications domain. In: Workshop on domain specific approaches to software test automation (DOSTA’07): in conjunction with the 6th ESEC/FSE joint meeting. ACM, New York

  • Dustin E, Rashka J, Paul J (1999) Automated software testing: introduction, management, and performance. Addison-Wesley Professional

  • Houdek F, Ernst D, Schwinn T (2002) Defect detection for executable specifications – an experiment. Int J Softw Eng Knowl Eng 12(6):637–655

    Google Scholar 

  • Galletta DF, Abraham D, El Louadi M, Lekse W, Pollalis YA, Sampler JL (1993) An empirical study of spreadsheet error-finding performance. Account Manag Inf Technol 3(2):79–95

    Article  Google Scholar 

  • Goodenough JB, Gerhart SL (1975) Toward a theory of test data selection. SIGPLAN Notes 10(6):493–510

    Article  Google Scholar 

  • Graves TL, Harrold MJ, Kim JM, Porter A, Rothermel G (2001) An empirical study of regression test selection techniques. ACM Trans Softw Eng Methodol 10:184–208

    Article  MATH  Google Scholar 

  • Grechanik M, Xie Q, Fu C (2009) Maintaining and evolving GUI-directed test scripts. In: Proceedings of the 31st international conference on software engineering (ICSE’09). IEEE Computer Society, Washington, DC, pp 408–418

  • Hartman A (2002) Is issta research relevant to industry? SIGSOFT Softw Eng Notes 27(4):205–206

    Article  Google Scholar 

  • Höst M, Wohlin C, Thélin T (2005) Experimental context classification: incentives and experience of subjects. In: Proceedings of the 27th international conference on software engineering (ICSE’05)

  • Hutchins M, Foster H, Goradia T, Ostrand T (1994) Experiments of the effectiveness of data flow and control flow based test adequacy criteria. In: Proceedings of the 16th international conference on software engineering (ICSE’94). IEEE Computer Society Press, Los Alamitos, pp 191–200

  • IEEE 1044-2009 (2010) IEEE standard classification for software anomalies

  • Itkonen J (2008) Do test cases really matter? An experiment comparing test case based and exploratory testing. Licentiate Thesis, Helsinki University of Technology

  • Itkonen J, Rautiainen K (2005) Exploratory testing: a multiple case study. In: 2005 international symposium on empirical software engineering (ISESE’05), pp 84–93

  • Itkonen J, Mäntylä M, Lassenius C (2007) Defect detection efficiency: test case based vs. exploratory testing. In: 1st international symposium on empirical software engineering and measurement (ESEM’07), pp 61–70

  • Itkonen J, Mäntylä MV, Lassenius C (2009) How do testers do it? An exploratory study on manual testing practices. In: 3rd international symposium on empirical software engineering and measurement (ESEM’09), pp 494–497

  • Itkonen J, Mäntylä M, Lassenius C (2013) The role of the tester’s knowledge in exploratory software testing. IEEE Trans Softw Eng 39(5):707–724

    Article  Google Scholar 

  • Jia Y, Harman M (2011) An analysis and survey of the development of mutation testing. IEEE Trans Softw Eng 37(5):649–678

    Article  Google Scholar 

  • Juristo N, Moreno AM (2001) Basics of software engineering experimentation. Kluwer, Boston

    Book  MATH  Google Scholar 

  • Juristo N, Moreno A, Vegas S (2004) Reviewing 25 years of testing technique experiments. Empir Softw Eng 9(1):7–44

    Article  Google Scholar 

  • Kamsties E, Lott CM (1995) An empirical evaluation of three defect detection techniques. In: Proceedings of the 5th European software engineering conference (ESEC’95). Springer, London, pp 362–383

    Google Scholar 

  • Kaner C, Bach J, Pettichord B (2008) Lessons learned in software testing, 1st edn. Wiley-India

    Google Scholar 

  • Kettunen V, Kasurinen J, Taipale O, Smolander K (2010) A study on agility and testing processes in software organizations. In: Proceedings of the international symposium on software testing and analysis

  • Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam KE, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28:721–734

    Article  Google Scholar 

  • Kuhn D, Wallace D, Gallo A (2004) Software fault interactions and implications for software testing. IEEE Trans Softw Eng 30(6):418–421

    Article  Google Scholar 

  • Lung J, Aranda J, Easterbrook S, Wilson G (2008) On the difficulty of replicating human subjects studies in software engineering. In: ACM/IEEE 30th international conference on software engineering (ICSE’08)

  • Lyndsay J, van Eeden N (2003) Adventures in session-based testing.

  • Myers GJ, Sandler C, Badgett T (1979) The art of software testing. Wiley, New York

    Google Scholar 

  • Naseer A, Zulfiqar M (2010) Investigating exploratory testing in industrial practice. Master’s thesis, Blekinge Institute of Technology

  • Nie C, Leung H (2011) A survey of combinatorial testing. ACM Comput Surv 43(2):1–29

    Article  Google Scholar 

  • Poon P, Tse TH, Tang S, Kuo F (2011) Contributions of tester experience and a checklist guideline to the identification of categories and choices for software testing. Softw Qual J 19(1):141–163

    Article  Google Scholar 

  • Ryber T (2007) Essential software test design. Unique Publishing Ltd.

  • Sjøberg D, Hannay J, Hansen O, Kampenes V, Karahasanovic A, Liborg NK, Rekdal A (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753

    Article  Google Scholar 

  • Svahnberg M, Aurum A, Wohlin C (2008) Using students as subjects – An empirical evaluation. In: Proceedings of the 2nd ACM-IEEE international symposium on empirical software engineering and measurement (ESEM’08). ACM, New York

  • Taipale O, Kalviainen H, Smolander K (2006) Factors affecting software testing time schedule. In: Proceedings of the Australian software engineering conference (ASE’06). IEEE Computer Society, Washington, DC, pp 283–291

  • van Veenendaal E, Bach J, Basili V, Black R, Comey C, Dekkers T, Evans I, Gerard P, Gilb T, Hatton L, Hayman D, Hendriks R, Koomen T, Meyerhoff D, Pol M, Reid S, Schaefer H, Schotanus C, Seubers J, Shull F, Swinkels R, Teunissen R, van Vonderen R, Watkins J, van der Zwan M (2002) The testing practitioner. UTN Publishers

  • Våga J, Amland S (2002) Managing high-speed web testing. Springer, New York, pp 23–30

    Google Scholar 

  • Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101–132

    Google Scholar 

  • Weyuker EJ (1993) More experience with data flow testing. IEEE Trans Softw Eng 19:912–919

    Article  Google Scholar 

  • Whittaker JA (2010) Exploratory software testing. Addison-Wesley

  • Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer, Norwell

    Book  Google Scholar 

  • Wood M, Roper M, Brooks A, Miller J (1997) Comparing and combining software defect detection techniques: a replicated empirical study. In: Proceedings of the 6th European software engineering conference (ESEC’97) held jointly with the 5th ACM SIGSOFT international symposium on foundations of software engineering (FSE’97). Springer New York, pp 262–277

    Google Scholar 

  • Yamaura T (2002) How to design practical test cases. IEEE Softw 15(6):30–36

    Article  Google Scholar 

  • Yang B, Hu H, Jia L (2008) A study of uncertainty in software cost and its impact on optimal software release time. IEEE Trans Softw Eng 34(6):813–825

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Wasif Afzal.

Additional information

Communicated by: José Carlos Maldonado


Appendix A: Test Case Template for TCT

Please use this template to design the test cases. Fill the fields accordingly.

  • Date:

  • Name:

  • Subject ID:

Table 14 Test case template

Appendix B: Defect Report

Please report your found defects in this document. Once you are done, please return the document to the instructor.

  • Name:

  • Subject ID:

Table 15 Defect types
Table 16 Defects in the experimental object

Appendix C: ET – Test Session Charter

  • Description: In this test session your task is to do functional testing for jEdit application feature set from the view point of a typical user. Your goal is to analyse the system’s suitability to intended use from the viewpoint of a typical test editor user. Take into account the needs of both an occasional user who is not familiar with all the features of the jEdit as well as an advanced user.

  • What – Tested areas: Try to cover in your testing all features listed below. Focus into first priority functions, but make sure that you cover also the second priority functions on some level during the fixed length session.

    • First priority functions (refer to Section 3.5).

    • Second priority functions (refer to Section 3.5).

  • Why – Goal: Your goal is to reveal as many defects in the system as possible. The found defects are described briefly and the detailed analysis of the found defects is left out in this test session.

  • How – Approach: Focus is on testing the functionality. Try to test exceptional cases, valid as well as invalid inputs, typical error situations, and things that the user could do wrong. Use manual testing and try to form equivalence classes and test boundaries. Try also to test relevant combinations of the features.

  • Focus – What problem to look for: Pay attention to the following issues:

    • Does the function work as described in the user manual?

    • Does the function do things that it should not?

    • From the viewpoint of a typical user, does the function work as the user would expect?

    • What interactions the function have or might have with other functions? Do these interactions work correctly as a user would expect?

  • Exploratory log: Write your log in a separate document.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Afzal, W., Ghazi, A.N., Itkonen, J. et al. An experiment on the effectiveness and efficiency of exploratory testing. Empir Software Eng 20, 844–878 (2015).

Download citation

  • Published:

  • Issue Date:

  • DOI: