Skip to main content

An experiment on the effectiveness and efficiency of exploratory testing


The exploratory testing (ET) approach is commonly applied in industry, but lacks scientific research. The scientific community needs quantitative results on the performance of ET taken from realistic experimental settings. The objective of this paper is to quantify the effectiveness and efficiency of ET vs. testing with documented test cases (test case based testing, TCT). We performed four controlled experiments where a total of 24 practitioners and 46 students performed manual functional testing using ET and TCT. We measured the number of identified defects in the 90-minute testing sessions, the detection difficulty, severity and types of the detected defects, and the number of false defect reports. The results show that ET found a significantly greater number of defects. ET also found significantly more defects of varying levels of difficulty, types and severity levels. However, the two testing approaches did not differ significantly in terms of the number of false defect reports submitted. We conclude that ET was more efficient than TCT in our experiment. ET was also more effective than TCT when detection difficulty, type of defects and severity levels are considered. The two approaches are comparable when it comes to the number of false defect reports submitted.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

    Obviously it will help a tester if such knowledge exists (to find expected risks).

  2. 2.

    For recent reviews on software testing techniques, see Jia and Harman (2011); Ali et al. (2010); da Mota Silveira Neto et al. (2011); Nie and Leung (2011); Dias Neto et al. (2007).

  3. 3.

    The 90 minutes session length was decided as suggested by Bach (2000) but is not a strict requirement (we were constrained by the limited time available for the experiments from our industrial and academic subjects).

  4. 4.

    The C++ source code files were given to the subjects as an example to see code formatting and indentation. The purpose was to guide the subjects in detecting formatting and indentation defects.

  5. 5.

    jEdit version 4.2

  6. 6.

    The exploratory charter provided the subjects with high-level test guidelines.

  7. 7.

    Cohen’s d shows the mean difference between the two groups in standard deviation units. The values for d are interpreted differently for different research questions. However, we have followed a standard interpretation offered by Cohen (1988), where 0.8, 0.5 and 0.2 show large, moderate and small practical significances, respectively.

  8. 8.

    Median is a more close indication of true average than mean due to the presence of extreme values.

  9. 9.

    FTFI number is somewhat ambiguously named in the original article, since the metric is not about fault interactions, but interactions of inputs or conditions that trigger the failure.

  10. 10.

    η 2 is a commonly used effect size measure in analysis of variance and represents an estimate of the degree of the association for the sample. We have followed the interpretation of Cohen (1988) for the significance of η 2 where 0.0099 constitutes a small effect, 0.0588 a medium effect and 0.1379 a large effect.

  11. 11.

    The term mean rank is used in Tuckey-Kramer test for multiple comparisons. This test ranks the set of means in ascending order to reduce possible comparisons to be tested, e.g., in the ranking of the means W >X >Y >Z, if there is no difference between the two means that have the largest difference (W & Z), comparing other means having smaller difference will be of no use as we will get the same conclusion.

  12. 12.

    Vargha and Delaney suggest that the  12 statistic of 0.56, 0.64 and 0.71 represent small, medium and large effect sizes respectively (Vargha and Delaney 2000).

  13. 13.


  1. Abran A, Bourque P, Dupuis R, Moore JW, Tripp LL (eds) (2004) Guide to the software engineering body of knowledge – SWEBOK. IEEE Press, Piscataway

    Google Scholar 

  2. Agruss C, Johnson B (2000) Ad hoc software testing: a perspective on exploration and improvisation.

  3. Ahonen J, Junttila T, Sakkinen M (2004) Impacts of the organizational model on testing: three industrial cases. Empir Softw Eng 9(4):275–296

    Article  Google Scholar 

  4. Ali S, Briand L, Hemmati H, Panesar-Walawege R (2010) A systematic review of the application and empirical investigation of search-based test case generation. IEEE Trans Softw Eng 36(6):742–762

    Article  Google Scholar 

  5. Andersson C, Runeson P (2002) Verification and validation in industry – a qualitative survey on the state of practice. In: Proceedings of the 2002 international symposium on empirical software engineering (ISESE’02). IEEE Computer Society, Washington, DC

  6. Arisholm E, Gallis H, Dybå T, Sjøberg DIK (2007) Evaluating pair programming with respect to system complexity and programmer expertise. IEEE Trans Softw Eng 33:65–86

    Article  Google Scholar 

  7. Bach J (2000) Session-based test management. Software Testing and Quality Engineering Magazine, vol 2, no 6

  8. Bach J (2003) Exploratory testing explained.

  9. Basili V, Shull F, Lanubile F (1999) Building knowledge through families of experiments. IEEE Trans Softw Eng 25(4):456–473

    Article  Google Scholar 

  10. Beer A, Ramler R (2008) The role of experience in software testing practice. In: Proceedings of euromicro conference on software engineering and advanced applications

  11. Berner S, Weber R, Keller RK (2005) Observations and lessons learned from automated testing. In: Proceedings of the 27th international conference on software engineering (ICSE’05). ACM, New York

  12. Bertolino A (2007) Software testing research: achievements, challenges, dreams. In: Proceedings of the 2007 international conference on future of software engineering (FOSE’07)

  13. Bertolino A (2008) Software testing forever: old and new processes and techniques for validating today’s applications. In: Jedlitschka A, Salo O (eds) Product-focused software process improvement, lecture notes in computer science, vol 5089. Springer, Berlin Heidelberg

    Google Scholar 

  14. Bhatti K, Ghazi AN (2010) Effectiveness of exploratory testing: an empirical scrutiny of the challenges and factors affecting the defect detection efficiency. Master’s thesis, Blekinge Institute of Technology

  15. Briand LC (2007) A critical analysis of empirical research in software testing. In: Proceedings of the 1st international symposium on empirical software engineering and measurement (ESEM’07). IEEE Computer Society, Washington, DC

    Google Scholar 

  16. Brooks A, Roper M, Wood M, Daly J, Miller J (2008) Replication’s role in software engineering. In: Shull F, Singer J, Sjøberg DI (eds) Guide to advanced empirical software engineering. Springer, London, pp 365–379 doi:10.1007/978-1-84800-044-5_14

    Chapter  Google Scholar 

  17. Chillarege R, Bhandari I, Chaar J, Halliday M, Moebus D, Ray B, Wong MY (1992) Orthogonal defect classification–a concept for in-process measurements. IEEE Trans Softw Eng 18(11):943–956

    Article  Google Scholar 

  18. Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum

  19. da Mota Silveira Neto PA, do Carmo Machado I, McGregor JD, de Almeida ES, de Lemos Meira SR (2011) A systematic mapping study of software product lines testing. Inf Softw Technol 53(5):407–423

    Article  Google Scholar 

  20. Dias Neto AC, Subramanyan R, Vieira M, Travassos GH (2007) A survey on model-based testing approaches: a systematic review. In: Proceedings of the 1st ACM international workshop on empirical assessment of software engineering languages and technologies (WEASELTech’07): held in conjunction with the 22nd IEEE/ACM international conference on automated software engineering (ASE) 2007. ACM, New York

  21. do Nascimento LHO, Machado PDL (2007) An experimental evaluation of approaches to feature testing in the mobile phone applications domain. In: Workshop on domain specific approaches to software test automation (DOSTA’07): in conjunction with the 6th ESEC/FSE joint meeting. ACM, New York

  22. Dustin E, Rashka J, Paul J (1999) Automated software testing: introduction, management, and performance. Addison-Wesley Professional

  23. Houdek F, Ernst D, Schwinn T (2002) Defect detection for executable specifications – an experiment. Int J Softw Eng Knowl Eng 12(6):637–655

    Google Scholar 

  24. Galletta DF, Abraham D, El Louadi M, Lekse W, Pollalis YA, Sampler JL (1993) An empirical study of spreadsheet error-finding performance. Account Manag Inf Technol 3(2):79–95

    Article  Google Scholar 

  25. Goodenough JB, Gerhart SL (1975) Toward a theory of test data selection. SIGPLAN Notes 10(6):493–510

    Article  Google Scholar 

  26. Graves TL, Harrold MJ, Kim JM, Porter A, Rothermel G (2001) An empirical study of regression test selection techniques. ACM Trans Softw Eng Methodol 10:184–208

    Article  MATH  Google Scholar 

  27. Grechanik M, Xie Q, Fu C (2009) Maintaining and evolving GUI-directed test scripts. In: Proceedings of the 31st international conference on software engineering (ICSE’09). IEEE Computer Society, Washington, DC, pp 408–418

  28. Hartman A (2002) Is issta research relevant to industry? SIGSOFT Softw Eng Notes 27(4):205–206

    Article  Google Scholar 

  29. Höst M, Wohlin C, Thélin T (2005) Experimental context classification: incentives and experience of subjects. In: Proceedings of the 27th international conference on software engineering (ICSE’05)

  30. Hutchins M, Foster H, Goradia T, Ostrand T (1994) Experiments of the effectiveness of data flow and control flow based test adequacy criteria. In: Proceedings of the 16th international conference on software engineering (ICSE’94). IEEE Computer Society Press, Los Alamitos, pp 191–200

  31. IEEE 1044-2009 (2010) IEEE standard classification for software anomalies

  32. Itkonen J (2008) Do test cases really matter? An experiment comparing test case based and exploratory testing. Licentiate Thesis, Helsinki University of Technology

  33. Itkonen J, Rautiainen K (2005) Exploratory testing: a multiple case study. In: 2005 international symposium on empirical software engineering (ISESE’05), pp 84–93

  34. Itkonen J, Mäntylä M, Lassenius C (2007) Defect detection efficiency: test case based vs. exploratory testing. In: 1st international symposium on empirical software engineering and measurement (ESEM’07), pp 61–70

  35. Itkonen J, Mäntylä MV, Lassenius C (2009) How do testers do it? An exploratory study on manual testing practices. In: 3rd international symposium on empirical software engineering and measurement (ESEM’09), pp 494–497

  36. Itkonen J, Mäntylä M, Lassenius C (2013) The role of the tester’s knowledge in exploratory software testing. IEEE Trans Softw Eng 39(5):707–724

    Article  Google Scholar 

  37. Jia Y, Harman M (2011) An analysis and survey of the development of mutation testing. IEEE Trans Softw Eng 37(5):649–678

    Article  Google Scholar 

  38. Juristo N, Moreno AM (2001) Basics of software engineering experimentation. Kluwer, Boston

    Book  MATH  Google Scholar 

  39. Juristo N, Moreno A, Vegas S (2004) Reviewing 25 years of testing technique experiments. Empir Softw Eng 9(1):7–44

    Article  Google Scholar 

  40. Kamsties E, Lott CM (1995) An empirical evaluation of three defect detection techniques. In: Proceedings of the 5th European software engineering conference (ESEC’95). Springer, London, pp 362–383

    Google Scholar 

  41. Kaner C, Bach J, Pettichord B (2008) Lessons learned in software testing, 1st edn. Wiley-India

    Google Scholar 

  42. Kettunen V, Kasurinen J, Taipale O, Smolander K (2010) A study on agility and testing processes in software organizations. In: Proceedings of the international symposium on software testing and analysis

  43. Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam KE, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28:721–734

    Article  Google Scholar 

  44. Kuhn D, Wallace D, Gallo A (2004) Software fault interactions and implications for software testing. IEEE Trans Softw Eng 30(6):418–421

    Article  Google Scholar 

  45. Lung J, Aranda J, Easterbrook S, Wilson G (2008) On the difficulty of replicating human subjects studies in software engineering. In: ACM/IEEE 30th international conference on software engineering (ICSE’08)

  46. Lyndsay J, van Eeden N (2003) Adventures in session-based testing.

  47. Myers GJ, Sandler C, Badgett T (1979) The art of software testing. Wiley, New York

    Google Scholar 

  48. Naseer A, Zulfiqar M (2010) Investigating exploratory testing in industrial practice. Master’s thesis, Blekinge Institute of Technology

  49. Nie C, Leung H (2011) A survey of combinatorial testing. ACM Comput Surv 43(2):1–29

    Article  Google Scholar 

  50. Poon P, Tse TH, Tang S, Kuo F (2011) Contributions of tester experience and a checklist guideline to the identification of categories and choices for software testing. Softw Qual J 19(1):141–163

    Article  Google Scholar 

  51. Ryber T (2007) Essential software test design. Unique Publishing Ltd.

  52. Sjøberg D, Hannay J, Hansen O, Kampenes V, Karahasanovic A, Liborg NK, Rekdal A (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753

    Article  Google Scholar 

  53. Svahnberg M, Aurum A, Wohlin C (2008) Using students as subjects – An empirical evaluation. In: Proceedings of the 2nd ACM-IEEE international symposium on empirical software engineering and measurement (ESEM’08). ACM, New York

  54. Taipale O, Kalviainen H, Smolander K (2006) Factors affecting software testing time schedule. In: Proceedings of the Australian software engineering conference (ASE’06). IEEE Computer Society, Washington, DC, pp 283–291

  55. van Veenendaal E, Bach J, Basili V, Black R, Comey C, Dekkers T, Evans I, Gerard P, Gilb T, Hatton L, Hayman D, Hendriks R, Koomen T, Meyerhoff D, Pol M, Reid S, Schaefer H, Schotanus C, Seubers J, Shull F, Swinkels R, Teunissen R, van Vonderen R, Watkins J, van der Zwan M (2002) The testing practitioner. UTN Publishers

  56. Våga J, Amland S (2002) Managing high-speed web testing. Springer, New York, pp 23–30

    Google Scholar 

  57. Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101–132

    Google Scholar 

  58. Weyuker EJ (1993) More experience with data flow testing. IEEE Trans Softw Eng 19:912–919

    Article  Google Scholar 

  59. Whittaker JA (2010) Exploratory software testing. Addison-Wesley

  60. Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer, Norwell

    Book  Google Scholar 

  61. Wood M, Roper M, Brooks A, Miller J (1997) Comparing and combining software defect detection techniques: a replicated empirical study. In: Proceedings of the 6th European software engineering conference (ESEC’97) held jointly with the 5th ACM SIGSOFT international symposium on foundations of software engineering (FSE’97). Springer New York, pp 262–277

    Google Scholar 

  62. Yamaura T (2002) How to design practical test cases. IEEE Softw 15(6):30–36

    Article  Google Scholar 

  63. Yang B, Hu H, Jia L (2008) A study of uncertainty in software cost and its impact on optimal software release time. IEEE Trans Softw Eng 34(6):813–825

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Wasif Afzal.

Additional information

Communicated by: José Carlos Maldonado


Appendix A: Test Case Template for TCT

Please use this template to design the test cases. Fill the fields accordingly.

  • Date:

  • Name:

  • Subject ID:

Table 14 Test case template

Appendix B: Defect Report

Please report your found defects in this document. Once you are done, please return the document to the instructor.

  • Name:

  • Subject ID:

Table 15 Defect types
Table 16 Defects in the experimental object

Appendix C: ET – Test Session Charter

  • Description: In this test session your task is to do functional testing for jEdit application feature set from the view point of a typical user. Your goal is to analyse the system’s suitability to intended use from the viewpoint of a typical test editor user. Take into account the needs of both an occasional user who is not familiar with all the features of the jEdit as well as an advanced user.

  • What – Tested areas: Try to cover in your testing all features listed below. Focus into first priority functions, but make sure that you cover also the second priority functions on some level during the fixed length session.

    • First priority functions (refer to Section 3.5).

    • Second priority functions (refer to Section 3.5).

  • Why – Goal: Your goal is to reveal as many defects in the system as possible. The found defects are described briefly and the detailed analysis of the found defects is left out in this test session.

  • How – Approach: Focus is on testing the functionality. Try to test exceptional cases, valid as well as invalid inputs, typical error situations, and things that the user could do wrong. Use manual testing and try to form equivalence classes and test boundaries. Try also to test relevant combinations of the features.

  • Focus – What problem to look for: Pay attention to the following issues:

    • Does the function work as described in the user manual?

    • Does the function do things that it should not?

    • From the viewpoint of a typical user, does the function work as the user would expect?

    • What interactions the function have or might have with other functions? Do these interactions work correctly as a user would expect?

  • Exploratory log: Write your log in a separate document.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Afzal, W., Ghazi, A.N., Itkonen, J. et al. An experiment on the effectiveness and efficiency of exploratory testing. Empir Software Eng 20, 844–878 (2015).

Download citation


  • Software testing
  • Experiment
  • Exploratory testing
  • Efficiency
  • Effectiveness