Abstract
The exploratory testing (ET) approach is commonly applied in industry, but lacks scientific research. The scientific community needs quantitative results on the performance of ET taken from realistic experimental settings. The objective of this paper is to quantify the effectiveness and efficiency of ET vs. testing with documented test cases (test case based testing, TCT). We performed four controlled experiments where a total of 24 practitioners and 46 students performed manual functional testing using ET and TCT. We measured the number of identified defects in the 90-minute testing sessions, the detection difficulty, severity and types of the detected defects, and the number of false defect reports. The results show that ET found a significantly greater number of defects. ET also found significantly more defects of varying levels of difficulty, types and severity levels. However, the two testing approaches did not differ significantly in terms of the number of false defect reports submitted. We conclude that ET was more efficient than TCT in our experiment. ET was also more effective than TCT when detection difficulty, type of defects and severity levels are considered. The two approaches are comparable when it comes to the number of false defect reports submitted.
This is a preview of subscription content, access via your institution.






Similar content being viewed by others
Notes
Obviously it will help a tester if such knowledge exists (to find expected risks).
The 90 minutes session length was decided as suggested by Bach (2000) but is not a strict requirement (we were constrained by the limited time available for the experiments from our industrial and academic subjects).
The C++ source code files were given to the subjects as an example to see code formatting and indentation. The purpose was to guide the subjects in detecting formatting and indentation defects.
jEdit version 4.2
The exploratory charter provided the subjects with high-level test guidelines.
Cohen’s d shows the mean difference between the two groups in standard deviation units. The values for d are interpreted differently for different research questions. However, we have followed a standard interpretation offered by Cohen (1988), where 0.8, 0.5 and 0.2 show large, moderate and small practical significances, respectively.
Median is a more close indication of true average than mean due to the presence of extreme values.
FTFI number is somewhat ambiguously named in the original article, since the metric is not about fault interactions, but interactions of inputs or conditions that trigger the failure.
η 2 is a commonly used effect size measure in analysis of variance and represents an estimate of the degree of the association for the sample. We have followed the interpretation of Cohen (1988) for the significance of η 2 where 0.0099 constitutes a small effect, 0.0588 a medium effect and 0.1379 a large effect.
The term mean rank is used in Tuckey-Kramer test for multiple comparisons. This test ranks the set of means in ascending order to reduce possible comparisons to be tested, e.g., in the ranking of the means W >X >Y >Z, if there is no difference between the two means that have the largest difference (W & Z), comparing other means having smaller difference will be of no use as we will get the same conclusion.
Vargha and Delaney suggest that the  12 statistic of 0.56, 0.64 and 0.71 represent small, medium and large effect sizes respectively (Vargha and Delaney 2000).
References
Abran A, Bourque P, Dupuis R, Moore JW, Tripp LL (eds) (2004) Guide to the software engineering body of knowledge – SWEBOK. IEEE Press, Piscataway
Agruss C, Johnson B (2000) Ad hoc software testing: a perspective on exploration and improvisation. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.2070
Ahonen J, Junttila T, Sakkinen M (2004) Impacts of the organizational model on testing: three industrial cases. Empir Softw Eng 9(4):275–296
Ali S, Briand L, Hemmati H, Panesar-Walawege R (2010) A systematic review of the application and empirical investigation of search-based test case generation. IEEE Trans Softw Eng 36(6):742–762
Andersson C, Runeson P (2002) Verification and validation in industry – a qualitative survey on the state of practice. In: Proceedings of the 2002 international symposium on empirical software engineering (ISESE’02). IEEE Computer Society, Washington, DC
Arisholm E, Gallis H, Dybå T, Sjøberg DIK (2007) Evaluating pair programming with respect to system complexity and programmer expertise. IEEE Trans Softw Eng 33:65–86
Bach J (2000) Session-based test management. Software Testing and Quality Engineering Magazine, vol 2, no 6
Bach J (2003) Exploratory testing explained. http://www.satisfice.com/articles/et-article.pdf
Basili V, Shull F, Lanubile F (1999) Building knowledge through families of experiments. IEEE Trans Softw Eng 25(4):456–473
Beer A, Ramler R (2008) The role of experience in software testing practice. In: Proceedings of euromicro conference on software engineering and advanced applications
Berner S, Weber R, Keller RK (2005) Observations and lessons learned from automated testing. In: Proceedings of the 27th international conference on software engineering (ICSE’05). ACM, New York
Bertolino A (2007) Software testing research: achievements, challenges, dreams. In: Proceedings of the 2007 international conference on future of software engineering (FOSE’07)
Bertolino A (2008) Software testing forever: old and new processes and techniques for validating today’s applications. In: Jedlitschka A, Salo O (eds) Product-focused software process improvement, lecture notes in computer science, vol 5089. Springer, Berlin Heidelberg
Bhatti K, Ghazi AN (2010) Effectiveness of exploratory testing: an empirical scrutiny of the challenges and factors affecting the defect detection efficiency. Master’s thesis, Blekinge Institute of Technology
Briand LC (2007) A critical analysis of empirical research in software testing. In: Proceedings of the 1st international symposium on empirical software engineering and measurement (ESEM’07). IEEE Computer Society, Washington, DC
Brooks A, Roper M, Wood M, Daly J, Miller J (2008) Replication’s role in software engineering. In: Shull F, Singer J, Sjøberg DI (eds) Guide to advanced empirical software engineering. Springer, London, pp 365–379 doi:10.1007/978-1-84800-044-5_14
Chillarege R, Bhandari I, Chaar J, Halliday M, Moebus D, Ray B, Wong MY (1992) Orthogonal defect classification–a concept for in-process measurements. IEEE Trans Softw Eng 18(11):943–956
Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum
da Mota Silveira Neto PA, do Carmo Machado I, McGregor JD, de Almeida ES, de Lemos Meira SR (2011) A systematic mapping study of software product lines testing. Inf Softw Technol 53(5):407–423
Dias Neto AC, Subramanyan R, Vieira M, Travassos GH (2007) A survey on model-based testing approaches: a systematic review. In: Proceedings of the 1st ACM international workshop on empirical assessment of software engineering languages and technologies (WEASELTech’07): held in conjunction with the 22nd IEEE/ACM international conference on automated software engineering (ASE) 2007. ACM, New York
do Nascimento LHO, Machado PDL (2007) An experimental evaluation of approaches to feature testing in the mobile phone applications domain. In: Workshop on domain specific approaches to software test automation (DOSTA’07): in conjunction with the 6th ESEC/FSE joint meeting. ACM, New York
Dustin E, Rashka J, Paul J (1999) Automated software testing: introduction, management, and performance. Addison-Wesley Professional
Houdek F, Ernst D, Schwinn T (2002) Defect detection for executable specifications – an experiment. Int J Softw Eng Knowl Eng 12(6):637–655
Galletta DF, Abraham D, El Louadi M, Lekse W, Pollalis YA, Sampler JL (1993) An empirical study of spreadsheet error-finding performance. Account Manag Inf Technol 3(2):79–95
Goodenough JB, Gerhart SL (1975) Toward a theory of test data selection. SIGPLAN Notes 10(6):493–510
Graves TL, Harrold MJ, Kim JM, Porter A, Rothermel G (2001) An empirical study of regression test selection techniques. ACM Trans Softw Eng Methodol 10:184–208
Grechanik M, Xie Q, Fu C (2009) Maintaining and evolving GUI-directed test scripts. In: Proceedings of the 31st international conference on software engineering (ICSE’09). IEEE Computer Society, Washington, DC, pp 408–418
Hartman A (2002) Is issta research relevant to industry? SIGSOFT Softw Eng Notes 27(4):205–206
Höst M, Wohlin C, Thélin T (2005) Experimental context classification: incentives and experience of subjects. In: Proceedings of the 27th international conference on software engineering (ICSE’05)
Hutchins M, Foster H, Goradia T, Ostrand T (1994) Experiments of the effectiveness of data flow and control flow based test adequacy criteria. In: Proceedings of the 16th international conference on software engineering (ICSE’94). IEEE Computer Society Press, Los Alamitos, pp 191–200
IEEE 1044-2009 (2010) IEEE standard classification for software anomalies
Itkonen J (2008) Do test cases really matter? An experiment comparing test case based and exploratory testing. Licentiate Thesis, Helsinki University of Technology
Itkonen J, Rautiainen K (2005) Exploratory testing: a multiple case study. In: 2005 international symposium on empirical software engineering (ISESE’05), pp 84–93
Itkonen J, Mäntylä M, Lassenius C (2007) Defect detection efficiency: test case based vs. exploratory testing. In: 1st international symposium on empirical software engineering and measurement (ESEM’07), pp 61–70
Itkonen J, Mäntylä MV, Lassenius C (2009) How do testers do it? An exploratory study on manual testing practices. In: 3rd international symposium on empirical software engineering and measurement (ESEM’09), pp 494–497
Itkonen J, Mäntylä M, Lassenius C (2013) The role of the tester’s knowledge in exploratory software testing. IEEE Trans Softw Eng 39(5):707–724
Jia Y, Harman M (2011) An analysis and survey of the development of mutation testing. IEEE Trans Softw Eng 37(5):649–678
Juristo N, Moreno AM (2001) Basics of software engineering experimentation. Kluwer, Boston
Juristo N, Moreno A, Vegas S (2004) Reviewing 25 years of testing technique experiments. Empir Softw Eng 9(1):7–44
Kamsties E, Lott CM (1995) An empirical evaluation of three defect detection techniques. In: Proceedings of the 5th European software engineering conference (ESEC’95). Springer, London, pp 362–383
Kaner C, Bach J, Pettichord B (2008) Lessons learned in software testing, 1st edn. Wiley-India
Kettunen V, Kasurinen J, Taipale O, Smolander K (2010) A study on agility and testing processes in software organizations. In: Proceedings of the international symposium on software testing and analysis
Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam KE, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28:721–734
Kuhn D, Wallace D, Gallo A (2004) Software fault interactions and implications for software testing. IEEE Trans Softw Eng 30(6):418–421
Lung J, Aranda J, Easterbrook S, Wilson G (2008) On the difficulty of replicating human subjects studies in software engineering. In: ACM/IEEE 30th international conference on software engineering (ICSE’08)
Lyndsay J, van Eeden N (2003) Adventures in session-based testing. www.workroom-productions.com/papers/AiSBTv1.2.pdf
Myers GJ, Sandler C, Badgett T (1979) The art of software testing. Wiley, New York
Naseer A, Zulfiqar M (2010) Investigating exploratory testing in industrial practice. Master’s thesis, Blekinge Institute of Technology
Nie C, Leung H (2011) A survey of combinatorial testing. ACM Comput Surv 43(2):1–29
Poon P, Tse TH, Tang S, Kuo F (2011) Contributions of tester experience and a checklist guideline to the identification of categories and choices for software testing. Softw Qual J 19(1):141–163
Ryber T (2007) Essential software test design. Unique Publishing Ltd.
Sjøberg D, Hannay J, Hansen O, Kampenes V, Karahasanovic A, Liborg NK, Rekdal A (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753
Svahnberg M, Aurum A, Wohlin C (2008) Using students as subjects – An empirical evaluation. In: Proceedings of the 2nd ACM-IEEE international symposium on empirical software engineering and measurement (ESEM’08). ACM, New York
Taipale O, Kalviainen H, Smolander K (2006) Factors affecting software testing time schedule. In: Proceedings of the Australian software engineering conference (ASE’06). IEEE Computer Society, Washington, DC, pp 283–291
van Veenendaal E, Bach J, Basili V, Black R, Comey C, Dekkers T, Evans I, Gerard P, Gilb T, Hatton L, Hayman D, Hendriks R, Koomen T, Meyerhoff D, Pol M, Reid S, Schaefer H, Schotanus C, Seubers J, Shull F, Swinkels R, Teunissen R, van Vonderen R, Watkins J, van der Zwan M (2002) The testing practitioner. UTN Publishers
Våga J, Amland S (2002) Managing high-speed web testing. Springer, New York, pp 23–30
Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101–132
Weyuker EJ (1993) More experience with data flow testing. IEEE Trans Softw Eng 19:912–919
Whittaker JA (2010) Exploratory software testing. Addison-Wesley
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer, Norwell
Wood M, Roper M, Brooks A, Miller J (1997) Comparing and combining software defect detection techniques: a replicated empirical study. In: Proceedings of the 6th European software engineering conference (ESEC’97) held jointly with the 5th ACM SIGSOFT international symposium on foundations of software engineering (FSE’97). Springer New York, pp 262–277
Yamaura T (2002) How to design practical test cases. IEEE Softw 15(6):30–36
Yang B, Hu H, Jia L (2008) A study of uncertainty in software cost and its impact on optimal software release time. IEEE Trans Softw Eng 34(6):813–825
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: José Carlos Maldonado
Appendices
Appendix A: Test Case Template for TCT
Please use this template to design the test cases. Fill the fields accordingly.
-
Date:
-
Name:
-
Subject ID:
Appendix B: Defect Report
Please report your found defects in this document. Once you are done, please return the document to the instructor.
-
Name:
-
Subject ID:
Appendix C: ET – Test Session Charter
-
Description: In this test session your task is to do functional testing for jEdit application feature set from the view point of a typical user. Your goal is to analyse the system’s suitability to intended use from the viewpoint of a typical test editor user. Take into account the needs of both an occasional user who is not familiar with all the features of the jEdit as well as an advanced user.
-
What – Tested areas: Try to cover in your testing all features listed below. Focus into first priority functions, but make sure that you cover also the second priority functions on some level during the fixed length session.
-
Why – Goal: Your goal is to reveal as many defects in the system as possible. The found defects are described briefly and the detailed analysis of the found defects is left out in this test session.
-
How – Approach: Focus is on testing the functionality. Try to test exceptional cases, valid as well as invalid inputs, typical error situations, and things that the user could do wrong. Use manual testing and try to form equivalence classes and test boundaries. Try also to test relevant combinations of the features.
-
Focus – What problem to look for: Pay attention to the following issues:
-
Does the function work as described in the user manual?
-
Does the function do things that it should not?
-
From the viewpoint of a typical user, does the function work as the user would expect?
-
What interactions the function have or might have with other functions? Do these interactions work correctly as a user would expect?
-
-
Exploratory log: Write your log in a separate document.
Rights and permissions
About this article
Cite this article
Afzal, W., Ghazi, A.N., Itkonen, J. et al. An experiment on the effectiveness and efficiency of exploratory testing. Empir Software Eng 20, 844–878 (2015). https://doi.org/10.1007/s10664-014-9301-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-014-9301-4