An experiment on the effectiveness and efficiency of exploratory testing

Afzal, Wasif; Ghazi, Ahmad Nauman; Itkonen, Juha; Torkar, Richard; Andrews, Anneliese; Bhatti, Khurram

doi:10.1007/s10664-014-9301-4

An experiment on the effectiveness and efficiency of exploratory testing

Published: 16 April 2014

Volume 20, pages 844–878, (2015)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Wasif Afzal¹,
Ahmad Nauman Ghazi²,
Juha Itkonen³,
Richard Torkar⁴,
Anneliese Andrews⁵ &
…
Khurram Bhatti²

1544 Accesses
30 Citations
6 Altmetric
Explore all metrics

Abstract

The exploratory testing (ET) approach is commonly applied in industry, but lacks scientific research. The scientific community needs quantitative results on the performance of ET taken from realistic experimental settings. The objective of this paper is to quantify the effectiveness and efficiency of ET vs. testing with documented test cases (test case based testing, TCT). We performed four controlled experiments where a total of 24 practitioners and 46 students performed manual functional testing using ET and TCT. We measured the number of identified defects in the 90-minute testing sessions, the detection difficulty, severity and types of the detected defects, and the number of false defect reports. The results show that ET found a significantly greater number of defects. ET also found significantly more defects of varying levels of difficulty, types and severity levels. However, the two testing approaches did not differ significantly in terms of the number of false defect reports submitted. We conclude that ET was more efficient than TCT in our experiment. ET was also more effective than TCT when detection difficulty, type of defects and severity levels are considered. The two approaches are comparable when it comes to the number of false defect reports submitted.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Naming the pain in requirements engineering

Article 24 October 2016

Customer Feedback and Data Collection Techniques in Software R&D: A Literature Review

Test case selection and prioritization using machine learning: a systematic literature review

Article 14 December 2021

Notes

Obviously it will help a tester if such knowledge exists (to find expected risks).
For recent reviews on software testing techniques, see Jia and Harman (2011); Ali et al. (2010); da Mota Silveira Neto et al. (2011); Nie and Leung (2011); Dias Neto et al. (2007).
The 90 minutes session length was decided as suggested by Bach (2000) but is not a strict requirement (we were constrained by the limited time available for the experiments from our industrial and academic subjects).
The C++ source code files were given to the subjects as an example to see code formatting and indentation. The purpose was to guide the subjects in detecting formatting and indentation defects.
jEdit version 4.2
The exploratory charter provided the subjects with high-level test guidelines.
Cohen’s d shows the mean difference between the two groups in standard deviation units. The values for d are interpreted differently for different research questions. However, we have followed a standard interpretation offered by Cohen (1988), where 0.8, 0.5 and 0.2 show large, moderate and small practical significances, respectively.
Median is a more close indication of true average than mean due to the presence of extreme values.
FTFI number is somewhat ambiguously named in the original article, since the metric is not about fault interactions, but interactions of inputs or conditions that trigger the failure.
η ² is a commonly used effect size measure in analysis of variance and represents an estimate of the degree of the association for the sample. We have followed the interpretation of Cohen (1988) for the significance of η ² where 0.0099 constitutes a small effect, 0.0588 a medium effect and 0.1379 a large effect.
The term mean rank is used in Tuckey-Kramer test for multiple comparisons. This test ranks the set of means in ascending order to reduce possible comparisons to be tested, e.g., in the ranking of the means W >X >Y >Z, if there is no difference between the two means that have the largest difference (W & Z), comparing other means having smaller difference will be of no use as we will get the same conclusion.
Vargha and Delaney suggest that the Â ₁₂ statistic of 0.56, 0.64 and 0.71 represent small, medium and large effect sizes respectively (Vargha and Delaney 2000).
http://www.jedit.org/

References

Abran A, Bourque P, Dupuis R, Moore JW, Tripp LL (eds) (2004) Guide to the software engineering body of knowledge – SWEBOK. IEEE Press, Piscataway
Google Scholar
Agruss C, Johnson B (2000) Ad hoc software testing: a perspective on exploration and improvisation. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.2070
Ahonen J, Junttila T, Sakkinen M (2004) Impacts of the organizational model on testing: three industrial cases. Empir Softw Eng 9(4):275–296
Article Google Scholar
Ali S, Briand L, Hemmati H, Panesar-Walawege R (2010) A systematic review of the application and empirical investigation of search-based test case generation. IEEE Trans Softw Eng 36(6):742–762
Article Google Scholar
Andersson C, Runeson P (2002) Verification and validation in industry – a qualitative survey on the state of practice. In: Proceedings of the 2002 international symposium on empirical software engineering (ISESE’02). IEEE Computer Society, Washington, DC
Arisholm E, Gallis H, Dybå T, Sjøberg DIK (2007) Evaluating pair programming with respect to system complexity and programmer expertise. IEEE Trans Softw Eng 33:65–86
Article Google Scholar
Bach J (2000) Session-based test management. Software Testing and Quality Engineering Magazine, vol 2, no 6
Bach J (2003) Exploratory testing explained. http://www.satisfice.com/articles/et-article.pdf
Basili V, Shull F, Lanubile F (1999) Building knowledge through families of experiments. IEEE Trans Softw Eng 25(4):456–473
Article Google Scholar
Beer A, Ramler R (2008) The role of experience in software testing practice. In: Proceedings of euromicro conference on software engineering and advanced applications
Berner S, Weber R, Keller RK (2005) Observations and lessons learned from automated testing. In: Proceedings of the 27th international conference on software engineering (ICSE’05). ACM, New York
Bertolino A (2007) Software testing research: achievements, challenges, dreams. In: Proceedings of the 2007 international conference on future of software engineering (FOSE’07)
Bertolino A (2008) Software testing forever: old and new processes and techniques for validating today’s applications. In: Jedlitschka A, Salo O (eds) Product-focused software process improvement, lecture notes in computer science, vol 5089. Springer, Berlin Heidelberg
Google Scholar
Bhatti K, Ghazi AN (2010) Effectiveness of exploratory testing: an empirical scrutiny of the challenges and factors affecting the defect detection efficiency. Master’s thesis, Blekinge Institute of Technology
Briand LC (2007) A critical analysis of empirical research in software testing. In: Proceedings of the 1st international symposium on empirical software engineering and measurement (ESEM’07). IEEE Computer Society, Washington, DC
Google Scholar
Brooks A, Roper M, Wood M, Daly J, Miller J (2008) Replication’s role in software engineering. In: Shull F, Singer J, Sjøberg DI (eds) Guide to advanced empirical software engineering. Springer, London, pp 365–379 doi:10.1007/978-1-84800-044-5_14
Chapter Google Scholar
Chillarege R, Bhandari I, Chaar J, Halliday M, Moebus D, Ray B, Wong MY (1992) Orthogonal defect classification–a concept for in-process measurements. IEEE Trans Softw Eng 18(11):943–956
Article Google Scholar
Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum
da Mota Silveira Neto PA, do Carmo Machado I, McGregor JD, de Almeida ES, de Lemos Meira SR (2011) A systematic mapping study of software product lines testing. Inf Softw Technol 53(5):407–423
Article Google Scholar
Dias Neto AC, Subramanyan R, Vieira M, Travassos GH (2007) A survey on model-based testing approaches: a systematic review. In: Proceedings of the 1st ACM international workshop on empirical assessment of software engineering languages and technologies (WEASELTech’07): held in conjunction with the 22nd IEEE/ACM international conference on automated software engineering (ASE) 2007. ACM, New York
do Nascimento LHO, Machado PDL (2007) An experimental evaluation of approaches to feature testing in the mobile phone applications domain. In: Workshop on domain specific approaches to software test automation (DOSTA’07): in conjunction with the 6th ESEC/FSE joint meeting. ACM, New York
Dustin E, Rashka J, Paul J (1999) Automated software testing: introduction, management, and performance. Addison-Wesley Professional
Houdek F, Ernst D, Schwinn T (2002) Defect detection for executable specifications – an experiment. Int J Softw Eng Knowl Eng 12(6):637–655
Google Scholar
Galletta DF, Abraham D, El Louadi M, Lekse W, Pollalis YA, Sampler JL (1993) An empirical study of spreadsheet error-finding performance. Account Manag Inf Technol 3(2):79–95
Article Google Scholar
Goodenough JB, Gerhart SL (1975) Toward a theory of test data selection. SIGPLAN Notes 10(6):493–510
Article Google Scholar
Graves TL, Harrold MJ, Kim JM, Porter A, Rothermel G (2001) An empirical study of regression test selection techniques. ACM Trans Softw Eng Methodol 10:184–208
Article MATH Google Scholar
Grechanik M, Xie Q, Fu C (2009) Maintaining and evolving GUI-directed test scripts. In: Proceedings of the 31st international conference on software engineering (ICSE’09). IEEE Computer Society, Washington, DC, pp 408–418
Hartman A (2002) Is issta research relevant to industry? SIGSOFT Softw Eng Notes 27(4):205–206
Article Google Scholar
Höst M, Wohlin C, Thélin T (2005) Experimental context classification: incentives and experience of subjects. In: Proceedings of the 27th international conference on software engineering (ICSE’05)
Hutchins M, Foster H, Goradia T, Ostrand T (1994) Experiments of the effectiveness of data flow and control flow based test adequacy criteria. In: Proceedings of the 16th international conference on software engineering (ICSE’94). IEEE Computer Society Press, Los Alamitos, pp 191–200
IEEE 1044-2009 (2010) IEEE standard classification for software anomalies
Itkonen J (2008) Do test cases really matter? An experiment comparing test case based and exploratory testing. Licentiate Thesis, Helsinki University of Technology
Itkonen J, Rautiainen K (2005) Exploratory testing: a multiple case study. In: 2005 international symposium on empirical software engineering (ISESE’05), pp 84–93
Itkonen J, Mäntylä M, Lassenius C (2007) Defect detection efficiency: test case based vs. exploratory testing. In: 1st international symposium on empirical software engineering and measurement (ESEM’07), pp 61–70
Itkonen J, Mäntylä MV, Lassenius C (2009) How do testers do it? An exploratory study on manual testing practices. In: 3rd international symposium on empirical software engineering and measurement (ESEM’09), pp 494–497
Itkonen J, Mäntylä M, Lassenius C (2013) The role of the tester’s knowledge in exploratory software testing. IEEE Trans Softw Eng 39(5):707–724
Article Google Scholar
Jia Y, Harman M (2011) An analysis and survey of the development of mutation testing. IEEE Trans Softw Eng 37(5):649–678
Article Google Scholar
Juristo N, Moreno AM (2001) Basics of software engineering experimentation. Kluwer, Boston
Book MATH Google Scholar
Juristo N, Moreno A, Vegas S (2004) Reviewing 25 years of testing technique experiments. Empir Softw Eng 9(1):7–44
Article Google Scholar
Kamsties E, Lott CM (1995) An empirical evaluation of three defect detection techniques. In: Proceedings of the 5th European software engineering conference (ESEC’95). Springer, London, pp 362–383
Google Scholar
Kaner C, Bach J, Pettichord B (2008) Lessons learned in software testing, 1st edn. Wiley-India
Google Scholar
Kettunen V, Kasurinen J, Taipale O, Smolander K (2010) A study on agility and testing processes in software organizations. In: Proceedings of the international symposium on software testing and analysis
Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam KE, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28:721–734
Article Google Scholar
Kuhn D, Wallace D, Gallo A (2004) Software fault interactions and implications for software testing. IEEE Trans Softw Eng 30(6):418–421
Article Google Scholar
Lung J, Aranda J, Easterbrook S, Wilson G (2008) On the difficulty of replicating human subjects studies in software engineering. In: ACM/IEEE 30th international conference on software engineering (ICSE’08)
Lyndsay J, van Eeden N (2003) Adventures in session-based testing. www.workroom-productions.com/papers/AiSBTv1.2.pdf
Myers GJ, Sandler C, Badgett T (1979) The art of software testing. Wiley, New York
Google Scholar
Naseer A, Zulfiqar M (2010) Investigating exploratory testing in industrial practice. Master’s thesis, Blekinge Institute of Technology
Nie C, Leung H (2011) A survey of combinatorial testing. ACM Comput Surv 43(2):1–29
Article Google Scholar
Poon P, Tse TH, Tang S, Kuo F (2011) Contributions of tester experience and a checklist guideline to the identification of categories and choices for software testing. Softw Qual J 19(1):141–163
Article Google Scholar
Ryber T (2007) Essential software test design. Unique Publishing Ltd.
Sjøberg D, Hannay J, Hansen O, Kampenes V, Karahasanovic A, Liborg NK, Rekdal A (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753
Article Google Scholar
Svahnberg M, Aurum A, Wohlin C (2008) Using students as subjects – An empirical evaluation. In: Proceedings of the 2nd ACM-IEEE international symposium on empirical software engineering and measurement (ESEM’08). ACM, New York
Taipale O, Kalviainen H, Smolander K (2006) Factors affecting software testing time schedule. In: Proceedings of the Australian software engineering conference (ASE’06). IEEE Computer Society, Washington, DC, pp 283–291
van Veenendaal E, Bach J, Basili V, Black R, Comey C, Dekkers T, Evans I, Gerard P, Gilb T, Hatton L, Hayman D, Hendriks R, Koomen T, Meyerhoff D, Pol M, Reid S, Schaefer H, Schotanus C, Seubers J, Shull F, Swinkels R, Teunissen R, van Vonderen R, Watkins J, van der Zwan M (2002) The testing practitioner. UTN Publishers
Våga J, Amland S (2002) Managing high-speed web testing. Springer, New York, pp 23–30
Google Scholar
Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101–132
Google Scholar
Weyuker EJ (1993) More experience with data flow testing. IEEE Trans Softw Eng 19:912–919
Article Google Scholar
Whittaker JA (2010) Exploratory software testing. Addison-Wesley
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer, Norwell
Book Google Scholar
Wood M, Roper M, Brooks A, Miller J (1997) Comparing and combining software defect detection techniques: a replicated empirical study. In: Proceedings of the 6th European software engineering conference (ESEC’97) held jointly with the 5th ACM SIGSOFT international symposium on foundations of software engineering (FSE’97). Springer New York, pp 262–277
Google Scholar
Yamaura T (2002) How to design practical test cases. IEEE Softw 15(6):30–36
Article Google Scholar
Yang B, Hu H, Jia L (2008) A study of uncertainty in software cost and its impact on optimal software release time. IEEE Trans Softw Eng 34(6):813–825
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Innovation, Design and Engineering, Mälardalen University, Västerås,, Sweden
Wasif Afzal
Blekinge Institute of Technology, SE-37179, Karlskrona, Sweden
Ahmad Nauman Ghazi & Khurram Bhatti
Department of Computer Science and Engineering, Aalto University, Espoo, Finland
Juha Itkonen
Department of Computer Science and Engineering, Chalmers University of Technology|University of Gothenburg, Gothenburg, Sweden
Richard Torkar
University of Denver, Denver, CO 80208, USA
Anneliese Andrews

Authors

Wasif Afzal
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Nauman Ghazi
View author publications
You can also search for this author in PubMed Google Scholar
Juha Itkonen
View author publications
You can also search for this author in PubMed Google Scholar
Richard Torkar
View author publications
You can also search for this author in PubMed Google Scholar
Anneliese Andrews
View author publications
You can also search for this author in PubMed Google Scholar
Khurram Bhatti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wasif Afzal.

Additional information

Communicated by: José Carlos Maldonado

Appendices

Appendix A: Test Case Template for TCT

Please use this template to design the test cases. Fill the fields accordingly.

Date:
Name:
Subject ID:

Table 14 Test case template

Full size table

Appendix B: Defect Report

Please report your found defects in this document. Once you are done, please return the document to the instructor.

Name:
Subject ID:

Table 15 Defect types

Full size table

Table 16 Defects in the experimental object

Full size table

Appendix C: ET – Test Session Charter

Description: In this test session your task is to do functional testing for jEdit application feature set from the view point of a typical user. Your goal is to analyse the system’s suitability to intended use from the viewpoint of a typical test editor user. Take into account the needs of both an occasional user who is not familiar with all the features of the jEdit as well as an advanced user.
What – Tested areas: Try to cover in your testing all features listed below. Focus into first priority functions, but make sure that you cover also the second priority functions on some level during the fixed length session.
- First priority functions (refer to Section 3.5).
- Second priority functions (refer to Section 3.5).
Why – Goal: Your goal is to reveal as many defects in the system as possible. The found defects are described briefly and the detailed analysis of the found defects is left out in this test session.
How – Approach: Focus is on testing the functionality. Try to test exceptional cases, valid as well as invalid inputs, typical error situations, and things that the user could do wrong. Use manual testing and try to form equivalence classes and test boundaries. Try also to test relevant combinations of the features.
Focus – What problem to look for: Pay attention to the following issues:
- Does the function work as described in the user manual?
- Does the function do things that it should not?
- From the viewpoint of a typical user, does the function work as the user would expect?
- What interactions the function have or might have with other functions? Do these interactions work correctly as a user would expect?
Exploratory log: Write your log in a separate document.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Afzal, W., Ghazi, A.N., Itkonen, J. et al. An experiment on the effectiveness and efficiency of exploratory testing. Empir Software Eng 20, 844–878 (2015). https://doi.org/10.1007/s10664-014-9301-4

Download citation

Published: 16 April 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10664-014-9301-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An experiment on the effectiveness and efficiency of exploratory testing

Abstract

Access this article

Similar content being viewed by others

Naming the pain in requirements engineering

Customer Feedback and Data Collection Techniques in Software R&D: A Literature Review

Test case selection and prioritization using machine learning: a systematic literature review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Test Case Template for TCT

Appendix B: Defect Report

Appendix C: ET – Test Session Charter

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An experiment on the effectiveness and efficiency of exploratory testing

Abstract

Access this article

Similar content being viewed by others

Naming the pain in requirements engineering

Customer Feedback and Data Collection Techniques in Software R&D: A Literature Review

Test case selection and prioritization using machine learning: a systematic literature review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Test Case Template for TCT

Appendix B: Defect Report

Appendix C: ET – Test Session Charter

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation