On (Mis)perceptions of testing effectiveness: an empirical study

Vegas, Sira; Riofrío, Patricia; Marcos, Esperanza; Juristo, Natalia

doi:10.1007/s10664-020-09805-y

On (Mis)perceptions of testing effectiveness: an empirical study

Published: 07 May 2020

Volume 25, pages 2844–2896, (2020)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Sira Vegas ORCID: orcid.org/0000-0001-8535-9386¹,
Patricia Riofrío¹,
Esperanza Marcos² &
…
Natalia Juristo¹

537 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

A recurring problem in software development is incorrect decision making on the techniques, methods and tools to be used. Mostly, these decisions are based on developers’ perceptions about them. A factor influencing people’s perceptions is past experience, but it is not the only one. In this research, we aim to discover how well the perceptions of the defect detection effectiveness of different techniques match their real effectiveness in the absence of prior experience. To do this, we conduct an empirical study plus a replication. During the original study, we conduct a controlled experiment with students applying two testing techniques and a code review technique. At the end of the experiment, they take a survey to find out which technique they perceive to be most effective. The results show that participants’ perceptions are wrong and that this mismatch is costly in terms of quality. In order to gain further insight into the results, we replicate the controlled experiment and extend the survey to include questions about participants’ opinions on the techniques and programs. The results of the replicated study confirm the findings of the original study and suggest that participants’ perceptions might be based not on their opinions about complexity or preferences for techniques but on how well they think that they have applied the techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ethics in the Software Development Process: from Codes of Conduct to Ethical Deliberation

Article Open access 21 April 2021

Sampling in software engineering research: a critical review and guidelines

Article 28 April 2022

Case studies synthesis: a thematic, cross-case, and narrative synthesis worked example

Article 03 August 2014

Notes

This has been done for learning purposes, as we have noticed that students sometimes do not report failures that are exercised by test cases. Since this is a learning goal of the course, not relevant for the study, we measure it separately, and do not use it here.
Note that that it is not possible to take measurements on the failures reported by participants, as they do not run their own test cases, but the ones we have given them.
During the training, definitions for the terms error, fault and failure are introduced. Additionally, participants are explained that the generic term defect is used to refer to both faults and failures indistinctly.
Available at http://code.google.com/p/prest/
One of the versions in one of the programs contains only six faults. Due to a mistake we made, one of the failures was concealed by another.
They have participated in development projects in teams (as in the Artificial Intelligence, Compiler and Operating Systems courses).
All analyses are performed using IBM SPSS v26.
For example, Octaviano et al. (2015) use Landis & Koch, but Massey et al. (2015) use Fleiss et al. as we do.
For this reason, we need to check both.
Retrieved from: http://afhayes.com/spss-sas-and-mplus-macros-and-code.html
Meaning they were not giving consent to participate in the study.
In a uniform distribution, 33.3% of participants should choose each technique.
Note that the fact that all three techniques are classed as the most effective the same number of times is not incompatible with there being techniques that are more effective than others.
Except in the case of Group 2 where there is agreement for the EP-BT techniques. Since this is the only agreement found we think it could be spurious.
Note that the mismatch cost is 0 when there is a match.
Note that the median here is not very informative. In this particular case it is 0pp. This happens when there are more matches than mismatches.
Meaning they were not giving consent to participate in the study.

References

Altman D (1991) Practial statistics for medical research. Chapman and Hall
Aurum A, Wohlin C (2002) Applying decision-making models in requirements engineering. In: Proceedings of requirements engineering for software quality
Banerjee MV, Capozzoli M, McSweeney L, Sinha D (1999) Beyond kappa: a review of interrater agreement measures. Can J Stat 27:3–23
Article MathSciNet Google Scholar
Basili V, Selby R (1987) Comparing the effectiveness of software testing strategies. IEEE Trans Softw Eng 13(2):1278–1296
Article Google Scholar
Basili V, Green S, Laitenberger O, Lanubile F, Shull F, Sorumgard S, Zelkowitz M (1996) The empirical investigation of perspective based reading. Empir Softw Eng 1(2):133–164
Article Google Scholar
Beizer B (1990) Software testing techniques, 2nd edn. International Thomson Computer Press
Bhattacharya P (2012) Quantitative decision-making in software engineering. Ph.D. thesis University of California Riverside
Bieman J, Schultz J (1992) An empirical evaluation (and specification) of the all-du-paths testing criterion. Softw Eng J, 43–51
Biffl S (2000) Analysis of the impact of reading technique and inspector capability on individual inspection performance. In: 7th Asia-Pacific software engineering conference, pp 136–145
Briand L, Penta M, Labiche Y (2004) Assessing and improving state-based class testing: a series of experiments. IEEE Trans Softw Eng 30(11):770–793
Article Google Scholar
Capretz L, Varona D, Raza A (2015) Influence of personality types in software tasks choices. Comput Hum Behav 52:373–378
Article Google Scholar
Cotroneo D, Pietrantuono R, Russo S (2013) Testing techniques selection based on odc fault types and software metrics. J Syst Softw 86(6):1613–1637
Article Google Scholar
Deak A (2012) Understanding socio-technical factors influencing testers in software development organizations. In: 36th Annual computer software and applications conference (COMPSAC’12), pp 438–441
Devanbu P, Zimmermann T, Bird C (2016) Belief & evidence in empirical software engineering. In: Proceedings of the 38th international conference on software engineering, pp 108–119
Dias-Neto A, Travassos G (2014) Supporting the combined selection of model-based testing techniques. IEEE Trans Softw Eng 40(10):1025–1041
Article Google Scholar
Dias-Neto A, Matalonga S, Solari M, Robiolo G, Travassos G (2016) Toward the characterization of software testing practices in south america: looking at Brazil and Uruguay. Softw Qual J, 1–39
Dieste O, Aranda A, Uyaguari F, Turhan B, Tosun A, Fucci D, Oivo M, Juristo N (2017) Empirical evaluation of the effects of experience on code quality and programmer productivity: an exploratory study. Empirical Software Engineering. https://doi.org/10.1007/s10664-016-9471-3
Dunsmore A, Roper M, Wood M (2002) Further investigations into the development and evaluation of reading techniques for object-oriented code inspection. In: 24th International conference on software engineering, pp 47–57
Dybå T, Kitchenham B, Jorgensen M (2005) Evidence-based software engineering for practitioners. IEEE Softw 22(1):58–65
Article Google Scholar
Everitt B (2000) The analysis of contingency tables. In: Monographs statistics and applied probability, vol 45. Chapman & Hall/CRC
Falessi D, Juristo N, Wohlin C, Turhan B, Münch J, Jedlitschka A, Oivo M (2017) Empirical software engineering experts on the use of students and professionals in experiments. Empirical Software Engineering. https://doi.org/10.1007/s10664-017-9523-3
Fleiss J, Levin BMP (2003) Statistical methods for rates and proportions, 3rd edn. Wiley
Garousi V, Felderer M, Kuhrmann M, Herkiloğlu K (2017) What industry wants from academia in software testing?: hearing practitioners’ opinions. In: Proceedings of the 21st international conference on evaluation and assessment in software engineering, EASE’17, pp 65–69
Gonçalves W, de Almeida C, de Araújo LL, Ferraz M, Xandú R, de Farias I (2017) The influence of human factors on the software testing process: the impact of these factors on the software testing process. In: 2017 12th Iberian conference on information systems and technologies (CISTI), pp 1–6
Guaiani F, Muccini H (2015) Crowd and laboratory testing, can they co-exist? An exploratory study. In: 2nd International workshop on crowdsourcing in software engineering (CSI-SE), pp 32–37
Hayes A, Krippendorff K (2007) Answering the call for a standard reliability measure for coding data. Commun Methods Meas 1:77–89
Article Google Scholar
Hutchins M, Foster H, Goradia T, Ostrand T (1994) Experiments on the effectiveness of dataflow- and controlflow-based test adequacy criteria. In: Proceedings of the 16th international conference on software engineering, pp 191–200
Jedlitschka A, Juristo N, Rombach D (2014) Reporting experiments to satisfy professionals’ information needs. Empir Softw Eng 19(6):1921–1955
Article Google Scholar
Kamsties E, Lott C (1995) An empirical evaluation of three defect-detection techniques. In: Proceedings of the Fifth European software engineering conference, pp 84–89
Kanij T, Merkel R, Grundy J (2015) An empirical investigation of personality traits of software testers. In: 8th International workshop on cooperative and human aspects of software engineering (CHASE’15), pp 1–7
Khan T, Pezeshki V, Clear F, Al-Kaabi A (2010) Diverse virtual social networks: implications for remote software testing teams. In: European, mediterranean & middle eastern conference on information systems
Kocaguneli E, Tosun A, Bener A, Turhan B, Caglayan B (2009) Prest: an intelligent software metrics extraction, analysis and defect prediction tool, 637–642
Kosti M, Feldt R, Angelis L (2014) Personality, emotional intelligence and work preferences in software engineering: an empirical study. Inf Softw Technol 56(8):973–990
Article Google Scholar
Kuehl R (2000) Design of experiments: statistical principles of research design and analysis, 2nd edn. Duxbury Thomson Learning
Landis J, Koch G (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174
Article Google Scholar
Linger R (1979) Structured programming: theory and practice (the systems programming series). Addison-Wesley
Maldonado J, Carver J, Shull F, Fabbri S, Dória E., Martimiano L, Mendonça M, Basili V (2006) Perspective-based reading: a replicated experiment focused on individual reviewer effectiveness. Empir Softw Eng 11(1):119–142
Article Google Scholar
Marsden N, Pérez Rentería y Hernández T (2014) Understanding software testers in the automotive industry a mixed-method case study. In: 9th International conference on software engineering and applications (ICSOFT-EA), pp 305–314
Massey A, Otto P, Antón A (2015) Evaluating legal implementation readiness decision-making. IEEE Trans Softw Eng 41(6):545–564
Article Google Scholar
Myers G (1978) A controlled experiment in program testing and code walkthroughs/inspections. Commun ACM 21(9):760–768
Article Google Scholar
Myers G, Badgett T, Sandler C (2004) The art of software testing, 2nd edn. Wiley-Interscience
Octaviano F, Felizardo K, Maldonado J, Fabbri S (2015) Semi-automatic selection of primary studies in systematic literature reviews: is it reasonable? Empir Softw Eng 20(6):1898–1917
Article Google Scholar
Offut A, Lee S (1994) An empirical evaluation of weak mutation. IEEE Trans Softw Eng 20(5):337–344
Article Google Scholar
Offut A, Lee A, Rothermel G, Untch R, Zapf C (1996) An experimental determination of sufficient mutant operators. ACM Trans Softw Eng Methodol 5 (2):99–118
Article Google Scholar
Porter A, Votta L, Basili V (1995) Comparing detection methods for software requirements inspection: a replicated experiment. IEEE Trans Softw Eng 21(6):563–575
Article Google Scholar
Roper M, Wood M, Miller J (1997) An empirical evaluation of defect detection techniques. Inf Softw Technol 39:763–775
Article Google Scholar
Shull F, Carver J, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Softw Eng 13:211–218
Article Google Scholar
Thelin T, Runeson P, Wohlin C, Olsson T, Andersson C (2004) Evaluation of usage-based reading—conclusions after three experiments. Empir Softw Eng 9:77–110
Article Google Scholar
Vegas S, Basili V (2005) A characterisation schema for software testing techniques. Empir Softw Eng 10(4):437–466
Article Google Scholar
Vegas S, Juristo N, Basili V (2009) Maturing software engineering knowledge through classifications: a case study on unit testing techniques. IEEE Trans Softw Eng 35(4):551–565
Article Google Scholar
Weyuker E (1984) The complexity of data flow criteria for test data selection. Inf Process Lett 19(2):103–109
Article MathSciNet Google Scholar
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2014) Experimentation in software engineering: an introduction, 2nd edn. Springer
Wong E, Mathur A (1995) Fault detection effectiveness of mutation and data-flow testing. Softw Qual J 4:69–83
Article Google Scholar
Zapf A, Castell S, Morawietz L, Karch A (2016) Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Med Res Methodol, 16(93)
Zelkowitz M, Wallace D, Binkley D (2003) Experimental validation of new software technology. Series Softw Eng Knowl Eng 12:229–263
Article Google Scholar

Download references

Acknowledgments

This research was funded by Spanish Ministry of Science, Innovation and Universities research grant PGC2018-097265-B-I00, the Regional Government of Madrid, under the FORTE-CM project (S2018/TCS-4314) and the Spanish Ministry of Economy and Business, under the MADRID project (TIN2017-88557-R).

Author information

Authors and Affiliations

Universidad Politécnica de Madrid, Madrid, Spain
Sira Vegas, Patricia Riofrío & Natalia Juristo
Universidad Rey Juan Carlos, Madrid, Spain
Esperanza Marcos

Authors

Sira Vegas
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Riofrío
View author publications
You can also search for this author in PubMed Google Scholar
Esperanza Marcos
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Juristo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sira Vegas.

Additional information

Communicated by: Per Runeson

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Program Metrics

Table 39 shows the metrics collected for each program with the PREST tool. Note that all three programs show similar results for all metrics, except ntree that shows higher Halstead metrics. The size and complexity of cmdline is slightly higher compared to the other two programs.

Table 39 Metrics obtained with PREST

On (Mis)perceptions of testing effectiveness: an empirical study

Abstract

Access this article

Similar content being viewed by others

Ethics in the Software Development Process: from Codes of Conduct to Ethical Deliberation

Sampling in software engineering research: a critical review and guidelines

Case studies synthesis: a thematic, cross-case, and narrative synthesis worked example

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Program Metrics

Appendix B: Analysis of the Original Experiment

Appendix C: Analysis of the Replicated Experiment

Appendix D: Joint Analyses

1.1 D.1 RQ1.1: Participants’ Perceptions

1.2 D.2 RQ1.2: Comparing Perceptions with Reality

1.3 D.3 RQ1.3: Comparing the Effectiveness of Techniques

1.4 D.4 RQ1.4: Cost of Mismatch

1.5 D.5 RQ1.5: Expected Loss of Effectiveness

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation