Skip to main content
Log in

On (Mis)perceptions of testing effectiveness: an empirical study

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

A recurring problem in software development is incorrect decision making on the techniques, methods and tools to be used. Mostly, these decisions are based on developers’ perceptions about them. A factor influencing people’s perceptions is past experience, but it is not the only one. In this research, we aim to discover how well the perceptions of the defect detection effectiveness of different techniques match their real effectiveness in the absence of prior experience. To do this, we conduct an empirical study plus a replication. During the original study, we conduct a controlled experiment with students applying two testing techniques and a code review technique. At the end of the experiment, they take a survey to find out which technique they perceive to be most effective. The results show that participants’ perceptions are wrong and that this mismatch is costly in terms of quality. In order to gain further insight into the results, we replicate the controlled experiment and extend the survey to include questions about participants’ opinions on the techniques and programs. The results of the replicated study confirm the findings of the original study and suggest that participants’ perceptions might be based not on their opinions about complexity or preferences for techniques but on how well they think that they have applied the techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. This has been done for learning purposes, as we have noticed that students sometimes do not report failures that are exercised by test cases. Since this is a learning goal of the course, not relevant for the study, we measure it separately, and do not use it here.

  2. Note that that it is not possible to take measurements on the failures reported by participants, as they do not run their own test cases, but the ones we have given them.

  3. During the training, definitions for the terms error, fault and failure are introduced. Additionally, participants are explained that the generic term defect is used to refer to both faults and failures indistinctly.

  4. Available at http://code.google.com/p/prest/

  5. One of the versions in one of the programs contains only six faults. Due to a mistake we made, one of the failures was concealed by another.

  6. They have participated in development projects in teams (as in the Artificial Intelligence, Compiler and Operating Systems courses).

  7. All analyses are performed using IBM SPSS v26.

  8. For example, Octaviano et al. (2015) use Landis & Koch, but Massey et al. (2015) use Fleiss et al. as we do.

  9. For this reason, we need to check both.

  10. Retrieved from: http://afhayes.com/spss-sas-and-mplus-macros-and-code.html

  11. Meaning they were not giving consent to participate in the study.

  12. In a uniform distribution, 33.3% of participants should choose each technique.

  13. Note that the fact that all three techniques are classed as the most effective the same number of times is not incompatible with there being techniques that are more effective than others.

  14. Except in the case of Group 2 where there is agreement for the EP-BT techniques. Since this is the only agreement found we think it could be spurious.

  15. Note that the mismatch cost is 0 when there is a match.

  16. Note that the median here is not very informative. In this particular case it is 0pp. This happens when there are more matches than mismatches.

  17. Meaning they were not giving consent to participate in the study.

References

  • Altman D (1991) Practial statistics for medical research. Chapman and Hall

  • Aurum A, Wohlin C (2002) Applying decision-making models in requirements engineering. In: Proceedings of requirements engineering for software quality

  • Banerjee MV, Capozzoli M, McSweeney L, Sinha D (1999) Beyond kappa: a review of interrater agreement measures. Can J Stat 27:3–23

    Article  MathSciNet  Google Scholar 

  • Basili V, Selby R (1987) Comparing the effectiveness of software testing strategies. IEEE Trans Softw Eng 13(2):1278–1296

    Article  Google Scholar 

  • Basili V, Green S, Laitenberger O, Lanubile F, Shull F, Sorumgard S, Zelkowitz M (1996) The empirical investigation of perspective based reading. Empir Softw Eng 1(2):133–164

    Article  Google Scholar 

  • Beizer B (1990) Software testing techniques, 2nd edn. International Thomson Computer Press

  • Bhattacharya P (2012) Quantitative decision-making in software engineering. Ph.D. thesis University of California Riverside

  • Bieman J, Schultz J (1992) An empirical evaluation (and specification) of the all-du-paths testing criterion. Softw Eng J, 43–51

  • Biffl S (2000) Analysis of the impact of reading technique and inspector capability on individual inspection performance. In: 7th Asia-Pacific software engineering conference, pp 136–145

  • Briand L, Penta M, Labiche Y (2004) Assessing and improving state-based class testing: a series of experiments. IEEE Trans Softw Eng 30(11):770–793

    Article  Google Scholar 

  • Capretz L, Varona D, Raza A (2015) Influence of personality types in software tasks choices. Comput Hum Behav 52:373–378

    Article  Google Scholar 

  • Cotroneo D, Pietrantuono R, Russo S (2013) Testing techniques selection based on odc fault types and software metrics. J Syst Softw 86(6):1613–1637

    Article  Google Scholar 

  • Deak A (2012) Understanding socio-technical factors influencing testers in software development organizations. In: 36th Annual computer software and applications conference (COMPSAC’12), pp 438–441

  • Devanbu P, Zimmermann T, Bird C (2016) Belief & evidence in empirical software engineering. In: Proceedings of the 38th international conference on software engineering, pp 108–119

  • Dias-Neto A, Travassos G (2014) Supporting the combined selection of model-based testing techniques. IEEE Trans Softw Eng 40(10):1025–1041

    Article  Google Scholar 

  • Dias-Neto A, Matalonga S, Solari M, Robiolo G, Travassos G (2016) Toward the characterization of software testing practices in south america: looking at Brazil and Uruguay. Softw Qual J, 1–39

  • Dieste O, Aranda A, Uyaguari F, Turhan B, Tosun A, Fucci D, Oivo M, Juristo N (2017) Empirical evaluation of the effects of experience on code quality and programmer productivity: an exploratory study. Empirical Software Engineering. https://doi.org/10.1007/s10664-016-9471-3

  • Dunsmore A, Roper M, Wood M (2002) Further investigations into the development and evaluation of reading techniques for object-oriented code inspection. In: 24th International conference on software engineering, pp 47–57

  • Dybå T, Kitchenham B, Jorgensen M (2005) Evidence-based software engineering for practitioners. IEEE Softw 22(1):58–65

    Article  Google Scholar 

  • Everitt B (2000) The analysis of contingency tables. In: Monographs statistics and applied probability, vol 45. Chapman & Hall/CRC

  • Falessi D, Juristo N, Wohlin C, Turhan B, Münch J, Jedlitschka A, Oivo M (2017) Empirical software engineering experts on the use of students and professionals in experiments. Empirical Software Engineering. https://doi.org/10.1007/s10664-017-9523-3

  • Fleiss J, Levin BMP (2003) Statistical methods for rates and proportions, 3rd edn. Wiley

  • Garousi V, Felderer M, Kuhrmann M, Herkiloğlu K (2017) What industry wants from academia in software testing?: hearing practitioners’ opinions. In: Proceedings of the 21st international conference on evaluation and assessment in software engineering, EASE’17, pp 65–69

  • Gonçalves W, de Almeida C, de Araújo LL, Ferraz M, Xandú R, de Farias I (2017) The influence of human factors on the software testing process: the impact of these factors on the software testing process. In: 2017 12th Iberian conference on information systems and technologies (CISTI), pp 1–6

  • Guaiani F, Muccini H (2015) Crowd and laboratory testing, can they co-exist? An exploratory study. In: 2nd International workshop on crowdsourcing in software engineering (CSI-SE), pp 32–37

  • Hayes A, Krippendorff K (2007) Answering the call for a standard reliability measure for coding data. Commun Methods Meas 1:77–89

    Article  Google Scholar 

  • Hutchins M, Foster H, Goradia T, Ostrand T (1994) Experiments on the effectiveness of dataflow- and controlflow-based test adequacy criteria. In: Proceedings of the 16th international conference on software engineering, pp 191–200

  • Jedlitschka A, Juristo N, Rombach D (2014) Reporting experiments to satisfy professionals’ information needs. Empir Softw Eng 19(6):1921–1955

    Article  Google Scholar 

  • Kamsties E, Lott C (1995) An empirical evaluation of three defect-detection techniques. In: Proceedings of the Fifth European software engineering conference, pp 84–89

  • Kanij T, Merkel R, Grundy J (2015) An empirical investigation of personality traits of software testers. In: 8th International workshop on cooperative and human aspects of software engineering (CHASE’15), pp 1–7

  • Khan T, Pezeshki V, Clear F, Al-Kaabi A (2010) Diverse virtual social networks: implications for remote software testing teams. In: European, mediterranean & middle eastern conference on information systems

  • Kocaguneli E, Tosun A, Bener A, Turhan B, Caglayan B (2009) Prest: an intelligent software metrics extraction, analysis and defect prediction tool, 637–642

  • Kosti M, Feldt R, Angelis L (2014) Personality, emotional intelligence and work preferences in software engineering: an empirical study. Inf Softw Technol 56(8):973–990

    Article  Google Scholar 

  • Kuehl R (2000) Design of experiments: statistical principles of research design and analysis, 2nd edn. Duxbury Thomson Learning

  • Landis J, Koch G (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174

    Article  Google Scholar 

  • Linger R (1979) Structured programming: theory and practice (the systems programming series). Addison-Wesley

  • Maldonado J, Carver J, Shull F, Fabbri S, Dória E., Martimiano L, Mendonça M, Basili V (2006) Perspective-based reading: a replicated experiment focused on individual reviewer effectiveness. Empir Softw Eng 11(1):119–142

    Article  Google Scholar 

  • Marsden N, Pérez Rentería y Hernández T (2014) Understanding software testers in the automotive industry a mixed-method case study. In: 9th International conference on software engineering and applications (ICSOFT-EA), pp 305–314

  • Massey A, Otto P, Antón A (2015) Evaluating legal implementation readiness decision-making. IEEE Trans Softw Eng 41(6):545–564

    Article  Google Scholar 

  • Myers G (1978) A controlled experiment in program testing and code walkthroughs/inspections. Commun ACM 21(9):760–768

    Article  Google Scholar 

  • Myers G, Badgett T, Sandler C (2004) The art of software testing, 2nd edn. Wiley-Interscience

  • Octaviano F, Felizardo K, Maldonado J, Fabbri S (2015) Semi-automatic selection of primary studies in systematic literature reviews: is it reasonable? Empir Softw Eng 20(6):1898–1917

    Article  Google Scholar 

  • Offut A, Lee S (1994) An empirical evaluation of weak mutation. IEEE Trans Softw Eng 20(5):337–344

    Article  Google Scholar 

  • Offut A, Lee A, Rothermel G, Untch R, Zapf C (1996) An experimental determination of sufficient mutant operators. ACM Trans Softw Eng Methodol 5 (2):99–118

    Article  Google Scholar 

  • Porter A, Votta L, Basili V (1995) Comparing detection methods for software requirements inspection: a replicated experiment. IEEE Trans Softw Eng 21(6):563–575

    Article  Google Scholar 

  • Roper M, Wood M, Miller J (1997) An empirical evaluation of defect detection techniques. Inf Softw Technol 39:763–775

    Article  Google Scholar 

  • Shull F, Carver J, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Softw Eng 13:211–218

    Article  Google Scholar 

  • Thelin T, Runeson P, Wohlin C, Olsson T, Andersson C (2004) Evaluation of usage-based reading—conclusions after three experiments. Empir Softw Eng 9:77–110

    Article  Google Scholar 

  • Vegas S, Basili V (2005) A characterisation schema for software testing techniques. Empir Softw Eng 10(4):437–466

    Article  Google Scholar 

  • Vegas S, Juristo N, Basili V (2009) Maturing software engineering knowledge through classifications: a case study on unit testing techniques. IEEE Trans Softw Eng 35(4):551–565

    Article  Google Scholar 

  • Weyuker E (1984) The complexity of data flow criteria for test data selection. Inf Process Lett 19(2):103–109

    Article  MathSciNet  Google Scholar 

  • Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2014) Experimentation in software engineering: an introduction, 2nd edn. Springer

  • Wong E, Mathur A (1995) Fault detection effectiveness of mutation and data-flow testing. Softw Qual J 4:69–83

    Article  Google Scholar 

  • Zapf A, Castell S, Morawietz L, Karch A (2016) Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Med Res Methodol, 16(93)

  • Zelkowitz M, Wallace D, Binkley D (2003) Experimental validation of new software technology. Series Softw Eng Knowl Eng 12:229–263

    Article  Google Scholar 

Download references

Acknowledgments

This research was funded by Spanish Ministry of Science, Innovation and Universities research grant PGC2018-097265-B-I00, the Regional Government of Madrid, under the FORTE-CM project (S2018/TCS-4314) and the Spanish Ministry of Economy and Business, under the MADRID project (TIN2017-88557-R).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sira Vegas.

Additional information

Communicated by: Per Runeson

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Program Metrics

Table 39 shows the metrics collected for each program with the PREST tool. Note that all three programs show similar results for all metrics, except ntree that shows higher Halstead metrics. The size and complexity of cmdline is slightly higher compared to the other two programs.

Table 39 Metrics obtained with PREST

Appendix B: Analysis of the Original Experiment

Figure 3 shows the boxplot, and Table 40 shows the descriptive statistics for observed technique effectiveness.

Fig. 3
figure 3

Boxplot for observed technique effectiveness in the original study

Table 40 Descriptive statistics for observed technique effectiveness in the original study

We find that the mean and median effectiveness of BT is highest, followed by EP and then by CR. Additionally, EP has a lower variance. The 95% confidence interval suggests that EP is more effective than CR and that BT is as effective as EP and CR. This is an interesting result, as all faults injected in the code could be detected by all techniques. Additionally, it could indicate that code reading is more dependent on experience that you cannot acquire in a 4-hour training.

All three techniques have outliers corresponding to participants that have performed exceptionally bad: 3 in the case of BT (all 3 participants scored 0), 2 in the case of EP and 1 in the case of CR (also scoring 0). Additionally, EP has 3 outliers corresponding to participants that have performed exceptionally well (all scoring 100). In none of the cases the values belong to the same participant (which could suggest that outliers correspond to participants that performed exceptionally bad with one technique, but not all three). Additionally, CR shows a higher variability, which could indicate that it is more dependent on the person applying the technique.

The experimental data has been analysed with a Linear Mixed-Effects Model (SPSS v26 MIXED procedure). Group, program, technique and the program by technique interaction are fixed effects and subject is a random effect. Eleven models were tried, choosing the one with the lowest AIC:

  • One pure random effects model and no repeated measures.

  • Five models specifying program as repeated measures effect.

  • Five models specifying technique as repeated measures effect.

The five models differed in the covariance structures used (identity, diagonal, first-order autoregressive, compound symmetry and unstructured).

In this first analyses the MIXED procedure did not achieve convergence. We then decided to relax the model by removing subject from the random effects list. We re-run analyses, but this time we found severe departures from normality, leading to non-reliable results. Next step consisted on data transformation. The chosen transformation, square, solved the normality issues, and the MIXED procedure converged.

Table 41 shows the results for the model chosen (program is a repeated measures effect, and the covariance structure is Diagonal). Figure 4 shows residuals normality.

Fig. 4
figure 4

Normal Q-Q plot of residuals in the original study

Table 41 Type III tests of fixed effects in the original study

Table 41 shows that technique and program are both statistically significant. Group and the technique by program interaction are not significant. The Bonferroni multiple comparisons tests show that:

  • All three techniques show different effectiveness (EP> BT >CR).

  • ntree shows a higher defect detection rate than the other two programs (ntree>(cmdline=nametbl)).

Tables 42 and 43 show the estimated marginal means for both technique and program. Note that the mean, std. error and 95% confidence interval bounds values have been un-transformed.

Table 42 Estimated marginal means for technique in the original study
Table 43 Estimated marginal means for program in the original study

Finally, Fig. 5 shows the profile plot with error bars for the program by technique interaction. Although the interaction is not statistically significant, the profile plot suggests that nametbl could be behaving differently for CR. This could mean that lack of significance of the interaction in the analysis could be due to sample size.

Fig. 5
figure 5

Profile plot for program by technique interaction in the original study

Appendix C: Analysis of the Replicated Experiment

Figure 6 shows the boxplot for observed technique effectiveness. All three techniques show a similar median, and the same range. Compared to the original study, the behaviour of the techniques is much more homogeneous.

Fig. 6
figure 6

Boxplot for observed technique effectiveness in the replicated study

Table 44 shows the descriptive statistics for observed technique effectiveness. We find that the effectiveness of all three techniques is similar. CR has a similar effectiveness as in the original study suggesting again that perhaps 4 hours of training are not enough for learning the code review technique). However, the mean effectiveness for BT and EP has dropped from 66% and 82%, respectively, in the original study to 46%. Note that in this study, the nature of the faults has been changed. While in the original study all defects can be detected by all techniques, in this study some defects cannot be exercised by testing techniques. Therefore, a possible explanation for the change in effectiveness of testing techniques could be the faults seeded in the programs. However, this is just a mere hypothesis that need to be tested.

Table 44 Descriptive statistics for observed technique effectiveness in the replicated study

Figure 7 shows the boxplot for the percentage of defects found in each program. It is interesting to see that the median detection rate for nametbl and ntree is higher than for cmdline. These results could be attributed to cmdline having a slightly higher size and complexity. Additionally, cmdline shows 4 outliers, reflecting 4 people who performed exceptionally well, and 2 people who performed exceptionally bad (scoring 0). Finally, ntree shows a higher range compared to the other two programs (note that ntree shows higher Halstead metrics). This is an unexpected result, as nametbl and ntree are very similar in terms of complexity.

Fig. 7
figure 7

Boxplot for observed program detection rate in the Replicated Study

Table 45 shows the descriptive statistics for defect detection rate in programs. It is higher for nametbl and ntree than for cmdline. This result could be due to cmdline having a slightly higher complexity than the other two programs.

Table 45 Descriptive statistics for program defect detection rate in the replicated study

We started the analysis of the replicated study data with the same model as in the original study. This time we had neither convergence nor normality issues. The model chosen was: group, program, technique and the program by technique interaction are fixed effects and subject is a random effect. Program is a repeated measures effect, and the covariance structure is Diagonal. Table 46 shows the results of the analysis of the best model. Figure 8 shows residuals normality.

Fig. 8
figure 8

Normal Q-Q plot of residuals in the replicated study

Table 46 Type III tests of fixed effects in the replicated study

Table 46 shows that program is statistically significant. Group, technique and the technique by program interaction are not significant. The Bonferroni multiple comparisons tests show that cmdline shows a lower defect detection rate than the other two programs (cmdline<(cmdline=nametbl)).

Tables 47 and 48 show the estimated marginal means for both technique and program.

Table 47 Estimated marginal means for technique in the replicated study
Table 48 Estimated marginal means for program in the replicated study

Finally, Fig. 9 shows the profile plot with error bars for the program by technique interaction. In this case, the profile plot suggests that no interaction exists, which means the non-significance of the interaction in the analysis could not be due to sample size.

Fig. 9
figure 9

Profile plot for program by technique interaction in the replicated study

Appendix D: Joint Analyses

1.1 D.1 RQ1.1: Participants’ Perceptions

Table 49 shows the percentage of participants that perceive each technique to be the most effective. We cannot reject the null hypothesis that the frequency distribution of the responses to the questionnaire item (Using which technique did you detect most defects?) follows a uniform distribution (χ2(2,N = 60)= 1.900, p = 0.387). This means that the number of participants perceiving a particular technique as being more effective cannot be considered different for all three techniques. Our data do not support the conclusion that techniques are differently frequently perceived as being the most effective.

Table 49 Participants’ perceptions of technique effectiveness in the joint analysis

1.2 D.2 RQ1.2: Comparing Perceptions with Reality

Table 50 shows the value of kappa and its 95% CI, overall and for each technique separately. We find that all values for kappa with respect to the questionnaire item (Using which technique did you detect most defects?) are consistent with lack of agreement, except for CR (κ < 0.4, poor). This means that our data do not support the conclusion that participants correctly perceive the most effective technique for them.

Table 50 Agreement between perceived and real technique effectiveness in the joint analysis

As lack of agreement cannot be ruled out, we examine whether the perceptions are biased. The results of the Stuart-Maxwell test show that the null hypothesis of existence of marginal homogeneity cannot be rejected (χ2(2,N = 60)= 2.423, p = 0.298). Additionally, the results of the McNemar-Bowker test show that the null hypothesis of existence of symmetry cannot be rejected (χ2(3,N = 60)= 2.552, p = 0.466). This means that we cannot conclude that there is directionality when participants’ perceptions are wrong. These two results suggest that participants are not differently mistaken about one technique as they are about the others. Techniques are not differently subject to misperceptions.

1.3 D.3 RQ1.3: Comparing the Effectiveness of Techniques

We are going to check if misperceptions could be due to participants detecting the same amount of defects with all three techniques, and therefore being impossible for them to make the right decision. Table 51 shows the value and 95% CI of Krippendorff’s α overall and for each pair of techniques, for all participants and for every design group (participants that applied the same technique on the same program) separately, and Table 52 shows the value and 95% CI of Krippendorff’s α overall and for each program/session. For values with all participants, we can rule out agreement (α< 0.4) except for the case of EP-BT and nametbl-ntree for which the upper bound of the 95% CIs are consistent with fair to good agreement. However, even in this two cases, 0 belongs to the 95% CIs, meaning that agreement by chance cannot be ruled out. This means that participants do not obtain similar effectiveness values when applying the different techniques (testing the different programs) so as to be difficult to discriminate among techniques/programs. As regards the results for groups, the 95% CIs are too wide to show reliable results.

Table 51 Agreement between percentage of defects found with each technique in the joint analysis
Table 52 Agreement between percentage of defects found with each program in the joint analysis (N = 61)

1.4 D.4 RQ1.4: Cost of Mismatch

Table 53 and Fig. 10 show the cost of mismatch. We can see that the CR technique has fewer mismatches compared to the other two. Although the BT and EP techniques have the same number of mismatches, BT shows a higher dispersion. The results of the Kruskal-Wallis test reveal that we cannot reject the null hypothesis of techniques having the same mismatch cost (H(2)= 0.034, p = 0.983). This means that we cannot claim a difference in mismatch cost between the techniques. The estimated mean mismatch cost is 27pp (median 17pp).

Fig. 10
figure 10

Scatterplot for observed mismatch cost in the original study

Table 53 Observed reduction in technique effectiveness for mismatch

These results suggest that the mismatch cost is not negligible (27pp), and is not related to the technique perceived as most effective.

1.5 D.5 RQ1.5: Expected Loss of Effectiveness

Table 54 shows the average loss of effectiveness that should be expected in a project. Again, the results of the Kruskal-Wallis test reveal that we cannot reject the null hypothesis of techniques having the same expected reduction in technique effectiveness for a project (H(2)= 5.680, p = 0.058). This means we cannot claim a difference in project effectiveness loss between techniques. The mean expected loss in effectiveness in the project is estimated as 13pp.

Table 54 Observed reduction in technique effectiveness in a software project

These results suggest that the expected loss in effectiveness in a project is not negligible (15pp), and is not related to the technique perceived as most effective.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vegas, S., Riofrío, P., Marcos, E. et al. On (Mis)perceptions of testing effectiveness: an empirical study. Empir Software Eng 25, 2844–2896 (2020). https://doi.org/10.1007/s10664-020-09805-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-020-09805-y

Keywords

Navigation