Abstract
A recurring problem in software development is incorrect decision making on the techniques, methods and tools to be used. Mostly, these decisions are based on developers’ perceptions about them. A factor influencing people’s perceptions is past experience, but it is not the only one. In this research, we aim to discover how well the perceptions of the defect detection effectiveness of different techniques match their real effectiveness in the absence of prior experience. To do this, we conduct an empirical study plus a replication. During the original study, we conduct a controlled experiment with students applying two testing techniques and a code review technique. At the end of the experiment, they take a survey to find out which technique they perceive to be most effective. The results show that participants’ perceptions are wrong and that this mismatch is costly in terms of quality. In order to gain further insight into the results, we replicate the controlled experiment and extend the survey to include questions about participants’ opinions on the techniques and programs. The results of the replicated study confirm the findings of the original study and suggest that participants’ perceptions might be based not on their opinions about complexity or preferences for techniques but on how well they think that they have applied the techniques.
Similar content being viewed by others
Notes
This has been done for learning purposes, as we have noticed that students sometimes do not report failures that are exercised by test cases. Since this is a learning goal of the course, not relevant for the study, we measure it separately, and do not use it here.
Note that that it is not possible to take measurements on the failures reported by participants, as they do not run their own test cases, but the ones we have given them.
During the training, definitions for the terms error, fault and failure are introduced. Additionally, participants are explained that the generic term defect is used to refer to both faults and failures indistinctly.
Available at http://code.google.com/p/prest/
One of the versions in one of the programs contains only six faults. Due to a mistake we made, one of the failures was concealed by another.
They have participated in development projects in teams (as in the Artificial Intelligence, Compiler and Operating Systems courses).
All analyses are performed using IBM SPSS v26.
For this reason, we need to check both.
Retrieved from: http://afhayes.com/spss-sas-and-mplus-macros-and-code.html
Meaning they were not giving consent to participate in the study.
In a uniform distribution, 33.3% of participants should choose each technique.
Note that the fact that all three techniques are classed as the most effective the same number of times is not incompatible with there being techniques that are more effective than others.
Except in the case of Group 2 where there is agreement for the EP-BT techniques. Since this is the only agreement found we think it could be spurious.
Note that the mismatch cost is 0 when there is a match.
Note that the median here is not very informative. In this particular case it is 0pp. This happens when there are more matches than mismatches.
Meaning they were not giving consent to participate in the study.
References
Altman D (1991) Practial statistics for medical research. Chapman and Hall
Aurum A, Wohlin C (2002) Applying decision-making models in requirements engineering. In: Proceedings of requirements engineering for software quality
Banerjee MV, Capozzoli M, McSweeney L, Sinha D (1999) Beyond kappa: a review of interrater agreement measures. Can J Stat 27:3–23
Basili V, Selby R (1987) Comparing the effectiveness of software testing strategies. IEEE Trans Softw Eng 13(2):1278–1296
Basili V, Green S, Laitenberger O, Lanubile F, Shull F, Sorumgard S, Zelkowitz M (1996) The empirical investigation of perspective based reading. Empir Softw Eng 1(2):133–164
Beizer B (1990) Software testing techniques, 2nd edn. International Thomson Computer Press
Bhattacharya P (2012) Quantitative decision-making in software engineering. Ph.D. thesis University of California Riverside
Bieman J, Schultz J (1992) An empirical evaluation (and specification) of the all-du-paths testing criterion. Softw Eng J, 43–51
Biffl S (2000) Analysis of the impact of reading technique and inspector capability on individual inspection performance. In: 7th Asia-Pacific software engineering conference, pp 136–145
Briand L, Penta M, Labiche Y (2004) Assessing and improving state-based class testing: a series of experiments. IEEE Trans Softw Eng 30(11):770–793
Capretz L, Varona D, Raza A (2015) Influence of personality types in software tasks choices. Comput Hum Behav 52:373–378
Cotroneo D, Pietrantuono R, Russo S (2013) Testing techniques selection based on odc fault types and software metrics. J Syst Softw 86(6):1613–1637
Deak A (2012) Understanding socio-technical factors influencing testers in software development organizations. In: 36th Annual computer software and applications conference (COMPSAC’12), pp 438–441
Devanbu P, Zimmermann T, Bird C (2016) Belief & evidence in empirical software engineering. In: Proceedings of the 38th international conference on software engineering, pp 108–119
Dias-Neto A, Travassos G (2014) Supporting the combined selection of model-based testing techniques. IEEE Trans Softw Eng 40(10):1025–1041
Dias-Neto A, Matalonga S, Solari M, Robiolo G, Travassos G (2016) Toward the characterization of software testing practices in south america: looking at Brazil and Uruguay. Softw Qual J, 1–39
Dieste O, Aranda A, Uyaguari F, Turhan B, Tosun A, Fucci D, Oivo M, Juristo N (2017) Empirical evaluation of the effects of experience on code quality and programmer productivity: an exploratory study. Empirical Software Engineering. https://doi.org/10.1007/s10664-016-9471-3
Dunsmore A, Roper M, Wood M (2002) Further investigations into the development and evaluation of reading techniques for object-oriented code inspection. In: 24th International conference on software engineering, pp 47–57
Dybå T, Kitchenham B, Jorgensen M (2005) Evidence-based software engineering for practitioners. IEEE Softw 22(1):58–65
Everitt B (2000) The analysis of contingency tables. In: Monographs statistics and applied probability, vol 45. Chapman & Hall/CRC
Falessi D, Juristo N, Wohlin C, Turhan B, Münch J, Jedlitschka A, Oivo M (2017) Empirical software engineering experts on the use of students and professionals in experiments. Empirical Software Engineering. https://doi.org/10.1007/s10664-017-9523-3
Fleiss J, Levin BMP (2003) Statistical methods for rates and proportions, 3rd edn. Wiley
Garousi V, Felderer M, Kuhrmann M, Herkiloğlu K (2017) What industry wants from academia in software testing?: hearing practitioners’ opinions. In: Proceedings of the 21st international conference on evaluation and assessment in software engineering, EASE’17, pp 65–69
Gonçalves W, de Almeida C, de Araújo LL, Ferraz M, Xandú R, de Farias I (2017) The influence of human factors on the software testing process: the impact of these factors on the software testing process. In: 2017 12th Iberian conference on information systems and technologies (CISTI), pp 1–6
Guaiani F, Muccini H (2015) Crowd and laboratory testing, can they co-exist? An exploratory study. In: 2nd International workshop on crowdsourcing in software engineering (CSI-SE), pp 32–37
Hayes A, Krippendorff K (2007) Answering the call for a standard reliability measure for coding data. Commun Methods Meas 1:77–89
Hutchins M, Foster H, Goradia T, Ostrand T (1994) Experiments on the effectiveness of dataflow- and controlflow-based test adequacy criteria. In: Proceedings of the 16th international conference on software engineering, pp 191–200
Jedlitschka A, Juristo N, Rombach D (2014) Reporting experiments to satisfy professionals’ information needs. Empir Softw Eng 19(6):1921–1955
Kamsties E, Lott C (1995) An empirical evaluation of three defect-detection techniques. In: Proceedings of the Fifth European software engineering conference, pp 84–89
Kanij T, Merkel R, Grundy J (2015) An empirical investigation of personality traits of software testers. In: 8th International workshop on cooperative and human aspects of software engineering (CHASE’15), pp 1–7
Khan T, Pezeshki V, Clear F, Al-Kaabi A (2010) Diverse virtual social networks: implications for remote software testing teams. In: European, mediterranean & middle eastern conference on information systems
Kocaguneli E, Tosun A, Bener A, Turhan B, Caglayan B (2009) Prest: an intelligent software metrics extraction, analysis and defect prediction tool, 637–642
Kosti M, Feldt R, Angelis L (2014) Personality, emotional intelligence and work preferences in software engineering: an empirical study. Inf Softw Technol 56(8):973–990
Kuehl R (2000) Design of experiments: statistical principles of research design and analysis, 2nd edn. Duxbury Thomson Learning
Landis J, Koch G (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174
Linger R (1979) Structured programming: theory and practice (the systems programming series). Addison-Wesley
Maldonado J, Carver J, Shull F, Fabbri S, Dória E., Martimiano L, Mendonça M, Basili V (2006) Perspective-based reading: a replicated experiment focused on individual reviewer effectiveness. Empir Softw Eng 11(1):119–142
Marsden N, Pérez Rentería y Hernández T (2014) Understanding software testers in the automotive industry a mixed-method case study. In: 9th International conference on software engineering and applications (ICSOFT-EA), pp 305–314
Massey A, Otto P, Antón A (2015) Evaluating legal implementation readiness decision-making. IEEE Trans Softw Eng 41(6):545–564
Myers G (1978) A controlled experiment in program testing and code walkthroughs/inspections. Commun ACM 21(9):760–768
Myers G, Badgett T, Sandler C (2004) The art of software testing, 2nd edn. Wiley-Interscience
Octaviano F, Felizardo K, Maldonado J, Fabbri S (2015) Semi-automatic selection of primary studies in systematic literature reviews: is it reasonable? Empir Softw Eng 20(6):1898–1917
Offut A, Lee S (1994) An empirical evaluation of weak mutation. IEEE Trans Softw Eng 20(5):337–344
Offut A, Lee A, Rothermel G, Untch R, Zapf C (1996) An experimental determination of sufficient mutant operators. ACM Trans Softw Eng Methodol 5 (2):99–118
Porter A, Votta L, Basili V (1995) Comparing detection methods for software requirements inspection: a replicated experiment. IEEE Trans Softw Eng 21(6):563–575
Roper M, Wood M, Miller J (1997) An empirical evaluation of defect detection techniques. Inf Softw Technol 39:763–775
Shull F, Carver J, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Softw Eng 13:211–218
Thelin T, Runeson P, Wohlin C, Olsson T, Andersson C (2004) Evaluation of usage-based reading—conclusions after three experiments. Empir Softw Eng 9:77–110
Vegas S, Basili V (2005) A characterisation schema for software testing techniques. Empir Softw Eng 10(4):437–466
Vegas S, Juristo N, Basili V (2009) Maturing software engineering knowledge through classifications: a case study on unit testing techniques. IEEE Trans Softw Eng 35(4):551–565
Weyuker E (1984) The complexity of data flow criteria for test data selection. Inf Process Lett 19(2):103–109
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2014) Experimentation in software engineering: an introduction, 2nd edn. Springer
Wong E, Mathur A (1995) Fault detection effectiveness of mutation and data-flow testing. Softw Qual J 4:69–83
Zapf A, Castell S, Morawietz L, Karch A (2016) Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Med Res Methodol, 16(93)
Zelkowitz M, Wallace D, Binkley D (2003) Experimental validation of new software technology. Series Softw Eng Knowl Eng 12:229–263
Acknowledgments
This research was funded by Spanish Ministry of Science, Innovation and Universities research grant PGC2018-097265-B-I00, the Regional Government of Madrid, under the FORTE-CM project (S2018/TCS-4314) and the Spanish Ministry of Economy and Business, under the MADRID project (TIN2017-88557-R).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Per Runeson
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Program Metrics
Table 39 shows the metrics collected for each program with the PREST tool. Note that all three programs show similar results for all metrics, except ntree that shows higher Halstead metrics. The size and complexity of cmdline is slightly higher compared to the other two programs.
Appendix B: Analysis of the Original Experiment
Figure 3 shows the boxplot, and Table 40 shows the descriptive statistics for observed technique effectiveness.
We find that the mean and median effectiveness of BT is highest, followed by EP and then by CR. Additionally, EP has a lower variance. The 95% confidence interval suggests that EP is more effective than CR and that BT is as effective as EP and CR. This is an interesting result, as all faults injected in the code could be detected by all techniques. Additionally, it could indicate that code reading is more dependent on experience that you cannot acquire in a 4-hour training.
All three techniques have outliers corresponding to participants that have performed exceptionally bad: 3 in the case of BT (all 3 participants scored 0), 2 in the case of EP and 1 in the case of CR (also scoring 0). Additionally, EP has 3 outliers corresponding to participants that have performed exceptionally well (all scoring 100). In none of the cases the values belong to the same participant (which could suggest that outliers correspond to participants that performed exceptionally bad with one technique, but not all three). Additionally, CR shows a higher variability, which could indicate that it is more dependent on the person applying the technique.
The experimental data has been analysed with a Linear Mixed-Effects Model (SPSS v26 MIXED procedure). Group, program, technique and the program by technique interaction are fixed effects and subject is a random effect. Eleven models were tried, choosing the one with the lowest AIC:
One pure random effects model and no repeated measures.
Five models specifying program as repeated measures effect.
Five models specifying technique as repeated measures effect.
The five models differed in the covariance structures used (identity, diagonal, first-order autoregressive, compound symmetry and unstructured).
In this first analyses the MIXED procedure did not achieve convergence. We then decided to relax the model by removing subject from the random effects list. We re-run analyses, but this time we found severe departures from normality, leading to non-reliable results. Next step consisted on data transformation. The chosen transformation, square, solved the normality issues, and the MIXED procedure converged.
Table 41 shows the results for the model chosen (program is a repeated measures effect, and the covariance structure is Diagonal). Figure 4 shows residuals normality.
Table 41 shows that technique and program are both statistically significant. Group and the technique by program interaction are not significant. The Bonferroni multiple comparisons tests show that:
All three techniques show different effectiveness (EP> BT >CR).
ntree shows a higher defect detection rate than the other two programs (ntree>(cmdline=nametbl)).
Tables 42 and 43 show the estimated marginal means for both technique and program. Note that the mean, std. error and 95% confidence interval bounds values have been un-transformed.
Finally, Fig. 5 shows the profile plot with error bars for the program by technique interaction. Although the interaction is not statistically significant, the profile plot suggests that nametbl could be behaving differently for CR. This could mean that lack of significance of the interaction in the analysis could be due to sample size.
Appendix C: Analysis of the Replicated Experiment
Figure 6 shows the boxplot for observed technique effectiveness. All three techniques show a similar median, and the same range. Compared to the original study, the behaviour of the techniques is much more homogeneous.
Table 44 shows the descriptive statistics for observed technique effectiveness. We find that the effectiveness of all three techniques is similar. CR has a similar effectiveness as in the original study suggesting again that perhaps 4 hours of training are not enough for learning the code review technique). However, the mean effectiveness for BT and EP has dropped from 66% and 82%, respectively, in the original study to 46%. Note that in this study, the nature of the faults has been changed. While in the original study all defects can be detected by all techniques, in this study some defects cannot be exercised by testing techniques. Therefore, a possible explanation for the change in effectiveness of testing techniques could be the faults seeded in the programs. However, this is just a mere hypothesis that need to be tested.
Figure 7 shows the boxplot for the percentage of defects found in each program. It is interesting to see that the median detection rate for nametbl and ntree is higher than for cmdline. These results could be attributed to cmdline having a slightly higher size and complexity. Additionally, cmdline shows 4 outliers, reflecting 4 people who performed exceptionally well, and 2 people who performed exceptionally bad (scoring 0). Finally, ntree shows a higher range compared to the other two programs (note that ntree shows higher Halstead metrics). This is an unexpected result, as nametbl and ntree are very similar in terms of complexity.
Table 45 shows the descriptive statistics for defect detection rate in programs. It is higher for nametbl and ntree than for cmdline. This result could be due to cmdline having a slightly higher complexity than the other two programs.
We started the analysis of the replicated study data with the same model as in the original study. This time we had neither convergence nor normality issues. The model chosen was: group, program, technique and the program by technique interaction are fixed effects and subject is a random effect. Program is a repeated measures effect, and the covariance structure is Diagonal. Table 46 shows the results of the analysis of the best model. Figure 8 shows residuals normality.
Table 46 shows that program is statistically significant. Group, technique and the technique by program interaction are not significant. The Bonferroni multiple comparisons tests show that cmdline shows a lower defect detection rate than the other two programs (cmdline<(cmdline=nametbl)).
Tables 47 and 48 show the estimated marginal means for both technique and program.
Finally, Fig. 9 shows the profile plot with error bars for the program by technique interaction. In this case, the profile plot suggests that no interaction exists, which means the non-significance of the interaction in the analysis could not be due to sample size.
Appendix D: Joint Analyses
1.1 D.1 RQ1.1: Participants’ Perceptions
Table 49 shows the percentage of participants that perceive each technique to be the most effective. We cannot reject the null hypothesis that the frequency distribution of the responses to the questionnaire item (Using which technique did you detect most defects?) follows a uniform distribution (χ2(2,N = 60)= 1.900, p = 0.387). This means that the number of participants perceiving a particular technique as being more effective cannot be considered different for all three techniques. Our data do not support the conclusion that techniques are differently frequently perceived as being the most effective.
1.2 D.2 RQ1.2: Comparing Perceptions with Reality
Table 50 shows the value of kappa and its 95% CI, overall and for each technique separately. We find that all values for kappa with respect to the questionnaire item (Using which technique did you detect most defects?) are consistent with lack of agreement, except for CR (κ < 0.4, poor). This means that our data do not support the conclusion that participants correctly perceive the most effective technique for them.
As lack of agreement cannot be ruled out, we examine whether the perceptions are biased. The results of the Stuart-Maxwell test show that the null hypothesis of existence of marginal homogeneity cannot be rejected (χ2(2,N = 60)= 2.423, p = 0.298). Additionally, the results of the McNemar-Bowker test show that the null hypothesis of existence of symmetry cannot be rejected (χ2(3,N = 60)= 2.552, p = 0.466). This means that we cannot conclude that there is directionality when participants’ perceptions are wrong. These two results suggest that participants are not differently mistaken about one technique as they are about the others. Techniques are not differently subject to misperceptions.
1.3 D.3 RQ1.3: Comparing the Effectiveness of Techniques
We are going to check if misperceptions could be due to participants detecting the same amount of defects with all three techniques, and therefore being impossible for them to make the right decision. Table 51 shows the value and 95% CI of Krippendorff’s α overall and for each pair of techniques, for all participants and for every design group (participants that applied the same technique on the same program) separately, and Table 52 shows the value and 95% CI of Krippendorff’s α overall and for each program/session. For values with all participants, we can rule out agreement (α< 0.4) except for the case of EP-BT and nametbl-ntree for which the upper bound of the 95% CIs are consistent with fair to good agreement. However, even in this two cases, 0 belongs to the 95% CIs, meaning that agreement by chance cannot be ruled out. This means that participants do not obtain similar effectiveness values when applying the different techniques (testing the different programs) so as to be difficult to discriminate among techniques/programs. As regards the results for groups, the 95% CIs are too wide to show reliable results.
1.4 D.4 RQ1.4: Cost of Mismatch
Table 53 and Fig. 10 show the cost of mismatch. We can see that the CR technique has fewer mismatches compared to the other two. Although the BT and EP techniques have the same number of mismatches, BT shows a higher dispersion. The results of the Kruskal-Wallis test reveal that we cannot reject the null hypothesis of techniques having the same mismatch cost (H(2)= 0.034, p = 0.983). This means that we cannot claim a difference in mismatch cost between the techniques. The estimated mean mismatch cost is 27pp (median 17pp).
These results suggest that the mismatch cost is not negligible (27pp), and is not related to the technique perceived as most effective.
1.5 D.5 RQ1.5: Expected Loss of Effectiveness
Table 54 shows the average loss of effectiveness that should be expected in a project. Again, the results of the Kruskal-Wallis test reveal that we cannot reject the null hypothesis of techniques having the same expected reduction in technique effectiveness for a project (H(2)= 5.680, p = 0.058). This means we cannot claim a difference in project effectiveness loss between techniques. The mean expected loss in effectiveness in the project is estimated as 13pp.
These results suggest that the expected loss in effectiveness in a project is not negligible (15pp), and is not related to the technique perceived as most effective.
Rights and permissions
About this article
Cite this article
Vegas, S., Riofrío, P., Marcos, E. et al. On (Mis)perceptions of testing effectiveness: an empirical study. Empir Software Eng 25, 2844–2896 (2020). https://doi.org/10.1007/s10664-020-09805-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-020-09805-y