Real world projects, real faults: evaluating spectrum based fault localization techniques on Python projects

Widyasari, Ratnadira; Prana, Gede Artha Azriadi; Haryono, Stefanus Agus; Wang, Shaowei; Lo, David

doi:10.1007/s10664-022-10189-4

Real world projects, real faults: evaluating spectrum based fault localization techniques on Python projects

Published: 06 August 2022

Volume 27, article number 147, (2022)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Ratnadira Widyasari ORCID: orcid.org/0000-0001-8190-5458¹,
Gede Artha Azriadi Prana¹,
Stefanus Agus Haryono¹,
Shaowei Wang² &
…
David Lo¹

746 Accesses
3 Citations
Explore all metrics

Abstract

Spectrum Based Fault Localization (SBFL) is a statistical approach to identify faulty code within a program given a program spectra (i.e., records of program elements executed by passing and failing test cases). Several SBFL techniques have been proposed over the years, but most evaluations of those techniques were done only on Java and C programs, and frequently involve artificial faults. Considering the current popularity of Python, indicated by the results of the Stack Overflow survey among developers in 2020, it becomes increasingly important to understand how SBFL techniques perform on Python projects. However, this remains an understudied topic. In this work, our objective is to analyze the effectiveness of popular SBFL techniques in real-world Python projects. We also aim to compare our observed performance on Python to previously-reported performance on Java. Using the recently-built bug benchmark BugsInPy as our fault dataset, we apply five popular SBFL techniques (Tarantula, Ochiai, O^P, Barinel, and DStar) and analyze their performances. We subsequently compare our results with results from Java and C projects reported in earlier related works. We find that 1) the real faults in BugsInPy are harder to identify using SBFL techniques compared to the real faults in Defects4J, indicated by the lower performance of the evaluated SBFL techniques on BugsInPy; 2) older techniques such as Tarantula, Barinel, and Ochiai consistently outperform newer techniques (i.e., O^P and DStar) in a variety of metrics and debugging scenarios; 3) claims in preceding studies done on artificial faults in C and Java (such as “O^P outperforms Tarantula”) do not hold on Python real faults; 4) lower-performing techniques can outperform higher-performing techniques in some cases, emphasizing the potential benefit of combining SBFL techniques. Our results yield insight into how popular SBFL techniques perform in real Python faults and emphasize the importance of conducting SBFL evaluations on real faults.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Fig. 9

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Software defect prediction: future directions and challenges

Article 27 February 2024

APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities

Article Open access 06 December 2023

Notes

References

Abreu R, Van Gemund AJ (2009) A low-cost approximate minimal hitting set algorithm and its application to model-based diagnosis.. In: SARA, vol 9, Citeseer, pp 2–9
Abreu R, Zoeteweij P, Golsteijn R, Van Gemund ArjanJC (2009a) A practical evaluation of spectrum-based fault localization. J Syst Softw 82 (11):1780–1792
Article Google Scholar
Abreu R, Zoeteweij P, van Gemund AJC (2007) On the accuracy of spectrum-based fault localization. In: Proceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION, IEEE Computer Society, USA, TAICPART-MUTATION ’07, pp 89–98
Abreu R, Zoeteweij P, Van Gemund AJ (2006) An evaluation of similarity coefficients for software fault localization. In: 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06), IEEE, pp 39–46
Abreu R, Zoeteweij P, Van Gemund AJ (2009b) Spectrum-based multiple fault localization. In: 2009 IEEE/ACM International Conference on Automated Software Engineering, pp 88–99, IEEE
Ali S, Andrews JH, Dhandapani T, Wang W (2009) Evaluating the accuracy of fault localization techniques. In: 2009 IEEE/ACM International Conference on Automated Software Engineering, IEEE, PP 76–87
Baah GK, Podgurski A, Harrold MJ (2010) The probabilistic program dependence graph and its application to fault diagnosis. IEEE Trans Softw Eng 36(4):528–545
Article Google Scholar
Bouillon P, Krinke J, Meyer N, Steimann F (2007) Ezunit: A framework for associating failed unit tests with potential programming errors. In: International Conference on Extreme Programming and Agile Processes in Software Engineering, Springer, PP 101–104
Briand LC, Labiche Y, Liu X (2007) Using machine learning to support debugging with tarantula. In: The 18th IEEE International Symposium on Software Reliability (ISSRE’07), pp 137–146
Cantor AB (1996) Sample-size calculations for cohen’s kappa. Psychol Methods 1(2):150
Article Google Scholar
Chaki S, Groce A, Strichman O (2004) Explaining abstract counterexamples. In: Proceedings of the 12th ACM SIGSOFT twelfth international symposium on Foundations of software engineering, pp 73–82
Chen D, Stolee KT, Menzies T (2019) Replication can improve prior results: A github study of pull request acceptance. In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp 179–190, IEEE
Cifuentes C, Hoermann C, Keynes N, Li L, Long S, Mealy E, Mounteney M, Scholz B (2009) Begbunch: Benchmarking for c bug detection tools. In: Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2009), pp 16–20
Cliff N (1993) Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol Bull 114(3):494
Article Google Scholar
D’Agostino R, Pearson ES (1973) Tests for departure from normality. Empirical results for the distributions of b² and \(\sqrt {b^1}\). Biometrika 60(3):613–622
MathSciNet MATH Google Scholar
D’Agostino RB (1971) An omnibus test of normality for moderate and large sample sizes. Biometrika 58(34):1–348
MathSciNet Google Scholar
Debroy V, Wong WE, Xu X, Choi B (2010) A grouping-based strategy to improve the effectiveness of fault localization techniques. In: 2010 10th International Conference on Quality Software, IEEE, pp 13–22
DeVellis RF (2005) Inter-rater reliability. encyclopedia of social measurement. Elsevier Academic Press, Oxford
Google Scholar
Durieux T, Abreu R (2019) Critical review of bugswarm for fault localization and program repair. arXiv preprint arXiv:1905.09375
Ghanbari A, Benton S, Zhang L (2019) Practical program repair via bytecode mutation. In: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp 19–30
Gouveia C, Campos J, Abreu R (2013) Using html5 visualizations in software fault localization. In: 2013 First IEEE Working Conference on Software Visualization (VISSOFT), pp 1–10. , DOI , (to appear in print)
Hao D, Zhang L, Zhang L, Sun J, Mei H (2009) Vida: Visual interactive debugging. In: 2009 IEEE 31st International Conference on Software Engineering, IEEE, pp 583–586
He H, Ren J, Zhao G, He H (2020) Enhancing spectrum-based fault localization using fault influence propagation. IEEE Access 8:18497–18513
Article Google Scholar
Horváth F, Beszédes A, Vancsics B, Balogh G, Vidács L, Gyimóthy T (2020) Experiments with interactive fault localization using simulated and real users. In: 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, pp 290–300
Hutchins M, Foster H, Goradia T, Ostrand T (1994) Experiments on the effectiveness of dataflow-and control-flow-based test adequacy criteria. In: Proceedings of 16th International conference on Software engineering, IEEE, pp 191–200
Jiang J, Xiong Y, Zhang H, Gao Q, Chen X (2018) Shaping program repair space with existing patches and similar code. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, pp 298–309
Jones JA, Harrold MJ (2005) Empirical evaluation of the tarantula automatic fault-localization technique. In: Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering, pp 273–282
Jones JA, Harrold MJ, Stasko JT (2001) Visualization for fault localization. In: Proceedings of ICSE 2001 Workshop on Software Visualization, Citeseer
Ju X, Jiang S, Chen X, Wang X, Zhang Y, Cao H (2014) Hsfal: Effective fault localization using hybrid spectrum of full slices and execution slices. J Syst Softw 90:3–17
Article Google Scholar
Just R (2014) The major mutation framework: Efficient and scalable mutation analysis for java. In: Proceedings of the 2014 international symposium on software testing and analysis, pp 433–436
Just R, Jalali D, Ernst MD (2014a) Defects4j: A database of existing faults to enable controlled testing studies for java programs. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis, Association for Computing Machinery, New York, NY, USA, ISSTA 2014, pp 437–440, DOI https://doi.org/10.1145/2610384.2628055, (to appear in print)
Just R, Jalali D, Ernst MD (2014b) Defects4J: A database of existing faults to enable controlled testing studies for java programs. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis, pp 437–440
Just R, Parnin C, Drosos I, Ernst MD (2018) Comparing developer-provided to user-provided tests for fault localization and automated program repair. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, Association for Computing Machinery, New York, NY, USA, ISSTA 2018, pp 287–297. https://doi.org/10.1145/3213846.3213870
Kim J, Lee E (2014) Empirical evaluation of existing algorithms of spectrum based fault localization. In: The International Conference on Information Networking 2014 (ICOIN2014), IEEE, pp 346–351
Kitchenham B (2008) The role of replications in empirical software engineering—word of warning. Empir Softw Eng 13(2):219–221
Article Google Scholar
Koca F, Sözer H, Abreu R (2013) Spectrum-based fault localization for diagnosing concurrency faults. In: IFIP International Conference on Testing Software and Systems, Springer, pp 239–254
Kochhar PS, Xia X, Lo D, Li S (2016) Practitioners’ expectations on automated fault localization. In: Proceedings of the 25th International Symposium on Software Testing and Analysis, pp 165–176
Könighofer R, Bloem R (2011) Automated error localization and correction for imperative programs. In: 2011 Formal Methods in Computer-Aided Design (FMCAD), IEEE, pp 91–100
Le TB, Thung F, Lo D (2013) Theory and practice, do they match? a case with spectrum-based fault localization. In: 2013 IEEE International Conference on Software Maintenance, pp 380–383.
Le T-DB, Lo D, Li M (2015a) Constrained feature selection for localizing faults. In: 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, pp 501–505
Le T-DB, Lo D, Thung F (2015b) Should i follow this fault localization tool’s output?. Empirical Softw. Engg. 20(5):1237–1274. https://doi.org/10.1007/s10664-014-9349-1
Le T-DB, Thung F, Lo D (2013) Theory and practice, do they match? a case with spectrum-based fault localization. In: 2013 IEEE International Conference on Software Maintenance, IEEE, pp 380–383
Le Goues C, Holtschulte N, Smith EK, Brun Y, Devanbu P, Forrest S, Weimer W (2015) The manybugs and introclass benchmarks for automated repair of c programs. IEEE Trans Softw Eng 41(12):1236–1256
Article Google Scholar
Lindsay RM, Ehrenberg AS (1993) The design of replicated studies. The American Statistician 47(3):217–228
Google Scholar
Lo D, Jiang L, Budi A, et al. (2010) Comprehensive evaluation of association measures for fault localization. In: 2010 IEEE International Conference on Software Maintenance, IEEE, pp 1–10
Long F, Rinard M (2016) An analysis of the search spaces for generate and validate patch generation systems. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), IEEE, pp 702–713
Lu S, Li Z, Qin F, Tan L, Zhou P, Zhou Y (2005) Bugbench: Benchmarks for evaluating bug detection tools. In: Workshop on the evaluation of software defect detection tools, vol 5
Lucia, Lo D, Xia X (2014) Fusion fault localizers. In: Proceedings of the 29th ACM/IEEE international conference on Automated software engineering, pp 127–138
Martinez M, Monperrus M (2015) Mining software repair models for reasoning on the search space of automated program fixing. Empir Softw Eng 20 (1):176–205
Article Google Scholar
Moon S, Kim Y, Kim M, Yoo S (2014) Ask the mutants: Mutating faulty programs for fault localization. In: 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation, IEEE, pp 153–162
Naish L, Lee HJ, Ramamohanarao K (2011a) A model for spectra-based software diagnosis. ACM Trans. Softw. Eng. Methodol. 20(3)
Naish L, Lee HJ, Ramamohanarao K (2011b) A model for spectra-based software diagnosis. ACM Transactions on software engineering and methodology (TOSEM) 20(3):1–32
Article Google Scholar
Pan K, Kim S, Whitehead EJ (2009) Toward an understanding of bug fix patterns. Empirical Softw. Engg. 14(3):286–315. https://doi.org/10.1007/s10664-008-9077-5
Article Google Scholar
Parnin C, Orso A (2011) Are automated debugging techniques actually helping programmers?. In: Proceedings of the 2011 international symposium on software testing and analysis, pp 199–209
Patra J, Pradel M (2021) Semantic bug seeding: a learning-based approach for creating realistic bugs. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp 906–918
Pearson S, Campos J, Just R, Fraser G, Abreu R, Ernst MD, Pang D, Keller B (2017) Evaluating and improving fault localization. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp 609–620.
Pearson S, Campos J, Just R, Fraser G, Abreu R, Ernst MD, Pang D, Keller B (2017) Evaluating and improving fault localization. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), IEEE, pp 609–620
Planning S (2002) The economic impacts of inadequate infrastructure for software testing. National Institute of Standards and Technology
Rayson P, Berridge D, Francis B (2004) Extending the cochran rule for the comparison of word frequencies between corpora. In: 7th International Conference on Statistical analysis of textual data (JADT 2004), pp 926–936
Ren L, Shan S, xu X, Liu (2020) Starin: An approach to predict the popularity of github repository, pp 258–273. https://doi.org/10.1007/978-981-15-7984-4_20
Renieres M, Reiss SP (2003) Fault localization with nearest neighbor queries. In: 18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings., IEEE, pp 30–39
Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices. In: annual meeting of the Southern Association for Institutional Research, Citeseer, pp 1–51
Ruthruff JR, Burnett M, Rothermel G (2005) An empirical study of fault localization for end-user programmers. In: Proceedings of the 27th International Conference on Software Engineering, pp 352–361
Saha RK, Lyu Y, Lam W, Yoshida H, Prasad MR (2018) Bugs. jar: a large-scale, diverse dataset of real-world java bugs. In: Proceedings of the 15th International Conference on Mining Software Repositories, pp 10–13
Santos A, Vegas S, Uyaguari F, Dieste O, Turhan B, Juristo N (2020) Increasing validity through replication: an illustrative tdd case. arXiv preprint arXiv:2004.05335
Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Softw Eng 13(2):211–218
Article Google Scholar
Sobreira V, Durieux T, Madeiral F, Monperrus M, de Almeida Maia M (2018) Dissection of a bug dataset: Anatomy of 395 patches from defects4j. In: 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, pp 130–140
Sohn J, Yoo S (2017) Fluccs: Using code and change metrics to improve fault localization. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp 273–283
Steimann F, Frenkel M, Abreu R (2013) Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators. In: Proceedings of the 2013 International Symposium on Software Testing and Analysis, pp 314–324
Tallarida RJ, Murray RB (1987) Chi-square test. In: Manual of pharmacologic calculations, Springer, pp 140–142
Tomassi DA, Dmeiri N, Wang Y, Bhowmick A, Liu Y-C, Devanbu PT, Vasilescu B, Rubio-González C (2019) Bugswarm: Mining and continuously growing a dataset of reproducible failures and fixes. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, pp 339–349
Tufano M, Kimko J, Wang S, Watson C, Bavota G, Di Penta M, Poshyvanyk D (2020) Deepmutation: A neural mutation tool. In: 42nd ACM/IEEE International Conference on Software Engineering: Companion, ICSE-Companion 2020, Institute of Electrical and Electronics Engineers Inc., pp 29–33
Vancsics B, Szatmári A, Beszédes A (2020) Relationship between the effectiveness of spectrum-based fault localization and bug-fix types in javascript programs. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, pp 308–319
Vessey I (1985) Expertise in debugging computer programs: A process analysis. International Journal of Man-Machine Studies 23(5):459–494
Article Google Scholar
Wen M, Chen J, Wu R, Hao D, Cheung S-C (2018) Context-aware patch generation for better automated program repair. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), IEEE, pp 1–11
Widyasari R, Sim SQ, Lok C, Qi H, Phan J, Tay Q, Tan C, Wee F, Tan JE, Yieh Y, et al (2020) Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp 1556–1560
Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics, Springer, pp 196–202
Wong E, Wei T, Qi Y, Zhao L (2008) A crosstab-based statistical method for effective fault localization. In: 2008 1st international conference on software testing, verification, and validation, IEEE, pp 42–51
Wong WE, Debroy V, Gao R, Li Y (2013) The dstar method for effective software fault localization. IEEE Trans Reliab 63(1):290–308
Article Google Scholar
Wong WE, Debroy V, Golden R, Xu X, Thuraisingham B (2011) Effective software fault localization using an rbf neural network. IEEE Trans Reliab 61(1):149–169
Article Google Scholar
Wong WE, Debroy V, Surampudi A, Kim H, Siok MF (2010) Recent catastrophic accidents: Investigating how software was responsible. In: 2010 Fourth International Conference on Secure Software Integration and Reliability Improvement, IEEE, pp 14–22
Wong WE, Gao R, Li Y, Abreu R, Wotawa F (2016) A survey on software fault localization. IEEE Trans Softw Eng 42(8):707–740
Article Google Scholar
Wright CS, Zia TA (2011) A quantitative analysis into the economics of correcting software bugs. In: Computational Intelligence in Security for Information Systems, Springer, pp 198–205
Xia X, Bao L, Lo D, Li S (2016) “automated debugging considered harmful” considered harmful: A user study revisiting the usefulness of spectra-based fault localization techniques with professionals using real bugs from large systems. In: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp 267–278, IEEE
Xie X, Chen TY, Kuo F-C, Xu B (2013) A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization. ACM Transactions on Software Engineering and Methodology (TOSEM) 22(4):1–40
Article Google Scholar
Xie X, Liu Z, Song S, Chen Z, Xuan J, Xu B (2016) Revisit of automatic debugging via human focus-tracking analysis. In: Proceedings of the 38th International Conference on Software Engineering, pp 808–819
Xuan J, Monperrus M (2014a) Learning to combine multiple ranking metrics for fault localization. In: 2014 IEEE International Conference on Software Maintenance and Evolution, pp 191–200, IEEE
Xuan J, Monperrus M (2014b) Test case purification for improving fault localization. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp 52–63
Zhang M, Li X, Zhang L, Khurshid S (2017) Boosting spectrum-based fault localization using pagerank. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp 261–272
Zou D, Liang J, Xiong Y, Ernst MD, Zhang L (2019) An empirical study of fault localization families and their combinations. IEEE Trans Softw Eng 47(2):332–347
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing and Information Systems, Singapore Management University, 80 Stamford Rd, Stamford, 178902, Singapore
Ratnadira Widyasari, Gede Artha Azriadi Prana, Stefanus Agus Haryono & David Lo
Department of Computer Science, University of Manitoba, Winnipeg, Canada
Shaowei Wang

Authors

Ratnadira Widyasari
View author publications
You can also search for this author in PubMed Google Scholar
Gede Artha Azriadi Prana
View author publications
You can also search for this author in PubMed Google Scholar
Stefanus Agus Haryono
View author publications
You can also search for this author in PubMed Google Scholar
Shaowei Wang
View author publications
You can also search for this author in PubMed Google Scholar
David Lo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ratnadira Widyasari.

Additional information

Communicated by: Martin Monperrus

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 20 Cliff’s d effect size and Wilcoxon rank-sum test results on top-k metrics in all debugging scenarios (i.e., best-case, average-case, worst-case)

Full size table

Table 21 Improvement on SBFL Techniques

Full size table

Table 22 Improvement on Average-case Debugging Scenario

Full size table

Table 23 Improvement on Worst-case Debugging Scenario

Full size table

Table 24 Prior Results Comparison on Average-case Scenario

Full size table

Table 25 Prior Results Comparison on Worst-case Scenario

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Widyasari, R., Prana, G.A.A., Haryono, S.A. et al. Real world projects, real faults: evaluating spectrum based fault localization techniques on Python projects. Empir Software Eng 27, 147 (2022). https://doi.org/10.1007/s10664-022-10189-4

Download citation

Accepted: 26 May 2022
Published: 06 August 2022
DOI: https://doi.org/10.1007/s10664-022-10189-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real world projects, real faults: evaluating spectrum based fault localization techniques on Python projects

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

Software defect prediction: future directions and challenges

APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Real world projects, real faults: evaluating spectrum based fault localization techniques on Python projects

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

Software defect prediction: future directions and challenges

APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation