Evaluating defect prediction approaches: a benchmark and an extensive comparison

D’Ambros, Marco; Lanza, Michele; Robbes, Romain

doi:10.1007/s10664-011-9173-9

Evaluating defect prediction approaches: a benchmark and an extensive comparison

Published: 17 August 2011

Volume 17, pages 531–577, (2012)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Marco D’Ambros¹,
Michele Lanza¹ &
Romain Robbes²

4386 Accesses
407 Citations
3 Altmetric
Explore all metrics

Abstract

Reliably predicting software defects is one of the holy grails of software engineering. Researchers have devised and implemented a plethora of defect/bug prediction approaches varying in terms of accuracy, complexity and the input data they require. However, the absence of an established benchmark makes it hard, if not impossible, to compare approaches. We present a benchmark for defect prediction, in the form of a publicly available dataset consisting of several software systems, and provide an extensive comparison of well-known bug prediction approaches, together with novel approaches we devised. We evaluate the performance of the approaches using different performance indicators: classification of entities as defect-prone or not, ranking of the entities, with and without taking into account the effort to review an entity. We performed three sets of experiments aimed at (1) comparing the approaches across different systems, (2) testing whether the differences in performance are statistically significant, and (3) investigating the stability of approaches across different learners. Our results indicate that, while some approaches perform better than others in a statistically significant manner, external validity in defect prediction is still an open problem, as generalizing results to different contexts/learners proved to be a partially unsuccessful endeavor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

Sampling in software engineering research: a critical review and guidelines

Article 28 April 2022

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Notes

http://promisedata.org
http://bug.inf.usi.ch
http://promisedata.org/data
http://mdp.ivv.nasa.gov, also part of PROMISE.
We employ JUnit 3 naming conventions to detect test classes, i.e., classes whose names end with “Test” are detected as tests.
Available at http://www.intooitus.com/.
Available at http://www.moosetechnology.org.
Available at http://churrasco.inf.usi.ch.
For instance, the churn of FanIn linearly decayed and churn of FanIn logarithmically decayed have a very high correlation.
This is not in contradiction with Antoniol et al. (2008): Bugs mentioned as fixes in CVS comments are intuitively more likely to be real bugs, as they got fixed.

References

Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc Y-G (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests. In: Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds (CASCON 2008). ACM, New York, pp 304–318
Chapter Google Scholar
Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. In: Proceedings of the 2006 ACM/IEEE international symposium on empirical software engineering (ISESE 2006). ACM, New York, pp 8–17
Chapter Google Scholar
Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83(1):2–17
Article Google Scholar
Arnaoudova V, Eshkevari L, Oliveto R, Gueheneuc Y-G, Antoniol G (2010) Physical and conceptual identifier dispersion: measures and relation to fault proneness. In: Proceedings of the 26th IEEE international conference on software maintenance (ICSM 2010). IEEE CS, Washington, pp 1–5
Bacchelli A, D’Ambros M, Lanza M (2010) Are popular classes more defect prone? In: Proceedings of the 13th international conference on fundamental approaches to software engineering (FASE 2010). Springer, Berlin, pp 59–73
Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761
Article Google Scholar
Bernstein A, Ekanayake J, Pinzger M (2007) Improving defect prediction using temporal features and non linear models. In: Proceedings of the ninth international workshop on principles of software evolution (IWPSE 2007). ACM, New York, pp 11–18
Binkley AB, Schach SR (1998) Validation of the coupling dependency metric as a predictor of run-time failures and maintenance measures. In: Proceedings of the 20th international conference on software engineering (ICSE 1998). IEEE CS, Washington, pp 452–455
Chapter Google Scholar
Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC/FSE 2009). ACM, New York, pp 121–130
Google Scholar
Briand LC, Daly JW, Wüst J (1999) A unified framework for coupling measurement in object-oriented systems. IEEE Trans Softw Eng 25(1):91–121
Article Google Scholar
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493
Article Google Scholar
D’Ambros M, Lanza M (2010) Distributed and collaborative software evolution analysis with churrasco. J Sci Comput Program (SCP) 75(4):276–287
Article MATH MathSciNet Google Scholar
D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of the 7th international working conference on mining software repositories (MSR 2010). IEEE CS, Washington, pp 31–41
Demeyer S, Tichelaar S, Ducasse S (2001) FAMIX 2.1—The FAMOOS information exchange model. Technical report, University of Bern
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MATH MathSciNet Google Scholar
Ducasse S, Gîrba T, Nierstrasz O (2005) Moose: an agile reengineering environment. In: Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on foundations of software engineering (ESEC/FSE 2005). ACM, New York, pp 99–102. Tool demo
El Emam K, Melo W, Machado JC (2001) The prediction of faulty classes using object-oriented design metrics. J Syst Softw 56(1):63–75
Article Google Scholar
Fenton NE, Ohlsson N (2000) Quantitative analysis of faults and failures in a complex software system. IEEE Trans Softw Eng 26(8):797–814
Article Google Scholar
Fischer M, Pinzger M, Gall H (2003) Populating a release history database from version control and bug tracking systems. In: Proceedings of the international conference on software maintenance (ICSM 2003). IEEE CS, Washington, pp 23–32
Chapter Google Scholar
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion mmre. IEEE Trans Softw Eng 29(11):985–995
Article Google Scholar
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Google Scholar
Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(07):653–661
Article Google Scholar
Gyimóthy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Softw Eng 31(10):897–910
Article Google Scholar
Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning (ICML 2000). In: Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann, San Mateo, pp 359–366
Google Scholar
Hall M, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447
Article Google Scholar
Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering (ICSE 2009). IEEE CS, Washington, pp 78–88
Hassan AE, Holt RC (2005) The top ten list: dynamic fault prediction. In: Proceedings of the 21st IEEE international conference on software maintenance (ICSM 2005). IEEE CS, Washington, pp 263–272
Ho YC, Pepyne DL (2002) Simple explanation of the no-free-lunch theorem and its implications. J Optim Theory Appl 115(3):549–570
Article MATH MathSciNet Google Scholar
Jackson EJ (2003) A users guide to principal components. Wiley, New York
Google Scholar
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Software Eng 13:561–595.
Article Google Scholar
Juristo NJ, Vegas S (2009) Using differences among replications of software engineering experiments to gain knowledge. In: Proceedings of the 3rd international symposium on empirical software engineering and measurement (ESEM 2009). IEEE CS,Washington, pp 356–366
Kamei Y, Matsumoto S, Monden A, Matsumoto K-i, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort aware models. In: Proceedings of the 26th IEEE international conference on software maintenance (ICSM 2010). IEEE CS, Washington, pp 1–10
Kim S, Zimmermann T, Whitehead J, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE 2007). IEEE CS, Washington, pp 489–498
Chapter Google Scholar
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning (ML 1992). Morgan Kaufmann, San Mateo, pp 249–256
Google Scholar
Khoshgoftaar TM, Allen EB (1999) A comparative study of ordering and classification of fault-prone software modules. Empir Software Eng 4:159–186.
Article Google Scholar
Khoshgoftaar TM, Allen EB, Goel N, Nandi A, McMullan J (1996) Detection of software modules with high debug code churn in a very large legacy system. In: Proceedings of the seventh international symposium on software reliability engineering (ISSRE 1996). IEEE CS, Washington, pp 364–371
Chapter Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324
Article Google Scholar
Kollmann R, Selonen P, Stroulia E (2002) A study on the current state of the art in tool-supported UML-based static reverse engineering. In: Proceedings of the ninth working conference on reverse engineering (WCRE 2002). IEEE CS, Washington, pp 22–32
Kononenko I (1994) Estimating attributes: analysis and extensions of relief. Springer, Berlin, pp 171–182
Google Scholar
Koru AG, Zhang D, Liu H (2007) Modeling the effect of size on defect proneness for open-source software. In: Proceedings of the third international workshop on predictor models in software engineering (PROMISE 2007). IEEE CS, Washington, pp 10–19
Chapter Google Scholar
Koru AG, El Emam K, Zhang D, Liu H, Mathew D (2008) Theory of relative defect proneness. Empir Software Eng 13:473–498
Article Google Scholar
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496
Article Google Scholar
Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300
Article Google Scholar
Mende T (2010) Replication of defect prediction studies: problems, pitfalls and recommendations. In: Proceedings of the 6th international conference on predictive models in software engineering (PROMISE 2010). ACM, New York, pp 1–10
Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the 5th international conference on predictive models in software engineering (PROMISE 2009). ACM, New York, pp 1–10
Chapter Google Scholar
Mende T, Koschke R (2010) Effort-aware defect prediction models. In: Proceedings of the 14th European conference on software maintenance and reengineering (CSMR 2010). IEEE CS, Washington, pp 109–118
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13
Article Google Scholar
Menzies T, Milton Z, Turhan B, Cukic B, Bener YJA (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407
Article Google Scholar
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering (ICSE 2008). ACM, New York, pp 181–190
Myrtveit I, Stensrud E, Shepperd MJ (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391
Article Google Scholar
Nagappan N, Ball T (2005a) Static analysis tools as early indicators of pre-release defect density. In: Proceedings of the 27th international conference on software engineering (ICSE 2005). ACM, New York, pp 580–586
Chapter Google Scholar
Nagappan N, Ball T (2005b) Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on software engineering (ICSE 2005). ACM, New York, pp 284–292
Chapter Google Scholar
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering (ICSE 2006). ACM, New York, pp 452–461
Chapter Google Scholar
Nikora AP, Munson JC (2003) Developing fault predictors for evolving software systems. In: Proceedings of the 9th international symposium on software metrics (METRICS 2003). IEEE CS, Washington, pp 338–349
Google Scholar
Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Proceedings of the 14th ACM conference on computer and communications security (CCS 2007). ACM, New York, pp 529–540
Chapter Google Scholar
Ohlsson N, Alberg H (1996) Predicting fault-prone software modules in telephone switches. IEEE Trans Softw Eng 22(12):886–894
Article Google Scholar
Ostrand TJ, Weyuker EJ (2002) The distribution of faults in a large industrial software system. In: Proceedings of the ACM SIGSOFT international symposium on software testing and analysis (ISSTA 2002). ACM, New York, pp 55–64
Ostrand TJ, Weyuker EJ, Bell RM (2004) Where the bugs are. In: Proceedings of the ACM SIGSOFT international symposium on software testing and analysis (ISSTA 2004). ACM, New York, pp 86–96
Chapter Google Scholar
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355
Article Google Scholar
Ostrand TJ, Weyuker EJ, Bell RM (2007) Automating algorithms for the identification of fault-prone files. In: ISST proceedings of the ACM SIGSOFT international symposium on software testing and analysis (ISSTA 2007). ACM, New York, pp 219–227
Google Scholar
Pinzger M, Nagappan N, Murphy B (2008) Can developer-module networks predict failures? In: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering (FSE 2008). ACM, New York, pp 2–12
Robles G (2010) Replicating MSR: a study of the potential replicability of papers published in the mining software repositories proceedings. In: Proceedings of the 7th international working conference on mining software repositories (MSR 2010). IEEE CS, Washington, pp 171–180
Shin Y, Bell RM, Ostrand TJ, Weyuker EJ (2009) Does calling structure information improve the accuracy of fault prediction? In: Proceedings of the 7th international working conference on mining software repositories (MSR 2009). IEEE CS, Washington, pp 61–70
Sim SE, Easterbrook SM, Holt RC (2003) Using benchmarking to advance research: a challenge to software engineering. In: Proceedings of the 25th international conference on software engineering (ICSE 2003). IEEE CS, Washington, pp 74–83
Subramanyam R, Krishnan MS (2003) Empirical analysis of ck metrics for object-oriented design complexity: implications for software defects. IEEE Trans Softw Eng 29(4):297–310
Article MATH Google Scholar
Turhan B, Menzies T, Bener AB, Di Stefano JS (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Software Eng 14(5):540–578
Article Google Scholar
Turhan B, Bener AB, Menzies T (2010) Regularities in learning defect predictors. In: Proceedings of the 11th international conference on product-focused software process improvement (PROFES 2010). Springer, Berlin, pp 116–130
Wolf T, Schröter A, Damian D, Nguyen THD (2009) Predicting build failures using social network analysis on developer communication. In: Proceedings of the 31st international conference on software engineering (ICSE 2009). IEEE CS, Washington, pp 1–11
Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: Proceedings of the 30th international conference on software engineering (ICSE 2008). ACM, New York, pp 531–540
Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the 3rd international workshop on predictive models in software engineering (PROMISE 2007). IEEE CS, Washington, pp 9–15
Chapter Google Scholar
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC/FSE 2009). ACM, New York, pp 91–100

Download references

Acknowledgement

We acknowledge the financial support of the Swiss National Science foundation for the project “SOSYA” (SNF Project No. 132175).

Author information

Authors and Affiliations

REVEAL @ Faculty of Informatics, University of Lugano, 6900, Lugano, Switzerland
Marco D’Ambros & Michele Lanza
PLEIAD Lab @ Computer Science Department (DCC), University of Chile, Santiago, Chile
Romain Robbes

Authors

Marco D’Ambros
View author publications
You can also search for this author in PubMed Google Scholar
Michele Lanza
View author publications
You can also search for this author in PubMed Google Scholar
Romain Robbes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco D’Ambros.

Additional information

Editors: Jim Whitehead and Tom Zimmermann

Rights and permissions

Reprints and permissions

About this article

Cite this article

D’Ambros, M., Lanza, M. & Robbes, R. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Software Eng 17, 531–577 (2012). https://doi.org/10.1007/s10664-011-9173-9

Download citation

Published: 17 August 2011
Issue Date: August 2012
DOI: https://doi.org/10.1007/s10664-011-9173-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating defect prediction approaches: a benchmark and an extensive comparison

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Sampling in software engineering research: a critical review and guidelines

How different are different diff algorithms in Git?

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluating defect prediction approaches: a benchmark and an extensive comparison

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Sampling in software engineering research: a critical review and guidelines

How different are different diff algorithms in Git?

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation