Skip to main content
Log in

Evaluating defect prediction approaches: a benchmark and an extensive comparison

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Reliably predicting software defects is one of the holy grails of software engineering. Researchers have devised and implemented a plethora of defect/bug prediction approaches varying in terms of accuracy, complexity and the input data they require. However, the absence of an established benchmark makes it hard, if not impossible, to compare approaches. We present a benchmark for defect prediction, in the form of a publicly available dataset consisting of several software systems, and provide an extensive comparison of well-known bug prediction approaches, together with novel approaches we devised. We evaluate the performance of the approaches using different performance indicators: classification of entities as defect-prone or not, ranking of the entities, with and without taking into account the effort to review an entity. We performed three sets of experiments aimed at (1) comparing the approaches across different systems, (2) testing whether the differences in performance are statistically significant, and (3) investigating the stability of approaches across different learners. Our results indicate that, while some approaches perform better than others in a statistically significant manner, external validity in defect prediction is still an open problem, as generalizing results to different contexts/learners proved to be a partially unsuccessful endeavor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. http://promisedata.org

  2. http://bug.inf.usi.ch

  3. http://promisedata.org/data

  4. http://mdp.ivv.nasa.gov, also part of PROMISE.

  5. We employ JUnit 3 naming conventions to detect test classes, i.e., classes whose names end with “Test” are detected as tests.

  6. Available at http://www.intooitus.com/.

  7. Available at http://www.moosetechnology.org.

  8. Available at http://churrasco.inf.usi.ch.

  9. For instance, the churn of FanIn linearly decayed and churn of FanIn logarithmically decayed have a very high correlation.

  10. This is not in contradiction with Antoniol et al. (2008): Bugs mentioned as fixes in CVS comments are intuitively more likely to be real bugs, as they got fixed.

References

  • Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc Y-G (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests. In: Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds (CASCON 2008). ACM, New York, pp 304–318

    Chapter  Google Scholar 

  • Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. In: Proceedings of the 2006 ACM/IEEE international symposium on empirical software engineering (ISESE 2006). ACM, New York, pp 8–17

    Chapter  Google Scholar 

  • Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83(1):2–17

    Article  Google Scholar 

  • Arnaoudova V, Eshkevari L, Oliveto R, Gueheneuc Y-G, Antoniol G (2010) Physical and conceptual identifier dispersion: measures and relation to fault proneness. In: Proceedings of the 26th IEEE international conference on software maintenance (ICSM 2010). IEEE CS, Washington, pp 1–5

  • Bacchelli A, D’Ambros M, Lanza M (2010) Are popular classes more defect prone? In: Proceedings of the 13th international conference on fundamental approaches to software engineering (FASE 2010). Springer, Berlin, pp 59–73

  • Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761

    Article  Google Scholar 

  • Bernstein A, Ekanayake J, Pinzger M (2007) Improving defect prediction using temporal features and non linear models. In: Proceedings of the ninth international workshop on principles of software evolution (IWPSE 2007). ACM, New York, pp 11–18

  • Binkley AB, Schach SR (1998) Validation of the coupling dependency metric as a predictor of run-time failures and maintenance measures. In: Proceedings of the 20th international conference on software engineering (ICSE 1998). IEEE CS, Washington, pp 452–455

    Chapter  Google Scholar 

  • Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC/FSE 2009). ACM, New York, pp 121–130

    Google Scholar 

  • Briand LC, Daly JW, Wüst J (1999) A unified framework for coupling measurement in object-oriented systems. IEEE Trans Softw Eng 25(1):91–121

    Article  Google Scholar 

  • Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493

    Article  Google Scholar 

  • D’Ambros M, Lanza M (2010) Distributed and collaborative software evolution analysis with churrasco. J Sci Comput Program (SCP) 75(4):276–287

    Article  MATH  MathSciNet  Google Scholar 

  • D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of the 7th international working conference on mining software repositories (MSR 2010). IEEE CS, Washington, pp 31–41

  • Demeyer S, Tichelaar S, Ducasse S (2001) FAMIX 2.1—The FAMOOS information exchange model. Technical report, University of Bern

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  • Ducasse S, Gîrba T, Nierstrasz O (2005) Moose: an agile reengineering environment. In: Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on foundations of software engineering (ESEC/FSE 2005). ACM, New York, pp 99–102. Tool demo

  • El Emam K, Melo W, Machado JC (2001) The prediction of faulty classes using object-oriented design metrics. J Syst Softw 56(1):63–75

    Article  Google Scholar 

  • Fenton NE, Ohlsson N (2000) Quantitative analysis of faults and failures in a complex software system. IEEE Trans Softw Eng 26(8):797–814

    Article  Google Scholar 

  • Fischer M, Pinzger M, Gall H (2003) Populating a release history database from version control and bug tracking systems. In: Proceedings of the international conference on software maintenance (ICSM 2003). IEEE CS, Washington, pp 23–32

    Chapter  Google Scholar 

  • Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion mmre. IEEE Trans Softw Eng 29(11):985–995

    Article  Google Scholar 

  • Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Google Scholar 

  • Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(07):653–661

    Article  Google Scholar 

  • Gyimóthy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Softw Eng 31(10):897–910

    Article  Google Scholar 

  • Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning (ICML 2000). In: Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann, San Mateo, pp 359–366

    Google Scholar 

  • Hall M, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447

    Article  Google Scholar 

  • Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering (ICSE 2009). IEEE CS, Washington, pp 78–88

  • Hassan AE, Holt RC (2005) The top ten list: dynamic fault prediction. In: Proceedings of the 21st IEEE international conference on software maintenance (ICSM 2005). IEEE CS, Washington, pp 263–272

  • Ho YC, Pepyne DL (2002) Simple explanation of the no-free-lunch theorem and its implications. J Optim Theory Appl 115(3):549–570

    Article  MATH  MathSciNet  Google Scholar 

  • Jackson EJ (2003) A users guide to principal components. Wiley, New York

    Google Scholar 

  • Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Software Eng 13:561–595.

    Article  Google Scholar 

  • Juristo NJ, Vegas S (2009) Using differences among replications of software engineering experiments to gain knowledge. In: Proceedings of the 3rd international symposium on empirical software engineering and measurement (ESEM 2009). IEEE CS,Washington, pp 356–366

  • Kamei Y, Matsumoto S, Monden A, Matsumoto K-i, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort aware models. In: Proceedings of the 26th IEEE international conference on software maintenance (ICSM 2010). IEEE CS, Washington, pp 1–10

  • Kim S, Zimmermann T, Whitehead J, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE 2007). IEEE CS, Washington, pp 489–498

    Chapter  Google Scholar 

  • Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning (ML 1992). Morgan Kaufmann, San Mateo, pp 249–256

    Google Scholar 

  • Khoshgoftaar TM, Allen EB (1999) A comparative study of ordering and classification of fault-prone software modules. Empir Software Eng 4:159–186.

    Article  Google Scholar 

  • Khoshgoftaar TM, Allen EB, Goel N, Nandi A, McMullan J (1996) Detection of software modules with high debug code churn in a very large legacy system. In: Proceedings of the seventh international symposium on software reliability engineering (ISSRE 1996). IEEE CS, Washington, pp 364–371

    Chapter  Google Scholar 

  • Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324

    Article  Google Scholar 

  • Kollmann R, Selonen P, Stroulia E (2002) A study on the current state of the art in tool-supported UML-based static reverse engineering. In: Proceedings of the ninth working conference on reverse engineering (WCRE 2002). IEEE CS, Washington, pp 22–32

  • Kononenko I (1994) Estimating attributes: analysis and extensions of relief. Springer, Berlin, pp 171–182

    Google Scholar 

  • Koru AG, Zhang D, Liu H (2007) Modeling the effect of size on defect proneness for open-source software. In: Proceedings of the third international workshop on predictor models in software engineering (PROMISE 2007). IEEE CS, Washington, pp 10–19

    Chapter  Google Scholar 

  • Koru AG, El Emam K, Zhang D, Liu H, Mathew D (2008) Theory of relative defect proneness. Empir Software Eng 13:473–498

    Article  Google Scholar 

  • Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496

    Article  Google Scholar 

  • Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300

    Article  Google Scholar 

  • Mende T (2010) Replication of defect prediction studies: problems, pitfalls and recommendations. In: Proceedings of the 6th international conference on predictive models in software engineering (PROMISE 2010). ACM, New York, pp 1–10

  • Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the 5th international conference on predictive models in software engineering (PROMISE 2009). ACM, New York, pp 1–10

    Chapter  Google Scholar 

  • Mende T, Koschke R (2010) Effort-aware defect prediction models. In: Proceedings of the 14th European conference on software maintenance and reengineering (CSMR 2010). IEEE CS, Washington, pp 109–118

  • Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13

    Article  Google Scholar 

  • Menzies T, Milton Z, Turhan B, Cukic B, Bener YJA (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407

    Article  Google Scholar 

  • Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering (ICSE 2008). ACM, New York, pp 181–190

  • Myrtveit I, Stensrud E, Shepperd MJ (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391

    Article  Google Scholar 

  • Nagappan N, Ball T (2005a) Static analysis tools as early indicators of pre-release defect density. In: Proceedings of the 27th international conference on software engineering (ICSE 2005). ACM, New York, pp 580–586

    Chapter  Google Scholar 

  • Nagappan N, Ball T (2005b) Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on software engineering (ICSE 2005). ACM, New York, pp 284–292

    Chapter  Google Scholar 

  • Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering (ICSE 2006). ACM, New York, pp 452–461

    Chapter  Google Scholar 

  • Nikora AP, Munson JC (2003) Developing fault predictors for evolving software systems. In: Proceedings of the 9th international symposium on software metrics (METRICS 2003). IEEE CS, Washington, pp 338–349

    Google Scholar 

  • Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Proceedings of the 14th ACM conference on computer and communications security (CCS 2007). ACM, New York, pp 529–540

    Chapter  Google Scholar 

  • Ohlsson N, Alberg H (1996) Predicting fault-prone software modules in telephone switches. IEEE Trans Softw Eng 22(12):886–894

    Article  Google Scholar 

  • Ostrand TJ, Weyuker EJ (2002) The distribution of faults in a large industrial software system. In: Proceedings of the ACM SIGSOFT international symposium on software testing and analysis (ISSTA 2002). ACM, New York, pp 55–64

  • Ostrand TJ, Weyuker EJ, Bell RM (2004) Where the bugs are. In: Proceedings of the ACM SIGSOFT international symposium on software testing and analysis (ISSTA 2004). ACM, New York, pp 86–96

    Chapter  Google Scholar 

  • Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355

    Article  Google Scholar 

  • Ostrand TJ, Weyuker EJ, Bell RM (2007) Automating algorithms for the identification of fault-prone files. In: ISST proceedings of the ACM SIGSOFT international symposium on software testing and analysis (ISSTA 2007). ACM, New York, pp 219–227

    Google Scholar 

  • Pinzger M, Nagappan N, Murphy B (2008) Can developer-module networks predict failures? In: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering (FSE 2008). ACM, New York, pp 2–12

  • Robles G (2010) Replicating MSR: a study of the potential replicability of papers published in the mining software repositories proceedings. In: Proceedings of the 7th international working conference on mining software repositories (MSR 2010). IEEE CS, Washington, pp 171–180

  • Shin Y, Bell RM, Ostrand TJ, Weyuker EJ (2009) Does calling structure information improve the accuracy of fault prediction? In: Proceedings of the 7th international working conference on mining software repositories (MSR 2009). IEEE CS, Washington, pp 61–70

  • Sim SE, Easterbrook SM, Holt RC (2003) Using benchmarking to advance research: a challenge to software engineering. In: Proceedings of the 25th international conference on software engineering (ICSE 2003). IEEE CS, Washington, pp 74–83

  • Subramanyam R, Krishnan MS (2003) Empirical analysis of ck metrics for object-oriented design complexity: implications for software defects. IEEE Trans Softw Eng 29(4):297–310

    Article  MATH  Google Scholar 

  • Turhan B, Menzies T, Bener AB, Di Stefano JS (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Software Eng 14(5):540–578

    Article  Google Scholar 

  • Turhan B, Bener AB, Menzies T (2010) Regularities in learning defect predictors. In: Proceedings of the 11th international conference on product-focused software process improvement (PROFES 2010). Springer, Berlin, pp 116–130

  • Wolf T, Schröter A, Damian D, Nguyen THD (2009) Predicting build failures using social network analysis on developer communication. In: Proceedings of the 31st international conference on software engineering (ICSE 2009). IEEE CS, Washington, pp 1–11

  • Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: Proceedings of the 30th international conference on software engineering (ICSE 2008). ACM, New York, pp 531–540

  • Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the 3rd international workshop on predictive models in software engineering (PROMISE 2007). IEEE CS, Washington, pp 9–15

    Chapter  Google Scholar 

  • Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC/FSE 2009). ACM, New York, pp 91–100

Download references

Acknowledgement

We acknowledge the financial support of the Swiss National Science foundation for the project “SOSYA” (SNF Project No. 132175).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco D’Ambros.

Additional information

Editors: Jim Whitehead and Tom Zimmermann

Rights and permissions

Reprints and permissions

About this article

Cite this article

D’Ambros, M., Lanza, M. & Robbes, R. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Software Eng 17, 531–577 (2012). https://doi.org/10.1007/s10664-011-9173-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-011-9173-9

Keywords

Navigation