Abstract
Reliably predicting software defects is one of the holy grails of software engineering. Researchers have devised and implemented a plethora of defect/bug prediction approaches varying in terms of accuracy, complexity and the input data they require. However, the absence of an established benchmark makes it hard, if not impossible, to compare approaches. We present a benchmark for defect prediction, in the form of a publicly available dataset consisting of several software systems, and provide an extensive comparison of well-known bug prediction approaches, together with novel approaches we devised. We evaluate the performance of the approaches using different performance indicators: classification of entities as defect-prone or not, ranking of the entities, with and without taking into account the effort to review an entity. We performed three sets of experiments aimed at (1) comparing the approaches across different systems, (2) testing whether the differences in performance are statistically significant, and (3) investigating the stability of approaches across different learners. Our results indicate that, while some approaches perform better than others in a statistically significant manner, external validity in defect prediction is still an open problem, as generalizing results to different contexts/learners proved to be a partially unsuccessful endeavor.
Similar content being viewed by others
Notes
http://mdp.ivv.nasa.gov, also part of PROMISE.
We employ JUnit 3 naming conventions to detect test classes, i.e., classes whose names end with “Test” are detected as tests.
Available at http://www.intooitus.com/.
Available at http://www.moosetechnology.org.
Available at http://churrasco.inf.usi.ch.
For instance, the churn of FanIn linearly decayed and churn of FanIn logarithmically decayed have a very high correlation.
This is not in contradiction with Antoniol et al. (2008): Bugs mentioned as fixes in CVS comments are intuitively more likely to be real bugs, as they got fixed.
References
Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc Y-G (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests. In: Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds (CASCON 2008). ACM, New York, pp 304–318
Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. In: Proceedings of the 2006 ACM/IEEE international symposium on empirical software engineering (ISESE 2006). ACM, New York, pp 8–17
Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83(1):2–17
Arnaoudova V, Eshkevari L, Oliveto R, Gueheneuc Y-G, Antoniol G (2010) Physical and conceptual identifier dispersion: measures and relation to fault proneness. In: Proceedings of the 26th IEEE international conference on software maintenance (ICSM 2010). IEEE CS, Washington, pp 1–5
Bacchelli A, D’Ambros M, Lanza M (2010) Are popular classes more defect prone? In: Proceedings of the 13th international conference on fundamental approaches to software engineering (FASE 2010). Springer, Berlin, pp 59–73
Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761
Bernstein A, Ekanayake J, Pinzger M (2007) Improving defect prediction using temporal features and non linear models. In: Proceedings of the ninth international workshop on principles of software evolution (IWPSE 2007). ACM, New York, pp 11–18
Binkley AB, Schach SR (1998) Validation of the coupling dependency metric as a predictor of run-time failures and maintenance measures. In: Proceedings of the 20th international conference on software engineering (ICSE 1998). IEEE CS, Washington, pp 452–455
Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC/FSE 2009). ACM, New York, pp 121–130
Briand LC, Daly JW, Wüst J (1999) A unified framework for coupling measurement in object-oriented systems. IEEE Trans Softw Eng 25(1):91–121
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493
D’Ambros M, Lanza M (2010) Distributed and collaborative software evolution analysis with churrasco. J Sci Comput Program (SCP) 75(4):276–287
D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of the 7th international working conference on mining software repositories (MSR 2010). IEEE CS, Washington, pp 31–41
Demeyer S, Tichelaar S, Ducasse S (2001) FAMIX 2.1—The FAMOOS information exchange model. Technical report, University of Bern
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Ducasse S, Gîrba T, Nierstrasz O (2005) Moose: an agile reengineering environment. In: Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on foundations of software engineering (ESEC/FSE 2005). ACM, New York, pp 99–102. Tool demo
El Emam K, Melo W, Machado JC (2001) The prediction of faulty classes using object-oriented design metrics. J Syst Softw 56(1):63–75
Fenton NE, Ohlsson N (2000) Quantitative analysis of faults and failures in a complex software system. IEEE Trans Softw Eng 26(8):797–814
Fischer M, Pinzger M, Gall H (2003) Populating a release history database from version control and bug tracking systems. In: Proceedings of the international conference on software maintenance (ICSM 2003). IEEE CS, Washington, pp 23–32
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion mmre. IEEE Trans Softw Eng 29(11):985–995
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(07):653–661
Gyimóthy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Softw Eng 31(10):897–910
Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning (ICML 2000). In: Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann, San Mateo, pp 359–366
Hall M, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447
Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering (ICSE 2009). IEEE CS, Washington, pp 78–88
Hassan AE, Holt RC (2005) The top ten list: dynamic fault prediction. In: Proceedings of the 21st IEEE international conference on software maintenance (ICSM 2005). IEEE CS, Washington, pp 263–272
Ho YC, Pepyne DL (2002) Simple explanation of the no-free-lunch theorem and its implications. J Optim Theory Appl 115(3):549–570
Jackson EJ (2003) A users guide to principal components. Wiley, New York
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Software Eng 13:561–595.
Juristo NJ, Vegas S (2009) Using differences among replications of software engineering experiments to gain knowledge. In: Proceedings of the 3rd international symposium on empirical software engineering and measurement (ESEM 2009). IEEE CS,Washington, pp 356–366
Kamei Y, Matsumoto S, Monden A, Matsumoto K-i, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort aware models. In: Proceedings of the 26th IEEE international conference on software maintenance (ICSM 2010). IEEE CS, Washington, pp 1–10
Kim S, Zimmermann T, Whitehead J, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE 2007). IEEE CS, Washington, pp 489–498
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning (ML 1992). Morgan Kaufmann, San Mateo, pp 249–256
Khoshgoftaar TM, Allen EB (1999) A comparative study of ordering and classification of fault-prone software modules. Empir Software Eng 4:159–186.
Khoshgoftaar TM, Allen EB, Goel N, Nandi A, McMullan J (1996) Detection of software modules with high debug code churn in a very large legacy system. In: Proceedings of the seventh international symposium on software reliability engineering (ISSRE 1996). IEEE CS, Washington, pp 364–371
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324
Kollmann R, Selonen P, Stroulia E (2002) A study on the current state of the art in tool-supported UML-based static reverse engineering. In: Proceedings of the ninth working conference on reverse engineering (WCRE 2002). IEEE CS, Washington, pp 22–32
Kononenko I (1994) Estimating attributes: analysis and extensions of relief. Springer, Berlin, pp 171–182
Koru AG, Zhang D, Liu H (2007) Modeling the effect of size on defect proneness for open-source software. In: Proceedings of the third international workshop on predictor models in software engineering (PROMISE 2007). IEEE CS, Washington, pp 10–19
Koru AG, El Emam K, Zhang D, Liu H, Mathew D (2008) Theory of relative defect proneness. Empir Software Eng 13:473–498
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496
Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300
Mende T (2010) Replication of defect prediction studies: problems, pitfalls and recommendations. In: Proceedings of the 6th international conference on predictive models in software engineering (PROMISE 2010). ACM, New York, pp 1–10
Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the 5th international conference on predictive models in software engineering (PROMISE 2009). ACM, New York, pp 1–10
Mende T, Koschke R (2010) Effort-aware defect prediction models. In: Proceedings of the 14th European conference on software maintenance and reengineering (CSMR 2010). IEEE CS, Washington, pp 109–118
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13
Menzies T, Milton Z, Turhan B, Cukic B, Bener YJA (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering (ICSE 2008). ACM, New York, pp 181–190
Myrtveit I, Stensrud E, Shepperd MJ (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391
Nagappan N, Ball T (2005a) Static analysis tools as early indicators of pre-release defect density. In: Proceedings of the 27th international conference on software engineering (ICSE 2005). ACM, New York, pp 580–586
Nagappan N, Ball T (2005b) Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on software engineering (ICSE 2005). ACM, New York, pp 284–292
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering (ICSE 2006). ACM, New York, pp 452–461
Nikora AP, Munson JC (2003) Developing fault predictors for evolving software systems. In: Proceedings of the 9th international symposium on software metrics (METRICS 2003). IEEE CS, Washington, pp 338–349
Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Proceedings of the 14th ACM conference on computer and communications security (CCS 2007). ACM, New York, pp 529–540
Ohlsson N, Alberg H (1996) Predicting fault-prone software modules in telephone switches. IEEE Trans Softw Eng 22(12):886–894
Ostrand TJ, Weyuker EJ (2002) The distribution of faults in a large industrial software system. In: Proceedings of the ACM SIGSOFT international symposium on software testing and analysis (ISSTA 2002). ACM, New York, pp 55–64
Ostrand TJ, Weyuker EJ, Bell RM (2004) Where the bugs are. In: Proceedings of the ACM SIGSOFT international symposium on software testing and analysis (ISSTA 2004). ACM, New York, pp 86–96
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355
Ostrand TJ, Weyuker EJ, Bell RM (2007) Automating algorithms for the identification of fault-prone files. In: ISST proceedings of the ACM SIGSOFT international symposium on software testing and analysis (ISSTA 2007). ACM, New York, pp 219–227
Pinzger M, Nagappan N, Murphy B (2008) Can developer-module networks predict failures? In: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering (FSE 2008). ACM, New York, pp 2–12
Robles G (2010) Replicating MSR: a study of the potential replicability of papers published in the mining software repositories proceedings. In: Proceedings of the 7th international working conference on mining software repositories (MSR 2010). IEEE CS, Washington, pp 171–180
Shin Y, Bell RM, Ostrand TJ, Weyuker EJ (2009) Does calling structure information improve the accuracy of fault prediction? In: Proceedings of the 7th international working conference on mining software repositories (MSR 2009). IEEE CS, Washington, pp 61–70
Sim SE, Easterbrook SM, Holt RC (2003) Using benchmarking to advance research: a challenge to software engineering. In: Proceedings of the 25th international conference on software engineering (ICSE 2003). IEEE CS, Washington, pp 74–83
Subramanyam R, Krishnan MS (2003) Empirical analysis of ck metrics for object-oriented design complexity: implications for software defects. IEEE Trans Softw Eng 29(4):297–310
Turhan B, Menzies T, Bener AB, Di Stefano JS (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Software Eng 14(5):540–578
Turhan B, Bener AB, Menzies T (2010) Regularities in learning defect predictors. In: Proceedings of the 11th international conference on product-focused software process improvement (PROFES 2010). Springer, Berlin, pp 116–130
Wolf T, Schröter A, Damian D, Nguyen THD (2009) Predicting build failures using social network analysis on developer communication. In: Proceedings of the 31st international conference on software engineering (ICSE 2009). IEEE CS, Washington, pp 1–11
Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: Proceedings of the 30th international conference on software engineering (ICSE 2008). ACM, New York, pp 531–540
Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the 3rd international workshop on predictive models in software engineering (PROMISE 2007). IEEE CS, Washington, pp 9–15
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC/FSE 2009). ACM, New York, pp 91–100
Acknowledgement
We acknowledge the financial support of the Swiss National Science foundation for the project “SOSYA” (SNF Project No. 132175).
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Jim Whitehead and Tom Zimmermann
Rights and permissions
About this article
Cite this article
D’Ambros, M., Lanza, M. & Robbes, R. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Software Eng 17, 531–577 (2012). https://doi.org/10.1007/s10664-011-9173-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-011-9173-9