There are many hundreds of fault prediction models published in the literature. The predictive performance of these models is often reported using a variety of different measures. Most performance measures are not directly comparable. This lack of comparability means that it is often difficult to evaluate the performance of one model against another. Our aim is to present an approach that allows other researchers and practitioners to transform many performance measures back into a confusion matrix. Once performance is expressed in a confusion matrix alternative preferred performance measures can then be derived. Our approach has enabled us to compare the performance of 600 models published in 42 studies. We demonstrate the application of our approach on 8 case studies, and discuss the advantages and implications of doing this.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price includes VAT (USA)
Tax calculation will be finalised during checkout.
Definitions of particular measures are given in Sect. 2.
Binary models are those predicting that code units (e.g. modules or classes) are either fault prone (fp) or not fault prone (nfp). Binary models do not predict the number of faults in code units. In this paper we restrict ourselves to considering only binary models that are based on machine learning techniques.
When this class distribution is not known it is often possible to calculate the proportion of faulty units in a data set.
This tool is available at: https://bugcatcher.stca.herts.ac.uk/DConfusion/.
Full results can be found in the supporting material at https://bugcatcher.stca.herts.ac.uk/DConfusion/analysis.html.
The use of percentage error is not completely correct, as p approaches 1 for many of the performance measures, it becomes increasingly difficult to have a proportionally increasing error, therefore it is to be expected that p values close to 1 will have a small perr value.
The aim of this paper is to provide an approach by which others may perform comparative analysis of fault prediction models. A comparative analysis is complex and requires many factors to be taken into account, e.g. the aims of the predictive model. It is beyond the scope of this paper to provide a full comparative analysis of studies against each other.
Now based at https://code.google.com/p/promisedata/.
Arisholm, E., Briand, L.C., Fuglerud, M.: Data mining techniques for building fault-proneness models in telecom Java software. In: The 18th IEEE International Symposium on Software Reliability, 2007, ISSRE ’07, pp. 215–224 (2007)
Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)
Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000)
Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Bowes, D., Hall, T., Gray, D.: Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix. In: Proceedings of the 8th International Conference on Predictive Models in Software Engineering, PROMISE ’12, New York, NY, USA, pp. 109–118. ACM, New York (2012)
Catal, C., Diri, B., Ozumut, B.: An artificial immune system approach for fault prediction in object-oriented software. In: 2nd International Conference on Dependability of Computer Systems, 2007, DepCoS-RELCOMEX ’07, pp. 238–245 (2007, accepted)
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)
Cruzes, D.S., Dybå, T.: Research synthesis in software engineering: a tertiary study. Inf. Softw. Technol. 53, 440–455 (2011)
de Carvalho, A.B., Pozo, A., Vergilio, S., Lenz, A.: Predicting fault proneness of classes trough a multiobjective particle swarm optimization algorithm. In: 20th IEEE International Conference on Tools with Artificial Intelligence, 2008, ICTAI ’08, vol. 2, pp. 387–394 (2008)
de Carvalho, A.B., Pozo, A., Vergilio, S.R.: A symbolic fault-prediction model based on multiobjective particle swarm optimization. J. Syst. Softw. 83(5), 868–882 (2010)
Denaro, G., Pezzè, M.: An empirical evaluation of fault-proneness models. In: Proceedings of the 24th International Conference on Software Engineering, ICSE ’02, New York, NY, USA, pp. 241–251. ACM, New York (2002)
Elish, K., Elish, M.: Predicting defect-prone software modules using support vector machines. J. Syst. Softw. 81(5), 649–660 (2008)
Fenton, N., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)
Forman, G., Scholz, M.: Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explor. Newsl. 12(1), 49–57 (2010)
Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: Further thoughts on precision. In: Evaluation and Assessment in Software Engineering (EASE) (2011)
Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: Reflections on the nasa mdp data sets. IET Softw. 6(6), 549–558 (2012)
Hall, T., Bowes, D.: The state of machine learning methodology in software fault prediction. In: International Conference on Machine Learning and Applications (ICMLA) (2012)
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng., 1263–1284 (2008)
Jiang, Y., Cukic, B., Ma, Y.: Techniques for evaluating fault prediction models. Empir. Softw. Eng. 13(5), 561–595 (2008)
Kamei, Y., Monden, A., Matsumoto, S., Kakimoto, T., Matsumoto, K.: The effects of over and under sampling on fault-prone module detection. In: First International Symposium on Empirical Software Engineering and Measurement, 2007, ESEM 2007, pp. 196–204 (2007)
Kaur, A., Sandhu, P.S., Bra, A.S.: Early software fault prediction using real time defect data. In: Second International Conference on Machine Vision, 2009, ICMV ’09, pp. 242–245 (2009)
Khoshgoftaar, T., Seliya, N.: Comparative assessment of software quality classification techniques: an empirical case study. Empir. Softw. Eng. 9(3), 229–257 (2004)
Koru, A., Liu, H.: Building effective defect-prediction models in practice. IEEE Softw. 22(6), 23–29 (2005)
Kutlubay, O., Turhan, B., Bener, A.: A two-step model for defect density estimation. In: 33rd EUROMICRO Conference on Software Engineering and Advanced Applications, 2007, pp. 322–332 (2007)
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)
Ma, Y., Guo, L., Cukic, B.: A statistical framework for the prediction of fault-proneness. In: Advances in Machine Learning Applications in Software Engineering, IGI Global, pp. 237–265 (2006)
Mende, T., Koschke, R.: Effort-aware defect prediction models. In: 14th European Conference on Software Maintenance and Reengineering (CSMR), 2010, pp. 107–116 (2010)
Menzies, T., Dekhtyar, A., Distefano, J., Greenwald, J.: Problems with precision: a response to “comments on ‘data mining static code attributes to learn defect predictors’ ”. IEEE Trans. Softw. Eng. 33(9), 637–640 (2007a)
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007b)
Ostrand, T., Weyuker, E.: How to measure success of fault prediction models. In: Fourth International Workshop on Software Quality Assurance: in Conjunction with the 6th ESEC/FSE Joint Meeting, pp. 25–30. ACM, New York (2007)
Ostrand, T., Weyuker, E., Bell, R.: Where the bugs are. In: ACM SIGSOFT Software Engineering Notes, vol. 29, pp. 86–96. ACM, New York (2004)
Pai, G., Dugan, J.: Empirical analysis of software fault content and fault proneness using Bayesian methods. IEEE Trans. Softw. Eng. 33(10), 675–686 (2007)
Pizzi, N., Summers, A., Pedrycz, W.: Software quality prediction using median-adjusted class labels. In: Proceedings of the 2002 International Joint Conference on Neural Networks, 2002, IJCNN ’02, vol. 3, pp. 2405–2409 (2002)
Seliya, N., Khoshgoftaar, T., Zhong, S.: Analyzing software quality with limited fault-proneness defect data. In: Ninth IEEE International Symposium on High-Assurance Systems Engineering, 2005, HASE 2005, pp. 89–98 (2005)
Sun, Y., Castellano, C., Robinson, M., Adams, R., Rust, A., Davey, N.: Using pre & post-processing methods to improve binding site predictions. Pattern Recognit. 42(9), 1949–1958 (2009)
Turhan, B., Kocak, G., Bener, A.: Data mining source code for locating software bugs: a case study in telecommunication industry. Expert Syst. Appl. 36(6), 9986–9990 (2009)
Wang, T., Li, W.: Naïve Bayes software defect prediction model. In: International Conference on Computational Intelligence and Software Engineering (CiSE), 2010, pp. 1–4 (2010). doi:10.1109/CISE.2010.5677057
Yi, L., Khoshgoftaar, T.M., Seliya, N.: Evolutionary optimization of software quality modelling with multiple repositories. IEEE Trans. Softw. Eng. 36(6), 852–864 (2010)
Zeller, A., Zimmermann, T., Bird, C.: Failure is a four-letter word: a parody in empirical research. In: Proceedings of the 7th International Conference on Predictive Models in Software Engineering, Promise ’11, New York, NY, USA, pp. 5:1–5:7. ACM, New York (2011)
Zhang, H., Zhang, X.: Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 635–637 (2007)
Zhou, Y., Leung, H.: Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans. Softw. Eng. 32(10), 771–789 (2006)
Many thanks to Prof. Barbara Kitchenham for double checking the equations used to re-compute the confusion matrix.
Appendix A: List of papers from which we have re-computed confusion matrices
E. Arisholm, L. C. Briand, and M. Fuglerud. Data mining techniques for building fault-proneness models in telecom java software. In The 18th IEEE Intern Symp on ISSRE ’07, pages 215–224, 2007.
E. Arisholm, L. C. Briand, and E. B. Johannessen. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software, 83(1):2–17, 2010.
C. Bird, N. Nagappan, H. Gall, B. Murphy, and P. Devanbu. Putting it all together: Using socio-technical networks to predict failures. In 20th International Symposium on Software Reliability Engineering, pages 109–119. IEEE, 2009.
L. Briand, W. Melo, and J. Wust. Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Transactions on Software Engineering, 28(7):706–720, 2002.
B. Caglayan, A. Bener, and S. Koch. Merits of using repository metrics in defect prediction for open source projects. In ICSE Workshop on FLOSS ’09, pages 31–36, 2009.
G. Calikli, A. Tosun, A. Bener, and M. Celik. The effect of granularity level on software defect prediction. In 24th International Symposium on Computer and Information Sciences, 2009. ISCIS 2009, pages 531–536, 2009.
C. Catal, B. Diri, and B. Ozumut. An artificial immune system approach for fault prediction in object-oriented software. In 2nd International Conference on Dependability of Computer Systems, 2007. DepCoS-RELCOMEX ’07, pages 238–245, 2007.
C. Cruz and A. Erika. Exploratory study of a uml metric for fault prediction. In Proceedings of the 32nd ACM/IEEE Intern Conf on Software Engineering, pages 361–364. 2010.
A. B. de Carvalho, A. Pozo, S. Vergilio, and A. Lenz. Predicting fault proneness of classes trough a multiobjective particle swarm optimization algorithm. In 20th IEEE International Conference on Tools with Artificial Intelligence, 2008. ICTAI ’08, volume 2, pages 387–394, 2008.
A. B. de Carvalho, A. Pozo, and S. R. Vergilio. A symbolic fault-prediction model based on multiobjective particle swarm optimization. J of Sys & Soft, 83(5):868–882, 2010.
G. Denaro and M. Pezzè. An empirical evaluation of fault-proneness models. In Proceedings of the 24th International Conference on Software Engineering, ICSE ’02, pages 241–251, NY, USA, 2002. ACM.
L. Guo, Y. Ma, B. Cukic, and H. Singh. Robust prediction of fault-proneness by random forests. In 15th International Symposium on Software Reliability Engineering, 2004. ISSRE 2004, pages 417–428, 2004.
T. Gyimothy, R. Ferenc, and I. Siket. Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Transactions on Software Engineering, 31(10):897–910, 2005.
Z. Hongyu. An investigation of the relationships between lines of code and defects. In IEEE Intern Conf on Software Maintenance, 2009, pages 274–283, 2009.
Y. Jiang, B. Cukic, and Y. Ma. Techniques for evaluating fault prediction models. Empirical Software Engineering, 13(5):561–595, 2008.
S. Kanmani, V. Uthariaraj, V. Sankaranarayanan, and P. Thambidurai. Object-oriented software fault prediction using neural networks. Information and Software Technology, 49(5):483–492, 2007.
A. Kaur and R. Malhotra. Application of random forest in predicting fault-prone classes. In Internl Conf on Advanced Computer Theory and Engineering, 2008, 37–43, 2008.
A. Kaur, P. S. Sandhu, and A. S. Bra. Early software fault prediction using real time defect data. In Intern Conf on Machine Vision, 2009, pages 242–245.
T. Khoshgoftaar and N. Seliya. Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering, 9(3):229–257, 2004.
T. Khoshgoftaar, X. Yuan, E. Allen, W. Jones, and J. Hudepohl. Uncertain classification of fault-prone software modules. Empirical Software Engineering, 7(4):297–318, 2002.
A. Koru and H. Liu. Building effective defect-prediction models in practice. Software, IEEE, 22(6):23–29, 2005.
O. Kutlubay, B. Turhan, and A. Bener. A two-step model for defect density estimation. In 33rd EUROMICRO Conference on Software Engineering and Advanced Applications, 2007, pages 322–332, 2007.
Y. Ma, L. Guo, and B. Cukic. Advances in Machine Learning Applications in Software Engineering, chapter A statistical framework for the prediction of fault-proneness, pages 237–265. IGI Global, 2006.
T. Mende and R. Koschke. Effort-aware defect prediction models. In 14th European Conference on Software Maintenance and Reengineering (CSMR), 2010, pages 107–116, 2010.
T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1):2–13, 2007.
O. Mizuno, S. Ikami, S. Nakaichi, and T. Kikuno. Spam filter based approach for finding fault-prone software modules. In International Workshop on Mining Software Repositories, 2007. ICSE ’07, page 4, 2007.
O. Mizuno and T. Kikuno. Training on errors experiment to detect fault-prone software modules by spam filter. In Procs European Software Engineering Conf and the ACM SIGSOFT symp on The foundations of software engineering, ESEC-FSE ’07, pages 405–414, 2007. ACM.
R. Moser, W. Pedrycz, and G. Succi. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In ACM/IEEE 30th Intern Conf on Software Engineering, 2008. ICSE ’08, pages 181–190.
N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, and B. Murphy. Change bursts as defect predictors. In 2010 IEEE 21st International Symposium on Software Reliability Engineering (ISSRE), pages 309–318.
A. Nugroho, M. R. V. Chaudron, and E. Arisholm. Assessing uml design metrics for predicting fault-prone classes in a java system. In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR), pages 21–30.
G. Pai and J. Dugan. Empirical analysis of software fault content and fault proneness using Bayesian methods. IEEE Trans on Software Engineering, 33(10):675–686, 2007.
A. Schröter, T. Zimmermann, and A. Zeller. Predicting component failures at design time. In Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering, pages 18–27. ACM, 2006.
N. Seliya, T. Khoshgoftaar, and S. Zhong. Analyzing software quality with limited fault-proneness defect data. In IEEE Internl Symp on High-Assurance Systems Engineering, 2005, pages 89–98, 2005.
S. Shivaji, E. J. Whitehead, R. Akella, and K. Sunghun. Reducing features to improve bug prediction. In 24th IEEE/ACM International Conference on Automated Software Engineering, 2009. ASE ’09, pages 600–604.
Y. Singh, A. Kaur, and R. Malhotra. Predicting software fault proneness model using neural network. Product-Focused Software Process Improvement, 5089:204–214, 2008.
A. Tosun and A. Bener. Reducing false alarms in software defect prediction by decision threshold optimization. In International Symposium on Empirical Software Engineering and Measurement, ESEM 2009, pages 477–480.
B. Turhan and A. Bener. A multivariate analysis of static code attributes for defect prediction. In Intern Conf on Quality Software, 2007, pages 231–237, 2007.
O. Vandecruys, D. Martens, B. Baesens, C. Mues, M. De Backer, and R. Haesen. Mining software repositories for comprehensible software fault prediction models. Journal of Systems and Software, 81(5):823–839, 2008.
R. Vivanco, Y. Kamei, A. Monden, K. Matsumoto, and D. Jin. Using search-based metric selection and oversampling to predict fault prone modules. In Canadian Conf on Electrical and Computer Engineering, 2010, pages 1–6.
L. Yi, T. M. Khoshgoftaar, and N. Seliya. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans on Soft Engin, 36(6):852–864, 2010.
Y. Zhou and H. Leung. Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans on Software Engineering, 32(10):771–789, 2006.
T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Predictor Models in Software Engineering, 2007. PROMISE’07, page 9, 2007.
Appendix B: Fault prediction experiment algorithm
Appendix C: Re-computing the confusion matrix′ algorithm
About this article
Cite this article
Bowes, D., Hall, T. & Gray, D. DConfusion: a technique to allow cross study performance evaluation of fault prediction studies. Autom Softw Eng 21, 287–313 (2014). https://doi.org/10.1007/s10515-013-0129-8
- Confusion matrix
- Machine learning