Skip to main content
Log in

DConfusion: a technique to allow cross study performance evaluation of fault prediction studies

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

There are many hundreds of fault prediction models published in the literature. The predictive performance of these models is often reported using a variety of different measures. Most performance measures are not directly comparable. This lack of comparability means that it is often difficult to evaluate the performance of one model against another. Our aim is to present an approach that allows other researchers and practitioners to transform many performance measures back into a confusion matrix. Once performance is expressed in a confusion matrix alternative preferred performance measures can then be derived. Our approach has enabled us to compare the performance of 600 models published in 42 studies. We demonstrate the application of our approach on 8 case studies, and discuss the advantages and implications of doing this.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Definitions of particular measures are given in Sect. 2.

  2. Binary models are those predicting that code units (e.g. modules or classes) are either fault prone (fp) or not fault prone (nfp). Binary models do not predict the number of faults in code units. In this paper we restrict ourselves to considering only binary models that are based on machine learning techniques.

  3. Data imbalance also has serious implications for the training of prediction models. Discussion of this is beyond the scope of this work (instead see Gray et al. 2011; Zhang and Zhang 2007; Turhan et al. 2009; Batista et al. 2004 and Kamei et al. 2007).

  4. When this class distribution is not known it is often possible to calculate the proportion of faulty units in a data set.

  5. This tool is available at: https://bugcatcher.stca.herts.ac.uk/DConfusion/.

  6. Full results can be found in the supporting material at https://bugcatcher.stca.herts.ac.uk/DConfusion/analysis.html.

  7. The use of percentage error is not completely correct, as p approaches 1 for many of the performance measures, it becomes increasingly difficult to have a proportionally increasing error, therefore it is to be expected that p values close to 1 will have a small perr value.

  8. The aim of this paper is to provide an approach by which others may perform comparative analysis of fault prediction models. A comparative analysis is complex and requires many factors to be taken into account, e.g. the aims of the predictive model. It is beyond the scope of this paper to provide a full comparative analysis of studies against each other.

  9. Now based at https://code.google.com/p/promisedata/.

References

  • Arisholm, E., Briand, L.C., Fuglerud, M.: Data mining techniques for building fault-proneness models in telecom Java software. In: The 18th IEEE International Symposium on Software Reliability, 2007, ISSRE ’07, pp. 215–224 (2007)

    Google Scholar 

  • Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)

    Article  Google Scholar 

  • Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000)

    Article  Google Scholar 

  • Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)

    Article  Google Scholar 

  • Bowes, D., Hall, T., Gray, D.: Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix. In: Proceedings of the 8th International Conference on Predictive Models in Software Engineering, PROMISE ’12, New York, NY, USA, pp. 109–118. ACM, New York (2012)

    Chapter  Google Scholar 

  • Catal, C., Diri, B., Ozumut, B.: An artificial immune system approach for fault prediction in object-oriented software. In: 2nd International Conference on Dependability of Computer Systems, 2007, DepCoS-RELCOMEX ’07, pp. 238–245 (2007, accepted)

    Google Scholar 

  • Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)

    MATH  Google Scholar 

  • Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)

    Article  Google Scholar 

  • Cruzes, D.S., Dybå, T.: Research synthesis in software engineering: a tertiary study. Inf. Softw. Technol. 53, 440–455 (2011)

    Article  Google Scholar 

  • de Carvalho, A.B., Pozo, A., Vergilio, S., Lenz, A.: Predicting fault proneness of classes trough a multiobjective particle swarm optimization algorithm. In: 20th IEEE International Conference on Tools with Artificial Intelligence, 2008, ICTAI ’08, vol. 2, pp. 387–394 (2008)

    Chapter  Google Scholar 

  • de Carvalho, A.B., Pozo, A., Vergilio, S.R.: A symbolic fault-prediction model based on multiobjective particle swarm optimization. J. Syst. Softw. 83(5), 868–882 (2010)

    Article  Google Scholar 

  • Denaro, G., Pezzè, M.: An empirical evaluation of fault-proneness models. In: Proceedings of the 24th International Conference on Software Engineering, ICSE ’02, New York, NY, USA, pp. 241–251. ACM, New York (2002)

    Google Scholar 

  • Elish, K., Elish, M.: Predicting defect-prone software modules using support vector machines. J. Syst. Softw. 81(5), 649–660 (2008)

    Article  Google Scholar 

  • Fenton, N., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)

    Article  Google Scholar 

  • Forman, G., Scholz, M.: Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explor. Newsl. 12(1), 49–57 (2010)

    Article  Google Scholar 

  • Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: Further thoughts on precision. In: Evaluation and Assessment in Software Engineering (EASE) (2011)

    Google Scholar 

  • Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: Reflections on the nasa mdp data sets. IET Softw. 6(6), 549–558 (2012)

    Article  Google Scholar 

  • Hall, T., Bowes, D.: The state of machine learning methodology in software fault prediction. In: International Conference on Machine Learning and Applications (ICMLA) (2012)

    Google Scholar 

  • Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)

    Article  Google Scholar 

  • He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng., 1263–1284 (2008)

  • Jiang, Y., Cukic, B., Ma, Y.: Techniques for evaluating fault prediction models. Empir. Softw. Eng. 13(5), 561–595 (2008)

    Article  Google Scholar 

  • Kamei, Y., Monden, A., Matsumoto, S., Kakimoto, T., Matsumoto, K.: The effects of over and under sampling on fault-prone module detection. In: First International Symposium on Empirical Software Engineering and Measurement, 2007, ESEM 2007, pp. 196–204 (2007)

    Google Scholar 

  • Kaur, A., Sandhu, P.S., Bra, A.S.: Early software fault prediction using real time defect data. In: Second International Conference on Machine Vision, 2009, ICMV ’09, pp. 242–245 (2009)

    Chapter  Google Scholar 

  • Khoshgoftaar, T., Seliya, N.: Comparative assessment of software quality classification techniques: an empirical case study. Empir. Softw. Eng. 9(3), 229–257 (2004)

    Article  Google Scholar 

  • Koru, A., Liu, H.: Building effective defect-prediction models in practice. IEEE Softw. 22(6), 23–29 (2005)

    Article  Google Scholar 

  • Kutlubay, O., Turhan, B., Bener, A.: A two-step model for defect density estimation. In: 33rd EUROMICRO Conference on Software Engineering and Advanced Applications, 2007, pp. 322–332 (2007)

    Google Scholar 

  • Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)

    Article  Google Scholar 

  • Ma, Y., Guo, L., Cukic, B.: A statistical framework for the prediction of fault-proneness. In: Advances in Machine Learning Applications in Software Engineering, IGI Global, pp. 237–265 (2006)

    Chapter  Google Scholar 

  • Mende, T., Koschke, R.: Effort-aware defect prediction models. In: 14th European Conference on Software Maintenance and Reengineering (CSMR), 2010, pp. 107–116 (2010)

    Chapter  Google Scholar 

  • Menzies, T., Dekhtyar, A., Distefano, J., Greenwald, J.: Problems with precision: a response to “comments on ‘data mining static code attributes to learn defect predictors’ ”. IEEE Trans. Softw. Eng. 33(9), 637–640 (2007a)

    Article  Google Scholar 

  • Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007b)

    Article  Google Scholar 

  • Ostrand, T., Weyuker, E.: How to measure success of fault prediction models. In: Fourth International Workshop on Software Quality Assurance: in Conjunction with the 6th ESEC/FSE Joint Meeting, pp. 25–30. ACM, New York (2007)

    Google Scholar 

  • Ostrand, T., Weyuker, E., Bell, R.: Where the bugs are. In: ACM SIGSOFT Software Engineering Notes, vol. 29, pp. 86–96. ACM, New York (2004)

    Google Scholar 

  • Pai, G., Dugan, J.: Empirical analysis of software fault content and fault proneness using Bayesian methods. IEEE Trans. Softw. Eng. 33(10), 675–686 (2007)

    Article  Google Scholar 

  • Pizzi, N., Summers, A., Pedrycz, W.: Software quality prediction using median-adjusted class labels. In: Proceedings of the 2002 International Joint Conference on Neural Networks, 2002, IJCNN ’02, vol. 3, pp. 2405–2409 (2002)

    Google Scholar 

  • Seliya, N., Khoshgoftaar, T., Zhong, S.: Analyzing software quality with limited fault-proneness defect data. In: Ninth IEEE International Symposium on High-Assurance Systems Engineering, 2005, HASE 2005, pp. 89–98 (2005)

    Google Scholar 

  • Sun, Y., Castellano, C., Robinson, M., Adams, R., Rust, A., Davey, N.: Using pre & post-processing methods to improve binding site predictions. Pattern Recognit. 42(9), 1949–1958 (2009)

    Article  MATH  Google Scholar 

  • Turhan, B., Kocak, G., Bener, A.: Data mining source code for locating software bugs: a case study in telecommunication industry. Expert Syst. Appl. 36(6), 9986–9990 (2009)

    Article  Google Scholar 

  • Wang, T., Li, W.: Naïve Bayes software defect prediction model. In: International Conference on Computational Intelligence and Software Engineering (CiSE), 2010, pp. 1–4 (2010). doi:10.1109/CISE.2010.5677057

    Google Scholar 

  • Yi, L., Khoshgoftaar, T.M., Seliya, N.: Evolutionary optimization of software quality modelling with multiple repositories. IEEE Trans. Softw. Eng. 36(6), 852–864 (2010)

    Article  Google Scholar 

  • Zeller, A., Zimmermann, T., Bird, C.: Failure is a four-letter word: a parody in empirical research. In: Proceedings of the 7th International Conference on Predictive Models in Software Engineering, Promise ’11, New York, NY, USA, pp. 5:1–5:7. ACM, New York (2011)

    Google Scholar 

  • Zhang, H., Zhang, X.: Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 635–637 (2007)

    Article  MATH  Google Scholar 

  • Zhou, Y., Leung, H.: Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans. Softw. Eng. 32(10), 771–789 (2006)

    Article  Google Scholar 

Download references

Acknowledgements

Many thanks to Prof. Barbara Kitchenham for double checking the equations used to re-compute the confusion matrix.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Bowes.

Appendices

Appendix A: List of papers from which we have re-computed confusion matrices

E. Arisholm, L. C. Briand, and M. Fuglerud. Data mining techniques for building fault-proneness models in telecom java software. In The 18th IEEE Intern Symp on ISSRE ’07, pages 215–224, 2007.

E. Arisholm, L. C. Briand, and E. B. Johannessen. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software, 83(1):2–17, 2010.

C. Bird, N. Nagappan, H. Gall, B. Murphy, and P. Devanbu. Putting it all together: Using socio-technical networks to predict failures. In 20th International Symposium on Software Reliability Engineering, pages 109–119. IEEE, 2009.

L. Briand, W. Melo, and J. Wust. Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Transactions on Software Engineering, 28(7):706–720, 2002.

B. Caglayan, A. Bener, and S. Koch. Merits of using repository metrics in defect prediction for open source projects. In ICSE Workshop on FLOSS ’09, pages 31–36, 2009.

G. Calikli, A. Tosun, A. Bener, and M. Celik. The effect of granularity level on software defect prediction. In 24th International Symposium on Computer and Information Sciences, 2009. ISCIS 2009, pages 531–536, 2009.

C. Catal, B. Diri, and B. Ozumut. An artificial immune system approach for fault prediction in object-oriented software. In 2nd International Conference on Dependability of Computer Systems, 2007. DepCoS-RELCOMEX ’07, pages 238–245, 2007.

C. Cruz and A. Erika. Exploratory study of a uml metric for fault prediction. In Proceedings of the 32nd ACM/IEEE Intern Conf on Software Engineering, pages 361–364. 2010.

A. B. de Carvalho, A. Pozo, S. Vergilio, and A. Lenz. Predicting fault proneness of classes trough a multiobjective particle swarm optimization algorithm. In 20th IEEE International Conference on Tools with Artificial Intelligence, 2008. ICTAI ’08, volume 2, pages 387–394, 2008.

A. B. de Carvalho, A. Pozo, and S. R. Vergilio. A symbolic fault-prediction model based on multiobjective particle swarm optimization. J of Sys & Soft, 83(5):868–882, 2010.

G. Denaro and M. Pezzè. An empirical evaluation of fault-proneness models. In Proceedings of the 24th International Conference on Software Engineering, ICSE ’02, pages 241–251, NY, USA, 2002. ACM.

L. Guo, Y. Ma, B. Cukic, and H. Singh. Robust prediction of fault-proneness by random forests. In 15th International Symposium on Software Reliability Engineering, 2004. ISSRE 2004, pages 417–428, 2004.

T. Gyimothy, R. Ferenc, and I. Siket. Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Transactions on Software Engineering, 31(10):897–910, 2005.

Z. Hongyu. An investigation of the relationships between lines of code and defects. In IEEE Intern Conf on Software Maintenance, 2009, pages 274–283, 2009.

Y. Jiang, B. Cukic, and Y. Ma. Techniques for evaluating fault prediction models. Empirical Software Engineering, 13(5):561–595, 2008.

S. Kanmani, V. Uthariaraj, V. Sankaranarayanan, and P. Thambidurai. Object-oriented software fault prediction using neural networks. Information and Software Technology, 49(5):483–492, 2007.

A. Kaur and R. Malhotra. Application of random forest in predicting fault-prone classes. In Internl Conf on Advanced Computer Theory and Engineering, 2008, 37–43, 2008.

A. Kaur, P. S. Sandhu, and A. S. Bra. Early software fault prediction using real time defect data. In Intern Conf on Machine Vision, 2009, pages 242–245.

T. Khoshgoftaar and N. Seliya. Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering, 9(3):229–257, 2004.

T. Khoshgoftaar, X. Yuan, E. Allen, W. Jones, and J. Hudepohl. Uncertain classification of fault-prone software modules. Empirical Software Engineering, 7(4):297–318, 2002.

A. Koru and H. Liu. Building effective defect-prediction models in practice. Software, IEEE, 22(6):23–29, 2005.

O. Kutlubay, B. Turhan, and A. Bener. A two-step model for defect density estimation. In 33rd EUROMICRO Conference on Software Engineering and Advanced Applications, 2007, pages 322–332, 2007.

Y. Ma, L. Guo, and B. Cukic. Advances in Machine Learning Applications in Software Engineering, chapter A statistical framework for the prediction of fault-proneness, pages 237–265. IGI Global, 2006.

T. Mende and R. Koschke. Effort-aware defect prediction models. In 14th European Conference on Software Maintenance and Reengineering (CSMR), 2010, pages 107–116, 2010.

T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1):2–13, 2007.

O. Mizuno, S. Ikami, S. Nakaichi, and T. Kikuno. Spam filter based approach for finding fault-prone software modules. In International Workshop on Mining Software Repositories, 2007. ICSE ’07, page 4, 2007.

O. Mizuno and T. Kikuno. Training on errors experiment to detect fault-prone software modules by spam filter. In Procs European Software Engineering Conf and the ACM SIGSOFT symp on The foundations of software engineering, ESEC-FSE ’07, pages 405–414, 2007. ACM.

R. Moser, W. Pedrycz, and G. Succi. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In ACM/IEEE 30th Intern Conf on Software Engineering, 2008. ICSE ’08, pages 181–190.

N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, and B. Murphy. Change bursts as defect predictors. In 2010 IEEE 21st International Symposium on Software Reliability Engineering (ISSRE), pages 309–318.

A. Nugroho, M. R. V. Chaudron, and E. Arisholm. Assessing uml design metrics for predicting fault-prone classes in a java system. In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR), pages 21–30.

G. Pai and J. Dugan. Empirical analysis of software fault content and fault proneness using Bayesian methods. IEEE Trans on Software Engineering, 33(10):675–686, 2007.

A. Schröter, T. Zimmermann, and A. Zeller. Predicting component failures at design time. In Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering, pages 18–27. ACM, 2006.

N. Seliya, T. Khoshgoftaar, and S. Zhong. Analyzing software quality with limited fault-proneness defect data. In IEEE Internl Symp on High-Assurance Systems Engineering, 2005, pages 89–98, 2005.

S. Shivaji, E. J. Whitehead, R. Akella, and K. Sunghun. Reducing features to improve bug prediction. In 24th IEEE/ACM International Conference on Automated Software Engineering, 2009. ASE ’09, pages 600–604.

Y. Singh, A. Kaur, and R. Malhotra. Predicting software fault proneness model using neural network. Product-Focused Software Process Improvement, 5089:204–214, 2008.

A. Tosun and A. Bener. Reducing false alarms in software defect prediction by decision threshold optimization. In International Symposium on Empirical Software Engineering and Measurement, ESEM 2009, pages 477–480.

B. Turhan and A. Bener. A multivariate analysis of static code attributes for defect prediction. In Intern Conf on Quality Software, 2007, pages 231–237, 2007.

O. Vandecruys, D. Martens, B. Baesens, C. Mues, M. De Backer, and R. Haesen. Mining software repositories for comprehensible software fault prediction models. Journal of Systems and Software, 81(5):823–839, 2008.

R. Vivanco, Y. Kamei, A. Monden, K. Matsumoto, and D. Jin. Using search-based metric selection and oversampling to predict fault prone modules. In Canadian Conf on Electrical and Computer Engineering, 2010, pages 1–6.

L. Yi, T. M. Khoshgoftaar, and N. Seliya. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans on Soft Engin, 36(6):852–864, 2010.

Y. Zhou and H. Leung. Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans on Software Engineering, 32(10):771–789, 2006.

T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Predictor Models in Software Engineering, 2007. PROMISE’07, page 9, 2007.

Appendix B: Fault prediction experiment algorithm

Algorithm 1
figure 4

Generating the prediction data using a 10×10-fold stratified cross-validation experiment

Appendix C: Re-computing the confusion matrix′ algorithm

Algorithm 2
figure 5

Evaluating the conversion from performance measure to a confusion matrix and then to other performance measures

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bowes, D., Hall, T. & Gray, D. DConfusion: a technique to allow cross study performance evaluation of fault prediction studies. Autom Softw Eng 21, 287–313 (2014). https://doi.org/10.1007/s10515-013-0129-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10515-013-0129-8

Keywords

Navigation