DConfusion: a technique to allow cross study performance evaluation of fault prediction studies

Bowes, David; Hall, Tracy; Gray, David

doi:10.1007/s10515-013-0129-8

DConfusion: a technique to allow cross study performance evaluation of fault prediction studies

Published: 25 July 2013

Volume 21, pages 287–313, (2014)
Cite this article

Automated Software Engineering Aims and scope Submit manuscript

David Bowes¹,
Tracy Hall² &
David Gray¹

798 Accesses
21 Citations
Explore all metrics

Abstract

There are many hundreds of fault prediction models published in the literature. The predictive performance of these models is often reported using a variety of different measures. Most performance measures are not directly comparable. This lack of comparability means that it is often difficult to evaluate the performance of one model against another. Our aim is to present an approach that allows other researchers and practitioners to transform many performance measures back into a confusion matrix. Once performance is expressed in a confusion matrix alternative preferred performance measures can then be derived. Our approach has enabled us to compare the performance of 600 models published in 42 studies. We demonstrate the application of our approach on 8 case studies, and discuss the advantages and implications of doing this.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Literature reviews as independent studies: guidelines for academic practice

Article Open access 14 October 2022

Predictive big data analytics for supply chain demand forecasting: methods, applications, and research opportunities

Article Open access 25 July 2020

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Notes

Definitions of particular measures are given in Sect. 2.
Binary models are those predicting that code units (e.g. modules or classes) are either fault prone (fp) or not fault prone (nfp). Binary models do not predict the number of faults in code units. In this paper we restrict ourselves to considering only binary models that are based on machine learning techniques.
Data imbalance also has serious implications for the training of prediction models. Discussion of this is beyond the scope of this work (instead see Gray et al. 2011; Zhang and Zhang 2007; Turhan et al. 2009; Batista et al. 2004 and Kamei et al. 2007).
When this class distribution is not known it is often possible to calculate the proportion of faulty units in a data set.
This tool is available at: https://bugcatcher.stca.herts.ac.uk/DConfusion/.
Full results can be found in the supporting material at https://bugcatcher.stca.herts.ac.uk/DConfusion/analysis.html.
The use of percentage error is not completely correct, as p approaches 1 for many of the performance measures, it becomes increasingly difficult to have a proportionally increasing error, therefore it is to be expected that p values close to 1 will have a small perr value.
The aim of this paper is to provide an approach by which others may perform comparative analysis of fault prediction models. A comparative analysis is complex and requires many factors to be taken into account, e.g. the aims of the predictive model. It is beyond the scope of this paper to provide a full comparative analysis of studies against each other.
Now based at https://code.google.com/p/promisedata/.

References

Arisholm, E., Briand, L.C., Fuglerud, M.: Data mining techniques for building fault-proneness models in telecom Java software. In: The 18th IEEE International Symposium on Software Reliability, 2007, ISSRE ’07, pp. 215–224 (2007)
Google Scholar
Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)
Article Google Scholar
Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000)
Article Google Scholar
Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Article Google Scholar
Bowes, D., Hall, T., Gray, D.: Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix. In: Proceedings of the 8th International Conference on Predictive Models in Software Engineering, PROMISE ’12, New York, NY, USA, pp. 109–118. ACM, New York (2012)
Chapter Google Scholar
Catal, C., Diri, B., Ozumut, B.: An artificial immune system approach for fault prediction in object-oriented software. In: 2nd International Conference on Dependability of Computer Systems, 2007, DepCoS-RELCOMEX ’07, pp. 238–245 (2007, accepted)
Google Scholar
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
MATH Google Scholar
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)
Article Google Scholar
Cruzes, D.S., Dybå, T.: Research synthesis in software engineering: a tertiary study. Inf. Softw. Technol. 53, 440–455 (2011)
Article Google Scholar
de Carvalho, A.B., Pozo, A., Vergilio, S., Lenz, A.: Predicting fault proneness of classes trough a multiobjective particle swarm optimization algorithm. In: 20th IEEE International Conference on Tools with Artificial Intelligence, 2008, ICTAI ’08, vol. 2, pp. 387–394 (2008)
Chapter Google Scholar
de Carvalho, A.B., Pozo, A., Vergilio, S.R.: A symbolic fault-prediction model based on multiobjective particle swarm optimization. J. Syst. Softw. 83(5), 868–882 (2010)
Article Google Scholar
Denaro, G., Pezzè, M.: An empirical evaluation of fault-proneness models. In: Proceedings of the 24th International Conference on Software Engineering, ICSE ’02, New York, NY, USA, pp. 241–251. ACM, New York (2002)
Google Scholar
Elish, K., Elish, M.: Predicting defect-prone software modules using support vector machines. J. Syst. Softw. 81(5), 649–660 (2008)
Article Google Scholar
Fenton, N., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)
Article Google Scholar
Forman, G., Scholz, M.: Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explor. Newsl. 12(1), 49–57 (2010)
Article Google Scholar
Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: Further thoughts on precision. In: Evaluation and Assessment in Software Engineering (EASE) (2011)
Google Scholar
Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: Reflections on the nasa mdp data sets. IET Softw. 6(6), 549–558 (2012)
Article Google Scholar
Hall, T., Bowes, D.: The state of machine learning methodology in software fault prediction. In: International Conference on Machine Learning and Applications (ICMLA) (2012)
Google Scholar
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)
Article Google Scholar
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng., 1263–1284 (2008)
Jiang, Y., Cukic, B., Ma, Y.: Techniques for evaluating fault prediction models. Empir. Softw. Eng. 13(5), 561–595 (2008)
Article Google Scholar
Kamei, Y., Monden, A., Matsumoto, S., Kakimoto, T., Matsumoto, K.: The effects of over and under sampling on fault-prone module detection. In: First International Symposium on Empirical Software Engineering and Measurement, 2007, ESEM 2007, pp. 196–204 (2007)
Google Scholar
Kaur, A., Sandhu, P.S., Bra, A.S.: Early software fault prediction using real time defect data. In: Second International Conference on Machine Vision, 2009, ICMV ’09, pp. 242–245 (2009)
Chapter Google Scholar
Khoshgoftaar, T., Seliya, N.: Comparative assessment of software quality classification techniques: an empirical case study. Empir. Softw. Eng. 9(3), 229–257 (2004)
Article Google Scholar
Koru, A., Liu, H.: Building effective defect-prediction models in practice. IEEE Softw. 22(6), 23–29 (2005)
Article Google Scholar
Kutlubay, O., Turhan, B., Bener, A.: A two-step model for defect density estimation. In: 33rd EUROMICRO Conference on Software Engineering and Advanced Applications, 2007, pp. 322–332 (2007)
Google Scholar
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)
Article Google Scholar
Ma, Y., Guo, L., Cukic, B.: A statistical framework for the prediction of fault-proneness. In: Advances in Machine Learning Applications in Software Engineering, IGI Global, pp. 237–265 (2006)
Chapter Google Scholar
Mende, T., Koschke, R.: Effort-aware defect prediction models. In: 14th European Conference on Software Maintenance and Reengineering (CSMR), 2010, pp. 107–116 (2010)
Chapter Google Scholar
Menzies, T., Dekhtyar, A., Distefano, J., Greenwald, J.: Problems with precision: a response to “comments on ‘data mining static code attributes to learn defect predictors’ ”. IEEE Trans. Softw. Eng. 33(9), 637–640 (2007a)
Article Google Scholar
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007b)
Article Google Scholar
Ostrand, T., Weyuker, E.: How to measure success of fault prediction models. In: Fourth International Workshop on Software Quality Assurance: in Conjunction with the 6th ESEC/FSE Joint Meeting, pp. 25–30. ACM, New York (2007)
Google Scholar
Ostrand, T., Weyuker, E., Bell, R.: Where the bugs are. In: ACM SIGSOFT Software Engineering Notes, vol. 29, pp. 86–96. ACM, New York (2004)
Google Scholar
Pai, G., Dugan, J.: Empirical analysis of software fault content and fault proneness using Bayesian methods. IEEE Trans. Softw. Eng. 33(10), 675–686 (2007)
Article Google Scholar
Pizzi, N., Summers, A., Pedrycz, W.: Software quality prediction using median-adjusted class labels. In: Proceedings of the 2002 International Joint Conference on Neural Networks, 2002, IJCNN ’02, vol. 3, pp. 2405–2409 (2002)
Google Scholar
Seliya, N., Khoshgoftaar, T., Zhong, S.: Analyzing software quality with limited fault-proneness defect data. In: Ninth IEEE International Symposium on High-Assurance Systems Engineering, 2005, HASE 2005, pp. 89–98 (2005)
Google Scholar
Sun, Y., Castellano, C., Robinson, M., Adams, R., Rust, A., Davey, N.: Using pre & post-processing methods to improve binding site predictions. Pattern Recognit. 42(9), 1949–1958 (2009)
Article MATH Google Scholar
Turhan, B., Kocak, G., Bener, A.: Data mining source code for locating software bugs: a case study in telecommunication industry. Expert Syst. Appl. 36(6), 9986–9990 (2009)
Article Google Scholar
Wang, T., Li, W.: Naïve Bayes software defect prediction model. In: International Conference on Computational Intelligence and Software Engineering (CiSE), 2010, pp. 1–4 (2010). doi:10.1109/CISE.2010.5677057
Google Scholar
Yi, L., Khoshgoftaar, T.M., Seliya, N.: Evolutionary optimization of software quality modelling with multiple repositories. IEEE Trans. Softw. Eng. 36(6), 852–864 (2010)
Article Google Scholar
Zeller, A., Zimmermann, T., Bird, C.: Failure is a four-letter word: a parody in empirical research. In: Proceedings of the 7th International Conference on Predictive Models in Software Engineering, Promise ’11, New York, NY, USA, pp. 5:1–5:7. ACM, New York (2011)
Google Scholar
Zhang, H., Zhang, X.: Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. 33(9), 635–637 (2007)
Article MATH Google Scholar
Zhou, Y., Leung, H.: Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans. Softw. Eng. 32(10), 771–789 (2006)
Article Google Scholar

Download references

Acknowledgements

Many thanks to Prof. Barbara Kitchenham for double checking the equations used to re-compute the confusion matrix.

Author information

Authors and Affiliations

Science and Technology Research Institute, University of Hertfordshire, College Lane, Hatfield, AL10 9AB, UK
David Bowes & David Gray
Department of Information Systems and Computing, Brunel University, Uxbridge, Middlesex, UB8 3PH, UK
Tracy Hall

Authors

David Bowes
View author publications
You can also search for this author in PubMed Google Scholar
Tracy Hall
View author publications
You can also search for this author in PubMed Google Scholar
David Gray
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Bowes.

Appendices

Appendix A: List of papers from which we have re-computed confusion matrices

E. Arisholm, L. C. Briand, and M. Fuglerud. Data mining techniques for building fault-proneness models in telecom java software. In The 18th IEEE Intern Symp on ISSRE ’07, pages 215–224, 2007.

E. Arisholm, L. C. Briand, and E. B. Johannessen. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software, 83(1):2–17, 2010.

C. Bird, N. Nagappan, H. Gall, B. Murphy, and P. Devanbu. Putting it all together: Using socio-technical networks to predict failures. In 20th International Symposium on Software Reliability Engineering, pages 109–119. IEEE, 2009.

L. Briand, W. Melo, and J. Wust. Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Transactions on Software Engineering, 28(7):706–720, 2002.

B. Caglayan, A. Bener, and S. Koch. Merits of using repository metrics in defect prediction for open source projects. In ICSE Workshop on FLOSS ’09, pages 31–36, 2009.

G. Calikli, A. Tosun, A. Bener, and M. Celik. The effect of granularity level on software defect prediction. In 24th International Symposium on Computer and Information Sciences, 2009. ISCIS 2009, pages 531–536, 2009.

C. Catal, B. Diri, and B. Ozumut. An artificial immune system approach for fault prediction in object-oriented software. In 2nd International Conference on Dependability of Computer Systems, 2007. DepCoS-RELCOMEX ’07, pages 238–245, 2007.

C. Cruz and A. Erika. Exploratory study of a uml metric for fault prediction. In Proceedings of the 32nd ACM/IEEE Intern Conf on Software Engineering, pages 361–364. 2010.

A. B. de Carvalho, A. Pozo, S. Vergilio, and A. Lenz. Predicting fault proneness of classes trough a multiobjective particle swarm optimization algorithm. In 20th IEEE International Conference on Tools with Artificial Intelligence, 2008. ICTAI ’08, volume 2, pages 387–394, 2008.

A. B. de Carvalho, A. Pozo, and S. R. Vergilio. A symbolic fault-prediction model based on multiobjective particle swarm optimization. J of Sys & Soft, 83(5):868–882, 2010.

G. Denaro and M. Pezzè. An empirical evaluation of fault-proneness models. In Proceedings of the 24th International Conference on Software Engineering, ICSE ’02, pages 241–251, NY, USA, 2002. ACM.

L. Guo, Y. Ma, B. Cukic, and H. Singh. Robust prediction of fault-proneness by random forests. In 15th International Symposium on Software Reliability Engineering, 2004. ISSRE 2004, pages 417–428, 2004.

T. Gyimothy, R. Ferenc, and I. Siket. Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Transactions on Software Engineering, 31(10):897–910, 2005.

Z. Hongyu. An investigation of the relationships between lines of code and defects. In IEEE Intern Conf on Software Maintenance, 2009, pages 274–283, 2009.

Y. Jiang, B. Cukic, and Y. Ma. Techniques for evaluating fault prediction models. Empirical Software Engineering, 13(5):561–595, 2008.

S. Kanmani, V. Uthariaraj, V. Sankaranarayanan, and P. Thambidurai. Object-oriented software fault prediction using neural networks. Information and Software Technology, 49(5):483–492, 2007.

A. Kaur and R. Malhotra. Application of random forest in predicting fault-prone classes. In Internl Conf on Advanced Computer Theory and Engineering, 2008, 37–43, 2008.

A. Kaur, P. S. Sandhu, and A. S. Bra. Early software fault prediction using real time defect data. In Intern Conf on Machine Vision, 2009, pages 242–245.

T. Khoshgoftaar and N. Seliya. Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering, 9(3):229–257, 2004.

T. Khoshgoftaar, X. Yuan, E. Allen, W. Jones, and J. Hudepohl. Uncertain classification of fault-prone software modules. Empirical Software Engineering, 7(4):297–318, 2002.

A. Koru and H. Liu. Building effective defect-prediction models in practice. Software, IEEE, 22(6):23–29, 2005.

O. Kutlubay, B. Turhan, and A. Bener. A two-step model for defect density estimation. In 33rd EUROMICRO Conference on Software Engineering and Advanced Applications, 2007, pages 322–332, 2007.

Y. Ma, L. Guo, and B. Cukic. Advances in Machine Learning Applications in Software Engineering, chapter A statistical framework for the prediction of fault-proneness, pages 237–265. IGI Global, 2006.

T. Mende and R. Koschke. Effort-aware defect prediction models. In 14th European Conference on Software Maintenance and Reengineering (CSMR), 2010, pages 107–116, 2010.

T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1):2–13, 2007.

O. Mizuno, S. Ikami, S. Nakaichi, and T. Kikuno. Spam filter based approach for finding fault-prone software modules. In International Workshop on Mining Software Repositories, 2007. ICSE ’07, page 4, 2007.

O. Mizuno and T. Kikuno. Training on errors experiment to detect fault-prone software modules by spam filter. In Procs European Software Engineering Conf and the ACM SIGSOFT symp on The foundations of software engineering, ESEC-FSE ’07, pages 405–414, 2007. ACM.

R. Moser, W. Pedrycz, and G. Succi. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In ACM/IEEE 30th Intern Conf on Software Engineering, 2008. ICSE ’08, pages 181–190.

N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, and B. Murphy. Change bursts as defect predictors. In 2010 IEEE 21st International Symposium on Software Reliability Engineering (ISSRE), pages 309–318.

A. Nugroho, M. R. V. Chaudron, and E. Arisholm. Assessing uml design metrics for predicting fault-prone classes in a java system. In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR), pages 21–30.

G. Pai and J. Dugan. Empirical analysis of software fault content and fault proneness using Bayesian methods. IEEE Trans on Software Engineering, 33(10):675–686, 2007.

A. Schröter, T. Zimmermann, and A. Zeller. Predicting component failures at design time. In Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering, pages 18–27. ACM, 2006.

N. Seliya, T. Khoshgoftaar, and S. Zhong. Analyzing software quality with limited fault-proneness defect data. In IEEE Internl Symp on High-Assurance Systems Engineering, 2005, pages 89–98, 2005.

S. Shivaji, E. J. Whitehead, R. Akella, and K. Sunghun. Reducing features to improve bug prediction. In 24th IEEE/ACM International Conference on Automated Software Engineering, 2009. ASE ’09, pages 600–604.

Y. Singh, A. Kaur, and R. Malhotra. Predicting software fault proneness model using neural network. Product-Focused Software Process Improvement, 5089:204–214, 2008.

A. Tosun and A. Bener. Reducing false alarms in software defect prediction by decision threshold optimization. In International Symposium on Empirical Software Engineering and Measurement, ESEM 2009, pages 477–480.

B. Turhan and A. Bener. A multivariate analysis of static code attributes for defect prediction. In Intern Conf on Quality Software, 2007, pages 231–237, 2007.

O. Vandecruys, D. Martens, B. Baesens, C. Mues, M. De Backer, and R. Haesen. Mining software repositories for comprehensible software fault prediction models. Journal of Systems and Software, 81(5):823–839, 2008.

R. Vivanco, Y. Kamei, A. Monden, K. Matsumoto, and D. Jin. Using search-based metric selection and oversampling to predict fault prone modules. In Canadian Conf on Electrical and Computer Engineering, 2010, pages 1–6.

L. Yi, T. M. Khoshgoftaar, and N. Seliya. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans on Soft Engin, 36(6):852–864, 2010.

Y. Zhou and H. Leung. Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans on Software Engineering, 32(10):771–789, 2006.

T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Predictor Models in Software Engineering, 2007. PROMISE’07, page 9, 2007.

Appendix B: Fault prediction experiment algorithm

Appendix C: Re-computing the confusion matrix′ algorithm

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bowes, D., Hall, T. & Gray, D. DConfusion: a technique to allow cross study performance evaluation of fault prediction studies. Autom Softw Eng 21, 287–313 (2014). https://doi.org/10.1007/s10515-013-0129-8

Download citation

Received: 01 January 2013
Accepted: 29 June 2013
Published: 25 July 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s10515-013-0129-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DConfusion: a technique to allow cross study performance evaluation of fault prediction studies

Abstract

Access this article

Similar content being viewed by others

Literature reviews as independent studies: guidelines for academic practice

Predictive big data analytics for supply chain demand forecasting: methods, applications, and research opportunities

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Notes

References

Acknowledgements