Skip to main content

Software Defect Prediction Through a Hybrid Approach Comprising of a Statistical Tool and a Machine Learning Model

  • Conference paper
  • First Online:
Applications of Operational Research in Business and Industries

Part of the book series: Lecture Notes in Operations Research ((LNOR))

  • 179 Accesses

Abstract

Traditional statistical learning algorithms perform poorly in case of learning from an imbalanced dataset. Software defect prediction (SDP) is a useful way to identify defects in the primary phases of the software development life cycle. This SDP methodology will help to remove software defects and induce to build a cost-effective and good quality of software products. Several statistical and machine learning models have been employed to predict defects in software modules. But the imbalanced nature of this type of datasets is one of the key characteristics, which needs to be exploited, for the successful development of a defect prediction model. Imbalanced software datasets contain non-uniform class distributions with most of the instances belonging to a specific class compared to that of the other class. We propose a novel hybrid model based on Hellinger distance-based decision tree (HDDT) and artificial neural network (ANN), which we call as hybrid HDDT-ANN model, for analysis of software defect prediction (SDP) data. This is a newly developed model which is found to be quite effective in predicting software bugs. A comparative study of several supervised machine learning models with our proposed model using different performance measures is also produced. Hybrid HDDT-ANN also takes care of the strength of a skew-insensitive distance measure, known as Hellinger distance, in handling class imbalance problems. A detailed experiment was performed over ten NASA SDP datasets to prove the superiority of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Akash, P. S., Kadir, M. E., Ali, A. A., & Shoyaib, M. (2019). Inter-node Hellinger distance based decision tree. IJCAI, 1967–1973.

    Google Scholar 

  2. Batista, G., Bazan, A., & Monard, M. (2003). Balancing training data for automated annotation of keywords: A case study. In Proceedings of the Second Brazilian Workshop on Bioinformatics (pp. 35–43).

    Google Scholar 

  3. Boetticher, G. (2007). The promise repository of empirical software engineering data. http://promisedata.org/repository

  4. Boonchuay, K., Sinapiromsaran, K., & Lursinsap, C. (2017). Decision tree induction based on minority entropy for the class imbalance problem. Pattern Analysis and Applications, 20(3), 769–782.

    Article  Google Scholar 

  5. Bouaziz, S., Dhahri, H., Alimi, A. M., & Abraham, A. (2013). A hybrid learning algorithm for evolving flexible beta basis function neural tree model. Neurocomputing, 117, 107–117.

    Article  Google Scholar 

  6. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC Press.

    Google Scholar 

  7. Briand, L. C., Emam, K. E., Freimut, B. G., & Laitenberger, O. (2000). A comprehensive evaluation of capture-recapture models for estimating software defect content. IEEE Transactions on Software Engineering, 26(6), 518–540.

    Article  Google Scholar 

  8. Catal, C., & Diri, B. (2009). Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences, 179(8), 1040–1058.

    Article  Google Scholar 

  9. Chaabane, I., Guermazi, R., & Hammami, M. (2019). Enhancing techniques for learning decision trees from imbalanced data. In Advances in Data Analysis and Classification (pp. 1–69). Springer.

    Google Scholar 

  10. Chakraborty, A. K., & Arthanari, T. S. (1994). Optimum testing time for software under an exploration model. OPSEARCH, 31, 202.

    Google Scholar 

  11. Chakraborty, T., & Chakraborty, A. K. (2020). Superensemble classifier for improving predictions in imbalanced datasets. Communications in Statistics: Case Studies, Data Analysis and Applications, 6(2), 123–141.

    Google Scholar 

  12. Chakraborty, T., & Chakraborty, A. K. (2021). Hellinger net: A hybrid imbalance learning model to improve software defect prediction. IEEE Transactions on Reliability, 70(2), 481–494.

    Article  Google Scholar 

  13. Chakraborty, T., Chattopadhyay, S., & Chakraborty, A. K. (2018). A novel hybridization of classification trees and artificial neural networks for selection of students in a business school. OPSEARCH, 55(2), 434–446.

    Article  Google Scholar 

  14. Chakraborty, A. K., Basak, G. K., & Das, S. (2019). Bayesian optimum stopping rule for software release. OPSEARCH, 56(1), 242–260.

    Article  Google Scholar 

  15. Chen, Y., Abraham, A., & Yang, J. (2005). Feature selection and intrusion detection using hybrid flexible neural tree. In Advances in Neural Networks—ISNN 2005 (p. 980).

    Google Scholar 

  16. Chen, Y., Yang, B., & Meng, Q. (2012). Small-time scale network traffic prediction based on flexible neural tree. Applied Soft Computing, 12(1), 274–279.

    Article  Google Scholar 

  17. Cieslak, D. A., Hoens, T. A., Chawla, N. V., & Kegelmeyer, W. P. (2012). Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery, 24(1), 136–158.

    Article  Google Scholar 

  18. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 3, 326–334.

    Article  Google Scholar 

  19. Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (pp. 233–240). ACM.

    Google Scholar 

  20. Dewanji, A., Sengupta, D., & Chakraborty, A. K. (2011). A discrete time model for software reliability with application to a flight control software. Applied Stochastic Models in Business and Industry, 27(6), 723–731.

    Article  Google Scholar 

  21. Dey, S., & Chakraborty, A. K. (2022). Estimating software reliability using size-biased concepts.

    Google Scholar 

  22. Fenton, N. E., & Neil, M. (1999). A critique of software defect prediction models. IEEE Transactions on Software Engineering, 25(5), 675–689.

    Article  Google Scholar 

  23. Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905.

    Article  Google Scholar 

  24. Foresti, G. L., & Dolso, T. (2004). An adaptive high-order neural tree for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 34(2), 988–996.

    Article  Google Scholar 

  25. Gong, L., Jiang, S., Bo, L., Jiang, L., & Qian, J. (2019). A novel class imbalance learning approach for both within-project and cross-project defect prediction. IEEE Transactions on Reliability, 69(1), 40–54.

    Article  Google Scholar 

  26. Gray, D., Bowes, D., Davey, N., Sun, Y., & Christianson, B. (2009). Using the support vector machine as a classification method for software defect prediction with static code metrics. In International Conference on Engineering Applications of Neural Networks (pp. 223–234). Springer.

    Google Scholar 

  27. Guo, L., Ma, Y., Cukic, B., & Singh, H. (2004). Robust prediction of fault proneness by random forests. In 15th International Symposium on Software Reliability Engineering (pp. 417–428). IEEE.

    Google Scholar 

  28. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.

    Article  Google Scholar 

  29. Jing, X. Y., Fei, W., Dong, X., & Xu, B. (2016). An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Transactions on Software Engineering, 43(4), 321–339.

    Article  Google Scholar 

  30. Khoshgoftaar, T. M., & Seliya, N. (2002). Tree-based software quality estimation models for fault prediction. In Proceedings Eighth IEEE Symposium on Software Metrics (pp. 203–214). IEEE.

    Google Scholar 

  31. Khoshgoftaar, T. M., Allen, E. B., Jones, W. D., & Hudepohl, J. I. (1999). Classification tree models of software quality over multiple releases. In Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No. PR00443) (pp. 116–125). IEEE.

    Google Scholar 

  32. Kim, K. (2016). A hybrid classification algorithm by subspace partitioning through semi-supervised decision tree. Pattern Recognition, 60, 157–163.

    Article  Google Scholar 

  33. Laradji, I. H., Alshayeb, M., & Ghouti, L. (2015). Software defect prediction using ensemble learning on selected features. Information and Software Technology, 58, 388–402.

    Article  Google Scholar 

  34. Lee, D. S., & Srihari, S. N. (1995). A theory of classifier combination: The neural network approach. In Proceedings of 3rd International Conference on Document Analysis and Recognition (Vol. 1, pp. 42–45).

    Google Scholar 

  35. Liu, M., Miao, L., & Zhang, D. (2014). Two-stage cost-sensitive learning for software defect prediction. IEEE Transactions on Reliability, 63(2), 676–686.

    Article  Google Scholar 

  36. Lopez, V., Fernandez, A., Garcıa, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.

    Article  Google Scholar 

  37. Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi disciplinary survey. Data Mining and Knowledge Discovery, 2(4), 345–389.

    Article  Google Scholar 

  38. Pelayo, L., & Dick, S. (2012). Evaluating stratification alternatives to improve software defect prediction. IEEE Transactions on Reliability, 61(2), 516–525.

    Article  Google Scholar 

  39. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal representations by error propagation (Technical report). University of California, Institute for Cognitive Science, La Jolla, San Diego.

    Google Scholar 

  40. Ryu, D., Choi, O., & Baik, J. (2016). Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empirical Software Engineering, 21(1), 43–71.

    Article  Google Scholar 

  41. Sakar, A., & Mammone, R. J. (1993). Growing and pruning neural tree networks. IEEE Transactions on Computers, 42(3), 291–299.

    Article  Google Scholar 

  42. Sethi, I. K. (1990). Entropy nets: From decision trees to neural networks. Proceedings of the IEEE, 78(10), 1605–1613.

    Article  Google Scholar 

  43. Shatnawi, R. (2012). Improving software fault-prediction for imbalanced data. In 2012 International Conference on Innovations in Information Technology (IIT) (pp. 54–59). IEEE.

    Google Scholar 

  44. Sirat, J., & Nadal, J. (1990). Neural trees: A new tool for classification. Network Computation in Neural Systems, 1(4), 423–438.

    Article  Google Scholar 

  45. Subasi, A., Molah, E., Almkallawi, F., & Chaudhery, T. J. (2017). Intelligent phishing website detection using random forest classifier. In 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA) (pp. 1–5). IEEE.

    Google Scholar 

  46. Sun, Z., Song, Q., & Zhu, X. (2012). Using coding-based ensemble learning to improve software defect prediction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 1806–1817.

    Google Scholar 

  47. Turhan, B. (2012). On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17(1), 62–74.

    Article  Google Scholar 

  48. Wang, S., & Yao, X. (2013). Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 62(2), 434–443.

    Article  Google Scholar 

  49. Zheng, J. (2010). Cost-sensitive boosting neural networks for software defect prediction. Expert Systems with Applications, 37(6), 4537–4543.

    Article  Google Scholar 

  50. Zhou, Z. H., Wu, J., & Tang, W. (2002). Ensembling neural networks: Many could be better than all. Artificial Intelligence, 137(1–2), 239–263.

    Article  Google Scholar 

  51. Zimmermann, T., Nagappan, N., & Zeller, A. (2008). Predicting bugs from history. In Software evolution (pp. 69–88). Springer.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Barin Karmakar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chakraborty, A.K., Karmakar, B. (2023). Software Defect Prediction Through a Hybrid Approach Comprising of a Statistical Tool and a Machine Learning Model. In: Gunasekaran, A., Sharma, J.K., Kar, S. (eds) Applications of Operational Research in Business and Industries. Lecture Notes in Operations Research. Springer, Singapore. https://doi.org/10.1007/978-981-19-8012-1_1

Download citation

Publish with us

Policies and ethics