Software Defect Prediction Through a Hybrid Approach Comprising of a Statistical Tool and a Machine Learning Model

Chakraborty, Ashis Kumar; Karmakar, Barin

doi:10.1007/978-981-19-8012-1_1

Ashis Kumar Chakraborty²¹ &
Barin Karmakar²¹

Part of the book series: Lecture Notes in Operations Research ((LNOR))

179 Accesses

Abstract

Traditional statistical learning algorithms perform poorly in case of learning from an imbalanced dataset. Software defect prediction (SDP) is a useful way to identify defects in the primary phases of the software development life cycle. This SDP methodology will help to remove software defects and induce to build a cost-effective and good quality of software products. Several statistical and machine learning models have been employed to predict defects in software modules. But the imbalanced nature of this type of datasets is one of the key characteristics, which needs to be exploited, for the successful development of a defect prediction model. Imbalanced software datasets contain non-uniform class distributions with most of the instances belonging to a specific class compared to that of the other class. We propose a novel hybrid model based on Hellinger distance-based decision tree (HDDT) and artificial neural network (ANN), which we call as hybrid HDDT-ANN model, for analysis of software defect prediction (SDP) data. This is a newly developed model which is found to be quite effective in predicting software bugs. A comparative study of several supervised machine learning models with our proposed model using different performance measures is also produced. Hybrid HDDT-ANN also takes care of the strength of a skew-insensitive distance measure, known as Hellinger distance, in handling class imbalance problems. A detailed experiment was performed over ten NASA SDP datasets to prove the superiority of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Akash, P. S., Kadir, M. E., Ali, A. A., & Shoyaib, M. (2019). Inter-node Hellinger distance based decision tree. IJCAI, 1967–1973.
Google Scholar
Batista, G., Bazan, A., & Monard, M. (2003). Balancing training data for automated annotation of keywords: A case study. In Proceedings of the Second Brazilian Workshop on Bioinformatics (pp. 35–43).
Google Scholar
Boetticher, G. (2007). The promise repository of empirical software engineering data. http://promisedata.org/repository
Boonchuay, K., Sinapiromsaran, K., & Lursinsap, C. (2017). Decision tree induction based on minority entropy for the class imbalance problem. Pattern Analysis and Applications, 20(3), 769–782.
Article Google Scholar
Bouaziz, S., Dhahri, H., Alimi, A. M., & Abraham, A. (2013). A hybrid learning algorithm for evolving flexible beta basis function neural tree model. Neurocomputing, 117, 107–117.
Article Google Scholar
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC Press.
Google Scholar
Briand, L. C., Emam, K. E., Freimut, B. G., & Laitenberger, O. (2000). A comprehensive evaluation of capture-recapture models for estimating software defect content. IEEE Transactions on Software Engineering, 26(6), 518–540.
Article Google Scholar
Catal, C., & Diri, B. (2009). Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences, 179(8), 1040–1058.
Article Google Scholar
Chaabane, I., Guermazi, R., & Hammami, M. (2019). Enhancing techniques for learning decision trees from imbalanced data. In Advances in Data Analysis and Classification (pp. 1–69). Springer.
Google Scholar
Chakraborty, A. K., & Arthanari, T. S. (1994). Optimum testing time for software under an exploration model. OPSEARCH, 31, 202.
Google Scholar
Chakraborty, T., & Chakraborty, A. K. (2020). Superensemble classifier for improving predictions in imbalanced datasets. Communications in Statistics: Case Studies, Data Analysis and Applications, 6(2), 123–141.
Google Scholar
Chakraborty, T., & Chakraborty, A. K. (2021). Hellinger net: A hybrid imbalance learning model to improve software defect prediction. IEEE Transactions on Reliability, 70(2), 481–494.
Article Google Scholar
Chakraborty, T., Chattopadhyay, S., & Chakraborty, A. K. (2018). A novel hybridization of classification trees and artificial neural networks for selection of students in a business school. OPSEARCH, 55(2), 434–446.
Article Google Scholar
Chakraborty, A. K., Basak, G. K., & Das, S. (2019). Bayesian optimum stopping rule for software release. OPSEARCH, 56(1), 242–260.
Article Google Scholar
Chen, Y., Abraham, A., & Yang, J. (2005). Feature selection and intrusion detection using hybrid flexible neural tree. In Advances in Neural Networks—ISNN 2005 (p. 980).
Google Scholar
Chen, Y., Yang, B., & Meng, Q. (2012). Small-time scale network traffic prediction based on flexible neural tree. Applied Soft Computing, 12(1), 274–279.
Article Google Scholar
Cieslak, D. A., Hoens, T. A., Chawla, N. V., & Kegelmeyer, W. P. (2012). Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery, 24(1), 136–158.
Article Google Scholar
Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 3, 326–334.
Article Google Scholar
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (pp. 233–240). ACM.
Google Scholar
Dewanji, A., Sengupta, D., & Chakraborty, A. K. (2011). A discrete time model for software reliability with application to a flight control software. Applied Stochastic Models in Business and Industry, 27(6), 723–731.
Article Google Scholar
Dey, S., & Chakraborty, A. K. (2022). Estimating software reliability using size-biased concepts.
Google Scholar
Fenton, N. E., & Neil, M. (1999). A critique of software defect prediction models. IEEE Transactions on Software Engineering, 25(5), 675–689.
Article Google Scholar
Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905.
Article Google Scholar
Foresti, G. L., & Dolso, T. (2004). An adaptive high-order neural tree for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 34(2), 988–996.
Article Google Scholar
Gong, L., Jiang, S., Bo, L., Jiang, L., & Qian, J. (2019). A novel class imbalance learning approach for both within-project and cross-project defect prediction. IEEE Transactions on Reliability, 69(1), 40–54.
Article Google Scholar
Gray, D., Bowes, D., Davey, N., Sun, Y., & Christianson, B. (2009). Using the support vector machine as a classification method for software defect prediction with static code metrics. In International Conference on Engineering Applications of Neural Networks (pp. 223–234). Springer.
Google Scholar
Guo, L., Ma, Y., Cukic, B., & Singh, H. (2004). Robust prediction of fault proneness by random forests. In 15th International Symposium on Software Reliability Engineering (pp. 417–428). IEEE.
Google Scholar
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Article Google Scholar
Jing, X. Y., Fei, W., Dong, X., & Xu, B. (2016). An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Transactions on Software Engineering, 43(4), 321–339.
Article Google Scholar
Khoshgoftaar, T. M., & Seliya, N. (2002). Tree-based software quality estimation models for fault prediction. In Proceedings Eighth IEEE Symposium on Software Metrics (pp. 203–214). IEEE.
Google Scholar
Khoshgoftaar, T. M., Allen, E. B., Jones, W. D., & Hudepohl, J. I. (1999). Classification tree models of software quality over multiple releases. In Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No. PR00443) (pp. 116–125). IEEE.
Google Scholar
Kim, K. (2016). A hybrid classification algorithm by subspace partitioning through semi-supervised decision tree. Pattern Recognition, 60, 157–163.
Article Google Scholar
Laradji, I. H., Alshayeb, M., & Ghouti, L. (2015). Software defect prediction using ensemble learning on selected features. Information and Software Technology, 58, 388–402.
Article Google Scholar
Lee, D. S., & Srihari, S. N. (1995). A theory of classifier combination: The neural network approach. In Proceedings of 3rd International Conference on Document Analysis and Recognition (Vol. 1, pp. 42–45).
Google Scholar
Liu, M., Miao, L., & Zhang, D. (2014). Two-stage cost-sensitive learning for software defect prediction. IEEE Transactions on Reliability, 63(2), 676–686.
Article Google Scholar
Lopez, V., Fernandez, A., Garcıa, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.
Article Google Scholar
Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi disciplinary survey. Data Mining and Knowledge Discovery, 2(4), 345–389.
Article Google Scholar
Pelayo, L., & Dick, S. (2012). Evaluating stratification alternatives to improve software defect prediction. IEEE Transactions on Reliability, 61(2), 516–525.
Article Google Scholar
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal representations by error propagation (Technical report). University of California, Institute for Cognitive Science, La Jolla, San Diego.
Google Scholar
Ryu, D., Choi, O., & Baik, J. (2016). Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empirical Software Engineering, 21(1), 43–71.
Article Google Scholar
Sakar, A., & Mammone, R. J. (1993). Growing and pruning neural tree networks. IEEE Transactions on Computers, 42(3), 291–299.
Article Google Scholar
Sethi, I. K. (1990). Entropy nets: From decision trees to neural networks. Proceedings of the IEEE, 78(10), 1605–1613.
Article Google Scholar
Shatnawi, R. (2012). Improving software fault-prediction for imbalanced data. In 2012 International Conference on Innovations in Information Technology (IIT) (pp. 54–59). IEEE.
Google Scholar
Sirat, J., & Nadal, J. (1990). Neural trees: A new tool for classification. Network Computation in Neural Systems, 1(4), 423–438.
Article Google Scholar
Subasi, A., Molah, E., Almkallawi, F., & Chaudhery, T. J. (2017). Intelligent phishing website detection using random forest classifier. In 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA) (pp. 1–5). IEEE.
Google Scholar
Sun, Z., Song, Q., & Zhu, X. (2012). Using coding-based ensemble learning to improve software defect prediction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 1806–1817.
Google Scholar
Turhan, B. (2012). On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17(1), 62–74.
Article Google Scholar
Wang, S., & Yao, X. (2013). Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 62(2), 434–443.
Article Google Scholar
Zheng, J. (2010). Cost-sensitive boosting neural networks for software defect prediction. Expert Systems with Applications, 37(6), 4537–4543.
Article Google Scholar
Zhou, Z. H., Wu, J., & Tang, W. (2002). Ensembling neural networks: Many could be better than all. Artificial Intelligence, 137(1–2), 239–263.
Article Google Scholar
Zimmermann, T., Nagappan, N., & Zeller, A. (2008). Predicting bugs from history. In Software evolution (pp. 69–88). Springer.
Google Scholar

Download references

Author information

Authors and Affiliations

SQC and OR Unit, ISI Kolkata, Kolkata, India
Ashis Kumar Chakraborty & Barin Karmakar

Authors

Ashis Kumar Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar
Barin Karmakar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Barin Karmakar .

Editor information

Editors and Affiliations

California State University, Bakersfield, CA, USA
Angappa Gunasekaran
Amity Business School, Amity University, Noida, India
Jai Kishore Sharma
Department of Mathematics, National Institute of Technology Durgapur, Durgapur, West Bengal, India
Samarjit Kar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chakraborty, A.K., Karmakar, B. (2023). Software Defect Prediction Through a Hybrid Approach Comprising of a Statistical Tool and a Machine Learning Model. In: Gunasekaran, A., Sharma, J.K., Kar, S. (eds) Applications of Operational Research in Business and Industries. Lecture Notes in Operations Research. Springer, Singapore. https://doi.org/10.1007/978-981-19-8012-1_1

Download citation

DOI: https://doi.org/10.1007/978-981-19-8012-1_1
Published: 22 May 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8011-4
Online ISBN: 978-981-19-8012-1
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics