Abstract
Traditional statistical learning algorithms perform poorly in case of learning from an imbalanced dataset. Software defect prediction (SDP) is a useful way to identify defects in the primary phases of the software development life cycle. This SDP methodology will help to remove software defects and induce to build a cost-effective and good quality of software products. Several statistical and machine learning models have been employed to predict defects in software modules. But the imbalanced nature of this type of datasets is one of the key characteristics, which needs to be exploited, for the successful development of a defect prediction model. Imbalanced software datasets contain non-uniform class distributions with most of the instances belonging to a specific class compared to that of the other class. We propose a novel hybrid model based on Hellinger distance-based decision tree (HDDT) and artificial neural network (ANN), which we call as hybrid HDDT-ANN model, for analysis of software defect prediction (SDP) data. This is a newly developed model which is found to be quite effective in predicting software bugs. A comparative study of several supervised machine learning models with our proposed model using different performance measures is also produced. Hybrid HDDT-ANN also takes care of the strength of a skew-insensitive distance measure, known as Hellinger distance, in handling class imbalance problems. A detailed experiment was performed over ten NASA SDP datasets to prove the superiority of the proposed method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Akash, P. S., Kadir, M. E., Ali, A. A., & Shoyaib, M. (2019). Inter-node Hellinger distance based decision tree. IJCAI, 1967–1973.
Batista, G., Bazan, A., & Monard, M. (2003). Balancing training data for automated annotation of keywords: A case study. In Proceedings of the Second Brazilian Workshop on Bioinformatics (pp. 35–43).
Boetticher, G. (2007). The promise repository of empirical software engineering data. http://promisedata.org/repository
Boonchuay, K., Sinapiromsaran, K., & Lursinsap, C. (2017). Decision tree induction based on minority entropy for the class imbalance problem. Pattern Analysis and Applications, 20(3), 769–782.
Bouaziz, S., Dhahri, H., Alimi, A. M., & Abraham, A. (2013). A hybrid learning algorithm for evolving flexible beta basis function neural tree model. Neurocomputing, 117, 107–117.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC Press.
Briand, L. C., Emam, K. E., Freimut, B. G., & Laitenberger, O. (2000). A comprehensive evaluation of capture-recapture models for estimating software defect content. IEEE Transactions on Software Engineering, 26(6), 518–540.
Catal, C., & Diri, B. (2009). Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences, 179(8), 1040–1058.
Chaabane, I., Guermazi, R., & Hammami, M. (2019). Enhancing techniques for learning decision trees from imbalanced data. In Advances in Data Analysis and Classification (pp. 1–69). Springer.
Chakraborty, A. K., & Arthanari, T. S. (1994). Optimum testing time for software under an exploration model. OPSEARCH, 31, 202.
Chakraborty, T., & Chakraborty, A. K. (2020). Superensemble classifier for improving predictions in imbalanced datasets. Communications in Statistics: Case Studies, Data Analysis and Applications, 6(2), 123–141.
Chakraborty, T., & Chakraborty, A. K. (2021). Hellinger net: A hybrid imbalance learning model to improve software defect prediction. IEEE Transactions on Reliability, 70(2), 481–494.
Chakraborty, T., Chattopadhyay, S., & Chakraborty, A. K. (2018). A novel hybridization of classification trees and artificial neural networks for selection of students in a business school. OPSEARCH, 55(2), 434–446.
Chakraborty, A. K., Basak, G. K., & Das, S. (2019). Bayesian optimum stopping rule for software release. OPSEARCH, 56(1), 242–260.
Chen, Y., Abraham, A., & Yang, J. (2005). Feature selection and intrusion detection using hybrid flexible neural tree. In Advances in Neural Networks—ISNN 2005 (p. 980).
Chen, Y., Yang, B., & Meng, Q. (2012). Small-time scale network traffic prediction based on flexible neural tree. Applied Soft Computing, 12(1), 274–279.
Cieslak, D. A., Hoens, T. A., Chawla, N. V., & Kegelmeyer, W. P. (2012). Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery, 24(1), 136–158.
Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 3, 326–334.
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (pp. 233–240). ACM.
Dewanji, A., Sengupta, D., & Chakraborty, A. K. (2011). A discrete time model for software reliability with application to a flight control software. Applied Stochastic Models in Business and Industry, 27(6), 723–731.
Dey, S., & Chakraborty, A. K. (2022). Estimating software reliability using size-biased concepts.
Fenton, N. E., & Neil, M. (1999). A critique of software defect prediction models. IEEE Transactions on Software Engineering, 25(5), 675–689.
Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905.
Foresti, G. L., & Dolso, T. (2004). An adaptive high-order neural tree for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 34(2), 988–996.
Gong, L., Jiang, S., Bo, L., Jiang, L., & Qian, J. (2019). A novel class imbalance learning approach for both within-project and cross-project defect prediction. IEEE Transactions on Reliability, 69(1), 40–54.
Gray, D., Bowes, D., Davey, N., Sun, Y., & Christianson, B. (2009). Using the support vector machine as a classification method for software defect prediction with static code metrics. In International Conference on Engineering Applications of Neural Networks (pp. 223–234). Springer.
Guo, L., Ma, Y., Cukic, B., & Singh, H. (2004). Robust prediction of fault proneness by random forests. In 15th International Symposium on Software Reliability Engineering (pp. 417–428). IEEE.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Jing, X. Y., Fei, W., Dong, X., & Xu, B. (2016). An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Transactions on Software Engineering, 43(4), 321–339.
Khoshgoftaar, T. M., & Seliya, N. (2002). Tree-based software quality estimation models for fault prediction. In Proceedings Eighth IEEE Symposium on Software Metrics (pp. 203–214). IEEE.
Khoshgoftaar, T. M., Allen, E. B., Jones, W. D., & Hudepohl, J. I. (1999). Classification tree models of software quality over multiple releases. In Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No. PR00443) (pp. 116–125). IEEE.
Kim, K. (2016). A hybrid classification algorithm by subspace partitioning through semi-supervised decision tree. Pattern Recognition, 60, 157–163.
Laradji, I. H., Alshayeb, M., & Ghouti, L. (2015). Software defect prediction using ensemble learning on selected features. Information and Software Technology, 58, 388–402.
Lee, D. S., & Srihari, S. N. (1995). A theory of classifier combination: The neural network approach. In Proceedings of 3rd International Conference on Document Analysis and Recognition (Vol. 1, pp. 42–45).
Liu, M., Miao, L., & Zhang, D. (2014). Two-stage cost-sensitive learning for software defect prediction. IEEE Transactions on Reliability, 63(2), 676–686.
Lopez, V., Fernandez, A., Garcıa, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.
Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi disciplinary survey. Data Mining and Knowledge Discovery, 2(4), 345–389.
Pelayo, L., & Dick, S. (2012). Evaluating stratification alternatives to improve software defect prediction. IEEE Transactions on Reliability, 61(2), 516–525.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal representations by error propagation (Technical report). University of California, Institute for Cognitive Science, La Jolla, San Diego.
Ryu, D., Choi, O., & Baik, J. (2016). Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empirical Software Engineering, 21(1), 43–71.
Sakar, A., & Mammone, R. J. (1993). Growing and pruning neural tree networks. IEEE Transactions on Computers, 42(3), 291–299.
Sethi, I. K. (1990). Entropy nets: From decision trees to neural networks. Proceedings of the IEEE, 78(10), 1605–1613.
Shatnawi, R. (2012). Improving software fault-prediction for imbalanced data. In 2012 International Conference on Innovations in Information Technology (IIT) (pp. 54–59). IEEE.
Sirat, J., & Nadal, J. (1990). Neural trees: A new tool for classification. Network Computation in Neural Systems, 1(4), 423–438.
Subasi, A., Molah, E., Almkallawi, F., & Chaudhery, T. J. (2017). Intelligent phishing website detection using random forest classifier. In 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA) (pp. 1–5). IEEE.
Sun, Z., Song, Q., & Zhu, X. (2012). Using coding-based ensemble learning to improve software defect prediction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 1806–1817.
Turhan, B. (2012). On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17(1), 62–74.
Wang, S., & Yao, X. (2013). Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 62(2), 434–443.
Zheng, J. (2010). Cost-sensitive boosting neural networks for software defect prediction. Expert Systems with Applications, 37(6), 4537–4543.
Zhou, Z. H., Wu, J., & Tang, W. (2002). Ensembling neural networks: Many could be better than all. Artificial Intelligence, 137(1–2), 239–263.
Zimmermann, T., Nagappan, N., & Zeller, A. (2008). Predicting bugs from history. In Software evolution (pp. 69–88). Springer.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chakraborty, A.K., Karmakar, B. (2023). Software Defect Prediction Through a Hybrid Approach Comprising of a Statistical Tool and a Machine Learning Model. In: Gunasekaran, A., Sharma, J.K., Kar, S. (eds) Applications of Operational Research in Business and Industries. Lecture Notes in Operations Research. Springer, Singapore. https://doi.org/10.1007/978-981-19-8012-1_1
Download citation
DOI: https://doi.org/10.1007/978-981-19-8012-1_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8011-4
Online ISBN: 978-981-19-8012-1
eBook Packages: Business and ManagementBusiness and Management (R0)