Abstract
Software defect prediction (SDP) is essential to analyze and identify defects present in a software model in early stages of software development. The identification of these defects and their early removal provides cost-efficient software. Machine learning (ML) techniques have been successfully used for developing defect prediction models. However, these techniques deliver off-target results when implemented on imbalanced datasets. For example, a dataset with unequal class distribution is technically imbalanced. Thus, ML techniques on such imbalanced data lead to a biased prediction of minority class instances, which are more important than majority class instances. Therefore, the imbalanced data problem must be resolved to successfully develop an efficient SDP model. In this study, we evaluated the prediction capability of ML classifiers for software defect prediction on nine imbalanced NASA datasets by applying oversampling methods. In addition, we considered five oversampling methods to synthesize minority class instances and make the datasets balanced. Dataset imbalance was eliminated using the five oversampling techniques. The oversampling techniques replicated or synthesized the instances of minority classes to balance the datasets. When the datasets were balanced, the ML classifiers were used to develop a defect prediction model. The experimental results acquired by applying ML classifiers on the imbalanced and balanced data showed an enhancement in the learning capability of ML techniques with the implementation of sampling techniques. Oversampling methods considerably improved the prediction performance of the ML classifiers.
Similar content being viewed by others
References
Krasner H (2020) The cost of poor quality software in the us: a 2020 report. Consortium for I.T. Software Quality, Technical report, 10.
Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616
Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140
Feng S, Keung J, Yu X, Xiao Y, Zhang M (2021) Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf Softw Technol 139:106662
Zhang F, Hassan AE, McIntosh S, Zou Y (2016) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Trans Softw Eng 43(5):476–491
Zhang Y, Li JX, Zhao J, Wang SZ, Pan Y, Tanaka K, Kadota S (2005) Synthesis and activity of oleanolic acid derivatives, a novel class of inhibitors of osteoclast formation. Bioorg Med Chem Lett 15(6):1629–1632
Saçar MD, Allmer J (2013) Data mining for microrna gene prediction: on the impact of class imbalance and feature number for microrna gene prediction. In: 2013 8th international symposium on health informatics and bioinformatics. IEEE, pp 1–6
Provost F (2000) Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 workshop on imbalanced data sets, vol 68, no. 2000. AAAI Press, pp 1–3
Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2017) Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg, pp 878–887
Agrawal A, Menzies T (2018) Is" Better Data" Better Than" Better Data Miners"? In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), 27 May–3 June 2018, pp 1050–1061. IEEE, Gothenburg, Sweden
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):1806–1817
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402
Xia X, Lo D, Shihab E, Wang X, Yang X (2015) Elblocker: Predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106
Wang H, Khoshgoftaar TM, Napolitano A (2010) A comparative study of ensemble feature selection techniques for software defect prediction. In: 2010 Ninth international conference on machine learning and applications. IEEE, pp 135–140
Liu M, Miao L, Zhang D (2014) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans Reliab 63(2):676–686
Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning-based software defect prediction. In: Proceedings of the 36th international conference on software engineering, pp 414–423
Yu X, Wu M, Jian Y, Bennin KE, Fu M, Ma C (2018) Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning. Soft Comput 22(10):3461–3472
Tomar D, Agarwal S (2016) Prediction of defective software modules using class imbalance learning. In: Applied computational intelligence and soft computing, 2016
Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol 11, pp 1–8
Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, vol 4. IEEE, pp 192–201
Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto KI (2007) The effects of over and under sampling on fault-prone module detection. In: First international symposium on empirical software engineering and measurement (ESEM 2007). IEEE, pp 196–204
Riquelme JC, Ruiz R, Rodríguez D, Moreno J (2008) Finding defective modules from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos 2(1):67–74
Shatnawi R (2012) Improving software fault prediction for imbalanced data. In: 2012 international conference on innovations in information technology (IIT). IEEE, pp 54–59
Menzies T, Dekhtyar A, Distefano J, Greenwald J (2007) Problems with precision: a response to" comments on’ data mining static code attributes to learn defect predictors’". IEEE Trans Softw Eng 33(9):637–640
Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259
Kovács G (2019) Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354
Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28(1):92–122
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the imbalanced class problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 475–482
Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21
Ian HW, Eibe F (2005) Data mining: practical machine learning tools and techniques
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Murphy KP (2006) Naive Bayes classifiers. Univ B C 18(60):1–8
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Yu H, Sun C, Yang W, Yang X, Zuo X (2015) AL-ELM: one uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166:140–150
Shanab AA, Khoshgoftaar TM, Wald R, Napolitano A (2012) Impact of noise and data sampling on stability of feature ranking techniques for biological datasets. In: 2012 IEEE 13th international conference on information reuse and integration (IRI). IEEE, pp 415–422
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215
Keung J, Kocaguneli E, Menzies T (2013) Finding conclusion stability for selecting the best effort predictor in software effort estimation. Autom Softw Eng 20(4):543–567
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Table 7 tabulates the performance of Software defect prediction models formed by various ML classifiers on different imbalanced datasets. The average of their performance is calculated to study the range in which the versions lie when SDP models are created using imbalanced datasets.
Figure 3 corresponds to the values in Table 6. Figure 4 shows the prediction capability of SDP models formed using datasets that are not balanced and balanced. The graphs plot their performance for each dataset when they are not balanced using ROS, SMOTE, ADASYN, SL-SM, SVM-SMOTE oversampling techniques.
Figure 4 also corresponds to the values in Table 6. In Fig. 4, each graph compares the performance of the SDP model when a specific ML classifier is used, and the dataset is balanced using various techniques.
Reproducibility Strategy
Our experiment uses nine publicly available NASA datasets as mentioned in section 4.1. The oversampling techniques and machine learning classifiers used are mentioned in section 3. A brief description of the performance measure used is mentioned in section 4.4. The hyperparameters used for different datasets are mentioned in Table 4. The code is available at: https://github.com/karunyat/Oversampling-Techniques.
Rights and permissions
About this article
Cite this article
Benala, T.R., Tantati, K. Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data. Innovations Syst Softw Eng 19, 247–263 (2023). https://doi.org/10.1007/s11334-022-00457-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11334-022-00457-3