Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data

Benala, Tirimula Rao; Tantati, Karunya

doi:10.1007/s11334-022-00457-3

Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data

Original Article
Published: 01 June 2022

Volume 19, pages 247–263, (2023)
Cite this article

Innovations in Systems and Software Engineering Aims and scope Submit manuscript

336 Accesses
1 Citation
Explore all metrics

Abstract

Software defect prediction (SDP) is essential to analyze and identify defects present in a software model in early stages of software development. The identification of these defects and their early removal provides cost-efficient software. Machine learning (ML) techniques have been successfully used for developing defect prediction models. However, these techniques deliver off-target results when implemented on imbalanced datasets. For example, a dataset with unequal class distribution is technically imbalanced. Thus, ML techniques on such imbalanced data lead to a biased prediction of minority class instances, which are more important than majority class instances. Therefore, the imbalanced data problem must be resolved to successfully develop an efficient SDP model. In this study, we evaluated the prediction capability of ML classifiers for software defect prediction on nine imbalanced NASA datasets by applying oversampling methods. In addition, we considered five oversampling methods to synthesize minority class instances and make the datasets balanced. Dataset imbalance was eliminated using the five oversampling techniques. The oversampling techniques replicated or synthesized the instances of minority classes to balance the datasets. When the datasets were balanced, the ML classifiers were used to develop a defect prediction model. The experimental results acquired by applying ML classifiers on the imbalanced and balanced data showed an enhancement in the learning capability of ML techniques with the implementation of sampling techniques. Oversampling methods considerably improved the prediction performance of the ML classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Approach to Software Fault Prediction

An Empirical Study on Data Sampling Methods in Addressing Class Imbalance Problem in Software Defect Prediction

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

Article 27 July 2023

References

Krasner H (2020) The cost of poor quality software in the us: a 2020 report. Consortium for I.T. Software Quality, Technical report, 10.
Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616
Article Google Scholar
Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140
Article Google Scholar
Feng S, Keung J, Yu X, Xiao Y, Zhang M (2021) Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf Softw Technol 139:106662
Article Google Scholar
Zhang F, Hassan AE, McIntosh S, Zou Y (2016) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Trans Softw Eng 43(5):476–491
Article Google Scholar
Zhang Y, Li JX, Zhao J, Wang SZ, Pan Y, Tanaka K, Kadota S (2005) Synthesis and activity of oleanolic acid derivatives, a novel class of inhibitors of osteoclast formation. Bioorg Med Chem Lett 15(6):1629–1632
Article Google Scholar
Saçar MD, Allmer J (2013) Data mining for microrna gene prediction: on the impact of class imbalance and feature number for microrna gene prediction. In: 2013 8th international symposium on health informatics and bioinformatics. IEEE, pp 1–6
Provost F (2000) Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 workshop on imbalanced data sets, vol 68, no. 2000. AAAI Press, pp 1–3
Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2017) Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg, pp 878–887
Agrawal A, Menzies T (2018) Is" Better Data" Better Than" Better Data Miners"? In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), 27 May–3 June 2018, pp 1050–1061. IEEE, Gothenburg, Sweden
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):1806–1817
Article Google Scholar
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402
Article Google Scholar
Xia X, Lo D, Shihab E, Wang X, Yang X (2015) Elblocker: Predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106
Article Google Scholar
Wang H, Khoshgoftaar TM, Napolitano A (2010) A comparative study of ensemble feature selection techniques for software defect prediction. In: 2010 Ninth international conference on machine learning and applications. IEEE, pp 135–140
Liu M, Miao L, Zhang D (2014) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans Reliab 63(2):676–686
Article Google Scholar
Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning-based software defect prediction. In: Proceedings of the 36th international conference on software engineering, pp 414–423
Yu X, Wu M, Jian Y, Bennin KE, Fu M, Ma C (2018) Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning. Soft Comput 22(10):3461–3472
Article Google Scholar
Tomar D, Agarwal S (2016) Prediction of defective software modules using class imbalance learning. In: Applied computational intelligence and soft computing, 2016
Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol 11, pp 1–8
Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, vol 4. IEEE, pp 192–201
Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto KI (2007) The effects of over and under sampling on fault-prone module detection. In: First international symposium on empirical software engineering and measurement (ESEM 2007). IEEE, pp 196–204
Riquelme JC, Ruiz R, Rodríguez D, Moreno J (2008) Finding defective modules from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos 2(1):67–74
Google Scholar
Shatnawi R (2012) Improving software fault prediction for imbalanced data. In: 2012 international conference on innovations in information technology (IIT). IEEE, pp 54–59
Menzies T, Dekhtyar A, Distefano J, Greenwald J (2007) Problems with precision: a response to" comments on’ data mining static code attributes to learn defect predictors’". IEEE Trans Softw Eng 33(9):637–640
Article Google Scholar
Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259
Article Google Scholar
Kovács G (2019) Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354
Article Google Scholar
Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28(1):92–122
Article MathSciNet MATH Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the imbalanced class problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 475–482
Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21
Article Google Scholar
Ian HW, Eibe F (2005) Data mining: practical machine learning tools and techniques
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Murphy KP (2006) Naive Bayes classifiers. Univ B C 18(60):1–8
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Article MATH Google Scholar
Yu H, Sun C, Yang W, Yang X, Zuo X (2015) AL-ELM: one uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166:140–150
Article Google Scholar
Shanab AA, Khoshgoftaar TM, Wald R, Napolitano A (2012) Impact of noise and data sampling on stability of feature ranking techniques for biological datasets. In: 2012 IEEE 13th international conference on information reuse and integration (IRI). IEEE, pp 415–422
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215
Article Google Scholar
Keung J, Kocaguneli E, Menzies T (2013) Finding conclusion stability for selecting the best effort predictor in software effort estimation. Autom Softw Eng 20(4):543–567
Article Google Scholar
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, JNTU-GV College of Engineering, Vizianagaram, Jawaharlal Nehru Technological University, Gurajada-Vizianagaram, 535003, Andhra Pradesh, India
Tirimula Rao Benala & Karunya Tantati

Authors

Tirimula Rao Benala
View author publications
You can also search for this author in PubMed Google Scholar
Karunya Tantati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tirimula Rao Benala.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

See Figs. 3, 4 and Table 7.

Table 7 Performance results of defect prediction models before sampling

Full size table

Table 7 tabulates the performance of Software defect prediction models formed by various ML classifiers on different imbalanced datasets. The average of their performance is calculated to study the range in which the versions lie when SDP models are created using imbalanced datasets.

Figure 3 corresponds to the values in Table 6. Figure 4 shows the prediction capability of SDP models formed using datasets that are not balanced and balanced. The graphs plot their performance for each dataset when they are not balanced using ROS, SMOTE, ADASYN, SL-SM, SVM-SMOTE oversampling techniques.

Figure 4 also corresponds to the values in Table 6. In Fig. 4, each graph compares the performance of the SDP model when a specific ML classifier is used, and the dataset is balanced using various techniques.

Reproducibility Strategy

Our experiment uses nine publicly available NASA datasets as mentioned in section 4.1. The oversampling techniques and machine learning classifiers used are mentioned in section 3. A brief description of the performance measure used is mentioned in section 4.4. The hyperparameters used for different datasets are mentioned in Table 4. The code is available at: https://github.com/karunyat/Oversampling-Techniques.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Benala, T.R., Tantati, K. Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data. Innovations Syst Softw Eng 19, 247–263 (2023). https://doi.org/10.1007/s11334-022-00457-3

Download citation

Received: 16 September 2021
Accepted: 26 April 2022
Published: 01 June 2022
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11334-022-00457-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data

Abstract

Access this article

Similar content being viewed by others

An Efficient Approach to Software Fault Prediction

An Empirical Study on Data Sampling Methods in Addressing Class Imbalance Problem in Software Defect Prediction

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

References