Skip to main content
Log in

Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data

  • Original Article
  • Published:
Innovations in Systems and Software Engineering Aims and scope Submit manuscript

Abstract

Software defect prediction (SDP) is essential to analyze and identify defects present in a software model in early stages of software development. The identification of these defects and their early removal provides cost-efficient software. Machine learning (ML) techniques have been successfully used for developing defect prediction models. However, these techniques deliver off-target results when implemented on imbalanced datasets. For example, a dataset with unequal class distribution is technically imbalanced. Thus, ML techniques on such imbalanced data lead to a biased prediction of minority class instances, which are more important than majority class instances. Therefore, the imbalanced data problem must be resolved to successfully develop an efficient SDP model. In this study, we evaluated the prediction capability of ML classifiers for software defect prediction on nine imbalanced NASA datasets by applying oversampling methods. In addition, we considered five oversampling methods to synthesize minority class instances and make the datasets balanced. Dataset imbalance was eliminated using the five oversampling techniques. The oversampling techniques replicated or synthesized the instances of minority classes to balance the datasets. When the datasets were balanced, the ML classifiers were used to develop a defect prediction model. The experimental results acquired by applying ML classifiers on the imbalanced and balanced data showed an enhancement in the learning capability of ML techniques with the implementation of sampling techniques. Oversampling methods considerably improved the prediction performance of the ML classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Krasner H (2020) The cost of poor quality software in the us: a 2020 report. Consortium for I.T. Software Quality, Technical report, 10.

  2. Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616

    Article  Google Scholar 

  3. Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140

    Article  Google Scholar 

  4. Feng S, Keung J, Yu X, Xiao Y, Zhang M (2021) Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf Softw Technol 139:106662

    Article  Google Scholar 

  5. Zhang F, Hassan AE, McIntosh S, Zou Y (2016) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Trans Softw Eng 43(5):476–491

    Article  Google Scholar 

  6. Zhang Y, Li JX, Zhao J, Wang SZ, Pan Y, Tanaka K, Kadota S (2005) Synthesis and activity of oleanolic acid derivatives, a novel class of inhibitors of osteoclast formation. Bioorg Med Chem Lett 15(6):1629–1632

    Article  Google Scholar 

  7. Saçar MD, Allmer J (2013) Data mining for microrna gene prediction: on the impact of class imbalance and feature number for microrna gene prediction. In: 2013 8th international symposium on health informatics and bioinformatics. IEEE, pp 1–6

  8. Provost F (2000) Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 workshop on imbalanced data sets, vol 68, no. 2000. AAAI Press, pp 1–3

  9. Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2017) Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550

    Article  Google Scholar 

  10. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg, pp 878–887

  11. Agrawal A, Menzies T (2018) Is" Better Data" Better Than" Better Data Miners"? In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), 27 May–3 June 2018, pp 1050–1061. IEEE, Gothenburg, Sweden

  12. Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):1806–1817

    Article  Google Scholar 

  13. Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402

    Article  Google Scholar 

  14. Xia X, Lo D, Shihab E, Wang X, Yang X (2015) Elblocker: Predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106

    Article  Google Scholar 

  15. Wang H, Khoshgoftaar TM, Napolitano A (2010) A comparative study of ensemble feature selection techniques for software defect prediction. In: 2010 Ninth international conference on machine learning and applications. IEEE, pp 135–140

  16. Liu M, Miao L, Zhang D (2014) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans Reliab 63(2):676–686

    Article  Google Scholar 

  17. Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning-based software defect prediction. In: Proceedings of the 36th international conference on software engineering, pp 414–423

  18. Yu X, Wu M, Jian Y, Bennin KE, Fu M, Ma C (2018) Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning. Soft Comput 22(10):3461–3472

    Article  Google Scholar 

  19. Tomar D, Agarwal S (2016) Prediction of defective software modules using class imbalance learning. In: Applied computational intelligence and soft computing, 2016

  20. Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol 11, pp 1–8

  21. Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, vol 4. IEEE, pp 192–201

  22. Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636

    Article  Google Scholar 

  23. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  24. Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto KI (2007) The effects of over and under sampling on fault-prone module detection. In: First international symposium on empirical software engineering and measurement (ESEM 2007). IEEE, pp 196–204

  25. Riquelme JC, Ruiz R, Rodríguez D, Moreno J (2008) Finding defective modules from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos 2(1):67–74

    Google Scholar 

  26. Shatnawi R (2012) Improving software fault prediction for imbalanced data. In: 2012 international conference on innovations in information technology (IIT). IEEE, pp 54–59

  27. Menzies T, Dekhtyar A, Distefano J, Greenwald J (2007) Problems with precision: a response to" comments on’ data mining static code attributes to learn defect predictors’". IEEE Trans Softw Eng 33(9):637–640

    Article  Google Scholar 

  28. Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259

    Article  Google Scholar 

  29. Kovács G (2019) Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354

    Article  Google Scholar 

  30. Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28(1):92–122

    Article  MathSciNet  MATH  Google Scholar 

  31. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328

  32. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the imbalanced class problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 475–482

  33. Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21

    Article  Google Scholar 

  34. Ian HW, Eibe F (2005) Data mining: practical machine learning tools and techniques

  35. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  36. Murphy KP (2006) Naive Bayes classifiers. Univ B C 18(60):1–8

    Google Scholar 

  37. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    Article  MATH  Google Scholar 

  38. Yu H, Sun C, Yang W, Yang X, Zuo X (2015) AL-ELM: one uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166:140–150

    Article  Google Scholar 

  39. Shanab AA, Khoshgoftaar TM, Wald R, Napolitano A (2012) Impact of noise and data sampling on stability of feature ranking techniques for biological datasets. In: 2012 IEEE 13th international conference on information reuse and integration (IRI). IEEE, pp 415–422

  40. Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215

    Article  Google Scholar 

  41. Keung J, Kocaguneli E, Menzies T (2013) Finding conclusion stability for selecting the best effort predictor in software effort estimation. Autom Softw Eng 20(4):543–567

    Article  Google Scholar 

  42. Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tirimula Rao Benala.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

See Figs. 3, 4 and Table 7.

Fig. 3
figure 3

Graphical analysis of performance accuracy results of datasets before and after applying oversampling methods

Fig. 4
figure 4

Graphical analysis of performance results of defect prediction models generated using the imbalanced and balanced data samples

Table 7 Performance results of defect prediction models before sampling

Table 7 tabulates the performance of Software defect prediction models formed by various ML classifiers on different imbalanced datasets. The average of their performance is calculated to study the range in which the versions lie when SDP models are created using imbalanced datasets.

Figure 3 corresponds to the values in Table 6. Figure 4 shows the prediction capability of SDP models formed using datasets that are not balanced and balanced. The graphs plot their performance for each dataset when they are not balanced using ROS, SMOTE, ADASYN, SL-SM, SVM-SMOTE oversampling techniques.

Figure 4 also corresponds to the values in Table 6. In Fig. 4, each graph compares the performance of the SDP model when a specific ML classifier is used, and the dataset is balanced using various techniques.

Reproducibility Strategy

Our experiment uses nine publicly available NASA datasets as mentioned in section 4.1. The oversampling techniques and machine learning classifiers used are mentioned in section 3. A brief description of the performance measure used is mentioned in section 4.4. The hyperparameters used for different datasets are mentioned in Table 4. The code is available at: https://github.com/karunyat/Oversampling-Techniques.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Benala, T.R., Tantati, K. Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data. Innovations Syst Softw Eng 19, 247–263 (2023). https://doi.org/10.1007/s11334-022-00457-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11334-022-00457-3

Keywords

Navigation