On the relative value of data resampling approaches for software defect prediction


Software defect data sets are typically characterized by an unbalanced class distribution where the defective modules are fewer than the non-defective modules. Prediction performances of defect prediction models are detrimentally affected by the skewed distribution of the faulty minority modules in the data set since most algorithms assume both classes in the data set to be equally balanced. Resampling approaches address this concern by modifying the class distribution to balance the minority and majority class distribution. However, very little is known about the best distribution for attaining high performance especially in a more practical scenario. There are still inconclusive results pertaining to the suitable ratio of defect and clean instances (Pfp), the statistical and practical impacts of resampling approaches on prediction performance and the more stable resampling approach across several performance measures. To assess the impact of resampling approaches, we investigated the bias and effect of commonly used resampling approaches on prediction accuracy in software defect prediction. Analyzes of six resampling approaches on 40 releases of 20 open-source projects across five performance measures and five imbalance rates were performed. The experimental results obtained indicate that there were statistical differences between the prediction results with and without resampling methods when evaluated with the geometric-mean, recall(pd), probability of false alarms(pf ) and balance performance measures. However, resampling methods could not improve the AUC values across all prediction models implying that resampling methods can help in defect classification but not defect prioritization. A stable Pfp rate was dependent on the performance measure used. Lower Pfp rates are required for lower pf values while higher Pfp values are required for higher pd values. Random Under-Sampling and Borderline-SMOTE proved to be the more stable resampling method across several performance measures among the studied resampling methods. Performance of resampling methods are dependent on the imbalance ratio, evaluation measure and to some extent the prediction model. Newer oversampling methods should aim at generating relevant and informative data samples and not just increasing the minority samples.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. Agrawal A, Menzies T (2017) Better data is better than better data miners (benefits of tuning smote for defect prediction). arXiv:1705.03697

  2. Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83 (1):2–17

    Article  Google Scholar 

  3. Barua S, Md MI, Yao Xi, Murase K (2014) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  4. Bennin K, Keung J, Monden A, Phannachitta P, Mensah S (2017) The significant effects of data sampling approaches on software defect prioritization and classification. In: 11th international symposium on empirical software engineering and measurement, ESEM 2017

  5. Bennin KE, Keung J, Monden A, Kamei Y, Ubayashi N (2016) Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: 2016 IEEE 40th annual Computer software and applications conference (COMPSAC), vol 1. IEEE, pp 154–163

  6. Bennin KE, Toda K, Kamei Y, Keung J, Monden A, Ubayashi N (2016) Empirical evaluation of cross-release effort-aware defect prediction models. In: S2016 IEEE international conference on oftware quality, reliability and security (QRS). IEEE, pp 214–221

  7. Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2017) Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng

  8. Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159

    Article  Google Scholar 

  9. Brunner E, Munzel U, Puri ML (2002) The multivariate nonparametric behrens–fisher problem. J Stat Plan Inference 108(1):37–53

    MathSciNet  MATH  Article  Google Scholar 

  10. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-asia conference on knowledge discovery and data mining. Springer, pp 475–482

  11. Chawla NV (2010) Data mining for imbalanced datasets: an overview. In: Data mining and knowledge discovery handbook. Springer, pp 875–886

  12. Chawla NV, Bowyer KW, Hall LO., Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res:321–357

  13. D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of 2010 7th IEEE Working Conference on Mining Software Repositories (MSR). IEEE, pp 31–41

  14. D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4-5):531–577

    Article  Google Scholar 

  15. Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 155–164

  16. Drown DJ, Khoshgoftaar TM, Seliya N (2009) Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Trans Syst, Man, Cybern-Part A: Syst Humans 39(5):1097– 1107

    Article  Google Scholar 

  17. Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36

    MathSciNet  Article  Google Scholar 

  18. García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21

    Article  Google Scholar 

  19. Gray D, Bowes D, Davey N, Yi S, Christianson B (2011) The misuse of the nasa metrics data program data sets for automated software defect prediction. In: Proceedings of 15th Annual Conference on Evaluation & Assessment in Software Engineering (EASE 2011). IET, pp 96–103

  20. Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304

    Article  Google Scholar 

  21. Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Advances in intelligent computing. Springer, pp 878–887

  22. Hata H, Mizuno O, Kikuno T (2012) Bug prediction based on fine-grained module histories. In: Proceedings of the 34th International Conference on Software Engineering. IEEE Press, pp 200–210

  23. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans knowl data Eng 21(9):1263–1284

    Article  Google Scholar 

  24. He H, Bai Y, Garcia E, Shutao L et al (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on Neural networks, 2008. IJCNN 2008. (IEEE world congress on computational intelligence). IEEE, pp 1322–1328

  25. He Z, Shu F, Ye Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199

    Article  Google Scholar 

  26. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    MATH  Article  Google Scholar 

  27. Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13(5):561–595

    Article  Google Scholar 

  28. Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001. IEEE, pp 257–264

  29. Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering. ACM, p 9

  30. Jureczko M, Spinellis D (2010) Using object-oriented design metrics to predict software defects: models and methods of system dependability. Oficyna Wydawnicza Politechniki Wroclawskiej:69–81

  31. Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto Kx-I (2007) The effects of over and under sampling on fault-prone module detection. In: First international symposium on empirical software engineering and measurement, 2007. ESEM 2007. IEEE, pp 196–204

  32. Kamei Y, Matsumoto S, Monden A, Matsumoto K-I, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: Proceedings of 2010 IEEE International Conference onSoftware Maintenance (ICSM). IEEE, pp 1–10

  33. Kitchenham B, Madeyski L, Budgen D, Keung J, Brereton P, Charters S, Gibbs S, Pohthong A (2016) Robust statistical methods for empirical software engineering. Empir Softw Eng:1–52

  34. Kocaguneli E, Menzies T, Bener A, Keung JW (2012) Exploiting the essential assumptions of analogy-based effort estimation. IEEE Trans Softw Eng 38 (2):425–438

    Article  Google Scholar 

  35. Kocaguneli E, Menzies T, Keung J, Cok D, Madachy R (2013) Active learning and effort estimation: finding the essential content of software effort estimation data. IEEE Trans Softw Eng 39(8):1040–1053

    Article  Google Scholar 

  36. Kraemer HC, Kupfer DJ (2006) Size of treatment effects and their importance to clinical research and practice. Biological Psych 59(11):990–996

    Article  Google Scholar 

  37. Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol 97, Nashville, USA, pp 179–186

  38. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z (2014) Caret: classification and regression training. r package version 6.0–24

  39. Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402

    Article  Google Scholar 

  40. Lee SS (2000) Noisy replication in skewed binary classification. Comput Stat Data Anal 34(2):165–191

    MATH  Article  Google Scholar 

  41. Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496

    Article  Google Scholar 

  42. Liu M, Miao L, Zhang D (2014) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans Reliab 63(2):676–686

    Article  Google Scholar 

  43. Madeyski L, Jureczko M (2015) Which process metrics can significantly improve defect prediction models? an empirical study. Softw Qual J 23(3):393–422

    Article  Google Scholar 

  44. Menzies T, Dekhtyar A, Distefano J, Greenwald J (2007) Problems with precision: a response to comments on data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(9):637

    Article  Google Scholar 

  45. Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13

    Article  Google Scholar 

  46. Menzies T, Turhan B, Bener A, Gay G, Cukic B, Jiang Y (2008) Implications of ceiling effects in defect predictors. In: Proceedings of the 4th international workshop on Predictor models in software engineering. ACM, pp 47–54

  47. Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener AY (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17(4):375–407

    Article  Google Scholar 

  48. Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: ACM/IEEE 30th international conference on Software engineering, 2008. ICSE’08. IEEE, pp 181–190

  49. Nickerson A, Japkowicz N, Milios E (2001) Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the Eighth International Workshop on AI and Statitsics, pp 261–265

  50. Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Proceedings of the Eleventh International Conference on Machine Learning, pp 217–225

  51. Pelayo L, Dick S (2007) Applying novel resampling strategies to software defect prediction. In: Annual meeting of the north american Fuzzy information processing society, 2007. NAFIPS’07. IEEE, pp 69–72

  52. Phung SL, Bouzerdoum A, Nguyen GH (2009) Learning pattern classification tasks with imbalanced data sets

  53. Radjenovic D, Hericko M, Torkar R, živkovic A (2013) Software fault prediction metrics: a systematic literature review. Inf Softw Technol 55(8):1397–1418

    Article  Google Scholar 

  54. Riquelme JC, Ruiz R, Rodríguez D, Moreno J (2008) Finding defective modules from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos 2(1):67–74

    Google Scholar 

  55. Shirabad JS, Menzies TJ (2005) The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada

  56. Seiffert C, Khoshgoftaar TM, Hulse JV, Rusboost AN (2010) A hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, and Cybernetics-Part A: Systems and Humans 40(1):185–197

    Article  Google Scholar 

  57. Shanab A, Khoshgoftaar TM, Wald R, Napolitano A (2012) Impact of noise and data sampling on stability of feature ranking techniques for biological datasets. In: 2012 IEEE 13th international conference on Information reuse and integration (IRI). IEEE, pp 415–422

  58. Shatnawi R (2017) The application of roc analysis in threshold identification, data imbalance and metrics selection for software fault prediction. Innov Syst Softw Eng:1–17

  59. Shepperd M, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022

    Article  Google Scholar 

  60. Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378

    MATH  Article  Google Scholar 

  61. Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst, Man, Cybern, Part C (Appl Rev) 42(6):1806–1817

    Article  Google Scholar 

  62. Tang Y, Zhang Y-Q, Chawla NV, Krasser S (2009) Svms modeling for highly imbalanced classification. IEEE Trans Syst, Man, Cybern, Part B (Cybernetics) 39(1):281–288

    Article  Google Scholar 

  63. R Core Team (2012) R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing

  64. Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443

    Article  Google Scholar 

  65. Weiss GM, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Rutgers Univ

  66. Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res:315–354

  67. Wilcox RR, Schönbrodt FD (2014) The wrs package for robust statistics in r (version 0.26). Available: Retrieved from https://github.com/nicebread/WRS

  68. Wong GY, Leung FHF, Ling S-H (2013) A novel evolutionary preprocessing method based on over-sampling and under-sampling for imbalanced datasets. In: 2013-39th annual conference of the IEEE Industrial electronics society, IECON. IEEE, pp 2354–2359

  69. Yan M, Fang Y, Lo D, Xia X, Zhang X (2017) File-level defect prediction: unsupervised vs. supervised models. In: 2017 ACM/IEEE international symposium on Empirical software engineering and measurement (ESEM). IEEE, pp 344–353

  70. Yoon K, Kwek S (2007) A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput Appl 16(3):295–306

    Article  Google Scholar 

  71. Zheng J (2010) Cost-sensitive boosting neural networks for software defect prediction. Expert Syst Appl 37(6):4537–4543

    Article  Google Scholar 

Download references


This work is supported in part by the General Research Fund of the Research Grants Council of Hong Kong (No. 11208017), and the research funds of City University of Hong Kong (No. 7004474, 7004683) and JSPS KAKENHI 17K00102.

Author information



Corresponding author

Correspondence to Kwabena Ebo Bennin.

Additional information

Communicated by: Nachiappan Nagappan


Appendix A: Process Metrics Data used this Study

The static metric datasets are available on the PROMISE repository. The process metric datasets were manually extracted. The keywords used for matching the commits from the commit logs during the data extraction process are “errors, bugs, fix, fixes, issues, bugfixes, refactoring(s)”. Below is a sample code used for the commit matching process.

[(bugs?—fix(es—ed)?)[\s:_#]*(\d+)], [#(\d+)(issue—bug)s?[\s#-]*(\d+)show_bug\.cgi\?id=(\d+)PR:? (\d+)].

Table 6 presents the metrics and fault collection period for the extracted datasets. The process metric datasets are available at http://analytics.jpn.org/SEdata/.

Appendix B: Extra Results for RQ1

In the figures below, we present the AUC, g-mean, balance, pd and pf results for the sampling approaches across the prediction models on the 20 imbalanced datasets across the different percentage of fault-prone modules (Pfp) values.

Table 6 Summary of versions and collection period for the 10 OSS projects used (Refer to Section 4.2.1)
Fig. 11

Performance plot of the sampling approaches per each prediction model for the AUC performance measure on 20 imbalanced datasets across different percentage of fault-prone modules (Pfp) values

Fig. 12

Performance plot of the sampling approaches per each prediction model for the g-mean performance measure on 20 imbalanced datasets across different percentage of fault-prone modules (Pfp) values

Fig. 13

Performance plot of the sampling approaches per each prediction model for the balance performance measure on 20 imbalanced datasets across different percentage of fault-prone modules (Pfp) values

Fig. 14

Performance plot of the sampling approaches per each prediction model for the pd performance measure on 20 imbalanced datasets across different percentage of fault-prone modules (Pfp) values

Fig. 15

Performance plot of the sampling approaches per each prediction model for the pf performance measure on 20 imbalanced datasets across different percentage of fault-prone modules (Pfp) values

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bennin, K.E., Keung, J.W. & Monden, A. On the relative value of data resampling approaches for software defect prediction. Empir Software Eng 24, 602–636 (2019). https://doi.org/10.1007/s10664-018-9633-6

Download citation


  • Software defect prediction
  • Imbalanced data
  • Data resampling approaches
  • Class imbalance
  • Empirical study