Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance

  • 60 Accesses


As the size of software projects becomes larger, software defect prediction (SDP) will play a key role in allocating testing resources reasonably, reducing testing costs, and speeding up the development process. Most SDP methods have used machine learning techniques based on common software metrics such as Halstead and McCabe’s cyclomatic. Datasets produced by these metrics usually do not follow Gaussian distribution, and also, they have overlaps in defect and non-defect classes. In addition, in many of software defect datasets, the number of defective modules (minority class) is considerably less than non-defective modules (majority class). In this situation, the performance of machine learning methods is reduced dramatically. Therefore, we first need to create a balance between minority and majority classes and then transfer the samples into a new space in which pair samples with same class (must-link set) are near to each other as close as possible and pair samples with different classes (cannot-link) stay as far as possible. To achieve the mentioned objectives, in this paper, Mahalanobis distance in two manners will be used. First, the minority class is oversampled based on the Mahalanobis distance such that generated synthetic data are more diverse from other minority data, and minority class distribution is not changed significantly. Second, a feature extraction method based on Mahalanobis distance metric learning is used which try to minimize distances of sample pairs in must-links and maximize the distance of sample pairs in cannot-links. To demonstrate the effectiveness of the proposed method, we performed some experiments on 12 publicly available datasets which are collected NASA repositories and compared its result by some powerful previous methods. The performance is evaluated in F-measure, G-Mean, and Matthews correlation coefficient. Generally, the proposed method has better performance as compared to the mentioned methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304

  2. 2.

    Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518

  3. 3.

    Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355

  4. 4.

    Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13

  5. 5.

    Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng 39(4):552–569

  6. 6.

    Li M, Zhang H, Wu R, Zhou Z-H (2012) Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19(2):201–230

  7. 7.

    Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496

  8. 8.

    D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4–5):531–577

  9. 9.

    Radjenović D, Heričko M, Torkar R, Živkovič A (2013) Software fault prediction metrics: a systematic literature review. Inf Softw Technol 55(8):1397–1418

  10. 10.

    Halstead MH (1977) Elements of software science, vol 7. Elsevier, New York

  11. 11.

    McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 2(4):308–320

  12. 12.

    Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: Proceedings of the 30th International Conference on Software Engineering. ACM, pp 531–540

  13. 13.

    Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th International Conference on Software Engineering. ACM, pp 181–190

  14. 14.

    Mahmood Z, Bowes D, Lane PC, Hall T (2015) What is the impact of imbalance on software defect prediction performance? In: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, p 4

  15. 15.

    Khalid S, Khalil T, Nasreen S (2014) A survey of feature selection and feature extraction techniques in machine learning. In: 2014 Science and Information Conference (SAI). IEEE, pp 372–378

  16. 16.

    He P, Li B, Liu X, Chen J, Ma Y (2015) An empirical study on software defect prediction with a simplified metric set. Inf Softw Technol 59:170–190

  17. 17.

    Khoshgoftaar TM, Gao K, Napolitano A, Wald R (2014) A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inf Syst Front 16(5):801–822

  18. 18.

    Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111

  19. 19.

    Yang X, Lo D, Xia X, Zhang Y, Sun J (2015) Deep learning for just-in-time defect prediction. In: QRS, pp 17–26

  20. 20.

    Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443

  21. 21.

    He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

  22. 22.

    Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106

  23. 23.

    Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Trans Softw Eng 39(6):757–773

  24. 24.

    Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2018) Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550

  25. 25.

    Xiang S, Nie F, Zhang CJPR (2008) Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognit 41(12):3600–3612

  26. 26.

    Menzies T, Caglayan B, He Z, Kocaguneli E, Krall J, Peters F, Turhans B (2012) The promise repository of empirical software engineering data. Technical report. Department of Computer Science, West Virginia University, Beckley, WV, USA. http://promisedata.googlecode.com

  27. 27.

    Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19

  28. 28.

    Zhou Z-H, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77

  29. 29.

    Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

  30. 30.

    Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical report ML-TR-44. Department of Computer Science, Rutgers University. https://doi.org/10.7282/t3-vpfw-sf95

  31. 31.

    Liu X-Y, Wu J, Zhou Z-H (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern 39(2):539–550

  32. 32.

    Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727

  33. 33.

    Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on Artificial Intelligence in Medicine in Europe. Springer, pp 63–66

  34. 34.

    Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International Conference on Data Warehousing and Knowledge Discovery. Springer, pp 283–292

  35. 35.

    García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21

  36. 36.

    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

  37. 37.

    Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

  38. 38.

    Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp 475–482

  39. 39.

    Fan X, Tang K, Weise T (2011) Margin-based over-sampling method for learning from imbalanced datasets. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp 309–320

  40. 40.

    Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing. Springer, pp 878–887

  41. 41.

    Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636

  42. 42.

    Zhou L, Li R, Zhang S, Wang H (2018) Imbalanced data processing model for software defect prediction. Wirel Pers Commun 102(2):937–950

  43. 43.

    Kalsoom A, Maqsood M, Ghazanfar MA, Aadil F, Rho S (2018) A dimensionality reduction-based efficient software fault prediction using Fisher linear discriminant analysis (FLDA). J Supercomput 74(9):4568–4602. https://doi.org/10.1007/s11227-018-2326-5

  44. 44.

    Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26(1):97–125

  45. 45.

    Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):1806–1817

  46. 46.

    Henein MM, Shawky DM, Abd-El-Hafiz SK (2018) Clustering-based under-sampling for software defect prediction. In: ICSOFT, pp 219–227

  47. 47.

    Lingden P, Alsadoon A, Prasad PW, Alsadoon OH, Ali RS, Nguyen VT (2019) A novel modified undersampling (MUS) technique for software defect prediction. Comput Intell. https://doi.org/10.1111/coin.12229

  48. 48.

    Lin Y, Zhong Y (2018) Software defect prediction based on data sampling and multivariate filter feature selection. In: 2018 2nd International Conference on Artificial Intelligence: Technologies and Applications (ICAITA 2018). Atlantis Press

  49. 49.

    Nevendra M, Singh P (2018) Multistage preprocessing approach for software defect data prediction. In: Annual Convention of the Computer Society of India. Springer, pp 505–515

  50. 50.

    Pak C, Wang TT, Su XH (2018) An empirical study on software defect prediction using over-sampling by SMOTE. Int J Softw Eng Knowl Eng 28(06):811–830

  51. 51.

    Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188

  52. 52.

    Fukunaga K (2013) Introduction to statistical pattern recognition. Elsevier, Amsterdam

  53. 53.

    Tian Q, Barbero M, Gu Z-H, Lee SH (1986) Image classification by the Foley-Sammon transform. Opt Eng 25(7):257834

  54. 54.

    Hong Z-Q, Yang J-Y (1991) Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognit 24(4):317–324

  55. 55.

    Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, pp 297–308

  56. 56.

    Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning. ACM, pp 1096–1103

  57. 57.

    Wiatowski T, Bölcskei H (2018) A mathematical theory of deep convolutional neural networks for feature extraction. IEEE Trans Inf Theory 64(3):1845–1866

  58. 58.

    Lee K, Lee K, Lee H, Shin J (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: Advances in Neural Information Processing Systems, pp 7167–7177

  59. 59.

    Denouden T, Salay R, Czarnecki K, Abdelzad V, Phan B, Vernekar S (2018) Improving reconstruction autoencoder out-of-distribution detection with mahalanobis distance. arXiv preprint arXiv:181202765

  60. 60.

    Xu J, Luo L, Deng C, Huang H (2018) Bilevel distance metric learning for robust image recognition. In: Advances in Neural Information Processing Systems, pp 4198–4207

  61. 61.

    Guo Y-F, Li S-J, Yang J-Y, Shu T-T, Wu L-D (2003) A generalized Foley-Sammon transform based on generalized fisher discriminant criterion and its application to face recognition. Pattern Recognit Lett 24(1–3):147–158

  62. 62.

    Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML. Citeseer, pp 148–156

  63. 63.

    Tan P-N (2007) Introduction to data mining. Pearson Education India, Chennai

  64. 64.

    Li W, Huang Z, Li Q (2016) Three-way decisions based software defect prediction. Knowl Based Syst 91:263–274

Download references

Author information

Correspondence to Abbas Rasoolzadegan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix 1

For NASA datasets, the modes number in the histogram of each feature are depicted in Tables 9, 10, 11, 12, 13, 14, 15, 16, 17 and 18.

Table 9 Number of modes in CM1’s features
Table 10 Number of modes in KC1’s features
Table 11 Number of modes in KC2’s features
Table 12 Number of modes in KC3’s features
Table 13 Number of modes in MC1’s features
Table 14 Number of modes in MC2’s features
Table 15 Number of modes in MW1’s features
Table 16 Number of modes in PC2’s features
Table 17 Number of modes in PC3’s features
Table 18 Number of modes in PC4’s features

Appendix 2

The p-values of Shapiro–Wilk normality test for each dataset of NASA repository are depicted in Tables 19, 20, 21, 22, 23, 24, 25, 26, 27 and 28.

Table 19 P-values of Shapiro–Wilk normality test for CM1’s features
Table 20 P-values of Shapiro–Wilk normality test for KC1’s features
Table 21 P-values of Shapiro–Wilk normality test for KC2’s features
Table 22 P-values of Shapiro–Wilk normality test for KC3’s features
Table 23 P-values of Shapiro–Wilk normality test for MC1’s features
Table 24 P-values of Shapiro–Wilk normality test for MC2’s features
Table 25 P-values of Shapiro–Wilk normality test for MW1’s features
Table 26 P-values of Shapiro–Wilk normality test for PC2’s features
Table 27 P-values of Shapiro–Wilk normality test for PC3’s features
Table 28 P-values of Shapiro–Wilk normality test for PC4’s features

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

NezhadShokouhi, M.M., Majidi, M.A. & Rasoolzadegan, A. Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance. J Supercomput 76, 602–635 (2020). https://doi.org/10.1007/s11227-019-03051-w

Download citation


  • Software defect prediction
  • Software metrics
  • Mahalanobis distance
  • Over-sampling
  • Feature extraction