Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance

NezhadShokouhi, Mohammad Mahdi; Majidi, Mohammad Ali; Rasoolzadegan, Abbas

doi:10.1007/s11227-019-03051-w

Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance

Published: 30 October 2019

Volume 76, pages 602–635, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Mohammad Mahdi NezhadShokouhi¹,
Mohammad Ali Majidi² &
Abbas Rasoolzadegan ORCID: orcid.org/0000-0001-8668-5650³

698 Accesses
16 Citations
Explore all metrics

Abstract

As the size of software projects becomes larger, software defect prediction (SDP) will play a key role in allocating testing resources reasonably, reducing testing costs, and speeding up the development process. Most SDP methods have used machine learning techniques based on common software metrics such as Halstead and McCabe’s cyclomatic. Datasets produced by these metrics usually do not follow Gaussian distribution, and also, they have overlaps in defect and non-defect classes. In addition, in many of software defect datasets, the number of defective modules (minority class) is considerably less than non-defective modules (majority class). In this situation, the performance of machine learning methods is reduced dramatically. Therefore, we first need to create a balance between minority and majority classes and then transfer the samples into a new space in which pair samples with same class (must-link set) are near to each other as close as possible and pair samples with different classes (cannot-link) stay as far as possible. To achieve the mentioned objectives, in this paper, Mahalanobis distance in two manners will be used. First, the minority class is oversampled based on the Mahalanobis distance such that generated synthetic data are more diverse from other minority data, and minority class distribution is not changed significantly. Second, a feature extraction method based on Mahalanobis distance metric learning is used which try to minimize distances of sample pairs in must-links and maximize the distance of sample pairs in cannot-links. To demonstrate the effectiveness of the proposed method, we performed some experiments on 12 publicly available datasets which are collected NASA repositories and compared its result by some powerful previous methods. The performance is evaluated in F-measure, G-Mean, and Matthews correlation coefficient. Generally, the proposed method has better performance as compared to the mentioned methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Sampling-Based Feature Selection Framework for Software Defect Prediction

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

Article 14 September 2015

A training sample selection method for predicting software defects

Article 19 September 2022

References

Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304
Article Google Scholar
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518
Article Google Scholar
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355
Article Google Scholar
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13
Article Google Scholar
Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code change-based bug prediction. IEEE Trans Softw Eng 39(4):552–569
Article Google Scholar
Li M, Zhang H, Wu R, Zhou Z-H (2012) Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19(2):201–230
Article Google Scholar
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496
Article Google Scholar
D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4–5):531–577
Article Google Scholar
Radjenović D, Heričko M, Torkar R, Živkovič A (2013) Software fault prediction metrics: a systematic literature review. Inf Softw Technol 55(8):1397–1418
Article Google Scholar
Halstead MH (1977) Elements of software science, vol 7. Elsevier, New York
MATH Google Scholar
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 2(4):308–320
Article MathSciNet MATH Google Scholar
Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: Proceedings of the 30th International Conference on Software Engineering. ACM, pp 531–540
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th International Conference on Software Engineering. ACM, pp 181–190
Mahmood Z, Bowes D, Lane PC, Hall T (2015) What is the impact of imbalance on software defect prediction performance? In: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, p 4
Khalid S, Khalil T, Nasreen S (2014) A survey of feature selection and feature extraction techniques in machine learning. In: 2014 Science and Information Conference (SAI). IEEE, pp 372–378
He P, Li B, Liu X, Chen J, Ma Y (2015) An empirical study on software defect prediction with a simplified metric set. Inf Softw Technol 59:170–190
Article Google Scholar
Khoshgoftaar TM, Gao K, Napolitano A, Wald R (2014) A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inf Syst Front 16(5):801–822
Article Google Scholar
Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111
Article Google Scholar
Yang X, Lo D, Xia X, Zhang Y, Sun J (2015) Deep learning for just-in-time defect prediction. In: QRS, pp 17–26
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443
Article Google Scholar
He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Google Scholar
Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106
Article Google Scholar
Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Trans Softw Eng 39(6):757–773
Article Google Scholar
Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2018) Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550
Article Google Scholar
Xiang S, Nie F, Zhang CJPR (2008) Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognit 41(12):3600–3612
Article MATH Google Scholar
Menzies T, Caglayan B, He Z, Kocaguneli E, Krall J, Peters F, Turhans B (2012) The promise repository of empirical software engineering data. Technical report. Department of Computer Science, West Virginia University, Beckley, WV, USA. http://promisedata.googlecode.com
Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19
Article Google Scholar
Zhou Z-H, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77
Article MathSciNet Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Article MATH Google Scholar
Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical report ML-TR-44. Department of Computer Science, Rutgers University. https://doi.org/10.7282/t3-vpfw-sf95
Liu X-Y, Wu J, Zhou Z-H (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern 39(2):539–550
Article Google Scholar
Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Article MathSciNet Google Scholar
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on Artificial Intelligence in Medicine in Europe. Springer, pp 63–66
Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International Conference on Data Warehousing and Knowledge Discovery. Springer, pp 283–292
García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp 475–482
Fan X, Tang K, Weise T (2011) Margin-based over-sampling method for learning from imbalanced datasets. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp 309–320
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing. Springer, pp 878–887
Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636
Article Google Scholar
Zhou L, Li R, Zhang S, Wang H (2018) Imbalanced data processing model for software defect prediction. Wirel Pers Commun 102(2):937–950
Article Google Scholar
Kalsoom A, Maqsood M, Ghazanfar MA, Aadil F, Rho S (2018) A dimensionality reduction-based efficient software fault prediction using Fisher linear discriminant analysis (FLDA). J Supercomput 74(9):4568–4602. https://doi.org/10.1007/s11227-018-2326-5
Article Google Scholar
Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26(1):97–125
Article Google Scholar
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):1806–1817
Article Google Scholar
Henein MM, Shawky DM, Abd-El-Hafiz SK (2018) Clustering-based under-sampling for software defect prediction. In: ICSOFT, pp 219–227
Lingden P, Alsadoon A, Prasad PW, Alsadoon OH, Ali RS, Nguyen VT (2019) A novel modified undersampling (MUS) technique for software defect prediction. Comput Intell. https://doi.org/10.1111/coin.12229
Lin Y, Zhong Y (2018) Software defect prediction based on data sampling and multivariate filter feature selection. In: 2018 2nd International Conference on Artificial Intelligence: Technologies and Applications (ICAITA 2018). Atlantis Press
Nevendra M, Singh P (2018) Multistage preprocessing approach for software defect data prediction. In: Annual Convention of the Computer Society of India. Springer, pp 505–515
Pak C, Wang TT, Su XH (2018) An empirical study on software defect prediction using over-sampling by SMOTE. Int J Softw Eng Knowl Eng 28(06):811–830
Article Google Scholar
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188
Article Google Scholar
Fukunaga K (2013) Introduction to statistical pattern recognition. Elsevier, Amsterdam
MATH Google Scholar
Tian Q, Barbero M, Gu Z-H, Lee SH (1986) Image classification by the Foley-Sammon transform. Opt Eng 25(7):257834
Article Google Scholar
Hong Z-Q, Yang J-Y (1991) Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognit 24(4):317–324
Article MathSciNet Google Scholar
Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, pp 297–308
Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning. ACM, pp 1096–1103
Wiatowski T, Bölcskei H (2018) A mathematical theory of deep convolutional neural networks for feature extraction. IEEE Trans Inf Theory 64(3):1845–1866
Article MathSciNet MATH Google Scholar
Lee K, Lee K, Lee H, Shin J (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: Advances in Neural Information Processing Systems, pp 7167–7177
Denouden T, Salay R, Czarnecki K, Abdelzad V, Phan B, Vernekar S (2018) Improving reconstruction autoencoder out-of-distribution detection with mahalanobis distance. arXiv preprint arXiv:181202765
Xu J, Luo L, Deng C, Huang H (2018) Bilevel distance metric learning for robust image recognition. In: Advances in Neural Information Processing Systems, pp 4198–4207
Guo Y-F, Li S-J, Yang J-Y, Shu T-T, Wu L-D (2003) A generalized Foley-Sammon transform based on generalized fisher discriminant criterion and its application to face recognition. Pattern Recognit Lett 24(1–3):147–158
Article MATH Google Scholar
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML. Citeseer, pp 148–156
Tan P-N (2007) Introduction to data mining. Pearson Education India, Chennai
Google Scholar
Li W, Huang Z, Li Q (2016) Three-way decisions based software defect prediction. Knowl Based Syst 91:263–274
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Software Quality Lab, Ferdowsi University of Mashhad, Mashhad, Iran
Mohammad Mahdi NezhadShokouhi
Department of Computer Engineering, Center of Excellence on Soft Computing and Intelligent Information Processing, Ferdowsi University of Mashhad, Mashhad, Iran
Mohammad Ali Majidi
Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
Abbas Rasoolzadegan

Authors

Mohammad Mahdi NezhadShokouhi
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Ali Majidi
View author publications
You can also search for this author in PubMed Google Scholar
Abbas Rasoolzadegan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abbas Rasoolzadegan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

For NASA datasets, the modes number in the histogram of each feature are depicted in Tables 9, 10, 11, 12, 13, 14, 15, 16, 17 and 18.

Table 9 Number of modes in CM1’s features

Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance

Abstract

Access this article

Similar content being viewed by others

Data Sampling-Based Feature Selection Framework for Software Defect Prediction

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

A training sample selection method for predicting software defects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation