A training sample selection method for predicting software defects

Jin, Cong

doi:10.1007/s10489-022-04044-8

A training sample selection method for predicting software defects

Published: 19 September 2022

Volume 53, pages 12015–12031, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Cong Jin¹

302 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Software Defect Prediction (SDP) is an important method to analyze software quality and reduce development cost. Data from software life cycle has been widely used to predict the defect prone of software modules, and although many machine learning-based SDP models have been proposed, their predictive performance is not always satisfactory. Traditional machine learning-based classifiers usually assume that all samples have the same contribution to the training of SDP, which is not true. In fact, different training samples have different effects on the performance of the SDP model, the performance of machine learning-based SDP models is heavily dependent on the quality of training samples. For the above shortcoming of traditional machine learning-based classifiers, the contributions of this paper are as follows: (1) Inspired by the clustering algorithm, a method to calculate the contribution of each training sample to the SDP model is proposed, which not only considers the relationship between the contributions of the training samples to the SDP model, and also analyzes the influence of the distance between the sample and the category boundary on the performance of the SDP model, so it is different from the existing calculation method of sample contribution. (2) A Sample Selection (SS) method is proposed to improve the performance of the SDP model. It first calculates the contribution of each training sample based on several nearest neighbors of the sample and the label information of these neighbors, and then implements SS according to Hoeffding probability inequality and the contribution of each sample. To confirm the validity of the proposed SDP model, some experimental results are given. Both direct observations and statistical tests of the experimental results show that the SS method is very effective for improving the predictive performance of the SDP model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

Article 27 July 2023

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

Article 14 September 2015

Data Sampling-Based Feature Selection Framework for Software Defect Prediction

Notes

References

Lucija S, Petar A, Adrian SK et al (2021) Improving software defect prediction by aggregated change metrics. IEEE Access 9:19391–19411
Google Scholar
Jin C (2011) Software reliability prediction based on support vector regression using a hybrid genetic algorithm and simulated annealing algorithm. IET Softw 5:398–405
Google Scholar
Ke SZ, Huang CY (2020) Software reliability prediction and management: a multiple change-point model approach. Qual Reliab Eng Int 5:1678–1707
Google Scholar
Jin C, Jin SW (2016) Parameter optimization of software reliability growth model with S-shaped testing-effort function using improved swarm intelligent optimization. Appl Soft Comput 40:283–291
Google Scholar
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. The 27th International Conference on Software Engineering, 284–292. https://doi.org/10.1109/ICSE.2005.1553571
Jin C (2021) Cross-project software defect prediction based on domain adaptation learning and optimization. Expert Syst Appl 171:114637
Google Scholar
Mehta S, Patnaik KS (2021) Improved prediction of software defects using ensemble machine learning techniques. Neural Comput & Applic 33:10551–10562
Google Scholar
Jin C (2021) Software defect prediction model based on distance metric learning. Soft Comput 25:447–461
MATH Google Scholar
Xu Z, Liu J, Luo XP et al (2019) Software defect prediction based on kernel PCA and weighted extreme learning machine. Inf Softw Technol 106:182–200
Google Scholar
Yedida R, Menzies T (2021) On the value of oversampling for deep learning in software defect prediction. IEEE Trans Softw Eng 99:1–1
Google Scholar
Jin C, Jin SW (2014) Software reliability prediction model based on support vector regression with improved estimation of distribution algorithms. Appl Soft Comput 15:113–120
Google Scholar
Ren JH, Zhang Q (2021) A novel software defect prediction approach using modified objective cluster analysis. Concurr Comput Pract Experience 9:e6112
Google Scholar
Kasinathan M, Srinivas S, Aruna M A, et al. (2016) Software defect prediction using augmented Bayesian networks. The 8th International Conference of Soft Computing and Pattern Recognition, 279–293. https://doi.org/10.1007/s10664-012-9218-8
Jin C, Jin SW (2014) Applications of fuzzy integrals for predicting software fault-prone. J Intell Fuzzy Syst 26:721–729
MathSciNet MATH Google Scholar
Sushant KP, Deevashwer R, Anil KT (2020) Software defect prediction using K-PCA and various kernel-based extreme learning machine: an empirical study. IET Softw 7:768–782
Google Scholar
Zhu K, Ying S, Zhang NN et al (2021) Software defect prediction based on enhanced metaheuristic feature selection optimization and a hybrid deep neural network. J Syst Softw 180:111026
Google Scholar
Jin C, Jin SW, Ye JM (2012) Artificial neural network-based metric selection for software fault-prone prediction model. IET Softw 6:479–487
Google Scholar
Marian Z, Mircea I G, Czibula I G, et al. (2016) A novel approach for software defect prediction using fuzzy decision trees. The 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 240–247. https://doi.org/10.1109/SYNASC.2016.046
Okutan A, Yıldız OT (2014) Software defect prediction using Bayesian networks. Empir Softw Eng 19:154–181
Google Scholar
Rana ZA, Awais MM, Shamail S (2015) Improving recall of software defect prediction models using association mining. Knowl-Based Syst 90:1–13
Google Scholar
Noekhah S, Salim N B, Zakaria N H (2017) Predicting software reliability with a novel neural network approach. The 2nd International Conference of Reliable Information and Communication Technology, 907–916. https://doi.org/10.1007/978-3-319-59427-9_93
Zhou TC, Sun XB, Xia X et al (2019) Improving defect prediction with deep forest. Inf Softw Technol 114:204–216
Google Scholar
Amin A, Grunske L, Colman A (2013) An approach to software reliability prediction based on time series modelling. J Syst Softw 86:1923–1932
Google Scholar
Issam HL, Mohammad A, Lahouari G (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 8:388–402
Google Scholar
Ömer FA, Kürşat A (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263–277
Google Scholar
Li J, He P J, Zhu J M, et al. (2017) Software defect prediction via convolutional neural network. 2017 IEEE International Conference on Software Quality, Reliability and Security, 318–328. https://doi.org/10.1109/QRS.2017.42
Wang T, Zhang Z, Jing X, Zhang L (2016) Multiple kernel ensemble learning for software defect prediction. Autom Softw Eng 23:569–590
Google Scholar
Gray D, Bowes D, Davey N et al (2011) The misuse of the NASA metrics data program data sets for automated software defect prediction. Int Conf Eval Assess Softw Eng:96–103. https://doi.org/10.1049/ic.2011.0012
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39:1208–1215
Google Scholar
Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Ann Econ Soc Meas 5:475–492
Google Scholar
Donald SG (1995) Two-step estimation of heteroskedastic sample selection models. J Econ 65:347–380
MathSciNet MATH Google Scholar
Das M, Newey WK, Vella F (2003) Nonparametric estimation of sample selection models. Rev Econ Stud 70(1):33–58
MathSciNet MATH Google Scholar
Wang XZ, Dong LC, Yan JH (2012) Maximum ambiguity-based sample selection in fuzzy decision tree induction. IEEE Trans Knowl Data Eng 24(8):1491–1505
Google Scholar
Adhikari B, Rahtu E, Huttunen H (2021) Sample selection for efficient image annotation. 9th European Workshop on Visual Information Processing (EUVIP), 1–7. https://doi.org/10.1109/EUVIP50544.2021.9484022
Elankavi R, Kalaiprasath R, Udayakumar DR (2017) A fast clustering algorithm for high-dimensional data. Int J Civil Eng Technol 8:1220–1227
Google Scholar
Tzortzis G, Likas A (2014) The min max k-means clustering algorithm. Pattern Recogn 47:2505–2516
Google Scholar
Vapnik V (2013) The nature of statistical learning theory. Springer Science & Business Media, New York
MATH Google Scholar
Shao ZF, Er MJ (2016) Efficient leave-one-out cross-validation-based regularized extreme learning machine. Neurocomputing 194:260–270
Google Scholar
Serfling RJ (2009) Approximation theorems of mathematical statistics. John Wiley & Sons, New York
MATH Google Scholar
Michalski RS, Carbonell JG, Mitchell TM (2013) Machine learning: an artificial intelligence approach. Springer Science & Business Media, Berlin
MATH Google Scholar
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518
Google Scholar
Jung HW, Kim SG, Chung CS (2004) Measuring software product quality: a survey of ISO/IEC 9126. IEEE Softw 21:88–92
Google Scholar
Zhou Y, Leung H (2006) Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans Softw Eng 32:771–789
Google Scholar
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng SE-2:308–320
MathSciNet MATH Google Scholar
McCabe TJ, Butler CW (1989) Design complexity measurement and testing. Commun ACM 32:1415–1425
Google Scholar
Halstead MH (1977) Elements of software science. North-Holland, New York
Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140
Google Scholar
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:2–13
Google Scholar
Zhang HY, Zhang XZ (2007) Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans Softw Eng 33:635–636
Google Scholar
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34:485–496
Google Scholar
Picard RR, Cook RD (1984) Cross-validation of regression models. J Am Stat Assoc 79:575–583
MathSciNet MATH Google Scholar
Sheikh M, Coolen ACC (2020) Accurate Bayesian data classification without hyperparameter cross-validation. J Classif 37:277–297
MathSciNet MATH Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, New York, pp I-XXIV, 1–736
Salvador J, Perez PE (2015) Naive Bayes super-resolution forest. IEEE Int Conf Comput Vis:325–333. https://doi.org/10.1109/ICCV.2015.45
Wang HB, Wang T, Zhou YC et al (2019) Information classification algorithm based on decision tree optimization. Clust Comput 22:7559–7568
MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
MATH Google Scholar
Zhang NN, Ying S, Zhu K, Zhu DD (2022) Software defect prediction based on stacked sparse denoising autoencoders and enhanced extreme learning machine. IET Softw 16(1):29–47
Google Scholar
Goyal S, Bhatia P (2020) Comparison of machine learning techniques for software quality prediction. Int J Knowl Syst Sci 11:21–40
Google Scholar
Lear AM, Dada EG, Oyewola DO et al (2021) Ensemble machine learning model for software defect prediction. Adv Machine Learn Artif Intell 2:11–21
Google Scholar
Mohammad SD, Shabib A, Munir A et al (2021) Machine learning empowered software defect prediction system. Intell Autom Soft Comput 31:1287–1300
Google Scholar
Goyal S, Bhatia PK (2021) Heterogeneous stacked ensemble classifier for software defect prediction. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-11488-6
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1:80–83
Google Scholar
Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25:101–132
Google Scholar
Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. The 33rd International Conference on Software Engineering, 1–10. https://doi.org/10.1145/1985793.1985795

Download references

Author information

Authors and Affiliations

School of Computer, Central China Normal University, Wuhan, 430079, People’s Republic of China
Cong Jin

Authors

Cong Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cong Jin.

Ethics declarations

Conflict of interest

The author declares that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jin, C. A training sample selection method for predicting software defects. Appl Intell 53, 12015–12031 (2023). https://doi.org/10.1007/s10489-022-04044-8

Download citation

Accepted: 27 July 2022
Published: 19 September 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10489-022-04044-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A training sample selection method for predicting software defects

Abstract

Access this article

Similar content being viewed by others

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

Data Sampling-Based Feature Selection Framework for Software Defect Prediction

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A training sample selection method for predicting software defects

Abstract

Access this article

Similar content being viewed by others

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

Data Sampling-Based Feature Selection Framework for Software Defect Prediction

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation