Skip to main content
Log in

A training sample selection method for predicting software defects

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Software Defect Prediction (SDP) is an important method to analyze software quality and reduce development cost. Data from software life cycle has been widely used to predict the defect prone of software modules, and although many machine learning-based SDP models have been proposed, their predictive performance is not always satisfactory. Traditional machine learning-based classifiers usually assume that all samples have the same contribution to the training of SDP, which is not true. In fact, different training samples have different effects on the performance of the SDP model, the performance of machine learning-based SDP models is heavily dependent on the quality of training samples. For the above shortcoming of traditional machine learning-based classifiers, the contributions of this paper are as follows: (1) Inspired by the clustering algorithm, a method to calculate the contribution of each training sample to the SDP model is proposed, which not only considers the relationship between the contributions of the training samples to the SDP model, and also analyzes the influence of the distance between the sample and the category boundary on the performance of the SDP model, so it is different from the existing calculation method of sample contribution. (2) A Sample Selection (SS) method is proposed to improve the performance of the SDP model. It first calculates the contribution of each training sample based on several nearest neighbors of the sample and the label information of these neighbors, and then implements SS according to Hoeffding probability inequality and the contribution of each sample. To confirm the validity of the proposed SDP model, some experimental results are given. Both direct observations and statistical tests of the experimental results show that the SS method is very effective for improving the predictive performance of the SDP model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm CTC
Fig. 2
Algorithm STS

Similar content being viewed by others

Notes

  1. http://mdp.ivv.nasa.gov/

  2. http://j.mp/scvvIU

References

  1. Lucija S, Petar A, Adrian SK et al (2021) Improving software defect prediction by aggregated change metrics. IEEE Access 9:19391–19411

    Google Scholar 

  2. Jin C (2011) Software reliability prediction based on support vector regression using a hybrid genetic algorithm and simulated annealing algorithm. IET Softw 5:398–405

    Google Scholar 

  3. Ke SZ, Huang CY (2020) Software reliability prediction and management: a multiple change-point model approach. Qual Reliab Eng Int 5:1678–1707

    Google Scholar 

  4. Jin C, Jin SW (2016) Parameter optimization of software reliability growth model with S-shaped testing-effort function using improved swarm intelligent optimization. Appl Soft Comput 40:283–291

    Google Scholar 

  5. Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. The 27th International Conference on Software Engineering, 284–292. https://doi.org/10.1109/ICSE.2005.1553571

  6. Jin C (2021) Cross-project software defect prediction based on domain adaptation learning and optimization. Expert Syst Appl 171:114637

    Google Scholar 

  7. Mehta S, Patnaik KS (2021) Improved prediction of software defects using ensemble machine learning techniques. Neural Comput & Applic 33:10551–10562

    Google Scholar 

  8. Jin C (2021) Software defect prediction model based on distance metric learning. Soft Comput 25:447–461

    MATH  Google Scholar 

  9. Xu Z, Liu J, Luo XP et al (2019) Software defect prediction based on kernel PCA and weighted extreme learning machine. Inf Softw Technol 106:182–200

    Google Scholar 

  10. Yedida R, Menzies T (2021) On the value of oversampling for deep learning in software defect prediction. IEEE Trans Softw Eng 99:1–1

    Google Scholar 

  11. Jin C, Jin SW (2014) Software reliability prediction model based on support vector regression with improved estimation of distribution algorithms. Appl Soft Comput 15:113–120

    Google Scholar 

  12. Ren JH, Zhang Q (2021) A novel software defect prediction approach using modified objective cluster analysis. Concurr Comput Pract Experience 9:e6112

    Google Scholar 

  13. Kasinathan M, Srinivas S, Aruna M A, et al. (2016) Software defect prediction using augmented Bayesian networks. The 8th International Conference of Soft Computing and Pattern Recognition, 279–293. https://doi.org/10.1007/s10664-012-9218-8

  14. Jin C, Jin SW (2014) Applications of fuzzy integrals for predicting software fault-prone. J Intell Fuzzy Syst 26:721–729

    MathSciNet  MATH  Google Scholar 

  15. Sushant KP, Deevashwer R, Anil KT (2020) Software defect prediction using K-PCA and various kernel-based extreme learning machine: an empirical study. IET Softw 7:768–782

    Google Scholar 

  16. Zhu K, Ying S, Zhang NN et al (2021) Software defect prediction based on enhanced metaheuristic feature selection optimization and a hybrid deep neural network. J Syst Softw 180:111026

    Google Scholar 

  17. Jin C, Jin SW, Ye JM (2012) Artificial neural network-based metric selection for software fault-prone prediction model. IET Softw 6:479–487

    Google Scholar 

  18. Marian Z, Mircea I G, Czibula I G, et al. (2016) A novel approach for software defect prediction using fuzzy decision trees. The 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 240–247. https://doi.org/10.1109/SYNASC.2016.046

  19. Okutan A, Yıldız OT (2014) Software defect prediction using Bayesian networks. Empir Softw Eng 19:154–181

    Google Scholar 

  20. Rana ZA, Awais MM, Shamail S (2015) Improving recall of software defect prediction models using association mining. Knowl-Based Syst 90:1–13

    Google Scholar 

  21. Noekhah S, Salim N B, Zakaria N H (2017) Predicting software reliability with a novel neural network approach. The 2nd International Conference of Reliable Information and Communication Technology, 907–916. https://doi.org/10.1007/978-3-319-59427-9_93

  22. Zhou TC, Sun XB, Xia X et al (2019) Improving defect prediction with deep forest. Inf Softw Technol 114:204–216

    Google Scholar 

  23. Amin A, Grunske L, Colman A (2013) An approach to software reliability prediction based on time series modelling. J Syst Softw 86:1923–1932

    Google Scholar 

  24. Issam HL, Mohammad A, Lahouari G (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 8:388–402

    Google Scholar 

  25. Ömer FA, Kürşat A (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263–277

    Google Scholar 

  26. Li J, He P J, Zhu J M, et al. (2017) Software defect prediction via convolutional neural network. 2017 IEEE International Conference on Software Quality, Reliability and Security, 318–328. https://doi.org/10.1109/QRS.2017.42

  27. Wang T, Zhang Z, Jing X, Zhang L (2016) Multiple kernel ensemble learning for software defect prediction. Autom Softw Eng 23:569–590

    Google Scholar 

  28. Gray D, Bowes D, Davey N et al (2011) The misuse of the NASA metrics data program data sets for automated software defect prediction. Int Conf Eval Assess Softw Eng:96–103. https://doi.org/10.1049/ic.2011.0012

  29. Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39:1208–1215

    Google Scholar 

  30. Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Ann Econ Soc Meas 5:475–492

    Google Scholar 

  31. Donald SG (1995) Two-step estimation of heteroskedastic sample selection models. J Econ 65:347–380

    MathSciNet  MATH  Google Scholar 

  32. Das M, Newey WK, Vella F (2003) Nonparametric estimation of sample selection models. Rev Econ Stud 70(1):33–58

    MathSciNet  MATH  Google Scholar 

  33. Wang XZ, Dong LC, Yan JH (2012) Maximum ambiguity-based sample selection in fuzzy decision tree induction. IEEE Trans Knowl Data Eng 24(8):1491–1505

    Google Scholar 

  34. Adhikari B, Rahtu E, Huttunen H (2021) Sample selection for efficient image annotation. 9th European Workshop on Visual Information Processing (EUVIP), 1–7. https://doi.org/10.1109/EUVIP50544.2021.9484022

  35. Elankavi R, Kalaiprasath R, Udayakumar DR (2017) A fast clustering algorithm for high-dimensional data. Int J Civil Eng Technol 8:1220–1227

    Google Scholar 

  36. Tzortzis G, Likas A (2014) The min max k-means clustering algorithm. Pattern Recogn 47:2505–2516

    Google Scholar 

  37. Vapnik V (2013) The nature of statistical learning theory. Springer Science & Business Media, New York

    MATH  Google Scholar 

  38. Shao ZF, Er MJ (2016) Efficient leave-one-out cross-validation-based regularized extreme learning machine. Neurocomputing 194:260–270

    Google Scholar 

  39. Serfling RJ (2009) Approximation theorems of mathematical statistics. John Wiley & Sons, New York

    MATH  Google Scholar 

  40. Michalski RS, Carbonell JG, Mitchell TM (2013) Machine learning: an artificial intelligence approach. Springer Science & Business Media, Berlin

    MATH  Google Scholar 

  41. Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518

    Google Scholar 

  42. Jung HW, Kim SG, Chung CS (2004) Measuring software product quality: a survey of ISO/IEC 9126. IEEE Softw 21:88–92

    Google Scholar 

  43. Zhou Y, Leung H (2006) Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans Softw Eng 32:771–789

    Google Scholar 

  44. McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng SE-2:308–320

    MathSciNet  MATH  Google Scholar 

  45. McCabe TJ, Butler CW (1989) Design complexity measurement and testing. Commun ACM 32:1415–1425

    Google Scholar 

  46. Halstead MH (1977) Elements of software science. North-Holland, New York

  47. Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140

    Google Scholar 

  48. Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:2–13

    Google Scholar 

  49. Zhang HY, Zhang XZ (2007) Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans Softw Eng 33:635–636

    Google Scholar 

  50. Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34:485–496

    Google Scholar 

  51. Picard RR, Cook RD (1984) Cross-validation of regression models. J Am Stat Assoc 79:575–583

    MathSciNet  MATH  Google Scholar 

  52. Sheikh M, Coolen ACC (2020) Accurate Bayesian data classification without hyperparameter cross-validation. J Classif 37:277–297

    MathSciNet  MATH  Google Scholar 

  53. Vapnik V (1998) Statistical learning theory. Wiley, New York, pp I-XXIV, 1–736

  54. Salvador J, Perez PE (2015) Naive Bayes super-resolution forest. IEEE Int Conf Comput Vis:325–333. https://doi.org/10.1109/ICCV.2015.45

  55. Wang HB, Wang T, Zhou YC et al (2019) Information classification algorithm based on decision tree optimization. Clust Comput 22:7559–7568

    MATH  Google Scholar 

  56. Breiman L (2001) Random forests. Mach Learn 45:5–32

    MATH  Google Scholar 

  57. Zhang NN, Ying S, Zhu K, Zhu DD (2022) Software defect prediction based on stacked sparse denoising autoencoders and enhanced extreme learning machine. IET Softw 16(1):29–47

    Google Scholar 

  58. Goyal S, Bhatia P (2020) Comparison of machine learning techniques for software quality prediction. Int J Knowl Syst Sci 11:21–40

    Google Scholar 

  59. Lear AM, Dada EG, Oyewola DO et al (2021) Ensemble machine learning model for software defect prediction. Adv Machine Learn Artif Intell 2:11–21

    Google Scholar 

  60. Mohammad SD, Shabib A, Munir A et al (2021) Machine learning empowered software defect prediction system. Intell Autom Soft Comput 31:1287–1300

    Google Scholar 

  61. Goyal S, Bhatia PK (2021) Heterogeneous stacked ensemble classifier for software defect prediction. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-11488-6

  62. Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1:80–83

    Google Scholar 

  63. Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25:101–132

    Google Scholar 

  64. Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. The 33rd International Conference on Software Engineering, 1–10. https://doi.org/10.1145/1985793.1985795

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cong Jin.

Ethics declarations

Conflict of interest

The author declares that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jin, C. A training sample selection method for predicting software defects. Appl Intell 53, 12015–12031 (2023). https://doi.org/10.1007/s10489-022-04044-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04044-8

Keywords

Navigation