Skip to main content
Log in

Label propagation based semi-supervised learning for software defect prediction

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

Software defect prediction can automatically predict defect-prone software modules for efficient software test in software engineering. When the previous defect labels of modules are limited, predicting the defect-prone modules becomes a challenging problem. In static software defect prediction, there exist the similarity among software modules, a software module can be approximated by a sparse representation of the other part of the software modules, and class-imbalance problem, the number of defect-free modules is much larger than that of defective ones. In this paper, we propose to use graph based semi-supervised learning technique to predict software defect. By using Laplacian score sampling strategy for the labeled defect-free modules, we construct a class-balance labeled training dataset firstly. And then, we use a nonnegative sparse algorithm to compute the nonnegative sparse weights of a relationship graph which serve as clustering indicators. Lastly, on the nonnegative sparse graph, we use a label propagation algorithm to iteratively predict the labels of unlabeled software modules. We thus propose a nonnegative sparse graph based label propagation approach for software defect classification and prediction, which uses not only few labeled data but also abundant unlabeled ones to improve the generalization capability. We vary the size of labeled software modules from 10 to 30 % of all the datasets in the widely used NASA projects. Experimental results show that the NSGLP outperforms several representative state-of-the-art semi-supervised software defect prediction methods, and it can fully exploit the characteristics of static code metrics and improve the generalization capability of the software defect prediction model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter. 6(1), 20–29 (2004)

    Article  Google Scholar 

  • Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(11), 2399–2434 (2006)

    MathSciNet  MATH  Google Scholar 

  • Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009a)

    Article  Google Scholar 

  • Catal, C., Diri, B.: Unlabelled extra data do not always mean extra performance for semi-supervised fault prediction. Expert Syst. 26(5), 458–471 (2009b)

    Article  Google Scholar 

  • Catal, C.: A comparison of semi-supervised classification approaches for software defect prediction. J. Intell. Syst. 23(1), 75–82 (2014)

    Google Scholar 

  • Chan, Y., Walmsley, R.P.: Learning and understanding the Kruskal-Wallis one-way analysis-of-variance-by-ranks test for differences among three or more independent groups. Phys. Ther. 77(12), 1755–1761 (1997)

    Google Scholar 

  • Chapelle, O., Zien, A.: Semi-supervised classification by low density separation. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 57–64 (2005)

  • Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifici. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  • Culp, M., Michailidis, G.: Graph-based semisupervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 30(1), 174–179 (2008)

    Article  Google Scholar 

  • Fenton, N., Ohlsson, N.: Quantitative analysis of faults and failures in a complex software system. IEEE Trans. Softw. Eng. 26(8), 797–814 (2000)

    Article  Google Scholar 

  • Gao, K., Khoshgoftaar, T. M.: Software defect prediction for high-dimensional and class-imbalanced data. In: Proceedings of the 23rd International Conference on Software Engineering and Knowledge Engineering, pp. 89–94 (2011)

  • Gao, K., Khoshgoftaar, T.M., Wald, R.: The use of under- and oversampling within ensemble feature selection and classification for software quality prediction. Int. J. Reliab. Qual. Saf. Eng. 21(1), 145004 (2014)

    Article  Google Scholar 

  • Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: Proceedings of the 17th International Conference on Machine Learning, pp. 327–334 (2000)

  • Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in neural information processing systems, pp. 529–536 (2004)

  • Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: The misuse of the NASA metrics data program data sets for automated software defect prediction. In: Proceedings of 15th Annual Conference on Evaluation and Assessment in Software Engineering, pp. 96–103 (2011)

  • Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)

    Article  Google Scholar 

  • He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems, pp. 507–514 (2005)

  • Jiang, Y., Li, M., Zhou, Z.H.: Software defect detection with ROCUS. J. Comput. Sci. Technol. 26(2), 328–342 (2011)

    Article  Google Scholar 

  • Jing, X. Y., Ying, S., Zhang, Z. W., Wu, S. S., Liu, J.: Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp. 414-423 (2014a)

  • Jing, X. Y., Zhang, Z. W., Ying, S., Wang, F., Zhu, Y. P.: Software defect prediction based on collaborative representation classification. In: Companion Proceedings of the 36th International Conference on Software Engineering, pp. 632–633 (2014b)

  • Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning, pp 200–209 (1999)

  • Khoshgoftaar, T. M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: problems in software defect prediction. In: Proceedings of the 22nd IEEE International Conference on Tools with Artificial Intelligence, pp. 137–144 (2010)

  • Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, pp 179–186 (1997)

  • Laradji, I.H., Alshayeb, M., Ghouti, L.: Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58, 388–402 (2015)

    Article  Google Scholar 

  • Li, M., Zhang, H., Wu, R., Zhou, Z.H.: Sample-based software defect prediction with active and semi-supervised learning. Autom. Softw. Eng. 19(2), 201–230 (2012)

    Article  Google Scholar 

  • Li, S., Fu, Y.: Low-rank coding with b-matching constraint for semi-supervised classification. In: Proceedings of the 23th International Joint Conference on Artificial Intelligence, pp. 1472–1478 (2013)

  • Lu, H., Cukic, B., Culp, M.: An iterative semi-supervised approach to software fault prediction. In: Proceedings of the 7th International Conference on Predictive Models in Software Engineering (Article 15) (2011)

  • Lu, H., Cukic, B., Culp, M.: Software defect prediction using semi-supervised learning with dimension reduction. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp. 314–317 (2012)

  • Lyu, M. R.: Software reliability engineering: a roadmap. In: 2007 Future of Software Engineering, pp. 153–170 (2007)

  • McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 4, 308–320 (1976)

    Article  MATH  Google Scholar 

  • Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)

    Article  Google Scholar 

  • Miller, D. J., Uyar, H. S.: A mixture of experts classifier with learning based on both labelled and unlabelled data. In: Advances in neural information processing systems, pp. 571–577 (1997)

  • Nam, J., Pan, S. J., Kim, S.: Transfer defect learning. In: Proceedings of the 35th International Conference on Software Engineering, pp. 382–391 (2013)

  • Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)

    Article  MATH  Google Scholar 

  • Pelayo, L, Dick, S.: Applying novel resampling strategies to software defect prediction. In: Proceedings of the 2007 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 69–72 (2007)

  • Seliya, N., Khoshgoftaar, T.M.: Software quality estimation with limited fault data: a semi-supervised learning perspective. Softw. Qual. J. 15(3), 327–344 (2007a)

    Article  Google Scholar 

  • Seliya, N., Khoshgoftaar, T.M.: Software quality analysis of unlabeled program modules with semisupervised clustering. IEEE Trans. Syst. Man. Cyber. 37(2), 201–211 (2007b)

    Article  Google Scholar 

  • Shahshahani, B.M., Landgrebe, D.: The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. Geosci. Remote Sens. 32(5), 1087–1095 (1994)

    Article  Google Scholar 

  • Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013)

    Article  Google Scholar 

  • Sun, Z.B., Song, Q.B., Zhu, X.Y.: Using coding based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cyber. C 42(6), 1806–1817 (2012)

    Article  Google Scholar 

  • Turhan, B., Menzies, T., Bener, A.: On the relative value of cross-company and within-company data for defect prediction. Empirical Softw. Eng. 14(5), 540–578 (2009)

    Article  Google Scholar 

  • Wang, F., Zhang, C.: Label propagation through linear neighborhoods. IEEE Trans. Knowl. Data Eng. 20(1), 55–67 (2008)

    Article  Google Scholar 

  • Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)

    Article  Google Scholar 

  • Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust Face Recognition via Sparse Representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)

    Article  Google Scholar 

  • Xu, J., Man, H.: Dictionary learning based on laplacian score in sparse coding. In: Machine Learning and Data Mining in Pattern Recognition, pp.253–264 (2011)

  • Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16(16), 321–328 (2004)

    Google Scholar 

  • Zhou, Z.-H., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)

    Article  Google Scholar 

  • Zhou, Z.-H., Li, M.: Semi-supervised regression with co-training style algorithms. IEEE Trans. Knowl. Data Eng. 19(11), 1479–1493 (2007)

    Article  Google Scholar 

  • Zhu, X.: Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University (2005)

  • Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University (2002)

  • Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning, pp. 912–919 (2003)

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their valuable comments and suggestions to improve this paper. We also thank the staff of the NASA Metrics Data Program for making the software measurement data available. The work described in this paper was partially supported by the NSFC under Project Nos. 61272273, 61073113, 333 Engineering of Jiangsu Province under Project No. BRA2011175, the Graduate Student Innovation Research Project of Jiangsu Province under Grant No.CXZZ12_0478.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi-Wu Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, ZW., Jing, XY. & Wang, TJ. Label propagation based semi-supervised learning for software defect prediction. Autom Softw Eng 24, 47–69 (2017). https://doi.org/10.1007/s10515-016-0194-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10515-016-0194-x

Keywords

Navigation