Label propagation based semi-supervised learning for software defect prediction

Zhang, Zhi-Wu; Jing, Xiao-Yuan; Wang, Tie-Jian

doi:10.1007/s10515-016-0194-x

Label propagation based semi-supervised learning for software defect prediction

Published: 22 March 2016

Volume 24, pages 47–69, (2017)
Cite this article

Automated Software Engineering Aims and scope Submit manuscript

2789 Accesses
86 Citations
Explore all metrics

Abstract

Software defect prediction can automatically predict defect-prone software modules for efficient software test in software engineering. When the previous defect labels of modules are limited, predicting the defect-prone modules becomes a challenging problem. In static software defect prediction, there exist the similarity among software modules, a software module can be approximated by a sparse representation of the other part of the software modules, and class-imbalance problem, the number of defect-free modules is much larger than that of defective ones. In this paper, we propose to use graph based semi-supervised learning technique to predict software defect. By using Laplacian score sampling strategy for the labeled defect-free modules, we construct a class-balance labeled training dataset firstly. And then, we use a nonnegative sparse algorithm to compute the nonnegative sparse weights of a relationship graph which serve as clustering indicators. Lastly, on the nonnegative sparse graph, we use a label propagation algorithm to iteratively predict the labels of unlabeled software modules. We thus propose a nonnegative sparse graph based label propagation approach for software defect classification and prediction, which uses not only few labeled data but also abundant unlabeled ones to improve the generalization capability. We vary the size of labeled software modules from 10 to 30 % of all the datasets in the widely used NASA projects. Experimental results show that the NSGLP outperforms several representative state-of-the-art semi-supervised software defect prediction methods, and it can fully exploit the characteristics of static code metrics and improve the generalization capability of the software defect prediction model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised software defect prediction using signed Laplacian-based spectral classifier

Article 20 March 2019

Aris Marjuni, Teguh Bharata Adji & Ridi Ferdiana

Unsupervised software defect prediction using median absolute deviation threshold based spectral classifier on signed Laplacian matrix

Article Open access 27 September 2019

Aris Marjuni, Teguh B. Adji & Ridi Ferdiana

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

Article 14 September 2015

Duksan Ryu, Jong-In Jang & Jongmoon Baik

References

Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter. 6(1), 20–29 (2004)
Article Google Scholar
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(11), 2399–2434 (2006)
MathSciNet MATH Google Scholar
Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009a)
Article Google Scholar
Catal, C., Diri, B.: Unlabelled extra data do not always mean extra performance for semi-supervised fault prediction. Expert Syst. 26(5), 458–471 (2009b)
Article Google Scholar
Catal, C.: A comparison of semi-supervised classification approaches for software defect prediction. J. Intell. Syst. 23(1), 75–82 (2014)
Google Scholar
Chan, Y., Walmsley, R.P.: Learning and understanding the Kruskal-Wallis one-way analysis-of-variance-by-ranks test for differences among three or more independent groups. Phys. Ther. 77(12), 1755–1761 (1997)
Google Scholar
Chapelle, O., Zien, A.: Semi-supervised classification by low density separation. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 57–64 (2005)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifici. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Culp, M., Michailidis, G.: Graph-based semisupervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 30(1), 174–179 (2008)
Article Google Scholar
Fenton, N., Ohlsson, N.: Quantitative analysis of faults and failures in a complex software system. IEEE Trans. Softw. Eng. 26(8), 797–814 (2000)
Article Google Scholar
Gao, K., Khoshgoftaar, T. M.: Software defect prediction for high-dimensional and class-imbalanced data. In: Proceedings of the 23rd International Conference on Software Engineering and Knowledge Engineering, pp. 89–94 (2011)
Gao, K., Khoshgoftaar, T.M., Wald, R.: The use of under- and oversampling within ensemble feature selection and classification for software quality prediction. Int. J. Reliab. Qual. Saf. Eng. 21(1), 145004 (2014)
Article Google Scholar
Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: Proceedings of the 17th International Conference on Machine Learning, pp. 327–334 (2000)
Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in neural information processing systems, pp. 529–536 (2004)
Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: The misuse of the NASA metrics data program data sets for automated software defect prediction. In: Proceedings of 15th Annual Conference on Evaluation and Assessment in Software Engineering, pp. 96–103 (2011)
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)
Article Google Scholar
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems, pp. 507–514 (2005)
Jiang, Y., Li, M., Zhou, Z.H.: Software defect detection with ROCUS. J. Comput. Sci. Technol. 26(2), 328–342 (2011)
Article Google Scholar
Jing, X. Y., Ying, S., Zhang, Z. W., Wu, S. S., Liu, J.: Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp. 414-423 (2014a)
Jing, X. Y., Zhang, Z. W., Ying, S., Wang, F., Zhu, Y. P.: Software defect prediction based on collaborative representation classification. In: Companion Proceedings of the 36th International Conference on Software Engineering, pp. 632–633 (2014b)
Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning, pp 200–209 (1999)
Khoshgoftaar, T. M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: problems in software defect prediction. In: Proceedings of the 22nd IEEE International Conference on Tools with Artificial Intelligence, pp. 137–144 (2010)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, pp 179–186 (1997)
Laradji, I.H., Alshayeb, M., Ghouti, L.: Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58, 388–402 (2015)
Article Google Scholar
Li, M., Zhang, H., Wu, R., Zhou, Z.H.: Sample-based software defect prediction with active and semi-supervised learning. Autom. Softw. Eng. 19(2), 201–230 (2012)
Article Google Scholar
Li, S., Fu, Y.: Low-rank coding with b-matching constraint for semi-supervised classification. In: Proceedings of the 23th International Joint Conference on Artificial Intelligence, pp. 1472–1478 (2013)
Lu, H., Cukic, B., Culp, M.: An iterative semi-supervised approach to software fault prediction. In: Proceedings of the 7th International Conference on Predictive Models in Software Engineering (Article 15) (2011)
Lu, H., Cukic, B., Culp, M.: Software defect prediction using semi-supervised learning with dimension reduction. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp. 314–317 (2012)
Lyu, M. R.: Software reliability engineering: a roadmap. In: 2007 Future of Software Engineering, pp. 153–170 (2007)
McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 4, 308–320 (1976)
Article MATH Google Scholar
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)
Article Google Scholar
Miller, D. J., Uyar, H. S.: A mixture of experts classifier with learning based on both labelled and unlabelled data. In: Advances in neural information processing systems, pp. 571–577 (1997)
Nam, J., Pan, S. J., Kim, S.: Transfer defect learning. In: Proceedings of the 35th International Conference on Software Engineering, pp. 382–391 (2013)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)
Article MATH Google Scholar
Pelayo, L, Dick, S.: Applying novel resampling strategies to software defect prediction. In: Proceedings of the 2007 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 69–72 (2007)
Seliya, N., Khoshgoftaar, T.M.: Software quality estimation with limited fault data: a semi-supervised learning perspective. Softw. Qual. J. 15(3), 327–344 (2007a)
Article Google Scholar
Seliya, N., Khoshgoftaar, T.M.: Software quality analysis of unlabeled program modules with semisupervised clustering. IEEE Trans. Syst. Man. Cyber. 37(2), 201–211 (2007b)
Article Google Scholar
Shahshahani, B.M., Landgrebe, D.: The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. Geosci. Remote Sens. 32(5), 1087–1095 (1994)
Article Google Scholar
Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng. 39(9), 1208–1215 (2013)
Article Google Scholar
Sun, Z.B., Song, Q.B., Zhu, X.Y.: Using coding based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cyber. C 42(6), 1806–1817 (2012)
Article Google Scholar
Turhan, B., Menzies, T., Bener, A.: On the relative value of cross-company and within-company data for defect prediction. Empirical Softw. Eng. 14(5), 540–578 (2009)
Article Google Scholar
Wang, F., Zhang, C.: Label propagation through linear neighborhoods. IEEE Trans. Knowl. Data Eng. 20(1), 55–67 (2008)
Article Google Scholar
Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)
Article Google Scholar
Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust Face Recognition via Sparse Representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)
Article Google Scholar
Xu, J., Man, H.: Dictionary learning based on laplacian score in sparse coding. In: Machine Learning and Data Mining in Pattern Recognition, pp.253–264 (2011)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16(16), 321–328 (2004)
Google Scholar
Zhou, Z.-H., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)
Article Google Scholar
Zhou, Z.-H., Li, M.: Semi-supervised regression with co-training style algorithms. IEEE Trans. Knowl. Data Eng. 19(11), 1479–1493 (2007)
Article Google Scholar
Zhu, X.: Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University (2005)
Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University (2002)
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning, pp. 912–919 (2003)

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their valuable comments and suggestions to improve this paper. We also thank the staff of the NASA Metrics Data Program for making the software measurement data available. The work described in this paper was partially supported by the NSFC under Project Nos. 61272273, 61073113, 333 Engineering of Jiangsu Province under Project No. BRA2011175, the Graduate Student Innovation Research Project of Jiangsu Province under Grant No.CXZZ12_0478.

Author information

Authors and Affiliations

School of Computer, Nanjing University of Posts and Telecommunications, Nanjing, 210003, People’s Republic of China
Zhi-Wu Zhang & Xiao-Yuan Jing
State Key Laboratory of Software Engineering, School of Computer, Wuhan University, Wuhan, 430072, People’s Republic of China
Xiao-Yuan Jing & Tie-Jian Wang

Authors

Zhi-Wu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Yuan Jing
View author publications
You can also search for this author in PubMed Google Scholar
Tie-Jian Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhi-Wu Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, ZW., Jing, XY. & Wang, TJ. Label propagation based semi-supervised learning for software defect prediction. Autom Softw Eng 24, 47–69 (2017). https://doi.org/10.1007/s10515-016-0194-x

Download citation

Received: 22 December 2014
Accepted: 09 March 2016
Published: 22 March 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10515-016-0194-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Label propagation based semi-supervised learning for software defect prediction

Abstract

Access this article

Similar content being viewed by others

Unsupervised software defect prediction using signed Laplacian-based spectral classifier

Unsupervised software defect prediction using median absolute deviation threshold based spectral classifier on signed Laplacian matrix

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Label propagation based semi-supervised learning for software defect prediction

Abstract

Access this article

Similar content being viewed by others

Unsupervised software defect prediction using signed Laplacian-based spectral classifier

Unsupervised software defect prediction using median absolute deviation threshold based spectral classifier on signed Laplacian matrix

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation