Software defect prediction employing BiLSTM and BERT-based semantic feature

Uddin, Md Nasir; Li, Bixin; Ali, Zafar; Kefalas, Pavlos; Khan, Inayat; Zada, Islam

doi:10.1007/s00500-022-06830-5

Software defect prediction employing BiLSTM and BERT-based semantic feature

Focus
Published: 21 February 2022

Volume 26, pages 7877–7891, (2022)
Cite this article

Soft Computing Aims and scope Submit manuscript

Md Nasir Uddin ORCID: orcid.org/0000-0002-0493-9803¹,
Bixin Li¹,
Zafar Ali¹,
Pavlos Kefalas²,
Inayat Khan³ &
…
Islam Zada⁴

1513 Accesses
22 Citations
Explore all metrics

Abstract

Recent years, software defect prediction systems are becoming quite popular since they improve software reliability by identifying the potential bugs in the code. Several models were introduced in literature that aim to support the developers. Unfortunately, these models consider the manually constructed code features and input into machine learning-based classifiers. Moreover, these baseline approaches ignore the semantic and contextual information of the source code. With this paper we present a software defect prediction model that address all these issues. The model employs bidirectional long-short term memory network (BiLSTM) and BERT-based semantic feature (SDP-BB) that captures the semantic features of code to predict defects in the corresponding software. In particular, it utilizes the BiLSTM to exploit contextual information from the embedded token vectors learned through BERT model. Moreover, it utilizes an attention mechanism to capture salient features of the nodes. This is done through a data augmentation technique for generating more training data. We evaluated our approach against state-of-the-art models using ten open-source projects in terms of F1-score in fault prediction. The experiments evaluated the performance of full-token and AST-node data processing methods conducting the length of coverage on each project from 50 to 90% in both within-project defect prediction (WPDP) and cross-project defect prediction (CPDP) experiments. The results indicate that the proposed method outperforms competing models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Hybrid Defect Prediction Model Based on Counterfactual Feature Optimization

Article Open access 04 July 2023

Towards One Reusable Model for Various Software Defect Mining Tasks

Hybrid deep architecture for software defect prediction with improved feature set

Article 17 February 2024

Data availability

Enquiries about data availability should be directed to the authors.

Notes

Bug prediction at Google. http://google-engtools.blogspot.com/.
http://openscience.us/repo/defect.
https://github.com/codertimo/BERT-pytorch.

References

Abro WA, Qi G, Ali Z, Feng Y, Aamir M (2020) Multi-turn intent determination and slot filling with neural networks and regular expressions. Knowl Based Syst (KBS) 208:106428
Article Google Scholar
Abro WA, Qi G, Gao H, Khan MA, Ali Z (2019) Multi-turn intent determination for goal-oriented dialogue systems. In: 2019 international joint conference on neural networks (IJCNN), IEEE. pp 1–8
Ahmed MR, Ali MA, Ahmed N, Zamal MFB, Shamrat FJM (2020) The impact of software fault prediction in real-world application: an automated approach for software engineering. In: Proceedings of 2020 the 6th international conference on computing and data engineering, pp 247–251
Akiyama F (1971) An example of software system debugging. ifip congress 71, vortragsauszüge. Comput Softw 37:42
Ali Z, Kefalas P, Muhammad K, Ali B, Imran M (2020) Deep learning in citation recommendation models survey. Expert Syst Appl 113790
Ali Z, Qi G, Muhammad K, Kefalas P, Khusro S (2021) Global citation recommendation employing generative adversarial network. Expert Syst Appl 114888
Allamanis M, Barr ET, Devanbu P, Sutton C (2018) A survey of machine learning for big code and naturalness. ACM Comput Surv (CSUR) 51:1–37
Article Google Scholar
Amasaki S, Takagi Y, Mizuno O, Kikuno T (2003) A bayesian belief network for assessing the likelihood of fault content. In: In the 14th international symposium on software reliability engineering (ISSRE), pp 215–226
Arar ÖF, Ayan K (2017) A feature dependent Naive Bayes approach and its application to the software defect prediction problem. Appl Soft Comput 59:197–209
Article Google Scholar
Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res (JMLR) 3:1137–1155
MATH Google Scholar
Chen D, Chen X, Li H, Xie J, Mu Y (2019) Deepcpdp: deep learning based cross-project defect prediction. IEEE Access 7:184832–184848
Article Google Scholar
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20:476–493
Article Google Scholar
Dam K, Pham T, Ng SW, Tran T, Grundy J, Ghose A, Kim T, Kim C (2018) A deep tree-based model for software defect prediction. arXiv:1802.00921
Deng J, Lu L, Qiu S (2020) Software defect prediction via LSTM. IET Softw 14:443–450
Article Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp 4171–4186
Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81:649–660
Article Google Scholar
Fan G, Diao X, Yu H, Yang K, Chen L (2019) Software defect prediction via attention-based recurrent neural network. Sci Program
Gong L, Jiang S, Bo L, Jiang L, Qian J (2019) A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Trans Reliab 69:40–54
Article Google Scholar
Gray D, Bowes D, Davey N, Sun Y, Christianson B (2009) Using the support vector machine as a classification method for software defect prediction with static code metrics. In: International conference on engineering applications of neural networks, pp 223–234
Guo L, Ma Y, Cukic B, Singh H (2004) Robust prediction of fault-proneness by random forests. In: 15th international Symposium on Software Reliability Engineering, pp 417–428
Halstead MH (1977) Elements of software science (operating and programming systems series). Elsevier Science Inc., USA
Harrison R, Counsell SJ, Nithi RV (1998) An evaluation of the mood set of object-oriented software metrics. IEEE Trans Softw Eng 24:491–496
Article Google Scholar
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554
Article MathSciNet Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Article Google Scholar
Jiang T, Tan L, Kim S (2013) Personalized defect prediction. In: 2013 28th IEEE/ACM international conference on automated software engineering (ASE), pp 279–289
Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning based software defect prediction. In: Proceedings of the 36th international conference on software engineering, pp 414–423
Jing X, Wu F, Dong X, Qi F, Xu B (2015) Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, pp 496–507
Jing XY, Wu F, Dong X, Xu B (2016) An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Tran Softw Eng 43:321–339
Article Google Scholar
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, pp 1–10
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1746–1751
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. CoRR. arXiv:1412.6980
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402
Article Google Scholar
Lee T, Nam J, Han D, Kim S, In HP (2016) Developer micro interaction metrics for software defect prediction. IEEE Trans Softw Eng 42:1015–1035
Article Google Scholar
Li J, He P, Zhu J, Lyu MR (2017) Software defect prediction via convolutional neural network. In: 2017 IEEE international conference on software quality, reliability and security (QRS), pp 318–328
Li Z, Jing XY, Zhu X (2018) Progress on approaches to software defect prediction. IET Softw 12:161–175
Article Google Scholar
Liang H, Yu Y, Jiang L, Xie Z (2019) Seml: a semantic LSTM model for software defect prediction. IEEE Access 7:83812–83824
Article Google Scholar
Liu G, Guo J (2019) Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338
Article Google Scholar
Liu AG, Musial E, Chen MH (2011) Progressive reliability forecasting of service-oriented software. In: 2011 IEEE international conference on web services, pp 532–539
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 308–320
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407
Article Google Scholar
Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association
Minku LL, Mendes E, Turhan B (2016) Data mining for software engineering and humans in the loop. Prog Artif Intell 5:307–314
Article Google Scholar
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on Software engineering, pp 181–190
Mousavi R, Eftekhari M, Rahdari F (2018) Omni-ensemble learning (OEL): utilizing over-bagging, static and dynamic ensemble selection approaches for software defect prediction. Int J Artif Intell Tools 27:1850024
Article Google Scholar
Nagappan N, Ball T (2007) Using software dependencies and churn metrics to predict field failures: an empirical case study. In: First international symposium on empirical software engineering and measurement (ESEM 2007), pp 364–373
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35th international conference on software engineering (ICSE), pp 382–391
Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: 2012 Proceedings of the 27th IEEE/ACM international conference on automated software engineering, pp 70–79
Pan C, Lu M, Xu B, Gao H (2019) An improved CNN model for within-project software defect prediction. Appl Sci 9:2138
Article Google Scholar
Paramshetti P, Phalke D (2014) Survey on software defect prediction using machine learning techniques. Int J Sci Res 3:1394–1397
Google Scholar
Reena P, Binu R, (2014) Software defect prediction system–decision tree algorithm with two level data pre-processing. Int J Eng Res Technol (IJERT) 3
Salem AM, Rekab K, Whittaker JA (2004) Prediction of software failures through logistic regression. Inf Softw Technol 46:781–789
Article Google Scholar
Sayyad SJ, Menzies T (2005) The promise repository of software engineering databases
Schneidewind NF (2001) Investigation of logistic regression as a discriminant of software quality. In: Proceedings seventh international software metrics symposium, pp 328–337
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681
Article Google Scholar
Schwenk H, Gauvain JL (2005) Training neural network language models on very large corpora. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp 201–208
Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40:603–616
Article Google Scholar
Singh PD, Chug A (2017) Software defect prediction analysis using machine learning algorithms. In: 2017 7th International conference on cloud computing, Data science & engineering-confluence. IEEE, pp 775–781
Song Q, Guo Y, Shepperd M (2018) A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans Softw Eng 45:1253–1269
Article Google Scholar
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42:1806–1817
Article Google Scholar
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, pp 99–108
Uddin MN, Li B, Mondol MN, Rahman MM, Mia MS, Mondol EL (2021) Sdp-ml: an automated approach of software defect prediction employing machine learning techniques. In: 2021 International conference on electronics, communications and information technology (ICECIT). IEEE, pp 1–4
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang T, Li WH (2010) Naive bayes software defect prediction model. In: 2010 International conference on computational intelligence and software engineering, pp 1–4
Wang J, Shen B, Chen Y (2012) Compressed c4. 5 models for software defect prediction. In: The 12th International conference on quality software, pp 13–16
Wang S, Liu T, Tan L (2016a) Automatically learning semantic features for defect prediction. In: IEEE/ACM 38th international conference on software engineering (ICSE), pp 297–308
Wang T, Zhang Z, Jing X, Zhang L (2016b) Multiple kernel ensemble learning for software defect prediction. Autom Softw Eng 23:569–590
Article Google Scholar
Wang S, Liu T, Nam J, Tan L (2018) Deep semantic feature learning for software defect prediction. IEEE Trans Softw Eng 46:1267–1293
Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter languagereuse. In: Proceedings of the 4th international workshop on predictor models in software engineering, pp 19–24
Wei J, Zou K (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), Hong Kong, China, pp 6382–6388
Wu F, Jing XY, Sun Y, Sun J, Huang L, Cui F, Sun Y (2018) Cross-project and within-project semisupervised software defect prediction: a unified approach. IEEE Trans Reliab 67:581–597
Article Google Scholar
Xia X, Lo D, Pan SJ, Nagappan N, Wang X (2016) Hydra: massively compositional model for cross-project defect prediction. IEEE Trans Softw Eng 42:977–998
Article Google Scholar
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Yu X, Liu J, Yang Z, Jia X, Ling Q, Ye S (2017) Learning from imbalanced data for predicting the number of software defects. In: 2017 IEEE 28th international symposium on software reliability engineering (ISSRE), pp 78–89
Zhang F, Mockus A, Keivanloo I, Zou Y (2014) Towards building a universal defect prediction model. In: Proceedings of the 11th working conference on mining software repositories, pp 182–191

Download references

Funding

No funding

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, China
Md Nasir Uddin, Bixin Li & Zafar Ali
Department of Informatics, Aristotle University, Thessaloniki, Greece
Pavlos Kefalas
Department of Computer Science, University of Buner, Buner, 19290, Pakistan
Inayat Khan
Department of Computer Science and Software Engineering, International Islamic University (IIU), Islamabad, Pakistan
Islam Zada

Authors

Md Nasir Uddin
View author publications
You can also search for this author in PubMed Google Scholar
Bixin Li
View author publications
You can also search for this author in PubMed Google Scholar
Zafar Ali
View author publications
You can also search for this author in PubMed Google Scholar
Pavlos Kefalas
View author publications
You can also search for this author in PubMed Google Scholar
Inayat Khan
View author publications
You can also search for this author in PubMed Google Scholar
Islam Zada
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Md Nasir Uddin helped in conceptualization, methodology, software, data curation, writing—original draft. Bixin Li supervised and conceptualized the study. Zafar Ali was involved in conceptualization, methodology, writing—review & editing. Pavlos Kefalas contributed to formal analysis, writing—review & editing. Inayat Khan helped in analysis and manuscript preparation. Islam Zada was involved in resources and analysis with constructive discussions.

Corresponding authors

Correspondence to Md Nasir Uddin or Zafar Ali.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Communicated by Jia-Bao Liu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Conceptualization of this study, Methodology, Software.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Uddin, M.N., Li, B., Ali, Z. et al. Software defect prediction employing BiLSTM and BERT-based semantic feature. Soft Comput 26, 7877–7891 (2022). https://doi.org/10.1007/s00500-022-06830-5

Download citation

Accepted: 24 January 2022
Published: 21 February 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s00500-022-06830-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Software defect prediction employing BiLSTM and BERT-based semantic feature

Abstract

Access this article

Similar content being viewed by others

Hybrid Defect Prediction Model Based on Counterfactual Feature Optimization

Towards One Reusable Model for Various Software Defect Mining Tasks

Hybrid deep architecture for software defect prediction with improved feature set

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Software defect prediction employing BiLSTM and BERT-based semantic feature

Abstract

Access this article

Similar content being viewed by others

Hybrid Defect Prediction Model Based on Counterfactual Feature Optimization

Towards One Reusable Model for Various Software Defect Mining Tasks

Hybrid deep architecture for software defect prediction with improved feature set

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation