Abstract
Recent years, software defect prediction systems are becoming quite popular since they improve software reliability by identifying the potential bugs in the code. Several models were introduced in literature that aim to support the developers. Unfortunately, these models consider the manually constructed code features and input into machine learning-based classifiers. Moreover, these baseline approaches ignore the semantic and contextual information of the source code. With this paper we present a software defect prediction model that address all these issues. The model employs bidirectional long-short term memory network (BiLSTM) and BERT-based semantic feature (SDP-BB) that captures the semantic features of code to predict defects in the corresponding software. In particular, it utilizes the BiLSTM to exploit contextual information from the embedded token vectors learned through BERT model. Moreover, it utilizes an attention mechanism to capture salient features of the nodes. This is done through a data augmentation technique for generating more training data. We evaluated our approach against state-of-the-art models using ten open-source projects in terms of F1-score in fault prediction. The experiments evaluated the performance of full-token and AST-node data processing methods conducting the length of coverage on each project from 50 to 90% in both within-project defect prediction (WPDP) and cross-project defect prediction (CPDP) experiments. The results indicate that the proposed method outperforms competing models.
Similar content being viewed by others
Data availability
Enquiries about data availability should be directed to the authors.
Notes
Bug prediction at Google. http://google-engtools.blogspot.com/.
References
Abro WA, Qi G, Ali Z, Feng Y, Aamir M (2020) Multi-turn intent determination and slot filling with neural networks and regular expressions. Knowl Based Syst (KBS) 208:106428
Abro WA, Qi G, Gao H, Khan MA, Ali Z (2019) Multi-turn intent determination for goal-oriented dialogue systems. In: 2019 international joint conference on neural networks (IJCNN), IEEE. pp 1–8
Ahmed MR, Ali MA, Ahmed N, Zamal MFB, Shamrat FJM (2020) The impact of software fault prediction in real-world application: an automated approach for software engineering. In: Proceedings of 2020 the 6th international conference on computing and data engineering, pp 247–251
Akiyama F (1971) An example of software system debugging. ifip congress 71, vortragsauszüge. Comput Softw 37:42
Ali Z, Kefalas P, Muhammad K, Ali B, Imran M (2020) Deep learning in citation recommendation models survey. Expert Syst Appl 113790
Ali Z, Qi G, Muhammad K, Kefalas P, Khusro S (2021) Global citation recommendation employing generative adversarial network. Expert Syst Appl 114888
Allamanis M, Barr ET, Devanbu P, Sutton C (2018) A survey of machine learning for big code and naturalness. ACM Comput Surv (CSUR) 51:1–37
Amasaki S, Takagi Y, Mizuno O, Kikuno T (2003) A bayesian belief network for assessing the likelihood of fault content. In: In the 14th international symposium on software reliability engineering (ISSRE), pp 215–226
Arar ÖF, Ayan K (2017) A feature dependent Naive Bayes approach and its application to the software defect prediction problem. Appl Soft Comput 59:197–209
Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res (JMLR) 3:1137–1155
Chen D, Chen X, Li H, Xie J, Mu Y (2019) Deepcpdp: deep learning based cross-project defect prediction. IEEE Access 7:184832–184848
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20:476–493
Dam K, Pham T, Ng SW, Tran T, Grundy J, Ghose A, Kim T, Kim C (2018) A deep tree-based model for software defect prediction. arXiv:1802.00921
Deng J, Lu L, Qiu S (2020) Software defect prediction via LSTM. IET Softw 14:443–450
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp 4171–4186
Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81:649–660
Fan G, Diao X, Yu H, Yang K, Chen L (2019) Software defect prediction via attention-based recurrent neural network. Sci Program
Gong L, Jiang S, Bo L, Jiang L, Qian J (2019) A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Trans Reliab 69:40–54
Gray D, Bowes D, Davey N, Sun Y, Christianson B (2009) Using the support vector machine as a classification method for software defect prediction with static code metrics. In: International conference on engineering applications of neural networks, pp 223–234
Guo L, Ma Y, Cukic B, Singh H (2004) Robust prediction of fault-proneness by random forests. In: 15th international Symposium on Software Reliability Engineering, pp 417–428
Halstead MH (1977) Elements of software science (operating and programming systems series). Elsevier Science Inc., USA
Harrison R, Counsell SJ, Nithi RV (1998) An evaluation of the mood set of object-oriented software metrics. IEEE Trans Softw Eng 24:491–496
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Jiang T, Tan L, Kim S (2013) Personalized defect prediction. In: 2013 28th IEEE/ACM international conference on automated software engineering (ASE), pp 279–289
Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning based software defect prediction. In: Proceedings of the 36th international conference on software engineering, pp 414–423
Jing X, Wu F, Dong X, Qi F, Xu B (2015) Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, pp 496–507
Jing XY, Wu F, Dong X, Xu B (2016) An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Tran Softw Eng 43:321–339
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, pp 1–10
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1746–1751
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. CoRR. arXiv:1412.6980
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402
Lee T, Nam J, Han D, Kim S, In HP (2016) Developer micro interaction metrics for software defect prediction. IEEE Trans Softw Eng 42:1015–1035
Li J, He P, Zhu J, Lyu MR (2017) Software defect prediction via convolutional neural network. In: 2017 IEEE international conference on software quality, reliability and security (QRS), pp 318–328
Li Z, Jing XY, Zhu X (2018) Progress on approaches to software defect prediction. IET Softw 12:161–175
Liang H, Yu Y, Jiang L, Xie Z (2019) Seml: a semantic LSTM model for software defect prediction. IEEE Access 7:83812–83824
Liu G, Guo J (2019) Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338
Liu AG, Musial E, Chen MH (2011) Progressive reliability forecasting of service-oriented software. In: 2011 IEEE international conference on web services, pp 532–539
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 308–320
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407
Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association
Minku LL, Mendes E, Turhan B (2016) Data mining for software engineering and humans in the loop. Prog Artif Intell 5:307–314
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on Software engineering, pp 181–190
Mousavi R, Eftekhari M, Rahdari F (2018) Omni-ensemble learning (OEL): utilizing over-bagging, static and dynamic ensemble selection approaches for software defect prediction. Int J Artif Intell Tools 27:1850024
Nagappan N, Ball T (2007) Using software dependencies and churn metrics to predict field failures: an empirical case study. In: First international symposium on empirical software engineering and measurement (ESEM 2007), pp 364–373
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35th international conference on software engineering (ICSE), pp 382–391
Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: 2012 Proceedings of the 27th IEEE/ACM international conference on automated software engineering, pp 70–79
Pan C, Lu M, Xu B, Gao H (2019) An improved CNN model for within-project software defect prediction. Appl Sci 9:2138
Paramshetti P, Phalke D (2014) Survey on software defect prediction using machine learning techniques. Int J Sci Res 3:1394–1397
Reena P, Binu R, (2014) Software defect prediction system–decision tree algorithm with two level data pre-processing. Int J Eng Res Technol (IJERT) 3
Salem AM, Rekab K, Whittaker JA (2004) Prediction of software failures through logistic regression. Inf Softw Technol 46:781–789
Sayyad SJ, Menzies T (2005) The promise repository of software engineering databases
Schneidewind NF (2001) Investigation of logistic regression as a discriminant of software quality. In: Proceedings seventh international software metrics symposium, pp 328–337
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681
Schwenk H, Gauvain JL (2005) Training neural network language models on very large corpora. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp 201–208
Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40:603–616
Singh PD, Chug A (2017) Software defect prediction analysis using machine learning algorithms. In: 2017 7th International conference on cloud computing, Data science & engineering-confluence. IEEE, pp 775–781
Song Q, Guo Y, Shepperd M (2018) A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans Softw Eng 45:1253–1269
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42:1806–1817
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, pp 99–108
Uddin MN, Li B, Mondol MN, Rahman MM, Mia MS, Mondol EL (2021) Sdp-ml: an automated approach of software defect prediction employing machine learning techniques. In: 2021 International conference on electronics, communications and information technology (ICECIT). IEEE, pp 1–4
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang T, Li WH (2010) Naive bayes software defect prediction model. In: 2010 International conference on computational intelligence and software engineering, pp 1–4
Wang J, Shen B, Chen Y (2012) Compressed c4. 5 models for software defect prediction. In: The 12th International conference on quality software, pp 13–16
Wang S, Liu T, Tan L (2016a) Automatically learning semantic features for defect prediction. In: IEEE/ACM 38th international conference on software engineering (ICSE), pp 297–308
Wang T, Zhang Z, Jing X, Zhang L (2016b) Multiple kernel ensemble learning for software defect prediction. Autom Softw Eng 23:569–590
Wang S, Liu T, Nam J, Tan L (2018) Deep semantic feature learning for software defect prediction. IEEE Trans Softw Eng 46:1267–1293
Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter languagereuse. In: Proceedings of the 4th international workshop on predictor models in software engineering, pp 19–24
Wei J, Zou K (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), Hong Kong, China, pp 6382–6388
Wu F, Jing XY, Sun Y, Sun J, Huang L, Cui F, Sun Y (2018) Cross-project and within-project semisupervised software defect prediction: a unified approach. IEEE Trans Reliab 67:581–597
Xia X, Lo D, Pan SJ, Nagappan N, Wang X (2016) Hydra: massively compositional model for cross-project defect prediction. IEEE Trans Softw Eng 42:977–998
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Yu X, Liu J, Yang Z, Jia X, Ling Q, Ye S (2017) Learning from imbalanced data for predicting the number of software defects. In: 2017 IEEE 28th international symposium on software reliability engineering (ISSRE), pp 78–89
Zhang F, Mockus A, Keivanloo I, Zou Y (2014) Towards building a universal defect prediction model. In: Proceedings of the 11th working conference on mining software repositories, pp 182–191
Funding
No funding
Author information
Authors and Affiliations
Contributions
Md Nasir Uddin helped in conceptualization, methodology, software, data curation, writing—original draft. Bixin Li supervised and conceptualized the study. Zafar Ali was involved in conceptualization, methodology, writing—review & editing. Pavlos Kefalas contributed to formal analysis, writing—review & editing. Inayat Khan helped in analysis and manuscript preparation. Islam Zada was involved in resources and analysis with constructive discussions.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Communicated by Jia-Bao Liu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Conceptualization of this study, Methodology, Software.
Rights and permissions
About this article
Cite this article
Uddin, M.N., Li, B., Ali, Z. et al. Software defect prediction employing BiLSTM and BERT-based semantic feature. Soft Comput 26, 7877–7891 (2022). https://doi.org/10.1007/s00500-022-06830-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-022-06830-5