An empirical study of text-based machine learning models for vulnerability detection

Napier, Kollin; Bhowmik, Tanmay; Wang, Shaowei

doi:10.1007/s10664-022-10276-6

An empirical study of text-based machine learning models for vulnerability detection

Published: 03 February 2023

Volume 28, article number 38, (2023)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

1271 Accesses
6 Citations
Explore all metrics

Abstract

With an increase in complexity and severity, it is becoming harder to identify and mitigate vulnerabilities. Although traditional tools remain useful, machine learning models are being adopted to expand efforts. To help explore methods of vulnerability detection, we present an empirical study on the effectiveness of text-based machine learning models by utilizing 344 open-source projects, 2,182 vulnerabilities and 38 vulnerability types. With the availability of vulnerabilities being presented in forms such as code snippets, we construct a methodology based on extracted source code functions and create equal pairings. We conduct experiments using seven machine learning models, five natural language processing techniques and three data processing methods. First, we present results based on full context function pairings. Next, we introduce condensed functions and conduct a statistical analysis to determine if there is a significant difference between the models, techniques, or methods. Based on these results, we answer research questions regarding model prediction for testing within and across projects and vulnerability types. Our results show that condensed functions with fewer features may achieve greater prediction results when testing within rather than across. Overall, we conclude that text-based machine learning models are not effective in detecting vulnerabilities within or across projects and vulnerability types.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Listing 1

Listing 2

Listing 3

Listing 4

Cyber Security Threats and Vulnerabilities: A Systematic Mapping Study

Article 06 January 2020

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

Impact of word embedding models on text analytics in deep learning environment: a review

Article 22 February 2023

Data Availability

The datasets generated during and/or analyzed during the current study are available in the “emse_data” repository, https://github.com/krn65/emse_data

Notes

https://haveibeenpwned.com
https://owasp.org/www-project-top-ten/
https://cve.mitre.org/cve/
https://cvedetails.com/browse-by-date.php
https://scitools.com
https://tomcat.apache.org/
https://nvd.nist.gov
https://samate.nist.gov/SARD/
https://dwheeler.com/flawfinder/
The original database link provided by the paper is unavailable, but an alternative link was found: https://github.com/announce/vcc-base
https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/doc2vec.html
https://scikit-learn.org/
https://radimrehurek.com/gensim/
https://keras.io
https://tensorflow.org
https://wikipedia.org
https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4
https://github.com/krn65/emse_data
https://cvedetails.com
CVE Details does provide a disclaimer that the site and all data are provided “as is”, meaning it is not guaranteed to be accurate or complete.
https://github.com/FFmpeg/FFmpeg
https://github.com/bonzini/qemu
https://cwe.mitre.org/data/definitions/119.html
https://cwe.mitre.org/data/definitions/20.html

References

Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from stackoverflow: An exploratory study on android Apps. Inf Softw Technol 88:148–158. https://doi.org/10.1016/j.infsof.2017.04.005
Article Google Scholar
Ban X, Liu S, Chen C, Chua C (2019) A performance evaluation of deep-learnt features for software vulnerability detection. Concurr Comput Pract Experience 31(19):e5103. https://doi.org/10.1002/cpe.5103
Article Google Scholar
Bates S, Cozby P (2017) Methods in behavioral research. McGraw-Hill Education, New York
Google Scholar
Cavusoglu H, Mishra B, Raghunathan S (2004) The effect of internet security breach announcements on market value: Capital market reactions for breached firms and internet security developers. Int J Electron Commer 9(1):70–104. https://doi.org/10.1080/10864415.2004.11044320
Article Google Scholar
Chen Y (2015) Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo. http://hdl.handle.net/10012/9592
Chernis B, Verma R (2018) Machine learning methods for software vulnerability detection. In: Proceedings of the 4th ACM international workshop on security and privacy analytics, pp 31–39. https://doi.org/10.1145/3180445.3180453
Cor K, Sood G (2018) Pwned: How often are Americans’ online accounts breached? arXiv:1808.01883
Czerwonka J, Greiler M, Tilford J (2015) Code reviews do not find bugs. How the current code review best practice slows us down. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2. IEEE, pp 27–28. https://doi.org/10.1109/ICSE.2015.131
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805, https://doi.org/10.48550/arXiv.1810.04805
Dowd M, McDonald J, Schuh J (2006) The art of software security assessment: Identifying and preventing software vulnerabilities. Pearson Education
Duan X, Wu J, Ji S, Rui Z, Luo T, Yang M, Wu Y (2019) Vulsniper: Focus your attention to shoot fine-grained vulnerabilities. In: IJCAI, pp 4665–4671. https://doi.org/10.24963/ijcai.2019/648
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64. https://doi.org/10.1080/01621459.1961.10482090
Article MathSciNet MATH Google Scholar
Egele M, Scholte T, Kirda E, Kruegel C (2008) A survey on automated dynamic malware-analysis techniques and tools. ACM Comput Surv (CSUR) 44 (2):1–42. https://doi.org/10.1145/2089125.2089126
Article Google Scholar
Fan J, Li Y, Wang S, Nguyen TN (2020) AC/C++ code vulnerability dataset with code changes and CVE summaries. In: Proceedings of the 17th international conference on mining software repositories, pp 508–512. https://doi.org/10.1145/3379597.3387501
Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack overflow considered harmful? the impact of copy&paste on android application security. In: 2017 IEEE symposium on security and privacy (SP). https://doi.org/10.1109/SP.2017.31. IEEE, pp 121–136
Ghaffarian SM, Shahriari HR (2017) Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Comput Surv (CSUR) 50(4):1–36. https://doi.org/10.1145/3092566
Article Google Scholar
Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the 6th ACM conference on data and application security and privacy, pp 85–96. https://doi.org/10.1145/2857705.2857720
Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv:1803.04497
Hovsepyan A, Scandariato R, Joosen W, Walden J (2012) Software vulnerability prediction using text analysis techniques. In: Proceedings of the 4th international workshop on Security measurements and metrics, pp 7–10. https://doi.org/10.1145/2372225.2372230
Huang S, Tang H, Zhang M, Tian J (2010) Text clustering on national vulnerability database. In: 2010 2nd international conference on computer engineering and applications, vol 2. IEEE, pp 295–299. https://doi.org/10.1109/ICCEA.2010.209
Ijaz M, Durad MH, Ismail M (2019) Static and dynamic malware analysis using machine learning. In: 2019 16th international BHURBAN conference on applied sciences and technology (IBCAST). https://doi.org/10.1109/IBCAST.2019.8667136. IEEE, pp 687–691
Jie G, Xiao-Hui K, Qiang L (2016) Survey on software vulnerability analysis method based on machine learning. In: 2016 IEEE 1st international conference on data science in cyberspace (DSC). https://doi.org/10.1109/DSC.2016.33. IEEE, pp 642–647
Kim J, Hubczenko D, Montague P (2019) Towards attention based vulnerability discovery using source code representation. In: International conference on artificial neural networks. https://doi.org/10.1007/978-3-030-30490-4_58. Springer, pp 731–746
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). https://doi.org/10.3115/v1/D14-1181, https://aclanthology.org/D14-1181. Association for Computational Linguistics, Doha, Qatar, pp 1746–1751
Klock R (2021) Quality of SQL code security on stackoverflow and methods of prevention. PhD thesis, Oberlin College. http://rave.ohiolink.edu/etdc/view?acc_num=oberlin1625831198110328
Koroteev M (2021) Bert: A review of applications in natural language processing and understanding. arXiv:2103.11943
Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621. https://doi.org/10.1080/01621459.1952.10483441
Article MATH Google Scholar
Layton R, Watters PA (2014) A methodology for estimating the tangible cost of data breaches. J Inf Secur Appl 19(6):321–330. https://doi.org/10.1016/j.jisa.2014.10.012
Article Google Scholar
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. https://doi.org/10.48550/arXiv.1405.4053
Li P, Cui B (2010) A comparative study on software vulnerability static analysis techniques and tools. In: 2010 IEEE international conference on information theory and information security. https://doi.org/10.1109/ICITIS.2010.5689543. IEEE, pp 521–524
Li X, Chang X, Board JA, Trivedi KS (2017) A novel approach for software vulnerability classification. In: 2017 annual reliability and maintainability symposium (RAMS). https://doi.org/10.1109/RAM.2017.7889792. IEEE, pp 1–7
Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv:180101681, https://doi.org/10.14722/ndss.2018.23158
Li Z, Zou D, Xu S, Chen Z, Zhu Y, Jin H (2021a) Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans Dependable Sec Comput. https://doi.org/10.1109/TDSC.2021.3076142
Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z (2021b) Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2021.3051525
Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: Vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 2539–2541. https://doi.org/10.1145/3133956.3138840
Lin G, Zhang J, Luo W, Pan L, De Vel O, Montague P, Xiang Y (2019) Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Trans Dependable Sec Comput. https://doi.org/10.1109/TDSC.2019.2954088
Lin G, Wen S, Han QL, Zhang J, Xiang Y (2020) Software vulnerability detection using deep neural networks: A survey. Proc IEEE 108(10):1825–1848. https://doi.org/10.1109/JPROC.2020.2993293
Article Google Scholar
Liu B, Shi L, Cai Z, Li M (2012) Software vulnerability discovery techniques: A survey. In: 2012 4th international conference on multimedia information networking and security. https://doi.org/10.1109/MINES.2012.202. IEEE, pp 152–156
Liu S, Lin G, Han QL, Wen S, Zhang J, Xiang Y (2019) Deepbalance: Deep-learning and fuzzy oversampling for vulnerability detection. IEEE Trans Fuzzy Syst 28(7):1329–1343. https://doi.org/10.1109/TFUZZ.2019.2958558
Article Google Scholar
Liu S, Lin G, Qu L, Zhang J, De Vel O, Montague P, Xiang Y (2020) CD-VulD: Cross-domain vulnerability discovery based on deep domain adaptation. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2020.2984505
Mäntylä V, Lassenius C (2008) What types of defects are really discovered in code reviews? IEEE Trans Softw Eng 35(3):430–448. https://doi.org/10.1109/TSE.2008.71
Article Google Scholar
McQueen MA, McQueen TA, Boyer WF, Chaffin MR (2009) Empirical estimates and observations of 0day vulnerabilities. In: 2009 42nd Hawaii international conference on system sciences. https://doi.org/10.1109/HICSS.2009.186. IEEE, pp 1–12
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. https://doi.org/10.48550/arXiv.1301.3781
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. https://doi.org/10.48550/arXiv.1310.4546
Mokbal FMM, Dan W, Imran A, Jiuchuan L, Akhtar F, Xiaoxi W (2019) MLPXSS: an integrated XSS-based attack detection scheme in web applications using multilayer perceptron technique. IEEE Access 7:100567–100580. https://doi.org/10.1109/ACCESS.2019.2927417
Article Google Scholar
Mubarek AM, Adalı E (2017) Multilayer perceptron neural network technique for fraud detection. In: 2017 international conference on computer science and engineering (UBMK). https://doi.org/10.1109/UBMK.2017.8093417. IEEE, pp 383–387
Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp 426–437. https://doi.org/10.1145/2810103.2813604
Pham NH, Nguyen TT, Nguyen HA, Nguyen TN (2010) Detection of recurring software vulnerabilities. In: Proceedings of the IEEE/ACM international conference on automated software engineering, pp 447–456. https://doi.org/10.1145/1858996.1859089
Piessens F (2002) A taxonomy of causes of software vulnerabilities in internet software. In: Supplementary Proceedings of the 13th international symposium on software reliability engineering. Citeseer, pp 47–52
Plachkinova M, Maurer C (2018) Security breach at target. J Inf Syst Educ 29(1):11–20. https://aisel.aisnet.org/jise/vol29/iss1/7
Google Scholar
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: A survey. Sci China Technol Sci 63(10):1872–1897. https://doi.org/10.1007/s11431-020-1647-3
Article Google Scholar
Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. IEEE Trans Softw Eng 40(10):993–1006. https://doi.org/10.1109/TSE.2014.2340398
Article Google Scholar
Shar LK, Briand LC, Tan HBK (2014) Web application vulnerability prediction using hybrid program analysis and machine learning. IEEE Trans Dependable Secur Comput 12(6):688–707. https://doi.org/10.1109/TDSC.2014.2373377
Article Google Scholar
Shin Y, Williams L (2008) An empirical model to predict security vulnerabilities using code complexity metrics. In: Proceedings of the 2nd ACM-IEEE international symposium on Empirical software engineering and measurement, pp 315–317. https://doi.org/10.1145/1414004.1414065
Shu X, Tian K, Ciambrone A, Yao D (2017) Breaking the target: An analysis of target data breach and lessons learned. arXiv:1701.04940
Spanos G, Angelis L, Toloudis D (2017) Assessment of vulnerability severity using text mining. In: Proceedings of the 21st Pan-Hellenic conference on informatics, pp 1–6. https://doi.org/10.1145/3139367.3139390
Spreitzenbarth M, Schreck T, Echtler F, Arp D, Hoffmann J (2015) Mobile-sandbox: combining static and dynamic analysis with machine-learning techniques. Int J Inf Secur 14(2):141–153. https://doi.org/10.1007/s10207-014-0250-0
Article Google Scholar
Su W, Yuan Y, Zhu M (2015) A relationship between the average precision and the area under the ROC curve. In: Proceedings of the 2015 international conference on the theory of information retrieval, pp 349–352. https://doi.org/10.1145/2808194.2809481
Sultana KZ, Deo A, Williams BJ (2016) A preliminary study examining relationships between nano-patterns and software security vulnerabilities. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC). https://doi.org/10.1109/COMPSAC.2016.34, vol 1. IEEE, pp 257–262
Tang G, Meng L, Wang H, Ren S, Wang Q, Yang L, Cao W (2020) A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 international symposium on theoretical aspects of software engineering (TASE). IEEE, pp 1–8. https://doi.org/10.1109/TASE49443.2020.00010
Telang R, Wattal S (2007) An empirical analysis of the impact of software vulnerability announcements on firm stock price. IEEE Trans Softw Eng 33(8):544–557. https://doi.org/10.1109/TSE.2007.70712
Article Google Scholar
Wang H, Ye G, Tang Z, Tan SH, Huang S, Fang D, Feng Y, Bian L, Wang Z (2020) Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans Inf Forensics Secur 16:1943–1958. https://doi.org/10.1109/TIFS.2020.3044773
Article Google Scholar
Wang P, Johnson C (2018) Cybersecurity incident handling: A case study of the equifax data breach. Issues Inf Syst 19(3). https://doi.org/10.48009/3_iis_2018_150-159
Wijayasekara D, Manic M, McQueen M (2014) Vulnerability identification and classification via text mining bug databases. In: IECON 2014-40th annual conference of the IEEE industrial electronics society. https://doi.org/10.1109/IECON.2014.7049035. IEEE, pp 3612–3618
Yamaguchi F, Lindner F, Rieck K (2011) Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning. In: Proceedings of the 5th USENIX conference on Offensive technologies, pp 13–13. https://dl.acm.org/doi/10.5555/2028052.2028065
Zhang H, Wang S, Li H, Chen THP, Hassan AE (2021) A study of C/C++ code weaknesses on stack overflow. IEEE Trans Softw Eng. https://doi.org/10.1109/TSE.2021.3058985
Zhu M (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo. Waterloo 2(30):6
Google Scholar
Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision, pp 19–27. https://doi.org/10.1109/ICCV.2015.11
Zou D, Wang S, Xu S, Li Z, Jin H (2019) μ vuldeepecker: A deep learning-based system for multiclass vulnerability detection. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2019.2942930

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Mississippi State University, Mississippi State, MS, USA
Kollin Napier & Tanmay Bhowmik
Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
Shaowei Wang

Authors

Kollin Napier
View author publications
You can also search for this author in PubMed Google Scholar
Tanmay Bhowmik
View author publications
You can also search for this author in PubMed Google Scholar
Shaowei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kollin Napier.

Ethics declarations

Conflict of Interests

The authors of this manuscript have no conflicts of interest.

Additional information

Communicated by: Yuan Zhang

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Preliminary Experiment Additional Metrics

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Napier, K., Bhowmik, T. & Wang, S. An empirical study of text-based machine learning models for vulnerability detection. Empir Software Eng 28, 38 (2023). https://doi.org/10.1007/s10664-022-10276-6

Download citation

Accepted: 13 December 2022
Published: 03 February 2023
DOI: https://doi.org/10.1007/s10664-022-10276-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An empirical study of text-based machine learning models for vulnerability detection

Abstract

Access this article

Similar content being viewed by others

Cyber Security Threats and Vulnerabilities: A Systematic Mapping Study

Data collection and quality challenges in deep learning: a data-centric AI perspective

Impact of word embedding models on text analytics in deep learning environment: a review

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendix A: Preliminary Experiment Additional Metrics

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An empirical study of text-based machine learning models for vulnerability detection

Abstract

Access this article

Similar content being viewed by others

Cyber Security Threats and Vulnerabilities: A Systematic Mapping Study

Data collection and quality challenges in deep learning: a data-centric AI perspective

Impact of word embedding models on text analytics in deep learning environment: a review

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendix A: Preliminary Experiment Additional Metrics

Appendix A: Preliminary Experiment Additional Metrics

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation