Abstract
With an increase in complexity and severity, it is becoming harder to identify and mitigate vulnerabilities. Although traditional tools remain useful, machine learning models are being adopted to expand efforts. To help explore methods of vulnerability detection, we present an empirical study on the effectiveness of text-based machine learning models by utilizing 344 open-source projects, 2,182 vulnerabilities and 38 vulnerability types. With the availability of vulnerabilities being presented in forms such as code snippets, we construct a methodology based on extracted source code functions and create equal pairings. We conduct experiments using seven machine learning models, five natural language processing techniques and three data processing methods. First, we present results based on full context function pairings. Next, we introduce condensed functions and conduct a statistical analysis to determine if there is a significant difference between the models, techniques, or methods. Based on these results, we answer research questions regarding model prediction for testing within and across projects and vulnerability types. Our results show that condensed functions with fewer features may achieve greater prediction results when testing within rather than across. Overall, we conclude that text-based machine learning models are not effective in detecting vulnerabilities within or across projects and vulnerability types.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analyzed during the current study are available in the “emse_data” repository, https://github.com/krn65/emse_data
Notes
The original database link provided by the paper is unavailable, but an alternative link was found: https://github.com/announce/vcc-base
CVE Details does provide a disclaimer that the site and all data are provided “as is”, meaning it is not guaranteed to be accurate or complete.
References
Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from stackoverflow: An exploratory study on android Apps. Inf Softw Technol 88:148–158. https://doi.org/10.1016/j.infsof.2017.04.005
Ban X, Liu S, Chen C, Chua C (2019) A performance evaluation of deep-learnt features for software vulnerability detection. Concurr Comput Pract Experience 31(19):e5103. https://doi.org/10.1002/cpe.5103
Bates S, Cozby P (2017) Methods in behavioral research. McGraw-Hill Education, New York
Cavusoglu H, Mishra B, Raghunathan S (2004) The effect of internet security breach announcements on market value: Capital market reactions for breached firms and internet security developers. Int J Electron Commer 9(1):70–104. https://doi.org/10.1080/10864415.2004.11044320
Chen Y (2015) Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo. http://hdl.handle.net/10012/9592
Chernis B, Verma R (2018) Machine learning methods for software vulnerability detection. In: Proceedings of the 4th ACM international workshop on security and privacy analytics, pp 31–39. https://doi.org/10.1145/3180445.3180453
Cor K, Sood G (2018) Pwned: How often are Americans’ online accounts breached? arXiv:1808.01883
Czerwonka J, Greiler M, Tilford J (2015) Code reviews do not find bugs. How the current code review best practice slows us down. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2. IEEE, pp 27–28. https://doi.org/10.1109/ICSE.2015.131
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805, https://doi.org/10.48550/arXiv.1810.04805
Dowd M, McDonald J, Schuh J (2006) The art of software security assessment: Identifying and preventing software vulnerabilities. Pearson Education
Duan X, Wu J, Ji S, Rui Z, Luo T, Yang M, Wu Y (2019) Vulsniper: Focus your attention to shoot fine-grained vulnerabilities. In: IJCAI, pp 4665–4671. https://doi.org/10.24963/ijcai.2019/648
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64. https://doi.org/10.1080/01621459.1961.10482090
Egele M, Scholte T, Kirda E, Kruegel C (2008) A survey on automated dynamic malware-analysis techniques and tools. ACM Comput Surv (CSUR) 44 (2):1–42. https://doi.org/10.1145/2089125.2089126
Fan J, Li Y, Wang S, Nguyen TN (2020) AC/C++ code vulnerability dataset with code changes and CVE summaries. In: Proceedings of the 17th international conference on mining software repositories, pp 508–512. https://doi.org/10.1145/3379597.3387501
Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack overflow considered harmful? the impact of copy&paste on android application security. In: 2017 IEEE symposium on security and privacy (SP). https://doi.org/10.1109/SP.2017.31. IEEE, pp 121–136
Ghaffarian SM, Shahriari HR (2017) Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Comput Surv (CSUR) 50(4):1–36. https://doi.org/10.1145/3092566
Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the 6th ACM conference on data and application security and privacy, pp 85–96. https://doi.org/10.1145/2857705.2857720
Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv:1803.04497
Hovsepyan A, Scandariato R, Joosen W, Walden J (2012) Software vulnerability prediction using text analysis techniques. In: Proceedings of the 4th international workshop on Security measurements and metrics, pp 7–10. https://doi.org/10.1145/2372225.2372230
Huang S, Tang H, Zhang M, Tian J (2010) Text clustering on national vulnerability database. In: 2010 2nd international conference on computer engineering and applications, vol 2. IEEE, pp 295–299. https://doi.org/10.1109/ICCEA.2010.209
Ijaz M, Durad MH, Ismail M (2019) Static and dynamic malware analysis using machine learning. In: 2019 16th international BHURBAN conference on applied sciences and technology (IBCAST). https://doi.org/10.1109/IBCAST.2019.8667136. IEEE, pp 687–691
Jie G, Xiao-Hui K, Qiang L (2016) Survey on software vulnerability analysis method based on machine learning. In: 2016 IEEE 1st international conference on data science in cyberspace (DSC). https://doi.org/10.1109/DSC.2016.33. IEEE, pp 642–647
Kim J, Hubczenko D, Montague P (2019) Towards attention based vulnerability discovery using source code representation. In: International conference on artificial neural networks. https://doi.org/10.1007/978-3-030-30490-4_58. Springer, pp 731–746
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). https://doi.org/10.3115/v1/D14-1181, https://aclanthology.org/D14-1181. Association for Computational Linguistics, Doha, Qatar, pp 1746–1751
Klock R (2021) Quality of SQL code security on stackoverflow and methods of prevention. PhD thesis, Oberlin College. http://rave.ohiolink.edu/etdc/view?acc_num=oberlin1625831198110328
Koroteev M (2021) Bert: A review of applications in natural language processing and understanding. arXiv:2103.11943
Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621. https://doi.org/10.1080/01621459.1952.10483441
Layton R, Watters PA (2014) A methodology for estimating the tangible cost of data breaches. J Inf Secur Appl 19(6):321–330. https://doi.org/10.1016/j.jisa.2014.10.012
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. https://doi.org/10.48550/arXiv.1405.4053
Li P, Cui B (2010) A comparative study on software vulnerability static analysis techniques and tools. In: 2010 IEEE international conference on information theory and information security. https://doi.org/10.1109/ICITIS.2010.5689543. IEEE, pp 521–524
Li X, Chang X, Board JA, Trivedi KS (2017) A novel approach for software vulnerability classification. In: 2017 annual reliability and maintainability symposium (RAMS). https://doi.org/10.1109/RAM.2017.7889792. IEEE, pp 1–7
Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv:180101681, https://doi.org/10.14722/ndss.2018.23158
Li Z, Zou D, Xu S, Chen Z, Zhu Y, Jin H (2021a) Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans Dependable Sec Comput. https://doi.org/10.1109/TDSC.2021.3076142
Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z (2021b) Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2021.3051525
Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: Vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 2539–2541. https://doi.org/10.1145/3133956.3138840
Lin G, Zhang J, Luo W, Pan L, De Vel O, Montague P, Xiang Y (2019) Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Trans Dependable Sec Comput. https://doi.org/10.1109/TDSC.2019.2954088
Lin G, Wen S, Han QL, Zhang J, Xiang Y (2020) Software vulnerability detection using deep neural networks: A survey. Proc IEEE 108(10):1825–1848. https://doi.org/10.1109/JPROC.2020.2993293
Liu B, Shi L, Cai Z, Li M (2012) Software vulnerability discovery techniques: A survey. In: 2012 4th international conference on multimedia information networking and security. https://doi.org/10.1109/MINES.2012.202. IEEE, pp 152–156
Liu S, Lin G, Han QL, Wen S, Zhang J, Xiang Y (2019) Deepbalance: Deep-learning and fuzzy oversampling for vulnerability detection. IEEE Trans Fuzzy Syst 28(7):1329–1343. https://doi.org/10.1109/TFUZZ.2019.2958558
Liu S, Lin G, Qu L, Zhang J, De Vel O, Montague P, Xiang Y (2020) CD-VulD: Cross-domain vulnerability discovery based on deep domain adaptation. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2020.2984505
Mäntylä V, Lassenius C (2008) What types of defects are really discovered in code reviews? IEEE Trans Softw Eng 35(3):430–448. https://doi.org/10.1109/TSE.2008.71
McQueen MA, McQueen TA, Boyer WF, Chaffin MR (2009) Empirical estimates and observations of 0day vulnerabilities. In: 2009 42nd Hawaii international conference on system sciences. https://doi.org/10.1109/HICSS.2009.186. IEEE, pp 1–12
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. https://doi.org/10.48550/arXiv.1301.3781
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. https://doi.org/10.48550/arXiv.1310.4546
Mokbal FMM, Dan W, Imran A, Jiuchuan L, Akhtar F, Xiaoxi W (2019) MLPXSS: an integrated XSS-based attack detection scheme in web applications using multilayer perceptron technique. IEEE Access 7:100567–100580. https://doi.org/10.1109/ACCESS.2019.2927417
Mubarek AM, Adalı E (2017) Multilayer perceptron neural network technique for fraud detection. In: 2017 international conference on computer science and engineering (UBMK). https://doi.org/10.1109/UBMK.2017.8093417. IEEE, pp 383–387
Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp 426–437. https://doi.org/10.1145/2810103.2813604
Pham NH, Nguyen TT, Nguyen HA, Nguyen TN (2010) Detection of recurring software vulnerabilities. In: Proceedings of the IEEE/ACM international conference on automated software engineering, pp 447–456. https://doi.org/10.1145/1858996.1859089
Piessens F (2002) A taxonomy of causes of software vulnerabilities in internet software. In: Supplementary Proceedings of the 13th international symposium on software reliability engineering. Citeseer, pp 47–52
Plachkinova M, Maurer C (2018) Security breach at target. J Inf Syst Educ 29(1):11–20. https://aisel.aisnet.org/jise/vol29/iss1/7
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: A survey. Sci China Technol Sci 63(10):1872–1897. https://doi.org/10.1007/s11431-020-1647-3
Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. IEEE Trans Softw Eng 40(10):993–1006. https://doi.org/10.1109/TSE.2014.2340398
Shar LK, Briand LC, Tan HBK (2014) Web application vulnerability prediction using hybrid program analysis and machine learning. IEEE Trans Dependable Secur Comput 12(6):688–707. https://doi.org/10.1109/TDSC.2014.2373377
Shin Y, Williams L (2008) An empirical model to predict security vulnerabilities using code complexity metrics. In: Proceedings of the 2nd ACM-IEEE international symposium on Empirical software engineering and measurement, pp 315–317. https://doi.org/10.1145/1414004.1414065
Shu X, Tian K, Ciambrone A, Yao D (2017) Breaking the target: An analysis of target data breach and lessons learned. arXiv:1701.04940
Spanos G, Angelis L, Toloudis D (2017) Assessment of vulnerability severity using text mining. In: Proceedings of the 21st Pan-Hellenic conference on informatics, pp 1–6. https://doi.org/10.1145/3139367.3139390
Spreitzenbarth M, Schreck T, Echtler F, Arp D, Hoffmann J (2015) Mobile-sandbox: combining static and dynamic analysis with machine-learning techniques. Int J Inf Secur 14(2):141–153. https://doi.org/10.1007/s10207-014-0250-0
Su W, Yuan Y, Zhu M (2015) A relationship between the average precision and the area under the ROC curve. In: Proceedings of the 2015 international conference on the theory of information retrieval, pp 349–352. https://doi.org/10.1145/2808194.2809481
Sultana KZ, Deo A, Williams BJ (2016) A preliminary study examining relationships between nano-patterns and software security vulnerabilities. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC). https://doi.org/10.1109/COMPSAC.2016.34, vol 1. IEEE, pp 257–262
Tang G, Meng L, Wang H, Ren S, Wang Q, Yang L, Cao W (2020) A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 international symposium on theoretical aspects of software engineering (TASE). IEEE, pp 1–8. https://doi.org/10.1109/TASE49443.2020.00010
Telang R, Wattal S (2007) An empirical analysis of the impact of software vulnerability announcements on firm stock price. IEEE Trans Softw Eng 33(8):544–557. https://doi.org/10.1109/TSE.2007.70712
Wang H, Ye G, Tang Z, Tan SH, Huang S, Fang D, Feng Y, Bian L, Wang Z (2020) Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans Inf Forensics Secur 16:1943–1958. https://doi.org/10.1109/TIFS.2020.3044773
Wang P, Johnson C (2018) Cybersecurity incident handling: A case study of the equifax data breach. Issues Inf Syst 19(3). https://doi.org/10.48009/3_iis_2018_150-159
Wijayasekara D, Manic M, McQueen M (2014) Vulnerability identification and classification via text mining bug databases. In: IECON 2014-40th annual conference of the IEEE industrial electronics society. https://doi.org/10.1109/IECON.2014.7049035. IEEE, pp 3612–3618
Yamaguchi F, Lindner F, Rieck K (2011) Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning. In: Proceedings of the 5th USENIX conference on Offensive technologies, pp 13–13. https://dl.acm.org/doi/10.5555/2028052.2028065
Zhang H, Wang S, Li H, Chen THP, Hassan AE (2021) A study of C/C++ code weaknesses on stack overflow. IEEE Trans Softw Eng. https://doi.org/10.1109/TSE.2021.3058985
Zhu M (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo. Waterloo 2(30):6
Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision, pp 19–27. https://doi.org/10.1109/ICCV.2015.11
Zou D, Wang S, Xu S, Li Z, Jin H (2019) μ vuldeepecker: A deep learning-based system for multiclass vulnerability detection. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2019.2942930
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors of this manuscript have no conflicts of interest.
Additional information
Communicated by: Yuan Zhang
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Preliminary Experiment Additional Metrics
Appendix A: Preliminary Experiment Additional Metrics
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Napier, K., Bhowmik, T. & Wang, S. An empirical study of text-based machine learning models for vulnerability detection. Empir Software Eng 28, 38 (2023). https://doi.org/10.1007/s10664-022-10276-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-022-10276-6