Abstract
Cyber security has an increasing importance since the day when information technologies are an invariable part of modern human life. One of the fundamental areas of cyber security is the concept of software security. Security vulnerabilities in software are one of the main reasons for the exploitation of information systems. For this reason, it has been systematically reported, analyzed and classified for a long time, with a protocol established between the states and the stakeholders of the issue at the level. All these processes are carried out manually by humans today. This situation causes errors and delays caused by human nature. Therefore, the current study aims to help the experts and increase the accuracy of the analysis results by speeding up the processes. To achieve this goal, a model is proposed that uses technical explanations of security reports written in natural language. Our model basically proposes a method that uses word embedding approaches and multi-class classification algorithms from natural language processing techniques. In order to compare the proposed model more accurately, the NVD database, which is open to everyone and accepted as a reference, was chosen. In addition, previous studies in the literature and the model we propose were compared. In order for the results of the compared models to be analyzed more accurately, our model was trained with the data sets of the studies it was compared and the results were presented clearly. The proposed method showed estimation success in the range of 87.34–96.25% for CVSS 2.0 metrics, and in the range of 84–90% for CVSS 3.1. This study, in which different word embedding and classification algorithms are used together, is one of the limited studies on the latest version of the official scoring system used for classification of software security vulnerabilities. Moreover, it is the most comprehensive and original study in its field due to the size of the dataset it uses and the number of databases evaluated.
Similar content being viewed by others
Availability of data materials
References
Kobek, L.P.: The State of Cybersecurity in Mexico: An Overview. Wilson Centre’s Mexico Institute, Washington (2017)
Ghaffarian, S.M., Shahriari, H.R.: Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Comput. Surv. 50(4), 36 (2017). https://doi.org/10.1145/3092566
Moore, T.W., Probst, C.W., Rannenberg, K., van Eeten, M.: Assessing ICT security risks in socio-technical systems (Dagstuhl Seminar 16461). Dagstuhl Rep. 6(11), 63–89 (2017). https://doi.org/10.4230/DagRep.6.11.63
NVD, “NVD,” National Vulnerability Database. https://nvd.nist.gov (2020). Accessed 25 July 2020
Spanos, G., Angelis, L.: A multi-target approach to estimate software vulnerability characteristics and severity scores. J. Syst. Softw. 146, 152–166 (2018). https://doi.org/10.1016/j.jss.2018.09.039
Ruohonen, J.: A look at the time delays in CVSS vulnerability scoring. Appl. Comput. Inform. 15(2), 129–135 (2019). https://doi.org/10.1016/j.aci.2017.12.002
Theisen, C., Williams, L.: Better together: comparing vulnerability prediction models. Inf. Softw. Technol. (2019). https://doi.org/10.1016/j.infsof.2019.106204
Yang, H., Park, S., Yim, K., Lee, M.: Better not to use vulnerability’s reference for exploitability prediction. Appl. Sci. (Switzerland) 10(7), 2555 (2020). https://doi.org/10.3390/app10072555
IBM, Cost of a Data Breach Report. https://www.ibm.com/reports/data-breach (2022). 23 June 2023
“Mitre Corporation,” 2020. https://www.mitre.org (2020). Accessed 25 July 2020
Bozoklu, O., Çil, C.Z.: Yazılım Güvenlik Açığı Ekosistemi Ve Türkiye’deki Durum Değerlendirmesi. Uluslararası Bilgi Güvenliği Mühendisliği Dergisi 3(1), 6–26 (2017)
Kekül, H., Ergen, B., Arslan, H.: Yazılım Güvenlik Açığı Veri Tabanları. Avrupa Bilim ve Teknoloji Dergisi 28, 1008–1012 (2021)
CVE, “CVE,” Common Vulnerabilities and Exposures. https://cve.mitre.org (2020). Accessed 25 July 2020
Mell, P., Scarfone, K., Romanosky, S.: A complete guide to the common vulnerability scoring system version 2.0. FIRSTForum of Incident Response and Security Teams. https://www.first.org/cvss/cvss-v2-guide.pdf (2007). Accessed 01 Jan 2021
Common Vulnerability Scoring System v3.1: User Guide. https://www.first.org/cvss/v3.1/user-guide (2021). Accessed 01 Jan 2021
Wu, X., Zheng, W., Chen, X., Wang, F., Mu, D.: CVE-assisted large-scale security bug report dataset construction method. J. Syst. Softw. 160, 110456 (2020). https://doi.org/10.1016/j.jss.2019.110456
Raducu, R., Esteban, G., Lera, F.J.R., Fernández, C.: Collecting vulnerable source code from open-source repositories for dataset generation. Appl. Sci. (Switzerland) 10(4), 1270 (2020). https://doi.org/10.3390/app10041270
Miyamoto, D., Yamamoto, Y., Nakayama, M.: Text-mining approach for estimating vulnerability score. In: Proceedings—2015 4th ınternational workshop on building analysis datasets and gathering experience returns for security, BADGERS 2015, pp. 67–73 (2017). https://doi.org/10.1109/BADGERS.2015.12
D. Hin, A. Kan, H. Chen, and M. A. Babar, “LineVD: statement-level vulnerability detection using graph neural networks,” in Proceedings of the 19th International Conference on Mining Software Repositories, 2022, pp. 596–607.
Sahu, K., Alzahrani, F.A., Srivastava, R.K., Kumar, R.: Evaluating the impact of prediction techniques: software reliability perspective. Comput. Mater. Contin. 67(2), 1471–1488 (2021)
Sahu, K., Alzahrani, F.A., Srivastava, R.K., Kumar, R.: Hesitant fuzzy sets based symmetrical model of decision-making for estimating the durability of web application. Symmetry (Basel) 12(11), 1770 (2020)
Sahu, K., Srivastava, R.K.: Soft computing approach for prediction of software reliability. Neural Netw. 17, 19 (2018)
János, F.D., Huu Phuoc Dai, N.: Security concerns towards security operations centers. İn: 2018 IEEE 12th International Symposium on Applied Computational Intelligence and Informatics (SACI), 2018, pp. 273–278 (2018) https://doi.org/10.1109/SACI.2018.8440963
Kritikos, K., Magoutis, K., Papoutsakis, M., Ioannidis, S.: A survey on vulnerability assessment tools and databases for cloud-based web applications. Array 3–4, 100011 (2019). https://doi.org/10.1016/j.array.2019.100011
Ghaffarian, S.M., Shahriari, H.R.: Neural software vulnerability analysis using rich intermediate graph representations of programs. Inf. Sci. (N Y) 553, 189–207 (2021). https://doi.org/10.1016/j.ins.2020.11.053
Şahin, C.B., Dinler, Ö.B., Abualigah, L.: Prediction of software vulnerability based deep symbiotic genetic algorithms: phenotyping of dominant-features. Appl. Intell. 51(11), 8271–8287 (2021). https://doi.org/10.1007/s10489-021-02324-3
Attaallah, A., Alsuhabi, H., Shukla, S., Kumar, R., Gupta, B.K., Khan, R.A.: Analyzing the big data security through a unified decision-making approach. Intell. Autom. Soft Comput. 32(2), 1071–1088 (2022)
Almulihi, A.H., Alassery, F., Khan, A.I., Shukla, S., Gupta, B.K., Kumar, R.: Analyzing the ımplications of healthcare data breaches through computational technique. Intell. Autom. Soft Comput. 32(3), 1763–1779 (2022)
Sahu, K., Srivastava, R.K.: Needs and importance of reliability prediction: an industrial perspective. Inf. Sci. Lett. 9(1), 33–37 (2020)
Sahu, K., Srivastava, R.K.: Predicting software bugs of newly and large datasets through a unified neuro-fuzzy approach: reliability perspective. Adv. Math.: Sci. J. 10(1), 543–555 (2021)
Russo, E.R., Di Sorbo, A., Visaggio, C.A., Canfora, G.: Summarizing vulnerabilities’ descriptions to support experts during vulnerability assessment activities. J. Syst. Softw. 156, 84–99 (2019). https://doi.org/10.1016/j.jss.2019.06.001
Yasasin, E., Prester, J., Wagner, G., Schryen, G.: Forecasting IT security vulnerabilities—an empirical analysis. Comput Secur 88, 101610 (2020). https://doi.org/10.1016/j.cose.2019.101610
Sharma, R., Sibal, R., Sabharwal, S.: Software vulnerability prioritization using vulnerability description. Int. J. Syst. Assur. Eng. Manag. 12(1), 58–64 (2021). https://doi.org/10.1007/s13198-020-01021-7
Malhotra, R., Vidushi: Severity prediction of software vulnerabilities using textual data. In: Gunjan, V.K., Zurada, J.M. (eds.) Proceedings of ınternational conference on recent trends in machine learning, IoT, smart cities and applications. Springer, Singapore, pp. 453–464 (2021)
Sun, X., et al.: Automatic software vulnerability assessment by extracting vulnerability elements. J. Syst. Softw. (2023). https://doi.org/10.1016/j.jss.2023.111790
Wang, Q., Gao, Y., Ren, J., Zhang, B.: An automatic classification algorithm for software vulnerability based on weighted word vector and fusion neural network. Comput. Secur. 126, 103070 (2023). https://doi.org/10.1016/j.cose.2022.103070
Kekül, H., Ergen, B., Arslan, H.: A multiclass hybrid approach to estimating software vulnerability vectors and severity score. J. Inf. Secur. Appl. 63, 103028 (2021). https://doi.org/10.1016/j.jisa.2021.103028
Patriciu, V.-V., Priescu, I., Nicolaescu, S.: Security metrics for enterprise information systems. J. Appl. Quant. Methods 1(2), 151–159 (2006)
Schiffman, M., Cisco, C.I.A.G.: A complete guide to the common vulnerability scoring system (CVSS) v1 Archive. https://www.first.org/cvss/v1/guide (2005). Accessed 01 Jan 2021
Spanos, G., Sioziou, A., Angelis, L.: WIVSS: a new methodology for scoring ınformation systems vulnerabilities. İn: Proceedings of the 17th Panhellenic Conference on Informatics. İn: PCI ’13. New York, NY, USA: Association for Computing Machinery, pp. 83–90 (2013) https://doi.org/10.1145/2491845.2491871
Spanos, G., Angelis, L.: Impact metrics of security vulnerabilities: analysis and weighing. Inf. Secur. J.: A Glob. Perspect. 24(1–3), 57–71 (2015)
Schiffman Mike, C.C.: Complete CVSS v1 Guide.” https://www.first.org/cvss/v1/guide (2023). Accessed 02 May 2023
Mell, P., Scarfone, K., Romanosky, S.: A complete guide to the common vulnerability scoring system Version 2.0. (2007)
Common Vulnerability Scoring System v3.0: Specification Document. Accessed 02 May 2023. (online). Available: https://www.first.org/cvss/examples
Common Vulnerability Scoring System version 3.1 Specification Document Revision 1. (online). Available: https://www.first.org/cvss/ (2023). Accessed 02 May 2023
Fesseha, A., Xiong, S., Emiru, E.D., Diallo, M., Dahou, A.: Text classification based on convolutional neural networks and word embedding for low-resource languages: tigrinya. Information 12(2), 52 (2021). https://doi.org/10.3390/info12020052
Van Rossum, G., Drake, F.L.: Python 3 Reference Manual. CreateSpace, Scotts Valley, CA (2009)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc, Sebastopol (2009)
Řehuřek, R., Sojka, P.: Software framework for topic modelling with large corpora. İn: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta: ELRA, May 2010, pp. 45–50 (2010)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Harris, C.R., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2
McKinney, W., et al.: Data structures for statistical computing in python. In: Proceedings of the 9th Python in Science Conference, pp. 51–56 (2010)
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manag. 50(1), 104–112 (2014). https://doi.org/10.1016/j.ipm.2013.08.006
Gupta, G., Malhotra, S.: Text document tokenization for word frequency count using rapid miner (taking resume as an example). Int. J. Comput. Appl 975, 8887 (2015)
Verma, T., Renu, R., Gaur, D.: Tokenization and filtering process in RapidMiner. Int. J. Appl. Inf. Syst. 7(2), 16–18 (2014)
Jalal, A.A., Ali, B.H.: Text documents clustering using data mining techniques. Int. J. Electr. Comput. Eng. (2088-8708) 11(1), 150 (2021)
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019)
Zhang, Y., Jin, R., Zhou, Z.-H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1–4), 43–52 (2010)
Aizawa, A.: An information-theoretic perspective of tf–idf measures. Inf Process Manag 39(1), 45–65 (2003)
Banerjee, S., Pedersen, T.: The design, implementation, and use of the ngram statistics package. İn: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 370–381 (2003)
Aydoğan, M., Karci, A.: Turkish text classification with machine learning and transfer learning. İn: 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), pp. 1–6 (2019).https://doi.org/10.1109/IDAP.2019.8875919
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. İn: Advances in Neural İnformation Processing Systems pp. 3111–3119 (2013)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013) arXiv preprint arXiv:1301.3781
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. İn: International Conference on Machine Learning. pp. 1188–1196 (2014)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Su, Y., Lin, R., Kuo, C.: Tree-structured multi-stage principal component analysis (TMPCA): theory and applications. Expert systems with applications 118, 355–364 (2019)
Aggarwal, S., Kaur, D.: Naive bayes classifier with various smoothing techniques for text documents. Int J Comput Trends Technol 4(4), 873–876 (2013)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Fix, E.: Discriminatory analysis: nonparametric discrimination, consistency properties. USAF school of Aviation Medicine 1, (1985)
McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1943)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. İn IJCAİ, pp. 1137–1145 (1995)
Cawley, G.C., Talbot, N.L.C.: On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010)
Norvig, P.R., Intelligence, SA.: A modern approach. Prentice hall upper saddle river, NJ, USA: Rani, M., Nayak, R., & Vyas, OP: An ontology-based adaptive personalized e-learning system, assisted by software agents on cloud storage. Knowledge-Based Systems 90(2002), 33–48 (2015)
Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive bayes text classifiers. İn: Proceedings of the 20th İnternational Conference on Machine Learning (ICML-03), pp. 616–623 (2003)
Mallory, E.K., Acharya, A., Rensi, S.E., Turnbaugh, P.J., Bright, R.A., Altman, R.B.: Chemical reaction vector embeddings: towards predicting drug metabolism in the human gut microbiome. İn: PSB, pp. 56–67 (2018)
Kamiński, B., Jakubczyk, M., Szufel, P.: A framework for sensitivity analysis of decision trees. Cent. Eur. J. Oper. Res 26(1), 135–159 (2018). https://doi.org/10.1007/s10100-017-0479-6
Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27(3), 221–234 (1987). https://doi.org/10.1016/S0020-7373(87)80053-6
Deng, X., Li, Y., Weng, J., Zhang, J.: Feature selection for text classification: a review. Multimed. Tools Appl. 78(3), 3797–3816 (2019)
Chen, Z., Zhou, L.J., Da Li, X., Zhang, J.N., Huo, W.J.: The Lao Text Classification Method Based on KNN. Procedia Comput. Sci. 166, 523–528 (2020). https://doi.org/10.1016/j.procs.2020.02.053
Tan, Y.: An ımproved KNN text classification algorithm based on K-medoids and rough set. İn: 2018 10th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), pp. 109–113 (2018). https://doi.org/10.1109/IHMSC.2018.00032
Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. Boston 1(1), 69–90 (1999)
Rosenblatt, F.: Principles of neurodynamics: Perceptrons and the theory of brain mechanisms, vol. 55. Spartan books, Washington, DC (1962)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. California Univ San Diego La Jolla Inst For Cognitive Science, Technical rept (1985)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Sign. Syst. 5(4), 455 (1992)
Simanjuntak, D.A., Ipung, H.P., Nugroho, A.S., et al.: Text classification techniques used to faciliate cyber terrorism investigation. İn: 2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies, pp. 198–200 (2010)
Shah, K., Patel, H., Sanghvi, D., Shah, M.: A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augment. Hum. Res. 5(1), 1–16 (2020)
Sun, Y., Li, Y., Zeng, Q., Bian, Y.: Application research of text classification based on random forest algorithm. İn 2020 3rd International conference on advanced electronic materials, computers and software engineering (AEMCSE), pp. 370–374 (2020). https://doi.org/10.1109/AEMCSE50948.2020.00086
Sawangarreerak, S., Thanathamathee, P.: Random forest with sampling techniques for handling ımbalanced prediction of university student depression. Information 11(11), 519 (2020). https://doi.org/10.3390/info11110519
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009). https://doi.org/10.1016/j.ipm.2009.03.002
Bielza, C., Li, G., Larrañaga, P.: Multi-dimensional classification with Bayesian networks. Int. J. Approx. Reas. 52(6), 705–727 (2011). https://doi.org/10.1016/j.ijar.2011.01.007
Ballabio, D., Grisoni, F., Todeschini, R.: Multivariate comparison of classification performance measures. Chemom. Intell. Lab. Syst. 174, 33–44 (2018). https://doi.org/10.1016/j.chemolab.2017.12.004
Fang, Y., Liu, Y., Huang, C., Liu, L.: Fastembed: predicting vulnerability exploitation possibility based on ensemble machine learning algorithm. PLoS ONE 15(2), 1–28 (2020). https://doi.org/10.1371/journal.pone.0228439
Funding
This study is supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) with project number 121E298.
Author information
Authors and Affiliations
Contributions
Dr. Hakan KEKÜL involved in conceptualization, methodology, validation, formal analysis, resources, data Curation, writing—original draft, writing—review and editing, and visualization. Dr. Burhan ERGEN involved in conceptualization, methodology, validation, formal analysis, writing—review and editing, supervision, and project administration. Dr. Halil ARSLAN involved in conceptualization, methodology, validation, formal analysis, writing—original draft, writing—review and editing, and supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kekül, H., Ergen, B. & Arslan, H. Estimating vulnerability metrics with word embedding and multiclass classification methods. Int. J. Inf. Secur. 23, 247–270 (2024). https://doi.org/10.1007/s10207-023-00734-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10207-023-00734-7