Knowledge Graph Construction Research From Multi-source Vulnerability Intelligence

Du, Lin; Xu, Chuanqi

doi:10.1007/978-981-19-8285-9_13

Lin Du^10,11 &
Chuanqi Xu¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1699))

Included in the following conference series:

China Cyber Security Annual Conference

2503 Accesses
4 Citations

Abstract

There is a huge Internet user group in China, and many enterprises and institutions are deeply affected by the threat of Cybersecurity vulnerabilities. At present, according to the needs of different business scenarios, relevant business personnel often need to search for different vulnerability information separately, relying on manpower, and the vulnerability intelligence distributed on the Internet has the characteristics of multi-source heterogeneity, which is difficult to ensure the effectiveness and reliability of vulnerability knowledge. In view of the above background, with vulnerabilities as the core, knowledge extraction of vulnerability intelligence is carried out according to existing standards, corresponding entities and relationships are established, and related and visualized knowledge graphs are studied and constructed to provide support for the discovery and traceability of vulnerability threats by information workers.

You have full access to this open access chapter, Download conference paper PDF

Automatic Analysis and Reasoning Based on Vulnerability Knowledge Graph

Refining Traceability Links Between Vulnerability and Software Component in a Vulnerability Knowledge Graph

Classifying Common Vulnerabilities and Exposures Database Using Text Mining and Graph Theoretical Analysis

Keywords

1 Introduction

With the rapid development of the Internet industry, a large number of Cybersecurity vulnerabilities have been gradually discovered and exploited in the use of various companies’ products, causing potential risks to production and daily life. Vulnerability threat discovery and traceability have become common challenges and work requirements for personnel including system operation and maintenance and network management. There are various sources of vulnerability information, including vulnerability reports from various open source communities, public vulnerability databases, and products’ patch information etc., which have the characteristics of scattered data, incomplete information, and different structures, and the vulnerability knowledge caused by data sources such as different Internet community platforms. The information of high quality and low quality are mixed, the repetition is high, the correlation is not clear, the data quality cannot be guaranteed, and it cannot effectively support the work needs of Cybersecurity business personnel for vulnerability detection, analysis and judgment.

In recent years, knowledge graphs can use deep learning to form valuable information and knowledge models through data collection, analysis, and mining. Since the knowledge graph theory was proposed by Google and applied to intelligent search [1], it was initially applied efficiently in the commercial field, such as the LinkedIn economic graph (User Profile) in the social field, and the Tianyancha enterprise graph (Enterprise Profile) in the field of enterprise information, etc.

In various vertical fields in China, there has been research and exploration on the application of knowledge graphs. An Ning et al. [2] proposed the construction of a cross-platform network public opinion knowledge graph, using Sina Weibo and Douyin short videos as data sources to build a network public opinion knowledge map, which is mainly used in the management and guidance of network public opinion. Xiao Le et al. [3] proposed knowledge graph for grain situation is mainly based on the grain situation dictionary and Flat-lattice model to extract grain situation entities for construction, which is used to assist grain situation decision-making. Mou Tianhao et al. [4] proposed a knowledge graph of process industrial control systems based on the control system cyber-physical asset management tasks to solve business problems related to industrial control systems. Zhang Kunli et al. [5] took obstetric diseases as the core and proposed a Chinese obstetric knowledge graph to facilitate medical question and answer and auxiliary diagnosis and treatment.

There are few applications of knowledge graphs in the field of Cybersecurity. This paper uses knowledge graphs to correlate numerous isolated vulnerability intelligences and present a panorama of vulnerability entities, which provides a new idea for vulnerability research and analysis, and helps to promote solutions for difficulties related to Cybersecurity business.

2 Vulnerability Knowledge Graph Construction Route

Large-scale domestic vulnerability databases include the China National Vulnerability Database (CNVD), the China National Vulnerability Database of Information Security(CNNVD) etc., which are the main methods for the construction and sharing of vulnerability intelligence [6]. Combining with the current situation of information security development, the sources of vulnerability intelligence in this paper are CNVD, CNNVD and CVE (Common Vulnerability Disclosure). After the vulnerability knowledge is integrated, manual proofreading is finally performed, and data with low confidence is discarded to ensure the quality of the vulnerability knowledge base. At the same time, the knowledge extraction model is continuously supervised and trained with new intelligence. With the accumulation of data, more new knowledge base data sources such as open source security websites are added as appropriate, and finally the entire system is iteratively updated.

2.1 Schema Layer Design

The schema layer of the vulnerability knowledge graph is above the data layer, and the core is the ontology library, which is an abstract representation of vulnerability knowledge, like the “class” in object-oriented. The schema layer mainly includes: entity-relation-entity, entity-attribute-attribute’s value. Based on “Information security technology—Cybersecurity vulnerability identification and description specification “ [8] (GB/T 28458–2020), the framework of vulnerability identification and description can be composed of identification items and description items. Taking into account the actual situation of domestic vulnerabilities, mainly from the perspective of vulnerability management and emergency response [9], the main attribute of the vulnerability is CNVD_ID, and the framework of the preliminary design entity and relationship is shown in Fig. 1.

Based on the graph structure, entities are used to represent objects or abstract concepts in the vulnerability space, and relationships are used to model inter-entity interactions, the framework follows the triplet of (head entity, relation, tail entity). Entities are distinguished by boxes, each row under the entity name has its attributes, PK represents the main attribute, and the arrow represents the relationship. The entity defines 5: vulnerability = {CNVD_ID, title, date, level, product, description, solution, patch, CVE_ID}; event = {event_id, description, time, URL, victim}; company = {name}; product = {name}; victim = {name}. Relationships define 4: influence, raise, belong to, use. More entities, attributes, and relationships can be gradually expanded according to this framework.

2.2 Data Layer Construction

The vulnerability knowledge graph data layer consists of three steps: data collection, knowledge extraction, and knowledge fusion.

2.2.1 Data Collection

Vulnerability, company, and product data are obtained from the unstructured text of the China National Vulnerability Database (CNVD) and semi-structured text of CVE (Common Vulnerability Disclosure) [10]. According to their own circumstances, the two entities, events and victims, can collect them in a compliant manner if they conduct unified management of vulnerabilities for the unit and its subordinate units, or as vulnerability managers.

2.2.2 Knowledge Extraction

Knowledge extraction is a method to automatically obtain structured information such as entities, relationships, and entity attributes from heterogeneous data such as semi-structured or unstructured data. According to the characteristics of vulnerability intelligence text, this paper marks the vulnerability intelligence text with BIOES [11], and then performs the following main operations: entity extraction, attribute extraction, and relation extraction. They are introduced as follows:

1)
Entity extraction, namely named entity recognition (NER), refers to the automatic recognition of named entities from text datasets. At present, the main technical methods of named entity recognition are divided into: rule-based and dictionary-based methods -- manual construction of rule templates, and pattern and string matching as the main means; statistical-based methods -- including Hidden Markov Model (HMM), Maximum Entropy (MEM), Support Vector Machine (SVM), Conditional Random Field (CRF); Neural Network methods -- the main models are NN/CNN-CRF, RNN-CRF, LSTM-CRF. The goal of attribute extraction is to collect attribute information of a specific entity from different information sources. For example, for a specific vulnerability, attributes such as name and affected product can be obtained from the public information on the Internet. Entity and attribute extraction this paper adopts the BLSTM-CRF model (Bidirectional Long Short-Term Memory Network - Conditional Random Field) [12], which is currently more effective in the field of security vulnerabilities, taking the product entity (Apache Log4j) as an example, as shown in Fig. 2

2)
Relation extraction. After the vulnerability intelligence text is extracted by entities and attributes, a series of discrete named entities are obtained. Continuing to obtain semantic information requires relation extraction: extracting the interrelationships between entities from related texts, and connecting entities through relationships to form a networked knowledge structure. The vulnerability knowledge graph is different from the social character graph. The relationship is relatively small and simple. For example, vulnerability A “raises” event B. Since the relationship defined in the schema layer is easier to distinguish in text data such as vulnerability reports, this paper chooses the method of rule matching, and the recognized entities are automatically selected according to the definition of the relationship in the category and schema layer, and fine-tuning is performed later. According to the definition, the entity can conform to the rules based on the pattern, so the relationship between the entities is determined according to the trigger word, and the designed rule samples are shown in Table 1.

Table 1. Samples of trigger word rules

Full size table

2.2.3 Knowledge Fusion

After data collection and knowledge extraction, entities, relationships and entity attribute information are obtained from the original unstructured and semi-structured vulnerability intelligence data. However, the relationship between multiple sources (information) is flat and lacks hierarchy and logic; there is still a lot of redundancy and misinformation in the knowledge. Knowledge fusion is to solve this problem, through entity disambiguation and coreference resolution, to realize the integration of vulnerability knowledge. For example, the company “” and the “Apple” belong to the entity synonymous relationship and need to be integrated. After knowledge fusion, the noise and redundancy in the data are removed, and the quality of vulnerability knowledge is improved.

3 Vulnerability Knowledge Graph Construction Results

3.1 Experimental Environment

The experimental environment of this paper: the operating system is Windows 10; the CPU is AMD Ryzen™ 7 5800H@3.2 GHz; the GPU is GTX 3050Ti (4 GB); the memory is 64 GB; the Python version is 3.7; the neo4j version is 3.1.1.

3.2 Knowledge Graph Display

Taking some generic vulnerability data and a small number of influenced victims under Apache as an example (entities are vulnerability ontology, historical events, involved victims, companies, and products; relationships are the edges of a directed graph), the constructed visual interface is shown in Fig. 3.

3.3 Application Analysis

In terms of vulnerability threat discovery and analysis, by constructing the graph to correlate and analyze vulnerability information, hidden information can be mined and effective judgments can be made. Referring to Fig. 3, various types of entities are used as nodes in the graph, and various types of relationships between entities are used as edges in the graph. Starting from a certain entity, such as an victim with critical infrastructure, you can know which products of which companies are used by the victim, and which security events have occurred due to which vulnerabilities occurred at specific times. Once a 0-day vulnerability occurs again in the corresponding products of the company, it can be reasonably predicted that the victim will be influenced by this vulnerability, and it will be warned in time before possible Cybersecurity events to avoid major losses. This information is often unavailable from a single vulnerability report, and knowledge graphs can organically connect numerous vulnerability information.

4 Conclusion

According to the characteristics of the vulnerability field, this paper first integrates multi-source vulnerability intelligence data to design a vulnerability knowledge graph framework; then uses a deep learning model to extract entities and attributes, extracts relationships based on pattern rules, and constructs a vulnerability knowledge ontology, check and analyze; and finally complete the multi-source knowledge graph. In the future, by further adding multiple vulnerability threat intelligence data sources, a larger and more complete vulnerability knowledge graph can be formed, which can effectively provide more Cybersecurity decision support for information workers.

References

Singhal, A: Introducing the knowledge graph: things, not strings. Offic. Google Blog 5, 16 (2012)
Google Scholar
An, N., An, L.: Construction and comparison of cross-platform online public opinion knowledge graph. Inf. Sci. 40(03):159–165 (2022). (in Chinese)
Google Scholar
Xiao, L., Li, J. X., Ge, L., et al.: Knowledge graph construction for decision support of grain situation. J. Chin. Cereals Oils Assoc. 03(05), 1–14 (2022). (in Chinese)
Google Scholar
Mou, T.H., Li, S. Y.: Knowledge graph construction for control systems in process industry. Chin. J. Intell. Sci. Technol. 4(01), 129–141 (2022). (in Chinese)
Google Scholar
Zhang, K.L., Hu, C.X., Song, Y., et al.: Construction of Chinese obstetric knowledge graph based on multiple source data. J. Zhengzhou Univ. Natl. Sci. Ed. 05(30), 1–7 (2022). (in Chinese)
Google Scholar
Dong, C., Jiang, B., Lu, Z.G., et al.: Knowledge graph for cyberspace security intelligence: a survey. J. Cyber Secur. 5(05), 56–76 (2020). (in Chinese)
Google Scholar
Ji, S.X., Pan, S.R., Cambria, E., et al.: A Survey on Knowledge Graphs: representation, acquisition and applications. arXiv preprint arXiv:2002.00388 (2021)
Google Scholar
Committee, N.I.S.S.T.: Information Security Technology—Cybersecurity Vulnerability Identification and Description Specification: GB/T 28458–2020. Standards Press of China, Beijing (2020).(in Chinese)
Google Scholar
Committee, N.I.S.S.T.: Information Security Technology—Specification for Cybersecurity Vulnerability Management: GB/T 30276–2020. Standards Press of China, Beijing (2020).(in Chinese)
Google Scholar
Zhou, W.P., Yang, W.Y., Wang, X.H., et al.: Research on penetration testing tool for industrial control system. Comput. Eng. 45(08), 92–101 (2019). (in Chinese)
Google Scholar
Reimers, N., Gurevych, I.: Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks. arXiv preprint arXiv:1707.06799 (2017)
Zhang, R.B., Liu, J.Y., He, X.: Named entity recognition for vulnerabilities based on BLSTM-CRF model. J. Sichuan Univ. Natl. Sci. 56(03), 469–475 (2019). (in Chinese)
Google Scholar

Download references

Author information

Authors and Affiliations

Technical Team/Coordination Center of China, Tianjin Branch of National Computer Network Emergency Response, Tianjin, China
Lin Du & Chuanqi Xu
School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Lin Du

Authors

Lin Du
View author publications
You can also search for this author in PubMed Google Scholar
Chuanqi Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lin Du .

Editor information

Editors and Affiliations

CNCERT, Beijing, China
Wei Lu
University of Chinese Academy of Sciences, Beijing, China
Yuqing Zhang
Peking University, Beijing, China
Weiping Wen
CNCERT, Beijing, China
Hanbing Yan
CNCERT, Beijing, China
Chao Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Du, L., Xu, C. (2022). Knowledge Graph Construction Research From Multi-source Vulnerability Intelligence. In: Lu, W., Zhang, Y., Wen, W., Yan, H., Li, C. (eds) Cyber Security. CNCERT 2022. Communications in Computer and Information Science, vol 1699. Springer, Singapore. https://doi.org/10.1007/978-981-19-8285-9_13

Download citation

DOI: https://doi.org/10.1007/978-981-19-8285-9_13
Published: 10 December 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8284-2
Online ISBN: 978-981-19-8285-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics