Keywords

1 Introduction

With the rapid development of the Internet industry, a large number of Cybersecurity vulnerabilities have been gradually discovered and exploited in the use of various companies’ products, causing potential risks to production and daily life. Vulnerability threat discovery and traceability have become common challenges and work requirements for personnel including system operation and maintenance and network management. There are various sources of vulnerability information, including vulnerability reports from various open source communities, public vulnerability databases, and products’ patch information etc., which have the characteristics of scattered data, incomplete information, and different structures, and the vulnerability knowledge caused by data sources such as different Internet community platforms. The information of high quality and low quality are mixed, the repetition is high, the correlation is not clear, the data quality cannot be guaranteed, and it cannot effectively support the work needs of Cybersecurity business personnel for vulnerability detection, analysis and judgment.

In recent years, knowledge graphs can use deep learning to form valuable information and knowledge models through data collection, analysis, and mining. Since the knowledge graph theory was proposed by Google and applied to intelligent search [1], it was initially applied efficiently in the commercial field, such as the LinkedIn economic graph (User Profile) in the social field, and the Tianyancha enterprise graph (Enterprise Profile) in the field of enterprise information, etc.

In various vertical fields in China, there has been research and exploration on the application of knowledge graphs. An Ning et al. [2] proposed the construction of a cross-platform network public opinion knowledge graph, using Sina Weibo and Douyin short videos as data sources to build a network public opinion knowledge map, which is mainly used in the management and guidance of network public opinion. Xiao Le et al. [3] proposed knowledge graph for grain situation is mainly based on the grain situation dictionary and Flat-lattice model to extract grain situation entities for construction, which is used to assist grain situation decision-making. Mou Tianhao et al. [4] proposed a knowledge graph of process industrial control systems based on the control system cyber-physical asset management tasks to solve business problems related to industrial control systems. Zhang Kunli et al. [5] took obstetric diseases as the core and proposed a Chinese obstetric knowledge graph to facilitate medical question and answer and auxiliary diagnosis and treatment.

There are few applications of knowledge graphs in the field of Cybersecurity. This paper uses knowledge graphs to correlate numerous isolated vulnerability intelligences and present a panorama of vulnerability entities, which provides a new idea for vulnerability research and analysis, and helps to promote solutions for difficulties related to Cybersecurity business.

2 Vulnerability Knowledge Graph Construction Route

Large-scale domestic vulnerability databases include the China National Vulnerability Database (CNVD), the China National Vulnerability Database of Information Security(CNNVD) etc., which are the main methods for the construction and sharing of vulnerability intelligence [6]. Combining with the current situation of information security development, the sources of vulnerability intelligence in this paper are CNVD, CNNVD and CVE (Common Vulnerability Disclosure). After the vulnerability knowledge is integrated, manual proofreading is finally performed, and data with low confidence is discarded to ensure the quality of the vulnerability knowledge base. At the same time, the knowledge extraction model is continuously supervised and trained with new intelligence. With the accumulation of data, more new knowledge base data sources such as open source security websites are added as appropriate, and finally the entire system is iteratively updated.

2.1 Schema Layer Design

The schema layer of the vulnerability knowledge graph is above the data layer, and the core is the ontology library, which is an abstract representation of vulnerability knowledge, like the “class” in object-oriented. The schema layer mainly includes: entity-relation-entity, entity-attribute-attribute’s value. Based on “Information security technology—Cybersecurity vulnerability identification and description specification “ [8] (GB/T 28458–2020), the framework of vulnerability identification and description can be composed of identification items and description items. Taking into account the actual situation of domestic vulnerabilities, mainly from the perspective of vulnerability management and emergency response [9], the main attribute of the vulnerability is CNVD_ID, and the framework of the preliminary design entity and relationship is shown in Fig. 1.

Fig. 1.
figure 1

The framework of entity and relationship

Based on the graph structure, entities are used to represent objects or abstract concepts in the vulnerability space, and relationships are used to model inter-entity interactions, the framework follows the triplet of (head entity, relation, tail entity). Entities are distinguished by boxes, each row under the entity name has its attributes, PK represents the main attribute, and the arrow represents the relationship. The entity defines 5: vulnerability = {CNVD_ID, title, date, level, product, description, solution, patch, CVE_ID}; event = {event_id, description, time, URL, victim}; company = {name}; product = {name}; victim = {name}. Relationships define 4: influence, raise, belong to, use. More entities, attributes, and relationships can be gradually expanded according to this framework.

2.2 Data Layer Construction

The vulnerability knowledge graph data layer consists of three steps: data collection, knowledge extraction, and knowledge fusion.

2.2.1 Data Collection

Vulnerability, company, and product data are obtained from the unstructured text of the China National Vulnerability Database (CNVD) and semi-structured text of CVE (Common Vulnerability Disclosure) [10]. According to their own circumstances, the two entities, events and victims, can collect them in a compliant manner if they conduct unified management of vulnerabilities for the unit and its subordinate units, or as vulnerability managers.

2.2.2 Knowledge Extraction

Knowledge extraction is a method to automatically obtain structured information such as entities, relationships, and entity attributes from heterogeneous data such as semi-structured or unstructured data. According to the characteristics of vulnerability intelligence text, this paper marks the vulnerability intelligence text with BIOES [11], and then performs the following main operations: entity extraction, attribute extraction, and relation extraction. They are introduced as follows:

  1. 1)

    Entity extraction, namely named entity recognition (NER), refers to the automatic recognition of named entities from text datasets. At present, the main technical methods of named entity recognition are divided into: rule-based and dictionary-based methods -- manual construction of rule templates, and pattern and string matching as the main means; statistical-based methods -- including Hidden Markov Model (HMM), Maximum Entropy (MEM), Support Vector Machine (SVM), Conditional Random Field (CRF); Neural Network methods -- the main models are NN/CNN-CRF, RNN-CRF, LSTM-CRF. The goal of attribute extraction is to collect attribute information of a specific entity from different information sources. For example, for a specific vulnerability, attributes such as name and affected product can be obtained from the public information on the Internet. Entity and attribute extraction this paper adopts the BLSTM-CRF model (Bidirectional Long Short-Term Memory Network - Conditional Random Field) [12], which is currently more effective in the field of security vulnerabilities, taking the product entity (Apache Log4j) as an example, as shown in Fig. 2

Fig. 2.
figure 2

The structure of BLSTM-CRF model

  1. 2)

    Relation extraction. After the vulnerability intelligence text is extracted by entities and attributes, a series of discrete named entities are obtained. Continuing to obtain semantic information requires relation extraction: extracting the interrelationships between entities from related texts, and connecting entities through relationships to form a networked knowledge structure. The vulnerability knowledge graph is different from the social character graph. The relationship is relatively small and simple. For example, vulnerability A “raises” event B. Since the relationship defined in the schema layer is easier to distinguish in text data such as vulnerability reports, this paper chooses the method of rule matching, and the recognized entities are automatically selected according to the definition of the relationship in the category and schema layer, and fine-tuning is performed later. According to the definition, the entity can conform to the rules based on the pattern, so the relationship between the entities is determined according to the trigger word, and the designed rule samples are shown in Table 1.

Table 1. Samples of trigger word rules

2.2.3 Knowledge Fusion

After data collection and knowledge extraction, entities, relationships and entity attribute information are obtained from the original unstructured and semi-structured vulnerability intelligence data. However, the relationship between multiple sources (information) is flat and lacks hierarchy and logic; there is still a lot of redundancy and misinformation in the knowledge. Knowledge fusion is to solve this problem, through entity disambiguation and coreference resolution, to realize the integration of vulnerability knowledge. For example, the company “” and the “Apple” belong to the entity synonymous relationship and need to be integrated. After knowledge fusion, the noise and redundancy in the data are removed, and the quality of vulnerability knowledge is improved.

3 Vulnerability Knowledge Graph Construction Results

3.1 Experimental Environment

The experimental environment of this paper: the operating system is Windows 10; the CPU is AMD Ryzen™ 7 5800H@3.2 GHz; the GPU is GTX 3050Ti (4 GB); the memory is 64 GB; the Python version is 3.7; the neo4j version is 3.1.1.

3.2 Knowledge Graph Display

Taking some generic vulnerability data and a small number of influenced victims under Apache as an example (entities are vulnerability ontology, historical events, involved victims, companies, and products; relationships are the edges of a directed graph), the constructed visual interface is shown in Fig. 3.

Fig. 3.
figure 3

Vulnerability knowledge graph

3.3 Application Analysis

In terms of vulnerability threat discovery and analysis, by constructing the graph to correlate and analyze vulnerability information, hidden information can be mined and effective judgments can be made. Referring to Fig. 3, various types of entities are used as nodes in the graph, and various types of relationships between entities are used as edges in the graph. Starting from a certain entity, such as an victim with critical infrastructure, you can know which products of which companies are used by the victim, and which security events have occurred due to which vulnerabilities occurred at specific times. Once a 0-day vulnerability occurs again in the corresponding products of the company, it can be reasonably predicted that the victim will be influenced by this vulnerability, and it will be warned in time before possible Cybersecurity events to avoid major losses. This information is often unavailable from a single vulnerability report, and knowledge graphs can organically connect numerous vulnerability information.

4 Conclusion

According to the characteristics of the vulnerability field, this paper first integrates multi-source vulnerability intelligence data to design a vulnerability knowledge graph framework; then uses a deep learning model to extract entities and attributes, extracts relationships based on pattern rules, and constructs a vulnerability knowledge ontology, check and analyze; and finally complete the multi-source knowledge graph. In the future, by further adding multiple vulnerability threat intelligence data sources, a larger and more complete vulnerability knowledge graph can be formed, which can effectively provide more Cybersecurity decision support for information workers.