Keywords

1 Introduction

With the acceleration of the world’s digitization process, the network environment is becoming more and more complex. At the same time, the network attack behavior tends to be industrialized, and the attack means are becoming more and more diversified. The traditional way of building defense strategies and deploying products based on experience is difficult to detect [1], intercept, analyze and respond in time and effectively in the face of emerging new, persistent, and advanced threats [2]. In this context, Cyber Threat Intelligence (CTI) [3] technology came into being. As an important network security knowledge, it can support the construction of a more active network security defense [4] mode. Based on all-around intelligence perception and multi-dimensional fusion analysis, it can study and judge the overall situation of network security and reasonably predict the threat trend, so as to realize dynamic and accurate response to network security threats. However, the existing network threat intelligence information is also mixed with a large number of invalid or interference information. How to more effectively obtain more critical threat intelligence entity information (such as organization, software, vulnerability number, etc.) from threat intelligence has become the focus of current research. Applying named entity recognition (NER) technology [4] to the field of Network Threat Intelligence can effectively solve the problem of extracting important security entity information from unstructured Threat Intelligence text. Automatically identifying network security entities from Internet information, such as software, vulnerabilities, attack means, and related network terms, and classifying them is an important step in constructing the knowledge map in network security [5].

2 Relation Work

In the early stage, NER tasks were performed using a rule-based and dictionary-based approach, which achieved good results when formulating very comprehensive rules and dictionaries, but at great cost, so machine learning methods were considered to improve the accuracy of NER. Mulwad V et al. [6] identified potential vulnerability descriptions through an SVM classifier and used Wikilogy knowledge base to identify vulnerabilities, threats, and attacks in Web text. Since SVM cannot consider context information, Joshi A et al. [7] used CRF based system to identify important entities and concepts related to network security in a given text. In order to better improve the performance of NER, we can also consider adding POS, Weerawardhana S et al. [8] identified the key PAG parameters embedded in the vulnerability description text by machine learning and POS, including software name, version, impact, attacker operation, and user operation. It is proved by experiments that entity recognition tasks are carried out in the field of network security. The POS method does provide a viable alternative to machine learning.

Although machine learning [9] has some improvement on NER tasks in network security, it requires network security researchers to label security data, which is extremely costly. As a branch of machine learning, deep learning has become increasingly popular in recent years. At present, some researchers have applied deep learning to the field of named entity identification of network threat intelligence. Pingchuan Ma et al. [10] proposed a BiLSTM-CRF method to extract security-related concepts and entities from unstructured text and used open-source data to evaluate the model on P, R, and F1-score with good results. Wu H et al. [11] added a domain dictionary matching correction method based on BiLSTM-CRF, using BiLSTM to automatically capture context features, using CRF to learn label constraint rules, and using ontology domain dictionary to match correction. Qin Y et al. [12] added a feature template (FT) to BiLSTM-CRF to extract local context features, and CNN to extract character-level features of security entities, such as malware and English naming vulnerabilities. Li T et al. [13] proposed a neural network model based on self-attention to identify entities. On the basis of the existing BiLSTM-CRF model, the self-attention mechanism was added to extract more context information related to the current word in a sentence and get more information about the current word. Han Zhang et al. [14] added GAN to BiLSTM-Attention-CRF to obtain tag data and solve the problem of lack of tag data in network security. P Evangelatos et al. [15] proposed using a transformer to extract named entities in threat intelligence and verified its validity by experimenting with the threat intelligence (DNRTI) dataset [16].

However, there is a polysemy in the named entity of Network Threat Intelligence. The word vectors obtained by word2vec and glove are static, which cannot solve the problem. At the same time, BiLSTM alone cannot obtain more information about the current word. Therefore, this paper proposes a BERT-BiLSTM-CRF named entity recognition method that combines a self-attention mechanism. BERT (Bidirectional Encoder Representations from Transformers) [17] the pre-training language model is a dynamic word vector based on the language model, which can dynamically adjust the embedding of words according to the semantics of the context, better express the representation relationship between words and sentences, and solve the problem of polysemy. In addition, the self-attention mechanism pays more attention to the important words related to the target entity in a sentence, which can better capture the interdependence between the current word and other words and extract more context information related to the current word.

3 BERT-BiLSTM-Self-attention-CRF Model

The BERT-BiLSTM-Self-attention-CRF model is divided into four parts: BERT pre-training language model, BiLSTM layer, Self-attention layer, and CRF layer. The unstructured text information is converted into dynamic word vectors through BERT, then the word vectors are used as input to BiLSTM. The context feature information is obtained from the forward LSTM and the reverse LSTM, and then some important information is selectively paid more attention and assigned higher weight through the self-attention mechanism. Finally, it is marked in the way of BIO through CRF. The model structure is shown in Fig. 1.

Fig. 1.
figure 1

BERT-BiLSTM-self-attention-CRF model architecture

3.1 BERT Model

Language models are the most important part of named entity recognition, which transforms the input unstructured text into word vectors. Word2Vec [18] was originally used to get the word vector representation in the research of the named entity recognition of network threat intelligence. Its core idea is to obtain the vectorized representation of the word through the word context, including Skip-gram and CBOW. The former predicts the surrounding word by the given central word, and the latter predicts the central word by the given context information. In addition, the word vector representation is obtained by using the co-occurrence matrix with the Glove [19] method, which considers both local and global information. However, Word2vec and Glove are both static word vectors, and the word vector representation is the same in different contexts. For complex network security texts, there is a situation of polysemy. To solve this problem, this paper proposes a BERT pre-training language model, which can generate dynamic word vector representation to obtain the final representation of word vectors, so as to solve the problem of polysemy.

BERT adopts the encoding part of the bidirectional transformer and has two pre-training tasks. The first task is Mask Language, which randomly masks 15% of the words with MASK for the input text content, and then infers the masked words from the context information. The second task is to predict whether the second sentence is the next sentence of the first sentence, which is based on the first task, is marked with IsNest/NoNext by randomly selecting two sentences in the pre-training text. Figure 2 shows the structure of the BERT model.

Fig. 2.
figure 2

BERT architecture

The input representation of BERT consists of three parts: Token Embedding, Segment Embedding, and Position Embedding. By adding and summing these three vectors together as the final input, feature extraction is performed in the encoding part of the bidirectional transformer, and finally the sequence vector with rich semantics. The input representation is shown in Fig. 3.

Fig. 3.
figure 3

Input representation of BERT

3.2 BiLSTM Layer

The traditional neural networks cannot memorize the input context information and infer the content from the previous information. This paper uses LSTM to solve this problem better. The model has a memory function, and can better capture the long-distance dependency. It can learn the information that needs to be forgotten and needs to be remembered through training. Its structure is shown in Fig. 4.

Fig. 4.
figure 4

LSTM structure

Its structure is composed of a forgetting gate, a memory gate and an output gate. It is controlled by the unit status. The implementation of LSTM is denoted as follows:

$$ f_t = \sigma \left( {W_f \bullet \left[ {h_{t - 1} ,x_t } \right] + b_f } \right) $$
(1)
$$ i_t = \sigma \left( {W_i \bullet \left[ {h_{t - 1} ,x_t } \right] + b_i } \right) $$
(2)
$$ \widetilde{C}_t = \tanh \left( {W_C \bullet [h_{t - 1} ,x_t ] + b_C } \right) $$
(3)
$$ C_t = f_t *C_{t - 1} + i_t *\tilde{C}_t $$
(4)
$$ o_t = \sigma \left( {W_o \bullet \left[ {h_{t - 1} ,x_t } \right] + b_o } \right) $$
(5)
$$ h_t = o_t *\tanh \left( {C_t } \right) $$
(6)

where xt is the input vector, ft is the forgetting gate, it is the memory gate, ot is the output gate, Ct is the unit status of the time t, and ht is the hidden state of the time t.

However, the LSTM cannot encode the information from the back to the front. Adding the reverse LSTM can better obtain the following information, that is, the BiLSTM model can better capture the bidirectional semantics, as shown in Fig. 5.

Fig. 5.
figure 5

BiLSTM model structure

In the text, the word vector output from the Bert layer is used as the input of the forward LSTM to obtain the forward feature information ht and the reverse feature information ht′, and then the two are spliced to obtain the final hidden state Ht, as shown below:

$$ H_t = \left[ {h_t ,h^{\prime}_t } \right] $$
(7)

3.3 Self-attention Layer

In order to better understand the effective information in the threat intelligence text, this paper proposes to add a self-attention mechanism after BiLSTM, which can capture the correlation between vectors, selectively pay more attention to some important information in the feature vector of BiLSTM layer output, give higher weight, and give lower weight to other information. The process of calculation the self-attention mechanism in this paper is as follows.

First, the hidden state of the BiLSTM layer output is represented as Ht, and the vector-matrix Q, K, and V are obtained by mapping the vector Ht:

$$ \begin{gathered} Q = H_t W^Q \hfill \\ K = H_t W^K \hfill \\ V = H_t W^V \hfill \\ \end{gathered} $$
(8)

where WQ, WK, and WV are the parameters learned in the training process, and then calculated by scaling the dot product attention. The calculation formula is as follows:

$$ Attention\left( {Q,K,V} \right) = Soft\max \left( {\frac{QK^T }{{\sqrt {d_k } }}} \right)V $$
(9)

1/√dk is used to prevent the result from being too large. Finally, the result is normalized by using the Softmax function and multiplied by V to get the result.

3.4 CRF Layer

Conditional random fields (CRF) is a conditional probability model used to solve the maximization of sequence probability. In the threat intelligence NER task, BiLSTM is good at processing long-distance text information, but cannot deal with the dependency between adjacent tags. CRF can obtain the best prediction sequence through the relationship between adjacent tags, which makes up for the deficiency of BiLSTM. CRF ensures the validity of prediction tags by adding restriction rules to the final predicted tags. During the training process, these restriction rules are automatically learned by the CRF classifier, and the Viterbi is used to find the most likely tag sequence.

Given the input sequence X = {x1, x2,…, xn} of a sentence corresponds to the prediction sequence Y = {y1, y2,…, yn}, and the score corresponding to the prediction sequence Y is calculated. The formula is as follows:

$$ s\left( {X,Y} \right) = \sum_{i = 0}^n {A_{y_i ,y_{i + 1} } } + \sum_{i = 1}^n {P_{i,y_i } } $$
(10)

where A represents the transfer matrix of the label, P represents the label score, which is used to predict the probability of sequence Y, and the formula is as follows:

$$ P\left( {Y|X} \right) = \frac{{e^{s\left( {X,Y} \right)} }}{{\sum\limits_{\tilde{Y} \in Y_X } {s\left( {X,\tilde{Y}} \right)} }} $$
(11)

where \(\tilde{Y}\) represents the correctly marked sequence and YX represents the marked sequence. Logarithmically on both sides of the above formula to obtain the likelihood function of the prediction sequence. The formula is as follows:

$$ \ln \left( {P\left( {Y|X} \right)} \right) = s\left( {X,Y} \right) - \ln \left( {\sum_{\tilde{Y} \in Y_X } {s\left( {X,\tilde{Y}} \right)} } \right) $$
(12)

Finally, a set of tag sequences with the highest probability is calculated by Viterbi.

4 Experimental Analysis

4.1 Dataset Construction

Since there is no public Chinese named entity identification dataset in network security, this paper mainly obtains the required data from the websites related to network security vulnerability through python, such as the National Information Security Vulnerability Sharing Platform (www.cnvd.org.cn), Information Security Vulnerability Portal (http://cve.scap.org.cn) 360 Network Security Response Center (cert.360.cn) and national Internet Emergency Center (www.cert.org. cn) are divided into nine types (as shown in Table 1), labeled with BIO. B represents the first word of the entity, I represents the intermediate word of the entity, and O represents the non-entity.

Table 1. Entity labeled mode

The labeled dataset is divided into the training set, test set, and verification set in 7:2:1 (as shown in Table 2).

Table 2. Dataset size

4.2 Evaluation Metrics

How to evaluate the performance of NER is a crucial step in the NER task. Through evaluation, we can analyze the advantages and existing problems of the proposed algorithm. At present, there are three main evaluation indicators to measure the performance of NER tasks: Precision, Recall, and F1-score.

Precision refers to the probability that all the samples predicted to be positive are actually positive. The formula is as follows:

$$ P = \frac{TP}{{TP + FP}} \ast 100\% $$
(13)

For the original sample, the recall rate refers to the probability of being predicted as a positive sample in the actually positive sample. The formula is as follows:

$$ R = \frac{TP}{{TP + FN}} \ast 100\% $$
(14)

Obviously, the above two evaluation indicators are contradictory and cannot meet the requirements that the precision and recall can reach the best. Therefore, the F1-score is balanced, and the precision and recall rate are considered to maximize the two as much as possible. As a comprehensive index to balance the impact of precision and recall, its formula is as follows:

$$ F1 = \frac{2*P*R}{{P + R}}*100\% $$
(15)

where TP refers to the number of samples that are actually positive and predicted to be positive, FP refers to the number of samples that are actually negative and predicted to be positive, FN refers to the number of samples that are actually positive and predicted to be negative.

4.3 Experimental Results

Experiments are carried out on the constructed network security data set. In order to verify the rationality of the proposed model, the model is compared with several classical models in the named entity recognition task. The comparison results are shown in Table 3.

Table 3. Comparison of different models (%)

For the task of named entity recognition in network security, more features are needed for recognition, and the state of the current time should be related to the state of the previous time and the next time, while the current state in HMM is only related to the previous state. From the experimental results, it can be seen that the F1 value of BiLSTM is higher than that of HMM. BiLSTM cannot learn the relationship between state sequences. After adding CRF, it can learn state sequences. Compare the BiLSTM-CRF model with BERT-BiLSTM-CRF, the experimental results show that because BERT can deeply extract the semantic information of network security text and fully reflect the polysemy of a word, the F1-score has been significantly improved.

Comparing the BERT-BiLSTM-CRF model with the BERT-BiLSTM-CRF model proposed in this paper, which combines the self-attention mechanism, the precision, the recall, and F1-score are improved. Due to the addition of the self-attention mechanism, the model is better at capturing the correlation between the data in the full text of network security by calculating the interaction between words, so that the F1-score of the model proposed in this paper is 2.28% more than the BERT-BiLSTM-CRF model. It has achieved good results in the task of network security named entity recognition.

5 Conclusion and Future Work

Threatening intelligence has gradually become one of the hot areas of network security. At present, government departments and network security enterprises pay more attention to the development of threatening intelligence, and the demand for threatening intelligence in all walks of life is growing. However, there are some problems in Network Threat Intelligence entities, such as ambiguous words, mixed Chinese and English, blurred boundary, etc. To solve these problems, this paper presents network security named entity recognition model based on BERT-BiLSTM-CRF, which combines a self-attention mechanism, uses a BERT pre-training language model to generate word vectors dynamically through two-way Transformer structure, mining syntax structure, and semantic information, and introduces a self-attention mechanism to calculate the correlation between words. Distance dependence can be better solved by assigning different weights to different words according to their degree of association. Experiments show that the model has a certain improvement in P, R, and F1-score, and has a good recognition effect. It can complete the actual network threat intelligence entity identification work and solve the difficulties of threat intelligence entity identification and the ambiguity of one word.

However, there is still much space to improve the task of identifying named entities for network threat intelligence. Because there are still a large number of unmarked network security corpora in a specific area, transfer learning can be considered in future research to solve the problem of lack of labeled data. The performance of identifying network threat intelligence entities can be further improved by expanding the size of the corpus.