Abstract
With the continuous emergence of new network threat means, how to turn passive defense into active prediction, the rise of Cyber Threat Intelligence (CTI) technology provides a new idea. CTI technology can timely and effectively obtain all kinds of network security threat intelligence information to help security personnel quickly identify all kinds of attacks and make effective decisions in time. However, there are not only a large number of redundant information in threat intelligence information, but also the problems of Chinese English mixing, fuzzy boundary, and polysemy of related security entities. Therefore, identifying complex and valuable information from this information has become a great challenge. Through the research on the above problems, a named entity recognition model in the field of Network Threat Intelligence Based on BERT-BiLSTM-Self-Attention-CRF is proposed to identify the complex network threat intelligence entities in the text. Firstly, the dynamic word vector is obtained through Bert to fully represent the semantic information and solve the problem of polysemy of a word. Then the obtained word vector is used as the input of BiLSTM, and the context feature vector is obtained by BiLSTM. Then the output result is introduced into the self-attention mechanism to capture the correlation within the data or features, and finally the result is input into CRF for annotation. To verify the effectiveness of the model, experiments are carried out on the constructed network threat intelligence data set. The results show that the model significantly improves the effect of Threat Intelligence named entity recognition compared with several other classical models.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
With the acceleration of the world’s digitization process, the network environment is becoming more and more complex. At the same time, the network attack behavior tends to be industrialized, and the attack means are becoming more and more diversified. The traditional way of building defense strategies and deploying products based on experience is difficult to detect [1], intercept, analyze and respond in time and effectively in the face of emerging new, persistent, and advanced threats [2]. In this context, Cyber Threat Intelligence (CTI) [3] technology came into being. As an important network security knowledge, it can support the construction of a more active network security defense [4] mode. Based on all-around intelligence perception and multi-dimensional fusion analysis, it can study and judge the overall situation of network security and reasonably predict the threat trend, so as to realize dynamic and accurate response to network security threats. However, the existing network threat intelligence information is also mixed with a large number of invalid or interference information. How to more effectively obtain more critical threat intelligence entity information (such as organization, software, vulnerability number, etc.) from threat intelligence has become the focus of current research. Applying named entity recognition (NER) technology [4] to the field of Network Threat Intelligence can effectively solve the problem of extracting important security entity information from unstructured Threat Intelligence text. Automatically identifying network security entities from Internet information, such as software, vulnerabilities, attack means, and related network terms, and classifying them is an important step in constructing the knowledge map in network security [5].
2 Relation Work
In the early stage, NER tasks were performed using a rule-based and dictionary-based approach, which achieved good results when formulating very comprehensive rules and dictionaries, but at great cost, so machine learning methods were considered to improve the accuracy of NER. Mulwad V et al. [6] identified potential vulnerability descriptions through an SVM classifier and used Wikilogy knowledge base to identify vulnerabilities, threats, and attacks in Web text. Since SVM cannot consider context information, Joshi A et al. [7] used CRF based system to identify important entities and concepts related to network security in a given text. In order to better improve the performance of NER, we can also consider adding POS, Weerawardhana S et al. [8] identified the key PAG parameters embedded in the vulnerability description text by machine learning and POS, including software name, version, impact, attacker operation, and user operation. It is proved by experiments that entity recognition tasks are carried out in the field of network security. The POS method does provide a viable alternative to machine learning.
Although machine learning [9] has some improvement on NER tasks in network security, it requires network security researchers to label security data, which is extremely costly. As a branch of machine learning, deep learning has become increasingly popular in recent years. At present, some researchers have applied deep learning to the field of named entity identification of network threat intelligence. Pingchuan Ma et al. [10] proposed a BiLSTM-CRF method to extract security-related concepts and entities from unstructured text and used open-source data to evaluate the model on P, R, and F1-score with good results. Wu H et al. [11] added a domain dictionary matching correction method based on BiLSTM-CRF, using BiLSTM to automatically capture context features, using CRF to learn label constraint rules, and using ontology domain dictionary to match correction. Qin Y et al. [12] added a feature template (FT) to BiLSTM-CRF to extract local context features, and CNN to extract character-level features of security entities, such as malware and English naming vulnerabilities. Li T et al. [13] proposed a neural network model based on self-attention to identify entities. On the basis of the existing BiLSTM-CRF model, the self-attention mechanism was added to extract more context information related to the current word in a sentence and get more information about the current word. Han Zhang et al. [14] added GAN to BiLSTM-Attention-CRF to obtain tag data and solve the problem of lack of tag data in network security. P Evangelatos et al. [15] proposed using a transformer to extract named entities in threat intelligence and verified its validity by experimenting with the threat intelligence (DNRTI) dataset [16].
However, there is a polysemy in the named entity of Network Threat Intelligence. The word vectors obtained by word2vec and glove are static, which cannot solve the problem. At the same time, BiLSTM alone cannot obtain more information about the current word. Therefore, this paper proposes a BERT-BiLSTM-CRF named entity recognition method that combines a self-attention mechanism. BERT (Bidirectional Encoder Representations from Transformers) [17] the pre-training language model is a dynamic word vector based on the language model, which can dynamically adjust the embedding of words according to the semantics of the context, better express the representation relationship between words and sentences, and solve the problem of polysemy. In addition, the self-attention mechanism pays more attention to the important words related to the target entity in a sentence, which can better capture the interdependence between the current word and other words and extract more context information related to the current word.
3 BERT-BiLSTM-Self-attention-CRF Model
The BERT-BiLSTM-Self-attention-CRF model is divided into four parts: BERT pre-training language model, BiLSTM layer, Self-attention layer, and CRF layer. The unstructured text information is converted into dynamic word vectors through BERT, then the word vectors are used as input to BiLSTM. The context feature information is obtained from the forward LSTM and the reverse LSTM, and then some important information is selectively paid more attention and assigned higher weight through the self-attention mechanism. Finally, it is marked in the way of BIO through CRF. The model structure is shown in Fig. 1.
3.1 BERT Model
Language models are the most important part of named entity recognition, which transforms the input unstructured text into word vectors. Word2Vec [18] was originally used to get the word vector representation in the research of the named entity recognition of network threat intelligence. Its core idea is to obtain the vectorized representation of the word through the word context, including Skip-gram and CBOW. The former predicts the surrounding word by the given central word, and the latter predicts the central word by the given context information. In addition, the word vector representation is obtained by using the co-occurrence matrix with the Glove [19] method, which considers both local and global information. However, Word2vec and Glove are both static word vectors, and the word vector representation is the same in different contexts. For complex network security texts, there is a situation of polysemy. To solve this problem, this paper proposes a BERT pre-training language model, which can generate dynamic word vector representation to obtain the final representation of word vectors, so as to solve the problem of polysemy.
BERT adopts the encoding part of the bidirectional transformer and has two pre-training tasks. The first task is Mask Language, which randomly masks 15% of the words with MASK for the input text content, and then infers the masked words from the context information. The second task is to predict whether the second sentence is the next sentence of the first sentence, which is based on the first task, is marked with IsNest/NoNext by randomly selecting two sentences in the pre-training text. Figure 2 shows the structure of the BERT model.
The input representation of BERT consists of three parts: Token Embedding, Segment Embedding, and Position Embedding. By adding and summing these three vectors together as the final input, feature extraction is performed in the encoding part of the bidirectional transformer, and finally the sequence vector with rich semantics. The input representation is shown in Fig. 3.
3.2 BiLSTM Layer
The traditional neural networks cannot memorize the input context information and infer the content from the previous information. This paper uses LSTM to solve this problem better. The model has a memory function, and can better capture the long-distance dependency. It can learn the information that needs to be forgotten and needs to be remembered through training. Its structure is shown in Fig. 4.
Its structure is composed of a forgetting gate, a memory gate and an output gate. It is controlled by the unit status. The implementation of LSTM is denoted as follows:
where xt is the input vector, ft is the forgetting gate, it is the memory gate, ot is the output gate, Ct is the unit status of the time t, and ht is the hidden state of the time t.
However, the LSTM cannot encode the information from the back to the front. Adding the reverse LSTM can better obtain the following information, that is, the BiLSTM model can better capture the bidirectional semantics, as shown in Fig. 5.
In the text, the word vector output from the Bert layer is used as the input of the forward LSTM to obtain the forward feature information ht and the reverse feature information ht′, and then the two are spliced to obtain the final hidden state Ht, as shown below:
3.3 Self-attention Layer
In order to better understand the effective information in the threat intelligence text, this paper proposes to add a self-attention mechanism after BiLSTM, which can capture the correlation between vectors, selectively pay more attention to some important information in the feature vector of BiLSTM layer output, give higher weight, and give lower weight to other information. The process of calculation the self-attention mechanism in this paper is as follows.
First, the hidden state of the BiLSTM layer output is represented as Ht, and the vector-matrix Q, K, and V are obtained by mapping the vector Ht:
where WQ, WK, and WV are the parameters learned in the training process, and then calculated by scaling the dot product attention. The calculation formula is as follows:
1/√dk is used to prevent the result from being too large. Finally, the result is normalized by using the Softmax function and multiplied by V to get the result.
3.4 CRF Layer
Conditional random fields (CRF) is a conditional probability model used to solve the maximization of sequence probability. In the threat intelligence NER task, BiLSTM is good at processing long-distance text information, but cannot deal with the dependency between adjacent tags. CRF can obtain the best prediction sequence through the relationship between adjacent tags, which makes up for the deficiency of BiLSTM. CRF ensures the validity of prediction tags by adding restriction rules to the final predicted tags. During the training process, these restriction rules are automatically learned by the CRF classifier, and the Viterbi is used to find the most likely tag sequence.
Given the input sequence X = {x1, x2,…, xn} of a sentence corresponds to the prediction sequence Y = {y1, y2,…, yn}, and the score corresponding to the prediction sequence Y is calculated. The formula is as follows:
where A represents the transfer matrix of the label, P represents the label score, which is used to predict the probability of sequence Y, and the formula is as follows:
where \(\tilde{Y}\) represents the correctly marked sequence and YX represents the marked sequence. Logarithmically on both sides of the above formula to obtain the likelihood function of the prediction sequence. The formula is as follows:
Finally, a set of tag sequences with the highest probability is calculated by Viterbi.
4 Experimental Analysis
4.1 Dataset Construction
Since there is no public Chinese named entity identification dataset in network security, this paper mainly obtains the required data from the websites related to network security vulnerability through python, such as the National Information Security Vulnerability Sharing Platform (www.cnvd.org.cn), Information Security Vulnerability Portal (http://cve.scap.org.cn) 360 Network Security Response Center (cert.360.cn) and national Internet Emergency Center (www.cert.org. cn) are divided into nine types (as shown in Table 1), labeled with BIO. B represents the first word of the entity, I represents the intermediate word of the entity, and O represents the non-entity.
The labeled dataset is divided into the training set, test set, and verification set in 7:2:1 (as shown in Table 2).
4.2 Evaluation Metrics
How to evaluate the performance of NER is a crucial step in the NER task. Through evaluation, we can analyze the advantages and existing problems of the proposed algorithm. At present, there are three main evaluation indicators to measure the performance of NER tasks: Precision, Recall, and F1-score.
Precision refers to the probability that all the samples predicted to be positive are actually positive. The formula is as follows:
For the original sample, the recall rate refers to the probability of being predicted as a positive sample in the actually positive sample. The formula is as follows:
Obviously, the above two evaluation indicators are contradictory and cannot meet the requirements that the precision and recall can reach the best. Therefore, the F1-score is balanced, and the precision and recall rate are considered to maximize the two as much as possible. As a comprehensive index to balance the impact of precision and recall, its formula is as follows:
where TP refers to the number of samples that are actually positive and predicted to be positive, FP refers to the number of samples that are actually negative and predicted to be positive, FN refers to the number of samples that are actually positive and predicted to be negative.
4.3 Experimental Results
Experiments are carried out on the constructed network security data set. In order to verify the rationality of the proposed model, the model is compared with several classical models in the named entity recognition task. The comparison results are shown in Table 3.
For the task of named entity recognition in network security, more features are needed for recognition, and the state of the current time should be related to the state of the previous time and the next time, while the current state in HMM is only related to the previous state. From the experimental results, it can be seen that the F1 value of BiLSTM is higher than that of HMM. BiLSTM cannot learn the relationship between state sequences. After adding CRF, it can learn state sequences. Compare the BiLSTM-CRF model with BERT-BiLSTM-CRF, the experimental results show that because BERT can deeply extract the semantic information of network security text and fully reflect the polysemy of a word, the F1-score has been significantly improved.
Comparing the BERT-BiLSTM-CRF model with the BERT-BiLSTM-CRF model proposed in this paper, which combines the self-attention mechanism, the precision, the recall, and F1-score are improved. Due to the addition of the self-attention mechanism, the model is better at capturing the correlation between the data in the full text of network security by calculating the interaction between words, so that the F1-score of the model proposed in this paper is 2.28% more than the BERT-BiLSTM-CRF model. It has achieved good results in the task of network security named entity recognition.
5 Conclusion and Future Work
Threatening intelligence has gradually become one of the hot areas of network security. At present, government departments and network security enterprises pay more attention to the development of threatening intelligence, and the demand for threatening intelligence in all walks of life is growing. However, there are some problems in Network Threat Intelligence entities, such as ambiguous words, mixed Chinese and English, blurred boundary, etc. To solve these problems, this paper presents network security named entity recognition model based on BERT-BiLSTM-CRF, which combines a self-attention mechanism, uses a BERT pre-training language model to generate word vectors dynamically through two-way Transformer structure, mining syntax structure, and semantic information, and introduces a self-attention mechanism to calculate the correlation between words. Distance dependence can be better solved by assigning different weights to different words according to their degree of association. Experiments show that the model has a certain improvement in P, R, and F1-score, and has a good recognition effect. It can complete the actual network threat intelligence entity identification work and solve the difficulties of threat intelligence entity identification and the ambiguity of one word.
However, there is still much space to improve the task of identifying named entities for network threat intelligence. Because there are still a large number of unmarked network security corpora in a specific area, transfer learning can be considered in future research to solve the problem of lack of labeled data. The performance of identifying network threat intelligence entities can be further improved by expanding the size of the corpus.
References
Han, X., Li, C., Li, X., Lu, T.: Research on APT attack detection technology based on DenseNet convolutional neural network. In: 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI), pp. 440–448 (2021)
Pujol-Perich, D., Suárez-Varela, J., Cabellos-Aparicio, A., et al.: Unveiling the potential of graph neural networks for robust intrusion detection. ACM SIGMETRICS Perf. Eval. Rev. 49(4), 111–117 (2021)
Schlette, D., Caselli, M., Pernul, G.: A comparative study on cyber threat intelligence: the security incident response perspective. IEEE Commun. Surv. Tutor. 23(4), 2525–2556 (2021)
Nozza, D., Manchanda, P., Fersini, E., et al.: Learning to adapt with word embeddings: domain adaptation of named entity recognition systems. Inf. Process. Manag. 58(3), 102537 (2021)
Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 33(2), 494–514 (2022)
Mulwad, V., Li, W., Joshi, A., et al.: Extracting information about security vulnerabilities from web text. In: 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, pp. 257–260 (2011)
Joshi, A., Lal, R., Finin, T., et al.: Extracting cybersecurity-related linked data from tex. In: 2013 IEEE Seventh International Conference on Semantic Computing, pp. 252–259 (2013)
Weerawardhana, S., Mukherjee, S., Ray, I., Howe, A.: Automated extraction of vulnerability information for home computer security. In: Cuppens, F., Garcia-Alfaro, J., Zincir Heywood, N., Fong, P.W.L. (eds.) FPS 2014. LNCS, vol. 8930, pp. 356–366. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17040-4_24
Zhao, Q., Sun, J., Ren, H., Sun, G.: Machine-learning based TCP security action prediction. In: 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), pp. 1329–1333 (2020)
Ma, P., Jiang, B., Lu, Z., et al.: Cybersecurity named entity recognition using bidirectional long short-term memory with conditional random fields. Tsinghua Sci. Technol. 26(3), 259–265 (2020)
Wu, H., Li, X., Gao, Y.: An effective approach of named entity recognition for cyber threat intelligence. In: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp. 1370–1374 (2020)
Qin, Y., Shen, G.-W., Zhao, W., Chen, Y.-P., Yu, M., Jin, X.: A network security entity recognition method based on feature template and CNN-BiLSTM-CRF. Front. Inf. Technol. Electron. Eng. 20(6), 872–884 (2019). https://doi.org/10.1631/FITEE.1800520
Li, T., Guo, Y., Ju, A.: A self-attention-based approach for named entity recognition in cybersecurity. In: 2019 15th International Conference on Computational Intelligence and Security (CIS), pp. 147–150 (2019)
Zhang, H., Guo, Y., Li, T.: Domain named entity recognition combining Gan and BiLSTM-Attention-CRF. Comput. Res. Dev. 56(9), 8 (2019)
Evangelatos, P., et al.: Named entity recognition in cyber threat intelligence using transformer-based models. In: 2021 IEEE International Conference on Cyber Security and Resilience (CSR), pp. 348–353 (2021)
Wang, X., et al.: DNRTI: a large-scale dataset for named entity recognition in threat intelligence. In: 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 1842–1848 (2020)
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding (2018)
Ren, F., Jiang, Z., Liu, J.: A bi-directional LSTM model with attention for malicious URL detection. In: 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp. 300–305 (2019)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014)
Acknowledgement
The work is supported by Supported by the Fundamental Research Funds for the Central Universities, North Minzu University (2022PT_S04), and the Natural Science Foundation of Ningxia Province (No. 2020AAC03212), and the Innovation Projects for Graduate Students of North Minzu University (Project No. YCX21087).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this paper
Cite this paper
Zhang, K., Chen, X., Jing, Y., Wang, S., Tang, L. (2022). Research on Named Entity Recognition Method of Network Threat Intelligence. In: Lu, W., Zhang, Y., Wen, W., Yan, H., Li, C. (eds) Cyber Security. CNCERT 2022. Communications in Computer and Information Science, vol 1699. Springer, Singapore. https://doi.org/10.1007/978-981-19-8285-9_16
Download citation
DOI: https://doi.org/10.1007/978-981-19-8285-9_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8284-2
Online ISBN: 978-981-19-8285-9
eBook Packages: Computer ScienceComputer Science (R0)