Keywords

1 Background Introduction

With the continuous development of Internet technology, people also face various security threats while relying on the great convenience of the network. Therefore, network security testing is of great significance to ensuring national security and people's life. How to quickly identify various attacks in real time, especially unpredictable attacks, is an inevitable problem today. Intrusion Detection and Defense Systems (IDS) is an important achievement in information security field. Compared to traditional static security technology [1], such as firewalls and vulnerability scanners, it can identify intrusions that are already occurring or are occurring. The network intrusion detection system [2] is an active cybersecurity defense tool to monitor and analyze key nodes in a network environment in real time and detect for signs of attacks or security violations. Policies in network systems. Behavior and deals with the behavior accordingly. To effectively improve the detection performance of intrusion detection systems in a network environment, many researchers have applied machine learning technology to the research and development of intelligent detection systems. For example, literature [3] applies support vector machines to invasion detection, introduces statistical learning theory into invasion detection studies, literature [4] introduces a naive Bayesian nuclear density estimation algorithm into invasion detection, literature [5] introduces random forest to deal with attack detection disequilibrium and short attack response time. However, most traditional machine learning algorithms are shallow learning algorithms. They aim to emphasize feature engineering and feature selection and do not solve the classification of massive invasive data in actual networks. As the network data grows rapidly, its accuracy will constantly decline. Deep learning [6] is one of the most widely used technologies in the AI field. Many scholars have applied it to intrusion detection and achieved better accuracy. Deep learning is a kind of machine learning. Its concept comes from the study of artificial neural networks. Its structure is actually a multi-layer perceptron with multiple hidden layers. Convolutional neural networks (CNN) require fewer parameters and are well suited to processing data with statistical stability and local correlations. In Ref [7], applying convolutional neural networks to sparse attack type r2l invasion detection improves the u2r detection rate, but requires further improvement on the detection of sparse attack type r2l. Long short-term memory (LSTM) is specifically used for learning time-series data with long dependencies. It has great advantages in learning long-term dependencies and timing in higher advanced feature sequences. Long short-term memory neural network (LSTM) is a special recurrent neural network and is one of the classical deep learning methods. Literature [8] applied LSTM to intrusion detection, effectively solving the problem of gradient disappearance and gradient explosion in data training, and effectively solving the problem of input sequence features. However, the model is still not accurate enough for feature extraction in small and medium-sized datasets. It takes advantage of the advantages of convolutional neural networks in processing locally relevant data and feature extraction, as well as long-and short-term memory neural networks in capturing data sequences and long-term dependencies. Combined with the attention [9] self attention mechanism, it has the advantages of processing the serialized data and classification. In this paper a CNNsalstm based intrusion detection model to further improve accuracy and reduce misuse rate.

2 Related Theories

2.1 Long and Short-term Neural Memory Network

Commonly known as LSTM, is a special RNN [10], that can learn about long dependence. They were introduced by Hochreiter & schmidhuber [11] and improved and popularized by many. They work well on a variety of issues and are now widely used. RNN is good at processing sequence data, but exhibits gradient extinction or gradient explosion as well as long-term dependence in the course of RNN training. The LSTM has been carefully designed to avoid long-term dependence. Keep in mind that long-term historical information is actually their default behavior, not what they are trying to learn. All recurrent neural networks have the form of recurrent module chains of neural networks. In the standard RNN, repeat modules will have very simple structures, such as a single tanh layer (Fig. 1).

Fig. 1.
figure 1

Single layer neural network with repeated modules in standard RNN

LSTM also has this chain structure, but the structure of the repeat modules is different. Compared to the simple layers of neural networks, LSTM have four layers, which interact in special ways (Fig. 2).

Fig. 2.
figure 2

Four interactive neural network layers included in the repeating module in LSTM

The long, short-term neural memory model actually adds three gates to the hidden layer of the RNN model, namely the input gate, the output gate, the forgetting gate, and a cell state update, as shown in the figure below (Fig. 3).

Fig. 3.
figure 3

Long short-term memory module

By forgetting the gate, we screen the cell states in the upper layer, leaving the desired information and discarding useless information. The formula is as follows:

$${f}_{t}=\sigma ({w}_{f}*[{h}_{t},{x}_{t}]+{b}_{f})$$
(1)

They are the weight matrices and bias terms of the forgetting gate, are the activation functions of the sigmoid, and [,] is connecting the two vectors into one vector. The input gate determines the importance of the information and sends the important information to the place where the cell state is updated to complete the cell state update. This process consists of two parts, the first part uses the sigmoid function to determine new information needed to be added to the cell state, and the second part uses the tanh function to general new candidate vectors. The calculation formula is as follows:

$$\left\{\begin{array}{c}{f}_{t}=\sigma ({w}_{i}*[{h}_{t-1}, {x}_{t}]+{b}_{i})\\ {\tilde{c }}_{t}=tanh({w}_{c}*[{h}_{t-1}, {x}_{t}]+{b}_{c})\end{array}\right.$$
(2)

Among them, it is the weight and bias of the input gate, which is the weight and bias of the cell state. After the above treatment, the cell state is updated to the cell state c, formula as follows:

$${c}_{t}={f}_{t}*{c}_{t-1}+{i}_{t}*{\tilde{c }}_{t}$$
(3)

Among them, * represents multiplied elements, represents deleted information, and * represents new information.

The output gate controls the output of the cell state of the present layer and determines which cell state enters the next layer. The calculation formula is as follows:

$$\left\{\begin{array}{c}{o}_{t}=\sigma ({w}_{o}*[{h}_{t-1}, {x}_{t}]+{b}_{o})\\ {h}_{t}={o}_{t}*tanh({c}_{t})\end{array}\right.$$
(4)

According to the LSTM network invasion method, the initial detection dataset was first digitized, standardized, normalized, then the preprocessed dataset was input into the trained LSTM model, and finally the results into the softmax classifier to get good classification results. Although the proposed method can extract more comprehensive features and improve the accuracy of network intrusion detection when processing sequence data, the proposed method has a high false alarm rate.

2.2 Convolutional Neural Network

Convolutional neural networks is a hierarchical computational model. As the number of network layers increases, increasingly complex abstract patterns can be extracted. The emergence of convolutional neural networks was inspired by bioprocessing, as the connectivity between neurons is similar to the tissue structure of the animal visual cortex. The typical architecture of CNN is: input the → conv → pool → fullcon, which combines the idea of local receptive fields, shared weights, and spatial or temporal subsampling. This architecture makes CNN well-suited for processing data with statistical stability and local correlations, and makes it highly deformable upon translation, scaling, and tilt. It is a deep feedforward neural network. Each network has a multiple neuron population. Each neuron receives only the upper-layer of the output. After the layer is calculated, the results are output to the next layer. Elements of homric neurons are not connected. The proposed algorithm can obtain the output from a multi-layer network trained with the input data. Convolutional neural network includes input layer, convolutional layer, pooling layer, fully connected layer, and the structure in Fig (Fig. 4).

Fig. 4.
figure 4

Convolutional neural network structure

Input Layer.

It can be represented as the beginning of the entire neural network. In the field of data processing, the input to convolutional neural networks can be viewed as a data matrix.

Convolutional Layer.

As the most important part of the convolutional neural network, each convolutional layer comprises several convolutional units, each of whose parameters are optimized by a backpropagation algorithm. The purpose of the convolution operations is to extract the different features of the input. The first convolutional layer can only extract low-level features such as edges, lines, and angles. More multiple layers of the network can iteratively extract more complex features from low-level features. Convolutional layers perform more thorough analysis of each small block to obtain more abstract features. Convolutional neural networks first extract local features and then fuse local features at a higher level, which can not only obtain global features, but also reduce the number of neuronal nodes. However, the number of neurons is still very large at this time, so by setting the same weight of each neuron, the number of network parameters is greatly reduced. For the m th convolutional layer, its output is \({\mathrm{y}}_{\mathrm{m}}\), then the output of the Kth convolution kernel is \({\mathrm{y}}_{\mathrm{m}}\):

$${\mathrm{y}}_{\mathrm{k}}^{\mathrm{m}}=\updelta ({\sum }_{{\mathrm{y}}_{\mathrm{i}}^{\mathrm{n}-1}}\in {\mathrm{m}}_{\mathrm{k}}{\mathrm{y}}_{\mathrm{i}}^{\mathrm{m}-1}*{\mathrm{W}}_{\mathrm{ik}}^{\mathrm{m}}+{\mathrm{b}}_{\mathrm{k}}^{\mathrm{m}})$$
(5)

Pooling Layer.

You can reduce the size of the data matrix very efficiently. The two most commonly used methods are maximal pooling and average pooling, which further reduce the number of nodes in the fully connected layer. The task of reducing the entire neural network parameters is finally implemented.

Fully Connected Layer and Output Layer.

Features of the data were extracted and classified by the full connectivity layer. The output layer completes the detailed prime classification of the risk factors according to the professional type to obtain the probability distribution problem.

2.3 Attention

The attention mechanism was first proposed in the field of image recognition. The idea is that when humans deal with certain things or images, they allocate more energy to specific parts of the key information. Once concentrated, the information can be accessed more efficiently. When processing a large amount of input information, the neural network can also learn from the attention mechanism of the human brain, and select only some key input information for processing, thus improving the efficiency of the neural network. When using neural networks, we can usually encode using convolutional or recurrent networks to obtain an output vector sequence of the same length (Fig. 5).

Fig. 5.
figure 5

The essence of the Attention mechanism: addressing

The essence of the attention mechanism is an addressing process [12], as shown above: given a task-related query vector Q, calculates the attention value by calculating the attention distribution of the key and attaching it to the value. This process is actually the embodiment of the attention mechanism in reducing the complexity of the neural network model: there is no need to input all the N input information into the neural network for calculation. Simply select some task-related x information and input it into the neural network. The attention mechanism can be divided into three steps: one is the information input; the other is to calculate the attention distribution α; three is the attention distribution α, used to calculate the weighted average of the input information. When using neural networks, we can usually encode using convolutional or recurrent networks to obtain an output vector sequence of the same length, as shown in Fig (Fig. 6):

Fig. 6.
figure 6

Variable length sequence coding based on convolutional network and recurrent network

As can be seen from the figure above, both convolutional and recurrent neural networks are actually “local coding” for the variable length sequence: the convolutional neural network is obviously based on n-gram local coding; for recurrent neural networks, short-range dependence can be established only due to the disappearance of the gradient (Fig. 7).

Fig. 7.
figure 7

Self-attention model

In this case, we can use attention mechanisms to generate weights for different connectivity “dynamics”. This is the self-attention model. Since the weights of the self attention model are dynamically generated, the longer information sequence can be processed. Overall, why are self-attention models so powerful: attention mechanisms are used to “dynamically” generate weights of different links to process longer sequence of information. The self-attention model was calculated as follows: Let X = [x1, · · ·, xN] represent N input information; obtain the query vector sequence, key vector sequence and value vector sequence through linear transformation:

$${\text{Q}} = {\text{w}}_{{\text{Q}}} {\text{X}}\;{\text{K}} = {\text{w}}_{{\text{K}}} {\text{X}}\;{\text{V}} = {\text{w}}_{{\text{V}}} {\text{X}}$$
(6)

From the above formula, Q in self-Attention is a transformation of self-input, and attention calculates the formula as:

$${\mathrm{h}}_{\mathrm{i}}=\mathrm{att}((\mathrm{K},\mathrm{V}),{\mathrm{q}}_{\mathrm{i}})$$
$$={\sum}_{\mathrm{j}=1}^{\mathrm{N}}{\mathrm{a}}_{\mathrm{ij}}{\mathrm{v}}_{\mathrm{j}}$$
$$={\sum}_{\mathrm{j}=1}^{\mathrm{N}}\mathrm{softmax}(\mathrm{s}({\mathrm{k}}_{\mathrm{j}},{\mathrm{q}}_{\mathrm{j}})){\mathrm{v}}_{\mathrm{j}}$$
(7)

In self-attention models, the scaled dot product is usually used as a function of attention scoring, and the output vector sequence can be written as:

$$\mathrm{H}=\mathrm{V}\;\; \mathrm{softmax}(x=\frac{{\mathrm{K}}^{\mathrm{T}}\mathrm{Q}}{\sqrt{{\mathrm{d}}_{3}}})$$
(8)

2.4 Data Pre-processing

In this paper, the KDD99 [13] dataset is used as our training and test dataset. The dataset is nine-week network connectivity data collected from a simulated USAF LAN, divided into training data with identification information and test data without identification information. The test and training data have different probability distributions. The test data contained some types of attack that did not appear in the training data, which makes intrusion detection more realistic. Each connection in the dataset included 41 functions and 1 attack type. The training dataset contains a normal identification type and 36 training attack types, with training data contains 22 attack patterns, and only 14 attacks in the test dataset (Fig. 8).

Fig. 8.
figure 8

Details of five labels

TCP basic connection characteristics (nine kinds) basic connection characteristics include basic connection attributes, such as continuous time, protocol type, number of transmitted bytes, etc. TCP connection content features (13 kinds in total) are extracted from the content features that may reflect intrusion data, such as the number of login failures. Network statistics have time-based traffic (9 kinds, from 23 to 31). Due to the strong temporal correlation of network attack events, there is a certain connection between the current connection records and the previous connection records. Statistical calculation can better reflect the relationship between connections. Host based network traffic statistics (32–41 in total) time based traffic statistics only show the relationship between the current connection and the last two seconds, as shown in the following figure (Fig. 9). Original intrusion data record: x = {0, icmp, ecr_i, SF, 1032, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 511, 511, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 255, 255, 1.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, smurf} There are 41 functional parts and a label.

Fig. 9.
figure 9

Details of forty one features

2.5 Character Numeric

First, we should remove the duplicates. In the actual data collected, many intrusion records are the same, so the deduplication technology [14] can be used to reduce the amount of input ID data and eliminate information redundancy. The KDD99 dataset has been counter processed, and filtering is not required in this paper. However, some functions in the KDD99 dataset are number functions and some are characters. All data captured from different ID sources were then converted into digital format using normalization to simplify data processing. Value rules for symbol features are as follows: Use attribute mapping. For example, property 2 is the protocol type protocol_type. It has three values: TCP, UDP and ICMP, are represented by its location. TCP is 1, UDP 22 and ICMP 3. Similarly, the mapping relationship can establish the relationship between the symbol values and the corresponding values through the 70 symbol values and 11 symbol values used by the attribute element service. Labbel processed as follows (Fig. 10).

Fig. 10.
figure 10

Description of five labels

2.6 Normalization

Because some elements have values of 0 or 1, some values to avoid the influence of large range values, too large; and small effects of the values disappear, need to normalize the value of each feature to convert between [0,1].

$${\text{y }} = \, \left( {{\text{x }} - {\text{ xmin}}/{\text{xmax }} - {\text{ xmin}}} \right)$$
(9)

After normalization

x = {0.0, 3.38921626955e−07, 0.00128543131293, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.00195694716243, 0.00195694716243, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.125490196078, 1.0, 1.0, 0.0, 0.03, 0.05, 0.0, 0.0, 0.0, 0.0, 0} (Fig. 11).

3 Model Establishment

Fig. 11.
figure 11

CNN-SALSTM network structure

3.1 Based on CNN-SALSTM Network Structure

  • Step 1. Data preprocessing. One-click encoding of network protocols, network service type, and network connection state text type data. Meanwhile, continuous numerical data such as the connection time in the grouping characteristics are normalized according to Eq. 10

    $${\mathrm{x}}_{\mathrm{n}}=\frac{x-{\mathrm{x}}_{\mathrm{min}}}{{\mathrm{x}}_{\mathrm{max}-}{\mathrm{x}}_{\mathrm{min}}}$$
    (10)
  • Step 2. Advanced feature extraction. The basic features of the pre-processed packets are sent to lenet for advanced feature extraction, output advanced features via one-dimensional convolution operations. Each volume layer is followed by a BN layer and leakyrelu activation function to speed up the network and avoid collapse as much as possible.

  • Step 3. The self-attention mechanism highlights the high-weight features. Based to its upper subvector, each vector multiplied its three matrices WQ, wk and WV generated by its upper subvector to obtain a vector. A vector yields a probability then multiplied by the result of the CNN convolution and passed to the next layer.

  • Step 4. Classified the network connections. Entering-level features into LSTM, yields the classification results of the network data through the softmax function.

3.2 Evaluation Method

Precision, recall and F-measure were used in this experiment to judge the classification effect of the model. TP represents the number of samples correctly identified as an attack, and FP represents the number of samples incorrectly identified as an attack. TN represents the number of samples correctly identified as normal, while FN indicates the number of samples incorrectly identified as normal. Accuracy represents the proportion of network data classified as common attack types. The calculation formula is as follows:

$$\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(11)

Recall represents the proportion of network data classified as an attack to all attack data. The calculation formula is:

$$\mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(12)

Measure is the weighted average of both Precision and Recall. It is used to synthesize the scores of Precision and Recall. The calculation formula is:

$$\mathrm{F }-\mathrm{ Measure}=\frac{(1+{\upbeta }^{2})\times \mathrm{Precision}\times \mathrm{Recall}}{{\upbeta }^{2}\times (\mathrm{Precision}+\mathrm{Recall})}$$
(13)

β is used to adjust the proportion of accuracy and recall. When β = 1, F − Measure is the F1 score.

3.3 Experimental Parameter Setting and Result Analysis

The software environment used in this paper is the Python 3.7, tensorflow 2.1 and keras2.24. experimental hardware conditions of Intel Core i7–8700 CPU and 16g ram.The model was trained using the Adam optimizer and the category_ cross-entropy loss function.Adam's learning rate is 0.0001, epoch is 2000, batch_ size is 128, momentum in batch normalization is 0.85, and alpha in leakyrelu is 0.2. Dropout is set to 0.4, and LSTM recurrent_ Dropout is set to 0.01. The experiment is selected from the KDD99 training set 300,000 pieces of data are used to train the model, and the remaining 194021 pieces are used to test the model. The Sklearn toolkit is used to encode the 22 types of attacks in the training set. The results are shown in Fig. 12. The invasion detection accuracy of CNN+LSTM and CNN+SA+LSTM is as follows.

Fig. 12
figure 12

.

For experiments, CNN used a 3 × 3 convolutional kernel with a step length of 2, after each BN layer and a dropout layer. In Table 2, label0 represents normal network traffic and label1–label22 represents 22 different attack types. From the experimental results, the CNN+SA+LSTM hybrid model has a higher accuracy than the LSTM and CNN+LSTM models, and the convergence rate is significantly better than the CNN+LSTM model. The iterative procedure of model training is shown in Figs. 13 and 14.

Fig. 13.
figure 13

CNN+LSTM Model accuracy graph

Fig. 14.
figure 14

CNN+SaLSTM Model accuracy graph

4 In Conclusion

For the current research status of intrusion detection, a neural network model based on intrusion detection with CNN and self-attention LSTM is proposed to solve the problems of unbalanced invasion data and inaccurate feature representation. Convolutional neural networks were used to extract the features of the raw data. Features that have great effects on classification results are given higher weight by attention autommachines. Then, the processed high-level features were predicted as input parameters for the LSTM network. In this paper, KDD99 training set was used for model training and testing for comparative analysis of CNN+LSTM and CNN+salstm models. Experiments show that the CNN+salstm model-based invasion detection and F1 metrics are better and accurate than the pure CNN+LSTM model.