1 Introduction

Emerging IoT applications such as augmented reality (AR) and virtual reality (VR) wearables, smart homes, and the Industrial Internet all require low-latency, low-energy data communication and processing. The original cloud computing, where data is stored and processed at a distance from the end device, makes network communication latency high [13]. Fog computing, proposed by Cisco as an extension of cloud computing, is expected to meet these needs, as shown in Fig. 1. With small servers with storage, routing devices, gateways and other devices close to the end-user to process and respond to data directly, while allowing the cloud computing centre to manage the state of these devices and data processing asynchronously [4, 5]. In this way, the fog devices can respond to user requests in near real-time for general situations, while for special situations, they can be processed with the help of the cloud computing centre [611]. However, the application of fog computing also brings some new issues, one of which is security. The fact that fog device nodes have hardware resources and the presence of valuable data information makes them an easy target for malicious entities [12]. Attackers exploit the characteristics of network communication protocol mechanisms to launch denial of service (DoS) attacks using large numbers of botnets sending carefully constructed network messages, which in turn make normal user device requests unanswerable [9, 1319]. Such attacks can be catastrophic in an industrial IoT scenario [3, 15, 2024]. In addition to this, there are also attacks such as override attacks, sniffing attacks, etc.

Fig. 1
figure 1

Cisco’s proposed fog computing model

Currently, there are currently a number of countermeasures proposed to deal with cyber attacks, such as reducing the risk of intrusion by regularly updating passwords and by strictly verifying the identity of users and limiting their permissions to a fine-grained level, or using encryption technology to encrypt network communications so that attackers cannot analyse useful data, etc [2528]. In addition to preventive methods, there are also countermeasures for network attacks that have already been launched, namely intrusion detection techniques [4, 29]. Intrusion detection techniques analyse certain indicators in network messages to determine whether they are attack traffic, and then filter and block them. Depending on how it is implemented, it can be divided into intrusion detection techniques based on feature recognition and intrusion detection techniques based on anomaly detection [3032]. Feature-based intrusion detection requires that the characteristics of various network attacks are stored in a database in advance, and then the upcoming network is tested for the presence of network attack characteristics to determine whether it is a network attack [3335]. For known attacks, feature-based intrusion detection techniques can accurately discriminate, but for unknown attacks, it is completely helpless, so the database of features of network attacks must be constantly updated so that new attacks can be detected. Anomaly-based intrusion detection technology identifies network attacks by determining how much network traffic deviates from normal traffic. It does not require pre-saving the characteristics of various network attacks or real-time updating of the network attack characteristics database, so it is cheaper and more suitable for the current network environment.

Besides, the rapid development of data science and artificial intelligence technology and its excellent performance in natural language processing, image processing and other fields make it a research hotspot. By combining the specific characteristics of network traffic, many researchers propose many network intrusion detection methods based on artificial intelligence. These methods show that data science and artificial intelligence technology are powerful tools to solve the problems and challenges brought by network attacks. Deep learning, as a machine learning technology based on representation learning idea, does not require artificial feature design and feature extraction. It only needs to input the original data to the neural network, and it will automatically learn the high-level information in the original data [36, 37].

Although many researchers have done extensive research on the application of machine learning and deep learning in network intrusion detection and achieved remarkable results, there is still a problem of low accuracy in semi-supervised automatic encoder model. In this paper, we propose a semi-supervised intrusion detection model combining long-term and short-term memory neural network (LSTM) and automatic encoder (AE). It is a two-stage detection technique to carry out the classification problem of network traffic. In terms of data pre-processing, this model (NADLA) not only transforms the non-numerical features in the samples into numerical features and fixes the range of data values in a small interval, but also increases the sensitivity of the model to anomalies by using the 3-sigma idea to remove anomalies for the AE sub-model training data. For anomaly detection, the LSTM sub-model is first used to separate the sample set into normal and anomalous samples by using high-level information from the time series, and then the AE sub-model is used to collect the high-level information from the samples considered normal by the LSTM sub-model. The classification results are obtained by comparing the error between the input samples and the reconstructed samples with the set threshold. Experimental results on the NSL-KDD dataset show that the NADLA model achieves an average accuracy of 92.79% and an F1-score of 93.73%. The advantage of the model is that it uses a self-encoder model to reduce the reliance on labelled datasets to a certain extent, while the LSTM model is used to learn the temporal features in the dataset to further improve the performance of the model in terms of accuracy and recall. In addition, the model structure is optimised by conducting a large number of comparison experiments, allowing the model to be trained efficiently.

Specifically, the main contributions of this study are as follows.

  • This paper proposes a new network anomaly detection model (NADLA) incorporating LSTM and AE, which is capable of learning both temporal information in the data and high-level features of normal data.

  • In this paper, NADLA improves the data pre-processing method by not only performing data coding and normalisation operations but also introducing the removal of specific points operation, which has the effect of significantly improving the accuracy of the trained model in detecting anomalies.

  • We further investigated the effect of different design structures and parameter settings in the NADLA model on the accuracy of the model in detecting anomalies.

The rest of this paper is organized as follows. Section 2 discusses the related work. This is followed by Sect. 3, which specifies the design details of the proposed NADLA model. In Sect. 4, we describe the experimental methods, including the dataset, pre-processing operations Then, Sect. 5 presents the experimental results. Finally, Sect. 6 summarises the work and suggests directions for future works.

2 Related Work

The research in the field of network intrusion detection can be divided into two parts. The first part uses the original machine learning method to detect network traffic, and the other part uses the deep learning method to detect network traffic. This section will introduce the related research results of these two parts.

2.1 The machine learning-based approach to network anomaly detection

Over the past two decades, many researchers have conducted various studies on network anomaly detection. Tavallaee et al. [38] conducted in-depth research on KDD-CUP99 and NSL-KDD datasets, and compared the performance of various machine learning methods on these datasets. Assiri et al. [39] proposed a random forest-based genetic algorithm for anomaly classification for network intrusion detection. Tao et al. [40] combined the genetic algorithm with SVM to improving the accuracy of the model and also reducing the model construction time. Chand et al. [41] proposed a stacked SVM model, which can effectively identify intrusions and outperform BayesNet, AdaBoost, SimpleCart and other classifiers proposed in past studies for intrusion detection.

After a single model encountered a bottleneck in improving classification accuracy, researchers changed their thinking to further improve detection accuracy. Agarwal et al. [42] proposed an integrated method combining three machine learning methods, Naïve Bayes, SVM and K-nearest neighbour (KNN), to improve classification accuracy and reduce processing time. Ling and Wu [43] used the random forest and trained an integrated method based on multiple classifiers using the selected features, which are effective in improving intrusion detection accuracy. Kim et al. [44] developed a hybrid system, which is useful to improve the accuracy of attack detection. Machine learning method requires artificial feature selection of network traffic, which may be complex and time-consuming. Therefore, researchers began to carry out scientific work in the field of network anomaly detection.

2.2 The deep learning-based approach to network anomaly detection

To date, researchers have paid increasing attention to deep learning methods, especially after the 2012 image classification competition made a splash and created a research boom in academia and industry [45]. Zhang et al. [46] proposed a hierarchical neural network model by fusing a modified LeNet-5 with an LSTM to allow the model to learn high-level features by extracting key components as training samples. The experimental results performed well. To address the problem of high false alarm rate, Imrana et al. [47] proposed a bi-directional long- and short-term memory neural network model (BiDLSTM) for intrusion detection experiments, which was innovated on the long- and short-term memory neural network model. Their experimental results showed good performance in metrics such as recall rate and F1-score.

For deep learning models, Shone et al. [48] evaluated the proposed model on the benchmark dataset KDD-CUP99, obtaining an attack classification accuracy of 97.87% and a false alarm rate of 2.15%. Ieracitano et al. [49] achieved an accuracy of 84.21% on the NSL-KDD dataset using data analysis and statistical methods and using a three-layer self-encoder model. In [50], the authors implemented different deep learning models, including self-encoders, RNNs and convolutional neural networks. The authors implemented a self-encoder-based anomaly detection model in [51] for automatic threshold learning and achieved an accuracy of 88.98%. Farahnakian and Heikkonen [52] proposed a deep self-encoder (DAE) model containing four self-encoders, which are trained using a hierarchical approach to prevent overfitting and local optima. In addition, they classify the incoming network input flows into normal and abnormal, and activate the function through softmax.

3 Design and implementation of NADLA

This section describes the structure of the NADLA model and the relevant details. In Sect. 3.1, the structure and workflow of the proposed model are presented. In Sects. 3.2 and 3.3, the specific details of the sub-modules of the model are described.

3.1 Model structure

The proposed NADLA model is an original and successful synthesis of its LSTM and self-encoder models, with the LSTM and self-encoder sub-models producing fine-grained and quantitative improvements. It has undergone extensive testing, and results on the intrusion detection benchmark dataset NSL-KDD have been surprisingly encouraging. Figure 2 depicts the whole layout of the buildings.

Fig. 2
figure 2

Model structure of the NADLA

The data sample initially enters the LSTM module of the NADLA model after being processed and translated into the time sequence format. The module is composed of two layers of LSTM nerve cells. Information from the sample may be successfully extracted using the LSTM neurons unit’s door control mechanism and cell state. The door control system consists of three doors. The input goalkeeper data sample is computed using the vector computation and activation function. The inner neuron receives trustworthy information. The forgotten door examines the incoming data to find the optimal amount of information. Information status is carried out by function processing being activated. The LSTM submodel uses the data time information to label the data sample. The LSTM should then be provided with the self-encoder module while submitting the pre-standard as a data sample of the normal sample. The module is made up of two parts: an encoder and a decoder. The sample contains a lot of compressed, important data. The decoder, which is a sample data collection of the same dimension as the input samples, only uses this essential information. Due to information loss in important areas, both the original sample data and the reconstructed sample data will contain more or fewer errors. This error number may also be classified by comparing the model’s training threshold against the sample.

Additionally improving is data pre-processing. To enable the features to be computed and quantified for the original samples of the data concentration, first transform the non-string feature value into the features of the value using the monopolymodes. then does data processing to guarantee speedy model convergence. Since the classification mechanism of self-encoder modules relies more on the high reconstruction error when the abnormal sample is corrected than it does on reconstructing errors from normal sample data, entering the module’s data to the module’s data will improve the module’s generalization capabilities. The sample must be filtered at particular places. Figure 3 displays the distinctive overall process. The designs of the LSTM and self-encoder module will be addressed in detail in the next parts, and Sect. 4 will discuss the data pre-processing technique.

Fig. 3
figure 3

The overall workflow of the model designed in this paper

3.2 LSTM sub-model

LSTM is a modification of the RNN model. Gradient disappearance during training is an issue since only the BPTT approach, a BP algorithm that interprets data as a time series is employed to build conventional RNN models. This problem is addressed by the LSTM using a gating mechanism and cell states. The three control gates that make up the LSTM are the input gate, output gate, and forgetting gate. The input gate is made up of a sigmoid function and a tanh function, which processes the beginning data and earlier information under pre-established principles to produce the information that has to be remembered right now. The forgetting gate selects which information should be forgotten by using the sigmoid function as an activation function. Figure 4 illustrates how the output gate multiplies the outputs of the sigmoid and tanh functions to create the information that is delivered to the LSTM neuron that follows [37].

Fig. 4
figure 4

LSTM structure

The workflow of the LSTM sub-model in the NADLA model is shown in Fig. 5, which contains two LSTM implicit layer units. After data pre-processing, the dataset contains n samples, and each sample \(X_{i}\) is a d-dimensional vector as shown in Eqs. 1 and 2.

$$\begin{aligned}&\hbox{dataset}=\left\{ X_{1}, X_{2}, \ldots , X_{n}\right\} \end{aligned}$$
(1)
$$\begin{aligned}&X_{i}=\left\{ x_{i, 1}, x_{i, 2}, \ldots , x_{i, d}\right\} , X_{i} \in {\text{dataset}} \end{aligned}$$
(2)

Since the input format of the LSTM sub-model is a time series with a certain step size, it is necessary to reconstruct the input sample of the LSTM according to the step size of the original input data. We stipulate that the step size of the time series of the LSTM sub-model is 128. The 128 original samples are merged to form an input sample of LSTM sub-model, as shown in Eq. 3.

$$\begin{aligned} \hbox{LSTM input data}=\left\{ S_{1}, S_{2}, \ldots , S_{n \% 128}\right\} \end{aligned}$$
(3)

where \(S_{i}=\left\{ X_{128 \times i}, X_{128 \times i+1}, \ldots , X_{128 \times i+127}\right\}\).

Fig. 5
figure 5

Details of the designed LSTM sub-model

The first step for each sample \(S_{t}\) to enter the LSTM sub-model is to determine how much information to discard through the forgetting gate. By reading the data information \(H_{t-1}\) of the previously hidden layer and the data information \(S_{t}\) of the input layer, and using the activation function to output a value between 0 and 1, it indicates how much information the previous LSTM unit has retained, as shown in Eq. 4.

$$\begin{aligned} f_{t}=\sigma \left( W_{f} \cdot \left[ H_{t-1}, S_{t}\right] +b_{f}\right) \end{aligned}$$
(4)

Using Eq. 5, the output \(i_{t}\) of the input gate can be calculated by using the data information \(H_{t-1}\) of the previously hidden layer and the data information \(S_{t}\) of the input layer. Then use the t function to create a candidate value vector \(\widetilde{C}_{t}\) and add it to the LSTM state, as shown in Eqs. 6 and 7.

$$\begin{aligned}i_{t}=\sigma \left( W_{i} \cdot \left[ H_{t-1}, S_{t}\right] +b_{i}\right) \end{aligned}$$
(5)
$$\begin{aligned}\widetilde{C}_{t}=\tanh \left( W_{C} \cdot \left[ H_{t-1}, S_{t}\right] +b_{C}\right) \end{aligned}$$
(6)
$$\begin{aligned}C_{t}=f_{t} * C_{t-1}+i_{t-1} * \widetilde{C}_{t} \end{aligned}$$
(7)

Finally, the information enters the output gate and the output value is determined according to the cell state of the LSTM. We process the cell state \(C_{t}\) by the tanh function and multiply it with \(o_{t}\) to obtain the part of \(H_{t}\) where the cell of this LSTM implicit layer determines the output, as shown in Eqs. 8 and 9.

$$\begin{aligned}o_{t}=\sigma \left( W_{o} \cdot \left[ H_{t-1}, S_{t}\right] +b_{o}\right) \end{aligned}$$
(8)
$$\begin{aligned}H_{t}=o_{t} * \tanh \left( C_{t}\right) \end{aligned}$$
(9)

The LSTM sub-model of the NADLA model contains two LSTM cell units, the first LSTM cell unit uses 16 hidden layer units, and the second LSTM cell unit uses 8 hidden layer units. In addition, each layer adopts batch regularization operation and 20% cell failure operation to avoid the gradient disappearance problem and overfitting problem. After the above operation on \(H_{t}\), \(H_{t}^{\prime }\) is obtained, and then the classification results are obtained by using the full connection layer of the softmax activation function, as shown in Eq. 10.

$$\begin{aligned} \{\{\hbox{Attack}\},\{\hbox{Normal}\}\} \leftarrow \hbox{Softmax} \left( H_{t}^{\prime }\right) \end{aligned}$$
(10)

where \(\{\hbox{Attack}\}\) and \(\{\hbox{Normal}\}\) are the set of LSTM samples \(\left\{ S_{1}, S_{2}, \ldots , S_{k}\right\} \subset \hbox{LSTM input data}\).

3.3 AE sub-model

A self-encoder is a self-supervised learning model that is trained using only input data and is widely used in the semi-supervised and unsupervised areas of machine learning. The self-encoder network model \(f_\theta\) can be sliced into two parts, the first part will try to learn the mapping relation \(g_{\theta 1}: x \rightarrow z\), while the second part tries to learn the mapping relation \(h_{\theta 2}: z \rightarrow \hat{x}\), the ultimate aim of the model is to make \(\ hat{x}\) and x as identical as possible. Thus, \(g_{\theta 1}\) can be seen as a data encoding process that encodes the original high-dimensional sample data into a low-dimensional hidden variable z, and \(h_{\theta 2}\) as a data decoding process that uses the encoded low-dimensional hidden variable z to decode a reconstructed sample \(\hat{x}\) that is as high-dimensional as the input sample x. The feedback mechanism for model learning then relies on the magnitude of the error value between \(\hat{x}\) and x. Therefore, the usual structure of a self-encoder model is to have input and output layers of the same dimensionality, and the model generally contains multiple implicit layers inside the model, which have a clear decreasing and then increasing dimensionality structure, the simplified model of which is shown in Fig. 6.

Fig. 6
figure 6

Classical structure of AE

The AE sub-model processing flow in the NADLA model is shown in Fig. 7. The auto-encoder model consists of two operations, encoding, and decoding. The AE sub-model is a five-layer structure. The input and output layers are both 122-dimensional feature variables, and the hidden layer consists of three layers, in the order of 32, 10, and 32 dimensions. Training is performed unsupervised using small-batch stochastic gradient descent [44]. Each layer was operated by a regularisation operation and let 20% of the neurons fail operation to avoid the gradient disappearance problem and overfitting problem.

Fig. 7
figure 7

The designed AE sub-model data processing process

The samples in the dataset will be divided into normal and abnormal sample sets after processing by the LSTM sub-model, as shown in Eq. 10. After the analysis of LSTM classification results, we found that some samples in the normal samples are actually abnormal samples. NADLA model uses AE sub-model to further classify this part of the sample set. The input data format of the AE sub-model is the original data format. Thus, to match the input format of the AE sub-model, each LSTM sample \(S_{1}\) in the LSTM-judged normal sample set \(\{{\text{Normal}} \}=\left\{ S_{1}, S_{2}, \ldots , S_{k}\right\} , k \le n\) needs to be decomposed according to the dimensionality of the original samples, as shown in Eq. 11.

$$\begin{aligned} X_{i * 128}, X_{i * 128+1}, \ldots , X_{i * 128+127} \leftarrow S_{i}, i \in [1, k] \end{aligned}$$
(11)

where \(X_{i}=\left\{ x_{i, 1}, x_{i, 2}, \ldots , x_{i, d}\right\}\) and d indicates the number of sample features.

In the AE sub-model, during the encoding phase, the d-dimensional input layer data \(X_{i}\) goes through two implicit layers [39, 53] to compress the dimensionality thereby obtaining the high-level information \(Y_{i}\) of the data representation, as shown in Eq. 12.

$$\begin{aligned} Y_{i}=F_{1}\left( W X_{i}+b\right) , i \in [1, k] \end{aligned}$$
(12)

where \(F_{1}\) is the encoder function, W denotes the weight matrix, and b denotes the bias vector.

In the decoding operation, the high-level information \(Y_{i}\) is remapped as a d-dimensional vector \(\widehat{X}_{i}=\left( \hat{x}_{i, 1}, \hat{x}_{i, 2}, \ldots , \hat{x}_{i, d}\right)\), as shown in Eq. 13.

$$\begin{aligned} \widehat{X}_{i}=F_{2}\left( W^{\prime } Y_{i}+b^{\prime }\right) , i \in [1, k] \end{aligned}$$
(13)

where \(F_{2}\) is the decoder function, \(W^{\prime }\) and \(b^{\prime }\) denote the decoder weight and bias respectively.

In the encoding and decoding operations, the dissimilarity between the input data and the reconstructed data is reduced by continuously optimising the parameters of the neural network \(\theta =\left( W, W^{\prime }, b, b^{\prime }\right)\). In the NADLA model, mean absolute error (MAE) is used to calculate the degree of dissimilarity between the data samples \(X_{i}\) and the reconstructed samples \(\widehat{X}_{i}\). The calculation process of MAE is shown in Eq. 14.

$$\begin{aligned} {{\hbox{Loss}}}\left( X_{i}, \widehat{X}_{i}\right) =\frac{1}{d} \sum _{j=1}^{d}\left| x_{i, j}-\hat{x}_{i, j}\right| \end{aligned}$$
(14)

The reconstruction error \(\left\{ l_{1}, l_{2}, \ldots , l_{k}\right\}\) of each sample can be obtained by Eq. 14. We choose the maximum value of the reconstruction error in the training set samples as the threshold q to determine whether the data samples are anomalous, as shown in Eq. 15.

$$\begin{aligned} q=\max \left\{ l_{1}, l_{2}, \ldots , l_{k}\right\} , k \in A E \hbox{train dataset} \end{aligned}$$
(15)

In discriminating samples discriminated as normal by the LSTM, anomalies are determined by comparing the magnitude of their reconstruction error \(l_{i}\) concerning the threshold q, as shown in Eqs. 16 and 17.

$$\begin{aligned}&X_{i} \in \{\hbox{Normal}\}, l_{i} \le q \end{aligned}$$
(16)
$$\begin{aligned}&X_{i} \in \{\hbox{Attack}\}, \quad l_{i}>q \end{aligned}$$
(17)

4 Experimental methods

In this section, we first introduce the data set used in this experiment, and then introduce the preprocessing method designed in this paper and the evaluation index of the experiment. The environment for our experiments is shown in Table 1.

Table 1 Environment of the experiment

4.1 Dataset

The dataset for the experiments in this paper is the NSL-KDD dataset, which is a benchmark dataset in the field of network intrusion detection that addresses some of the problems inherent in the KDD99 dataset [45]. Although the NSL-KDD dataset still has some problems [54] and may not reflect the current network environment well, it is still a valid benchmark dataset.

The dataset has two files, \(\hbox{KDD}_{{\rm Train}+}\) and \(\hbox{KDD}_{{\rm Test}+}\). These two files are complete training set and test set of NSL-KDD, which contain a variety of types of attacks. Since it has been classified, in the experiment, we will directly use the samples in \(\hbox{KDD}_{{\rm Train}+}\) to train, and use \(\hbox{KDD}_{{\rm Test}+}\) to observe the relevant indicators of NADLA model. At the same time, a variety of category labels are reclassified into two categories, namely normal samples and abnormal samples. Each sample in this dataset contains 41 features, including 38 numerical types and 3 character types. The distribution of data sets after reclassification is shown in Table 2.

Table 2 Data distribution of NSL-KDD data

4.2 Pre-processing

Before model training, pre-processing procedures are required for the NSL-KDD dataset. These pre-processing procedures include the unique thermal encoding as well as the regularisation operations. In addition to this, for the auto-encoder model, filtering of labels and outlier removal operations were performed. To improve the efficiency of model training, we convert non-numerical features into numerical features. We do this using a unique thermal encoding technique, which converts non-numerical features into n features, with n representing the number of values taken by the non-numerical feature. Therefore, the three string features in the NSL-KDD sample set will become 84 features after the unique hot encoding, of which 3 are protocol_type, 70 are service and 11 are flag. In the NSL-KDD data set, there are 38 numerical features in addition to the above three non-numerical features. In addition, there are 38 numerical features in the NSL-KDD dataset. Therefore, after calculating the remaining data features and performing exclusive thermal coding, there will be 122 features in total. Since each feature has a different range of values, to eliminate the effect of different scales for different features and thus reduce the execution time for model training, we use maximum–minimum normalization, which is calculated as shown in Eqs. 18 and 19. This method will map each feature into a new interval.

$$\begin{aligned}&X_{\rm std}=\frac{X-X_{\min }}{X_{\max }-X_{\min }} \end{aligned}$$
(18)
$$\begin{aligned}&X_{\rm scaled}=X_{\rm std} *(\max -\min )+\min \end{aligned}$$
(19)

where max and min are (0,1), so each feature maps to the value between [0,1].

For the LSTM model, only the above two operations are needed to complete the preprocessing work, but for the autoencoder (AE), before the maximum and minimum regularization, it is necessary to filter labels and remove outliers. The filter label is to retain that the label is a “normal” sample because the auto-encoder only uses the normal sample in training so that the effect of auto-encoder training can have a smaller reconstruction error for the “normal” sample, and a larger reconstruction error for the “abnormal” sample, to identify attacks. By analyzing the features in the NSL-KDD dataset, we can assume that the features in the dataset are independent of each other and obey a normal distribution. Moreover, normal network behaviour, because it is specified at design time, will in most cases not deviate too much from the value of the reference metric, so we use 3-sigma theory to make the determination. 3-sigma theory is also known as 68–95–99.7 theory. This rule states that 68% of instances lie within one standard deviation of the mean, 95% lie within two standard deviations, and 99.7% lie within three standard deviations [55]. Moreover, normal network behaviour, because it is specified at design time, will in most cases not deviate too much from the value of the reference metric, so we use 3-sigma theory to make the determination [56]. 3-sigma theory is also known as 68–95–99.7 theory. This rule states that 68% of instances lie within one standard deviation of the mean, 95% lie within two standard deviations, and 99.7% lie within three standard deviations [57, 58].

In this paper, we specify that if a feature takes a value outside of 3sigma (99.7%), it is an outlier and that sample is removed [53]. Algorithm 1 describes the outlier process used. Since the outlier removal is done on the “normal” samples in the KDDTrain+ dataset, the training set sample is reduced from 67,343 to 41,761 after the outlier removal. This remaining data will be used for the training of the self-encoder.

figure a

4.3 Evaluation metrics

To verify the performance of the model proposed in this paper, the classification accuracy, precision, recall rate, F1 score, and other indicators are used. The attack sample is regarded as category 0, and the normal sample is regarded as category 1. The confusion matrix is shown in Table 3. Among them, true-positive (TP) denotes the case of correctly marked as the first category, that is, the case of correctly marked as an attack. True-negative (TN) denotes the situation where the correct label is the second type, that is, the situation where the correct label is normal. False positive (FP) denotes the case of class 0 marked incorrectly as class 1, while false negative (FN) denotes the case of class 0 marked incorrectly as class 1.

Table 3 Data distribution of NSL-KDD data

Accuracy (Acc) measures the proportion of correct predictions and represents the number of correctly classified samples as a proportion of the total number of samples in a given dataset, as shown in Eq. 20.

$$\begin{aligned} \hbox{Accuracy}=\frac{\hbox{TP}+\hbox{TN}}{\hbox{TP}+\hbox{TN}+\hbox{FP}+\hbox{FN}} \end{aligned}$$
(20)

True positive rate (TPR), also known as recall or sensitivity, reflects how many abnormal samples are identified out of all abnormal samples and is calculated as shown in Eq. 21.

$$\begin{aligned} \hbox{TPR}/\hbox{Recall}=\frac{\hbox{TP}}{\hbox{TP}+\hbox{FN}} \end{aligned}$$
(21)

Precision indicates how many of the data marked as attack samples are true attack samples out of the total number of data marked as attack samples, as shown in Eq. 22.

$$\begin{aligned} \hbox{Precision} = \frac{\hbox{TP}}{\hbox{TP}+\hbox{FP}} \end{aligned}$$
(22)

F1-score is a measure of test accuracy, calculated by taking the summed average of precision and recall, as shown in Eq. 23.

$$\begin{aligned} F1=\frac{2 \times \hbox{Precision} \times \hbox{Recall}}{\hbox{Precision} + \hbox{Recall}} \end{aligned}$$
(23)

5 Experiment and results

In this section, we present the exact steps of the experimental execution and discuss the results.

5.1 The performance of NADLA

We first measured the performance of the model proposed in this paper (NADLA) using the method mentioned above. Considering the possible volatility and chance in the performance of the model, all experiments below will be repeated ten times and then averaged. Table 4 shows the performance of the NADLA model under the above-mentioned metrics. While earning an average performance of 93.73% in F1, it achieved 92.79% in Accuracy.

Table 4 Performance of NADLA Model

We also examined the performance of the sub-modules individually in classifying the data samples because the NADLA model includes both LSTM and self-encoder sub-modules. The results are displayed in Table 5 and allow us to precisely quantify the impact of each module on the overall model in terms of each metric. According to the results, the self-encoder and LSTM sub-modules individually underperform the NADLA model as a whole by more than 3% on each metric. The performance of the NADLA as a whole was then further examined in detail, namely how the LSTM and self-encoder sub-modules enhanced it. We were able to determine the answer to the question of why the LSTM sub-module can misclassify normal samples in the data samples as attack samples by analyzing the confusion matrix of NADLA and the two sub-modules. On the other hand, the AE sub-module wrongly classifies the attack samples in the data samples as normal samples since it is unable to distinguish them from other samples. The results of the confusion matrix are shown in Figs. 8 and 9.

Fig. 8
figure 8

Mixing matrix for each sub-model classification of the NADLA model

Fig. 9
figure 9

Mixing matrix for NADLA model classification

Table 5 Comparison of individual sub-models and overall model metrics

In addition, we further analysed the labelling of specific samples by the model, and we found that the LSTM sub-module and AE sub-module were very reliable in making consistent judgements for attack samples, so the NADLA model cleverly exploited this and significantly improved the overall performance, compared to other semi-supervised machine learning methods as shown in Table 6. NADLA shows a 10–20% improvement over traditional machine learning methods and a 2–15% improvement over similar deep learning models. Additionally, the NADLA model beat the self-encoder model suggested by Wu et al. [55] in 2021 by 2%.

Table 6 Summary of the performance of each model on the NSL-KDD dataset

Since the input to the LSTM is a time series, n samples need to be concatenated first and last into one LSTM sample, followed by the input to the LSTM model. The input and forgetting gates of the LSTM sub-model selectively extract some of the information from the input samples to add to the neural unit, and the length of the input samples affects the process of information extraction by the LSTM sub-model. To explore the effect of the number of original samples contained in an LSTM sample on the performance of the NADLA model, we conducted further experiments, the results of which are shown in Fig. 10. The experiments shows that the number of original samples included in the LSTM input samples did not show a significant correlation with the model performance, so we finally selected the time series length of 128 for the LSTM input samples.

Fig. 10
figure 10

Performance of the proposed NADLA model with different number of LSTM samples containing original samples

5.2 The metrics results of NADLA

Because the input format of the LSTM sub-module is the time sequence, we cleverly take each data sample as a time point, and the multiple data samples form a time sequence as the input of the LSTM sub-module. The input doors in the LSTM subsub-module process the time sequence to get the information that needs to be memory, and the forgetful door will abandon some of the information in the time sequence, so the length of the time sequence will have a certain impact on the NADLA model. To quantify the specific relationship between the number of time points and models contained in the time sequence, we have selected seven 2 index multiple values for analysis. The experimental results are shown in Fig. 11. The length of the time sequence is not clearly correlated with model performance, as can be seen from the results, but it also turns out that it is not longer and better, so we ultimately decide to set the length of the time sequence for LSTM input to be 128.

Fig. 11
figure 11

Performance of the proposed NADLA model under different LSTM structures

Additionally, we investigated the effect of the sub-module design on the performance of the NADLA model. The major factors in architectural design are the network’s layer count and the number of neurons in each layer. With the correct architecture, it will be simpler to avoid either overfitting or underfitting. The experimental results from our examination of how the number of neurons in each layer affects the model metrics for the LSTM sub-module, which we created using Sunanda Gamage’s paper [63], are shown in Fig. 11. The LSTM sub-module generates results that are insufficient when more neurons are added, thus we chose the structures of the first and second LSTM cells, which each employ 16 and 8 hidden layer cells, respectively. For the self-encoder structure comparison study, we selected the most popular peer structures. The outcomes are shown in Table 7. The self-encoder operates better on average when it has three hidden layers, and the innermost layer functions best overall when the dimensionality is decreased to 10 dimensions, obtaining an accuracy of 92.13%, according to the findings analysis.

Table 7 Performance of the proposed model under different AE structures

Finally, we investigate the thresholds for reconstruction errors in the self-encoder judgement sub-module. In our proposed model, after the training of the auto-encoder sub-model, an auto-encoder trained with normal samples and an anomaly score threshold is obtained, and during testing, if a sample is reconstructed by the auto-encoder with a reconstruction loss higher than this anomaly score threshold, it will be judged as an attack sample. Therefore, the choice of the anomaly score threshold affects the classification accuracy of the auto-encoder. The choice of the abnormal score threshold is related to the reconstruction error obtained in the auto-encoder for all normal samples used for training. To investigate the impact of the selection of the abnormal score threshold on the proposed model, different methods of selecting the abnormal score threshold were chosen, three in total, the maximum, the minimum, and the average of the reconstruction errors of the normal samples used were selected as the abnormal score threshold. The results obtained by repeating the experiment are shown in Table 8. We found that the best classification result was obtained by selecting the maximum reconstruction error of the normal samples as the abnormal score threshold.

Table 8 Performance of the proposed model under different thresholds

6 Conclusion

This paper proposes a semi-supervised network anomaly detection model, NADLA, which combines a long and short-term memory neural network (LSTM) and an autoencoder (AE). The LSTM sub-model uses its powerful temporal feature learning capability for anomaly detection and classification, and when the LSTM considers a sample to be normal, it is further fed into the AE sub-model to learn its high-level information to reconstruct the sample, and the reconstruction error is compared with a set threshold to obtain the final anomaly determination. The NADLA model has been averaged over several iterations of the NSL-KDD dataset to achieve an accuracy of 92.79% and an F1 score of 93.73%, which is better than other semi-supervised machine learning models. Considering the changing network attack, based on a semi-supervised learning model is still needed to play tag in the network traffic, and the task itself is complicated and time-consuming, so the future research direction will focus on further improving the accuracy of the model and improved model structure research and unsupervised learning methods for network intrusion detection and related research.