1 Introduction

A network intrusion detection system (NIDS) monitors activities in a network and classifies them as “benign” or “malicious” [1, 2]. Recently, machine learning has been applied to enhance the capabilities of anomaly-detecting NIDS [3]. A problem that hinders widespread application is the required number of labeled training samples to achieve practical performance. The more training data used, the better the performance. However, labeling data samples is a costly task that requires a human operator to examine each data item, classify it, and label it. Additionally, trends in network traffic that are subject to NIDS auditing change daily, and new attacks continue to generate. Hence, labeling work must be done constantly, creating numerous problems.

To address this issue, we propose a semi-supervised machine-learning-based NIDS that reduces the required number of labeled training samples. We use a smaller set that would ordinarily result in poor performance for supervised learning classification. To avoid this, our semi-supervised learning method exploits unlabeled training samples, which does not require costly human labor. We use an adversarial auto-encoder (AAE) to realize semi-supervised learning in this fashion [5] alongside a generative adversarial network (GAN) [4]. These two components comprise the key building blocks of our method [5]. The auto-encoder (AE) reduces the dimensionality of input data by extracting and maintaining important features as a latent variable vector, whereas the GAN employs a generator and a discriminator such that the latent-variable vector of the AE follows an arbitrary distribution for regularization. In our proposed method, we divide the latent-variable vector into two subset vectors: one for classification and the other for traffic feature representation. Using unlabeled data samples only, the AE is trained to extract two latent-variable vectors, and the GAN is trained to allow them to follow categorical and Gaussian distributions, respectively. Then, using labeled data samples, the AE is trained to minimize the cross-entropy error. Finally, the latent-variable vector for classification is used to classify the input data as normal or as an attack.

In our earlier work [6], we reported preliminary evaluation results. In this study, we investigate the performance of the proposed method through a series of detailed experiments to answer the following questions:

  • How many unlabeled training samples are required to obtain practical performance?

  • Does the selection of labeled training samples affect performance?

  • How many dimensions does the latent-variable vector require to extract the traffic features needed to build the NIDS?

We evaluate the proposed method through a series of experiments and confirm that the proposed AAE-based NIDS achieves performance comparable to that of multi-layer perceptron (MLP)-based NIDS with only 0.1% of the labeled training samples. We confirm that when the number of labeled data samples is small, the accuracy does not diminish. Moreover, the selection of data samples for annotation does not affect the performance of our proposed AAE-based NIDS. We also confirm that the best performance as measured by recall and F1 score occurs when the dimensionality of the AE’s latent variable vector is 10, which suggests that this structure can decompose the attack and normalize communications expediently. Hence, we show that the proposed AAE-based NIDS effectively uses a small number of labeled training data samples to reduce the need for costly human labor while improving performance with the support of unlabeled data. This study demonstrates promising results that will be useful to industry and academia.

The rest of this paper is organized as follows. Section 2 describes related work. Section 3 discusses the dataset used for training the machine-learning-based NIDS. Section 4 presents the proposed method. Section 5 evaluates the proposed method via experiments. Section 6 concludes the paper.

2 Related Work

Various machine-learning algorithms have been applied to anomaly-type NIDS [3, 7, 8]. In the early stages of research on the application of machine learning to NIDS approximately 20 years ago, attention was focused on shallow learning methods [9] such as K-nearest-neighbor, naive Bayes, the C4.5 decision tree algorithms [10], Principal Component Analysis [11], and support vector machines (SVM) [12, 13]. Feature engineering is important in shallow machine learning, and methods have been proposed to extract the 10 top-ranked features from the Network Security Laboratory Knowledge Discovery in Databases (NSL-KDD) dataset by making use of the information gain [14, 15]. Multivariate correlation analyses with the support of a dissimilarity measure were conducted to improve accuracy [16].

It is generally believed that there is no single model that will solve different problems simultaneously. Indeed, even if multiple models were highly effective for a given problem, finding the best model for different data distributions or statistical mixtures would be difficult. Ensemble learning is the practice of combining multiple models to improve predictive performance. Several studies have been conducted on the application of ensemble learning to NIDS [17,18,19,20,21].

Recent advances in deep learning have stimulated research on how to realize NIDS without having to conduct feature engineering. Spectral clustering and a deep neural network have been applied for a NIDS for sensor networks [22]. A recurrent neural network (RNN)-based NIDS [23,24,25] including a long short-term memory (LSTM)-based one [26] were proposed. A convolutional neural network (CNN) was applied [27]. A CNN and LSTM were used for feature extraction to capture spatio-temporal features [28].

A GAN was used to extract statistical features [29].

The attention mechanism was used to perform feature learning by highlighting the key input of sequential network flow data composed of packet vectors in a bidirectional LSTM model [30].

Auto-encoders (AEs) are considered suitable for anomaly-type NIDS because they can determine whether a deviation from a previously learned normal state is detected based on whether the reconstruction error is within a threshold value [31].

A robust AE was proposed [32] to address issues of denoising [33, 34], requiring noise-free data. A maximum correntropy AE was proposed [35] to provide representations influenced by the outliers and noise. AE reconstruction errors were used to augment and train the classifier [15]. A stacked AE was proposed to extract features from log data of n-gram hypertext transmission protocols [36].

Several authors proposed NIDS comprising AEs and/or deep belief networks to extract feature representations followed by classifiers. NIDSes have been proposed based on an asymmetric AE [37], a stacked AE [38], and a sparse AE [39,40,41]. Stochastic denoising AE [42] was applied to different NIDSes [43,44,45], and a stacked contractive AE [46] was applied to yet another NIDS [47]. A deep belief network was employed [48], using a classifier during the second stage based on various shallow-learning algorithms (e.g., random forest [37], SVMs [39, 47, 48], and soft-max classifiers [38, 40, 41, 40,44,41]). Self-taught [49] learning-based NIDSes were proposed [39,40,41, 50].

Although there have been several studies in which machine learning algorithms were applied, few have investigated the problems caused by having only a small number of labeled training samples.

Overfitting occurs when a model is trained on a small sample. One-shot learning and few-shot learning, which are inspired by the human ability to learn things from a few examples, have been proven to train a model with a small number of data instances and achieve high performance for image recognition [51, 52]. Recently, one-shot learning and few-shot learning have been applied to network security in the area of malware detection, where malware is converted to image data [53,54,55].

More recently, one-shot learning with Siamese networks was applied to a NIDS [56] to learn pair similarities rather than features that are unique to each class, although open questions remain, including that careful consideration should be given to ensure the same number of training pairs for all class combinations when creating the training set.

Semi-supervised learning is another approach to solving the problem of overfitting due to a small quantity of labeled training data.

Semi-supervised learning models based on the variational auto-encoder (VAE) have been proposed [57], and in our previous work [6], Adversarial Auto-Encoder (AAE) for semi-supervised learning NIDS was proposed and preliminary evaluated. While the VAE assumes that the underlying distribution of the latent variable is Gaussian, our AAE-based method is more general and assumes that any distribution can be used as the underlying distribution of the latent variable.

A hybrid method using an LSTM auto-encoder and a one-class SVM that uses only normal class examples in the training dataset has also been proposed [58].

A stacked sparse autoencoder (SSAE)-based semi-supervised deep learning models [59] and their extension to federated learning (FL) [60], which combines unsupervised feature extraction and supervised classification algorithms to make use of information from both unlabeled and labeled data, have been proposed in recent years.

Those approaches are in the early stages of research. Even if a model is very effective for a given problem, it is very difficult to find the best model for different data distributions and statistical mixtures. In order to find the appropriate approach for the problem, the characteristics of these approaches should be investigated in detail. In this paper, we present the performance of the method proposed in our earlier work [6] through a series of detailed experiments.

3 Training Dataset

3.1 Limited Quantity of Labeled Data

A typical machine learning workflow comprises a pre-training phase in which clustering is performed using unsupervised learning, a data annotation phase in which human operators manually examine data samples and attach labels, and a supervised learning phase in which the classifier is trained using the set of labeled data. For the classification task, existing supervised-learning algorithms require a dataset of high quality containing a sufficient quantity of human-annotated data samples for training. Because annotation is an extremely costly and time-consuming task, a new method is needed to enable more efficient classifier training. Furthermore, obtaining anomaly data samples is difficult because trends in network traffic that are subject to NIDS oversight change daily, and new attacks continue to be generated. This quickly leads to unbalanced datasets that must be updated continuously .

3.2 The NSL-KDD Dataset

To investigate the effect of the number of labeled data samples, we use the NSL-KDD dataset [9] as a benchmark. This dataset has been used extensively to evaluate machine-learning-based NIDS methods. NSL-KDD is an enhanced version of the KDD CUP 99 [61, 62] dataset, used in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99). The competition task was to build a predictive network intrusion detector capable of distinguishing between intrusions and normal connections. The database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. A major criticism of the KDD CUP 99 dataset [61] pertains to its large degree of redundancy. Therefore, the authors of the NSL-KDD paper [9] removed duplicates and created more sophisticated subsets. This dataset is divided into KDDTrain+ (125,973 data records) for training and KDDTest+ (22,544 data records) for testing.

This dataset consists of records of traffic sent and received between source and destination internet protocol (IP) addresses. It was created from transport control protocol (TCP) dump data from 7 weeks of network traffic processed into about five million connection records. Similarly, two weeks of testing data yielded approximately two million connection records. Each traffic sample has 41 features categorized into three types: basic, content-based, and traffic-based. Among these, some are categorical protocol types that have three possible values (i.e., tcp, udp, and icmp), a flag that has 11 possible values (SF, S1, REJ, etc.), and a service that has 70 possible values (http, telnet, ftp, etc.). Instead of coding each categorical data into scalar values, we adopt a one-hot vector representation, resulting in 122 features.

All data are labeled as “normal” or “anomaly”. Attacks in the dataset are classified into four categories according to their characteristics: denial-of-service (DOS), remote-to-local (R2L) (i.e., unauthorized access from a remote machine), user-to-remote (U2R) (i.e., unauthorized access to local superuser (root) privileges), and probes (e.g., surveillance and port scanning).

The details of each category are described in Table 1. The KDDTrain+ dataset contains 22 types of data, and the KDDTest+ dataset contains 38 types. Some specific attack types (boldface) in the testing dataset do not appear in the training dataset. KDDTest+ contains 17 types of attack data that are not in KDDTrain+, and KDDTrain+ contains two types of attack data that are not in KDDTest+. This renders the detection scenario more realistic. Therefore, the KDDTest+ dataset is a reliable indicator of performance with respect to attacks that have not been seen previously, such as zero-day attacks and variants of existing attack types.

Table 1 Attack types in the NSL-KDD dataset [9] Attack types written in bold in the testing dataset do not appear in the training dataset.

4 Semi-supervised Learning

4.1 AAE

Supervised learning requires a large number of data instances to achieve practical performance. The more training data used, the better the classifier performance. Unfortunately, obtaining a large quantity of training data is costly. Unlike supervised models, semi-supervised learning requires just a small set of labeled data for training.

Therefore, to improve the performance of NIDS, we propose to use semi-supervised learning to take advantage of unlabeled training data and reduce the need for human intervention.

We apply AAE to implement the semi-supervised learning algorithm. Figure 1 presents the architecture of the AAE, which comprises an AE and a GAN as its key building blocks. The AE reduces the dimensionality of input data by extracting and maintaining important features as a latent variable vector, z, whereas the GAN employs the generator and discriminator so that z follows an arbitrary distribution for regularization. Hence, an AAE can be viewed as an AE that forces hidden variables to follow any desired distribution.

Fig. 1
figure 1

AAE Architecture

In an AAE, the latent variable vector of the AE is regularized by the discriminator of the GAN in order to match an arbitrary prior, p(z), to the aggregated posterior, q(z), of the latent variable vector, z. Let x be the input and z be the latent variable vector of the AE. Let \(q(z\mid x)\) be an encoding distribution and \(p(x\mid z)\) be a decoding distribution. Further, let \(p_d(x)\) be the data distribution and p(z) be the prior distribution we want to force the latent variable vector to follow. The encoding function of the AE, \(q(z\mid x)\), defines a posterior distribution of q(z) on the latent variable vector of the AE as follows:

$$\begin{aligned} \begin{aligned}&q(z) = \int _x q(z\mid x)p_d(x) dx \end{aligned} \end{aligned}$$
(1)

The adversarial network and the AE are trained jointly using the stochastic gradient descent (SGD) method. As the learning progresses, the AAE can match the arbitrary prior, p(z), to the aggregated posterior, q(z). Then, z follows the prior distribution. Thus, the encoder of the AE is regarded as the GAN generator. As such, z is regarded as the generated data. The discriminator discriminates the z generated by the AE that follows the distribution of q(z) from the sample generated by the generator that follows the distribution of p(z). It yields the probability that input data come from the sample of the arbitrary prior, p(z).

4.2 Proposed Method

4.2.1 Architecture

Figure 2 shows the architecture of the proposed method. We assume that there should be complex latent expressions behind normal and anomaly communications. The AE extracts two latent variable vectors, \(z_1\) and \(z_2\), to hold the features of the input data. The encoder, \(q(z_1,z_2\mid x)\), generates \(z_1\) and \(z_2\), where \(z_1\) holds the features representing the class information (i.e., “normal” or “attack”)and \(z_2\) holds the other features. The \(z_1\) corresponding to the categorical distribution is designed to record the label associated with the input data. We use \(z_2\) to impose the Gaussian distribution in order to preserve detailed features other than class information. The \(z_2\) corresponding to the Gaussian distribution is designed to separate clusters.

The proposed AAE employs two pairs of generative models and discriminators to regularize the latent variables \(z_1\) and \(z_2\). One follows a categorical distribution and the other a Gaussian distribution. A categorical distribution, \(cat(z_1)\), is used to force \(z_1\) to represent only the class datum, whereas a Gaussian distribution, \(N(z_2\mid 0,I)\), is used to force \(z_2\) to represent other information. The categorical distribution takes the same number of one-hot values as the number of classes. In our method there are two classes, “normal” or “attack”. The latent variable \(z_1\) is learned by the discriminator to hold a one-hot vector. As such, classification can be performed by referring to the value of \(z_1\) estimated by the encoder.

4.2.2 Workflow of the Proposed Method

We train the AAE using unlabeled data. A latent variable \(z_1\) corresponding to the categorical distribution is designed to record the label associated with the input data. A latent variable \(z_2\) corresponding to the Gaussian distribution is designed to separate clusters. Therefore, it is assumed that input data are generated by a latent class variable \(z_1\) that comes from a categorical distribution and a latent variable \(z_2\) that comes from a Gaussian distribution. When labeled data are available, we train the AAE by using the label instead of the categorical generative model. When the AAE is trained, it is used to classify new incoming data. The latent variable \(z_1\) in the middle hidden layer indicates the inferred class associated with the input data. As such, at the time of detection, classification is performed by a latent variable \(z_1\) that represents the class information, indicating whether the input data are normal or anomaly.

Fig. 2
figure 2

Architecture of an AAE using semi-supervised learning

The middle of Fig. 2 is the neural network of the AE that reduces the dimensionality of the input data (Input(X)) and generates latent-variable vectors \((z_1,z_2)\). The top of Fig. 2 is the neural network of the discriminator that imposes a categorical distribution on a latent class variable vector \(z_1\), as a prior distribution. Therefore, the distribution of \(z_1\) is guaranteed to match the categorical distribution. Hence, only class information that is required to detect an attack is extracted from the latent-variable vectors. The bottom of Fig. 2 is the neural network of the discriminator that imposes a Gaussian distribution on a latent class variable \(z_2\), as a prior distribution. Therefore, the distribution of \(z_2\) is guaranteed to match a Gaussian distribution. The semi-supervised AAE is trained with the SGD in three phases as follows:

  1. 1.

    Reconstruction phase: the encoder and decoder are updated. Optimizes the parameters of the AE to minimize reconstruction error for input and output data. Only unlabeled data are used in this phase. The AE generates the latent-variable vectors \(z_1\) and \(z_2\) from unlabeled data in this phase.

  2. 2.

    Regularization phase: each discriminator is trained to discriminate the latent-variable vector, \(z_1\), or sample the categorical distribution, and to discriminate the latent variable vector, \(z_2\), or sample the Gaussian distribution. The AE is optimized based on the determination result of the discriminator. This training is based on Eq. (1).

  3. 3.

    Semi-supervised classification phase: the AE is updated to minimize the cross-entropy error on labeled data. In this phase, we conduct semi-supervised learning using labeled data.

After the AAE is trained, the latent variable \(z_1\) is used to classify whether the input data are normal.

5 Evaluation

5.1 Neural Network Parameters

We implemented the proposed AAE method using Pytorch [63]. To compare the proposed method to an existing one, we also implemented an MLP-based deep neural network (DNN). We use an ADAM optimizer [64] for both models. We also added dropout and batch normalization to prevent overfitting during the training phase. The architecture of the MLP model and that of our AAE model are shown in Fig. 3. In this example, regarding the AAE, the encoder has 122 inputs that are compressed into the important features, yielding 52 (= 2 + 50) outputs (two for \(z_1\) and 50 for \(z_2\)). The decoder receives 52 inputs from the hidden middle layer that has the latent variable vectors (\(z_1\) and \(z_2\)) and yields 122 outputs. Both the encoder and the decoder have a middle fully-connected layer with a size of \(1000\times 1000\). They also have a drop-out layer with batch normalization between layers. The discriminator for the categorical distribution (\(z_1\)) receives two inputs and yields one output (“Fake” or “Real”). The discriminator for the Gaussian distribution (\(z_2\)) receives 50 inputs and yields one output (“Fake” or “Real”). Both discriminators for the categorical distribution (\(z_1\)), and the Gaussian distribution (\(z_2\)) have a middle fully-connected layer with a size of \(1000\times 1000\). They also have batch normalization between layers.

Fig. 3
figure 3

Architecture of the AAE and DNN

5.2 Dataset

To investigate the impact of the small number of labeled data samples, we created 10 training datasets by randomly selecting data samples from KDDTrain+ (125,973 data items) as labeled data. \(p_{sample}\) is a fraction of labeled samples randomly selected over the total from KDDTrain+. We used KDDTest+ (22,544 data items) for testing.

When \(p_{sample}\) is small, the way the selected data samples are distributed will strongly affect the performance of the machine learning-based NIDS. Here, if we were to use supervised learning instead of our proposed semi-supervised learning, we would need to select the label data carefully. If by chance we can select “good” or “representative” data samples for training, supervised learning may succeed in constructing a valid model. Conversely, if a “bad” data sample is selected, a valid model cannot be obtained by supervised learning. Therefore, given the constraint that only a small number of labeled data samples can be chosen, supervised learning requires careful selection of the samples to be labeled. In contrast, our proposed semi-supervised learning is expected to require less careful selection of samples to label.

5.3 Effectiveness of the Proposed Method

To reduce the required number of labeled data samples in the training dataset, we propose a semi-supervised machine-learning-based NIDS that applies an AAE technique. The first question is “How many labeled data samples are required in the training dataset?” The AAE-based NIDS uses the structure of the unlabeled data-sample distribution to improve the accuracy of the boundaries between adjacent classes calculated from the labeled samples. We examine how the AAE-based NIDS improves its performance by taking advantage of the unlabeled samples, avoiding the need for costly human labor.

Figure  4 shows the performance of the AAE-based NIDS, including accuracy, precision, recall, and F1 score,

when using 0.1% of the training data of the NSL-KDD dataset as labeled (i.e., \(p_{sample}\)= 0.001). In Fig.  4, the horizontal axis represents the number of unlabeled training samples. In Fig.  4, the performance of the MLP is shown for comparison (see the four rightmost bars). Amongst the performance measures, false-negative and false-positive ratios are important because the false detection risk is very high with NIDS [65]. Focusing on the recall in Fig. 4, we observe that the AAE-based NIDS achieves higher recall than the MLP-based NIDS. We confirm that the proposed AAE-based NIDS achieves performance comparable to the MLP-based NIDS with only 0.1% of labeled data samples in the training dataset.

The next question is “How many unlabeled data samples are required in the training dataset?”. Figure 4 shows that adding unlabeled data samples causes the performance of the AAE-based NIDS to improve. Thus, we confirm that the AAE-based NIDS takes advantage of the structure of the unlabeled data sample distribution to improve the accuracy of the boundaries between adjacent classes calculated from the labeled data samples. We note that the addition of a small number of unlabeled data samples effectively improves performance.

Fig. 4
figure 4

Performance of the AAE-based NIDS (accuracy, precision, recall, and F1 score) as a function of the fraction of labeled data samples over the total data samples in KDDTrain+ (\(p_{sample}\)=0.001)

We investigated the effect of \(p_{samples}\) on the AAE-based NIDS, to determine the number of labeled data samples required to improve performance. Figures 5 and 6 show the results with \(p_{samples}\) = 0.01 and 0.0001 (i.e., using 1% and 0.01% of the training data of NSL-KDD dataset as labeled samples). The horizontal axis represents the number of unlabeled data samples in Figs. 5 and 6. As observed, we confirm that the AAE-based NIDS improves its performance by increasing the number of unlabeled data samples. By comparing Figs. 5 and 6, we observe that both the AAE- and MLP-based NIDS yield better variable results with wider confidence intervals with \(p_{samples}\) = 0.01 than with \(p_{samples}\)=0.0001. We believe that the reason for this is as follows. When \(p_{samples}\)=0.0001, the number of labeled data sample is so small that the way the selected data samples are distributed has a strong influence on the performance of machine learning-based NIDS. That is, the number of labeled data samples per class is only 12. If we coincidentally select and label representative training samples, we might succeed in building a good model, whereas if we fail to select and label representative data samples, we might not. However, when \(p_{samples}\)=0.01, the number of labeled data sample is sufficient to select and label representative data samples to avoid such variance. That is, the number of labeled data samples per class is 1,259. Thus, AAE- and MLP-based NIDS both yield higher variable results with wider confidence intervals at \(p_{samples}\)=0.0001 than those with \(p_{samples}\) = 0.01.

Fig. 5
figure 5

Performance of the AAE-based NIDS (i.e., accuracy, precision, recall, and F1 score) as a function of the fraction of labeled data samples over the total data samples in KDDTrain+ (\(p_{sample}\)=0.01)

Fig. 6
figure 6

Performance of the AAE-based NIDS (i.e., accuracy, precision, recall, and F1 score) as a function of the fraction of labeled data samples over the total data samples in KDDTrain+ (\(p_{sample}\)=0.0001)

5.4 Dimensionality of the Latent Variable

We are interested in how the proposed AAE successfully extracts feature representations of normal and anomaly traffic. Hence, we evaluated the effect of the dimensionality of the latent variable. Figure 7 shows the performance of the proposed AAE-based NIDS as a function of dimension sizes of \(z_2\) = 2, 10, and 50. As observed in Fig. 7, \(z_2\) = 10 achieves the best performance (i.e., recall and F1 score).

Fig. 7
figure 7

Performance of the AAE-based NIDS (i.e., accuracy, precision, recall, and F1 score) as a function of the dimensionality of the AE, \(z_2\)

To investigate the mechanism behind the AAE successfully classifying data, we visualize how well the latent variable represents a distinctive set of features of input traffic. To visualize the multidimensional latent variable, we employed T-distributed stochastic neighbor embedding (t-SNE) [66] to reduce the dimensionality of the data. t-SNE is well suited for embedding high-dimensional data for visualization in a low-dimensional space. Specifically, it models each high-dimensional object by a 2- or 3-dimensional point in such a way that similar objects are modeled by nearby points, and dissimilar objects are modeled by distant points having high probability. We used a t-SNE with a perplexity of 50 and random state of zero. We used dimension sizes \(z_2\) = 2, 10, and 50, with \(p_{sample}\)=0.001. Figures 8, 9, and 10 show how the latent variable \(z_2\) represents a distinctive set of features from the input traffic when the dimensionality of \(z_2\) is 2, 10, and 50. As observed in Fig. 9, \(z_2\) = 10 produces the clearest separation between normal and attack designations. Figures 8 and 10 show that \(z_2\) = 2 and 50 produce separations that are less clear.

Fig. 8
figure 8

Visualization of latent variable by t-SNE (\(z_2\)=2)

Fig. 9
figure 9

Visualization of latent variable by t-SNE (\(z_2\)=10)

Fig. 10
figure 10

Visualization of latent variable by t-SNE (\(z_2\)=50)

5.5 Computation Time

We next discuss the computation time required to train the AAE and MLP. Figure 11 shows the time required to train the AAE and MLP for various ratios of labeled data. The computing environment used in the experiment is shown in Table  2.

Table 2 Computing Environment

The learning rate and number of epochs were as shown in Table 3.

Table 3 Hyperparameter values used for AAE and MLP training
Fig. 11
figure 11

Computing time required to train the AAE and MLP

If the AAE is used for supervised learning, only the AE parameters need to be trained. Hence, as the ratio of labeled data increases, fewer training epochs are required and less computation time is needed to train the AAE. However, when the AAE uses unlabeled data, the two discriminators must be trained in addition to the AE training. As such, as the ratio of unlabeled data increases, more training epochs are required and more computation time is needed to train the AAE. As shown in Fig. 11, the ratio of labeled data is less than 5.0%, and the AAE requires more training time than the MLP. This is because the MLP trains one neural network, whereas the AAE trains one neural network for the AE and two for the discriminators.

6 Conclusions

To reduce the required number of labeled training samples, we proposed a semi-supervised machine-learning-based NIDS that applies an AAE technique. We evaluated the proposed method with a series of experiments and obtained the following results:

  • The proposed AAE-based NIDS achieved performance comparable to that of an MLP-based NIDS with only 0.1% of the labeled data samples and the addition of a small number of unlabeled data samples.

  • When the number of labeled data samples was small, the accuracy did not change, irrespective of the samples selected to be labeled; hence, the selection of data samples for annotation does not affect the performance of the proposed AAE-based NIDS.

  • We demonstrated that the data structure dimensionality of \(z_2\)=10 successfully extracted the essential features from the data samples and produced the clearest separation between normal and attack classifications, resulting in the best performance in terms of recall and F1 score.

This study demonstrates promising results obtained by our novel semi-supervised learning method, which reduces the number of labeled training samples and greatly offsets the operational costs of a machine-learning-based NIDS.

Regarding future research directions, we should point out the applicability to real-world environments. In recent years, several publicly available datasets have been published [67,68,69]. These datasets may help us assess applicability to real-world environments even though efforts to assess the applicability of public datasets for NIDS evaluation is still ongoing [70]. More importantly, a prototype system will be deployed in a real-world environment.