Few-Shot Abnormal Network Traffic Detection Based on Multi-scale Deep-CapsNet and Adversarial Reconstruction

Detecting various attacks and abnormal traffic in the network is extremely important to network security. Existing detection models used massive amounts of data to complete abnormal traffic detection. However, few-shot attack samples can only be intercepted in certain special scenarios. In addition, the discrimination of traffic attributes will also be affected by the change of feature attitude. But the traditional neural network model cannot detect this kind of attitude change. Therefore, the accuracy and efficiency of few-shot sample abnormal traffic detection are very low. In this paper, we proposed a few-shot abnormal network traffic detection method. It was composed of the multi-scale Deep-CapsNet and adversarial reconstruction. First, we designed an improved EM vector clustering of the Deep-CapsNet. The attitude transformation matrix was used to complete the prediction from low-level to high-level features. Second, a multi-scale convolutional capsule was designed to optimize the Deep-CapsNet. Third, an adversarial reconstruction classification network (ARCN) was proposed. The supervised source data classification and the unsupervised target data reconstruction were achieved. Moreover, we proposed an adversarial training strategy, which alleviated the noise interference during reconstruction. Fourth, the few-shot sample classification were obtained by combining multi-scale Deep-CapsNet and adversarial reconstruction. The ICSX2012 and CICIDS2017 datasets were used to verify the performance. The experimental results show that our method has better training performance. Moreover, it has the highest accuracy in two-classification and multi-classification. Especially it has good anti-noise performance and short running time, which can be used for real-time few-shot abnormal network traffic detection.


Introduction
The problem of network security has become increasingly prominent as the network brings us a richer and faster life. Network security risks pose new challenges to information security in the politics, economy, culture, and society. It is composed of the confidentiality, integrity, and availability of carried information. Any attempt to destroy these features can be considered network intrusion. Meanwhile, network security is relative rather than absolute, and network intrusions exist objectively. Therefore, enterprises and individuals must detect network intrusions. Detecting various attacks and abnormal traffic in the network is extremely important to network security. We must study the security threats and various malicious attacks that the network faces. In the end, accurately detecting various types of attacks and providing effective processing solutions have become important research directions to ensure network security. However, traditional security techniques are difficult to apply to the current complex networks. Therefore, network security issues are more prominent. As an important method in network intrusion detection, the network abnormal traffic detection has become the focus of current network security research. Network managers can effectively identify various network attacks through network abnormal traffic detection. Moreover, it allows network managers to use effective methods for network intrusion prevention. However, the existing detection methods [1][2][3] have many problems in the abnormal network traffic detection. For instance, the attack traffic has many features, strong uncertainty, strong concealment, and a wide range of attacks. Existing detection methods [3,4] often only extract a single traffic feature. It is difficult to accurately extract high-level key features of attacks from large-scale traffic in real-time. As a commonly used technique in the feature extraction process, a convolutional neural network (CNN) [5][6][7] has achieved better results in the feature extraction process. However, the limitations of its own structure cannot be ignored. Its main limitations are as follows: first, CNN has a strong dependence on the number of samples. Some experimental results [5] show that the more training samples and the fewer target categories, the higher the classification accuracy that CNN can achieve. When dealing with few-shot sample datasets and multi-classification tasks, CNN is prone to overfitting due to the high model complexity and few learnable features. Second, there is no structure for encoding spatial hierarchical information in CNN [6]. Therefore, it is not good at learning the spatial relationship between part and part, part and whole. If we only rely on the pooling operation and the fully connected layer, we can only maintain invariance for small-amplitude translation transformations. When the position of traffic features is changed, network traffic attributes will also be changed. Moreover, this change is called a feature attitude transition in the feature position. Therefore, changes in network attributes brought by the feature attitude of network traffic must be taken into account. Third, CNN is vulnerable to adversarial samples [7], and there are major security flaws. When there is a slight disturbance on the sample, it may interfere with the training network. Aiming at the limitations of CNN, Sabour et al. [8] proposed the concept of Caps Net. It replaced the scalar neuron of CNN with vector neuron-capsule. It used dynamic routing instead of CNN pooling to represent the part-whole relationship of the data. Therefore, it overcame the shortcomings of CNN's low robustness to changing directions and positions. A capsule contains a group of neurons. Therefore, when constructing more advanced features, it can use vectors to represent the various attributes of the entities in the feature attitude. The direction of the vector represents certain attributes of the entity. When dealing with the problem of attack detection, the capsule structure increases the capacity of feature representation. It can more accurately deal with the complicated spatial position relationship. And then, it improves the model's ability to recognize position information. Therefore, we used the capsule network as a framework for feature extraction in this paper.
Furthermore, the current detection methods [8][9][10][11] based on deep learning require a large number of attack samples for training. In many special scenarios, security agencies can only intercept few-shot attack samples. This kind of attack is called a few-shot attack. Therefore, the accuracy is low when the few-shot attack is detected. For instance, the few-shot attack samples can only be obtained by security agencies in a zero-day attack. Unfortunately, there are little researches on the few-shot abnormal traffic detection. However, the few-shot sample attack detection can theoretically be attributed to the few-shot sample classification problem. After referencing a large amount of literature, we used the sample augmentation classification method to solve this problem in this paper. The auxiliary data or information is used to expand the original sample, which is composed of data expansion and feature augmentation. The former is to add new data, and the data can be generated by unlabeled or synthesized labeled data. The latter is to add some features that are convenient for classification in the space.
In summary, we used the capsule network to recognize the attitude change in the traffic feature. Moreover, a new classification network is designed based on the DRCN [12] network. It is called multi-scale Deep-CapsNet and adversarial reconstruction, which is used for few-shot sample attack detection. Compared with the current detection model, our method has the following highlights and contributions: • The iterative routing is optimized by improved EM vector clustering of the Deep-CapsNet. The attitude transformation matrix is used to complete the prediction from low-level features to high-level features. Moreover, a three-dimensional multi-scale convolutional capsule is designed to optimize the Deep-CapsNet. In the end, the multi-level features of the few-shot abnormal traffic are effectively extracted. • After DRCN is analyzed, an ARCN is proposed. The loss function of RCN is updated by ARCN. It is used to reconstruct the high-order inconsistency between the generated data and the original data. Moreover, adversarial training is added to alleviate noise interference during data reconstruction. In the end, the few-shot abnormal traffic detection is stable. • We verify the effectiveness of our method in few-shot sample attack scenarios on three datasets. The results show that our method is feasible and effective in fewshot abnormal traffic detection. It has higher detection accuracy and better efficiency.
This paper is organized as follows: the purpose of the research and the ideas are introduced in Sect. 1. The abnormal traffic detection, CapsNet network, and the few-shot sample classification method are introduced in Sect. 2. In Sect. 3, the concrete theory and implementation method of multi-scale Deep-CapsNet are introduced. In Sect. 4, the specific process of ARCN is implemented. In Sect. 5, a series of comparative experiments are set up to verify the performance of our method. Moreover, we concluded this paper and put forward our next research plan in Sect. 6.

Related Works
In Sect. 1, we have explained that we proposed our method based on the capsule network and sample augmentation classification method. And it is classified in a few-shot sample scenario. Therefore, we divided the related work into three sections to describe. Section 2.1: abnormal traffic detection, Sect. 2.2: feature extraction based on capsule network, Sect. 2.3: few-shot sample classification based on sample augmentation.

Abnormal Traffic Detection
Network intrusion detection can be defined as a system that the classification of abnormal traffic is achieved. But the premise is that a large number of samples have been obtained. The two-classification model can be divided into normal and attack traffic to achieve intrusion detection. The detection process generally consists of two steps: feature extraction and classification. For instance, Hamed et al. [2] proposed a coded payload method to achieve intrusion detection based on recursive feature addition (RFA). The higher test accuracy is achieved on the ISCX2012 dataset. Wang et al. [3] proposed a HAST-IDS system in which traffic features are divided into time and spatial features. The former is extracted by CNN, and the latter is extracted by a recurrent neural network (RNN). An accuracy rate of 90.23% is achieved on the ISCX2012 dataset. Min et al. [4] proposed a feature TR-IDS system that combines statistical features and CNN extraction features. An accuracy rate of 92.31% is obtained on the ISCX2012 dataset. In 2017, Vinayakumar et al. [13] also used the CNN for intrusion detection, and the traffic is modeled as a time series. The supervised learning method is used to model the TCP/IP protocol packet within a predefined time range, and the effectiveness is proved on the KDD99 dataset. In addition, Wang et al. [5] first converted the original network traffic into images, and botnet traffic is classified through a CNN. But there is a huge difference between network traffic and imags. Therefore, simply visualizing traffic to simulate image classification tasks cannot meet the requests of practical applications. Furthermore, there are many types of such research, such as autoencoders [9], long short-term memory networks, support vector machines [10], restricted Boltzmann machine [11], Earth Mover distance, Hybrid neural network, and other models [8]. Moreover, the newer CICIDS2017 dataset is also used for detection in addition to ISCX2012. The conclusions can be drawn: the attack traffic can be accurately identified as long as there are a large number of samples. However, there are very few traffic samples that can be obtained in certain scenarios. The traffic has many high-level semantic features in a complex environment, which is affected by its feature attitude in the discrimination. The current abnormal traffic detection method based on deep learning is no longer suitable for a few-shot sample. Therefore, a few-shot sample learning method and a feature extraction method that considers the feature attitude needs to be constructed. The following content will describe the existing related work with the capsule network and the few-shot sample classification based on sample augmentation.

Feature Optimal Extraction Based on Capsule Network
Caps Net was proposed by Geoffrey Hinton to solve the inherent shortcomings of CNN. In related experiments, it is proved that this structure can achieve a better feature extraction. Moreover, it has also been verified to consider the change of feature attitude during feature extraction. But most capsules cannot share weights in dynamic routing, and the optimization ability of dynamic routing is weak. Meanwhile, the capsule network has not yet been widely applied to network traffic detection due to the limited amount of calculation. But it has excellent performance, which is applied to optimization tasks. For instance, multiple subspaces are used to represent the capsule, which is used as the input feature vector of the next layer. In [14], a twin network is proposed, which is called SCNet. The cosine similarity, Euclidean distance, and Manhattan distance are used to describe the similarity of the output capsule. The discarding operation is added, and the sigmoid function is used to learn the discarding probability of different capsules. Furthermore, a method of capsule pooling is proposed to study how to reduce the number of capsules [15]. Regarding the optimization of dynamic routing, LaLonde and Rajasegaran deepen the capsule network [16,17]. All capsules are converted along the depth direction. However, the numbers of dynamic routing are increased when it is deepened. And then, the computational complexity will be greatly increased. Therefore, Rajasegaran et al. [18] proposed an improved strategy, which is to be used to reduce the routing iterations. Three-dimensional convolution is used in the routing middle layer, and parameter sharing is used to reduce the parameters. The more detailed information is captured by the newly deepened capsule network. To further optimize the performance of the capsule network, batch standardization and pooling will be added to the capsule network. The verification test is completed with the few-shot sample. When the test samples are far more than the training samples, better performance can be achieved.
In recent years, the combination of capsule network and generative adversarial network (GAN) has also received attention. Specifically, the capsule network can be used as a discriminator in the GAN. For instance, a capsule network suitable for text classification is proposed [19]. First, the dynamic routing mechanism is optimized by the network. Then, independent categories are added to the network. Moreover, the routing parameter training process is optimized. It ensures the network is more suitable for the application environment. The experiment is tested on six different text classification datasets, and the performance of the four datasets is better than CNN.

Few-Shot Sample Classification Based on Sample Augmentation
Few-shot sample classification does not require high data volume for task objectives. A good classification effect can be obtained with only a few or even a single labeled sample. It can be divided into three mainstream methods: metric [20][21][22], meta-learning [23][24][25], and sample enhancement. In this paper, the third method is used for research. Furthermore, the current related research is described in three ways: unlabeled data augmentation, data synthesis, and feature augmentation.
The first way is that unlabeled data are used to expand the few-shot sample, and semi-supervised learning is the most representative. In 2016, Wang et al. [26] used an additional unsupervised meta-training network. It ensures that a large amount of unlabeled data is touched by multiple top-level units. The diverse sets of low-density separators are learned by these units. Boney et al. [27] used the MAML [28] model for semi-supervised learning in 2018. First, the unlabeled data is used to adjust the parameters. Then, labeled data is used to adjust the classifier parameters. Finally, better few-shot sample classification are achieved. Ren et al. [29] improved the prototype network [21] and added unlabeled data to achieve higher accuracy in a few-shot sample classification. To further optimize the semi-supervised method, direct learning is regarded as a sub-problem of semi-supervised learning for research. Therefore, Liu et al. [30] proposed a transductive propagation network. The network is divided into 4 stages: feature embedding, graph construction, label propagation, and loss calculation. All labeled and unlabeled data are mapped into the vector space. And then, the construction function is used to complete the learning. Hou et al. [31] also proposed a cross attention network based on direct push learning. The attention mechanism is used to generate cross-attention mapping pair features. Moreover, the extracted features are more discriminative.
The specific idea of the data synthesis: the training data is expanded by synthesizing new label data. The commonly used algorithms are mainly various methods for optimizing GAN. Mehrotra et al. [32] proposed a generative adversarial residual pairwise network (GAN-RPN). First, the generator is used to provide an effective regular representation of invisible data distribution. Then, the RPN is used as a discriminator to measure the similarity of paired samples. Furthermore, a few-shot classifier GAN [33] is also proposed. Two antagonistic CNN are used to generate data and discriminate categories. The overall function is solved to obtain the maximum and minimum solutions. Antoniou et al. [34] proposed a data augmentation GAN (DAGAN) to generate new samples. The accuracy of the few-shot sample is improved. Zhu et al. [35] constructed a GAN model of spectral dimension and space spectral dimension. The generated samples are added to the classification process for training. Moreover, the results show that the few-shot classification can be effectively improved. Zhang et al. [36] constructed an unsupervised feature extraction method by combining FCN and WGAN. First, the features of the hierarchical connection discriminator and the extracted multi-convolutional features are used as the spatial spectrum features. Then, the methods of shallow and deep networks are used for comparison. Finally, the results show that it is trained with few-shot label samples. Zhong et al. [37] extended the discriminator that the network contains label and authenticity information by designing a semi-GAN. The deep features are extracted while adversarial training is added. As a result, the classification accuracy is improved. In addition to the data synthesis method, Hariharan et al. [38] proposed an improved data augmentation method. It is divided into two stages: representation learning and the few-shot sample learning. The first model is learned by a source dataset in the first stage. Moreover, a square gradient magnitude loss is proposed to improve the representation learning. Wang et al. [39] generated virtual data to expand the diversity of samples by combining meta-learning and data augmentation. The endto-end approach is used to train the generation and classification models. Xian et al. [40] integrated an f-VAEGAN-D2 network by combining the variation encoder (VAE) with GAN. When few-shot sample learning and classification are completed, the generated sample feature space is formally expressed. The final result has good interpretability. Furthermore, Chen et al. [41] formed extended support set to interpolate the support set by using a meta-learning training set. The expanded sample is used to calculate the classification loss. And then, it is used to optimize the weight generation sub-network.
In the above method, the generated sample is utilized to enhance the sample. In addition, the diversity of samples also can be improved by enhancing the sample feature. The key to a few-shot sample classification is to obtain a feature extractor with good generalization. Dixit et al. [42] proposed the attributed guided augmentation (AGA) model. It is used to learn the mapping of synthetic data. Although the migration problem of attitude objects can be solved, the trajectory is discrete and not continuous. Therefore, Liu et al. Page 5 of 25 195 [43] proposed a feature transfer network (FATTEN). It is composed of an encoder and decoder. First, the features of the target data are mapped to a pair of attitude parameters. Then, the feature vector is generated by the decoder. In addition, feature extraction and few-shot sample classification are completed. Schwartz et al. [44] proposed the Delta encoder to synthesize new samples for invisible categories. The transferable variables between similar samples can be extracted by the encoder. However, the feature augmentation is too simple to significantly improve the boundary problem. Therefore, Chen et al. [45] proposed a two-way network, which is called TriNet. The data feature is enhanced through the mutual mapping. However, classification networks usually focus on the most discriminative feature areas. Other areas are ignored, and it is not conducive to network generalization. Therefore, Shen et al. [46] proposed a strategy to replace the fixed attention mechanism with an uncertain attention mechanism. The initial feature is convolved multiple times to obtain a one-dimensional feature.
In summary, the few-shot sample classification can be solved to a certain extent by the many methods. However, they have some shortcomings in the classification. For instance, the many complicated data distributions cannot be captured, and the generated features are not interpretable. Especially when there is noise interference, the reconstructed data is unreliable. Therefore, we proposed an adversarial reconstruction classification network (ARCN) to improve the above problems in this paper.

Multi-scale Deep-CapsNet Construction
The capsule network is a deep network that consists of stacking multiple non-linear layers. Several hidden layers are coded from the bottom to up, which is used to input data features. It is mainly composed of three layers: convolutional layer, initial capsule layer, and digital capsule layer. The parameters consist of convolutional layer weights, capsule layer weights, and capsule layer coupling coefficients. The coupling coefficient is updated through the iterative dynamic routing mechanism. And then, the attitude transformation matrix between the capsule layers need to be updated by the Margin loss function. Its expression is: In the formula, c is the classification category. L c is the loss function of the c-th category, and T c is an indicator function. When the entity belongs to category c , the value of T c is 1. If it does not belong to category c , the value is 0. The output capsule of category c is represented by c . Moreover, the length is the probability that the entity belongs to the category. m + is the upper boundary with an empirical value of 0.9. m − is the lower boundary with an empirical value of 0.1. When the network entity is of type c , the c will be greater than 0.9. When it does not belong to the c category, c will be less than 0.1. Its purpose is to ensure the loss value is almost 0 when the prediction is correct. Meanwhile, the loss value is larger when the prediction is wrong. is an adjustment coefficient, which is used to adjust the proportion of correct and incorrect results.
Therefore, the capsule neuron carries the connection of the network weights. In addition, features are also enriched. And then, misjudgment of abnormal traffic caused by changes in network traffic attributes is avoided. But when it is used for network traffic feature extraction, the scale of the intermediate vector is too small due to insufficient network depth. Moreover, the inner product routing algorithm is inefficient. These factors reduce the speed of the network. In addition, the clustering effect of the network is affected. As a result, the extraction of traffic features is not sufficient. Therefore, multi-scale Deep-CapsNet is proposed to optimize the capsule network. Moreover, it is divided into two sections. Section 3.1: the Deep-CapsNet network is designed. It is used to replace iterative routing in the capsule network. And then, the operating efficiency is improved. Section 3.2: a multi-scale capsule structure is designed. It is used to extract high-level features in various abnormal traffic. Then, the classification performance of the model is optimized. Therefore, the implementation of Multi-scale Deep-CapsNet is described in Sects. 3.1 and 3.2.

Deep-CapsNet Network
When the target object has a complex internal representation, the category cannot be correctly predicted by the original capsule. Therefore, a Deep-CapsNet ( Fig. 1) is first constructed in this paper. When the network traffic contains N labeled samples of L categories, the set can be represented by . Moreover, the feature vector is used as the input of the capsule network. And then, the main idea of feature extraction is that convolution is used for the low-level feature. Moreover, capsules are used to characterize more advanced features. Therefore, two convolutional layers are used to convolve the capsule network. The local feature detection of network traffic is completed. The channels of the two convolutional layers are 512 and 256. The size of the convolution kernel is 9*9 and 5*5. The convolution step length is 1. The ReLU is used as the activation function. Then a primary capsule layer is constructed, which is used to connect the convolutional layer and the capsule module. A set of 32 × 16 × 6 × 6 scalars are obtained by multiple convolutions. A group of 1152 vector neurons from 16-dimensional vectors is composed of the capsule section. Each neuron outputs a 16-dimensional vector. In each 6 × 6 grid, the weight is shared to each capsule, and then each output vector is output. Meanwhile, the vectors are optimized by adding clustering capsule layer 1. First, the features are preliminarily screened by the EM clustering. A high-level vector with an obvious tendency is formed for the clustering capsule layer 2. Then, the vectors that can be used to represent different information are selected. Moreover, compression is performed in each cluster layer. The clustering process ensures the high-level features are more concentrated. The compression is used to limit the length of the vector. Different from the forward and backward propagation, the transformation of the traffic features is the judgment of the attitude transformation. Furthermore, the connection between different capsule layers is completed through EM vector clustering. The last layer is the category capsule layer. The dimension of each capsule vector neuron is denoted as d 2 . The number of capsules is the corresponding category number L . Therefore, the following content mainly describes the attitude transformation and EM vector clustering.

Attitude Transformation
As shown in the process shown in Fig. 1, when the feature passes through two clustering capsule layers, the feature attitude will be analyzed (after the affine transformation in the figure, it is the process of feature attitude transformation.). The specific process is as follows: in the propagation between the initial capsule layer and class capsule layer, vector neurons need to be fully connected. Therefore, an affine transformation coefficient W i,j is added. W i,j is the attitude transformation matrix of the i-capsule output from the l-level to the j-th capsule. The matrix is used to convert the vector neuron of length d 1 in the initial capsule layer (denoted as in i ). It can be multiplied with the length of the vector neuron in the capsule layer d 2 (denoted as cap ). Finally, the input vector after affine transformation can be obtained as: U i,j and cap j are vectors of equal length. U i,j is regarded as the prediction vector of the j capsule in the i layer. It is the prediction result. Then all the inputs of cap j are weighted sum. The total input si s i of each cap j can be expressed as:

EM Vector Clustering Optimization
Low-level features are predicted as high-level through vector neurons in the capsule network. The distribution of the output vector conforms to the Gaussian mixture model [47] with different high-level features as expected. Therefore, EM clustering is used to replace the iterative routing, which improves operating efficiency. However, it has the following two disadvantages: first, the EM algorithm is complicated to calculate, which is not suitable for large-scale datasets and high-dimensional data. Second, the function to be optimized is not a convex function. It is prone to a locally optimal solution instead of a globally optimal solution. Therefore, the EM clustering is designed to be EM vector clustering. It is used to replace iterative routing in the capsule network. EM vector clustering corresponds to two clustering capsule layers in Fig. 1. The specific implementation process is as follows: a set of prediction vectors generated by the attitude transformation equation conform to the Gaussian mixture distribution. The distribution function with the highest probability is obtained as the capsule output, which can be expressed as: represents the category, and X is the input vector. j is the probability of class j and ∑ j j = 1 is true. j is the vector expectation of the j category, and ∑ j is the covariance matrix. Because low-level features are the transformation results of traffic features, the vectors are independent of each other. Therefore, the covariance matrix is a diagonal matrix, which is equivalent to the decoupling calculation of input X . Furthermore, the formula (7) is modified, which is expressed: The input distribution is regarded as a mixed Gaussian distribution for clustering. Therefore, the center vector is the weighted average of the vectors within the cluster. The significance of the network cannot be measured by the model length. Therefore, a scalar a j is added as a scaling scale to measure the significance. The output is substituted into the asquashing function. It is used to control the length of the output vector. The EM clustering result is used to obtain the variance of the output Gaussian distribution. The larger the variance, the closer the prediction vector distribution is to a uniform distribution. It further shows that the prediction result is not close to the same feature, and a j should be small.
Therefore, the information entropy C j is selected to assist a j , and C j is expressed as: l j is the mean square error of the l-th layer in the j capsule. r ij is the ratio of the bottom layer i capsule transformed into the j capsule. C j is the distinctiveness, which is expressed by the j capsule. The scaling scale a j is obtained through C j , which can be expressed as: When the distribution variance is smaller, the C j value is smaller. Therefore, iterative optimization is achieved by maximizing −C j . When U i,j ,S j , and a j are given, U i,j represents a vector prediction. Specifically, it is expressed as the i capsule is output to the j high-layer capsule. S j represents the output direction of the l + 1 layer capsule. 2 j is the variance of the output direction in the l + 1 layer capsule. a j represents the prominence of the features. The EM vector clustering algorithm is as follows: Therefore, the process of the EM vector clustering algorithm is as follows: first, each prediction vector is weighted by using the length of the input vector. The weighted probability r ij of an advanced capsule is obtained. Second, the parameters S j and 2 j of the Gaussian mixture model are obtained by the calculated r ij and the predicted vector distribution U i,j . After multiple iterations, the probability a j is used as the output scale of the j capsule. The output direction vector is scaled to obtain the final output vector: asquashing(.) represents compression function.

Multi-scale Capsule Optimization
To solve the problem of insufficient network parameter training, a multi-scale capsule (Fig. 2) is designed to optimize Deep-CapsNet. The commonality and features of various abnormal traffic are fully extracted. Moreover, the classification performance of the model is optimized. It is composed of two sections. First, structural and semantic information is obtained through a multi-scale convolution kernel. The two convolution kernels of the top branch are advanced feature extraction processes in Fig. 2. They are used to extract semantic information. The first convolution kernel of the intermediate branch is used to extract intermediate features.
Meanwhile, the first convolution kernel is used to extract low-level features in the bottom branch. Then, the feature level is encoded by the multi-dimensional initial capsule. The initial capsule layers of the three branches are used to encode the high, medium, and low-level features. The initial capsule layer can be seen as an extension of the twodimensional convolutional layer. Therefore, the size of the two-dimensional convolution kernel is 4 × 4 , and the number of channels is 4. The convolution kernel has been subjected to 12, 8, and 4 convolution operations. Finally, the splicing operation is used to merge the three initial capsules. Moreover, four fusion capsules with a dimension of 24 are obtained.
To further improve the reliability of the capsule network, multi-scale feature learning is added to the training process. The multi-scale convolution kernel is used to extract the multi-scale features. Therefore, the multi-scale Deep-Cap-sNet is proposed, and the structure is shown in Fig. 3. It is mainly composed of a multi-layer convolution structure, multi-scale capsule extraction of deep features, and clustering capsule layers. The specific steps are as follows: Step 1: the input traffic features are extracted by the multi-scale Deep-CapsNet. In addition, the channels of the convolution kernel are 16. This process is a preliminary feature mapping of traffic data. Therefore, the primary feature extraction is completed in a larger range of common features.
Step 2: the multi-scale capsule structure is set to include three convolution kernels with dimensions 1 × 1 × 3 , 1 × 1 × 5 , and 1 × 1 × 1 . The convolution kernel channels are also designed to 16. It enables the different local structures is used to extract features. Moreover, the output features are output through three primary capsule networks. Finally, the three output features are spliced to fuse traffic features of different scales.
Step 3: it is called clustering capsule layer optimization. After the shallow information is extracted by a three-dimensional convolutional capsule, the optimized capsule is used to replace the iterative routing. This section is composed of two clustering capsule layers. They are output after reaching the classification capsule layer.

Few-Shot Sample Classification Based on ARCN
After the network traffic features are extracted by multi-scale Deep-CapsNet, a method suitable for a few-shot sample classification is proposed. Since there are few-shot network traffic samples, the feature training will fall into under-fitting. Moreover, many enhanced network structures are more complicated. In the few-shot samples, a better classification result cannot be obtained. As a result, Ghifary et al. [12] proposed a deep reconstruction classification network (DRCN) model based on deep learning. Two tasks can be learned by DRCN with a set of shared features: supervised source data classification and unsupervised target domain data reconstruction. The extracted features can obtain information while maintaining the discrimination. However, the classification depth in DRCN is not enough. As a result, deep features cannot be effectively identified. Furthermore, the abnormal traffic of few-shot samples is susceptible to noise interference. And then, DRCN is not stable in the classification process. Therefore, an adversarial reconstruction classification network (ARCN) is proposed based on DRCN.
It is composed of the reconstruction classification network (RCN) and adversarial training. In the end, the above problems are solved through these two improvement strategies. Furthermore, the implementation of ARCN is described in Sects. 4.1 and 4.2.

RCN Network Construction
First, the reconstruction classification network (RCN) is designed. The reconstruction network comes from the DRCN network in the Fig. 4. Moreover, the classification network is designed in the paper. The specific structure of the RCN is shown in Fig. 4. It is composed of two sections: a supervised classification network and an unsupervised reconstruction network. The two sections can be divided into the following three functions: the coding function can be represented by the overlapping section of the blue box and the red box in Fig. 4. The classification function can be represented by the remaining section of the red box in Fig. 4. The reconstruction function can be represented by the remaining section of the blue box in Fig. 4. The two sections of the RCN have a shared coding feature representation. It enables the coding function to learn the commonalities between the two tasks. First, f c ∶ → c is set as the supervised sample classification section. Then, f r ∶ → is set as the sample reconstruction section. and c represent the sample space and label space. They can be defined by the mathematical model: the encoding function is described as h enc ∶ → . The reconstruction function is described h rec ∶ → .The classification function is described as h cla ∶ → c ⋅ represents the coding feature space. When an input sample x ∈ is given, f c and f r can be expressed as: c = enc , cla and r = enc , rec represent the parameters of the supervised classification and the unsupervised reconstruction. enc , rec , and cla are the parameters of coding, reconstruction, and similar functions. The goal of parameters is to find a set of shared coding functions h enc to support f c and f r.

Adversarial Training
In the unsupervised reconstruction model, adversarial training is designed to optimize the influence of noise and jitter interference. Therefore, we proposed ARCN for adversarial training based on RCN. The RCN model is represented by the green box in Fig. 5. The features are used to generate classification results and reconstruct samples. Moreover, the adversarial network is represented by the blue box on the right in Fig. 5.
The traffic is used to generate category labels (1 = real sample, 0 = synthetic sample). a represents the adversarial network parameter. f a ∶ → d represents an adversarial network, where d [0, 1] is established. The ARCN process , is mainly as follows: when an input sample x is given, f a can be expressed as follows: When N 1 labeled training samples (x i , y i ) are given, y i ∈ {0, 1} K is one-hot code. Moreover, N 2 unlabeled training samples are also given, and the loss function can be defined as: k=1 y k lnŷ k represents multiple types of crossentropy loss, which is used to predict ŷ . All of l a can be described as l a (ẑ, z) = −[lnẑ + (1 − z)ln(1 −ẑ)] , which is a binary cross-entropy loss. The loss of parameters c and r related to RCN can be minimized. Moreover, the loss related to parameter a of the adversarial network is maximized. Since the term containing l a in (14) is related to the adversarial network, the loss function can be expressed as: The training process of the adversarial network is to minimize the loss function in (15). Under the given conditions of the adversarial network, the minimization process of the RCN model is equivalent to minimize the multi-class loss function. Therefore, the loss function of the RCN model can be expressed as: − l a f a f r x i ), 0 is updated − l a f a f r x i ), 1 . The maximization adversarial model is adopted to predict f r (x) as the true sample probability. Moreover, a strong gradient signal will be generated. The experiments have proved that the update is of great significance for accelerating the training process. Therefore, the loss function of the RCN model can be updated as: To further optimize the network, the Adam algorithm is used to alternately minimize L A and L RCN . The optimization of L ARCN is achieved, and the algorithm will stop in (17) two cases. First, the preset accuracy is exceeded by multiple consecutive training accuracy rates. Second, the iterations reach the maximum number. The specific implementation of ARCN is given by algorithm 2. In the optimization of L RCN , dropout regularization is used to prevent overfitting.

Experimental Dataset and Settings
Our experiment is implemented under the following platforms. Hardware: IntelXeonE5-2600V3@2.6 GHz, 256GBRAM; Software: Ubuntu16.04LTS, CUDA 10.0, cuDNN7.5 and PyTorchO.4.1. The dataset that can be used to verify the proposed model should have the following three features: first, the dataset should be the complete original network traffic. Moreover, it is not a feature vector obtained through feature extraction. Second, network traffic must have a corresponding category label. Third, network traffic should be packaged into few-shot sample learning tasks. It is used to evaluate whether the algorithm is still effective. In the field of network intrusion detection, commonly used datasets include KDD99, NSL-KDD, DEFCON, LBNL, Kyoto, AVFA, ISCX2012, and CICIDS2017, etc. However, the datasets with features (1) and (2) are ISCX2012 and CICIDS2017. As for the research in the few-shot sample scenario with feature (3), there is no relevant public dataset. To satisfy feature (3), we used the complete raw network traffic in ISCX2012 and CICIDS2017. Therefore, the complete original network traffic is used as the dataset in ICSX2012 and CICIDS2017. The intrusion detection dataset suitable for a few-shot sample environment is constructed, which is called ICSX2012FS and CICIDS2017FS. The suffix "FS" means a few-shot. Therefore, three types of experimental datasets are established.
Types I and II are single datasets, and type III is crossdataset. Type I: ICSX2012FS dataset contains 4 types of attacks. Three types of attacks and normal traffic are selected as the training set. Moreover, the remaining type 1 attacks and normal traffic are used as the test set. The few-shot sample training and testing are constructed to perform twoclassification experiments. Type II: CICIDS2017FS dataset contains 5 types of attacks. Three types of attacks and normal traffic are selected as the training set. One of the remaining two types of attacks and the normal traffic sample is used as the test set for a two-classification experiment. Type III: the data are a cross-experimental dataset for cross-network dataset detection. Five types of attacks and normal traffic are selected as the training set in the CICIDS2017FS dataset. Any type 1 attack and normal traffic in the ICSX2012FS dataset are selected as the test set for a two-classification experiment. We call this type of experimental dataset "cross dataset". The two datasets come from experimental networks that contain different hardware and software environments. And then, the types of attacks are also different. The adaptability of the proposed method can be fully verified in the few-shot sample classification. The specific explanations of various types of attacks are shown in Table 1. Moreover, a unique code is assigned to each type of attack to facilitate the experiment. The code prefix "i" indicates that the data comes from ICSX2012FS. The prefix "c" indicates that the data comes from CICIDS2017FS. Therefore, ICSX2012FS is composed of normal traffic samples and four types of attack samples (iA, iB, iC, and iD). CICIDS2017FS is composed of normal traffic samples and five types of attack samples (cA, cB, cC, cD, and cE). For each task in the training set and test set, we set K = 5, 10 (we call it 5-shot, 10-shot). It means that the number of samples is 5 or 10 in each category. Therefore, only "a few" sample scenes are accurately simulated in the actual network.

Multi-scale Deep-CapsNet Performance Verification
To verify the influence of the EM parameters in the multiscale Deep-CapsNet, an EM parameter selection experiment is constructed. The parameter is the clustering length, which is used to measure the probability of classification. When the longer length is reached, the clustering effect is better.
The experiment is trained on the three datasets. The clustering length is mainly tested under different R (number of clustering iterations). To further illustrate the effectiveness, the clustering length of CapsNet is also used for comparison. The experiment is performed ten times, and the specific results are shown in Fig. 6. The left side of Fig. 6a-c is the comparison of the model length, and the right side is the clustering result. The conclusions that can be drawn from Fig. 6a-c are as follows: first, different clustering iterations have a great influence on the clustering length. Before the CapsNet is optimized by the improved EM clustering, the longest clustering length is less than 0.1. After performing three clusterings, the average clustering length reached 0.45 in the Cap-sNet. Therefore, it shows that CapsNet can be optimized by the improved EM clustering. Second, the clustering length will continue to be increased as the iterations are increased. When iterations are increased to 6, the clustering length will be decreased. Because the clustering effect will be affected by too many clustering iterations, the length will be reduced. Third, the clustering length of the EM clustering is significantly higher than that of routing clustering. It shows that the improved EM clustering effect is better than the traditional iterative routing. When the CapsNet is designed as    Fig. 7a-c, the following conclusions can be drawn: first, the training accuracy of the capsule network is significantly higher than CNN. Although the accuracy of the 3DCNN is higher than that of the capsule, it is lower than other capsule optimization methods. Second, when the capsule is optimized by the improved EM vector clustering, the accuracy is significantly improved. When the Deep-CapsNet is further designed as 3DConv-Deep-CapsNet, the accuracy is the highest. The average accuracy rate reaches 0.96 in the 5-shot training task. Third, the corresponding training accuracy continues to be increased as the epochs are increased.
However, the accuracy curves of CNN and 3DCNN have poor convergence. When the training epochs are increased, the accuracy will be decreased. Because CNN will appear gradient disappears as the training epochs are increased. Moreover, CNN will fall into a local optimal solution, and the final training performance will be affected. In contrast, the proposed 3DConv-Deep-CapsNet accuracy curve has the best convergence. When epochs are increased to 16, the accuracy rate remains stable. Therefore, the training performance of 3DConv-Deep-CapsNet is verified on the 5-shot task. The same conclusion is also obtained from the loss rate curve in Fig. 7d-f. Correspondingly, we can draw the following conclusions from Fig. 8a-c: first, the accuracy of the capsule is also significantly higher than that of the CNN. However, the accuracy under the 10-shot is slightly higher than that of the 5-shot. Because the traffic features under 10-shot are more than 5-shot. Many types of features can be obtained, and the accuracy will be improved. Moreover, the accuracy of the 3DCNN is the same as that of the capsule in the 10-shot training task. The advantages of the capsule over CNN have been proven again. Second, when the capsule is optimized by the improved EM vector clustering, the accuracy is significantly improved. When the Deep-CapsNet is further designed as 3DConv-Deep-CapsNet, the accuracy is the highest. The average accuracy rate reaches 0.975 in the 10-shot training task. Third, the accuracy curves of CNN, 3DCNN, and traditional capsules have poor convergence when epochs are increased. There are some malicious traffic features in the 10-shot training. Therefore, the traffic features are misjudged by the capsule. The proposed 3DConv-Deep-CapsNet accuracy curve has the best convergence. When epochs are increased to 16, the accuracy rate remains stable. The conclusion is consistent with the 5-shot training situation, and the same conclusion is also obtained from the loss rate curve in Fig. 8d-f.

ARCN Training Performance Verification
Similarly, the same experiment is constructed to verify the performance of ARCN. In the experiment, GAN is used as the baseline to compare the training performance of GAN, RCN, DRCN, CapsNet-GAN, and ARCN. The learning rates are also set to 0.001, and the iterations are 100. The accuracy and loss rates are given under 5-shot and 10-shot, which are compared in Figs. 9 and 10. When only the GAN is used to train the dataset, its training performance is relatively poor. The accuracy rate of GAN is only 76.5% under 5-shot. Other networks are optimized based on the GAN network, and the accuracy is better. However, the highest accuracy is achieved by the ARCN network, which is 97.7%. Second, although it is constructed with DRCN and RCN networks, the two networks have not joined the adversarial training. The data are too noisy in the reconstruction, so the performance is average. Third, the accuracy rate continues to be increased as the epochs are increased. However, the accuracy curve of other networks has poor convergence except for ARCN and DRCN. Moreover, when epochs are increased, the accuracy will be decreased. But the ARCN accuracy curve has the best convergence. The same conclusion is also obtained from the loss rate curve in Fig. 9d-f.  The following conclusions can also be drawn in Fig. 10a-c: the accuracy under 10-shot is slightly higher than that of 5-shot training tasks. Because the traffic features under 10-shot are more than 5-shot. Other methods have poor convergence of accuracy curves under 5-shot except for ARCN. The ARCN accuracy curve has the best convergence, and the average accuracy rate is 98.49%. When epochs are increased, the accuracy rate remains stable.

Two-Classification experiments
The few-shot sample classification comparison experiments are constructed in the section, which are given in Tables 2, 3 and 4 under different classification methods. The experiment is performed under 5-shot and 10-shot settings. Normal traffic is represented by normal, and abnormal traffic is represented by an attack. The Adam optimization method is used to train the network, and the initial learning rate is set to 1 × 10 −3 . Moreover, the learning rate is decayed with a decay factor of 0.5 every 10,000 iterations. The classification of the single dataset is compared in Tables 2 and 3. As can be seen from Table 2, we found that the accuracy of the metalearning is poor in the ICSX2012FS. The accuracy rates are 64.485 and 68.645% under the 10-shot setting, which is lower than other methods. The average accuracy rate of the proposed method reaches 85.51% under the 10-shot setting. They are 21.025 and 16.865% higher than Meta LSTM and MAML, respectively. Because they are meta-learning methods, a large number of categories are required to train. However, the dataset that is constructed has a few-shot  category in the paper. Therefore, meta-learning often has the problem of underfitting. Moreover, the proposed method is 14.15 and 18.765% higher than Matching Nets and Proto Nets, respectively. Furthermore, the highest accuracy rate is also achieved in the data augmentation method. In comparison, the accuracy rate of each method under the 5-shot setting is lower than the 10-shot setting. But the method with the highest accuracy is still proposed. The average accuracy rate is 86.83%. Although it is 2.705% lower than 5-shot, it is higher than other methods. Similarly, the same conclusions can be obtained from Table 3. For instance, the accuracy rate of the two metalearning methods under the 10-shot setting is 69.985% and 70.065%, which are lower than other methods. The average accuracy rate of the proposed method is 89.27% under the 10-shot setting. They are 19.285 and 19.205% higher than Meta LSTM and MAML, respectively. They are 12.15 and 10.9% higher than Matching Nets and Proto Nets, respectively. The reasons for the low accuracy are consistent with the ICSX2012FS experiment. The average accuracy of the proposed method is 86.83% under the 5-shot setting. Although it is 2.44% lower than 10-shot, it is higher than other methods. The traffic features are extracted through a multi-scale Deep-CapsNet. Moreover, the ARCN is used to reconstruct and classify the data. The noise interference is removed during data generation, so the final classification effect is better. Therefore, it can be observed from the classification experiments of the single dataset: regardless of 5-shot or 10-shot, the proposed method has the highest accuracy in the few-shot sample. The classification results on the cross dataset are compared in Table 4, and the conclusions are as follows: first, the accuracy of each method on the cross dataset is lower than the single dataset. The cross dataset is set that crosses the database, which is much lower than that of a single dataset. Moreover, the features of some attack traffic in the CICIDS2017FS are similar to the ICSX2012FS. Therefore, misjudgments are often caused. Compared with the data augmentation methods, the four methods based on metric learning and meta-learning have lower accuracy. The reason for the poor performance is the same as described above. Third, the proposed method has the highest accuracy under 10-shot and 5-shot settings. The average accuracy rates are 80.48 and 76.32%, respectively. Therefore, it is known from the experimental results that it has better performance in the cross datasets.
From the two classification results on three datasets, the effectiveness of the proposed method has been proven again. Further explanation: first, it is effective for improving the capsule and GAN. After the multi-scale Deep-CapsNet is used to extract traffic features, the discrimination of traffic features can be enhanced. Second, the complex high-level features can be further recognized by the attitude transformation. Third, high accuracy can still be obtained in a fewshot sample by reconstruction classification and adversarial training.

Multi-classification Experiments
To further verify the performance in a few-shot sample classification, all the attacks dataset are combined as a training set on the ICSX2012FS and CICIDS2017FS. The attack categories of the ICSX2012FS dataset and the CICIDS2017FS dataset are used as the test sets. The multi-classification of attack traffic is shown in Tables 5 and 6 under each method. The main conclusions drawn from Tables 5 and 6 are as follows: first, the training and testing of cross-database datasets are feasible. Although the two datasets are obtained by different software and hardware, the traffic has certain commonalities. Therefore, the experiment is feasible in this paper. Second, it can be observed from the accuracy results are obviously inferior to the two-classification. Because the similarity of the two datasets is easy to be misjudged in the classification. Especially the classification of each attack traffic is more likely to cause misjudgment. Moreover, there is also cross-validation set in the distribution of the training set and the test set. When multi-classification is performed, the results are poor due to an insufficient cross-validation set. Third, the proposed method has the best performance compared to other methods. The average accuracy rates are 77.585% and 78.084% on the ICSX2012FS and CIC-IDS2017FS. From the multi-classification results on the datasets, the effectiveness of the proposed method is verified again. Moreover, the validity and universality are also verified again in a few-shot sample scene.

Ablation Experiments
To further verify the performance of the optimization strategy, we have constructed an overall ablation experiment.
In the experiment, the accuracy without any optimization strategy is used as the baseline. The accuracy of the cross dataset is given when each optimization strategy is gradually added and removed ( Table 7). The upper section of Table 7 is the comparison that the optimization strategy is gradually added. The lower section is the comparison that the optimization strategy is removed each time. The conclusions are obtained in the upper section of Table 7: first, the average accuracy rate is 67% under the baseline experiment. The average accuracy rate reaches 72.52% when the improved EM vector clustering is added. It is 5.52% higher than the baseline experiment. The improved EM vector clustering for capsule optimization is first proved. Second, when the multi-scale convolutional capsule, RCN, and adversarial training are added, the accuracy is further improved based on the previous step. The average accuracy rate is 85.335% when all four strategies are added. It is 18.335% higher than the baseline experiment. Fourth, the performance is poor only RCN is used. Because the RCN is easily disturbed by noise during reconstruction. After the adversarial training is added, the accuracy will be greatly improved. The importance of adversarial training for RCN has been proven once again.
Similarly, the lower section of Table 7 is the comparison that the optimization strategy is removed. Therefore, all strategies is added as baseline experiments. We found that the accuracy is reduced when one of the optimization strategy is removed. However, the most influential is adversarial training. The average accuracy rate is 78.405 when it is removed. It is 6.93% lower than all four strategies are added. The importance of adversarial training for the noisy few-shot samples is fully explained. In summary, the necessity of each strategy is proved from the overall ablation experiment.
To more intuitively verify the performance of each optimization strategy, the visualization experiment is constructed. The visualization results of the three datasets are given when each optimization strategy is gradually added. The visualization of larger traffic samples is also given as a comparison, which is shown in Figs. 11, 12 and 13.
From the large sample visualization results of the ICSX2012FS and CICIDS2017FS, it can be observed the classification effect is getting better as the optimization strategies are added. When the adversarial training is added, attack and normal sample achieve effective classification. The data sample is more convergent and compact. From the few-shot sample visualization results shown in Figs. 11 and 12 (type 1 samples are attack traffic, and type 2 samples are normal traffic), it can be observed attack and normal can achieve effective classification when each optimization strategy is added. However, the data distribution is relatively  scattered and messy. It also shows that the effect of feature extraction and training is poor in the few-shot sample. The sample can not only achieve effective classification when four optimization strategies are added, but the sample is more compact. Therefore, the necessity of an optimization strategy has also been proved. From the larger sample visualization of the cross dataset in Fig. 13a-d, the following conclusions can be drawn: the classification effect is getting better as the optimization strategies are added. When adversarial training is added, most samples of attack and normal achieve effective classification. However, individual samples are still not effectively classified. Because the CICIDS2017FS dataset contains 5 types of attacks, they are disguised as the normal samples in the individual attack. When an attack of ICSX2012FS and normal traffic is used for testing, misjudgments will be caused. But when the four optimization strategies are added, the visualization results are best. The sample is more convergent and compact. Furthermore, the same conclusion is obtained from the few-shot sample visualization. Therefore, the effectiveness of the strategy has been proved in the visualization of cross datasets. From the visualization results of the three datasets in Figs. 11, 12 and 13, it can be observed the classification effect is getting better as the optimization strategies are added. Particularly, adversarial training is most important for traffic classification.

Generated Data Quality Evaluation
To verify the generated data quality, generated data experiment is constructed. Batch training is used in the experiment, which is composed of 128 sampling data. Two evaluation indexes: generative adversarial metric (GAM) and Inception score (IS). Moreover, GAM includes the training phase and testing phase of ARCN. M 1 = (G 1 , D 1 ) and M 2 = (G 2 , D 2 ) are two different generative adversarial networks. In the training phase, M 1 and M 2 are trained separately. In the test phase, the opponent's discriminator is used to discriminate the generated samples by its generator. The same random noise z is used as the input of the generators in M 1 and M 2 . G 1 (z) and G 2 (z) are used to represent the generated samples by M 1 and M 2 . The real numbers are input to the discriminator during the training and testing. They are denoted by x train and x test . In GAM comparison, there are two ratios: r test and r sample . err(.) represents the classification error rate, which is defined as follows:  In the formula: the generalization ability of M 1 and M 2 can be reflected by r test . To reduce the bias of the discriminator, r test is used to ensure the fairness of comparison. r sample is used to determine the winning model, and the specific discriminant rules can be expressed as: First, the GAM comparison experiment is constructed in the ARCN and DRCN. The time step of the generator is set to T = {1, 3, 5} . Ten test samples are selected as real samples, and the generated samples are 10 samples from the generator. Moreover, the former model is regarded as M 1 = (G 1 , D 1 ) , and the latter model is regarded as M 2 = (G 2 , D 2 ) . The comparison results are shown in Table 8.
Moreover, the main conclusions are as follows: first, M 2 = (G 2 , D 2 ) is the winner at the same time step. It shows that the quality of the generated data by the ARCN is better than that of DRCN. Second, the quality of the generated data keeps better as time steps are increased. However, it can be observed that quality is the best when T = 3. Third, the best GAM of the three datasets is the CICIDS2017FS. Because this dataset has more attack categories, the discriminator in another model is more likely to be "spoofed" by the generated samples. To further verify the performance of ARCN, the Inception score (IS) rate experiment is continued to be constructed. The higher the IS, the higher the diversity (18) r test = err(D 1 (x test )) err(D 2 (x test )) , r sample = err(D 1 (G 2 (z))) err(D 2 (G 1 (z))) . of the generated samples. On the contrary, the diversity is lower. Table 9 is IS rate that is obtained by DRCN and ARCN under different time steps. The IS rate of the two models will be increased as the time step is increased. It shows that the proposed structure is beneficial to increase diversity. Furthermore, we can observe that the IS rate of the cross dataset is much higher. Because the cross dataset contains more categories than other datasets. The span between categories is larger, and the different samples are easier to obtain. Moreover, it can be observed that the IS rate of ARCN is higher than DRCN.
It can be observed from Table 10 that the IS rate also increases as the time step is increased. However, the ARCN has the highest IS rate. It has obvious advantages, which are conducive to improving diversity. The very few labeled samples can be used by ARCN for training. High-order inconsistencies between the reconstructed and the original data are detected. Therefore, the data quality is improved.
To more intuitively verify the data quality, noise is added to the cross dataset for visualization analysis. The visualizations of the seven data augmentation and the proposed method are given, which are shown in Fig. 14.
Type 1 is normal. Type 2 is attacked, and type 3 is noise. From Fig. 14a-g, the data quality generated by other methods is mediocre when noise is added. Moreover, the samples have not been effectively separated. On the contrary, it is known that the three datasets can be effectively distinguished by the proposed method when noise traffic is added. In addition, the samples are more convergent. Therefore, the

Anti-noise and Time Analysis
To verify the anti-noise performance of the method more effectively, the signal of the cross dataset is input into the Netflow Analyzer in a certain period of time. It is used to form a traffic signal. Moreover, noise is added to this section of the signal. It is used to form a noisy signal (the upper section of Fig. 15). This signal is processed by the proposed method, and the processed signal is shown in the lower section of Fig. 15. The following conclusions can be drawn from the Fig. 15: the noise in the signal can be removed by adversarial training. Moreover, there is no distortion and smoothness in the processed signal. Furthermore, the three datasets are added to the malicious noise traffic. The classification accuracy and the time of each method are compared. Moreover, the final results are used to verify the robustness. When there is noise in the three datasets, the average accuracy of the seven data augmentation methods, the running time, and the testing time are given in Table 11.
The conclusions that can be drawn from Table 11 are as follows: first, the accuracy will be decreased when malicious traffic noise is added to the three datasets. The most affected of the three datasets is the cross dataset. When malicious noise traffic is added, its performance is poor as the training categories are increased. Second, the proposed method can obtain the best performance when noise traffic is added. The accuracy rate is the highest under the three datasets. Third, although the training time of the CapsNet-GAN is 0.702 s shorter than that of the proposed method, the test time of the proposed method is the shortest. Because the EM vector clustering is designed to optimize dynamic iterative routing, the final efficiency is improved. It shows that the proposed model can be used for real-time few-shot sample traffic classification, and it has good robustness.

Conclusions
Because of the scenario where the few-shot attack samples can only be obtained, the capsule and the reconstruction classification are combined to propose a few-shot abnormal traffic detection method. Two internationally public datasets ICSX2012 and CICIDS2017 are used to complete the training and testing of a few-shot sample. The main conclusions are as follows: 1. The proposed method has better training performance in a few-shot sample of three datasets. Few-shot sample training is more superior when multi-scale Deep-Cap-sNet and ARCN are combined. It has a higher training accuracy and a lower loss rate. 2. It has the highest classification accuracy under the 5-shot and 10-shot sample settings compared to other methods. It is known from the ablation experiment that it is effective for improving the capsule network and GAN. After the multi-scale Deep-CapsNet is used to extract traffic features, the discrimination of features can be enhanced. Moreover, the ARCN can still obtain high classification accuracy in the few-shot sample. 3. The visualization effect has the best classification effect in both large and few-shot samples. Normal and attack traffic can be effectively classified. Especially when noise is added to the dataset, the proposed method has good anti-noise performance and short running time. It shows that the proposed method can be used for realtime few-shot sample traffic detection, and it has good robustness. In the next research, the abnormal traffic will be used for detection in the real network environment.