Keywords

1 Introduction

Anomaly detection is becoming more and more important in visual tasks. In industrial production, it can greatly improve production efficiency to detect the faults of various parts of machines by means of anomaly detection. Over the years, scholars have done a lot of preliminary works [1,2,3,4,5,6] to explore the development direction of the field of anomaly detection. The development of CNN offers new ideas for image anomaly detection. From the proposal of LeNet [7] structure, to AlexNet [8], to VGG [9] and Inception series [10,11,12], the performance of CNN is getting better and better. In the tasks of anomaly detection, the methods of supervised learning based on CNNs have been widely used to detect anomalies. However, in some engineering areas, the lack of anomaly samples hinders the development of supervised anomaly detection methods. Due to the lack of abnormal samples, traditional methods such as object detection, semantic segmentation and image classification are difficult to carry out model training. Therefore, anomaly detection methods based on normal samples need to be proposed urgently.

The development of GAN in recent years has provided new ideas for the research of anomaly detection methods based on normal samples. As an unsupervised image method, GAN was proposed by Ian Goodfellow et al. [13] in 2014. Subsequently, methods such as LAPGAN, CGAN, InfoGAN, and CycleGAN [14,15,16,17] have gradually enhanced the performance of GAN. AnoGAN [18] applied GAN to the field of image anomaly detection, and realized image anomaly detection without abnormal samples. This method only uses normal samples to train DCGAN [19], and introduces an image distance measurement model to judge whether the samples are abnormal. After that, the proposal of Efficient-GAN [20], ALAD [21] and f-AnoGAN [22] further improved the performance of the GAN-based anomaly detection models.

On the basis of the GAN as the backbone network method, Akcay et al. proposed the GANomaly [23], which trains the autoencoder by adversarial mechanism and carries out image reconstruction operation. Skip-GANomaly [24] adds the skip connections between the encoding part and the decoding part of the generator on the basis of GANomaly to reduce information loss and enhance model performance. However, in some small target anomaly detection tasks, such as bird in CIFAR-10 dataset [25], the performance of f-AnoGAN, Skip-GANomaly and GANomaly are not satisfactory. Moreover, the current encoder-decoder networks lack stability and robustness in the training process.

In the paper, we mainly study abnormal detection tasks based on unsupervised learning and propose a Fully-Nested Encoder-decoder Framework. The main body of the anomaly detection method consists of a generating model and a distance measurement model. The generating model includes a generator and a discriminator, which detects data anomalies by a distance measurement model. In the generating model, we design a Fully-Residual Encoder-decoder Network as the generator. Taking into account the needs of different datasets for different network depths, the generator uses encoding-decoding networks of different depths to nest, which enhances the selectivity of different datasets for the best-depth encoding-decoding network. Then, we choose the discriminant network in DCGAN as the discriminator of the model. The experiments of our method on CIFAR-10 dataset demonstrate its excellent performance.

2 Proposed Method

This paper proposes a Fully-Nested Encoder-decoder Framework for anomaly detection. As shown in Fig. 1, the main body of the anomaly detection method consists of two parts, generating model and distance measurement model. Generating model is generated by learning the distribution of the normal data to reconstruct the normal samples. In the process of training generator, the model uses a classification network as discriminator to train with the adversarial mechanism. Furthermore, we introduce the distance measurement model. The distance measurement model is a distance calculation method. In the test phase, the distance between the reconstructed image and the real image is used to determine whether the test sample is abnormal.

Fig. 1.
figure 1

Pipeline of our proposed framework for anomaly detection

2.1 Generating Model

The generating model reconstructs the image by learning the distribution of normal samples. Choosing a high-performance encoder-decoder network is very important for image reconstruction. The composition of encoder and decoder directly affects the effect of reconstructed image.

In the generating model, generator is a fully nested residual network, which can be divided into encoding part and decoding part, as shown in Fig. 2. The network can be regarded as multiple encoding and decoding networks with different scales nested. The encoder is a shared branch. The decoder decodes the deep semantic feature maps of four different scales generated by the encoder, and produces four parallel decoding branches. The generating model uses a classification network as discriminator and is trained based on the adversarial mechanism. In the whole network structure, Batch Normalization [26] and ReLU activation functions [27] are used.

Fig. 2.
figure 2

The architecture of our proposed generator

The encoder is the shared part, as shown in the black dotted box in Fig. 1, represented as \({G}_{E}\), which is used to read in the input image \({x}_{real}\) to generate the deep semantic feature map \(z=({z}_{1},{z}_{2},{z}_{3},{z}_{4})\), its specific expression is shown in Formula (1),

$$\mathrm{z}={\mathrm{G}}_{\mathrm{E}}({\mathrm{x}}_{\mathrm{real}})$$
(1)

The decoder network decodes \(({z}_{1},{z}_{2},{z}_{3},{z}_{4})\), and produces four parallel branches: \({D}_{1}, {D}_{2}, {D}_{3}\) and \({D}_{4}\), which are expressed as \({G}_{D}\), as shown in the red dotted box in Fig. 1. Moreover, the internal decoding branches uses dense skip connections to connect to adjacent external decoding branches for feature fusion. Skip connections enhance the transfer of detailed information between different branches, greatly reducing information loss. The final layer of the outermost decoding branch outputs the reconstructed image \({x}_{fake}\) of the generator, its specific expression is shown in Formula (2),

$${\mathrm{x}}_{\mathrm{fake}}={\mathrm{G}}_{\mathrm{D}}(\mathrm{z})$$
(2)

We add residual structure into both encoder and decoder to improve the feature expression ability and reduce the risk of overfitting. Through back propagation, the model can independently select the suitable depth network for different datasets through the nested model of four scales.

We add a classification network after the generator as the discriminator of the model, which is the classification network of DCGAN model, denoted by \(D\left( \cdot \right)\). For the input image, the discriminator network identifies whether it is normal sample \({x}_{real}\) or the image \({x}_{fake}\) reconstructed by the generator.

The dataset is divided into the training set \({D}_{train}\) and the test set \({D}_{test}\). The training set \({D}_{train}\) is only composed of normal samples, and the test set \({D}_{test}\) is composed of normal samples and abnormal samples. At the training phase, the model only uses normal samples to train the generator and discriminator. At the test phase, the distance between the given test images and their reconstructed images generated by the generator are calculated to determine whether they are abnormal.

2.2 Distance Measurement Model

In the test phase, we calculate the anomaly score of the test image to measure whether it is abnormal. Given test set \({D}_{test}\) and input \({x}_{test}\), the anomaly score is defined as \(A\left({x}_{test}\right)\). We use two kinds of distances to measure the difference between \({x}_{test}\) and \({x}_{fake}\). First, calculate \({L}_{1}\) distance directly for \({x}_{test}\) and \({x}_{fake}\), represented as \(R\left({x}_{test}\right)\), which describes the detailed difference between the reconstructed image and the input image. Secondly, calculate \({L}_{2}\) distance directly for \(f\left({x}_{fake}\right)\) and \({f(x}_{test})\), which describes the difference in semantic feature, is denoted by \(L\left({x}_{test}\right)\). The formulas for \(A\left({x}_{test}\right)\), \(R\left({x}_{test}\right)\), and \(L\left({x}_{test}\right)\) are as follows,

$$A\left({x}_{test}\right)=\lambda R\left({x}_{test}\right)+(1-\lambda )L\left({x}_{test}\right)$$
(3)
$$R\left({x}_{test}\right)={||{x}_{test}-{x}_{fake}||}_{1}$$
(4)
$$L\left({x}_{test}\right)={||{f(x}_{test})-f\left({x}_{fake}\right)||}_{2}$$
(5)

where \(\lambda \) is the weight to balance the two distances \(R\left({x}_{test}\right)\) and \(L\left({x}_{test}\right)\). In the proposed model, \(\lambda \) is set to 0.9.

In order to better measure whether the input image is abnormal, it is necessary to normalize the anomaly score of each image in the test set \({D}_{test}\) calculated according to Formula (3). Suppose set \(A=\{{A}_{i}:A\left({x}_{test,i}\right), {x}_{test}\in {D}_{test}\}\) is the set of anomaly scores of all images in the test set \({D}_{test}\). The model maps the set of anomaly scores \(A\) to the interval [0, 1] by Formula (6).

$${\mathrm{A}}^{\mathrm{^{\prime}}}\left({\mathrm{x}}_{\mathrm{test}}\right)=\frac{\mathrm{A}\left({\mathrm{x}}_{\mathrm{test}}\right)-\mathrm{min}(\mathrm{A})}{\mathrm{max}\left(\mathrm{A}\right)-\mathrm{min}(\mathrm{A})}$$
(6)

We set a threshold for \({A}^{^{\prime}}\left({x}_{test}\right)\). Samples with anomaly score greater than the threshold are judged to be abnormal, else normal.

2.3 Training Strategy

The loss function of the model consists of three kinds of loss functions, which are Adversarial Loss, Contextual Loss, and Latent Loss.

In order to maximize the reconstruction ability of the model during the training phase and ensure that the generator reconstructs the normal image \({x}_{real}\) as realistically as possible, the discriminator should classify the normal image \({x}_{real}\) and the reconstructed image \({x}_{fake}\) generated by the generator as much as possible. Use cross entropy to define the Adversarial Loss, the specific expression is shown in Formula (7).

$${L}_{adv}=\mathrm{log}\left(D\left({x}_{real}\right)\right)+\mathrm{log}\left(1-D\left({x}_{fake}\right)\right)$$
(7)

In order to make the reconstructed image generated by the generator obey the data distribution of normal image as much as possible and make the reconstructed image \({x}_{fake}\) conform to the context image, the model defines the reconstruction loss by calculating the SmoothL1 Loss [28] of the normal image and the reconstructed image, as shown in Formula (8):

$${L}_{con}={S}_{L1}\left({x}_{real}-{x}_{fake}\right)$$
(8)

where \({S}_{L1}\) represents the SmoothL1 Loss function.

$$ S_{L1} = \left\{ {\begin{array}{*{20}l} {0.5x^2 } \hfill & {\left| x \right| < 1} \hfill \\ {\left| x \right| - 0.5} \hfill & {\left| x \right| \ge 1} \hfill \\ \end{array} } \right. $$
(9)

In order to pay more attention to the differences between the reconstructed image \({x}_{fake}\) generated by the generator and the normal image \({x}_{real}\) in the latent space, the model uses the last convolution layer of discriminator to extract the bottleneck features \(f\left({x}_{real}\right)\) and \(f\left({x}_{fake}\right)\), and takes the SmoothL1 loss between the two bottleneck features as the Latent Loss. The specific expression is shown in Formula (10).

$${L}_{lat}={S}_{L1}\left(f({x}_{real})-{f(x}_{fake})\right)$$
(10)

In the training phase, the model adopts the adversarial mechanism for training. First, fix the parameters of generator, and optimize the discriminator by maximizing the Adversarial Loss \({\mathcal{L}}_{adv}\). The objective function is

$${\mathcal{L}}_{D-Net}=\underset{D}{\mathrm{max}}{\mathcal{L}}_{adv}$$
(11)

Then, fix the parameters of discriminator, and optimize the generator by the objective function:

$${\mathcal{L}}_{G-Net}=\underset{G}{\mathrm{min}}({w}_{adv}{\mathcal{L}}_{adv}+{w}_{con}{\mathcal{L}}_{con}+{w}_{lat}{\mathcal{L}}_{lat})$$
(12)

where \({w}_{adv}\), \({w}_{con}\) and \({w}_{lat}\) are the weight parameters of \({\mathcal{L}}_{adv}\), \({\mathcal{L}}_{con}\) and \({\mathcal{L}}_{lat}\).

3 Experiments

All experiments in this paper are implemented using the Pytorch1.1.0 framework with an Intel Xeon E5-2664 v4 Gold and NVIDIA Tesla P100 GPU.

3.1 Dataset

To evaluate the proposed anomaly detection model, this paper conducted experiments on the CIFAR-10 [25] dataset.

The CIFAR-10 dataset consists of 60,000 color images, and the size of each image is 32 × 32. There are 10 classes of images in the CIFAR-10 dataset, each with 6000 images. When implementing anomaly detection experiments on the CIFAR-10 dataset, we regarded one class of them as abnormal class, and the other 9 classes as normal class. Specifically, we use 45000 normal images from the other 9 normal classes as normal samples for model training, and the remaining 9000 normal images in the other 9 normal classes and 6000 abnormal images in the abnormal class as test samples for model testing.

3.2 Implementation Details

Model Parameters Setting.

The model is set to be trained for 15 epochs and optimized by Adam [29] with the initial learning rate \(lr=0.0002\), with a lambda decay, and momentums \({\beta }_{1}=0.5\), \({\beta }_{2}=0.999\). The weighting parameters of loss function are set to \({w}_{adv}=1\), \({w}_{con}=5\), \({w}_{lat}=1\). The weighting parameter \(\lambda \) of the distance metric is empirically chosen as 0.9.

Metrics.

In this paper, AUROC and AUPRC are used to assess the performance of our method. Concretely, AUROC is the area under the ROC curve (Receiver Operating Characteristic curve), which is the function plotted by the TPR (true positive rates) and FPR (false positive rates) with varying threshold values. AUPRC is the area under the PR curve (Precision Recall curve), which is the function plotted by the Precision and Recall with varying threshold values.

Results and Discussion.

To demonstrate the performance of our method, we compare our method with Skip-GANomaly, GANomaly and f-AnoGAN on the CIFAR-10 dataset. The parameter settings of Skip-GANomaly and GANomaly are consistent with our experimental parameter settings in this paper, and the parameters of f-AnoGAN are the same as the settings in [22].

Table 1 and Fig. 3 show the experimental results of the CIFAR-10 dataset under the AUROC indicator, and Table 2 and Fig. 4 show the experimental results of the CIFAR-10 dataset under the AUPRC indicator. It is apparent from Table 1, Fig. 3, Table 2 and Fig. 4 that the proposed method is significantly better than the other methods in each anomaly classes of the CIFAR-10 dataset, achieving the optimal accuracy under both AUROC and AUPRC indicators. Moreover, the proposed method achieves the best performance among the three class of objects: airplane, frog, and ship, with almost 100% accuracy for anomaly detection. In addition, for the most challenging abnormal classes bird and horse in the CIFAR-10 dataset, the optimal AUROC of the other methods are 0.658 and 0.672, and the optimal AUPRC are 0.558 and 0.501, respectively. Significantly, the AUROC of abnormal classes bird and horse for the proposed method are 0.876 and 0.866, with accuracy increases of 21.8% and 19.4%, and the AUPRC are 0.818 and 0.775, with accuracy increases of 26.0% and 27.4%.

Figure 5 shows the histogram of anomaly scores of Skip-GANomaly and the proposed model on the CIFAR-10 dataset when bird class is considered as abnormal image. This can be seen that compared with Skip-GANomaly, our method can better distinguish between the normal and the abnormal, and achieves a good anomaly detection effect. Taking bird class as abnormal class, Fig. 6 illustrates the reconstruction effect of our method on objects of CIRAR-10 dataset in the test phase.

In conclusion, the anomaly detection performance of the method proposed in this paper on the CIFAR-10 dataset is better than the previous related methods.

Table 1. AUROC results for CIFAR-10 dataset
Table 2. AUPRC results for CIFAR-10 dataset
Fig. 3.
figure 3

Histogram of AUROC results for CIFAR-10 dataset

Fig. 4.
figure 4

Histogram of AUPRC results for CIFAR-10 dataset

Fig. 5.
figure 5

Histograms of anomaly scores for the test data when bird is used as abnormal class.

Fig. 6.
figure 6

The reconstruction effect of our method on objects of CIRAR-10 dataset in the test phase.

4 Conclusion

In this paper, we introduce a Fully-Nested Encoder-decoder Framework for general anomaly detection within an adversarial training scheme. The generator in the proposed model is composed of a novel full-residual encoder-decoder network, which can independently select suitable depth networks for different datasets through four-scale nested models. The residual structure is added to the generator to reduce the risk of overfitting and improve the feature expression ability. We have conducted multiple comparative experiments on the CIFAR-10 dataset. And the experimental results show that the performance of the proposed method in this paper has greatly improved compared with previous related work.