Latent feature reconstruction for unsupervised anomaly detection

Anomalies (or outliers) indicate a minority of data items that are quite different from the majority (inliers) of a dataset in a certain aspect. Unsupervised anomaly detection (UAD) is an important but not yet extensively studied research topic. Recent deep learning based methods exploit the reconstruction gap between inliers and outliers to discriminate them. However, it is observed that the reconstruction gap often decreases rapidly as the training process goes. And there is no reasonable way to set the training stop point. To support effective UAD, we propose a new UAD framework by introducing a Latent Feature Reconstruction (LFR) layer that can be applied to recent UAD methods. The LFR layer acts as a regularizer to constrain the latent features in a low-rank subspace from which inliers can be reconstructed well while outliers cannot. We develop two new UAD methods by implementing the proposed framework with autoencoder architecture and geometric transformation scheme. Experiments on five benchmarks show that our proposed methods can achieve state-of-the-art performance in most cases.


Introduction
Anomaly detection (AD), sometimes also referred to as outlier detection or novelty detection [1], is to identify a relatively small number of special data points (outliers) from a noisy dataset that deviates from the majority (inliers) of the However, in the UAD setting, it is observed that AEs/CAEs usually reconstruct outliers as well as inliers, and the reconstruction gap between inliers and outliers decreases as the training process goes.To illustrate this phenomenon, we give an example in Fig. 1, which shows the inlier and outlier reconstruction errors of a CAE trained on Fashion-MNIST.When the number of epochs reaches 1000, the two curves coincide, which means that the trained model can no longer discriminate outliers from inliers.
Though some existing works have tried to handle this problem to some extent, they also have their own limitations.For example, RSRAE [14] proposes a robust subspace recovery (RSR) layer for AEs to regularize inliers into a low-rank subspace, from which the outliers stay far away.However, RSRAE is designed specifically for AEs, and AEs are ineffective in handling high-dimensional and complex datasets like CIFAR10.To do SSAD over complex datasets, GEOM [9] employs ResNet for powerful feature representation and geometric transformations for data augmentation.And E 3 Outlier extends the transformations to RSRAE for UAD, it can retard the reduction of the loss gap between inliers and outliers.But both of them are applicable only to images, and the additional transformations incur much computational cost in training/testing.
In this paper, we propose a new and more general framework for UAD by introducing a latent feature reconstruction (LFR) layer as a plug-in module that can be embedded in the two types of existing UAD methods: autoencoder based methods (e.g.RSRAE) and geometric transformation based methods (e.g.GEOM and E 3 Outlier) to effectively handle the above-mentioned problem.In the training phase, the LFR Fig. 1 Averaged inliers and outliers reconstruction errors of a CAE trained on Fashion-MNIST.Inliers (green) are sampled from class "Tshirt", and outliers (red) are sampled from the rest classes.The ratio of outliers over inliers is 0.1.As training goes on, the error gap between inliers and outliers steadily decreases, and two curves coincide at around the 1000-th epoch layer linearly maps the latent features into a low-dimensional subspace that keeps the significant information, and from which the latent feature space can be reconstructed so that for inliers the reconstructed features are close to the original features while for outliers are not.We implement the proposed framework based on both AE and geometric transformations, and consequently develop two new UAD methods, which are called AE-LFR and GT-LFR, respectively.We also propose a novel yet simple anomaly scoring strategy by connecting the LFR layer and the backbone network in testing.We show that this strategy can get a large gap in anomaly scores between inliers and outliers.
In summary, our contributions include The most related work to our paper is the RSRAE method [14].It should be pointed out that our LFR framework is different from the RSRAE method in at least three aspects: (1) Our LFR framework employs different structures for training and testing, and in training the LFR layer is separated from the backbone network, while RSRAE has a similar structure for both training and testing, which is like that in our testing phase.(2) Our LFR framework is more general and can serve as a plug-in component to be applied to both AE based methods and geometric transformation (GT) based methods, while RSRAE is only a typical AE based method.(3) Our methods clearly outperform RSRAE in most cases.
The rest of this paper is organized as follows: Section 2 reviews the related works.Section 3 presents our methods in details.Section 4 is performance evaluation.Section 5 concludes this paper.

Related work
Most traditional works on anomaly (or novelty) detection consider that the training set consists of only normal data (inliers), so they treat the problem as one-class classification, and propose SVM based method [15] and principle compo-nent analysis (PCA) based methods [16,17] etc.They can be subsumed to supervised anomaly detection (SAD in short).
Recently, more and more deep neural network based methods are introduced for anomaly detection by exploiting their powerful representations of high-dimensional data (e.g.images and videos).A detailed review of deep learning for anomaly detection can be referred to [18].The majority of such existing works treat anomaly detection as a semi-supervised learning problem, that is, semi-supervised anomaly detection (SSAD in short).Those SSAD methods mainly fall into four types: reconstruction-based [5,6], GANbased [7,8], discrimination-based [9,10], and density-based [11] methods.
Unsupervised anomaly detection (UAD in short) is a more challenging problem that has not yet been extensively studied, where the challenge lies in that no inlier or outlier labels are provided in the training data.Up to now, only a few deep learning-based methods are proposed for UAD, which can be grouped into two categories: autoencoder (AE) based and geometric transformation (GT) based methods.In [18], they are also called reconstruction based and discrimination based methods, respectively.
Among the AE based methods, [5] proposes an autoencoderbased method that identifies the outliers by maximizing the reconstruction loss difference between inliers and outliers with a specifically designed loss function.[6] utilizes robust principal component analysis (RPCA) that decomposes the unlabelled input data matrix into a low-rank part and a sparse part to separate the inliers and outliers.And [19] jointly optimizes an AE and an estimation network in an end-to-end manner.The estimation network is used to fit a Gaussian mixture model.Inspired by robust subspace recovery (RSR), the RSRAE method [14] introduces an RSR layer within an AE to cope with the situation where a large portion of data points are corrupted by exploiting the latent low-rank subspace structure of the training data.UniAD [20] revisits the formulations of fully-connected layer, convolutional layer, as well as attention layer, and confirms the important role of query embedding in distinguishing normal and abnormal samples.It first proposes a layer-wise query decoder to model the normal distribution, and introduces a feature jittering strategy that urges the model to recover the correct message even with noisy input.
Up to now, the only GT based method is E 3 Outlier [10], which is based on GEOM [9] by changing the original pre-define self-supervised task in GEOM via extending the regular affine transformation to irregular affine transformation and patch re-arranging.GEOM is an SSAD method, which first applies different geometric transformations to normal training images, and then trains classification models for a pre-defined task (predicting the orientations of rotated images) on the augmented data.At the evaluation phase, the anomaly score of an instance is defined as the average of softmax classification scores of all the corresponding transformed images.
In addition to the advances in model structure and algorithm perspectives, some recent works try to introduce additional auxiliary information to improve the performance of anomaly detection.FCDD [21] collects anomalous samples from 80 millions Tiny Image and ImageNet, and trains a Fully Convolutional Data Description (FCDD), which maps normal samples near to the center c of normal distribution and the anomalous samples away from c. Salehi et al. [22] perform distillation on the expert network pretrained on Ima-geNet, detect and localize anomalies using the discrepancy between the expert and cloner networks' intermediate activation values.DRAEM [23] takes the auxiliary images as anomaly texture sources to generate anomalous images, then it learns a joint representation of an anomalous image and its anomaly-free reconstruction, while learning a decision boundary between normal and anomalous examples.Elite [24] even introduces some labeled examples as thvalidation set and leverages the gradient of the validation loss to predict if one training sample is abnormal.

Problem statement
Given an unlabeled data set X = {x i } N i=1 , where x i ∈ R D and N is the size of X, X implicitly consists of a subset of inliers X in and a subset of outliers X out .Data in X in and data in X out are sampled or generated from two completely different distributions (or distribution mixtures).The goal of UAD is to build a detector based on X such that for any data point x i ∈ X, it can determine whether x i belongs to X in or belongs to X out .
In what follows, we first introduce the LFR framework for UAD, then present two implementations of the LFR framework based on autoencoder and geometric transformation, respectively.These two implementations correspond to two new UAD methods, which are called AE-LFR and GT-LFR.

The LFR framework
Recent deep learning based methods for UAD learn the feature representations of training data points mainly by a generic feature learning method like autoencoder or ResNet [25].They pursue an underlying representation to distinguish anomalies from normal data.The process can be formally represented as follows: where φ(•; θ) is the feature extractor that maps x i ∈ R D to its latent feature z i ∈ R d , ψ(•; ω) is a surrogate task that takes z i as input and learns a critical latent feature space for the input, L ori (•) is a loss function depending on the backbone applied, and f (•) is an anomaly scoring function that measures the degree of abnormality s x .Outliers are usually identified by choosing a proper threshold for s x .
In UAD, the model is optimized for outliers and inliers simultaneously.Though the property "inlier priority" [10] indicates that the model gives priority to reducing the inliers' loss, the loss gap between inliers and outliers will decrease after enough training epochs, as shown in Fig. 1.Usually, the anomaly score is just the loss or a variant of the loss, so the anomaly score gap will decrease as well.To keep the score gap between inliers and outliers large, we introduce a new and general framework for UAD, where the core component is a latent feature reconstruction (LFR) layer embedded in the training and testing phases.We call this new framework LFR, which is illustrated in Fig. 2.
Here, the LFR layer is a plug-in component that can be embedded in existing methods without changing their backbone networks, it regularizes the latent features through back-propagation.The LFR layer takes the latent feature z i as input and outputs its reconstruction z i , which can also be used as the input of ψ(•; ω).In the testing phase, we just simply embed the LFR layer into the backbone.
In the training phase, we regulate the learning of z i by the LFR layer.Inspired by RSRAE [14], we introduce the robust subspace recovery (RSR) loss to the LFR layer.Specifically, the LFR layer seeks a low-rank latent feature subspace for inliers.It applies a linear transformation A ∈ R k×d that maps the original latent feature z i into a k-dimensional space, from which we reconstruct it in the original latent feature space by the transpose of A. The loss function is as follows, where A T is the transpose of A, I d denotes the identity matrix and .F denotes the Frobenius norm.As demonstrated in Fig. 2 The framework of LFR.Here, the upper subfigure shows the structure for training, and the lower subfigure shows the structure for testing [14], A T A is close to an orthogonal projector, and the loss will guide the latent features to lie in a low-rank subspace.The total loss of our framework is the sum of the original loss (the backbone loss) and the RSR loss in (2), i.e.,

L(θ, ω,
( In (1), s x = f (x, z, φ θ * , ψ ω * ) is the original anomaly score.Though it can also be used as the scoring function of our framework, it cannot make full use of the learnt LFR layer.
In the testing phase, we embed the LFR layer to the original backbone, as shown in the lower subfigure of Fig. 2, and thus have a new scoring function as follows: The intuition behind this function is like this: with the loss function of (2), the reconstruction z = A T Az for inliers is close to the original latent feature z, but for the outliers, it is not.Therefore, we replace z with z in the scoring function.The anomaly score gap between inliers and outliers will be enlarged, which is beneficial to discriminating outliers from inliers.

The AE-LFR method
We first implement the LFR framework by applying it to AE based UAD methods, and get our first new method called AE-LFR.That is, we plug the LFR layer in any AE based UAD method.As AE is used as the backbone, φ e (•) and ψ d (•) are the encoder and the decoder, respectively.So we have, L recon (θ e , ω d ), Above, the encoder takes x i as input and outputs the hidden feature z i , then the decoder maps z i to get the reconstruction xi of x i .As a plug-in layer, the LFR layer can be directly applied to any encoder-decoder architecture as illustrated in Fig. 2. In addition to the reconstruction loss of AE, the RSR loss in ( 2) is also used as the supervision signal.Accordingly, we have the following loss for the AE-LFR method, Then, by replacing z with z = A T Az in the scoring function s x in (5), we get the anomaly score function of AE-LFR as follows:

The GT-LFR method
Here, we apply our framework to geometric transformation based methods.Concretely, we take GEOM as an example, and get our second new method GT-LFR.GEOM [9] first applies a set of geometric transformations {T m } M m=1 , including rotations, reflections, and translations, to the training images.Then, it sets up a self-supervised task that trains a multi-class classification model on the augmented data to predict the transformations it applied.In the evaluation phase, an image is applied with M given transformations, and its anomaly score is the average of all probability outputs of the learned classification model over the M transformed images.Formally, where φ f (T m (•); θ f ) is a deep classification model like ResNet [25] and Wide Resnet (WRN) [26], which extracts the latent representations of input images augmented by the pre-defined geometric transformation T m .ψ g (•; θ g ) is a multi-class classifier and C E denotes the cross-entropy loss.
Here, the LFR layer also regularizes the latent feature learning of z T m i .But unlike AE-LFR, there are M distinct subsets of the augmented image set, with which it is hard to find a single low-rank latent feature subspace for the inliers.To tackle this problem, we try to find a separate feature subspace for each transformation.Thus, we assign a linear matrix A (m) ∈ R k×d for each transformation T m to accommodate the corresponding feature subspace, that is, So the loss function of GT-LFR can be formulated as follows: By replacing the latent feature of each transformed image z T m with A T (m) A (m) z T m in the scoring function s x of (8), we have the anomaly score function of GT-LFR as follows: 4 Performance evaluation
For fair comparison, we process the datasets by following the settings of previous UAD methods [5,6,10,14,32].For example, we follow the settings in [14] to handle Caltech101: taking 11 classes of Caltech101 and randomly choosing 100 images per class.Each training set with anomalies is constructed as follows: sampling the examples from a certain class as inliers, and combing some samples from each of the other classes as outliers.The ratio c of outliers/inliers is set to {0.1, 0.3, 0.5, 0.7, 0.9} respectively.Note that in UAD, all inlier/outlier labels are unknown to the model in the training phase.For a given ratio c, we first evaluate the performance of taking a certain class as inliers, then compute the average of all classes' results as the final performance.
And for each class, we do 5 trials with different random seeds and report the averaged result.The Area under the Receiver Operating Characteristic curve (AUROC) and the Area under the Precision-Recall curve (AUPR) are used as performance metrics.We treat the outliers as "positive" in evaluation.

Compared methods
We compare our methods with seven existing methods: AE/CAE [33], DRAE [5], RSRAE [14], GEOM [9], E 3 Outlier [10], LVAD [34], and Elite [24].AE/CAE, DRAE, and RSRAE are AE-based methods, GEOM and E 3 Outlier are geometric transformation-based methods, and LVAD is density-based.Elite has two variants, Elite-AE is AE-based, while Elite-SVDD is discrimination-based.Although GEOM was originally proposed for SSAD, it can be extended to UAD.Among these methods, GEOM, E 3 Outlier, LVAD, and Elite can only handle image data.As for RSRAE, LVAD, and Elite, we use the official code1 and follow its original setting.For the other methods, we utilize the implementations in the site 2 and adapt them to the settings of datasets used in our paper.

Implementation detail
We use the same autoencoder structure for the compared AE-based methods and our AE-LFR method.For the image datasets, the encoder in AE consists of three convolutional layers and a fully connected layer with output channels (32,64,128,256) and the kernel sizes (5 × 5, 5 × 5, 3 × 3) in convolutional layers, the output of encoder is a 256dimensional vector.
The decoder has an inverse architecture of the encoder, and replaces the convolutional kernels with deconvolutional kernels.For AE-LFR, we set k = 10, λ 1 = 2, λ 2 = 0.1 in all experiments.The AE-based models are optimized with Adam using a learning rate of 0.00025, a mini-batch size of 128, and 1000 epochs.The activation function is Tanh.All images are normalized into [−1, 1].
For the GT-based methods, GEOM and E 3 Outlier are implemented with a wide ResNet (WRN) with the widen factor being 4. Our GT-LFR method follows the settings of GEOM and uses its 72 transformations in self-supervised learning.We set k = 20, λ 1 = 0.0002, λ 2 = 0.00001 for GT-LFR in all experiments.As GT-based methods use powerful feature extractors and the change in latent features has significant impact on the downstream classification tasks, so we reduce the values of λ 1 and λ 2 .
Our methods are implemented in Pytorch and all experiments are conducted on 8 RTX2080Ti GPUs.
We can see that our methods achieve state-of-the-art performance in most cases, while DRAE and AE/CAE perform worse than the other methods because of the consequence of overfitting to outliers after being trained 1000 epochs.For AE-based methods, our AE-LFR method performs best on the two text datasets 20News and Reuters, and we gets competitive performance against RSRAE in most datasets, and outperforms RSRAE by 4% averaged AUROC on FMNIST.
For the GT-based methods, our GT-LFR method significantly outperforms the others on the two image datasets FMNIST and Caltech101, and is competitive to E 3 Outlier on CIFAR10.Though GT-LFR is based on GEOM, it performs considerably better than GEOM, which shows the effectiveness of our LFR framework.E 3 Outlier outperforms GEOM because it uses more geometric transformations.However, E 3 Outlier consumes more computation than the others because it uses more transformations, while our proposed method needs just an additional matrix A, which consumes a little additional computation cost, so it is much faster than E 3 Outlier.LVAD is generally better than AEbased methods and worse than GT-based method, because the density estimation method is not robust on the data with complex distribution.Elite introduce some labeled samples as the validation set, which makes its performance insensitive to abnormal proportions.However, even if it uses labeled samples, it is inferior to our method in the case of fewer anomalies, which is also more consistent with the data distribution of real application scenarios.In summary, our proposed method achieves better or competitive performance with additional parameters A and computation cost (O(kd(k + d))).As we adopt low-rank reconstruction for latent feature (k d), the additional computation cost approximates O(kd 2 ).

Ablation study
Here, we consider different combination configurations of the loss function and the anomaly scoring function in our methods AE-LFR and GT-LFR, and get different variants of our methods.We then compare these variants with two baselines AE/CAE and GEOM respectively.
For convenience, we use the following notations of the loss and scoring functions: Note that the loss and scoring functions implicitly represent the model architecture.For example, L A means that the decoder take z as input, corresponding to the architecture of the training phase in Fig. 2, while L B means that the decoder is fed with z = A T Az, corresponding to the architecture of the testing phase in Fig. 2. Thus, we can use L i S j to represent a combination configuration of loss and scoring functions in the training and testing phases, where i, j ∈ {A, B}.For example, L A S A indicates that the model is optimized by the loss function L A in the training phase and evaluated by the scoring function S A in the testing phase.So our methods can be denoted as L A S B .Meanwhile, the backbone model (AE/CAE) can be regarded as a deteriorated model trained with only L ori (ψ(z; ω)), i.e., L RS R (θ, A) is not used.
Table 1 presents the results of performance comparison between the baseline AE/CAE with AE-LFR (L A S B ) and its   learning goal of the LFR layer is to perfectly reconstruct the latent features of inliers, instead of outliers.So even if the autoencoder overfits the outliers after training, it is still hard for the autoencoder to recover the outliers when the LFR layer is removed.
Table 2 presents the results of GT-based variants.As it is difficult for the models to converge when L B is applied, here we report only L A S A and L A S B .We can see that L A S B (GT-LFR) performs better than L A S A .
Figure 5 shows the inlier/outlier anomaly score histograms of AE/CAE, L A S A and L A S B (AE-LFR) on class sneaker of FMNIST, class 2 of Caltech101 and class deer of CIFAR10.We can see that with AE-LFR, most inliers have smaller anomaly scores while most outliers have larger ones.On the contrary, we see quite different results with AE/CAE.This conforms to our expectation: our LFR layer reconstructs inliers much better than outliers.Figure 6 shows the anomaly score histograms of GT-based variants, we can see patterns similar to that of AE-based variants in Fig. 5.
Figure 7 shows how the anomaly score changes with the number of training epochs in four methods.As expected, our methods can still keep a large gap between the anomaly scores of inliers and outliers as the number of training epochs increases.However, for AE/CAE and GEOM, the anomaly score gap decreases rapidly with the increase of training epochs.This explains the good performance of our methods.

Conclusion
In this paper, we introduce a novel UAD framework with a latent feature reconstruction (LFR) layer and a new anomaly scoring function.The LFR layer is used as a plug-in component to regularize the latent features of samples to a low-rank subspace so that the inliers can be perfectly reconstructed while the outliers cannot.We implement the proposed framework by embedding the LFR layer into two major types of existing UAD methods: AE based methods and GT based methods, consequently deriving two new UAD methods called AE-LFR and GT-LFR.Extensive experiments on five datasets show that the proposed methods outperform the existing methods in most cases.As for future work, on the one hand, we are to apply our methods to more datasets, especially videos.On the other hand, we plan to extend the proposed methods for SAD and SSAD tasks.

Table 1
Comparison among AE/CAE and our AE-based variants A S A , L B S A and L B S B ). From

Table 1
L B S A is better than L B S B .In L B S A , the input of the decoder is the output of the LFR layer in training.The

Table 2
Comparison among GEOM and our GT-based variants