LatentOut\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathrm {Latent}}Out}$$\end{document}: an unsupervised deep anomaly detection approach exploiting latent space distribution

Anomaly detection methods exploiting autoencoders (AE) have shown good performances. Unfortunately, deep non-linear architectures are able to perform high dimensionality reduction while keeping reconstruction error low, thus worsening outlier detecting performances of AEs. To alleviate the above problem, recently some authors have proposed to exploit Variational autoencoders (VAE) and bidirectional Generative Adversarial Networks (GAN), which arise as a variant of standard AEs designed for generative purposes, both enforcing the organization of the latent space guaranteeing continuity. However, these architectures share with standard AEs the problem that they generalize so well that they can also well reconstruct anomalies. In this work we argue that the approach of selecting the worst reconstructed examples as anomalies is too simplistic if a continuous latent space autoencoder-based architecture is employed. We show that outliers tend to lie in the sparsest regions of the combined latent/error space and propose the VAEOut\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{VAE}Out$$\end{document} and LatentOut\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathrm {Latent}}Out}$$\end{document} unsupervised anomaly detection algorithms, identifying outliers by performing density estimation in this augmented feature space. The proposed approach shows sensible improvements in terms of detection performances over the standard approach based on the reconstruction error.


Introduction
Outlier detection is a fundamental and widely applicable discovery problem.Outliers can arise due to many reasons like mechanical faults, fraudulent behavior, human errors, instrument error or simply through natural deviations in populations.Generally speaking, the problem of outlier detection consists in isolating samples suspected of not being generated by the same mechanisms as the rest of the data.Approaches to outlier detection can be classified in supervised, semi-supervised, and unsupervised (Chandola et al., 2009;Aggarwal, 2013).Supervised methods take in input data labeled as normal and abnormal and build a classifier.The challenge there is posed by the fact that abnormal data form a rare class.Semi-supervised methods, also called one-class classifiers or domain description techniques, take in input only normal examples and use them to identify anomalies.Unsupervised methods detect outliers in an input dataset by assigning a score or anomaly degree to each object.Several statistical, data mining and machine learning approaches have been proposed to detect outliers, namely, statistical-based (Davies & Gather, 1993;Barnett & Lewis, 1994), distance-based (Knorr et al., 2000;Angiulli & Pizzuti, 2002, 2005;Angiulli et al., 2006;Angiulli & Fassetti, 2009), density-based (Breunig et al., 2000;Jin et al., 2001), reverse nearest neighbor-based (Hautamäki et al., 2004;Radovanović et al., 2015;Angiulli, 2017Angiulli, , 2018Angiulli, , 2020)), isolation-based (Liu et al., 2012), angle-based (Kriegel et al. 2008), SVM-based (Schölkopf et al., 2001;Tax & Duin, 2004), deep learning-based (Goodfellow et al., 2016;Chalapathy & Chawla, 2019), and many others (Chandola et al., 2009;Aggarwal, 2013).
Deep learning anomaly detection approaches exploiting autoencoders (AE) have shown good performances (Hawkins et al., 2002;An & Cho, 2015;Chalapathy & Chawla, 2019).Autoencoder-based anomaly detection consists in training an autoencoder to reconstruct a set of examples and then to detect as anomalies those inputs that show a sufficiently large reconstruction error.This approach is justified by the observation that, since the reconstruction process includes a dimensionality reduction step (the encoder) followed by a step mapping back representations in the compressed space (also called the latent space) to examples in the original space (the decoder), regularities should be better compressed and, hopefully, better reconstructed (Hawkins et al., 2002).
Unfortunately, deep non-linear architectures are able to perform high dimensionality reduction while keeping reconstruction error low.Ideally, an expressive enough architecture could reduce arbitrarily large dimensional data to one dimensional data while performing the reverse transformation with negligible loss.This problem is in part due to the lack of regularity in the latent space.Variational autoencoders (VAE) arise as a variant of standard autoencoders designed for generative purposes (Kingma & Welling, 2013).The key idea of variational autoencoders is to regularize the standard loss function consisting in the reconstruction error by including a regularization term constraining the organization of the latent space.Basically, variational autoencoders encode each example as a normal distribution over the latent space, instead of encoding them as single points, and regularize the loss by maximizing similarity of these distributions with the standard normal distribution.This encoding is conducive to obtain a continuous latent space, namely a latent space for which close points will lead to close decoded representation, thus avoiding the severe overfitting problem affecting standard autoencoders, for which some points of the latent space will give meaningless content once decoded.
As already pointed out, variational autoencoders were initially proposed as a tool for generating novel realistic examples by sampling and then decoding points of the latent space.Due to similarities to standard autoencoders some authors also proposed their use to detect anomalies.However, it has been noticed that variational autoencoders share with standard autoencoders the problem that they generalize so well that they can also well reconstruct anomalies (An & Cho, 2015;Kawachi et al. 2018;Sun et al., 2018;Chalapathy & Chawla, 2019).
Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) are another tool for generative purposes, aiming at learning an unknown distribution by means of an adversarial process involving a discriminator, able to output the probability for an observation to be generated by the unknown distribution, and a generator, mapping points coming from a standard distribution to points belonging to the unknown one.Moreover, Bidirectional GANs extend the above framework by including in their architecture an encoder learning the inverse transformation of the generator (Donahue et al., 2017).These architectures share with variational autoencoders generative capabilities and the particular organization of the latent space, and have also employed with success to the anomaly detection task (Akcay et al., 2018;Schlegl et al., 2019;Zenati et al., 2019;Sánchez-Martín et al., 2020).
We generally refer to architectures equipped with an encoder and a decoder and enforcing the organization of the latent space thus guaranteeing continuity, as continuous latent space autoencoder-based neural architectures.
The main contribution of this work can be summarized as follows: we argue that the approach of selecting the worst reconstructed examples as anomalies is too simplistic if a continuous latent space autoencoder architecture is employed and, specifically, we show that the anomaly detection process can greatly benefit from taking into account the continuos latent space distribution together with the associated reconstruction error.Indeed, we show that outliers tend to lie in the sparsest regions of the combined latent and reconstruction error space and propose the novel unsupervised anomaly detection algorithms VAEOut and LatentOut , that identify outliers by performing density estimation by tak- ing advantage of this augmented feature space.The proposed approach shows sensible improvements in terms of detection performances over the standard approach based on the reconstruction error.
The rest of the paper is organized as follows.Section 2 presents preliminary definitions and discusses related work.Section 3 introduces the VAEOut and LatentOut unsupervised anomaly detection algorithms.Section 4 illustrates experimental results.Finally, Section 5 concludes the work.

Preliminaries and related work
An autoencoder (AE) is a deep neural network trained with the aim of outputting a reconstruction x of an input sample x as close as possible to x (Kramer, 1991;Hecht-Nielsen, 1995;Goodfellow et al., 2016).An autoencoder consists in two parts, an encoder f and a decoder g .An enconder f is a mapping of a sample from the input feature space to a hidden representation in a latent space, and is univocally determined by parameters .A decoder g is a mapping of a hidden representation from the latent space to a reconstruction in the input feature space, and is univocally determined by parameters .
Given an autoencoder ⟨f , g ⟩ , let x be a sample and let z = f (x) be the latent vari- able where the sample x is mapped by the encoder, the reconstruction x of x is given by x = g  (z) = g  (f  (x)) and the reconstruction error E(x) of the autoencoder is a measure of dissimilarity of x with respect to x .A common reconstruction error is the mean squared error (MSE), defined as The autoencoder tries to minimize the reconstruction error.
Variational Autoencoders.A variational autoencoder (VAE) is a stochastic generative model aimed at outputting a reconstruction x of a given input sample x (Kingma & Welling, 2013).To this aim, VAE are composed by an encoder f which outputs param- eters of q (z|x) , that is the posterior distribution of observing the latent variable z given x, and a decoder g computing parameters of p (x|z) , that is the likelihood of x given the latent variable z.The prior distribution of the latent variable z is denoted by p (z) .Thus, the actual values of z are sampled from q (z|x) .Given the latent variable z, the recon- struction x is obtained as a realization of p (x|z).
As for the distributions associated with the latent variable z, that are p (z) and q (z|x) , the common choice is the isotropic normal.The distribution of the likelihood p (x|z) depends on the nature of the data: Bernoulli for binary data or multivariate Gaussian for continuous data.In these cases, g (z) outputs the mean of the distribution and usually the reconstruction x is given by g (z).
Given a variational autoencoder ⟨f , g ⟩ and a sample x, the reconstruction error is represented by the cross entropy of the distribution q (z|x) relative to the distribution p (x|z): For example, given x and its reconstruction x , the corresponding contribution e(x, x) to the above error is given by e(x, 2 for continuous data.The reconstruction error can be computed through a Monte Carlo estimation.Thus, by letting L be the number of samples z (1) , z (2) , … , z (L) from q (z|x), The loss of the variational autoencoder is given by where the second term represents the KL divergence between the distribution q (z|x) , modelled as a multivariate normal distribution with independent components, and the prior p (z) , modelled as a multivariate normal standard distribution, and plays the role of a regu- larization term forcing the posterior distribution to be similar to the prior distribution.The hyper-parameter can be used to balance the two terms of the loss (Higgins et al., 2017).In such a case, the variational autoencoder is also called a -VAE.
Reconstruction error-based anomaly detection.The classic use of standard AE for anomaly detection is based on the idea that, after the training, these networks are able to better reproduce in output the inlier data than the outlier and, hence, the loss or the reconstruction error of the network is used as an anomaly score (Hawkins et al. 2002).In An and Cho (2015) this idea is applied to VAEs, by using as anomaly score the reconstruction probability, corresponding to the negative cross entropy The experimental results obtained in An and Cho (2015) show that VAE outperforms, in terms of AUC, standard AE and PCA for a semi-supervised anomaly detection setting.
A slightly different approach is pursued in Wiewel and Yang (2019), where it is considered the whole negative loss function as anomaly score instead of the reconstruction probability, which is only a term of it.The authors justify this choice with the slightly better results they obtain in their experiments compared to reconstruction probability.
It has been observed that sometimes VAEs share with standard AE the problem that they generalize so well that they can also reconstruct anomalies, which leads to view some anomalies as normal data.Thus, in Kawachi et al. (2018) the authors try to overcome this problem by modifying the structure of VAEs in order to make them able to support supervised learning and to be trained with both anomalies and normal data.In particular it is adopted an a priori distribution in the latent space that encourages the separation between normal and anomalous data which leads to non-standard loss function and anomaly score.
Generative Adversarial Networks for anomaly detection.Among recent approaches for detecting anomalies, Generative Adversarial Networks (GANs) have been applied to address this problem and yielded results are noticeable.
Roughly speaking, a GAN (Goodfellow et al., 2014) is a generative model which exploits an adversarial process where two models, a discriminator D and a generator G, are trained simultaneously.The aim of the generator G is to capture distribution of the data and, then, at producing samples as similar to training samples as possible, while the aim of the discriminator D is to distinguish a sample coming from the training data and a sample produced by G.
Among many existing variants, Bidirectional GAN (Donahue et al., 2017) extends the standard GAN model including an encoder learning the inverse of the generator, thus a mapping from latent space to data and vice versa are simultaneously learnt.
It uses a standard GAN and trains it only on positive samples.Given an instance x, a point z in the latent space is searched such that G(z) is as similar to x as possible.Since the generator learns how to generate normal samples, even if x is anomalous, G(z) is expected to be non anomalous and then the difference between x and G(z) highlights the anomalies.
AnoGAN has been successively improved.In Sánchez-Martín et al. (2020) a BiGANbased approach is proposed, it exploits the network architecture of BiGAN to jointly train the mapping from image to latent space and from latent space to image and then providing a trained model to get the latent representation of an input sample.GANomaly (Akcay et al., 2018) introduces a generator with three elements, an encoder and a decoder, namely an autoencoder, plus an additional encoder.Thus, given an instance x, the encoder produces a point z in the latent space which is provided as input to the decoder that outputs x ′ which, in its turn, feds the succeeding encoder that produces z ′ .Thus, the generator learns to encode normal data and learns to generate normal data starting from the encoded representation.Since the generator produces normal data even if the input data is anomal, its reconstruction will be normal.The difference between z and z ′ represents the anomaly level.
Latent space-based anomaly detection.There are autoencoder-based anomaly detection approaches in the literature that address this task solely relying on the embedding space and not on reconstruction error (Guo et al., 2018;Zhang et al., 2018;Corizzo et al., 2019).Specifically, the framework described in Zhang et al. (2018) is tailored for nonlinear process monitoring, while that described in Corizzo et al. (2019) supports predictive modeling tasks from streaming data coming from multiple geo-referenced sensors.
All the three above approaches map points to their latent representation and then assign them a score on the basis of the distances from their k-nearest neighbors in the latent space.In particular, in Guo et al. (2018) the score is given by the distance to the k-th nearest neighbor, while in Zhang et al. (2018); Corizzo et al. (2019) the score is given by the sum of the distances to the k-nearest neighbors.Additionally, Zhang et al. (2018) takes into account also the there called residual space, consisting of the difference between each point and its reconstruction.Thus, a second score is obtained as the sum of the distances between the image of each point in the residual space and its k-nearest neighbors in the residual space.If both the above two scores are below suitable thresholds then the point is recognized as an anomaly.
We note that these approaches are very different from the one here introduced, since they do not combine the latent space and the reconstruction error in order to detect points that have suspicious reconstruction errors as compared to their latent neighbors.Although the framework (Zhang et al., 2018) considers also the residual space, this differs from the reconstruction error.Indeed, while the latter is a scalar value, the former is an other point of the original feature space.Morever, the latent space and the residual space are taken into account separately, thus in Zhang et al. (2018) a point is declared as an anomaly if both its latent representation and its residual representation are anomalous independently of each other.

The VAEOut and LatentOut algorithms
Let I denote the input space (usually I ⊆ ℝ d ), let L denote the latent space (usually L ⊆ ℝ k with k ≪ d ), and let E denote the reconstruction error space (usually E ⊆ ℝ ).As above pointed out, the traditional approach pursued to detect anomalies using (variational) autoencoders is to compare the input to its reconstruction by means of the reconstruction error, thus it is based on exploiting only the input and reconstruction error spaces.We argue that the approach of selecting the worst reconstructed examples as anomalies is too simplistic if a variational autoencoder architecture is employed.Specifically, we show that the anomaly detection process can greatly benefit of taking into account the latent space distribution together with the associated reconstruction error.
To illustrate this, we considered the MNIST dataset of handwritten digits and created a training-set consisting of the 6000 digits from the class 0 (the inliers) plus 90 randomly picked digits from the classes 1-9 (the outliers).Figure 1(a) reports the two-dimensional latent space of a variational autoencoder trained on the above set of examples (details on the architecture are provided in Section 4).In particular, we reported the means of the distributions associated with training examples (standard deviations are not shown for the ease of visualization): inliers are the (blue) dots and outliers are the (red) asterisks.
First of all we note that, since regular examples (the inliers) form the majority of the data, they will be encoded as distributions better complying with the standard normal one.In other words, the associated latent distributions will tend to distribute around the origin of the latent space and, more importantly, means tend to be closer and supports will overlap more.
Nonetheless, not all the normal data complies with the above behavior and, thus, a nonnegligible fraction of inliers spreads also over more peripheral regions.As for the abnormal examples, typically they spread over a wide portion of the latent space, including both boundary regions and the central region of the space, their location depending on the similarities they share with normal examples.This means that neither the location of the distributions in the latent space nor their degree of overlapping alone are sufficient to separate inliers from outliers.Indeed, in Fig. 1(a) the sparsest regions of the latent space contain both normal and abnormal examples.
Consider now Fig. 1(b) where the reconstruction error is associated with each latent distribution.It can be seen that even in this case the reconstruction error alone is not sufficient to guarantee a good separation between inliers and outliers.Indeed, though some clear anomalies can be recognized by means of a very high reconstruction error, most of the outliers have relatively low reconstruction errors.However, Figure 1(b) also suggests that outliers tend to lie in the sparsest regions of the latent/reconstruction error feature space.This can be understood since outliers have two properties: (1) they are few, and (2) their reconstruction error, even when it is not exceptionally large, is still significantly larger than that of their most similar inliers.All this tends to move away in the augmented feature space the outliers from the other points.

VAEOut algorithm
In light of these observations, the key idea of the proposed approach, called VAEOut , is to simultaneously exploit information from the two above highlighted aspects, namely the latent space distribution and the reconstruction error distribution, by constructing the novel feature space F = L × E , consisting of the juxtaposition of the latent space and of the reconstruction error space, and then by measuring the degree of overlapping of the examples in this novel feature space F , namely the density of the distribution of examples.
Outliers will be the points lying in the sparsest regions of the feature space F .Specifically, given a dataset S = {x 1 , x 2 , … , x n } our goal is to detect the outliers contained in S. With this aim we first train a variational autoencoder ⟨f , g ⟩ to reconstruct examples in S. Given an example x i , let z x i denote the point where z i ∼ q (z|x i ) is a latent space point sampled from the posterior distribution q (z|x i ) and ê(x i , xi ) is a measure related to the reconstruction error e(x i , xi ) associated with the reconstruction xi of x i obtained by means of z i .Specifically, if e(x i , xi ) is a log-likelihood we can take the exponential ê(x i , xi ) = exp e(x i , xi ) since all the other features are on a non-log scale, otherwise ê(x i , xi ) could be equal to e(x i , xi ).
Given a dataset S = {x 1 , … , x n } , by z S we denote the transformed dataset z S = {z x 1 , … , z x n } and by z S = {z x 1 , … , z x n } we denote the standardized versions of z S , that is the dataset obtained by normalizing each feature according to its mean and standard deviation.Standardization is needed here to handle non-homogeneous features.
To measure the density of a point x i in a set of points S we use nearest neighbor density estimation and specifically the average k-nearest neighbor distance of point x i from points in S, denoted as k-NN S (x i ) .However, instead of employing the distance defined in the original feature space, we employ as distance dist(x i , x i ) between x i and x j the distance sep- arating their images z x i and z x j in the transformed dataset.
Thus, the VAEOut anomaly score of x i in the dataset S consists of a k-nearest neighbor estimate of the density of z x i in the dataset z S .To take into account Monte Carlo estimation, L samples z (l)   x i ( l ∈ {1, … , L} ) can be used for each example x i and the distance dist(x i , x j ) is obtained as the average distance between pair of samples z . Figure 1(c) shows the latent samples and their associated anomaly score.It can be seen that now there is a marked separation between inliers and outliers in terms of the anomaly score.Inliers tend to have low scores, while almost all the outliers are associated with the largest anomaly scores of the population as a consequence of their inherent sparsity.Figure 1(d) compares the ROC curves obtained by our method ( VAEOut , the solid red line), with the ROC curve obtained by exploiting the reconstruction error of a variational autoencoder (recprob (An & Cho, 2015), the dashed blue line).Note that the AUC = 0.9063 of the standard VAE increases to the value AUC = 0.9908 if VAEOut is employed.
Algorithm 1 details the steps of the proposed technique.First of all, a variational autoencoder VAE is trained by exploiting input examples in S .This allows the encoder f and the decoder g to output parameters of q and p .Next, each example x i ∈ S can be mapped to the novel feature space F = L × E .In particular, L mappings of x i to F are built.The mappings z (l)  i of x i to L , with l ∈ {1, … , L} , are obtained by sampling values from q (z|x i ) while the mapping of x i to E are obtained by considering the reconstruction x(l) i = g  (z (l) i ) of x i provided by the decoder, and, then, by computing the measure ê(x i , x(l) i ) related to reconstruction error.
Once the L mappings z (l)   x i of x i to F have been generated, they are normalized by stand- ardizing each feature with respect to its mean and standard deviation.Next, the distance between all pairs of examples x i and x j can be computed by averaging the Euclidean dis- tances between mappings of x i and x j to F .Finally, the k nearest neighbors of x i according to the above illustrated distance are detected and the outlier score is computed as the mean distance between x i and such neighbors.

LatentOut algorithm
In this section we present the LatentOut algorithm, which generalizes the strategy of the VAEOut algorithm to any other autoencoder-based neural architecture A and also to other ways of combining the latent space location and the reconstruction error associated with observations in order to improve detection, namely different notions of score.We also call LatentOut A,score the variant of the LatentOut algorithm employing the architecture A and the score score.Algorithm 2 reports the pseudo-code of LatentOut.
As far as the allowed neural architectures, we consider autoencoder-based ones, namely architectures equipped with an encoder f , associated with the posterior distribution q(z|x) of observing the latent variable z given x, and a decoder g .Note that the above model encompass all kind of architectures described in Section 2, thus VAEs, but also bidirectional GANs and also standard AEs.
As for the scores, we distinguish between those that estimate the density in the latent space augmented with the reconstruction error and those that determine neighbors by taking into account only the original latent space.The former scores require the transformed dataset z S to be standardized by normalizing each feature according to its mean and standard deviation.Differently, when scores of the latter family are employed, the transformed dataset z S does not require to be standardized ( h = 0 and h = 1 are used to leave unchanged the h-th feature distribution).
Before determining final scores, the algorithm computes the sets N k (x i ) consisting of the k-nearest neighbors of x i according to the distance dist(x i , x j ) calculated on their associated transformed points z x i and z x j .
To perform nearest neighbor density estimation, the kNN-density score can be employed, also referred to as -score in the following: This score requires latent space augmentation and, thus, is related to the density of transformed points in the augmented feature space.Note that LatentOut VAE −score , or LatentOut VAE, for short, is the instance of the LatentOut algorithm corresponding to the VAEOut algorithm already described in Sect.3.1.Hence, when the score specification is omitted, as in LatentOut VAE , we assume the employed score is by default the −score.
Here we introduce an alternative way of injecting spatial information concerning the latent space in the process of outlier detection by comparing the reconstruction error of each latent point with that of its neighbors.The reconstruction error Z-score , denoted as −score in the following, does not require augmentation of the latent space and represents the deviation of the reconstruction error ê(x i , xi ) from the mean reconstruction error of its k-nearest neighbors  ê N k (x i ) , expressed in terms of number of standard deviations where and The idea is that if the reconstruction error of an observation presents large deviations from the reconstruction errors within its neighborhood, this may indicate an anomalous behavior even if the reconstruction error of the observation is not itself suspiciously large.This way of perceiving abnormality has clearly connections with those underlying the −score , but gives different results.In order to more precisely characterize the behavior of this score with respect to the standard density score, we will compare the two scores in different scenarios.
By LatentOut VAE, we denote the variant of the LatentOut algorithm employing Vari- ational AutoEncoder architectures with the −score .Figure 2(a) shows the score com- puted by LatentOut VAE, (for k = 100 ) on the variant of the MNIST dataset illustrated at the beginning of this section, while 2(b) reports the AUCs obtained by LatentOut VAE, and recprob.The AUC of LatentOut VAE, is 0.9363, thus smaller than those obtained by LatentOut VAE , though better than the AUC = 0.9063 of the standard VAE.
In the sequel we will consider the LatentOut also in combination with different other autoencoder-based architectures, specifically GAN-based, such as GANomaly and Fast-AnoGAN , and also with classic AutoEncoders.
Before concluding the section, we discuss on the type of anomalies identified by our method.In order to try to characterize the kind of anomalies singled out by LatentOut , As for the approach we adopt to isolate anomalies, we can say that it couples those based on the reconstruction error with density estimation based approaches (see also Fig. 5 at page 765 of Ruff et al. ( 2021)).The general behavior of reconstruction error approaches is to learn the encoder-decoder pair that minimizes the reconstruction error once applied to the data at hand.The other two main families of anomaly detection approaches are oneclass classification based or distribution-free, whose objective is to partition the space in an accepting region containing inliers and a rejecting region containing outliers, and probabilistic based or density estimation, which aim at reconstructing the density generating the normal data.We exploit both the first kind of approaches, to associate a reconstruction error and a latent space representation with each example, and the third kind of approaches to compute an anomaly score.
As for the kind of anomalies, the literature distinguishes between point anomalies and group anomalies, and also between non-contextual and contextual anomalies (e.g., see Fig. 2 at page 760 of Ruff et al. ( 2021)).We note that the anomalies singled out by our method are better characterized as point anomalies, since the scores we use are designed to be evaluated on single observations.Our score exhibits large values when some features associated with the examined point deviate from the features associated with its neighborhood.This confirms that the anomalies we detect are point anomalies, since if they were immersed in a group of similar observations, i.e. in a group of anomalies, they would not probably be pointed out as anomalous.Moreover, since the score compares the observation with its neighborhood, we believe our approach shares similarities with contextual point anomaly methods.In our case, however, the context is not an homogeneous sub-population containing the point or the spatial neighborhood in the original feature space of the point, but, being represented by the spatial neighborhood in the latent space, it can be conceived as the semantic neighborhood of the point.
Summarizing, we can characterize the kind of anomalies detected by our approach as follows: LatentOut couples reconstruction error approaches with density estimation ones in order to detect point anomalies according to the semantic context associated with each data observation.

Experimental results
We start by describing settings which are common to all the experiments reported in this section.
In order to generate an unsupervised setup, we considered some labelled dataset and, for each class label, we created a novel dataset having as inliers all the examples of the considered class and as outliers some randomly picked examples from the other classes.Precisely, we selected s examples ( s = 10 or s = 100 have been used) from each different dataset class label, so that the total number of outliers is s × (m − 1) , where m denotes the number of classes.
In the following we consider the MNIST 1 and Fashion-MNIST 2 datasets.Both datasets consist of 60,000 grayscale 28 × 28 pixels images partitioned in 10 classes: MNIST contains handwritten digits, while Fashion-MNIST contains Zalando's article images.The number of outliers within each dataset is also called its (absolute) contamination c.Since both the above datasets consist of 10 classes, their contamination corresponds to c = 9s.
As for the autoencoder architecture employed on MNIST and Fashion-MNIST , the encoding part is composed by an initial sequence of convolutional layers that reduce the size of the data to 14 × 14 , a flattening layer that transforms the data into a vectorial form and two dense layers that brings the data to the latent space having dimension d.The decoder consist in a layer that reshapes the data into a bi-dimensional form and a sequence of convolutional layers that transform the data back into the original 28 × 28 shape.
As for the parameter L, we verified that it has a limited impact on the accuracy and, hence, in the following we report results for L = 1 .All the experimental results are obtained by averaging over ten runs, thus we report both the mean and the standard deviation of performance measures.In the following, tables reporting experimental results highlight the best performance in bold.

Experiments with the VAEOut algorithm
If not otherwise stated, during experiments described in this section the parameter k is held fixed to 0.25c, thus k = 15 for s = 10 and k = 150 for s = 100 .Later, we will study the effect of the parameter k on the accuracy.According to the literature (Higgins et al., 2017), we employ large values for the parameter in order to allow the variational autoencoder to properly organize the latent space, and specifically = 10 4 .
VAEOut versus recprob.First of all, we investigated the impact of the proposed strat- egy on the accuracy of the variational autoencoder-based outlier detection approach, by comparing the Area Under the ROC Curve (AUC) of VAEOut with that of recprob, that is the standard strategy based on exploiting the VAE reconstruction error.Comparisons are conducted by considering the influence of the latent space dimension on the quality of the detection.Figure 3 reports the AUCs of VAEOut (red circle-marked lines) and recprob (blue square-marked lines) for the latent space dimension d ranging in the interval [2,32] and s = 10 .Due to the lack of space, results for s = 100 are summarized in Table 1.
The results highlight that the proposed strategy is able to improve accuracy of VAEbased outlier detection.Indeed, in many runs VAEOut improves over recprob, and for almost all the digits the achieved improvement is sensible.The experiments also show that accuracy of VAEOut is positively affected by the latent space dimension, while this does not seem to be the case for the standard VAE.We explain this behavior since lower dimensions constrain distributions within the latent space to overlap more, thus worsening the separation induced by the density associated with latent points.From these experiments, we conclude that a good choice for the latent space dimension d is in the order of a few tens, namely d ∈ [16,32].
Note that intervals of AUC values reported on the vertical axis of the plots are not identical.As for digit 1, it must be pointed out that the variational autoencoder is very able to reconstruct it, probably since it is the easiest digit in the set, and this explains why the recprob AUC is very close to 1. VAEOut shows a slightly smaller AUC for low latent dimensions, but reaches a similar AUC for sufficiently large dimensions.
Precision.Another measure employed to evaluate outlier detection approaches is the Precision.Specifically, since the goal is to isolate the most deviating dataset examples, we used the Prec@n measure, representing the percentage of true outliers among the examples  Note that despite the case s = 10 shows slightly larger AUCs, the Prec@n is higher for the case s = 100 .We explain this behavior by noticing that while the inliers of the two datasets are the same, the outliers for the case s = 100 have increased tenfold and this means that the probability that largest scores are assigned to outliers is increased, although overall the outliers are ranked slightly worse according to the AUC.
Sensitivity analysis for the parameter k.Experiments reported in Figure 4 are aimed at determining the optimal value for the parameter k, by performing a sensitivity analysis with respect to this parameter.With this aim, we took into account log-spaced values k in the interval [2, 1024] and determined the AUC of VAEOut on the MNIST dataset for s = 10 and s = 100 .In these experiments, the latent space dimension d is held fixed to d = 32.
To help understand the effect of k on the accuracy, on the horizontal axis we reported the value k∕c = k∕(9s) of k normalized on the absolute contamination c of the dataset, also called normalized neighborhood.Each plot reports also the AUC achieved by recprob.It can be seen that for a wide range of values of the parameter k the AUC of VAEOut is sensi- bly larger than that of recprob.In most cases the above property is valid for all the reported values of k.
This experiment witnesses that, although VAEOut requires an additional parameter with respect to a standard VAE, the selection of the right value for this parameter is not critical, being almost always guaranteed an improvement.Moreover, the optimal value for the normalized neighborhood appears to be located within the interval [10 −1 , 10 0 ] .Thus, the normalized neighborhood provides a tool for selecting a reasonable value for k.As a rule of thumb, we recommend to use k ≈ N∕3 , where N is the user-specified expected absolute contamination or, vice versa, to return N ∈ [3k, 5k] anomalies when k is user-specified.
Impact on the neural architecture.In this experiment we compare the detection performances of Auto-Encoder based anomaly detection (AE), Variational Auto-Encoder based anomaly detection (VAE), and VAEOut based anomaly detection.The aim of this experiment is not to determine the best configuration for each approach, but instead to compare the performances of these three autoencoder based approaches when the architecture is held fixed.Thus, all the results are relative to the equivalent network architectures and for the same common hyper-parameters.Specifically, the AE has the same structure of the VAE, except for employing a deterministic latent space and for the loss consisting only of the reconstruction error, while VAEOut builds on the same VAE architecture described at the beginning of this section.Tables 3 and 4 report the AUC of the three methods on the MNIST and Fashion-MNIST datasets with s = 10 , respectively, for d = 32 and k set to 30, that is to one third of the dataset contamination.While on the MNIST dataset VAE performs better than AE, on the Fashion-MNIST dataset with the same loss hyper-parameter , VAE perform worse than the corresponding deterministic architecture.
Importantly, VAEOut always shows clear improvements over the corresponding VAE architecture.On MNIST, for some critical classes, see for example digit 8 of MNIST, the performance are resolutely winning.On Fashion-MNIST , despite the sometimes poor

Experiments with the LatentOut algorithm
In the previous section we have experimented the VAEOut algorithm.In this section we complete experimental results by considering the general LatentOut algorithm.Since VAEOut can be regarded as an instance of LatentOut , in order to make clear comparison among the considered instances, in the following we will refer to the former algorithm as LatentOut VAE, .
Applying LatentOut to VAE architectures.We start by experimenting LatentOut on Var- iational AutoEncoder architectures.The number of outlying examples s coming from each different class label is set to s = 10 .Experimental results are reported in Table 5, showing the AUC obtained by LatentOut on MNIST (table on the top) and Fashion-MNIST (table on the bottom).In each table, the first column reports the class label, the second the AUC of the basic VAE architecture, while the last two columns show the AUC of LatentOut VAE, and LatentOut VAE, , respectively.In these experiments, we varied d in [2,32] and k in [2, 1000] and reported the optimal AUC scored by each method.While for LatentOut VAE, the optimal AUC value was found in the intervals d ∈ [8, 32] and k ∈ [30, 100] on both datasets, and this agrees with the analysis already performed in Sect.4.1, LatentOut VAE, behaved differently in terms of the optival values for the parameters.Indeed, LatentOut VAE, seems to perform better for smaller latent space dimensionalities, namely d ∈ [2, 8] , and for larger neighborhood parameters, namely k > 200.
As for the algorithm performances, in these experiments LatentOut VAE, guaranteed always the best accuracy.As for LatentOut VAE, , it exhibits improvements over the stand- ard VAE in many cases.
Applying LatentOut to GAN architectures.Here we discuss experiments concern- ing LatentOut on GAN autoencoder-based architectures.To better exploit the power of GANs, we considered the richer CIFAR-10 dataset 3 , a labeled subsets of the 80 million tiny images dataset.This dataset consists of 60,000 32 × 32 colour images partitioned in 10 classes, with 6,000 images per class.We employed the architectures of GANomaly and Fast-AnoGAN described in the respective papers.
We set s = 10 , d = 2 , and k = 30 for LatentOut GANomaly, and k = 500 for LatentOut GANomaly, .We observed that the accuracy of GANomaly is rather unstable and to better understand its behavior we measured the AUC of the methods as a function of the number of training epochs (see Figure 5 reports these values for some classes and a specific run; missing classes showed the same behavior).Interestingly, LatentOut shows large improvements on the AUC of GANomaly , even when the latter value is quite poor.Notably, LatentOut GANomaly, is always able to reach very large AUC values in the very first iterations and maintains its accuracy throughout the training procedure.In order to visualize the most difficult examples for each method, we collected the example scoring the top absolute difference between the ranking of GANomaly and the ranking of LatentOut .We verified that in these experiment all the above examples cor- respond to true anomalies showing a small GANomaly score and a large LatentOut score.Figures 6 (for LatentOut GANomaly, ) and 7 (for LatentOut GANomaly, ) report thest most devi- ating anomalies (on the first row).Under each image there are the relative ranking according to GANomaly (above) and according to LatentOut (below), where 1.0 (0.0, resp.)stands for top ranked (bottom ranked, resp.).The subsequent three rows represent the 1st, 2nd and 3rd nearest neighbors in the latent space of the anomalous example.As expected, in most cases anomalies share similarities with their neighbors in the latent space, but the different reconstruction error allows LatentOut to subvert the ranking for these anomalous examples.
Due to the large performances of LatentOut , we also tested LatentOut Fast-AnoGAN, on CIFAR-10.The AUC values are reported in Table 7 without standard deviations, since we executed a reduced number of runs.Sensitivity of LatentOut to the parameter k.In order to study the impact of the parameter k on LatentOut , we considered log-spaced values k and determined its AUC of LatentOut VAE, on MNIST and of LatentOut GANomaly, on CIFAR-10, in both cases for s = 10 .Since, accordingly to the previous experiments, we verified that LatentOut behaves better for smaller latent space dimensionalities, we held fixed d to 2.
Figure 8 reports the results of this experiment.The abscissa reports the value of the parameter k, ranging from 50 to 800, an interval including the optimal performances obtained in the other experiments.
The results highlight that LatentOut GANomaly, is practically insensitive to the parameter k, while it has a certain impact on the quality of LatentOut VAE, .In the latter case, some classes benefit from enlarging the value of k.An intermediate value seems good enough in all cases.We can conclude that the −score requires values of k different from the −score to reach its best performances.We can relate k to the contamination by k = 3c ( c = 90 in these experiments) and suggest k ≈ 3N as a rule of thumb to select an initial value for k.
Comparison with baseline methods.We compared our method with three baseline methods: k-Nearest Neighbour (KNN), Isolation Forest (IF) and Local Outlier Factor (LOF).In particular we considered the tabular datasets in Rayana (2016) whose statistics are reported in Table 8, as as the Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set4 .The former are a family of binary datasets created specifically for outlier detection, the latter is a multiclass dataset that we treated in the same way as the other multiclass images datasets, it consists in a collection of real attributes obtained by sensor signals with the aim of recognizing 12 different human movements; we choose this dataset because among the tabular datasets avaliable it is one with the largest dimension ( d = 561 ) and size ( n = 10929 ) and therefore more suitable to our analysis.Among all the versions of our method we selected the ones based on standard Autoencoders, i.e.LatentOut AE, and LatentOut AE, , because other architectures are specific for images datasets.Results are reported in Tables 9 and 10.
These datasets consist of few attributes if compared with image dataset and have a flat nature.In some cases the baseline methods are able to behave better than the more complex neural architecture.However, importantly LatentOut AE, and LatentOut AE, almost always improve over standard Autoencoders and perform better then the baselines in different cases.
As the dimension and the complexity of the datasets grow, our method can perform far better than the baselines; indeed, we considered also CIFAR as a more complex scenario.As we can see in Table 11, on CIFAR the AUC values obtained by KNN, IF and LOF are always much smaller than the ones obtained by LatentOut GANomaly, .
This set of experiments highlights that our method is very effective with large dimensionality datasets.This is due to the fact that using the feature space F instead of the original space of the data, maintains the semantic distribution of inliers and outliers, but the smaller dimension of the feature space allows to avoid issues related to the curse of dimensionality.

Conclusions
The main goal of this work is to show that, within the context of autoencoder neural networks architectures, the outlier detection process can greatly benefit of taking into account the latent space distribution together with the associated reconstruction error.Specifically, we observed that outliers tend to lie in the sparsest regions of the combined latent/error space and proposed the novel unsupervised anomaly detection algorithm, called LatentOut , that exploits this property to identify outliers.The novel approach always showed sensible improvements in terms of detection performances over the basic autoencoder-based architecture to which it is applied, especially as the dimension of the dataset increases.The comparison with baseline methods has shown that it has comparable performances on less complex datasets.

Table 1
AUC for the MNIST datasets ( s = 100).topnanomaly scores.We set n to the absolute contamination n = c .Table2compares the Prec@n achieved by VAEOut and recprob on MNIST ( d = 32 ).The results point out that VAEOut is able to significantly increase the percentage of true anom- alies among the examples ranked in the very first positions.Moreover, in different cases the precision is doubled.

Table 2
MNIST dataset Prec@n for n set to the contamination

Table 8
Statistics of the datasetsTable 9 AUC of LatentOut AE on tabular datasets

Table 10
AUC of LatentOut AE on Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set