1 Introduction

Outlier detection is a fundamental and widely applicable discovery problem. Outliers can arise due to many reasons like mechanical faults, fraudulent behavior, human errors, instrument error or simply through natural deviations in populations. Generally speaking, the problem of outlier detection consists in isolating samples suspected of not being generated by the same mechanisms as the rest of the data. Approaches to outlier detection can be classified in supervised, semi-supervised, and unsupervised (Chandola et al., 2009; Aggarwal, 2013). Supervised methods take in input data labeled as normal and abnormal and build a classifier. The challenge there is posed by the fact that abnormal data form a rare class. Semi-supervised methods, also called one-class classifiers or domain description techniques, take in input only normal examples and use them to identify anomalies. Unsupervised methods detect outliers in an input dataset by assigning a score or anomaly degree to each object. Several statistical, data mining and machine learning approaches have been proposed to detect outliers, namely, statistical-based (Davies & Gather, 1993; Barnett & Lewis, 1994), distance-based (Knorr et al., 2000; Angiulli & Pizzuti, 2002, 2005; Angiulli et al., 2006; Angiulli & Fassetti, 2009), density-based (Breunig et al., 2000; Jin et al., 2001), reverse nearest neighbor-based (Hautamäki et al., 2004; Radovanović et al., 2015; Angiulli, 2017, 2018, 2020), isolation-based (Liu et al., 2012), angle-based (Kriegel et al. 2008), SVM-based (Schölkopf et al., 2001; Tax & Duin, 2004), deep learning-based (Goodfellow et al., 2016; Chalapathy & Chawla, 2019), and many others (Chandola et al., 2009; Aggarwal, 2013).

Deep learning anomaly detection approaches exploiting autoencoders (AE) have shown good performances (Hawkins et al., 2002; An & Cho, 2015; Chalapathy & Chawla, 2019). Autoencoder-based anomaly detection consists in training an autoencoder to reconstruct a set of examples and then to detect as anomalies those inputs that show a sufficiently large reconstruction error. This approach is justified by the observation that, since the reconstruction process includes a dimensionality reduction step (the encoder) followed by a step mapping back representations in the compressed space (also called the latent space) to examples in the original space (the decoder), regularities should be better compressed and, hopefully, better reconstructed (Hawkins et al., 2002).

Unfortunately, deep non-linear architectures are able to perform high dimensionality reduction while keeping reconstruction error low. Ideally, an expressive enough architecture could reduce arbitrarily large dimensional data to one dimensional data while performing the reverse transformation with negligible loss. This problem is in part due to the lack of regularity in the latent space. Variational autoencoders (VAE) arise as a variant of standard autoencoders designed for generative purposes (Kingma & Welling, 2013). The key idea of variational autoencoders is to regularize the standard loss function consisting in the reconstruction error by including a regularization term constraining the organization of the latent space. Basically, variational autoencoders encode each example as a normal distribution over the latent space, instead of encoding them as single points, and regularize the loss by maximizing similarity of these distributions with the standard normal distribution. This encoding is conducive to obtain a continuous latent space, namely a latent space for which close points will lead to close decoded representation, thus avoiding the severe overfitting problem affecting standard autoencoders, for which some points of the latent space will give meaningless content once decoded.

As already pointed out, variational autoencoders were initially proposed as a tool for generating novel realistic examples by sampling and then decoding points of the latent space. Due to similarities to standard autoencoders some authors also proposed their use to detect anomalies. However, it has been noticed that variational autoencoders share with standard autoencoders the problem that they generalize so well that they can also well reconstruct anomalies (An & Cho, 2015; Kawachi et al. 2018; Sun et al., 2018; Chalapathy & Chawla, 2019).

Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) are another tool for generative purposes, aiming at learning an unknown distribution by means of an adversarial process involving a discriminator, able to output the probability for an observation to be generated by the unknown distribution, and a generator, mapping points coming from a standard distribution to points belonging to the unknown one. Moreover, Bidirectional GANs extend the above framework by including in their architecture an encoder learning the inverse transformation of the generator (Donahue et al., 2017). These architectures share with variational autoencoders generative capabilities and the particular organization of the latent space, and have also employed with success to the anomaly detection task (Akcay et al., 2018; Schlegl et al., 2019; Zenati et al., 2019; Sánchez-Martín et al., 2020).

We generally refer to architectures equipped with an encoder and a decoder and enforcing the organization of the latent space thus guaranteeing continuity, as continuous latent space autoencoder-based neural architectures.

The main contribution of this work can be summarized as follows: we argue that the approach of selecting the worst reconstructed examples as anomalies is too simplistic if a continuous latent space autoencoder architecture is employed and, specifically, we show that the anomaly detection process can greatly benefit from taking into account the continuos latent space distribution together with the associated reconstruction error. Indeed, we show that outliers tend to lie in the sparsest regions of the combined latent and reconstruction error space and propose the novel unsupervised anomaly detection algorithms \(\mathrm{VAE}Out\) and \({{\mathrm {Latent}}Out}\), that identify outliers by performing density estimation by taking advantage of this augmented feature space. The proposed approach shows sensible improvements in terms of detection performances over the standard approach based on the reconstruction error.

The rest of the paper is organized as follows. Section 2 presents preliminary definitions and discusses related work. Section 3 introduces the \(\mathrm{VAE}Out\) and \({{\mathrm {Latent}}Out}\) unsupervised anomaly detection algorithms. Section 4 illustrates experimental results. Finally, Section 5 concludes the work.

2 Preliminaries and related work

An autoencoder (AE) is a deep neural network trained with the aim of outputting a reconstruction \(\hat{x}\) of an input sample x as close as possible to x (Kramer, 1991; Hecht-Nielsen, 1995; Goodfellow et al., 2016). An autoencoder consists in two parts, an encoder \(f_\phi\) and a decoder \(g_\theta\). An enconder \(f_\phi\) is a mapping of a sample from the input feature space to a hidden representation in a latent space, and is univocally determined by parameters \(\phi\). A decoder \(g_\theta\) is a mapping of a hidden representation from the latent space to a reconstruction in the input feature space, and is univocally determined by parameters \(\theta\).

Given an autoencoder \(\langle f_\phi , g_\theta \rangle\), let x be a sample and let \(z = f_\phi (x)\) be the \(\text {latent}\) variable where the sample x is mapped by the encoder, the reconstruction \(\hat{x}\) of x is given by \(\hat{x} = g_\theta (z) = g_\theta (f_\phi (x))\) and the reconstruction error E(x) of the autoencoder is a measure of dissimilarity of x with respect to \(\hat{x}\). A common reconstruction error is the mean squared error (MSE), defined as

$$\begin{aligned} E(x) = \Vert x - g_\theta (f_\phi (x))\Vert _2^2. \end{aligned}$$

The autoencoder tries to minimize the reconstruction error.

Variational Autoencoders. A variational autoencoder (VAE) is a stochastic generative model aimed at outputting a reconstruction \(\hat{x}\) of a given input sample x (Kingma & Welling, 2013). To this aim, VAE are composed by an encoder \(f_\phi\) which outputs parameters of \(q_\phi (z|x)\), that is the posterior distribution of observing the latent variable z given x, and a decoder \(g_\theta\) computing parameters of \(p_\theta (x|z)\), that is the likelihood of x given the latent variable z. The prior distribution of the latent variable z is denoted by \(p_\theta (z)\). Thus, the actual values of z are sampled from \(q_\phi (z|x)\). Given the latent variable z, the reconstruction \(\hat{x}\) is obtained as a realization of \(p_\theta (x|z)\).

As for the distributions associated with the latent variable z, that are \(p_\theta (z)\) and \(q_\phi (z|x)\), the common choice is the isotropic normal. The distribution of the likelihood \(p_\theta (x|z)\) depends on the nature of the data: Bernoulli for binary data or multivariate Gaussian for continuous data. In these cases, \(g_\theta (z)\) outputs the mean of the distribution and usually the reconstruction \(\hat{x}\) is given by \(g_\theta (z)\).

Given a variational autoencoder \(\langle f_\phi , g_\theta \rangle\) and a sample x, the \(\textit{reconstruction error}\) is represented by the cross entropy of the distribution \(q_\phi (z|x)\) relative to the distribution \(p_\theta (x|z)\):

$$\begin{aligned} E(x) = -\mathbf {E}_{q_\phi \left( z|x\right) }\left[ \log p_\theta \left( x|z\right) \right] . \end{aligned}$$

For example, given x and its reconstruction \(\hat{x}\), the corresponding contribution \(e(x,\hat{x})\) to the above error is given by \(e(x,\hat{x}) = -\log \hat{x}^x(1-\hat{x})^{(1-x)} = -x\log \hat{x}-(1-x)\log (1-\hat{x})\) for Bernoulli data and \(e(x,\hat{x}) \propto -\log \exp -\Vert x-\hat{x}\Vert ^2_2 = \Vert x-\hat{x}\Vert ^2_2\) for continuous data.

The reconstruction error can be computed through a Monte Carlo estimation. Thus, by letting L be the number of samples \(z^{(1)}, z^{(2)}, \ldots , z^{(L)}\) from \(q_\phi (z|x)\),

$$\begin{aligned} E(x) =-\frac{1}{L}\sum _{l=1}^L \log p_\theta \big (x|z^{(l)}\big ). \end{aligned}$$

The loss of the variational autoencoder is given by

$$\begin{aligned} L_{\phi ,\theta }(x) = -\mathbf {E}_{q_\phi \left( z|x\right) }\left[ \log p_\theta \left( x|z\right) \right] +\beta \cdot D_{KL}\big (q_\phi (z|x)~\Vert ~p_\theta (z)\big ), \end{aligned}$$

where the second term represents the KL divergence between the distribution \(q_\phi (z|x)\), modelled as a multivariate normal distribution with independent components, and the prior \(p_\theta (z)\), modelled as a multivariate normal standard distribution, and plays the role of a regularization term forcing the posterior distribution to be similar to the prior distribution. The hyper-parameter \(\beta\) can be used to balance the two terms of the loss (Higgins et al., 2017). In such a case, the variational autoencoder is also called a \(\beta\)-VAE.

Reconstruction error-based anomaly detection. The classic use of standard AE for anomaly detection is based on the idea that, after the training, these networks are able to better reproduce in output the inlier data than the outlier and, hence, the loss or the reconstruction error of the network is used as an anomaly score (Hawkins et al. 2002). In An and Cho (2015) this idea is applied to VAEs, by using as anomaly score the reconstruction probability, corresponding to the negative cross entropy

$$\begin{aligned} score(x) = recprob(x) = \mathbf {E}_{q_\phi (z|x)}[\log p_\theta (x|z)] = \frac{1}{L}\sum _{l=1}^L\log p_\theta (x|z^{(l)}). \end{aligned}$$

The experimental results obtained in An and Cho (2015) show that VAE outperforms, in terms of AUC, standard AE and PCA for a semi-supervised anomaly detection setting.

A slightly different approach is pursued in Wiewel and Yang (2019), where it is considered the whole negative loss function

$$\begin{aligned} score(x) = -L_{\phi ,\theta }(x) \end{aligned}$$

as anomaly score instead of the reconstruction probability, which is only a term of it. The authors justify this choice with the slightly better results they obtain in their experiments compared to reconstruction probability.

It has been observed that sometimes VAEs share with standard AE the problem that they generalize so well that they can also reconstruct anomalies, which leads to view some anomalies as normal data. Thus, in Kawachi et al. (2018) the authors try to overcome this problem by modifying the structure of VAEs in order to make them able to support supervised learning and to be trained with both anomalies and normal data. In particular it is adopted an a priori distribution in the latent space that encourages the separation between normal and anomalous data which leads to non-standard loss function and anomaly score.

Generative Adversarial Networks for anomaly detection. Among recent approaches for detecting anomalies, \(\textit{Generative Adversarial Networks}\) (GANs) have been applied to address this problem and yielded results are noticeable.

Roughly speaking, a GAN (Goodfellow et al., 2014) is a generative model which exploits an adversarial process where two models, a discriminator D and a generator G, are trained simultaneously. The aim of the generator G is to capture distribution of the data and, then, at producing samples as similar to training samples as possible, while the aim of the discriminator D is to distinguish a sample coming from the training data and a sample produced by G.

Among many existing variants, Bidirectional GAN (Donahue et al., 2017) extends the standard GAN model including an encoder learning the inverse of the generator, thus a mapping from latent space to data and vice versa are simultaneously learnt.

The first work approaching anomaly detection with GAN is AnoGAN (Schlegl et al., 2017), with its extensions GAN+ (Zenati et al., 2019) and FastAnoGAN (Schlegl et al., 2019).

It uses a standard GAN and trains it only on positive samples. Given an instance x, a point z in the latent space is searched such that G(z) is as similar to x as possible. Since the generator learns how to generate normal samples, even if x is anomalous, G(z) is expected to be non anomalous and then the difference between x and G(z) highlights the anomalies.

AnoGAN has been successively improved. In Sánchez-Martín et al. (2020) a BiGAN-based approach is proposed, it exploits the network architecture of BiGAN to jointly train the mapping from image to latent space and from latent space to image and then providing a trained model to get the latent representation of an input sample. GANomaly (Akcay et al., 2018) introduces a generator with three elements, an encoder and a decoder, namely an autoencoder, plus an additional encoder. Thus, given an instance x, the encoder produces a point z in the latent space which is provided as input to the decoder that outputs \(x'\) which, in its turn, feds the succeeding encoder that produces \(z'\). Thus, the generator learns to encode normal data and learns to generate normal data starting from the encoded representation. Since the generator produces normal data even if the input data is anomal, its reconstruction will be normal. The difference between z and \(z'\) represents the anomaly level.

Latent space-based anomaly detection. There are autoencoder-based anomaly detection approaches in the literature that address this task solely relying on the embedding space and not on reconstruction error (Guo et al., 2018; Zhang et al., 2018; Corizzo et al., 2019). Specifically, the framework described in Zhang et al. (2018) is tailored for nonlinear process monitoring, while that described in Corizzo et al. (2019) supports predictive modeling tasks from streaming data coming from multiple geo-referenced sensors.

All the three above approaches map points to their latent representation and then assign them a score on the basis of the distances from their k-nearest neighbors in the latent space. In particular, in Guo et al. (2018) the score is given by the distance to the k-th nearest neighbor, while in Zhang et al. (2018); Corizzo et al. (2019) the score is given by the sum of the distances to the k-nearest neighbors. Additionally, Zhang et al. (2018) takes into account also the there called residual space, consisting of the difference between each point and its reconstruction. Thus, a second score is obtained as the sum of the distances between the image of each point in the residual space and its k-nearest neighbors in the residual space. If both the above two scores are below suitable thresholds then the point is recognized as an anomaly.

We note that these approaches are very different from the one here introduced, since they do not combine the latent space and the reconstruction error in order to detect points that have suspicious reconstruction errors as compared to their latent neighbors. Although the framework (Zhang et al., 2018) considers also the residual space, this differs from the reconstruction error. Indeed, while the latter is a scalar value, the former is an other point of the original feature space. Morever, the latent space and the residual space are taken into account separately, thus in Zhang et al. (2018) a point is declared as an anomaly if both its latent representation and its residual representation are anomalous independently of each other.

3 The \(\mathrm {VAE}Out\) and \({{\mathrm {Latent}}Out}\) algorithms

Let \({{\mathcal {I}}}\) denote the input space (usually \({{\mathcal {I}}}\subseteq \mathbb {R}^d\)), let \({\mathcal {L}}\) denote the latent space (usually \({{\mathcal {L}}}\subseteq \mathbb {R}^k\) with \(k\ll d\)), and let \(\mathcal E\) denote the reconstruction error space (usually \(\mathcal{E}\subseteq \mathbb {R}\)). As above pointed out, the traditional approach pursued to detect anomalies using (variational) autoencoders is to compare the input to its reconstruction by means of the reconstruction error, thus it is based on exploiting only the input and reconstruction error spaces. We argue that the approach of selecting the worst reconstructed examples as anomalies is too simplistic if a variational autoencoder architecture is employed. Specifically, we show that the anomaly detection process can greatly benefit of taking into account the latent space distribution together with the associated reconstruction error.

To illustrate this, we considered the MNIST dataset of handwritten digits and created a training-set consisting of the 6000 digits from the class 0 (the inliers) plus 90 randomly picked digits from the classes 1-9 (the outliers). Figure 1(a) reports the two-dimensional latent space of a variational autoencoder trained on the above set of examples (details on the architecture are provided in Section 4). In particular, we reported the means of the distributions associated with training examples (standard deviations are not shown for the ease of visualization): inliers are the (blue) dots and outliers are the (red) asterisks.

Fig. 1
figure 1

Comparing \(\mathrm{VAE}Out\) and recprob anomaly scores

First of all we note that, since regular examples (the inliers) form the majority of the data, they will be encoded as distributions better complying with the standard normal one. In other words, the associated latent distributions will tend to distribute around the origin of the latent space and, more importantly, means tend to be closer and supports will overlap more.

Nonetheless, not all the normal data complies with the above behavior and, thus, a non-negligible fraction of inliers spreads also over more peripheral regions. As for the abnormal examples, typically they spread over a wide portion of the latent space, including both boundary regions and the central region of the space, their location depending on the similarities they share with normal examples. This means that neither the location of the distributions in the latent space nor their degree of overlapping alone are sufficient to separate inliers from outliers. Indeed, in Fig. 1(a) the sparsest regions of the latent space contain both normal and abnormal examples.

Consider now Fig. 1(b) where the reconstruction error is associated with each latent distribution. It can be seen that even in this case the reconstruction error alone is not sufficient to guarantee a good separation between inliers and outliers. Indeed, though some clear anomalies can be recognized by means of a very high reconstruction error, most of the outliers have relatively low reconstruction errors. However, Figure 1(b) also suggests that outliers tend to lie in the sparsest regions of the latent/reconstruction error feature space. This can be understood since outliers have two properties: (1) they are few, and (2) their reconstruction error, even when it is not exceptionally large, is still significantly larger than that of their most similar inliers. All this tends to move away in the augmented feature space the outliers from the other points.

3.1 \(\mathrm {VAE}Out\) algorithm

In light of these observations, the key idea of the proposed approach, called \(\mathrm{VAE}Out\), is to simultaneously exploit information from the two above highlighted aspects, namely the latent space distribution and the reconstruction error distribution, by constructing the novel feature space \({{\mathcal {F}}} = {{\mathcal {L}}} \times {{\mathcal {E}}}\), consisting of the juxtaposition of the latent space and of the reconstruction error space, and then by measuring the degree of overlapping of the examples in this novel feature space \({\mathcal {F}}\), namely the density of the distribution of examples. Outliers will be the points lying in the sparsest regions of the feature space \({\mathcal {F}}\).

Specifically, given a dataset \(S=\{x_1,x_2,\ldots ,x_n\}\) our goal is to detect the outliers contained in S. With this aim we first train a variational autoencoder \(\langle f_\phi ,g_\theta \rangle\) to reconstruct examples in S. Given an example \(x_i\), let \(z_{x_i}\) denote the point

$$\begin{aligned} {z}_{x_i} = (z_i,\hat{e}(x_i,\hat{x}_i)) \in {{\mathcal {F}}} \end{aligned}$$

where \(z_i\sim q_\phi (z|x_i)\) is a latent space point sampled from the posterior distribution \(q_\phi (z|x_i)\) and \(\hat{e}(x_i,\hat{x}_i)\) is a measure related to the reconstruction error \(e(x_i,\hat{x}_i)\) associated with the reconstruction \(\hat{x}_i\) of \(x_i\) obtained by means of \(z_i\). Specifically, if \(e(x_i,\hat{x}_i)\) is a log-likelihood we can take the exponential \(\hat{e}(x_i,\hat{x}_i) = \exp e(x_i,\hat{x}_i)\) since all the other features are on a non-log scale, otherwise \(\hat{e}(x_i,\hat{x}_i)\) could be equal to \(e(x_i,\hat{x}_i)\).

Given a dataset \(S = \{x_1,\ldots ,x_n\}\), by \(z_S\) we denote the transformed dataset \({z}_S = \{{z}_{x_1},\ldots ,{z}_{x_n}\}\) and by \(\overline{z}_S=\{\overline{z}_{x_1},\ldots ,\overline{z}_{x_n}\}\) we denote the standardized versions of \(z_S\), that is the dataset obtained by normalizing each feature according to its mean and standard deviation. Standardization is needed here to handle non-homogeneous features.

To measure the density of a point \(x_i\) in a set of points S we use nearest neighbor density estimation and specifically the average k-nearest neighbor distance of point \(x_i\) from points in S, denoted as \(k\text{-NN }_S(x_i)\). However, instead of employing the distance defined in the original feature space, we employ as distance \(\mathrm{dist}(x_i,x_i)\) between \(x_i\) and \(x_j\) the distance separating their images \(\overline{z}_{x_i}\) and \(\overline{z}_{x_j}\) in the transformed dataset.

Thus, the \(\mathrm{VAE}Out\) anomaly score of \(x_i\) in the dataset S consists of a k-nearest neighbor estimate of the density of \(\overline{z}_{x_i}\) in the dataset \(\overline{z}_{S}\). To take into account Monte Carlo estimation, L samples \(z_{x_i}^{(l)}\) (\(l\in \{1,\ldots ,L\}\)) can be used for each example \(x_i\) and the distance \(\mathrm{dist}(x_i,x_j)\) is obtained as the average distance between pair of samples \(\overline{z}_{x_i}^{(l)}\) and \(\overline{z}_{x_i}^{(l)}\).

Figure 1(c) shows the latent samples and their associated anomaly score. It can be seen that now there is a marked separation between inliers and outliers in terms of the anomaly score. Inliers tend to have low scores, while almost all the outliers are associated with the largest anomaly scores of the population as a consequence of their inherent sparsity. Figure 1(d) compares the ROC curves obtained by our method (\(\mathrm{VAE}Out\), the solid red line), with the ROC curve obtained by exploiting the reconstruction error of a variational autoencoder (recprob (An & Cho, 2015), the dashed blue line). Note that the \(\mathrm{AUC}=0.9063\) of the standard VAE increases to the value \(\mathrm{AUC}=0.9908\) if \(\mathrm{VAE}Out\) is employed.

figure a

Algorithm 1 details the steps of the proposed technique. First of all, a variational autoencoder VAE is trained by exploiting input examples in \(S\). This allows the encoder \(f_\phi\) and the decoder \(g_\theta\) to output parameters of \(q_\phi\) and \(p_\theta\). Next, each example \(x_i \in S\) can be mapped to the novel feature space \(\mathcal {F} = \mathcal {L}\times \mathcal {E}\). In particular, L mappings of \(x_i\) to \(\mathcal {F}\) are built. The mappings \(z^{(l)}_i\) of \(x_i\) to \(\mathcal {L}\), with \(l \in \{1,\dots ,L\}\), are obtained by sampling values from \(q_\phi (z|x_i)\) while the mapping of \(x_i\) to \(\mathcal {E}\) are obtained by considering the reconstruction \(\hat{x}_i^{(l)} = g_\theta (z_i^{(l)})\) of \(x_i\) provided by the decoder, and, then, by computing the measure \(\hat{e}(x_i,\hat{x}_i^{(l)})\) related to reconstruction error.

Once the L mappings \(z_{x_i}^{(l)}\) of \(x_i\) to \(\mathcal {F}\) have been generated, they are normalized by standardizing each feature with respect to its mean and standard deviation. Next, the distance between all pairs of examples \(x_i\) and \(x_j\) can be computed by averaging the Euclidean distances between mappings of \(x_i\) and \(x_j\) to \(\mathcal {F}\). Finally, the k nearest neighbors of \(x_i\) according to the above illustrated distance are detected and the outlier score is computed as the mean distance between \(x_i\) and such neighbors.

3.2 \({{\mathrm {Latent}}Out}\) algorithm

In this section we present the \({{\mathrm {Latent}}Out}\) algorithm, which generalizes the strategy of the VAEOut algorithm to any other autoencoder-based neural architecture \({\mathcal {A}}\) and also to other ways of combining the latent space location and the reconstruction error associated with observations in order to improve detection, namely different notions of score. We also call \({{\mathrm {Latent}}Out}_\mathrm{{{{\mathcal {A}}},score}}\) the variant of the \({{\mathrm {Latent}}Out}\) algorithm employing the architecture \({\mathcal {A}}\) and the score score. Algorithm 2 reports the pseudo-code of \({{\mathrm {Latent}}Out}\).

As far as the allowed neural architectures, we consider autoencoder-based ones, namely architectures equipped with an encoder \(f_\phi\), associated with the posterior distribution q(z|x) of observing the latent variable z given x, and a decoder \(g_\theta\). Note that the above model encompass all kind of architectures described in Section 2, thus VAEs, but also bidirectional GANs and also standard AEs.

As for the scores, we distinguish between those that estimate the density in the latent space augmented with the reconstruction error and those that determine neighbors by taking into account only the original latent space. The former scores require the transformed dataset \(\overline{z}_S\) to be standardized by normalizing each feature according to its mean and standard deviation. Differently, when scores of the latter family are employed, the transformed dataset \(\overline{z}_S\) does not require to be standardized (\(\mu _h=0\) and \(\sigma _h=1\) are used to leave unchanged the h-th feature distribution).

Before determining final scores, the algorithm computes the sets \(\mathrm{N}_k(x_i)\) consisting of the k-nearest neighbors of \(x_i\) according to the distance \(\mathrm{dist}(x_i,x_j)\) calculated on their associated transformed points \(\overline{z}_{x_i}\) and \(\overline{z}_{x_j}\).

To perform nearest neighbor density estimation, the kNN-density score can be employed, also referred to as \(\varrho\)-score in the following:

$$\begin{aligned} \varrho {-score}\big (\mathrm{N}_k(x_i)\big ) = \frac{1}{k} \sum _{x_j\in \mathrm{N}_k(x_i)} \mathrm{dist}(x_i,x_j). \end{aligned}$$

This score requires latent space augmentation and, thus, is related to the density of transformed points in the augmented feature space. Note that \({{{\mathrm {Latent}}Out}_\mathrm{{VAE}}}_{\varrho {-score}}\), or \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\varrho }\) for short, is the instance of the \({{\mathrm {Latent}}Out}\) algorithm corresponding to the \(\mathrm {VAE}Out\) algorithm already described in Sect. 3.1. Hence, when the score specification is omitted, as in \({{\mathrm {Latent}}Out}_\mathrm{{VAE}}\), we assume the employed score is by default the \(\varrho {-score}\).

figure b

Here we introduce an alternative way of injecting spatial information concerning the latent space in the process of outlier detection by comparing the reconstruction error of each latent point with that of its neighbors. The \(\textit{reconstruction error Z-score}\), denoted as \(\zeta {-score}\) in the following, does not require augmentation of the latent space and represents the deviation of the reconstruction error \(\hat{e}(x_i,\hat{x_i})\) from the mean reconstruction error of its k-nearest neighbors \(\mu _{\hat{e}}\big (\mathrm{N}_k(x_i)\big )\), expressed in terms of number of standard deviations \(\sigma _{\hat{e}}\big (\mathrm{N}_k(x_i)\big )\):

$$\begin{aligned} \zeta {-score}\big (\mathrm{N}_k(x_i)\big ) = \frac{\hat{e}(x_i,\hat{x_i}) - \mu _{\hat{e}}\big (\mathrm{N}_k(x_i)\big )}{\sigma _{\hat{e}}\big (\mathrm{N}_k(x_i)\big )}, \end{aligned}$$

where

$$\begin{aligned} \mu _{\hat{e}}\big (\mathrm{N}_k(x_i)\big ) = \frac{1}{k}\sum _{x_j\in \mathrm{N}_k(x_i)}\hat{e}(x_j,\hat{x}_j) \end{aligned}$$

and

$$\begin{aligned} \sigma ^2_{\hat{e}}\big (\mathrm{N}_k(x_i)\big ) = \frac{1}{k}\sum _{x_j\in \mathrm{N}_k(x_i)} \Big ( \hat{e}(x_j,\hat{x}_j) - \mu _{\hat{e}}\big (\mathrm{N}_k(x_i)\big ) \Big )^2. \end{aligned}$$

The idea is that if the reconstruction error of an observation presents large deviations from the reconstruction errors within its neighborhood, this may indicate an anomalous behavior even if the reconstruction error of the observation is not itself suspiciously large. This way of perceiving abnormality has clearly connections with those underlying the \(\varrho {-score}\), but gives different results. In order to more precisely characterize the behavior of this score with respect to the standard density score, we will compare the two scores in different scenarios.

By \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\) we denote the variant of the \({{\mathrm {Latent}}Out}\) algorithm employing Variational AutoEncoder architectures with the \(\zeta {-score}\). Figure 2(a) shows the score computed by \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\) (for \(k=100\)) on the variant of the MNIST dataset illustrated at the beginning of this section, while 2(b) reports the AUCs obtained by \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\) and recprob. The AUC of \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\) is 0.9363, thus smaller than those obtained by \({{\mathrm {Latent}}Out}_\mathrm{{VAE}}\), though better than the \(AUC=0.9063\) of the standard \(\mathrm VAE\).

In the sequel we will consider the \({{\mathrm {Latent}}Out}\) also in combination with different other autoencoder-based architectures, specifically GAN-based, such as \(\mathrm {GANomaly}\) and \(\mathrm {Fast\text{-- }AnoGAN}\), and also with classic AutoEncoders.

Fig. 2
figure 2

Comparing \(\zeta {-score}\) and recprob anomaly scores

Before concluding the section, we discuss on the type of anomalies identified by our method. In order to try to characterize the kind of anomalies singled out by \({{\mathrm {Latent}}Out}\), we refer to well-established classifications of anomaly detection approaches and of type of anomalies of interest, as those reported in Ruff et al. (2021).

As for the approach we adopt to isolate anomalies, we can say that it couples those based on the reconstruction error with density estimation based approaches (see also Fig. 5 at page 765 of Ruff et al. (2021)). The general behavior of reconstruction error approaches is to learn the encoder-decoder pair that minimizes the reconstruction error once applied to the data at hand. The other two main families of anomaly detection approaches are one-class classification based or distribution-free, whose objective is to partition the space in an accepting region containing inliers and a rejecting region containing outliers, and probabilistic based or density estimation, which aim at reconstructing the density generating the normal data. We exploit both the first kind of approaches, to associate a reconstruction error and a latent space representation with each example, and the third kind of approaches to compute an anomaly score.

As for the kind of anomalies, the literature distinguishes between point anomalies and group anomalies, and also between non-contextual and contextual anomalies (e.g., see Fig. 2 at page 760 of Ruff et al. (2021)). We note that the anomalies singled out by our method are better characterized as point anomalies, since the scores we use are designed to be evaluated on single observations. Our score exhibits large values when some features associated with the examined point deviate from the features associated with its neighborhood. This confirms that the anomalies we detect are point anomalies, since if they were immersed in a group of similar observations, i.e. in a group of anomalies, they would not probably be pointed out as anomalous. Moreover, since the score compares the observation with its neighborhood, we believe our approach shares similarities with contextual point anomaly methods. In our case, however, the context is not an homogeneous sub-population containing the point or the spatial neighborhood in the original feature space of the point, but, being represented by the spatial neighborhood in the latent space, it can be conceived as the semantic neighborhood of the point.

Summarizing, we can characterize the kind of anomalies detected by our approach as follows: \({{\mathrm {Latent}}Out}\) couples reconstruction error approaches with density estimation ones in order to detect point anomalies according to the semantic context associated with each data observation.

4 Experimental results

We start by describing settings which are common to all the experiments reported in this section.

In order to generate an unsupervised setup, we considered some labelled dataset and, for each class label, we created a novel dataset having as inliers all the examples of the considered class and as outliers some randomly picked examples from the other classes. Precisely, we selected s examples (\(s=10\) or \(s=100\) have been used) from each different dataset class label, so that the total number of outliers is \(s\times (m-1)\), where m denotes the number of classes.

In the following we consider the MNISTFootnote 1 and Fashion-MNISTFootnote 2 datasets. Both datasets consist of \(60,\!000\) grayscale \(28\times 28\) pixels images partitioned in 10 classes: MNIST contains handwritten digits, while Fashion-MNIST contains Zalando’s article images. The number of outliers within each dataset is also called its (absolute) contamination c. Since both the above datasets consist of 10 classes, their contamination corresponds to \(c=9s\).

As for the autoencoder architecture employed on MNIST and \(\textit{Fashion-MNIST}\), the encoding part is composed by an initial sequence of convolutional layers that reduce the size of the data to \(14\times 14\), a flattening layer that transforms the data into a vectorial form and two dense layers that brings the data to the latent space having dimension d. The decoder consist in a layer that reshapes the data into a bi-dimensional form and a sequence of convolutional layers that transform the data back into the original \(28\times 28\) shape.

As for the parameter L, we verified that it has a limited impact on the accuracy and, hence, in the following we report results for \(L=1\). All the experimental results are obtained by averaging over ten runs, thus we report both the mean and the standard deviation of performance measures. In the following, tables reporting experimental results highlight the best performance in bold.

4.1 Experiments with the \(\mathrm {VAE}Out\) algorithm

If not otherwise stated, during experiments described in this section the parameter k is held fixed to 0.25c, thus \(k=15\) for \(s=10\) and \(k=150\) for \(s=100\). Later, we will study the effect of the parameter k on the accuracy. According to the literature (Higgins et al., 2017), we employ large values for the parameter \(\beta\) in order to allow the variational autoencoder to properly organize the latent space, and specifically \(\beta =10^{4}\).

Fig. 3
figure 3

MNIST dataset (\(s=10\)): AUCs of \(\mathrm{VAE}Out\) and recprob

\(\mathrm{VAE}Out\) versus recprob. First of all, we investigated the impact of the proposed strategy on the accuracy of the variational autoencoder-based outlier detection approach, by comparing the Area Under the ROC Curve (AUC) of \(\mathrm{VAE}Out\) with that of recprob, that is the standard strategy based on exploiting the VAE reconstruction error. Comparisons are conducted by considering the influence of the latent space dimension on the quality of the detection. Figure 3 reports the AUCs of VAEOut (red circle-marked lines) and recprob (blue square-marked lines) for the latent space dimension d ranging in the interval [2, 32] and \(s=10\). Due to the lack of space, results for \(s=100\) are summarized in Table 1.

The results highlight that the proposed strategy is able to improve accuracy of VAE-based outlier detection. Indeed, in many runs \(\mathrm{VAE}Out\) improves over recprob, and for almost all the digits the achieved improvement is sensible. The experiments also show that accuracy of \(\mathrm{VAE}Out\) is positively affected by the latent space dimension, while this does not seem to be the case for the standard VAE. We explain this behavior since lower dimensions constrain distributions within the latent space to overlap more, thus worsening the separation induced by the density associated with latent points. From these experiments, we conclude that a good choice for the latent space dimension d is in the order of a few tens, namely \(d\in [16,32]\).

Table 1 AUC for the MNIST datasets (\(s=100\)).

Note that intervals of AUC values reported on the vertical axis of the plots are not identical. As for digit 1, it must be pointed out that the variational autoencoder is very able to reconstruct it, probably since it is the easiest digit in the set, and this explains why the recprob AUC is very close to 1. \(\mathrm{VAE}Out\) shows a slightly smaller AUC for low latent dimensions, but reaches a similar AUC for sufficiently large dimensions.

Table 2 MNIST dataset Prec@n for n set to the contamination \(c=9s\)

Precision. Another measure employed to evaluate outlier detection approaches is the Precision. Specifically, since the goal is to isolate the most deviating dataset examples, we used the Prec@n measure, representing the percentage of true outliers among the examples associated with the top n anomaly scores. We set n to the absolute contamination \(n=c\). Table 2 compares the Prec@n achieved by \(\mathrm{VAE}Out\) and recprob on MNIST (\(d=32\)). The results point out that \(\mathrm{VAE}Out\) is able to significantly increase the percentage of true anomalies among the examples ranked in the very first positions. Moreover, in different cases the precision is doubled.

Note that despite the case \(s=10\) shows slightly larger AUCs, the Prec@n is higher for the case \(s=100\). We explain this behavior by noticing that while the inliers of the two datasets are the same, the outliers for the case \(s=100\) have increased tenfold and this means that the probability that largest scores are assigned to outliers is increased, although overall the outliers are ranked slightly worse according to the AUC.

Fig. 4
figure 4

MNIST dataset: AUC of \(\mathrm{VAE}Out\) for varying k values

Sensitivity analysis for the parameter k. Experiments reported in Figure 4 are aimed at determining the optimal value for the parameter k, by performing a sensitivity analysis with respect to this parameter. With this aim, we took into account log-spaced values k in the interval [2, 1024] and determined the AUC of \(\mathrm{VAE}Out\) on the MNIST dataset for \(s=10\) and \(s=100\). In these experiments, the latent space dimension d is held fixed to \(d=32\).

To help understand the effect of k on the accuracy, on the horizontal axis we reported the value \(k/c = k/(9s)\) of k normalized on the absolute contamination c of the dataset, also called normalized neighborhood. Each plot reports also the AUC achieved by recprob. It can be seen that for a wide range of values of the parameter k the AUC of \(\mathrm{VAE}Out\) is sensibly larger than that of recprob. In most cases the above property is valid for all the reported values of k.

This experiment witnesses that, although \(\mathrm{VAE}Out\) requires an additional parameter with respect to a standard VAE, the selection of the right value for this parameter is not critical, being almost always guaranteed an improvement. Moreover, the optimal value for the normalized neighborhood appears to be located within the interval \([10^{-1},10^0]\). Thus, the normalized neighborhood provides a tool for selecting a reasonable value for k. As a rule of thumb, we recommend to use \(k\approx N/3\), where N is the user-specified expected absolute contamination or, vice versa, to return \(N\in [3k,5k]\) anomalies when k is user-specified.

Table 3 MNIST (\(s=10\)) AUC for \(d = 32\) (\(k=30\))
Table 4 Fashion-MNIST (\(s=10\)) AUC for \(d = 32\) (\(k=30\))

Impact on the neural architecture. In this experiment we compare the detection performances of Auto-Encoder based anomaly detection (AE), Variational Auto-Encoder based anomaly detection (VAE), and \(\mathrm{VAE}Out\) based anomaly detection. The aim of this experiment is not to determine the best configuration for each approach, but instead to compare the performances of these three autoencoder based approaches when the architecture is held fixed. Thus, all the results are relative to the equivalent network architectures and for the same common hyper-parameters. Specifically, the AE has the same structure of the VAE, except for employing a deterministic latent space and for the loss consisting only of the reconstruction error, while \(\mathrm{VAE}Out\) builds on the same VAE architecture described at the beginning of this section.

Tables 3 and 4 report the AUC of the three methods on the MNIST and Fashion-MNIST datasets with \(s=10\), respectively, for \(d=32\) and k set to 30, that is to one third of the dataset contamination. While on the MNIST dataset VAE performs better than AE, on the Fashion-MNIST dataset with the same loss hyper-parameter \(\beta\), VAE perform worse than the corresponding deterministic architecture.

Importantly, \(\mathrm{VAE}Out\) always shows clear improvements over the corresponding VAE architecture. On MNIST, for some critical classes, see for example digit 8 of MNIST, the performance are resolutely winning. On \(\textit{Fashion-MNIST}\), despite the sometimes poor performances of the VAE reconstruction error, by exploiting the latent space information \(\mathrm{VAE}Out\) is able to achieve excellent detecting performances, almost always filling the gap between the AE and VAE results and going even further.

4.2 Experiments with the \({{\mathrm {Latent}}Out}\) algorithm

In the previous section we have experimented the \(\mathrm{VAE}Out\) algorithm. In this section we complete experimental results by considering the general \({{\mathrm {Latent}}Out}\) algorithm. Since \(\mathrm{VAE}Out\) can be regarded as an instance of \({{\mathrm {Latent}}Out}\), in order to make clear comparison among the considered instances, in the following we will refer to the former algorithm as \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\varrho }\).

Applying \({{\mathrm {Latent}}Out}\) to VAE architectures. We start by experimenting \({{\mathrm {Latent}}Out}\) on Variational AutoEncoder architectures. The number of outlying examples s coming from each different class label is set to \(s=10\). Experimental results are reported in Table 5, showing the AUC obtained by \({{\mathrm {Latent}}Out}\) on MNIST (table on the top) and Fashion-MNIST (table on the bottom). In each table, the first column reports the class label, the second the AUC of the basic VAE architecture, while the last two columns show the AUC of \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\varrho }\) and \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\), respectively.

In these experiments, we varied d in [2, 32] and k in [2, 1000] and reported the optimal AUC scored by each method. While for \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\varrho }\) the optimal AUC value was found in the intervals \(d\in [8,32]\) and \(k\in [30,100]\) on both datasets, and this agrees with the analysis already performed in Sect. 4.1, \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\) behaved differently in terms of the optival values for the parameters. Indeed, \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\) seems to perform better for smaller latent space dimensionalities, namely \(d\in [2,8]\), and for larger neighborhood parameters, namely \(k>200\).

As for the algorithm performances, in these experiments \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\varrho }\) guaranteed always the best accuracy. As for \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\), it exhibits improvements over the standard VAE in many cases.

Table 5 Comparison of \({{\mathrm {Latent}}Out}_\mathrm{{VAE}}\) on MNIST (above) and Fashion-MNIST (below)

Applying \({{\mathrm {Latent}}Out}\) to GAN architectures. Here we discuss experiments concerning \({{\mathrm {Latent}}Out}\) on GAN autoencoder-based architectures. To better exploit the power of GANs, we considered the richer CIFAR-10 datasetFootnote 3, a labeled subsets of the 80 million tiny images dataset. This dataset consists of \(60,\!000\) \(32\times 32\) colour images partitioned in 10 classes, with \(6,\!000\) images per class. We employed the architectures of \(\mathrm {GANomaly}\) and \(\mathrm {Fast\text{-- }AnoGAN}\) described in the respective papers.

Fig. 5
figure 5

AUC of \(\mathrm {GANomaly}\), \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\varrho }\) and \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\zeta }\) on CIFAR-10 varying the epochs

We set \(s=10\), \(d=2\), and \(k=30\) for \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\varrho }\) and \(k=500\) for \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\zeta }\). We observed that the accuracy of \(\mathrm {GANomaly}\) is rather unstable and to better understand its behavior we measured the AUC of the methods as a function of the number of training epochs (see Figure 5 reports these values for some classes and a specific run; missing classes showed the same behavior). Interestingly, \({{\mathrm {Latent}}Out}\) shows large improvements on the AUC of \(\mathrm {GANomaly}\), even when the latter value is quite poor. Notably, \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\zeta }\) is always able to reach very large AUC values in the very first iterations and maintains its accuracy throughout the training procedure. Table 6 reports the AUC of the methods after 200 epochs.

Table 6 AUC of \({{\mathrm {Latent}}Out}_\mathrm{{\mathrm {GANomaly}}}\) on CIFAR-10

In order to visualize the most difficult examples for each method, we collected the example scoring the top absolute difference between the ranking of \(\mathrm {GANomaly}\) and the ranking of \({{\mathrm {Latent}}Out}\). We verified that in these experiment all the above examples correspond to true anomalies showing a small \(\mathrm {GANomaly}\) score and a large \({{\mathrm {Latent}}Out}\) score. Figures 6 (for \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\varrho }\)) and 7 (for \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\zeta }\)) report thest most deviating anomalies (on the first row). Under each image there are the relative ranking according to \(\mathrm {GANomaly}\) (above) and according to \({{\mathrm {Latent}}Out}\) (below), where 1.0 (0.0, resp.) stands for top ranked (bottom ranked, resp.). The subsequent three rows represent the 1st, 2nd and 3rd nearest neighbors in the latent space of the anomalous example. As expected, in most cases anomalies share similarities with their neighbors in the latent space, but the different reconstruction error allows \({{\mathrm {Latent}}Out}\) to subvert the ranking for these anomalous examples.

Fig. 6
figure 6

Most deviating anomalies recognized by \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\varrho }\)

Fig. 7
figure 7

Most deviating anomalies recognized by \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\zeta }\)

Due to the large performances of \({{\mathrm {Latent}}Out}_\zeta\), we also tested \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {Fast\text{-- }AnoGAN}}},\zeta }\) on CIFAR-10. The AUC values are reported in Table 7 without standard deviations, since we executed a reduced number of runs.

Table 7 AUC of \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {Fast\text{-- }AnoGAN}}},\zeta }\) on CIFAR-10
Fig. 8
figure 8

AUC of \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\) on MNIST and of \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\zeta }\) on CIFAR-10 for different k values (\(d = 2\))

Sensitivity of \({{\mathrm {Latent}}Out}_\zeta\) to the parameter k. In order to study the impact of the parameter k on \({{\mathrm {Latent}}Out}_\zeta\), we considered log-spaced values k and determined its AUC of \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\) on MNIST and of \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\zeta }\) on CIFAR-10, in both cases for \(s=10\). Since, accordingly to the previous experiments, we verified that \({{\mathrm {Latent}}Out}_\zeta\) behaves better for smaller latent space dimensionalities, we held fixed d to 2.

Figure 8 reports the results of this experiment. The abscissa reports the value of the parameter k, ranging from 50 to 800, an interval including the optimal performances obtained in the other experiments.

The results highlight that \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\zeta }\) is practically insensitive to the parameter k, while it has a certain impact on the quality of \({{\mathrm {Latent}}Out}_{\mathrm{{VAE}},\zeta }\). In the latter case, some classes benefit from enlarging the value of k. An intermediate value seems good enough in all cases. We can conclude that the \(\zeta {-score}\) requires values of k different from the \(\varrho {-score}\) to reach its best performances. We can relate k to the contamination by \(k=3c\) (\(c=90\) in these experiments) and suggest \(k\approx 3N\) as a rule of thumb to select an initial value for k.

Comparison with baseline methods. We compared our method with three baseline methods: k-Nearest Neighbour (KNN), Isolation Forest (IF) and Local Outlier Factor (LOF). In particular we considered the tabular datasets in Rayana (2016) whose statistics are reported in Table 8, as well as the Smartphone-Based Recognition of Human Activities and Postural Transitions Data SetFootnote 4. The former are a family of binary datasets created specifically for outlier detection, the latter is a multiclass dataset that we treated in the same way as the other multiclass images datasets, it consists in a collection of real attributes obtained by sensor signals with the aim of recognizing 12 different human movements; we choose this dataset because among the tabular datasets avaliable it is one with the largest dimension (\(d=561\)) and size (\(n=10929\)) and therefore more suitable to our analysis.

Among all the versions of our method we selected the ones based on standard Autoencoders, i.e. \({{\mathrm {Latent}}Out}_{\mathrm{{AE}},\varrho }\) and \({{\mathrm {Latent}}Out}_{\mathrm{{AE}},\zeta }\), because other architectures are specific for images datasets. Results are reported in Tables 9 and 10.

These datasets consist of few attributes if compared with image dataset and have a flat nature. In some cases the baseline methods are able to behave better than the more complex neural architecture. However, importantly \({{\mathrm {Latent}}Out}_{\mathrm{{AE}},\varrho }\) and \({{\mathrm {Latent}}Out}_{\mathrm{{AE}},\zeta }\) almost always improve over standard Autoencoders and perform better then the baselines in different cases.

As the dimension and the complexity of the datasets grow, our method can perform far better than the baselines; indeed, we considered also CIFAR as a more complex scenario. As we can see in Table 11, on CIFAR the AUC values obtained by KNN, IF and LOF are always much smaller than the ones obtained by \({{\mathrm {Latent}}Out}_{\mathrm{{\mathrm {GANomaly}}},\zeta }\).

This set of experiments highlights that our method is very effective with large dimensionality datasets. This is due to the fact that using the feature space \({{\mathcal {F}}}\) instead of the original space of the data, maintains the semantic distribution of inliers and outliers, but the smaller dimension of the feature space allows to avoid issues related to the curse of dimensionality.

Table 8 Statistics of the datasets
Table 9 AUC of \({{\mathrm {Latent}}Out}_\mathrm{{AE}}\) on tabular datasets
Table 10 AUC of \({{\mathrm {Latent}}Out}_\mathrm{{AE}}\) on Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set
Table 11 AUC of \({{\mathrm {Latent}}Out}_\mathrm{{AE}}\) on CIFAR-10

5 Conclusions

The main goal of this work is to show that, within the context of autoencoder neural networks architectures, the outlier detection process can greatly benefit of taking into account the latent space distribution together with the associated reconstruction error. Specifically, we observed that outliers tend to lie in the sparsest regions of the combined latent/error space and proposed the novel unsupervised anomaly detection algorithm, called \({{\mathrm {Latent}}Out}\), that exploits this property to identify outliers. The novel approach always showed sensible improvements in terms of detection performances over the basic autoencoder-based architecture to which it is applied, especially as the dimension of the dataset increases. The comparison with baseline methods has shown that it has comparable performances on less complex datasets.