1 Introduction

The Anomaly Detection task consists in isolating samples in a dataset that are suspected of not being generated by the same distribution as the majority of the data.

Depending on the setting of the dataset, we can distinguish three different families of methods for Anomaly Detection (Chandola et al., 2009; Aggarwal, 2013). Supervised methods consider a dataset whose items are labeled as normal and abnormal and build a classifier, typically the dataset is highly unbalanced and the anomalies form a rare class. Semi-supervised methods, also called one-class classifiers, take in input only examples from the normal class and use them to train the detector. Unsupervised methods assign an anomaly score to each object of the input dataset in order to find anomalies in it. There exist several statistical, data mining and machine learning approaches to perform the task of detecting outliers, such as statistical-based (Davies & Gather, 1993; Barnett & Lewis, 1994), distance-based (Knorr et al., 2000; Angiulli & Pizzuti, 2002, 2005; Angiulli et al., 2006; Angiulli & Fassetti, 2009), density-based (Breunig et al., 2000; Jin et al., 2001), reverse nearest neighbor-based (Hautamäki et al., 2004; Radovanović et al., 2015; Angiulli, 2017, 2020), SVM-based (Schölkopf et al., 2001; Tax & Duin, 2004), deep learning-based (Goodfellow et al., 2016; Chalapathy & Chawla, 2019), and many others (Chandola et al., 2009; Aggarwal, 2013).

Among deep learning methods for anomaly detection the ones based on Autoencoders (AE) and Variational Autoencoders (VAE) have shown good performance (Hawkins et al., 2002; An & Cho, 2015; Chalapathy & Chawla, 2019). The standard application of these architectures to the task of anomaly detection is based on the concept of reconstruction error, that is a measure of the difference between the input and the reconstructed data, and relies on the assumption that, since the majority of the data with which they are trained belongs to the normal class, these network are able to reconstruct the inliers better than the outliers.

In Angiulli et al. (2020, 2022) the authors state that this approach is too simplistic and highlight the problem that these architectures generalize so well that they can also well reconstruct anomalies (An & Cho, 2015; Kawachi et al., 2018; Sun et al., 2018; Chalapathy & Chawla, 2019); in order to overcome this issue they introduce a novel approach, called \(\text {Latent}{Out}\), that is based on the joint use of both the latent space and the reconstruction error. In particular, they define two different anomaly scores:

  • \(\varrho {-score}\) that is obtained as a k-nearest neighbor estimation on the feature space composed by the latent space combined with the reconstruction error;

  • \(\zeta {-score}\) that consists in the difference of the reconstruction error of a certain point with the mean of the reconstruction error of its k nearest neighbor in the latent space.

Moreover, they extend the application of \(\text {Latent}{Out}\) also to other architectures such as \(\text {GANomaly}\) (Akcay et al., 2018) and ANOGan (Schlegl et al., 2017).

In this work the \(\text {Latent}{Out}\) paradigm is expanded toward three directions:

  • We implement a version of \(\text {Latent}{Out}\) for the semi-supervised scenario, we adapt the scores to this setting and perform experiments to show the performances of \(\text {Latent}{Out}\). In particular, we test the technique exploiting VAE and GANomaly as base architectures since they are easily adaptable to work on semi-supervised scenarios.

  • We consider two new architectures, \(\text {MO}-\text {GAAL}\) and \(\text {SO}-\text {GAAL}\) (Liu et al., 2020) and we modify them in order to make \(\text {Latent}{Out}\) applicable. We test on these both the original scores.

  • We show that the feature space induced by \(\text {Latent}{Out}\) has the characteristic to enhance the separation between normal and anomalous data. This is accomplished by generalizing the approach of \(\text {Latent}{Out}\) in order to exploit other definitions of scores. Specifically, we define novel scores by coupling the \(\text {Latent}{Out}\) strategy with some existing data mining outlier detection methods. As an important result, experimental results highlight that these novel variants of \(\text {Latent}{Out}\) are able to improve performances over the corresponding base methods.

The rest of the paper is organized as follows: in Section 2 we discuss the related works, in Section 3 we describe the instruments at the basis of our work and present the contributions in the three subsections, in Section 4 we experimentally test the introduced methods, finally Section 5 concludes the paper.

2 Related works

Deep Learning models for anomaly detection (Ruff et al., 2021; Pang et al., 2020) can be divided into two families: reconstruction error-based methods employing Autoencoders (AE) and GAN-based methods relying on Generative Adversarial Networks (GAN).

Autoencoders (Kramer, 1991; Hecht-Nielsen, 1995; Goodfellow et al., 2016; Hawkins et al., 2002) are a special type of neural networks that aim at obtaining a reconstruction \(\hat{x}\) as close as possible to the input sample x by minimizing the reconstruction error \(E(x)=\Vert x - \hat{x}\Vert _2^2\) after encoding x into a hidden representation in a latent space.

A variational autoencoder (VAE) is a stochastic generative model that can be seen as a variant of standard AE (Kingma & Welling, 2013). The main differences are that a VAE encodes each example as a normal distribution over the latent space instead that as single points, and introduce a regularization term in the loss that maximizes similarity of these distributions with the standard normal distribution.

The effect of these operations is that the latent space of a VAE is continuous, which means that in this space close points will lead to close decoded representation, thus avoiding the severe overfitting problem affecting standard autoencoders, for which some points of the latent space will give meaningless content once decoded. In the field of anomaly detection VAEs are used, in analogy with standard AE, by defining a reconstruction probability (An & Cho, 2015).

A Generative Adversarial Network (GAN) (Goodfellow et al., 2014) is a generative model composed by two models trained simultaneously: a generator G that aims to capture the distribution of the data in order to reproduce samples as realistic as possible and a discriminator D, that must distinguish the data belonging to the dataset from the ones artificially created by G. AnoGAN (Schlegl et al., 2017), with its extensions GAN+ (Zenati et al., 2019) and FastAnoGAN (Schlegl et al., 2019), and \(\text {GANomaly}\) (Akcay et al., 2018) are the first works in which GAN are used for the task of anomaly detection.

In some recent works has been observed that the anomaly detection performances obtained by both reconstruction error-based and GAN-based architectures can be enhanced by taking into account both the reconstruction error and the latent space. In particular, in Angiulli et al. (2020) authors propose to consider the enlarged feature space \({\mathcal F} = {\mathcal L} \times {\mathcal E}\), where \({\mathcal L}\) represents the latent space and \({\mathcal E}\) is the reconstruction error space (usually \({\mathcal E}\subseteq \mathbb {R}\)) and introduce the first variant of the \(\text {Latent}{Out}\) algorithm that consists in performing a KNN density estimation in the space \({\mathcal F}\).

Specifically, the \(\varrho {-score}\) is defined as

$$ \varrho {-score}(x_i) = \frac{1}{k} \sum _{x_j\in \textrm{N}^{{\mathcal F}}_k(x_i)} \mathrm{d_{\mathcal F}}(x_i,x_j), $$

where \(\textrm{N}^{{\mathcal F}}_k(x_i)\) is the set of the k-nearest neighbors of the point \(x_i\) according to the distance \(\mathrm{d_{\mathcal F}}\) that corresponds to the euclidean distance calculated between the images of \(x_i\) and \(x_j\) on the feature space \({\mathcal F}\).

In Angiulli et al. (2022) a variant of \(\text {Latent}{Out}\) considering an additional anomaly score, called \(\zeta {-score}\), is presented. This score is related to the difference between the reconstruction error \(E(x_i)\) of the point \(x_i\) and the mean of the reconstruction errors of its k-nearest neighbors in the latent space, in formula

$$ \zeta {-score}\big (x_i\big ) = \frac{E(x_i) - \mu \big (\textrm{N}^{{\mathcal L}}_k(x_i)\big )}{\sigma \big (\textrm{N}^{{\mathcal L}}_k(x_i)\big )}, $$

where \(\textrm{N}^{{\mathcal L}}_k(x_i)\) is the set of the k nearest neighbors in the latest space \(\mathcal L\) of the image \(x_i\) in the same space, and

$$ \mu \big (\textrm{N}^{{\mathcal L}}_k(x_i)\big ) = \frac{1}{k}\sum _{x_j\in \textrm{N}^{{\mathcal L}}_k(x_i)}E(x_j), \quad \sigma ^2\big (\textrm{N}^{{\mathcal L}}_k(x_i)\big ) = \frac{1}{k}\sum _{x_j\in \textrm{N}^{{\mathcal L}}_k(x_i)} \Big ( E(x_j) - \mu \big (\textrm{N}^{{\mathcal L}}_k(x_i)\big ) \Big )^2. $$

Next, we present the novel extensions of the \(\text {Latent}{Out}\) method.

3 Methodology

3.1 Extension to GAAL architectures

\(\text {Latent}{Out}\) has already been successfully applied to the above mentioned GAN-based architectures. Here we apply \(\text {Latent}{Out}\) on Single-Objective Generative Adversarial Active Learning (\(\text {SO}-\text {GAAL}\)) (Liu et al., 2020), a novel adversarial method for anomaly detection based on the mini-max game between a generator that creates potential anomalies and a discriminator that tries to draw a separation boundary between the anomalies and the normal class. We deal also with Multiple-Objective GAAL (\(\text {MO}-\text {GAAL}\)), an extension of \(\text {SO}-\text {GAAL}\) which employs multiple generators with different objectives in order to prevent the generator from falling into the mode collapsing problem.

In the standard version of the GAAL architectures, the generator has a decoder structure sampling from a low dimensional latent space \(\mathcal L\) and producing the artificial anomalies. The overall architecture does not contemplate an encoder module able to map the input data point to the generator latent space, which is essential to apply our technique upon it.

Indeed, even if the discriminator includes an encoder, this is designed to solve a different problem, that is to map the data points to a real number expressing their distance to the decision boundary.

Since, in order to be applied, \(\text {Latent}{Out}\) needs an architecture that, besides producing an anomaly score and having a latent space \(\mathcal L\), has a proper encoder, i. e. a mechanism to map data points from their original space into \(\mathcal L\), in this paper we modify the \(\text {SO}-\text {GAAL}\) (respectively \(\text {MO}-\text {GAAL}\)) by adding one (respectively many) encoder submodule to enable the application of \(\text {Latent}{Out}\).

With the aim of solving this issue, we modify the architecture of \(\text {SO}-\text {GAAL}\) by adding an encoder \(f_\phi \) that receives in input the original data \(x_i\) and outputs its latent representation \(z_i\), that in turn is passed to the generator.

The same problem arises for the \(\text {MO}-\text {GAAL}\) architecture, we face it by adding an encoder for each of the M generators \(f_\phi ^{(1)},\dots ,f_\phi ^{(M)}\) of the network. In this way, each point \(x_i\) is associated with M latent representations \(z_i^{(1)}=f_\phi ^{(1)},\dots ,z_i^{(M)}\left( x_i\right) \), where \(z_i^{(j)}=f_\phi ^{(j)}\left( x_i\right) \) for each \(j=1,\dots ,M\), therefore we define as latent transformation of \(x_i\) the mean of these points

$$ z_i=\frac{1}{M}\sum _{j=1}^kz_i^{(j)}. $$

Finally, in all the three parts of the GAAL (encoders, generators and discriminator) we add some convolutional layers in order to make them deeper and more suitable for image data.

3.2 Semi-supervised outlier detection with \(\text {Latent}{Out}\)

The semi-supervised setting is characterized by the presence of a training set \(T=\left\{ t_1,\dots ,t_n\right\} \) composed only by normal items and a test set \(X=\left\{ x_1,\dots ,x_m\right\} \) with binary labels \(Y=\left\{ y_1,\dots ,y_m\right\} \), where \(y_i=0\) if \(x_i\) is normal and \(y_i=1\) if it is an anomaly.

The application of \(\text {Latent}{Out}\) to this context, instead of to the classical unsupervised setting for which it has been designed, requires to deal with the fact that the models are trained only on normal data. In particular, given a point \(x_i\) in the test set, the semi-supervised versions of both \(\varrho {-score}\) and \(\zeta {-score}\) require the computation of the distance, in the enlarged latent space \(\mathcal F\), between \(x_i\) and each example \(t_i\) of the training set. Thus,

$$ \varrho {-score}(x_i) = \frac{1}{k} \sum _{t_j\in \textrm{N}^{\mathcal F}_k(x_i)} \mathrm{d_{\mathcal F}}(x_i,t_j), \quad \zeta {-score}\big (x_i\big ) = \frac{E(x_i) - \mu _T\big (\textrm{N}^{\mathcal L}_k(x_i)\big )}{\sigma _T\big (\textrm{N}^{\mathcal L}_k(x_i)\big )}, $$

where

$$ \mu _T\big (\textrm{N}^{\mathcal L}_k(x_i)\big ) = \frac{1}{k}\sum _{t_j\in \textrm{N}^{\mathcal L}_k(x_i)}E(t_j), \quad \sigma _T^2\big (\textrm{N}^{\mathcal L}_k(x_i)\big ) = \frac{1}{k}\sum _{t_j\in \textrm{N}^{\mathcal L}_k(x_i)} \Big ( E(t_j) - \mu \big (\textrm{N}^{\mathcal L}_k(x_i)\big ) \Big )^2. $$

We note that in this scenario the elements of the neighborhood \(\textrm{N}_k(x_i)\) of \(x_i\in X\) are selected among the objects of the training set T.

3.3 Novel anomaly scores

In this section we generalize the approach of \(\text {Latent}{Out}\) in order to exploit other definitions of scores. Indeed, our goal is to show that the feature space \(\mathcal F\) induced by \(\text {Latent}{Out}\) has the characteristic to enhance the separation between normal and anomalous data. Basically, this implies that any way of perceiving anomalous behaviour will take advantage of replacing the original data with its mapping in the \(\text {Latent}{Out}\) feature space \(\mathcal F\).

Specifically, given a generic anomaly score \(\sigma \), we call \(\sigma \)\(\text {Latent}{Out}\) the variant of \(\text {Latent}{Out}\) which applies the score \(\sigma \) within the feature space \(\mathcal F\); thus, \(\sigma \)\(\text {Latent}{Out}(x)\) coincides with \(\sigma _{\mathcal F}(x)\), that is the value of the score \(\sigma \) associated with the mapping of the instance x in the feature space \(\mathcal F\). Figure 1 reports a scheme of the overall methodology.

Fig. 1
figure 1

\(\text {Latent}{Out}\) receives the dataset as input and maps it into \(\mathcal F\). The transformed dataset is then processed by unsupervised anomaly detection methods which provide an anomaly score for each point

To substantiate our claim, in this work we consider 6 standard data mining outlier detection scores and compare their performances in the original feature space with that in the \(\text {Latent}{Out}\) feature space.

The methods considered in our analysis are Concentration Free Outlier Factor (CFOF) (Angiulli, 2017), Gaussian Mixture Models (GMM) (Reynolds et al., 2009), Isolation Forest (IF) (Liu et al., 2012), k-nearest neighbor (k-NN) (Ramaswamy et al., 2000) (whose application on \({\mathcal F}\) coincides with the \(\varrho {-score}\) of \(\text {Latent}{Out}\)), Local Outlier Factor (LOF) (Breunig et al., 2000) and One-Class Support Vector Machine (OC-SVM) (Schölkopf et al., 2001).

In the following we denote by \(z_i\) the image of the point \(x_i\) mapped in the space \({\mathcal F}\). Next, the definitions of the above listed methods are recalled.

Concentration free outlier factor

The Concentration Free Outlier Factor (CFOF) is based on the reverse neighborhood of the data points, for our aims the neighborhood relationship is defined according to the data representations in the space \({\mathcal F}\), in more details

$$\begin{aligned} \textrm{CFOF}_{\mathcal F}(x_i)=\min _{1\le k'\le n}\left\{ \frac{k'}{n}:n_{k'}^{\mathcal F}(x_i)\ge n\rho \right\} , \end{aligned}$$

where \(n_k^{\mathcal F}(x_i)=\left| \left\{ x_j:x_i\in \textrm{N}_k^{\mathcal F}(x_j)\right\} \right| \) is the reverse k nearest neighbor count, that is the number of objects having \(x_i\) among their k nearest neighbors, and \(\textrm{N}_k^{\mathcal F}(x_j)\) is the set of the k nearest neighbor of \(x_j\).

Gaussian mixture models

The goal of Gaussian Mixture Models (GMM) is to reconstruct the unknown density of the data projections in the feature space F as a mixture of k distributions

$$ p\left( z_i|\omega _j,\mu _j,\Sigma _j\right) =\sum _{j=1}^k\omega _jg\left( z_i|\mu _j,\Sigma _j\right) . $$

where each \(g(\cdot |\mu _j,\Sigma _j)\), \(j=1,\ldots ,k\), is a \(d+1\)-dimensional Gaussian distribution in the feature space \(\mathcal F\):

$$ g\left( z_i|\mu _j,\Sigma _j\right) ={\frac{1}{\left( 2\pi \right) ^{(d+1)/2}|\Sigma _j|^{1/2}}}\exp \left( -\left( z_i-\mu _j\right) ^T\Sigma _j^{-1}\left( z_i-\mu _j\right) \right) . $$

The parameters \(\omega _j\in \mathbb {R}\), \(\mu _j\in \mathbb {R}^{d+1}\), and \(\Sigma _j\in \mathbb {R}^{d\times d}\) of the mixture are estimated by using the Expectation-Minimization algorithm. Notice that the \(\Sigma _j\) are diagonal matrices, since co-variances are assumed to be null.

The anomaly score of \(x_i\) is defined as the value of the density obtained with the parameters \({\omega }_j,{\mu }_j,{\Sigma }_j\) that maximize the expectation, in formula

$$ \textrm{GMM}_{\mathcal F}(x_i)=p\left( x_i|\omega _j,\mu _j,\Sigma _j\right) . $$

Isolation forest

The Isolation Forest technique builds a data-induced tree, also called Isolation Tree (or iTree), by recursively and randomly partitioning instances, until all of them are isolated. The random partitioning produces shorter paths for anomalies.

In our context, the points of the dataset \(\left\{ x_1,\dots ,x_n\right\} \) are partitioned by considering split values on the features of their representation \(\left\{ z_1,\dots ,z_n\right\} \) in the space \(\mathcal F\).

The path length h(x) of a data point x is the number of edges traversed in order to reach the external node containing only x. An iTree is built by recursively expanding non-leaf nodes (initially each data point is associated with a single internal node) by randomly selecting an attribute a and a split value v.

The anomaly score obtained from this process is given by

$$ \textrm{IF}_{\mathcal F}(x_i) = 2^{-\frac{E[h(x)]}{c(n)}} $$

where E[h(x)] denotes the average path length of x in the collection of iTrees and c(n) is a normalization constant which depends on the total number of data points.

Local outlier factor

In our application, the concepts of reachability-distance (\({\mathrm rd}_k\)) between two data points \(x_i\) and \(x_j\) exploited by the Local Outlier Factor (LOF) is based on the distance \({\mathrm d}_{\mathcal F}\) introduced in Section 2 rather than on the standard euclidean distance, i. e.

$$ \textrm{rd}_k\left( x_i,x_j\right) =\max \left( \textrm{d}_{{\mathcal F},k}(x_i),\textrm{d}_{\mathcal F}(x_i,x_j)\right) , $$

where \(\textrm{d}_{{\mathcal F},k}(x_i)\) is the \(\textrm{d}_{\mathcal F}\) distance between \(x_i\) and its k-th nearest neighbor. Then, the LOF anomaly score of the point \(x_i\) is defined as usual, specifically

$$ \textrm{LOF}_{\mathcal F}(x_i)=\frac{\sum _{x_j\in N_k^{\mathcal F}(x_i)}\textrm{lrd}_k\left( x_j\right) }{\left| N_k^{\mathcal F}(x_i)\right| \textrm{lrd}_k\left( x_i\right) }, $$

where \(\textrm{lrd}_k\) is the local reachability density

$$ \textrm{lrd}_k\left( x_i\right) =\frac{\left| N_k^{\mathcal F}\right| }{\sum _{x_j\in N_k^{\mathcal F}(x_i)}\textrm{rd}_k\left( x_i,x_j\right) }. $$

One-class support vector machine

The application of the One-Class Support Vector Machine (OC-SVM) methodology to our paradigm is based on the idea of building an hyperplane that provides an optimal separation between the representations of normal and anomalous point in \({\mathcal F}\).

Specifically, the separation is obtained through the following constrained optimization problem

$$\begin{aligned}&w^*={{\,\textrm{argmin}\,}}_{w\in \mathbb {R}^{\ell +1}}\Vert w\Vert ^2 \nonumber \\&y_i\left\langle z_i,w\right\rangle \ge 1 \quad i=1,\dots ,m. \end{aligned}$$
(1)

The anomaly score of a point x is given by the distance of its mapping \(z\in {\mathcal F}\) from the hyperplane represented by the solution \(w^*\) of the optimization problem in (1)

$$ \mathrm{OC-SVM}_{\mathcal F}(x)=\frac{\left\langle z,w^*\right\rangle }{\Vert w^*\Vert }. $$

To manage non-linear separable problems, the soft-SVM algorithm is employed in the practice, which admits some of the above constraints to be violated while minimizing also the entity of their violation.

Moreover, for tackling problems where linear separators achieve poor generalization results, SVMs are equipped with kernel functions applying a non-linear transformation of the data and mapping them into a higher dimensional space in which they can be better separated.

4 Experimental results

In this section we report experiments conducted to study the behavior of the proposed techniques.

In particular, we focus on the following three aspects:

  • the behavior of \(\text {Latent}{Out}\) algorithm in the semi-supervised (one-class) setting in comparison with baseline architectures;

  • the application of all \(\text {Latent}{Out}\) scores on the new architectures \(\text {SO}-\text {GAAL}\) and \(\text {MO}-\text {GAAL}\) and comparison with baseline method;

  • the analysis of the behaviour of standard anomaly detection algorithm on the feature space \(\mathcal {F}\) and the comparison between their standard application on the original data space.

4.1 Experimental settings

In our experiments we employ three standard benchmark datasets, two composed by grayscale images, MNIST (Deng, 2012) and Fashion-MNIST (Xiao et al., 2017), and one composed by three-channels colour images, CIFAR-10 (Krizhevsky et al., 2009). Both the grayscale datasets consist of \(60,\!000\) \(28\times 28\) pixels images divided in 10 classes, CIFAR-10 consists of \(60,\!000\) \(32\times 32\) colour images partitioned in 10 classes. In some experiments, we also consider some tabular datasets belonging to the ODDS repository (Rayana, 2016), namely annthyroid, satellite, satimage-2, thyroid, vertebral, wine.

Some of these dataset are multi-labelled, thus, in order to make them suitable for anomaly detection, we decide to adopt a one-vs-all policy, which means that we consider one class as normal and all the others as anomalous.

In particular, in the unsupervised setting, we consider a dataset composed by all the examples of the normal class in the training set and a quantity \(s=10\) of randomly selected examples from each other class as anomalies. Thus, the resulting dataset meets the rarity and heterogeneity requirements characterizing Anomaly Detection scenarios.

On the other hand, in the semi-supervised (one-class) setting the training set is composed only by examples from the normal class, while the test set coincides with the original test sets of the considered datasets, thus it is composed of examples from both the normal and the anomalous classes.

The performances of the various algorithms are measured by means of the Area Under the ROC Curve (which we refer to in the paper as AUC).

Tables reporting experimental results highlight in bold the method scoring the best AUC value within each considered setting.

Table 1 AUC for MNIST, Fashion-MNIST and CIFAR-10 in the one-vs-all semi-supervised setting

4.2 \(\text {Latent}{Out}\) in the semi-supervised scenario

In this section we test \(\text {Latent}{Out}\) in the semi-supervised (one-class) setting by considering the architectures VAE and \(\text {GANomaly}\) as baseline.

The results are reported in Table 1; for each dataset and each architecture, on the left column there is the AUC of the baseline and on the right column there is the best AUC obtained by the two scores of \(\text {Latent}{Out}\). Mean and standard deviations are measured on 10 runs, each considering the same normal instances and a different set of randomly selected anomalies.

We vary the dimension of the latent space in the interval [2, 64]; the best results are obtained in the interval [8, 16] for \(\text {Latent}{Out}_{VAE}\), [16, 32] for \(\text {Latent}{Out}_{\text {GANomaly}}\), for [4, 8] for standard VAE and for [16, 64] for standard \(\text {GANomaly}\).

From these results it is clear that \(\text {Latent}{Out}\) outperforms both the considered baselines, and the improvement in many cases is huge.

4.3 Performance of \(\text {Latent}{Out}\) on GAAL architectures

In this section we test \(\text {Latent}{Out}\) scores on \(\text {MO}-\text {GAAL}\) and \(\text {SO}-\text {GAAL}\) architectures. Table 2 shows the results of the two \(\text {Latent}{Out}\) scores and the baseline on MNIST and Fashion-MNIST in a one-vs-all unsupervised setting, since the architectures \(\text {MO}-\text {GAAL}\) and \(\text {SO}-\text {GAAL}\) are specific for unsupervised anomaly detection.

In this experiment we fix the value of the parameter k for each score, and in particular we follow the indications given in Angiulli et al. (2022) and set \(k=50\) and \(k=200\), respectively. On the other hand, the value of the dimension of the latent space is variable in the interval [8, 128]. For both architectures the best values are obtained in the interval [32, 64]. Mean and standard deviations are measured on 10 runs, each considering the same normal instances and a different set of randomly selected anomalies.

From these results we can conclude that \(\text {Latent}{Out}\) is very effective also applied in these architecture, since it always guarantees an improvement over the standard baseline.

In particular, we can observe that \(\varrho {-score}\) is the best score for the majority of the classes, and, in those cases in which this is not true, its performance is almost always very close to the one of the best method.

Table 2 AUC for MNIST and Fashion-MNIST in the one-vs-all unsupervised setting (\(s=10\))

4.4 Analysis of \(\text {Latent}{Out}\) with the novel scores

In this section we analyze the behavior of \(\sigma \)\(\text {Latent}{Out}\), where \(\sigma \) is one of the following six methods: Concentration Free Outlier Factor (CFOF), Gaussian Mixture Models (GMM), Isolation Forest (IF), k-nearest neighbor (k-NN), Local Outlier Factor (LOF) and One-Class Support Vector Machine (OC-SVM).

In Table 3 we report the hyper-parameters and the corresponding set of values considered for each method. As for the hyper-parameters not included in the table, we employed their default values.

For the space \({\mathcal F}\) of \(\text {Latent}{Out}\), we use a Variational Autoencoder and we vary the latent space dimension \(\ell \) in the following set:

$$ \ell \in \left\{ \ell _i=\left\lfloor \frac{d}{4^i}\right\rfloor : \; \forall i\in \mathbb {N}^+ \text { s.t. }\left\lfloor \frac{d}{4^i}\right\rfloor \ge 2\right\} . $$
Table 3 List of the hyperparameters employed for each method
Fig. 2
figure 2

Comparison between the AUC of \(\sigma \) and \(\sigma \)\(\text {Latent}{Out}\) for different methods \(\sigma \). MNIST, Fashion-MNIST and CIFAR10 datasets

The number of layers composing the architecture of the Variational Autoencoder is inversely proportional to the latent space dimension \(\ell _i\). Specifically, for each \(j<i\) there is one hidden layer of dimension \(\ell _j\) in the encoder and the symmetric one in the decoder.

Let \(\sigma \) denote the generic basic anomaly detection method. Figures 2 and 3 report the comparison between the AUC obtained by \(\sigma \)\(\text {Latent}{Out}\) (on the y-axis) and the AUC obtained by \(\sigma \) (on the x-axis) in the unsupervised scenario. Each point is associated with a specific configuration of the hyper-parameters, namely a specific latent space dimension \(\ell _i\) and a specific basic method hyper-parameter value (see Table 3). Figure 2 shows results on the MNIST, Fashion-MNIST and CIFAR10 image datasets, while Fig. 3 concerns the ODDS shallow datasets.

Fig. 3
figure 3

Comparison between the AUC of \(\sigma \) and \(\sigma \)\(\text {Latent}{Out}\) for different methods \(\sigma \). ODDS datasets

Table 4 Maximum AUC on Fashion-MNIST

The figures highlight that \(\sigma \)\(\text {Latent}{Out}\) is able to improve the performances of \(\sigma \) very often. This behavior is much more evident on the complex image datasets which are naturally richer in correlations, but also on the shallow datasets the analysis may take benefit of working in the \(\text {Latent}{Out}\) feature space.

As a further detail, Tables 4 and 5 report the maximum AUC of GMM, LOF, and OC-SVM and their \(\text {Latent}{Out}\) counterpart for each class of the most two difficult image datasets, namely Fashion-MNIST and CIFAR10. We do not report details on CFOF and iForest since they use the default values for their hyper-parameters and have considerably less points in the plots, while k-NN corresponds to the \(\rho \)-score already considered in previous experiments.

Table 5 Maximum AUC on CIFAR10
Table 6 Average AUC on MNIST, Fashion-MNIST and CIFAR10

Table 6 summarizes the results of the experiments reported in Figs. 2 and 3 by reporting the mean AUC of the various methods. Importantly, the table highlights that the average performances of existing anomaly detection scores almost always improve when they are applied to the \(\text {Latent}{Out}\) feature space \(\mathcal F\).

Since \(\text {Latent}{Out}\) is able to generate a feature space having a positive impact on the anomaly detection task, we introduce a variant that we call \(\phi \)\(\text {Latent}{Out}\). This approach performs a pre-training of \(\text {Latent}{Out}\) on a representative sample of the population. Then, the whole set of observations to classify is mapped into the learned feature space \(\mathcal F\) and the score \(\sigma \) is evaluated on the mapped instances.

The advantage of this approach is that the execution time is reduced and, moreover, that the mapping associated with \(\phi \)\(\text {Latent}{Out}\) can be stored and employed multiple times to different test sets. The method assumes that each test set is representative at least of the normal data population: if the information about this property is unknown it can be anyway guaranteed by including the pre-training sample in the test set.

Fig. 4
figure 4

Comparison between the performances \(\text {Latent}{Out}\) and \(\phi \)\(\text {Latent}{Out}\) in terms of AUC on MNIST, Fashion-MNIST and CIFAR10

Table 7 Training time (in seconds) for each epochs of \(\text {Latent}{Out}\) and \(\phi \)\(\text {Latent}{Out}\) on MNIST and CIFAR10
Table 8 Computation time \(\sigma \) and \(\sigma _{\mathcal F}\) with different values of hyper-parameters and latent space dimension \(\ell \)

We compare performances of \(\text {Latent}{Out}\) and \(\phi \)\(\text {Latent}{Out}\) in the unsupervised scenario by taking into account the image datasets. Since these datasets contain all the normal class instances (6000 points), the pre-training phase of \(\phi \)\(\text {Latent}{Out}\) is performed on the normal class instances of the corresponding test set (1000 points).

Figure 4 reports mean and standard deviation of the AUC obtained considering the same combinations of the hyper-parameters discussed above. It can be seen that \(\phi \)\(\text {Latent}{Out}\) is able to maintain a comparable accuracy, but at a reduced computational cost: Table 7 reports the training time for epoch of \(\text {Latent}{Out}\) and \(\phi \)\(\text {Latent}{Out}\). In these experiments the total number of epochs has been set to 200.

Table 9 Computation time \(\sigma \) and \(\sigma _F\) with different values of hyper-parameters and latent space dimension \(\ell \)

Experiments have been performed on a Linux machine equipped with a 2.9 GHz Intel Core\(^{\text {TM}}\) i7-10700, 32 GB of main memory and a NVIDIA GeForce RTX 2070 Super having 8 GB of dedicated memory.

To conclude the section, we also measure the execution time of the basic method \(\sigma \) when executed in the original feature and in \(\text {Latent}{Out}\) feature space \(\mathcal F\) having dimension \(\ell \). Execution times are reported in Table 8 for MNIST and Table 9 for CIFAR10. The execution times of k-NN and LOF are almost independent of k and, hence, we report only the results for an intermediate k value, namely \(k=7\). As expected, by considering the reduced feature space \(\mathcal F\) of \(\text {Latent}{Out}\), we also achieve an improvement of the time devoted to the computation of the scores.

5 Conclusions

In this work we introduce three extensions of the \(\text {Latent}{Out}\) algorithm: an application to the semi-supervised setting, a novel architecture, and a series of novel scores based on some existing data mining outlier detection methods. The experiments show that in many cases the scores of \(\text {Latent}{Out}\) improve the performance of the considered baseline methods, both in the unsupervised and in the one-class scenarios.

The results obtained in this paper make us believe that the idea behind \(\text {Latent}{Out}\) of exploiting both the baseline score and the latent space of neural architectures can be effective in a wide range of different anomaly detection settings. Because of this, in the future, our main goal is to deal with supervised scenarios in which some anomalies are known in phase of training.