On the evaluation of outlier detection and one-class classification: a comparative study of algorithms, model selection, and ensembles

It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem (Janssens and Postma, in: Proceedings of the 18th annual Belgian-Dutch on machine learning, pp 56–64, 2009; Janssens et al. in: Proceedings of the 2009 ICMLA international conference on machine learning and applications, IEEE Computer Society, pp 147–153, 2009. 10.1109/ICMLA.2009.16). In this paper, we focus on the comparison of one-class classification algorithms with such adapted unsupervised outlier detection methods, improving on previous comparison studies in several important aspects. We study a number of one-class classification and unsupervised outlier detection methods in a rigorous experimental setup, comparing them on a large number of datasets with different characteristics, using different performance measures. In contrast to previous comparison studies, where the models (algorithms, parameters) are selected by using examples from both classes (outlier and inlier), here we also study and compare different approaches for model selection in the absence of examples from the outlier class, which is more realistic for practical applications since labeled outliers are rarely available. Our results showed that, overall, SVDD and GMM are top-performers, regardless of whether the ground truth is used for parameter selection or not. However, in specific application scenarios, other methods exhibited better performance. Combining one-class classifiers into ensembles showed better performance than individual methods in terms of accuracy, as long as the ensemble members are properly selected. Supplementary Information The online version contains supplementary material available at 10.1007/s10618-023-00931-x.

The most widely used parametric model is the Gaussian distribution (Bishop, 2007), where, in the most simple case, a single Gaussian probability density function, is fit to the inlier data, with µ being the mean, Σ being the covariance matrix, and d being the dimensionality of the data. Using a single Gaussian distribution, however, makes a very strong assumption about the distribution being unimodal, which is often violated. A more flexible model is a Gaussian Mixture Model (GMM) (Bishop, 2007), where the distribution is assumed to be the result of a mixture of

Parzen Window (PW)
PW is a non-parametric method based on Parzen Density Estimation (Parzen, 1962) that estimates the density of the data using a mixture of kernels centered on each of the N individual training instances. In our study, we use Gaussian kernels with diagonal covariance matrices Σ i = hI, where h is a width parameter. The probability of an instance being an inlier is then computed as: The training for the Parzen density consists of the determination of the parameter h, which can be optimized using the maximum likelihood solution (Duin, 1976). Alternatively, the user can supply the parameter h. In the latter case, the computational cost for training is negligible. Testing new instances, however, is computationally expensive. During the testing phase, the distances of the new instance to the training instances have to be computed, which also imposes a storage limitation, since the training instances have to be stored and typically large training sets are required to produce a good density estimation.

Support Vector Data Description (SVDD)
SVDD (Tax and Duin, 2004) is a boundary-based one-class classification method inspired by Support Vector Machines (SVM) (Vapni, 1995) used in regular classification problems. The primary difference between SVDD and SVM is that while a SVM attempts to separate two or more classes with a maximum margin hyperplane, SVDD instead will enclose the inlier class in a minimum volume hypersphere by minimizing the following error: subject to the constraints: where R is the radius of the hypersphere, a is the center of the hypersphere, ξ are slack variables allowing training observations x to fall outside the SVDD boundary, and C is a penalty (regularization) parameter. Like traditional SVMs, the above formulation can also be extended to nonlinearly transformed spaces using kernel methods. In the case of a kernel with the property K(x i , x i ) = 1, ∀ i , such as the Gaussian kernel that we use in our experiments, OC-SVM (Schölkopf et al, 2001) and SVDD (Tax and Duin, 2004) find the same decision boundary when some conditions are met (Tax and Duin, 2004;Schölkopf et al, 2001;Tax, 2001).

Linear Programming (LP)
LP (Pekalska et al, 2002) is a boundary method which, instead of using the explicit feature space, utilizes a dissimilarity measure to compare new instances to the inlier instances in the training set. The basic assumption is that instances belonging to the inlier class are similar to each other, while the outliers are dissimilar to inliers. The use of a dissimilarity measure makes the method very convenient for applications where it is difficult to define suitable features for other approaches, as e.g. in unstructured data (strings, graphs or shapes). The dissimilarity measure used must meet the criteria defined by the authors: reflectivity, positivity, and symmetry. LP constructs the boundary by minimizing the volume of a simplex (Bazaraa et al, 2009) with the main vertex being the origin and the other vertices resulting from the intersection of the boundary and the axes of the dissimilarity space.
For unbounded dissimilarity measures, instances with large dissimilarities values (due to noise or the existence of outliers in the training data) may affect the decision boundary. In order to make the classifier robust against such values, the dissimilarities are transformed by the sigmoid function, sigm( where ϕ is the neural network with the corresponding set of weights W trained to learn a transformation that minimizes the volume of a data-enclosing hypersphere centered on a predetermined point c. The second term is a standard weight decay regularizer.

Local Outlier Factor (LOF)
LOF (Breunig et al, 2000) is an unsupervised outlier detection method that, similarly to kNN local , compares the local density of an observation to that of its neighbors. The distances between observations are replaced by reachability distances, defined as: reach-dist k (x i ← x j ) = max{d(x j , NN k (x j )), d(x i , x j )} (7) The local reachability density of an observation x i is then defined as the inverse average reachability distance from the set of x i 's neighbors, kNN(x i ), that are within the k nearest neighbor distance around x i : Finally, the LOF score of an observation is computed by comparing the lrd of the observation with that of its neighbors: Local Correlation Integral (LOCI) LOCI (Papadimitriou et al, 2003) is an unsupervised outlier detection method which analyzes the density of an observation at multiple neighborhood radii φr of a given maximum radius r, where φ ∈ (0, 1]. For each observation x i , a (local) r-neighborhood N (x i , r) = {x|d(x i , x) ≤ r} and a (local) r-density n(x i , r) = |N (x i , r)| are defined. The average φr-density inside a r-neighborhood around an observation x i is then defined as:n (x i , r, φ) = xj ∈N (xi,r) n(x j , φr) n(x i , r) , and the multi-granularity deviation factor (MDEF) is given by: MDEF(x i , r, φ) = 1 − n(x i , φr) n(x i , r, φ) An observation x i is classified using the following score: σ MDEF(x i , r, φ) = σn(x i , r, φ) n(x i , r, φ) , which is the normalized standard deviation σn(x i , r, φ) of n(x i , φr) for x i ∈ N (x i , r). With these quantities, the LOCI score is computed as follows: k-Nearest Neighbor (kNN global ) The k-Nearest Neighbor approach, which we call here kNN global , has been originally introduced as an unsupervised distance-based outlier detection method (Ramaswamy et al, 2000). Its score is the numerator of Equation (5): which makes the score global rather than local.

Angle-Based Outlier Detection (ABOD)
ABOD (Kriegel et al, 2008) is a global outlier detection algorithm which uses not only the distances between points but primarily the variance of the angles between points. ABOD computes the variance of the angles between point x i and all other pairs of points in the dataset X, weighted by the inverse of the distances to the respective points. This weighting factor is important since the angle to a pair of points varies naturally more for larger distances. The Angle-Based Outlier Factor (ABOF) is defined as follows: Subspace Outlier Degree (SOD) SOD (Kriegel et al, 2009) analyzes for each object how well it fits into the subspace that is spanned by a set of reference objects. In order to define the reference set, the authors suggest using Shared Nearest Neighbors (SNN) in the full space. Once this reference set has been defined, the relevant subspace is determined as the set of attributes in which the variance of the objects of the reference set to its center is small compared to the expected variance. The outlier scoring of SOD is computed as the Euclidean distance of an object to the center of the reference set in the relevant subspace, normalized by the number of relevant attributes.

Global-Local Outlier Scores from Hierarchies (GLOSH)
GLOSH (Campello et al, 2015) is an unsupervised outlier detection algorithm based on the hierarchical density estimates provided by the hierarchical clustering algorithm HDBSCAN*. After a density-based clustering hierarchy is computed for the whole dataset, the GLOSH score for each observation x i can be computed based on the difference in density around x i and the highest density inside the cluster closest to x i (from a density-connectivity perspective) in the HDBSCAN* hierarchy, defined as follows: where λ(x i ) is the density of x i and λ max (C xi ) is the highest density of an observation inside the closest cluster C xi , where densities are estimated by a k-nearest neighbor density estimator. The closest cluster C xi is the one that x i belongs to at the density level of x i . To apply GLOSH in a one-class classification scenario, we can construct initially the HDBSCAN* hierarchy using the training data, and then use this hierarchy as a fixed "model" to compute outlier scores for unseen data. In order to classify a new instance x i , we must determine which is the closest cluster C xi in the fixed hierarchy. This can be achieved by first adding a given instance x i to the Minimum Spanning Tree (MST) which underlies the HDBSCAN* hierarchy; x i is connected to the training instance x j with the smallest "distance" in the density space in which the MST is constructed. We can find the closest cluster C xi for an instance x i by removing the edges of the MST in decreasing order of weight. The closest cluster C xi is the one that x i belongs before being detached from the MST.
There are two situations that may occur when connecting the new instance x i to the training instance x j . The first situation is that the edge connecting x i and x j is removed before x j is separated from the MST. In this case, the closest cluster to x i will be the cluster that x j belongs to when x i is detached. We show an example of this situation in Figure 1.  Fig. 1: MST which underlies the HDBSCAN* hierarchy in the density space with minimum cluster size (m clSize ) = 3. The new instance x i is connected to the training instance with the smallest distance x 5 (Figure 1(a)). The edge connecting x i and x 5 is removed before x 5 is separated from the MST and so the closest cluster to x i is the one x 5 is assigned at this density level. In this case, it is the cluster where all instances remain clustered (Figure 1 The second situation is that x j is separated from the MST before the edge between x i and x j is removed. In this case, the closest cluster for x i is the same as the closest cluster for x j . For this situation, we show an example in Figure 2. Once we have identified the closest cluster for x i , we can compute its GLOSH score using Equation (16).

Isolation Forest (iForest)
iForest (Liu et al, 2008(Liu et al, , 2012 is an unsupervised outlier detection algorithm based on the concept of isolation which can be seen as a particular kind of density estimate. The concept of isolation in this context means "separating an instance from the rest of the instances", which is achieved by building an ensemble of binary trees called isolation trees (iTree). An iTree recursively divides a dataset by randomly selecting an attribute and a split value until either the resulting partitions/nodes have only one instance or all instances in a partition/node has the same value. Anomalies are more likely to be isolated MST which underlies the HDBSCAN* hierarchy in the density space with minimum cluster size (m clSize ) = 3. The new instance x i is connected to the training instance with the smallest distance x 5 (Figure 2(a)). The edge connecting x 5 is separated from the MST before the edge between x i and x 5 (Figure 2(c)) and so the closest cluster to x i is the closest cluster for x 5 . In this case, it is the cluster with the instances x 4 , x 5 and x 6 ( Figure 2 closer to the root of an iTree, whereas normal instances are more likely to be isolated at deeper levels of an iTree. Therefore, anomalies are those instances that have short average path lengths in the iTrees.

Time Complexity
The computational complexity of the one-class classification algorithms can be analyzed for the two different sub-tasks: the training time O(f tr (N )) and the testing (or consultation) time O(f te (M )), as a function of the training set size (N ) and the test set size (M ), respectively. In general, the algorithms can be divided into two groups, eager learning and lazy learning (Webb, 2011). In regular classification problems, the majority of computation of eager learning algorithms occurs at training time, while lazy learning algorithms spend more time when testing. In one-class classification, however, if one wants to determine the threshold to separate inliers from outliers when using lazy learning algorithms, the model should also be applied to the training data (for example, as discussed in Section 3.3), which can result in higher complexity in the training time than in the testing when N > M . On the other hand, if one is not interested in labeling, but only in the ranking, the complexity for training time is usually null 1 . Another disadvantage of lazy learning algorithms is the fact that the training data has to be stored for consultation during the testing phase, which also imposes a storage restriction.
In Table 1, we provide the complexity of the methods discussed here. For networks such as Auto-encoder and Deep SVDD, the training time complexity depends on the training algorithm used, but it is usually linear in the training set size (N ) per each training iteration. Some algorithms, such as the Stochastic Gradient Descent (SGD) methods (Goodfellow et al, 2016), can perform the iteration in time complexity less than N , by using only a sample of the training data to compute an unbiased estimate of the gradient. Most deep learning methods, such as Deep SVDD, use SGD to alleviate the computational burden caused by deep number of layers/neurons. The bottleneck in the training of a neural network is usually on the number of weights to train. While for Deep SVDD, there is no predetermined number of weights, for Auto-encoder, the number of weights is equal to (t + d + 2dt), where d is the dimensionality of the dataset, and t is the user-defined number of neurons. Therefore, the use of Auto-encoder for high-dimensional datasets can be expensive. The time complexity in the number of weights also depends on the training algorithm to be used, for example, in the case of Conjugate Gradient, it is linear, while for Levenberg-Marquardt, it is cubic (LeCun et al, 1998). The convergence for Levenberg-Marquardt, however, is usually much faster, but the high time complexity makes it suitable only for small network sizes. Note that, although the time complexity per iteration for Deep SVDD (O(#weights)) is smaller when compared to autoencoder (O(N × #weights)) due to the SGD, the number of iterations required for the convergence of the autoencoder is smaller (Goodfellow et al, 2016). Also, usually the number of weights of the Deep SVDD ≫ number of weights of the autoencoder. The training for GMMs consists of finding the distribution parameters of the Gaussians (covariance matrices and means). It is usually achieved using an Expectation-Maximization (EM) algorithm. For each iteration of the EM algorithm, the E-step takes O(d 3 N t) and the M-step takes O(d 2 N t), where d is the dimensionality of the dataset, and t is the user-defined number of clusters. Noticing that, due to the inversion of the covariance matrix (O(d 3 )), this algorithm is expensive for high-dimensional datasets. For testing, the computation of the likelihood of each testing point belonging to the clusters takes O(d 3 M t).
The training time complexity for iForest uses only a user-defined sampling t 1 of the dataset to build the iTrees. The time complexity to build each of the t 2 iTrees defined by the user is quadratic in t 1 , which gives an overall complexity for training equal to O(t 2 1 t 2 ). The structure of the iTrees is equivalent to that of Binary Search Trees (BST), which makes the expected time complexity for testing a new instance O(log t 1 ), but the worst-case is O(t 1 ).
Both LP and SVDD have to solve an optimization problem during the training phase. While LP has to solve a linear programming (LP) problem, SVDD has to solve a quadratic programming (QP) problem. Both problems can be solved in polynomial time (O(N 3 )) using, for example, Interior Point Methods (Woodsend, 2009;Campbell and Bennett, 2000). When using a linear kernel, however, some algorithms in the literature can solve the problem efficiently in linear time (Joachims, 2006;Erfani et al, 2015).
For ABOD, the time complexity relies on the computation of the angles between each point and all other pairs of points in the dataset, which makes the overall time complexity for training O(N 3 ) and testing O(M N 2 ). Similarly, the time complexity for LOCI relies on the computation of the MDEF for each observation. When considering all possible radii r, it also takes O(N 3 ) for training and O(M N 2 ) for testing.
For kNN global , kNN local , LOF, and SOD the most expensive operation is the computation of the k nearest neighbors for all N observations. Without using any spatial index structures, this computation takes O(N 2 ). However, with the use of spatial index structures, such as k-d trees and R-trees (Friedman et al, 1977;Roussopoulos et al, 1995), the k nearest neighbors for all N observations can in practice often be determined much faster. For testing new instances, we have to compute the k nearest neighbors of these instances to all instances of the training set (O(M N )). Similarly to k-nearest neighbor-based classifiers, the PW needs to compute the distance to all the observations in the training set in order to estimate the densities. Therefore, the PW time complexity is the same as that of the k-nearest neighbor-based methods.
The training phase of GLOSH consists of building the HDBSCAN* hierarchy. The two most expensive operations are the computation of the core distance, which involves computing the k nearest neighbors for all N observations (O(N 2 )), and the construction of the MST (O(N 2 )). Therefore, the overall time complexity for training is O(N 2 ). For testing, GLOSH has to compute the core distance for each testing point (O(M N )), and find their respective closest objects in the MST (O(M N )). In total, the testing time complexity is O(M N ).

Detailed Results
In the following, we present the datasets and results in detail. Tables 2 and  3 show the characteristics of the datasets used in the experiments (number of features, the number of objects in each class, and the total number of objects in the datasets).
Tables 4-7, 8-11, and 12-15 display, respectively, ROC AUC, Adjusted-Prec@n, and MCC values for Type I experiments for each of the algorithms equipped with their best parameter value according to the different model selection methods (Cross-validation, SDS, Uniform Objects, Perturbation). [20][21][22][23][24][25][26][27]  Tables 28 and 29 display, respectively, ROC AUC and AdjustedPrec@n values for the different approaches to guide the selection of the base models combined into the ensemble. The highest achieved values for each dataset are shown in bold.

14
On the Evaluation of Outlier Detection and One-Class Classification  Table 5: Detailed results of the methods with the parameters selected using Self-Adaptive Data Shifting (SDS) over all datasets of the Type I experiments (single source-class inliers) with respect to ROC AUC. We visualize the distribution of these values in Figure 9(a). On the Evaluation of Outlier Detection and One-Class Classification Table 6: Detailed results of the methods with the parameters selected using Uniform Objects over all datasets of the Type I experiments (single source-class inliers) with respect to ROC AUC. We visualize the distribution of these values in Figure 9(b).  Table 7: Detailed results of the methods with the parameters selected using Perturbation over all datasets of the Type I experiments (single source-class inliers) with respect to ROC AUC. We visualize the distribution of these values in Figure 9(c). On the Evaluation of Outlier Detection and One-Class Classification Table 8: Detailed results of the methods with the parameters selected using Cross-validation over all datasets of the Type I experiments (single source-class inliers) with respect to AdjustedPrec@n. We visualize the distribution of these values in Figure 4 (b).  Table 9: Detailed results of the methods with the parameters selected using Self-Adaptive Data Shifting (SDS) over all datasets of the Type I experiments (single source-class inliers) with respect to AdjustedPrec@n. We visualize the distribution of these values in Figure  10(a).   On the Evaluation of Outlier Detection and One-Class Classification  Table 13: Detailed results of the methods with the parameters selected using Self-Adaptive Data Shifting (SDS) over all datasets of the Type I experiments (single source-class inliers) with respect to MCC. We visualize the distribution of these values in Figure 11(a).   Figure 11 (b).    Table 17: Detailed results of the methods with the parameters selected using Self-Adaptive Data Shifting (SDS) over all datasets of the Type II experiments (multiple source-class inliers) with respect to ROC AUC. We visualize the distribution of these values in Figure 12(a). On the Evaluation of Outlier Detection and One-Class Classification Table 18: Detailed results of the methods with the parameters selected using Uniform Objects over all datasets of the Type II experiments (multiple source-class inliers) with respect to ROC AUC. We visualize the distribution of these values in Figure 12 (b).  Table 19: Detailed results of the methods with the parameters selected using Perturbation over all datasets of the Type II experiments (multiple source-class inliers) with respect to ROC AUC. We visualize the distribution of these values in Figure 12(c). On the Evaluation of Outlier Detection and One-Class Classification Table 20: Detailed results of the methods with the parameters selected using Cross-validation over all datasets of the Type II experiments (multiple source-class inliers) with respect to AdjustedPrec@n. We visualize the distribution of these values in Figure 5 (b).  Table 21: Detailed results of the methods with the parameters selected using Self-Adaptive Data Shifting (SDS) over all datasets of the Type II experiments (multiple source-class inliers) with respect to AdjustedPrec@n. We visualize the distribution of these values in Figure 13(a). On the Evaluation of Outlier Detection and One-Class Classification Table 22: Detailed results of the methods with the parameters selected using Uniform Objects over all datasets of the Type II experiments (multiple source-class inliers) with respect to AdjustedPrec@n. We visualize the distribution of these values in Figure 13 (b).  (multiple source-class inliers) with respect to AdjustedPrec@n. We visualize the distribution of these values in Figure 13(c).   On the Evaluation of Outlier Detection and One-Class Classification Table 26: Detailed results of the methods with the parameters selected using Uniform Objects over all datasets of the Type II experiments (multiple source-class inliers) with respect to MCC. We visualize the distribution of these values in Figure 14 (b).  0.12 ± 0.15 0.11 ± 0.17 0.14 ± 0.16 0.24 ± 0.16 0.09 ± 0.14 0.16 ± 0.18 0.20 ± 0.12 0.10 ± 0.11 0.21 ± 0.14 0.10 ± 0.13 0.31 ± 0.14 0.15 ± 0.16 0.06 ± 0.