Extreme value theory for anomaly detection – the GPD classifier

Classification tasks usually assume that all possible classes are present during the training phase. This is restrictive if the algorithm is used over a long time and possibly encounters samples from unknown new classes. It is therefore fundamental to develop algorithms able to distinguish between normal and abnormal test data. In the last few years, extreme value theory has become an important tool in multivariate statistics and machine learning. The recently introduced extreme value machine, a classifier motivated by extreme value theory, addresses this problem and achieves competitive performance in specific cases. We show that this algorithm has some theoretical and practical drawbacks and can fail even if the recognition task is fairly simple. To overcome these limitations, we propose two new algorithms for anomaly detection relying on approximations from extreme value theory that are more robust in such cases. We exploit the intuition that test points that are extremely far from the training classes are more likely to be abnormal objects. We derive asymptotic results motivated by univariate extreme value theory that make this intuition precise. We show the effectiveness of our classifiers in simulations and on real data sets.


Introduction
Modern classifiers achieve human or super-human performance in a variety of tasks (Christopher 2016), including speech (Graves et al. 2013) and image recognition (He et al. 2016), but they are typically not able to discriminate between normal and abnormal classes and may give high confidence predictions for unrecognizable objects (Nguyen et al. 2015). Here and throughout, we call a class normal if we have examples of it during the training phase, whereas we call a class abnormal if we have no examples of it during the training phase. The ability to distinguish between these two cases is important if there is the possibility that new classes arise in the future or if there have not been any examples of some classes in the training set due to their rarity.
The task of distinguishing between normal and abnormal test data points is called anomaly detection (Chandola et al. 2009). We underline that in this context standard hyper parameter optimization procedures such as cross-validation are usually not available, since in the training set there are only normal objects. For this reason, an algorithm designed for anomaly detection should involve as few hyper parameters as possible.
The extreme value machine (EVM) introduced in Rudd et al. (2018) is an algorithm that uses extreme value theory to attack these problems. In the last few years, extreme value theory has become an important tool in multivariate statistics and machine learning. This is due to the fact that the extreme features, rather than the average ones, are the most important for discriminating between different objects (Scheirer 2017). The EVM strongly relies on the distances between the different classes in the training set. In particular, as we show in Section 8 using a simulated data set, this is the reason why the EVM fails to recognize abnormal points that arise relatively close to one class in the training set, even if the recognition task is fairly simple. For this reason, we propose an alternative approach that uses extreme value theory overcomes this problem. We call our method the GPD classifier (GPDC), according to the generalized Pareto distribution approximation from extreme value theory that it relies on. Moreover, we present also a second more heuristic algorithm, namely the GEV classifier (GEVC), based on the generalized extreme value distribution. We underline that, in the following, we consider only the Euclidean distance for simplicity, but the obtained results are more general and can be applied to any distance.
This paper is organized as follows. In Section 2 we summarize the previous work in this context. Section 3 states some fundamental results from extreme value theory that will be useful in the definition of the GEVC and GPDC. In Section 4 we describe the general framework of anomaly detection. In Section 5 we describe the EVM and explain the main limitations of this approach. In Sections 6 and 7 we describe the GPDC and GEVC and we theoretically justify their main properties. Finally, in Section 8, we evaluate and compare the performance of these anomaly detection algorithms on both simulated and real data.

Related work
Anomaly detection (Pimentel et al. 2014) is a well established field that shares similar though not identical goals with outlier detection (Walfish 2006). While the latter is focused on discovering which examples of the training data are not in agreement with the process that has generated the bulk of them, in anomaly detection we suppose to have observations from a number of normal classes and for a new sample we would like to determine whether it belongs to one of those existing classes or a new one. In this sense, anomaly detection is more naturally related to classification. Machine learning techniques have been applied successfully to both tasks, improving considerably their performance (Abe et al. 2006;Shon and Moon 2007;Désir et al. 2013). Many anomaly detection methods first estimate the distribution of the normal data and then mark as abnormal the new points associated with low density regions (Bishop 1994). We follow a similar approach for the two techniques that we propose. A relevant work in this direction is Roberts (1999) who fit a Gaussian mixture model to the training data and use extreme value theory to decide if a new point is normal or abnormal. Our methods will not rely on parametric assumptions for the multivariate density, which makes them more widely applicable.
Extreme value theory has been recognized in the last years as a powerful tool to increase the performance of anomaly detection (Geng et al. 2018) and, more generally, machine learning techniques (Jalalzai et al. 2018) with particular interest to computer vision (Scheirer 2017). The main reason for this success is the fact that the tails of the distribution of distances between training observations can be modeled effectively using the asymptotic theory provided by extreme value theory (Scheirer et al. 2011). One example of this can be found in Fragoso et al. (2013) where the well-known RANSAC algorithm is improved using extreme value theory. Moreover, extreme value theory has been applied to extend the use of classical anomaly detection algorithms to extreme regions (Goix et al. 2016;Thomas et al. 2017) and to detect anomalies in time series (Siffer et al. 2017).
A feature that could be important for anomaly detection algorithms is the ability to be easily updated with the arrival of new training data, namely incremental learning. A simple approach in this direction is to use the k-nearest neighbor classifier (James et al. 2013), a route that is to some extent also followed by the GPDC and the GEVC. Another method that has the desired requirement to be naturally adapted to new data is the nearest class mean classifier (Mensink et al. 2012), which represents each class as a prototype vector that is the mean of all the examples for that class seen so far. More complex methods are based on machine learning approaches such as support vector machines (Ruping 2001), neural networks (Rebuffi et al. 2017) or random forests (Saffari et al. 2009) and are in general computationally more demanding.
Finally, we note that the formulation of GPDC is partially related to the estimation of the right end point of a univariate distribution. As we will argue in the following, in the GPDC this problem takes a simpler form than in the general problem of upper end point estimation in extreme value statistics (Hall 1982).

Extreme value theory
In this section we briefly recall the main results from univariate extreme value theory and introduce the notation that will be used throughout the paper.
Let X 1 , X 2 , . . . be a sequence of i.i.d. random variables from the distribution function F , and denote by M n = max(X 1 , . . . , X n ) the maximum of the first n samples.
The Fisher-Tippett-Gnedenko theorem (cf., Coles et al. 2001) states that if there exist sequences a n ∈ R, b n > 0 such that then if G is a non-degenerate distribution function it belongs to the class of generalized extreme value (GEV) distributions with where ξ ∈ R, μ ∈ R and σ > 0 are the shape, location and scale parameters, respectively. If the convergence in Eq. 1 holds, we say that X is in the max-domain of attraction of a GEV distribution with shape parameter ξ . A similar result can be formulated for threshold exceedances. In fact, if the convergence in Eq. 1 holds, then, for a threshold u ∈ R that tends to the upper endpoint of the distribution F of X, the distribution function of X − u, conditional on X > u, is approximately and the distribution H is called the generalized Pareto distribution (GPD). The parameters of the two extreme value distributions are closely related: if M n follows approximately a GEV distribution G with parameters ξ , μ n and σ n , then the GPD H has the same shape ξ andσ = σ n + ξ(u − μ n ).
If we are interested in computing the distribution of the maximum of a large number of independent copies of a given random variable, the first result suggests to fit a GEV distribution to various such maxima taken over blocks of the same lengths. On the other hand, if we are interested to model the right tail of a distribution, then the second result suggests to fit a GPD to all values above a high threshold of a sequence of independent replications of a random variable with this distribution. The two approaches are asymptotically equivalent and the parameters in the two distributions are deterministically linked. The following lemma will be useful to describe our algorithms.
Lemma 1 (Embrechts et al. 2013) Suppose that the distribution function F of X has an upper endpoint denoted by b * . Then it is in the max-domain of attraction of a GEV with ξ < 0 if and only if where is a slowly varying function at infinity. This is also equivalent to (b * − X) −1 having a regularly varying tail with index −ξ . In this case, we have the representation b * = u −σ /ξ in terms of the parameters of the generalized Pareto distribution.

General setting
In this section we formalize the task of anomaly detection in a way suitable for all the algorithms that we describe in the following.
Denote the training data by x i ∈ R p , each of them labeled as a class y i ∈ {C 1 , . . . , C J }, i = 1, . . . , n. Here and throughout, p ∈ N is the dimension of the predictor space and J ∈ N the number of different classes in the training set. In total, then, we have J normal classes and we assume that each class is described by a continuous density function f C j defined on R p , j = 1, . . . , J , where the probability that a point in class C j falls in the set A ⊂ R p is The process that has generated the training data set can be described as a mixture of these density functions, with unconditional density The value of the function f evaluated at some point is thus large if that point has high chance to be normal. Note that, in the following, we approximate directly f and then we do not have to estimate the weights w j .
Further, suppose that we have a new unlabeled test point x 0 ∈ R p that we would like to mark as normal or abnormal. More precisely, we define the point x 0 as normal if it was sampled from the training distribution with density f , and as abnormal if it was sampled from another distribution with unknown density f 0 . The goal in anomaly detection is to decide if x 0 comes from the distribution with density f of the training data or not. Thus, we need to perform the hypothesis test Note that most of the algorithms for anomaly detection avoid formal assumptions on the distribution with density f 0 and mark a test point x 0 as abnormal if the training density f (x 0 ) is low. All the methods that we describe in the following are of this type.

The extreme value machine
In this section first briefly describe the extreme value machine introduced in Rudd et al. (2018) and then underline its limitations and incorrect assumptions.

Algorithm description
The idea of the EVM is to approximate the distribution of the margin distance of each point in each class using extreme value theory. A new point is then classified as normal if it is inside the margin of some point in the training set with high probability. Rudd et al. (2018) introduce this concept of margin distance of a training point x i as half of the minimum distance between x i and all the points belonging to a different class in the training data set. More formally, the margin distance of x i is The idea is to model the lower tail of the distribution of M (i) by using the D (i) j as the input data. An equivalent problem, to which we can apply extreme value theory, is to retrieve the upper tail of the distribution ofM To this end, the authors propose to use the Fisher-Tippett-Gnedenko theorem (cf., Coles et al. 2001) to fit to the k largest observed −D assuming b * = 0 sinceM (i) is bounded above by zero as a negated distance. We discuss issues with the use of this approximation in the next section. Estimating the parameters α i and σ i for each training observation x i separately yields estimates W (i) of the distribution functions W (i) . Those estimates are used to label a new point x 0 as normal if it is likely enough to fall in at least one of the margins of the n training data, that is, if and it is labeled as abnormal otherwise. Here, δ is a probability threshold that is chosen by a heuristic formula in Rudd et al. (2018).

Limitations of the EVM
The EVM redefines the anomaly detection problem as a probabilistic framework within extreme value theory, but it has some drawbacks that we describe here and that will be the starting point to construct new methods that improve upon it. First, assuming in Eq. 3 that the upper endpoint is b * = 0 implies that all classes in the training set share the same support and that it is impossible that two classes are perfectly separated. Indeed, if two classes are perfectly separated, then the distance between two points sampled from them is bounded from above by a strictly positive constant and therefore b * < 0. This assumption is rather restrictive and it is hardly verifiable on observed data. Second, estimating the upper tail ofM (i) based on the k largest observed −D (i) j for each x i should rely on the GPD instead of the GEV approximation since this amounts to a threshold exceedance approach with a GPD as limiting distribution.
A further limitation of the EVM is related to the choice of the threshold δ in Eq. 4. This threshold controls the anomaly detection error and a bigger δ implies a higher probability to label a new object as abnormal. In a statistical setting this type of thresholds are chosen by fixing the type I error, in this case, the probability to classify a normal object as abnormal. Unfortunately, for the EVM algorithm the authors do not propose any procedure to choose δ controlling the type I error and thus it is not clear how to properly tune it.
The most important drawback of the EVM is that it gives a non-justified premium to normal classes far from the others. In fact, if there are classes relatively close to each other and one class far from the others, then all the points of the latter class will have a large margin distance. For this reason, if we observe a new point from an abnormal class located closer to this far class than the other normal classes, we always classify the new point as belonging to the far away class. The main issue here is that we cannot use the distances between the normal classes to do anomaly detection because it does not convey any information regarding the abnormal ones. Knowing that class A is very far from class B, does not necessarily imply that a new object x 0 is normal only because the distance between it and the closest point in class A is much smaller than the distances between classes A and B. It might simply be that the new class is closer to class A than is class B. The EVM is implicitly assuming that the distances between the normal classes reflect also the behavior of the abnormal ones, a strong assumption that is again hard to verify. We will illustrate this problem in a simulation study in Section 8.

The GPD classifier
In this section we propose an algorithm for anomaly detection that uses the principles of extreme value theory and that does not rely on the distances between the different classes in the training set. This algorithm, which we call the GPDC, considers the distances between the new point that we want to mark as normal or abnormal, and the points composing the training data set. For simplicity, and since we are primarily interested in anomaly detection, all the normal classes are seen as one big class. If there is high confidence that the lower endpoint of the distribution of these distances is zero, then the process that has generated the training data can also have generated the new point, and we thus mark the new point as normal; otherwise we mark it as abnormal.

Extreme value theory and anomaly detection
In this setting the training data x i ∈ R p , i = 1, . . . , n, can be thought as being sampled from only one class, the class of the normal data points, with density f . Given a new point x 0 without label, we are interested in deciding whether it is from a normal class and has been generated by f , or whether it belongs to an unseen class and has been generated by an abnormal density f 0 .
We denote with D i , i = 1, . . . , n, the distances ||x i − x 0 || between x 0 and the points in the training set. In the following, we are interested in the lower tail of the distribution D of the distance between x 0 and a generic point sampled from f . Intuitively, if the lower tail of D behaves similarly as the corresponding quantity for training samples, then x 0 has a high probability to be a normal point. More mathematically, let B δ (x 0 ) denote a ball of radius δ centered in x 0 . Then where P is the probability measure that has density f . Moreover, the upper end point of −D is equal to zero if and only if there is no δ > 0 such that P {B δ (x 0 )} = 0. Only in this case there is a positive probability that x 0 comes from the process that has generated the training data, i.e., that it is a normal point. Under weak assumptions, we can then specify the exact tail behavior of −D at zero.

Proposition 1
Assume that x 0 is in the interior of the support supp(f ) of the normal classes and that f is continuous at x 0 . Then the distribution of −D is in the maxdomain of attraction of a GEV distribution with shape parameter ξ = −1/p, where p ∈ N is the dimension of the predictor space.
Proof We first note that the survival function of −D behaves at zero as where V {B δ (x 0 )} is the volume of the ball B δ (x 0 ) in R p , and the approximation holds as δ → 0. From this representation if follows directly that the distribution of −D is in the max-domain of attraction of a GEV distribution with shape parameter ξ = −1/p.
Note that f is, in general, a density on a multivariate space, whereas P (−D > ·) is a simple univariate survival function and the above result describes its tail behavior. We can therefore model the right tail of −D using extreme value statistics.
We further observe that the above proposition and Lemma 1 imply that 1/D has a regularly varying survival function for some slowly varying function at infinity. The upper tail of 1/D is thus in the max-domain of attraction of a GEV distribution with shape parameterξ = 1/p > 0. Therefore, a consistent estimator ofξ is given by the Hill estimator (Hill 1975). The following theorem uses this fact to obtain a consistent estimator for ξ .

Theorem 1
Assume that x 0 is in the interior of the support supp(f ) of the normal classes and that f is continuous at x 0 . Denote the distances between x 0 and the points in the training set with D 1 , . . . , D n . Let further R (n) ≥ R (n−1) · · · ≥ R (1) be the order statistics of R i = −D i . For k = k(n) → ∞ and k(n)/n → 0, as n → ∞, let This estimator converges in probability where p ∈ N is the dimension of the predictor space.
Proof Rewriting the statistic ξ n slightly gives and we recognize the (negated) classical Hill estimator applied to the sequence of independent copies −1/R 1 , . . . , −1/R n of 1/D. The consistency of the Hill estimator (cf., Theorem 3.2.2 in De Haan and Ferreira (2007)) and the fact that 1/D has shape parameter 1/p yields the desired result.
The above result motivates to use the statistic ξ n in Eq. 5 in an hypothesis test to decide whether the new point x 0 can come from a normal class, and the last part of the above theorem provides the theoretical background for this test. Indeed under the hypothesis that x 0 is in the interior of the support of f , the density of the normal classes, the statistic ξ n is approximately the negative reciprocal of the predictor space dimension. Moreover, it is possible to show that the estimator (5) is a special case of the classical Hill estimator (De Haan and Ferreira 2007) and hence it is asymptotically normal under a second order condition. Using this fact, we could even derive asymptotic confidence intervals for ξ based on it. The assumption on k(n) is common in extreme value statistics to ensure the right trade-off between bias and variance of a tail estimator.
Under the alternative, namely that the new point x 0 is abnormal, we can encounter two situations. First, if the support of the new class is overlapping with the support of the normal classes supp(f ), then −D might have upper endpoint zero if x 0 falls into the interior of supp(f ). Second, if the supports are not overlapping or x 0 / ∈ supp(f ), then the upper endpoint of −D is strictly smaller than zero. In this second case, the statistic ξ n can be shown to converge to zero.
Theorem 2 Let r * be the upper endpoint of −D and suppose that r * < 0. Assume that k = k(n) → ∞ and k(n)/n → 0, as n → ∞. Then the statistic in Eq. 5 converges almost surely to zero, that is, ξ n a.s.
Since −R (n−k) → −r * almost surely as n → ∞, this proves the assertion.
We can use this fact to do a first test and mark a new point x 0 for which p ξ n is significantly larger than −1 as an abnormal point. Note that we use the quantity p ξ n instead of ξ n to stabilize the asymptotic variance and to avoid numerical instabilities as −1/p → 0 as p → ∞. In fact, under a second order condition and by asymptotic normality of the Hill estimator (cf., Theorem 3.2.5 in De Haan and Ferreira (2007)), the quantity √ kp ξ n is approximately normal with mean −1 and variance 1. With this step we have excluded all the points that have no possibility to belong to one of the classes of the training set.
If, on the other hand, p ξ n is close to −1, we cannot reject the null hypothesis that x 0 is in the interior of supp(f ). In this case, we can use the threshold u = R (n−k) to model the upper tail of −D, and since the upper endpoint of −D is 0, Lemma 1 yields that the scale parameter of the approximating GPD satisfiesσ = R (n−k) ξ . Using the estimator ξ n from Theorem 1 yields the tail approximation for x > R (n−k) where H is the GPD distribution with shape ξ n and scaleσ = R (n−k) ξ . There is thus the possibility that x 0 was generated from one of the normal classes. Obviously, we have to exclude points that have positive, but very small probability of being normals. We therefore approximate the size of the ball around x 0 that contains a fixed amount γ > 0 of mass of f with the approximate distribution of −D as given by Eq. 6. More precisely, we let q γ < 0 be the (1 − γ )-quantile of −D for some small γ < k/n, which we can compute from Eq. 6 as such that P {B −q γ (x 0 )} is approximately equal to γ . Thus, the smaller −q γ the higher the training density f (x 0 ) around x 0 . If −q γ obtained for the new point x 0 is significantly larger than the corresponding quantity for a generic point in the training set, we mark x 0 as abnormal since the density f (x 0 ) around x 0 is too small. Otherwise we mark it as normal. The probability γ should be chosen small enough to have a good approximation of the magnitude of the density around x 0 , but at the same time large enough to obtain a reliable estimate of q γ . We choose γ = 1/n since this is sufficiently small and R (n−k) + H −1 (1 − 1/k) is usually a very good approximation of the true quantile. Note that the quantile estimator in Eq. 7 is a special case of the estimator in Weissman (1978).

The GPDC algorithm
Using the results on the asymptotic behavior of the statistic ξ n , we propose the following algorithm to classify a new point x 0 .
1. Compute the negated distances −D 1 , . . . , −D n between x 0 and each point in the training set. 2. Estimate ξ n using only the biggest k negated distances R (n) , . . . , R (n+1−k) . 3. If p ξ n is smaller than a given threshold s > −1, mark x 0 as possibly normal and go to the next point, otherwise mark it as abnormal and exit the algorithm. 4. Compute the (1 − 1/n)-quantile of −D by q γ = R (n−k) + H −1 (1 − 1/k) using the estimated GPD H and γ = 1/n. 5. A large radius −q γ corresponds to a low density of f at x 0 . Thus, if −q γ is bigger than a given threshold t > 0 mark x 0 as abnormal, otherwise mark it as normal.
Some of the steps require further explanation. The number of upper order statistics k in the second step that is used to obtain ξ n corresponds to the classical bias-variance trade-off of the Hill estimator. There are widely used diagnostics plots that help to determine the best k (Coles et al. 2001).
Note that step 3. marks x 0 as abnormal if it is not in the support of the training density f . Therefore, it can be seen as a preliminary hypothesis test with null hypothesis "x 0 is in the support of f " and alternative "x 0 is not in the support of f ". If we do not reject this first null hypothesis, step 5. is needed to test the stronger null hypothesis "x 0 is normal" with alternative "x 0 is abnormal". In particular, we mark x 0 as abnormal if it lies in a region where the density f (x 0 ) is low. Thus, steps 4. and 5. are especially necessary if the supports of f and f 0 are not clearly separated.
Step 3. and step 5. require the choice of the thresholds s and t to perform the hypothesis tests. It is common to fix the type I error to some probability α such as 5%, and then to compute the threshold that realizes this error. To this end we require the distribution of the test statistic under the null hypothesis. Instead of relying on asymptotic results, we propose to execute the algorithm using all the points in the training set as unknown one after the other in a jackknife fashion and to obtain in this way each time ξ (i) n and −q (i) γ , i = 1, . . . , n. We then jointly set the thresholds s and t to the (1−α/2)% quantiles of ξ (1) n , . . . , ξ (n) n and −q (1) γ , . . . , −q (n) γ , respectively, such that we mark at most α% of the training data as abnormal following Bonferroni's correction for multiple testing (Shaffer 1995). Note that this requires to compute O(nk log n) distances, roughly the same as in the training phase of the EVM. We can do this procedure at the beginning of the life of the algorithm and repeat it once in a while if new training data arise.
We emphasize that, contrary to what is stated in the EVM paper (Rudd et al. 2018), one should be careful to choose hyper parameters based on cross-validation by randomly splitting the classes in the training set in normal and abnormal data points. This can be misleading since normal classes convey no information on the abnormal ones.
It is important to underline that this algorithm relies on the asymptotic approximation by a GPD and hence it is asymptotically exact as the number of training samples tends to infinity. Moreover, it is completely kernel free and its implementation is very efficient using fast nearest-neighbor searching (Arya et al. 1998). In fact, during the evaluation of a new point x 0 we have to compute only O(k log n) distances, whereas with the EVM we have to compute O(n) distances. We stress also that the algorithm supports incremental learning since it does not require any training procedure once the thresholds for the hypothesis tests are fixed. If the interest is not only in anomaly detection our algorithm can be completed with an incremental classifier that performs standard classification only on the points marked as normals.

The GEV classifier
The GPDC in the previous section was based on the approximation of threshold exceedances by the GPD as limiting distribution. In this section we propose another algorithm based on a more heuristic approach to do anomaly detection. It is based on the approximation by the GEV distribution given the Fisher-Tippett-Gnedenko theorem (cf., Coles et al. 2001), and we thus refer to it as the GEVC.
For simplicity, as before, all normal classes are collapsed into one big class and we denote the training data with x i , i = 1, . . . , n, and with f the probability density describing the distribution of the training data. For each i = 1, . . . , n, we compute the distance between the training point x i and the nearest training point to it, i.e., This phase, using fast neighbor search, requires to compute only O(n log n) distances. Note that the quantities are bounded above by zero. Even though these random variables are not exactly independent, their dependence seems weak enough to still apply methods from extreme value theory. Using this and Lemma 1 we can fit a GEV with negative shape parameter to −D min 1 , . . . , −D min n , and thus estimating the distribution of −D min , i.e., the distribution of the maximum of the negated distances between a normal point and all the remaining points in the training set. Denote this estimated distribution function by W . We note again that this approximation is motivated heuristically, since −||x i − x j ||, conditionally on x i , would be max-domain of attraction of a GEV with negative shape parameter. Since the data point x i is random, this argument is formally no longer true, but in practice the GEV approximation with negative shape parameter seems still suitable.
When a new point x 0 arises and we want to mark it as normal or abnormal we consider the statistic in order to perform the hypothesis test described in Section 4. Note that this operation requires to compute only O(log n) distances. Under the null hypothesis, −d min 0 is approximately a sample of the distribution −D min . Thus we expect the quantity to be not too small, whereas under the alternative hypothesis this is not guaranteed. We can then fix a level α > 0 for the type I error to determine if the new point x 0 is normal or abnormal. Given the estimated distribution W that approximates −D min , we reject the null hypothesis if W (−d min 0 ) < α. In this case, the smallest distance of the new point to a training sample is too large, and we therefore mark it as abnormal. Otherwise it is marked as normal. Note that, also with the GEVC, the threshold α permits to control directly the type I error. Moreover, the GEVC has no hyper parameters and it is fast to update when we add new data to the training set, since revising −D min i requires only to compute the minimum distance between x i and the added points.

Application
In this section we compare the EVM introduced in Rudd et al. (2018) with our GPDC and the GEVC on simulated and real data. For completeness, when using real data, we compare the methods also with Isolation Forest (Liu et al. 2012) and One-Class SVM (Schölkopf et al. 2000), two state of the art methods for novelty detection. For this latter technique, we use a radial kernel and we select its hyper parameters using part of the test data as validation set, including both normal and abnormal objects, following what is proposed in the original paper, while for the former we follow the original implementation. Note that using part of the test set is an approach possible only in an experimental setting, since in real anomaly detection we do not have any access to the abnormal objects during the training phase.

Simulated data
We begin our experimental evaluation with a toy example that shows that the EVM may perform poorly when the distances between the different classes in the training set convey misleading information about the abnormal ones. The training set has n = 600 observations and it is composed of three classes, each one of them is sampled from a different bivariate normal distribution, i.e., the dimension of the predictor space is p = 2. The test data set has 800 observations and it is composed of examples from these three classes plus another abnormal class, sampled from another bivariate normal distribution. Both the train and the test data are shown in the left panel of Fig. 1. The abnormal objects are well separated from the normal ones.
We apply the EVM, the GPDC and the GEVC to this data set and, to solve the problem given by the fact that the probability threshold of the EVM is not interpretable as those of the GPDC and the GEVC, we choose to evaluate the performance  (Bradley 1997) that does not require to fix a threshold. For both the EVM and the GPDC we use k = 20, but the results are stable for different values of k. The EVM performs poorly in this rather simple scenario and has an AUC of 0.853. This is due to the fact that the abnormal class is relatively close to the normal class at the bottom compared to the other normal classes, even if it is perfectly separated from it. Conversely, the GPDC and the GEVC do not suffer of this type of issues since they do not rely on the distances between the normal classes to infer about the abnormal ones, and their results in this toy example are close to perfect for both algorithms, with AUC of 0.997 and 0.999, respectively. To show the effectiveness of the GPDC to determine whether new examples are in the support of the training distribution, we report the estimated ξ n of the test in point 3. of the algorithm in Section 6.2 for both normal and abnormal new data. The right panel of Fig. 1 shows for this dataset that the ξ n estimates for normal data are, as expected by Proposition 1, close to −1/2. On the other hand, the estimates corresponding to abnormal data are much closer to 0, as it is suggested by Theorem 2.
Step 3 of the GPDC algorithm thus already has a good power to filter out the abnormal classes for this data set. We note that, strictly speaking, the supports of all classes are overlapping, but for this finite number of data they are effectively separated so that the test works well. Figure 2 shows the decision boundaries of the GEVC and the GPDC for α = 0.01 and α = 0.1. It can be seen that both algorithms produce highly flexible decision boundaries that are capable to follow the shape of the training data. In some sense, one can see these boundaries as level sets of the training density, and a new point is considered as abnormal if it lies outside of these level sets. He and Einmahl (2017) also considered extrapolation of level sets into low density regions but using the more restrictive assumption of multivariate regular variation.

OLETTER protocol
In order to evaluate the anomaly detection performance of the GPDC and the GEVC for real data we compare the two techniques with the EVM, One-Class SVM and Isolation Forest using the OLETTER protocol proposed in Bendale and Boult (2015). This protocol is based on the LETTER data set (Frey and Slate 1991) that contains a total of 20000 observations of 26 different classes corresponding to handwritten letters. The predictor space is composed of p = 16 features that have been extracted from the handwritten letters. The training set has 15000 observations. The protocol consists in randomly selecting 15 classes that are considered as normal during training and adding abnormal classes by incrementally including subsets of the remaining 11 classes during testing, varying in this way the amount of openness, i.e., the proportion of abnormal objects, during the test phase. This process is then repeated 20 times, in a cross-validation fashion. We evaluate the performance of the different algorithms using the obtained F -measure (Huang and Ling 2005) as done in the original paper (Bendale and Boult 2015). The F -measure is an evaluation metric that combines recall and precision according to the following formula where precision = number of correctly identified abnormal data points number of data points identified as abnormal and recall = number of correctly identified abnormal data points number of abnormal data points .
For all the four algorithms we set dynamically the probability threshold using the heuristic rule proposed in Rudd et al. (2018) that accounts for the amount of openness at each step of the protocol. We set the hyper parameter k = 75 for the EVM as in original paper, and k = 22 for the GPDC, roughly corresponding to use the 0.25% of the biggest negated distances.
The results for all the methods are reported in Fig. 3. It can be seen that all the three methods based on EVT perform better than Isolation Forest and One-Class SVM, even if they have fewer hyper parameters than the latter methods. Moreover, the hyper parameters of One-Class SVM were tuned using also abnormal examples and that is not realistic during a real use of an anomaly detection classifier. The GPDC is the best method in this case, while the GEVC is competitive with the other methods even if it has no hyper parameters.

Diagnostics of thyroid disease
We consider an application of our algorithms for anomaly detection to diagnostics of hypothyroidism, a type of thyroid disease. We analyse the thyroid dataset (Quinlan et al. 1986;Schiffmann et al. 1992), available at the UCI machine learning repository (Dua and Graff 2017). The original dataset contains raw clinical measurements from sick patients affected by hypothyroidism and healthy ones. These raw measurements were pre-processed in order to easily apply directly neural networks and other common machine learning techniques, resulting in 21 features for each patient. In the final dataset there are 250 sick subjects and 6666 healthy subjects. We consider sick subjects as abnormal objects and healthy subjects as normal objects. We use all the 250 sick patients together with 250 randomly selected healthy patients to compose the test set. All the remaining healthy patients compose the training set.
Since in this case the normal objects belong to only one class, the EVM cannot be used with this dataset. For this reason, we compare the GEVC and the GPDC only with One-Class SVM and Isolation Forest. Also here, we use the AUC as evaluation measure. For the GPDC we train it with five different values of k, corresponding to using the most extreme 0.25%, 1%, 2.5%, 5% and 10% distances, respectively. These are common choices in extreme value applications and finally we consider the value of k that gives the best ROC curve. For One-Class SVM with radial kernel we consider different combinations of its hyper parameters and we retain the ones that gives the best results.
The ROC curves obtained for all the methods are shown in the left panel of Fig. 4. The performance of the GPDC, GEVC and One-Class SVM are comparable, with an AUC of 0.931, 0.897, 0.912 respectively, while Isolation Forest performs considerably worst with an AUC of 0.760. We underline again this result since it is particularly favorable to both the GPDC and the GEVC that are reaching the same performance as a state of the art kernel based method like One-Class SVM and better performance than Isolation Forest. These new methods are kernel free and the GPDC has only one hyper parameter that can be chosen based on EVT, whereas the GEVC has no hyper parameters at all.
The right panel of Fig. 4 shows the performance of the GPDC as a function of the number k of most extreme distances used. It can be seen that the GPDC shows a good performance for a wide range of thresholds and the best results are achieved for fairly small k as suggested by EVT.

Conclusion
We present two new kernel free algorithms that perform anomaly detection using extreme value theory. These algorithms, called the GPDC and the GEVC, are fast to update with the arrival of new data and they are easy to adapt to an incremental framework. Moreover, they do not use the distances between the classes in the training set to infer about the abnormals and are thus able to overcome certain restrictions of previously proposed methods. Their performances are therefore good even when the abnormal test data are close to a normal class relatively to the other normals. To show this fact and the effectiveness of the new GPDC and the GEVC, we compare them to the EVM (Rudd et al. 2018), another kernel free technique that uses EVT for anomaly detection, and the more classical kernel based One-Class SVM (Schölkopf et al. 2000) and tree based Isolation Forest (Liu et al. 2012), two state of the art techniques for novelty detection. The results of our methods on a simulated toy example and on real data sets are competitive. A major strength is that they perform well in very different situations. This good performance in general tasks is probably due to the fact that our methods rely on well-established statistical methods and asymptotically motivated approximations from univariate extreme value theory that apply under very mild conditions. We also underline that both the GPDC and the GEVC are computationally faster than the EVM during the evaluation phase.
The GPDC and the GEVC might be further improved by suitable representations of the data, for example using convolutional neural networks to extract features when working with images and we plan to perform more detailed evaluation of their performance in this direction on standard datasets such as ImageNet (Deng et al. 2009) or CIFAR-100 (Krizhevsky and Hinton 2009). Furthermore, to be fully incremental, they need to be capable to store only a subset of the training data. This may be achieved developing a suitable sampling technique that reduces the dimensionality of the training dataset without affecting the asymptotic correctness of the algorithms.
The result in Theorem 1 on the shape parameter of sample distances can be of independent interest. In particular, it will be studied how this can be used for a general method for extrapolation of level sets into low density regions, similarly as in Cai et al. (2011) andEinmahl et al. (2015) and He and Einmahl (2017).