Outlier detection in non-elliptical data by kernel MRCD

The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, the computation time of MRCD increases substantially when the number of variables goes up, and nowadays datasets with many variables are common. The proposed kernel minimum regularized covariance determinant (KMRCD) estimator addresses both issues. It is not restricted to elliptical data because it implicitly computes the MRCD estimates in a kernel-induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates and exploits the kernel trick to speed up the subsequent computations. Based on the KMRCD estimates, a rule is proposed to flag outliers. The KMRCD algorithm performs well in simulations, and is illustrated on real-life data.


Introduction
The minimum covariance determinant (MCD) estimator introduced in [21,22] is a robust estimator of multivariate location and covariance.It forms the basis of robust versions of multivariate techniques such as discriminant analysis, principal component analysis, factor analysis and multivariate regression, see e.g.[16,15] for an overview.The basic MCD method is quite intuitive.Given a data matrix of n rows with p columns, the objective is to find h < n observations whose sample covariance matrix has the lowest determinant.The MCD estimate of location is then the average of those h points, whereas the scatter estimate is a multiple of their covariance matrix.The MCD has good robustness properties.It has a high breakdown value, that is, it can withstand a substantial number of outliers.
The effect of a small number of potentially far outliers is measured by its influence function, which is bounded [5].
Computing the MCD was difficult at first but became faster with the algorithm of [26] and the deterministic algorithm DetMCD [17].An algorithm for n in the millions was recently constructed [6].But all algorithms for the original MCD require that the dimension p be lower than h in order to obtain an invertible covariance matrix.In fact it is recommended that n > 5p in practice [26].This restriction implies that the original MCD cannot be applied to datasets with more variables than cases, that are commonly found in spectroscopy and areas where sample acquisition is difficult or costly, e.g. in the field of omics data.
A solution to this problem was recently proposed in [3], which introduced the minimum regularized covariance determinant (MRCD) estimator.The scatter matrix of a subset of h observations is now a convex combination of its sample covariance matrix and a target matrix.This makes it possible to use the MRCD estimator when the dimension exceeds the subset size.But the computational complexity of MRCD still contains a term O(p 3 ) from the covariance matrix inversion, which limits its use for high-dimensional data.Another restriction is the assumption that the non-outlying observations roughly follow an elliptical distribution.
To address both issues we propose a generalization of the MRCD which is defined in a kernel induced feature space F, where the proposed estimator exploits the kernel trick: the p × p covariance matrix is not calculated explicitly but replaced by the calculation of a n × n centered kernel matrix, resulting in a computational speed-up in case n p. Similar ideas can be found in the literature, see e.g.[10,11] which kernelized the minimum volume ellipsoid [21,22].The results of the KMRCD algorithm with the linear kernel k(x, y) = x y and radial basis function (RBF) kernel k(x, y) = e − x−y 2 /(2σ 2 ) are shown in Figure 1.This example will be described in detail in Section 6.
The paper is organized as follows.Section 2 describes the MCD and MRCD estimators.Section 3 proposes the kernel MRCD method.Section 4 describes the kernel-based initial estimators used as well as a kernelized refinement procedure, and proves that the optimization in feature space is equivalent to an optimization in terms of kernel matrices.The simulation study in Section 5 confirms the robustness of the method as well as the improved computation speed when using a linear kernel.Section 6 illustrates KMRCD on three datasets, and Section 7 concludes.2 The MCD and MRCD methods

The Minimum Covariance Determinant estimator
Assume that we have a p-variate dataset X containing n data points, where the i-th observation x i = (x i1 , x i2 , . . ., x ip ) is a p-dimensional column vector.We do not know in advance which of these points are outliers, and they can be located anywhere.The objective of the MCD method is to find a set H containing the indices of |H| = h points whose sample covariance matrix has the lowest possible determinant.The user may specify any value of h with n/2 h < n.The remaining n − h observations could potentially be outliers.For each h-subset H the location estimate c H is the average of these h points: whereas the scatter estimate is a multiple of their covariance matrix, namely where c α is a consistency factor [5] that depends on the ratio α = h/n.The MCD aims to minimize the determinant of ΣH among all H ∈ H, where the latter denotes the collection of all possible sets H with |H| = h: Computing the exact MCD has combinatorial complexity, so it is infeasible for all but tiny datasets.However, the approximate algorithm FastMCD constructed in [26] is feasible.
FastMCD uses so-called concentration steps (C-steps) to minimize (1).Starting from any given ΣH , the C-step constructs a more concentrated approximation by calculating the Mahalanobis distance of every observation based on the location and scatter of the current subset H: These distances are sorted and the h observations with the lowest MD(x i , c H , ΣH ) form the new h-subset, which is guaranteed to have an equal or lower determinant [26].The C-step can be iterated until convergence.

The Minimum Regularized Covariance Determinant estimator
The minimum regularized covariance determinant estimator (MRCD) is a generalization of the MCD estimator to high dimensional data [3].The MRCD subset H is defined by minimizing the determinant of the regularized covariance matrix ΣH reg : where the regularized covariance matrix is given by with 0 < ρ < 1 and T a predetermined and well-conditioned symmetric and positive definite target matrix.The determination of ρ is done in a data-driven way such that ΣMRCD has a condition number at most κ, for which [3] proposes κ = 50.The MRCD algorithm starts from six robust, well-conditioned initial estimates of location and scatter, taken from the DetMCD algorithm [17].Each initial estimate is followed by concentration steps, and at the end the subset H with the lowest determinant is kept.Note that approximate algorithms like FastMCD and MRCD are much faster than exhaustive enumeration, but one can no longer formally prove a breakdown value.Fortunately, simulations confirm the high robustness of these methods.Also note that such approximate algorithms are guaranteed to converge, because they iterate C-steps starting from a finite number of initial fits.The algorithm may converge to a local minimum of the objective rather than its global minimum, but simulations have confirmed the accuracy of the result.

The Kernel MRCD Estimator
We now turn our attention to kernel transformations [28], formally defined as follows.
Definition 1.A function k : X × X → R is called a kernel on X iff there exists a real Hilbert space F and a map φ : X → F such that for all x, y in X : where φ is called a feature map and F is called a feature space.
We restrict ourselves to positive semidefinite (PSD) kernels.A symmetric function k : X × X → R is called PSD iff n i=1 n j=1 c i c j k(x i , x j ) 0 for any x 1 , . . ., x n in X and any c 1 , . . ., c n in R. Given an n × p dataset X, its kernel matrix is defined as K = ΦΦ with Φ = [φ(x 1 ), ..., φ(x n )] .The use of kernels makes it possible to operate in a highdimensional, implicit feature space without computing the coordinates of the data in that space, but rather by replacing inner products by kernel matrix entries.A well known example is given by kernel PCA [27], where linear PCA is performed in a kernel-induced feature space F instead of the original space X .Working with kernel functions has the advantage that non-linear kernels enable the construction of non-linear models.Note that the size of the kernel matrix is n × n, whereas the covariance matrix is p × p.The latter is an advantage when dealing with datasets for which n p, for then the memory and computational requirements are considerably lower.
Given an n × p dataset X = {x 1 , . . ., x n } we thus get its image {φ(x 1 ), . . .φ(x n )} in feature space, where it has the average Note that the dimension of the feature space F may be infinite.However, we will restrict ourselves to the subspace F spanned by {φ(x In this subspace the points φ(x i ) − c F thus have at most n − 1 coordinates.The covariance matrix in the feature space given by is thus a matrix of size at most (n − 1) × (n − 1).Note that the covariance matrix is centered but the original kernel matrix is not.Therefore we construct the centered kernel matrix K by where 1 nn is the n × n matrix with all entries set to 1/n.Note that the centered kernel matrix is equal to and is PSD by construction.The following result is due to [27].
Theorem 1.Given an n × p dataset X, the sorted eigenvalues of the covariance matrix ΣF and those of the centered kernel matrix K satisfy for all j = 1, . . ., m where m = rank( ΣF ).
Proof of Theorem 1.The eigendecomposition of the centered kernel matrix K is where Λ = diag(λ 1 , . . ., λ n ) with λ 1 . . .λ n .The eigenvalue λ j and eigenvector v j satisfy Φ Φ v j = λ j v j for all j = 1, . . ., m. Multiplying both sides by Φ /(n − 1) gives Combining the above equations results in for all j = 1, . . ., m where v ΣF j = ( Φ v j ) is the j-th eigenvector of ΣF .The remaining eigenvalues of the covariance matrix, if any, are equal to zero.
The above result can be related to a representer theorem for kernel PCA [2].It shows that the nonzero eigenvalues of the covariance matrix are proportional to the nonzero eigenvalues of the centered kernel matrix, thus proving that ΣF and K have the same rank.
What would a kernelized MCD estimator look like?It would have to be equivalent to applying the original MCD in the feature space, so that in case of the linear kernel the original MCD is obtained.The MCD estimate for location in the subspace F is whereas the covariance matrix now equals Likewise, the robust distance becomes In these formulas the mapping function φ may not be known, but that is not necessary since we can apply the kernel trick.More importantly, the covariance matrix may not be invertible as the φ(x i ) − c H F lie in a possibly high-dimensional space F. We therefore propose to apply MRCD in F in order to make the covariance matrix invertible.Let ΦH be the row-wise stacked matrix where i(1), . . ., i(h) are the indices in H.For any 0 < ρ < 1 the regularized covariance matrix is defined as where I m is the identity matrix in F. The KMRCD method is then defined as where H is the collection of subsets H of {1, . . ., n} such that |H| = h and ΣH is of maximal rank, i.e. rank( ΣH ) = dim(span(φ(x i(1) ) − c H F , . . ., φ(x i(h) ) − c H F )) = q with q := min(m, h − 1).We can equivalently say that the h-subset H is in general position.The corresponding regularized kernel matrix is where KH = ΦH ΦT H denotes the centered kernel matrix of h rows, that is, (2) with n replaced by h.The MRCD method in feature space F minimizes the determinant in (3) in F. But we would like to carry out an optimization on kernel matrices instead.The following theorem shows that this is possible.

Theorem 2. Minimizing det( ΣH
reg ) over all subsets H in H is equivalent to minimizing det( KH reg ) over all h-subsets H with rank( KH ) = q.
Proof of Theorem 2. From Theorem 1 it follows that the nonzero eigenvalues of ΣH F and KH are related by λ If H belongs to H, ΣH F has exactly q nonzero eigenvalues so KH also has rank q, and vice versa.The remaining m − q eigenvalues of ΣH are zero, as well as the remaining h − q eigenvalues of KH .Now consider the regularized matrices Computing the determinant of both matrices as a product of their eigenvalues yields: Therefore det( KH reg ) = ρ h−m (h−1) q det( ΣH reg ) in which the proportionality factor is constant, so the optimizations are equivalent.
Following [14] we can also express the robust Mahalanobis distance in terms of the regularized kernel matrix, by where k(x, ] in which i(1), . . ., i(h) are the members of H.This allows us to calculate the Mahalanobis distance in feature space from the kernel matrix, and consequently to perform the C-step procedure on it.Note that (5) requires inverting the matrix KH reg instead of the matrix ΣH reg .The C-step theorem of the MRCD in [3] shows that when you select a new h-subset as those i for which the Mahalanobis distance relative to the old h-subset is smallest, the regularized covariance determinant of the new h-subset is lower than or equal to that of the old one.In other words, C-steps lower the objective function of MRCD.Using Theorem 2, this C-step theorem thus also extends to the kernel MRCD estimator.

The Kernel MRCD Algorithm
This section introduces the elements of the kernel MRCD algorithm.If the original data comes in the form of an n × p dataset X, we start by robustly standardizing it.For this we first compute the univariate reweighted MCD estimator of [24] with coverage h = [n/2] + 1 to obtain estimates of the location and scatter of each variable, which are then used to transform X to z-scores.The kernel matrix K is then computed from these z-scores.Note, however, that the data can come in the form of a kernel matrix that was not derived from data points with coordinates.For instance, a so-called string kernel can compute similarities between texts, such as emails, without any variables or measurements.Such a kernel basically compares the occurrence of strings of consecutive letters in each text.Since the KMRCD method does all its computations on the kernel matrix, it can also be applied to such data.

Initial estimates
The MRCD estimator needs initial h-subsets to start C-steps from.In the original FastMCD algorithm of [26] the initial h-subsets were obtained by drawing random (p + 1)-subsets out of the n data points.For each its empirical mean and covariance matrix were computed as well as the resulting Mahalanobis distances of all points, after which the subset with the h smallest distances was obtained.However, this procedure would not be possible in situations where p > n because Mahalanobis distances require the covariance matrix to be invertible.The MRCD method instead starts from a small number of other initial estimators, inherited from the DetMCD algorithm in [17].
For the initial h-subsets in KMRCD we need methods that can be kernelized.We propose to use four such initial estimators, the combination of which has a good chance of being robust against different contamination types.Since initial estimators can be inaccurate, a kernelized refinement step will be applied to each.We will describe these methods in turn.
The first initial method is based on the concept of spatial median.For data with coordinates, the spatial median is defined as the point m that has the lowest total Euclidean distance i ||x i −m|| to the data points.This notion also makes sense in the kernel context, since Euclidean distances in the feature space can be written in terms of the inner products that make up the kernel matrix.The spatial median in coordinate space is often computed by the Weiszfeld algorithm and its extensions, see e.g.[31].A kernel algorithm for the spatial median was provided in [7].It writes the spatial median m F in feature space as a convex combination of the φ(x i ): in which the coefficients γ 1 , . . ., γ n are unknown.The Euclidean distance of each observation to m F is computed as the square root of and the coefficients γ 1 , . . ., γ n that minimize i ||φ(x i ) − m F || are obtained by an iterative procedure described in Algorithm 2 in Section A.1 of the Supplementary Material.The first initial h-subset H is then given by the objects with the h lowest values of (6).Alternatively, H is described by a weight vector w = (w 1 , . . ., w n ) of length n, where The initial location estimate c F in feature space is then the weighted mean The initial covariance estimate ΣF is the weighted covariance matrix given by covariance weights (u 1 , . . ., u n ) that in general may differ from the location weights (w 1 , . . ., w n ).But for the spatial median initial estimator one simply takes u i := w i for all i.
The second initial estimator is based on the Stahel-Donoho outlyingness (SDO) of [29,12].In a space with coordinates it involves projecting the data points on many unit length vectors (directions).We compute the kernelized SDO [9] of all observations and determine an h-subset as the indices of the h points with lowest outlyingness.This is then converted to weights w i as in ( 7), and we put u i := w i again.The entire procedure is listed as Algorithm 3 in the Supplementary Material.
The third initial h-subset is based on spatial ranks [8].The spatial rank of φ(x i ) with respect to the other feature vectors is defined as: where . If R i is large, this indicates that φ(x i ) lies further away from the bulk of the data than most other feature vectors.In this sense, the values R i represent a different measure of the outlyingness of φ(x i ) in the feature space.We then consider the h lowest spatial ranks, yielding the location weights w i by (7), and put u i := w i .The complete procedure is Algorithm 4 in the Supplementary Material.Note that this algorithm is closely related to the depth computation in [4] which appeared in the same year as [8].
The last initial estimator is a generalization of the spatial sign covariance matrix [32] (SSCM) to the feature space F. For data with coordinates, one first computes the spatial median m described above.The SSCM then carries out a radial transform which moves all data points to a sphere around m, followed by computing the classical product moment of the transformed data: The kernel spatial sign covariance matrix [7] is defined in the same way, by replacing x i by φ(x i ) and m by m F = n i=1 γ i φ(x i ).We now have two sets of weights.For location we use the weights w i = γ i of the spatial median and apply (8).But for the covariance matrix we compute the weights u i = 1/||φ(x i ) − m F || with the denominator given by (6).Next, we apply (9) with these u i .The entire kernel SSCM procedure is listed as Algorithm 5 in the Supplementary Material.Note that kernel SSCM uses continuous weights instead of zero-one weights.

The refinement step
It happens that the eigenvalues of initial covariance estimators are inaccurate.In [18] this was addressed by re-estimating the eigenvalues, and [17] carried out this refinement step for all initial estimates used in that paper.In order to employ a refinement step in KMRCD we need to be able to kernelize it.We will derive the equations for the general case of a location estimator given by a weighted sum (8) and a scatter matrix estimate given by a weighted covariance matrix (9) so it can be applied to all four initial estimates.We proceed in four steps.
1.The first step consists of projecting the uncentered data on the eigenvectors V F of the initial scatter estimate ΣF : where V with V the normalized eigenvectors of the weighted centered kernel matrix K = (D 2. Next, the covariance matrix is re-estimated by in which Q n is the scale estimator of Rousseeuw and Croux [23] and B .j is the j-th column of B.
3. The center is also re-estimated, by where median stands for the spatial median.This corresponds to using a modified feature map for the spatial median or running Algorithm 2 with the modified kernel matrix Transforming the spatial median gives us the desired center: where γ * i are the weights of the spatial median for the modified kernel matrix.
4. The kernel Mahalanobis distance is calculated as The h points with the smallest d * F (x) form the refined h-subset.The entire procedure is Algorithm 6 in the Supplementary Material.

Kernel MRCD algorithm
We now have all the elements to compute the kernel MRCD by Algorithm 1.Given any PSD kernel matrix and subset size h, the algorithm starts by computing the four initial estimators described in Section 4.1.Each initial estimate is then refined according to Section 4.2.Next, kernel MRCD computes the regularization parameter ρ.This is done with a kernelized version of the procedure in [3].For each initial estimate we choose ρ such that the regularized kernel matrix KH reg of ( 4) is well-conditioned.If we denote by λ the vector containing the eigenvalues of the centered kernel matrix KH , the condition number of KH reg is and we choose ρ such that κ(ρ) 50.(Section A.3 in the supplementary material contains a simulation study supporting this choice.)Finally, kernel C-steps are applied until convergence, where we monitor the objective function of Section 3.

Compute the weights of the four initial estimates of location and scatter as in Section
4.1.
3. Refine each initial estimate as in Section 4.2.
5. Determine the final ρ as in [3]: if max i ρ (i) 0.1 set ρ = max i ρ (i) , otherwise set ρ = max 0.1, median i (ρ (i) ) .6.For H = H (1) , . . ., H (4) perform C-steps as follows: (a) Compute the regularized kernel matrix KH reg for the h-subset H from (4).(b) Calculate the regularized Mahalanobis distance for each observation i by (5).In the special case where the linear kernel is used, the centered kernel matrix KH immediately yields the regularized covariance matrix ΣH reg through where XH = X H − 1 h i∈H x i is the centered matrix of the observations in H and Λ and Ṽ contain the eigenvalues and normalized eigenvectors of KH .(The derivation is given in Section A.2.)So instead of applying MRCD to coordinate data we can also run KMRCD with a linear kernel and transform KH to ΣH reg afterward.This computation is faster when the data has more dimensions than cases.

Anomaly detection by KMRCD
Mahalanobis distances (MD) relative to robust estimates of location and scatter are very useful to flag outliers, because outlying points i tend to have higher MD i values.The standard way to detect outliers by means of the MCD in low dimensional data is to compare the robust distances to a cutoff that is the square root of a quantile of the chi-squared distribution with degrees of freedom equal to the data dimension [26].However, in high dimensions the distribution of the squared robust distances is no longer approximately chisquared, which makes it harder to determine a suitable cutoff value.Faced with a similar problem [25] introduced a different approach, based on the empirical observation that robust distances of the non-outliers in higher dimensional data tend to have a distribution that is roughly similar to a lognormal.They first transform the distances MD i to LD i = log(0.1 + MD i ), where the term 0.1 prevents numerical problems should a (near-)zero MD i occur.The location and spread of the non-outlying LD i are then estimated by μMCD and σMCD , the results of applying the univariate MCD to all LD i using the same h as in the KMCRD method itself.Data point i is then flagged iff where z(0.995) is the 0.995 quantile of the standard normal distribution.The cutoff value for the untransformed robust distances is thus The user may want to try different values of h to be used in both the KMRCD method itself as well as in the μMCD and σMCD in (15).One typically starts with a rather low value of h, say h = 0.5n when the linear kernel is used and there are up to 10 dimensions, and h = 0.75n in all other situations.This will provide an idea about the number of outliers in the data, after which it is recommended to choose h as high as possible provided n − h exceeds the number of outliers.This will improve the accuracy of the estimates.

Choice of bandwidth
A commonly used kernel function is the radial basis function (RBF) k(x, y) = e − x−y 2 /(2σ 2 ) which contains a tuning constant σ that needs to be chosen.When the downstream learning task is classification σ is commonly selected by cross validation, where it is assumed that the data has no outliers or they have already been removed.However, in our unsupervised outlier detection context there is nothing to cross validate.Therefore, we will use the so-called median heuristic [13] given by σ 2 = median{ x i − x j 2 ; 1 i < j n} (16) in which the x i are the standardized data in the original space.We will use this σ in all our examples.

Illustration on toy examples
We illustrate the proposed KMRCD method on the two toy examples in Figure 1.Both datasets consist of n = 1000 bivariate observations.The elliptical dataset in the left panel was generated from a bivariate Gaussian distribution, plus 20% of outliers.The nonelliptical dataset in the panel on the right is frequently used to demonstrate kernel methods [30].This dataset also contains 20% of outliers, which are shown in red and form the outer curved shape.We first apply the non-kernel MCD method, which does not transform the data, with h = 0.75n .(Not using a kernel is equivalent to using the linear kernel.)The results are in Figure 2. In the panel on the left this works well because the MCD method was developed for data of which the majority has a roughly elliptical shape.For the same reason it does not work well on the non-elliptical data in the right hand panel.We now apply the kernel MRCD method to the same datasets.For the elliptical dataset we use the linear kernel, and for the non-elliptical dataset we use the RBF kernel with tuning constant σ given by formula (16).This yields Figure 3.We first focus on the left hand column.The figure shows three stages of the KMRCD runs.At the top, in Figure 3(a), we see the result for the selected h-subset, after the C-steps have converged.The members of that h-subset are the green points, whereas the points generated as outliers are colored red.Since h is lower than the true number of inlying points, some inliers (shown in black) are not included in the h-subset.In the next step, Figure 3(b) shows the robust Mahalanobis distances, with the horizontal line at the cutoff value given by formula (15).The final output of KMRCD shown in Figure 3(c) has the flagged outliers in orange and the points considered inliers in blue.As expected, this result is similar to that of the non-kernel MCD in the left panel of Figure 2.
The right hand column of Figure 3 shows the stages of the KMRCD run on the nonelliptical data.The results for the selected h-subset in Figure 3(a) look much better than in the right hand panel of Figure 2, because the level curves of the robust distance now follow the shape of the data.In stage (b) we see that the distances of the inliers and the outliers are fairly well separated by the cutoff (15), with a few borderline cases, and stage (c) is the final result.This illustrates that using a nonlinear kernel allows us to fit non-elliptical data.

Simulation study with linear kernel
In this section we compare the KMRCD method proposed in the current paper, run with the linear kernel, to the MRCD estimator of Boudt et al. [3].Recall that using the linear kernel k(x, y) = x y means that the feature space can be taken identical to the coordinate space, so using the linear kernel is equivalent to not using a kernel at all.Our purpose is twofold.First, we want to verify whether KMRCD performs well in terms of robustness and accuracy, and that its results are consistent with those of MRCD.And secondly, we wish to measure the computational speedup obtained by KMRCD in high dimensions.In order to obtain a fair comparison we run MRCD with the identity matrix as target, which corresponds to the target of KMRCD.All computations are done in MATLAB on a machine with Intel Core i7-8700K and 16 GB of 3.70GHz RAM.
For the uncontaminated data, that is, for contamination fraction ε = 0, we generate n cases from a p-variate normal distribution with true covariance matrix Σ.Since the methods under consideration are equivariant under translations and rescaling variables, we can assume without loss of generality that the center µ of the distribution is 0 and that the diagonal elements of Σ are all 1.We denote the distribution of the clean data by N (0, Σ).Since the methods are not equivariant to arbitrary nonsingular affine transformations we cannot set Σ equal to the identity matrix.Instead we consider Σ of the ALYZ type, generated as in Section 4 of [1], which yields a different Σ in each replication, but always with condition number 100.The main steps of the construction of Σ in [1] are the generation of a random orthogonal matrix to provide eigenvectors, and the generation of p eigenvalues such that the ratio between the largest and the smallest is 100, followed by iterations to turn the resulting covariance matrix into a correlation matrix while preserving the condition number.(In section A.3 of the supplementary material also Σ matrices with higher condition numbers were generated, with similar results.) For a contamination fraction ε > 0 we replace a random subset of εn observations by outliers of different types.Shift contamination is generated from N (µ C , Σ) where µ C lies in the direction where the outliers are hardest to detect, which is that of the last eigenvector v of the true covariance matrix Σ.We rescale v by making p .The center is taken as µ C = kv where we set k = 200.Next, cluster contamination stems from N (µ C , 0.05 2 I p ) where I p is the identity matrix.Finally, point contamination places all outliers in the point µ C so they behave like a tight cluster.These settings make the simulation consistent with those in [3,17] and [6].The deviation of an estimated scatter matrix Σ relative to the true covariance matrix Σ is measured by the Kullback-Leibler (KL) divergence KL( Σ, Σ) = trace( ΣΣ −1 ) − log(det( ΣΣ −1 )) − p.The speedup factor is measured as speedup = time(MRCD)/time(KMRCD).Different combinations of n and p are generated, ranging from p = n/2 to p = 2n.
Table 1 presents the Kullback-Leibler deviation results.The top panel is for ε = 0, the middle panel for ε = 0.1 and the bottom panel for ε = 0.3 .All table entries are averages over 50 replications.First look at the results without contamination.By comparing the three choices for h, namely 0.5n , 0.75n and 0.9n , we see that lowering h in this setting leads to increasingly inaccurate estimates Σ.This is the price we pay for being more robust to outliers, since n − h is an upper bound on the number of outliers the methods can handle.When we look at the panels for higher ε we see a similar pattern.
When ε = 0.1 the choice 0.9n is sufficiently robust, and the lower choices of h have higher KL deviation.But when ε = 0.3 only the choice h = 0.5n can detect the outliers, the other choices cause the estimates to break down.These patterns are confirmed by the averaged MSE = p i=1 p j=1 ( Σ − Σ) 2 ij /p 2 shown in Table 7 in the Supplementary Material.From these results we conclude that it is important that h be chosen lower than n minus the number of outliers, but not much lower since that would make the estimates less accurate.A good strategy is to first run with a low h, which reveals the number of outliers, and then to choose a higher h that can still handle the outliers and yields more accurate results as well.
As expected the KMRCD results are similar to those of MRCD, but not identical because there are differences in the selection of initial estimators, also leading to differences in the resulting regularization parameter ρ shown in Table 8 in the Supplementary Material.
We now turn our attention to the computational speedup factors in Table 2, that were derived from the same simulation runs as Table 1.Overall KMRCD ran substantially faster than MRCD, with the factor becoming larger when n decreases and/or the dimension p increases.There are two reasons for the speedup.First of all, the MRCD algorithm computes six initial scatter estimates, of which the last one is the most computationally demanding since it computes a robust bivariate correlation of every pair of variables, requiring p(p−1)/2 computations whose total time increases fast with p. Part of the speedup stems from the fact that KMRCD does not use this initial estimator, whereas its own four kernelized initial estimates gave equally robust results.This explains most of the speedup in Table 2.
For p > n there is a second reason for the speedup, the use of the kernel trick.In particular, each C-step requires the computation of the Mahalanobis distances of all cases.MRCD does this by inverting the p × p covariance matrix ΣH reg , whereas KMRCD uses equation ( 5) which implies that it suffices to invert the n × n kernel matrix KH reg , which takes time complexity O(n 3 ) instead of O(p 3 ).

Simulation with nonlinear kernel
In this section we compare the proposed KMRCD estimator to the MRCD estimator of Boudt et al. [3] on two types of non-elliptical datasets.The first type is generated by a copula.We start by considering the t copula [20] with Pearson correlation 0.1 and ν = 1 degrees of freedom.The black points in the left panel of Figure 4 were generated from this copula.We then added contamination in the form of uniformly distributed random noise on the unit square, where points lying close to the regular distribution were removed.The red points in the left panel of Figure 4 are the outliers.Apart from the t copula we also consider the Frank, Clayton, and Gumbel copulas with Kendall rank correlation τ = 0.6 .They are visualized in Figure 9 in section A.6 of the Supplementary Material.
The proposed estimator is also benchmarked in a second type of setting where the regular observations are uniformly distributed on the unit circle, and inside the circle are outliers generated from the Gaussian distribution with center 0 and covariance matrix equal to 0.04 times the identity matrix.This setting is illustrated in Figure 5.This is a simple example where the clean data lie near a manifold.
In the simulation we generated 100 datasets of each type, with n = 500 and the outlier fraction ε equal to 0.1 or 0.2, so the number of regular observations is n(1 − ε).With all four copulas the KMRCD estimator used the radial basis function with bandwidth (16).For the circle-based data the polynomial kernel k(x, y) = (x y + 1) 2 of degree 2 was used.We measure the performance by counting the number of outliers in the h-subset, and among the n(1 − ε) points with the lowest (kernel) Mahalanobis distance.The averaged counts over the 100 replications are shown in Table 3.By comparing the rows of KMRCD and MRCD with the same ε, we see that MRCD has more true outliers in its h-subset and its n(1 − ε) set.In the table, KMRCD outperforms MRCD for both choices of ε and for all three choices of h.The good performance of KMRCD is also seen in the right panel of Figure 4, where the contours of the kernel Mahalanobis distance nicely follow the distribution.The difference between MRCD and KMRCD is most apparent on the circle-based data: in Figure 5 the KMRCD fits the regular data on the circle, whereas the original MRCD method, by its nature, considers the outliers in the center as regular data.
We conclude that in this nonlinear setting, KMRCD has successfully extended the MRCD to non-elliptical distributions.We want to add two remarks about this.First, as in all kernel-based methods the choice of the kernel is important, and choosing a different kernel can lead to worse results.And second, just as in the linear setting h should be lower than n minus the number of outliers, so in practice it is recommended to first run with a low h, look at the results in order to find out how many outliers there are, and possibly run again with a higher h.
Section A.4 of the supplementary material contains additional simulation results about the computation time of the four initial estimators in KMRCD and their subsequent Csteps, in different settings with linear and nonlinear kernels.The dataset is bivariate and contains two color signals measured on organic sultana raisin samples.The goal is to classify these into inliers and outliers, so that during produc-tion outliers can be physically removed from the product in real time.There are training data and test data, but the class label 'outlier' is not known beforehand.The scatter plot of the training data in Figure 6 (a) reveals the non-elliptical (and to some extent triangular) structure of the inliers.Three types of outliers are visible.Those with high values of λ 1 and low λ 2 at the bottom right correspond to foreign, cap-stem related material like wood, whereas points with high values of λ 2 represent discolorations.There are also a few points with high values of both λ 1 and λ 2 which correspond to either discolored raisins or objects with clear attachment points of cap-stems.Outliers of any of these three types need to be flagged and removed from the product.From manually analyzing data of this product it is known beforehand that the fraction of outliers is rather low, at most around 2%.
We first run KMRCD on the training data.In its preprocessing step it standardizes both variables.For comparison purposes we use two kernels.In the left hand column of Figure 6 we apply the linear kernel, and in the right hand column we use the RBF kernel with tuning constant σ given by ( 16).Since we know the fraction of outliers is low we can put h = 0.95n .Each figure shows the flagged points in orange and the remaining points in blue, and the contour lines are level curves of the robust distance.
The fit with linear kernel in Figure 6 (a) has contour lines that do not follow the shape of the data very well, and as a consequence it fails to flag some of the outliers, such as those with high λ 2 and some with relatively high values of both λ 1 and λ 2 .The KMRCD fit with nonlinear kernel in Figure 6 (b) has contour lines that model the data more realistically.This fit does flag all three types of outliers correctly.Both trained models were then used to classify the previously unseen test set.The results are similar to those on the training data.The anomaly detection with linear kernel in Figure 6 (c) again misses the raisin discolorations, which would keep these impurities in the final consumer product.Fortunately, the method with the nonlinear kernel in panel (d) does flag them.

MNIST digits data
Our last example is high dimensional.The MNIST dataset contains images of handwritten digits from 0 to 9, at the resolution of 28×28 grayscale pixels (so there are 784 dimensions), and was downloaded from http://yann.lecun.com/exdb/mnist.There is a training set and a test set.Both were subsampled to 1000 images.To the training data we added noise distributed as N (0, (0.5) 2 ) to 20% of the images, and in the test set we added noise with the same distribution to all images.We then applied KMRCD with RBF kernel with tuning constant σ given by ( 16) and h = 0.75n to the 1000 training images.Next, we computed the eigenvectors of the robustly estimated covariance matrix.
Our goal is to denoise the images in the test set by projecting them onto the main eigenvectors found in the training data.As we are interested in a reconstruction of the data in the original space rather than in the feature space, we transform the scores back to the original input space by the iterative optimization method of [19].
The top panel of Figure 7 illustrates what happens when applying this computation to the classical covariance matrix in feature space, which corresponds to classical kernel PCA [27].The bottom panel is based on KMRCD.The first row of each panel displays original test set images, and the second row shows the test images after the noise was added.The first and second rows do not depend on the estimation method, but the remaining rows do.There we see the results of projecting on the first 5, 15, and 30 eigenvectors of each method.In the top panel those images are rather diffuse, which indicates that the classical approach was affected by the training images with added noise and considers the added noise as part of its model.This implies that increasing the number of eigenvectors used will not improve the overall image quality much.The lower panel contains sharper images, because the robust fit of the training data was less affected by the images with added noise that acted as outliers.
We can also compute the mean absolute error n i=1 p j=1 |x i,p − xi,p |/(np) between the original test images (with p = 784 dimensions) and the projected versions of the test images with added noise. Figure 8 shows this deviation as a function of the number of eigenvectors used in the projection.The deviations of the robust method are systematically lower than those of the classical method, confirming the visual impression from Figure 7.

Conclusions
The kernel MRCD method introduced in this paper is a robust method that allows to analyze non-elliptical data when used with a nonlinear kernel.Another advantage is that even when using the linear kernel the computation becomes much faster when there are more dimensions than cases, a situation that is quite common nowadays.Due to the built-in regularization the result is always well-conditioned.The algorithm starts from four kernelized initial estimators, and to each it applies a new kernelized refinement step.The remainder of the algorithm is based on a theorem showing that C-steps in feature space are equivalent to a new type of C-steps on centered kernel matrices, so the latter reduce the objective function.The performance of KMRCD in terms of robustness, accuracy and speed is studied in a simulation, and the method is applied to several examples.Potential future applications of the KMRCD method are as a building block for other multivariate techniques such as robust classification.
Research-level MATLAB code and an example script are freely available from the webpage http://wis.kuleuven.be/statdatascience/robust/software.

Figure 1 :
Figure1: Illustration of kernel MRCD on two datasets of which the non-outlying part is elliptical (left) and non-elliptical (right).Both datasets contain 20% of outlying observations.The generated regular observations are shown in black and the outliers in red.In the panel on the left a linear kernel was used, and in the panel on the right a nonlinear kernel.The curves on the left are contours of the robust Mahalanobis distance in the original bivariate space.The contours on the right are based on the robust distance in the kernel-induced feature space.

8 .
(c) Redefine H as the h indices i with smallest distance.(d) Compute and store the objective.If not converged, go back to (a). 7. Select the h-subset with the overall smallest objective function.Output: the final h-subset and the robust distances.

Figure 2 :
Figure 2: Results of the non-kernel MCD method on the toy datasets of Figure 1.The contour lines are level curves of the MCD-based Mahalanobis distance.
h-subset results after C-step convergence.Robust Mahalanobis distances with outlier cutoff.
Final results with the flagged outliers shown in orange.

Figure 3 :
Figure 3: Kernel MRCD results on the toy datasets of Figure 1.In the left column the linear kernel was used, and in the right column the RBF kernel.The three stages (a), (b) and (c) are explained in the text.

Figure 4 :
Figure 4: Illustration of the non-elliptical simulation setting with data generated from the t copula, plus 20% of outlying observations.In the left panel, the regular observations are shown in black and the outliers in red.The results of the MRCD estimator are in the middle panel, and those of the KMRCD estimator in the rightmost panel, each for h = 0.75n.In those panels the points in the h-subset are shown in green, and the other points with the n(1 − ε) lowest (kernel) Mahalanobis distance are depicted in grey.The remaining points are shown in red.The curves are contours of the robust (kernel) Mahalanobis distance.

Figure 5 :
Figure5: Illustration of the non-elliptical simulation setting with data generated from the circle manifold, plus 20% of outlying observations.The remainder of the description is as in Figure4.
Non-linear kernel, test set

Figure 6 :
Figure 6: Food industry example: KMRCD results with the linear kernel (left column) and the RBF kernel (right column).The top row contains the training data, and the resulting fits were applied to the test data in the bottom row.

Figure 7 :
Figure 7: MNIST denoising results based on the classical covariance matrix (top panel) and on the KMRCD estimate (bottom panel).The first and second rows of each panel show the same original and noise-added test set images.The remaining rows contain the results of projecting on the first 5, 15, and 30 eigenvectors.

Figure 8 :
Figure 8: The mean absolute error of the denoised images to the original test images in function of the number of eigenvectors used in the projection.The top curve is for the classical covariance matrix in feature space, the lower curve for KMRCD.

Table 3 :
Average number of outliers in the h-subset H, and among the n(1 − ε) points with lowest (kernel) Mahalanobis distance.We now turn our attention to a real dataset from the food industry.In that setting datasets frequently contain outliers, because samples originate from natural products which are often contaminated by insect damage, local discolorations and foreign material.It also happens that the image acquisition signals yield non-elliptical data, and in that case a kernel transform can help.