Cauchy robust principal component analysis with applications to high-dimensional data sets

Principal component analysis (PCA) is a standard dimensionality reduction technique used in various research and applied fields. From an algorithmic point of view, classical PCA can be formulated in terms of operations on a multivariate Gaussian likelihood. As a consequence of the implied Gaussian formulation, the principal components are not robust to outliers. In this paper, we propose a modified formulation, based on the use of a multivariate Cauchy likelihood instead of the Gaussian likelihood, which has the effect of robustifying the principal components. We present an algorithm to compute these robustified principal components. We additionally derive the relevant influence function of the first component and examine its theoretical properties. Simulation experiments on high-dimensional datasets demonstrate that the estimated principal components based on the Cauchy likelihood typically outperform, or are on a par with, existing robust PCA techniques. Moreover, the Cauchy PCA algorithm we have used has much lower computational cost in very high dimensional settings than the other public domain robust PCA methods we consider.


Introduction
In the analysis of multivariate data, it is frequently desirable to employ statistical methods which are insensitive to the presence of outliers in the sample.To address the problem of outliers, it is important to develop robust statistical procedures.Most statistical procedures include explicit or implicit prior assumptions about the distribution of the observations, but often without taking into account the effect of outliers.The purpose of this paper is to present a novel robust version of PCA which has some attractive features.
Principal components analysis (PCA) is considered to be one of the most important techniques in statistics.However, the classical version of PCA depends on either a covariance or a correlation matrix, both of which are very sensitive to outliers.We develop an alternative method to classical PCA, which is far more robust, by using a multivariate Cauchy likelihood to construct a robust principal components (PC) procedure.It is an adaptation of the classic method of PCA obtained by replacing the Gaussian log-likelihood function by the Cauchy log-likelihood function, in a sense that will be explained in section 2.2.Although we do not claim that the interpretation of standard PCA in terms of operations on a Gaussian likelihood is new, see Bolton and Krzanowski, this fact does not appear to have been exploited in the development of a robust PCA procedure, as we do in this paper.An important reason for using the multivariate Cauchy likelihood is that this likelihood has only one maximum point, but the single most important motivation is that it leads to a robust procedure.
In the next section we review briefly some of the techniques employed for estimating parameters and for directing a PCA in ways which are robust against the presence of outliers.We also present robustness preliminaries that include some important techniques which are necessary to assess whether the method used is robust or not.In Section 3 we develop the Cauchy-PCA and theoretically explore its robustness properties.Finally, in Section 4 we present the numerical algorithms for creating Cauchy PCs, and also give the results of a number of very high-dimensional real-data and simulated examples.Our approach is seen to be competitive with, and often gives superior results to, that of the projection pursuit algorithm of Croux et al. (2007Croux et al. ( , 2013)).Finally we conclude the paper in Section 5.

Literature review on robust PCA
It is well known that PCA is an important technique for high-dimensional data reduction.PCA is based on the sample covariance matrix Σ and it involves searching for a linear combination y j = u T x j of the x components of the vector that maximize the sample variance of the components of y.
According to Mardia et al. (1979), the solution will be given by the equation Σ = UΛU T , where Λ = diag{λ 1 , . . ., λ p } and its diagonal elements λ i are the sample variances, while U is an orthogonal matrix, i.e.UU T = U T U = I p , whose columns u i are the corresponding eigenvectors which represent the linear combinations.[[The principal components are efficiently estimated in practice via Singular Value Decomposition (SVD) (cite Lanczos for an efficient algorithm).]]Classical PCA, unfortunately, is non-robust, since it based on the sample covariance or sample correlation matrix which are very sensitive to outlying observations; see section 2. However, this problem has been handled by two different methods which result in robust versions of PCA by: i. replacing the standard covariance or correlation matrix with a robust estimator; or ii.maximising (or minimising) a different objective function to obtain a robust PCA.
Many different proposes had been developed to carry out robust PCA, such that using projection pursuit PP, M −estimators and so on.Despite maximum likelihood estimation, perhaps, being considered as the most important statistical inference method, sometimes this approach can lead to improper results when the underlying assumptions are not satisfied, for instance, when data contain outliers or deviate slightly from the supposed model.A generalization of maximum likelihood estimation proposed by Huber (1964) which is called M -estimation, aims to produce a robust statistic by constructing approaches that are resistant to deviations from the underline assumptions.M -estimators were also defined for the multivariate case by Maronna (1976).Campbell (1980) provided a procedure for robust PCA by examining the estimates of means and covariances which are less affected by outlier observations, and by exploring the observations which have a large effect on the estimates.He replaced the sample covariance sample by an M −estimator.Hubert and Verboven (2003) introduced a new approach to create robust PCA.It combines the advantages of two methods, the first one is based on replacing the covariance or correlation matrix by its robust estimator, while the second one is based on maximizing the objective function for this robust estimate.
A robust PCA based on the projection pursuit (PP) method was developed by Li and Chen (1985), using Huber's M -estimator of dispersion as the projection index.The objective of PP is to seek projections, of the high-dimensional data set onto low-dimensional subspaces, that optimise a function of "interestingness".The function that should be optimised is called an index or objective function and its choice depends on a feature that the researcher is concerned about.This property gives the PP technique a flexibility to handle many different statistical problems range from clustering to identifying outliers in a multivariate data set.Bolton and Krzanowski (1999) characterized the PC's for PP in terms of maximum likelihood under the assumption of normality.PCA can be considered as a special case of PP as well as many other methods of multivariate analysis.Li and Chen (1985) used Huber's M -estimator of dispersion as projective index to develop a robust PCA based on the PP approach.The sample median was used as a projective index to develop a robust PCA by Xie et al. (1993).In their simulation studies, Xie et al. (1993) observed a PCA resistant to outliers and deviations from the normal distribution.Croux et al. (2007Croux et al. ( , 2013) also suggested a robust PCA using projection pursuit and we will contrast our methodology against their algorithm.

Preliminaries on standard PCA
PCA is an orthogonal linear transformation that projects the data to a new coordinate system according to the variance of each direction.Given a data matrix X ∈ R n×p with each row correspond to a sample, the first direction u 1 that maximizes the variance is defined through where 1 n is an n-dimensional vector whose elements are all set to 1 while x = 1 n n i=1 x i is the empirical mean.The process is repeated k times and at each iteration the to-be-estimated principal direction has to be orthogonal to all previously-computed principal directions.Thus, the k-th direction which has to be orthogonal to the previous ones is defined by

Non-robustness of standard PCA
We will show that the influence function for the largest eigenvalue of the covariance matrix and the respective eigenvector are unbounded with respect to the norm of an outlier sample.Suppose that Σ is the covariance matrix of a population with distribution function F , i.e., where µ = R p xdF (x) corresponds to the mean vector.Assume that the leading eigenvalue of Σ has multiplicity 1, then we denote it by λ and the leading eigenvector by û (i.e., u 1 = û).
Let T be an arbitrary functional, F a distribution and z ∈ R p an arbitrary point in the relevant sample space.The influence function is defined as where ∆ z is a unit point mass located at z.A robust estimator for T means that the influence function is bounded with respect to the norm of the outlier z.
Proposition 2.1.The influence function for the leading eigenvector of Σ is given by1 (3) Similarly, the IF for the largest eigenvalue of Σ is The detailed calculations are presented in Appendix A1.The following result shows that outliers with unbounded influence function do exist.
Corollary 2.1.Let z = µ + γ û + ηv where v is orthogonal to û and does not belong to the null space of Σ and γ, η = 0 then and similarly for IF λ (z, F ).
Proof.Direct substitution of z into the influence function gives:

Generalizations of standard PCA
Standard PCA can be viewed as a special case of a more general optimization problem.We present two such generalization: the first one leads to projection pursuit algorithms while the second leads to a maximum likelihood formulation.Let u be a unit vector and define the projection values and a function Φ : R n → R acting on the projected values.The first generalization of PCA is defined as the maximization of Φ: As in the standard PCA, the following principal directions are obtained after removing the contribution of the current principal component from the data.When Φ is the sample variance then we recover the standard PCA.
The second generalization interprets the computation of the principal component as a maximum likelihood estimation problem.By letting, be the Gaussian log-likelihood, the first principal direction can be obtained by solving the minimax problem: Indeed, the inner maximization can be solved analytically which leads to the optimal solution Unsurprisingly, the optimal values are the sample mean and the sample variance.Using the above formulas it is straightforward to show that arg min Variations of PCA can be derived by changing the likelihood function and in the next section we analyze the case of Cauchy distribution.

Cauchy PCA
The Cauchy log-likelihood function is given by where µ and σ are the two parameters of the Cauchy distribution.The first Cauchy principal direction is also obtained by solving the minimax optimization problem: In contrast to the Gaussian case, the inner maximization cannot be performed analytically.Therefore an iterative approach needs to be utilized.
Here, we apply the Newton-Raphson method with initial values the median and half the interquartile range for the location and scale parameters, respectively.According to Copas (1975), although the mean of the Cauchy distribution does not exist and it has infinite variance, the Cauchy log-likelihood function l C (µ, σ) has a unique maximum likelihood estimate, (μ, σ).Fixing µ and σ, the outer minimization is also non-analytic and a fixed point iteration is applied to calculate u.The iteration is given by where ûun is the unnormalized direction which is obtained from the differentiation of the Lagrangian function with respect to u and it is given by ûun Once the first principal direction has been computed, its contribution from the dataset X is removed and the same procedure to estimate the next principal direction is repeated.This iterative process is repeated k times.The removal of the contribution makes the principal directions orthogonal to each other.We summarize the estimation of k Cauchy principal components in the following pseudo-code (Algorithm 1).
• Via Newton-Raphson algorithm find • Fix (μ, σ) and using fixed point iteration (i.e., ( 10) & ( 9)) find end while • Set the j-th Cauchy principal direction • Remove the contribution from the dataset X = X(I p − u j u T j ), end for

Robustness of the Leading Cauchy Principal Direction
Let θ = (µ, σ) T be the parameter vector of the Cauchy distribution and consider the infinite-sample normalized Cauchy log-likelihood function where g(c, θ) = log(σ/π) − log(σ 2 + (c − µ) 2 ) and c(u) = x T u.We will estimate the influence function for the leading Cauchy principal direction û = arg min where θ F (u) = arg max θ l(x T u|θ) is the optimal Cauchy parameters for a given direction u.
Since û is restricted to be a unit vector, the standard condition for the minimum, i.e., ∂ ∂u l(u|θ F (u)) u=û = 0 is not valid.The proper condition is defined by where P u is the projection matrix given by P u = I p − uu T .
Remark 3.1.An equivalent condition is to satisfy h T ∂ ∂u l(u|θ F (u)) u=û = 0 for all h such that h T û = 0 and ||h|| 2 = 1.Both derived conditions are essentially a consequence of the Lagrangian formulation of the constraint optimization problem.Indeed, the Lagrange condition implies that at the minimum the direction of the objective function's derivative should be parallel to the direction of the constraint's derivative which translates to ∂ ∂u l(u|θ F (u)) u=û = λû where λ = 0 is the Lagrange multiplier.
Let ḡ(x; u) = g(x T u|θ) θ=θ F (u) be the likelihood function computed at θ = θ F (u) and let denote its partial derivatives as .
Similarly, ḡcc , ḡcθ and ḡθθ denote the second order derivatives.The following proposition establishes the expression for the influence function of the leading Cauchy principal direction, û.
Proposition 3.1.Under the assumption of I F (û) and A being invertible matrices, the influence function of û is where is the expected Fisher information matrix under F for the parameters of the Cauchy distribution computed at û.
Proof.The proof consists of several straightforward series expansions and implicit function calculations.The complete proof is given in Appendix A2.
The following boundedness result for the influence function states the conditions under which Cauchy PCA is robust.
Corollary 3.1.Let the assumptions of the proposition hold.If z ⊥ û or if z ⊥ û = 0 but µ F (û) = 0 then the influence function for û is bounded.
Proof.First, observe that matrix A does not depend on z.It is only b that depends on z and our goal is to prove that b is bounded with respect to z.Second, we have to compute the partial derivatives ḡc (z; û) and ḡθ (z; û).Straightforward calculations lead to Let us now define an arbitrary scaling of the outlier z → αz and prove boundedness of b as we send α → ∞.We consider the first case where z ⊥ û.It holds that lim α→∞ ḡc (αz; û)αz = −(z T û) −1 z, lim α→∞ ḡµ (αz; û) = 0 and lim α→∞ ḡσ (αz; û) = 1 σ F (û) therefore b is bounded with respect to α.For the second case, we have since µ F (û) = 0 by assumption.Thus b is bounded with respect to α for the second case, too.
The only case not covered by the corollary is when z T û = 0 and µ(û) = 0. Our experiments presented in the following section show that outliers that are orthogonal to the Cauchy principal direction do sometimes influence the estimation of the Cauchy principal direction yet not significantly.

Several Cauchy principal components
We briefly mention possibilities for estimating several Cauchy principal components.There are two obvious approaches: one approach, the sequential approach, is to repeat the algorithm described above on the subspace orthogonal to û = û1 to obtain û2 , the second Cauchy principal component, where û1 is the first Cauchy principal component; then repeat the procedure on the subspace orthogonal to û1 and û2 to obtain û3 ; and so on.A second approach, the simultaneous approach, is to decide in advance how many principal components we wish to determine, p say, and then use a p-dimensional multivariate Cauchy likelihood, which has p + p(p + 1)/2 free parameters, to obtain û1 , . . ., ûp .It turns out that these two approaches lead to equivalent results in classical (Gaussian) PCA but when a Cauchy likelihood is used the two approaches produce different sets of principal components.Our current thinking is this: the sequential approach is easier to implement (essentially the same software can be used at each step) and it is faster.However, the simultaneous approach could potentially be preferable if we know in advance how many principal components we wish to estimate.Further investigation is required.

Simulation studies
In this section we will empirically validate our proposed methodology, via simulation studies.We searched for R packages that offer robust PCA in the n << p case and came up with FastHCS (Vakili, 2018), rrcovHD (Todorov, 2016), rpca (Sykulski, 2017) and pcaPP (Filzmoser et al., 2018).Out of them, pcaPP (Projection Pursuit PCA) is the only one which does not require hyper-parameter tuning, e.g.selection of the LASSO penalty λ or choice of the percentage of observations used to estimate a robust covariance matrix.

Setup of the simulations
Initially, we created a p × p (orthonormal) basis B by using QR decomposition on some randomly generated data.We then generated eigenvalues λ i ∼ Exp(0.4),where i = 1, . . ., p and hence we obtained the covariance matrix Σ Σ Σ = BΛ Λ ΛB T , where Λ Λ Λ = diag(λ i ).The first column of B served as the first "clean" eigenvector, and was the benchmark in our comparative evaluations.Following this step, we simulated n random vectors X ∼ N p (0, Σ Σ Σ) and in order to check the robustness of the results to the center of the data, all observations were shifted right by adding 50 everywhere.A number of outliers equal to 2% of the sample size were introduced.These outliers were x + e κ z ∈ R p , where x is the sample mean vector, z are unit vector(s) and e κ a real number denoting their norm, where κ varied from 3 up to 8 increasing with a step size equal to 1 and the angle between the outliers z and the first "clean" eigenvector spanned from 0 • up to 90 • .In all cases, we subtracted the spatial median or the column-wise median2 and scaled them by the mean absolute deviation.
At each case, we computed the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector.The performance metric is the angle (in degrees) between the first robust (based on Cauchy or PP-PCA) eigenvector and the first "clean" eigenvector computed using the classical PCA.All experiments were repeated 100 times and the results were averaged.

Comparative results
Tables 1-3 present the performance of the first Cauchy-PCA eigenvector and of the first PP-PCA eigenvector for a variety of norms of the outlier, with different angles (φ) between the outlier and the leading true eigenvector, for the n < p case.
The case of n < p was selected as statistical inference in this case is more challenging than the p < n case3 .Additionally, this case is also ordinarily met in the field of bioinformatics were the -omics data count tens of thousands of variables (genes, single nucleotide polymorphisms, etc.) but only tens or at most hundreds of observations.
As observed in Tables 1-3, the average angular difference between the Cauchy and the PP PCA ranges from 20 • up to more than 50 • , which is evidently quite substantial, providing evidence that Cauchy PCA has performed in a superior manner to the projection pursuit method of Croux et al. (2007Croux et al. ( , 2013)).In particular, the tables demonstrate that Cauchy PCA is less error prone than its competitor but, as is seen in Table 3, the error decreases for both methods with increasing sample size.Further, the mean angular difference between the two methods increases as the angle φ increases.For instance, in Table 1, when k = 8 and φ = 0 • the difference between the two methods is 20 • , whereas when φ = 90 • the difference increases to 48 • .Further, the error is not highly affected by the angle φ, or the norm of the outliers.It can be seen that in Table 2 and Table 3 in the special case of φ = 90 • , the error increases for the Cauchy PCA by 2 • − 3 • , thus corroborating the result of Corollary 3.1.However, this effect, as in Table 1, is rather small, though noticeable.
Tab. 1: Mean angular difference between the robust eigenvectors computed in the contaminated data and the sample eigenvector computed in the clean data when n = 100 and p = 500.The norm of the outliers is e k and their angle with the true clean eigenvector is denoted by φ.

High dimensional real datasets
Two real gene expression datasets, GSE13159 and GSE311614 , downloaded from the Biodataome platform (Lakiotaki et al., 2018), were used in the experiments.The dimensions of the datasets were equal to 2, 096 × 54, 630 and 1035 × 54, 675, respectively.We randomly selected 5, 000 variables and computed the outliers using the high dimensional Minimum Covariance Determinant (MCD) of Ro et al. (2015).In accordance with the simulations studies, we removed the 2% of the most extreme outliers detected by MCD and computed the first classical PC (benchmark eigenvector), the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector of the "clean" data.We then added those outliers and increased their norm by e k , where k = (0, 3, 4, . . ., 8) and computed computed the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector.In all cases, we subtracted the spatial median or the column-wise median and scaled them by the mean absolute deviation.The performance metric is the angle (in degrees) between the first robust (based on Cauchy or PP-PCA) eigenvector and the first true "clean" eigenvectors and the time required by each method.This procedure was repeated 200 times and the average results are graphically displayed in Figures 1(a)-(d).
Broadly speaking the effect of the PP PCA does not seem to have been affected substantially by the centering method, i.e. subtraction of the spatial or the column-wise median.On the contrary, the Cauchy PCA is affected by the type of median employed to this end.Centering with the spatial median yields high error levels for all norms of the outliers, for both datasets, whereas centering with the column-wise median produces much lower error levels.On average, the difference in the error between Cauchy PCA and PP PCA is about 30 • for the GSE31159 dataset (Figure 1(a)) and about 14 • for the GSE3161 dataset (Figure 1(b)).However, the error of the Cauchy PCA increases and the stabilizes in the GSE31159 dataset whereas the error of the PP PCA is stable regardless of the norm of the outliers.A different conclusion is extracted in the GSE31161 where the error of either method decreases as the norm of the outliers increases, until it reaches a plateau.
With regards to computational efficiency, the PP PCA is not affected by either centering method, whereas Cauchy PCA seems to be affected in the GSE31159 dataset but not in the GSE31161 dataset as seen in Figures 1(c) and 1(d).Cauchy PCA centered with the column-wise median is, on average, 5 times faster than PP PCA.

Conclusion
The starting point for this paper is the observation that classical PCA can be formulated purely in terms of operations on a Gaussian likelihood.Although this observation is not new, the specifics of this formulation of classical PCA do not appear to be as widely known as might be expected.The novel idea underlying this paper is to formulate a version of PCA in which a Cauchy likelihood is used instead of a Gaussian likelihood, leading to what we call Cauchy PCA.Study of the resulting influence functions shows that Cauchy PCA has very good robustness properties.Moreover, we have provided an implementation of Cauchy PCA which runs quickly and reliably.Numerous simulation and real-data examples, mainly in high-dimensional settings, show that Cauchy PCA typically out-performs alternative robust versions of PCA whose implementation is in the public domain.

A2 Proof of Proposition 3.1
Let us first make the symbolism more explicit and denote l F (u|θ) the Cauchy log-likelihood function with respect to the distribution F and ûF the respective leading Cauchy principal direction.Then, our goal is to calculate the limit of 1 (û F ,z − ûF ) as → 0 where ûF ,z is the leading Cauchy principal direction for the distribution F ,z = (1 − )F + ∆ z .The optimality condition for the leading Cauchy principal direction reads and Moreover, ûF ,z is a unit vector which can be represented as where h is a unit vector perpendicular to ûF and ρ is a (small) real number.Under these assumptions, ûF ,z is a unit vector since Obviously, ρ depends on and z (i.e., ρ = ρ( , z)) and lim →0 ρ = 0 but we choose to avoid denoting their explicit relationship because it is not required in our proof.Moreover, a Taylor expansion for the representation leads to

Fig. 1 :
Fig.1: The first row presents the angle between the first Cauchy PC of the "contaminated" data and the 1st leading eigenvector of the "clean" data and the angle between the first Projection Pursuit PC of the "contaminated" data and the 1st leading eigenvector of the "clean" data for increasing norms of the outliers.The second row contains the time in seconds.