1 Introduction

In the analysis of multivariate data, it is frequently desirable to employ statistical methods which are insensitive to the presence of outliers in the sample. To address the problem of outliers, it is important to develop robust statistical procedures. Most statistical procedures include explicit or implicit prior assumptions about the distribution of the observations, but often without taking into account the effect of outliers. The purpose of this paper is to present a novel robust version of PCA which has some attractive features.

Principal component analysis (PCA) is considered to be one of the most important techniques in statistics. However, the classical version of PCA depends on either a covariance or a correlation matrix, both of which are very sensitive to outliers. We develop an alternative method to classical PCA, which is far more robust, by using a multivariate Cauchy likelihood to construct a robust principal components (PC) procedure. It is an adaptation of the classic method of PCA obtained by replacing the Gaussian log-likelihood function by the Cauchy log-likelihood function, in a sense that will be explained in Sect. 2.2. Although we do not claim that the interpretation of standard PCA in terms of operations on a Gaussian likelihood is new, see Bolton and Krzanowski (1999), this fact does not appear to have been exploited in the development of a robust PCA procedure, as we do in this paper. An important reason for using the multivariate Cauchy likelihood is that this likelihood has only one maximum point, but the single most important motivation is that it leads to a robust procedure.

In the next section we review briefly some of the techniques employed for estimating PCA parameters and for directing a PCA in ways which are robust against the presence of outliers. We also present robustness preliminaries that include some important techniques which are necessary to assess whether the method used is robust or not. In Sect. 3 we develop the Cauchy-PCA and theoretically explore its robustness properties. Finally, in Sect. 4 we present the numerical algorithms for computing the Cauchy PCs, and also give the results of a number of very high-dimensional real-data and simulated examples. Our approach is seen to be competitive with, and often gives superior results to, that of the projection pursuit (PP) algorithm of Croux et al. (2007), Croux et al. (2013) and other competitors discussed later. Finally we conclude the paper in Sect. 5.

1.1 Literature review on robust PCA

It is well known that PCA is an important technique for high-dimensional data reduction. PCA is based on the sample covariance matrix \(\hat{{\varvec{\Sigma }}}\) and it involves searching for a linear combination \(y_{j}= \textbf{u}^{T}{} \textbf{x}_{j}\) of the \(\textbf{x}\) components of the vector that maximize the sample variance of the components of \(\textbf{y}\). According to Mardia et al. (1979), for example, the solution will be given by the equation

$$\begin{aligned} \hat{\varvec{\Sigma }}=\mathbf{U \Lambda U}^{T}, \end{aligned}$$

where \({\varvec{\Lambda }}= \hbox {diag}\{\lambda _{1}, \ldots , \lambda _{p}\}\) and its diagonal elements \(\lambda _{i}\) are the sample variances, while \(\textbf{U}\) is an orthogonal matrix, i.e. \(\textbf{U U}^{T} =\textbf{U}^{T}{} \textbf{U}=\textbf{I}_{p}\), whose columns \(\textbf{u}_{i}\) are the corresponding eigenvectors which represent the linear combinations. The PCs are efficiently estimated in practice via Singular Value Decomposition (SVD).

Classical PCA, unfortunately, is non-robust, since it based on the sample covariance or sample correlation matrix, both of which are very sensitive to outlying observations; see Sect. 2. However, this problem has been handled by two different methods which result in robust versions of PCA by:

  1. i.

    replacing the standard covariance or correlation matrix with a robust estimator; or

  2. ii.

    maximising (or minimising) a different objective function to obtain a robust PCA.

Many different proposals had been developed to carry out robust PCA, such as using PP, \(M-\)estimators and so on.

Despite maximum likelihood estimation, perhaps, being considered as the most important statistical inference method, sometimes this approach can lead to incorrect results when the underlying assumptions are not satisfied, for instance, when data contain outliers or deviate slightly from the supposed model. A generalization of maximum likelihood estimation proposed by Huber (1964), which is called M-estimation, aims to produce a robust statistic by constructing approaches that are resistant to deviations from the underline assumptions. M-estimators were also defined for the multivariate case by Maronna (1976).

Campbell (1980) provided a procedure for robust PCA by examining the estimates of means and covariances which are less affected by outlier observations, and by exploring the observations which have a large effect on the estimates. He replaced the sample covariance sample by an \(M-\)estimator. Hubert and Verboven (2003) introduced a new approach to create robust PCA. It combines the advantages of two methods, the first one is based on replacing the covariance or correlation matrix by its robust estimator, while the second one is based on maximizing the objective function for this robust estimator.

A robust PCA based on the PP method was developed by Li and Chen (1985), using Huber’s M-estimator of dispersion as the projection index. The objective of PP is to seek projections, of the high-dimensional data set onto low-dimensional subspaces, that optimise a function of "interestingness". The function that should be optimised is called an index or objective function and its choice depends on a feature that the researcher is concerned about. This property gives the PP technique a flexibility to handle many different statistical problems range from clustering to identifying outliers in a multivariate data set.

Bolton and Krzanowski (1999) characterized the PCs for PP in terms of maximum likelihood under the assumption of normality. PCA can be considered as a special case of PP as well as many other methods of multivariate analysis. Li and Chen (1985) used Huber’s M-estimator of dispersion as projective index to develop a robust PCA based on the PP approach. The sample median was used as a projective index to develop a robust PCA by Xie et al. (1993). In their simulation studies, Xie et al. (1993) observed a PCA resistant to outliers and deviations from the normal distribution.

More recently, there has been further work on developing approaches to robust PCA. Croux et al. (2007) and Croux et al. (2013) also suggested a robust PCA using a version of PP. Candès et al. (2011) proposed a robust penalized PCA, termed Principal Component Pursuit (PCP). Another approach, developed by Zhang et al. (2013), is based on a Robust Regularized Singular Value Decomposition (RRSVD). We will contrast our methodology against the three aforementioned algorithms in numerical examples, focusing especially on the ultra high-dimensional case where \(p> > n\).

2 Preliminaries on standard PCA

PCA is an orthogonal linear transformation that projects the data to a new coordinate system according to the variance of each direction. Given a data matrix \(\textbf{X}\in {\mathbb {R}}^{n\times p}\) with each row correspond to a sample, the first direction \(\textbf{u}_1\) that maximizes the variance is defined through

$$\begin{aligned} \textbf{u}_1 = {{\,\mathrm{\arg \max }\,}}_{||\textbf{u}||_2=1} ||(\textbf{X}- \textbf{1}_n{\bar{\textbf{x}}}^T) \textbf{u}||_2^2, \end{aligned}$$

where \(\textbf{1}_n\) is an n-dimensional vector whose elements are all set to 1 while \({\bar{\textbf{x}}}=\frac{1}{n}\sum _{i=1}^n \textbf{x}_i\) is the empirical mean. The process is repeated k times and at each iteration the to-be-estimated principal direction has to be orthogonal to all previously-computed principal directions. Thus, the k-th direction which has to be orthogonal to the previous ones is defined by

$$\begin{aligned} \textbf{u}_k= & {} {{\,\mathrm{\arg \max }\,}}_{||\textbf{u}||_2=1} ||(\textbf{X}- \textbf{1}_n{\bar{\textbf{x}}}^T) \textbf{u}||_2^2 \ \ \text {subject to} \ \ \textbf{u}_k \perp \textbf{u}_j \ \\{} & {} \text {with} \ j=1,...,k-1 \ . \end{aligned}$$

2.1 Non-robustness of standard PCA

We will show that the influence function for the largest eigenvalue of the covariance matrix and the respective eigenvector are unbounded with respect to the norm of an outlier sample. Suppose that \(\varvec{\Sigma }\) is the covariance matrix of a population with distribution function F, i.e.,

$$\begin{aligned} {\varvec{\Sigma }} = \int _{{\mathbb {R}}^p} (\textbf{x}-\varvec{\mu })(\textbf{x}-\varvec{\mu })^{T} dF(\textbf{x}), \end{aligned}$$
(1)

where \(\varvec{\mu }=\int _{{\mathbb {R}}^p} \textbf{x}dF(\textbf{x})\) corresponds to the mean vector. Assume that the leading eigenvalue of \(\varvec{\Sigma }\) has multiplicity 1, then we denote it by \(\lambda \) and the leading eigenvector by \({\hat{\textbf{u}}}\) (i.e., \(\textbf{u}_{1}={\hat{\textbf{u}}}\)).

Let T be an arbitrary functional, F a distribution and \(\textbf{z}\in {\mathbb {R}}^p\) an arbitrary point in the relevant sample space. The influence function is defined as

$$\begin{aligned} IF_T(\textbf{z};F) = \lim _{\epsilon \rightarrow 0+} \frac{T((1-\epsilon ) F + \epsilon \Delta _{\textbf{z}}) - T(F)}{\epsilon }, \end{aligned}$$
(2)

where \(\Delta _{\textbf{z}}\) is a unit point mass located at \(\textbf{z}\).

A robust estimator for T means that the influence function is bounded with respect to the norm of the outlier \(\textbf{z}\).

Proposition 1

The influence function for the leading eigenvector of \(\varvec{\Sigma }\) is given byFootnote 1

$$\begin{aligned} IF_{{\hat{\textbf{u}}}} (\textbf{z}, F) = - \big ( (\textbf{z}-\varvec{\mu })^{T}{\hat{\textbf{u}}} \big ) (\varvec{\Sigma }-\lambda \textbf{I}_p)^{+} (\textbf{z}-\varvec{\mu }). \end{aligned}$$
(3)

Similarly, the IF for the largest eigenvalue of \({\varvec{\Sigma }}\) is

$$\begin{aligned} IF_\lambda (\textbf{z}, F) = \big ( (\textbf{z}-\varvec{\mu })^{T}{\hat{\textbf{u}}} \big )^2 - \lambda . \end{aligned}$$
(4)

The detailed calculations are presented in Appendix 1. The following result shows that outliers with unbounded influence function do exist.

Corollary 1

Let \(\textbf{z}=\varvec{\mu }+ \gamma {\hat{\textbf{u}}} + \eta \textbf{v}\) where \(\textbf{v}\) is orthogonal to \({\hat{\textbf{u}}}\) and does not belong to the null space of \(\varvec{\Sigma }\) and \(\gamma ,\eta \ne 0\) then

$$\begin{aligned} \lim _{\textbf{z}: \, ||\textbf{z}||_2 \rightarrow \infty }||IF_{{\hat{\textbf{u}}}} (\textbf{z}, F)||_2 = \infty , \end{aligned}$$

and similarly for \(IF_\lambda (\textbf{z}, F)\).

Proof

Direct substitution of \(\textbf{z}\) into the influence function gives:

$$\begin{aligned} IF_{{\hat{\textbf{u}}}} (\textbf{z}, F)= & {} -((\gamma {\hat{\textbf{u}}} + \eta \textbf{v})^T {\hat{\textbf{u}}}) (\varvec{\Sigma }-\lambda \textbf{I}_p)^{+} (\gamma {\hat{\textbf{u}}} + \eta \textbf{v}) \\= & {} - \gamma \eta (\varvec{\Sigma }-\lambda \textbf{I}_p)^{+} \textbf{v}. \end{aligned}$$

Since \(\textbf{v}\) does not belong to the null space of \(\varvec{\Sigma }\), it holds that \((\varvec{\Sigma }-\lambda \textbf{I}_p)^{+} \textbf{v}\ne \textbf{0}\) thus \(||(\varvec{\Sigma }-\lambda \textbf{I}_p)^{+} \textbf{v}||_2=c\ne 0\). Hence,

$$\begin{aligned} ||IF_{{\hat{\textbf{u}}}} (\textbf{z}, F)||_2 = |\gamma | |\eta | c. \end{aligned}$$

Given that \(||\textbf{z}||_2^2 = \gamma ^2+\eta ^2+||\varvec{\mu }||_2^2+\gamma \varvec{\mu }^T{\hat{\textbf{u}}}+\eta \varvec{\mu }^T\textbf{v}\), either sending \(|\gamma |\rightarrow \infty \) or \(|\eta |\rightarrow \infty \) completes the proof.

Similarly,

$$\begin{aligned} IF_\lambda (\textbf{z}, F) = \gamma ^2-\lambda \rightarrow \infty , \end{aligned}$$

as \(|\gamma |\rightarrow \infty \). \(\square \)

2.2 Generalizations of standard PCA

Standard PCA can be viewed as a special case of a more general optimization problem. We present two such generalization: the first one leads to PP algorithms while the second leads to a maximum likelihood formulation. Let \(\textbf{u}\) be a unit vector and define the projection values

$$\begin{aligned} c_{i}(\textbf{u}) = \textbf{x}^{T}_{i} \textbf{u}, {\hspace{3mm}} i=1, \ldots , n, \end{aligned}$$

and a function \(\Phi :{\mathbb {R}}^n \rightarrow {\mathbb {R}}\) acting on the projected values. The first generalization of PCA is defined as the maximization of \(\Phi \):

$$\begin{aligned} \textbf{u}_1 = {{\,\mathrm{\arg \max }\,}}_{||\textbf{u}||_2=1} \Phi (c_1(\textbf{u}),...,c_n(\textbf{u})) \ . \end{aligned}$$

As in the standard PCA, the following principal directions are obtained after removing the contribution of the current PC from the data. When \(\Phi \) is the sample variance then we recover the standard PCA.

The second generalization interprets the computation of the principal component as a maximum likelihood estimation problem. By letting,

$$\begin{aligned} l_{G}(\mu , \sigma ^{2}| c_{1},\ldots , c_{n})= -\frac{n}{2} \log {\sigma }^{2} - \frac{n}{2{\sigma }^{2}}\sum _{i=1}^{n}(c_{i}-\mu )^{2}.\nonumber \\ \end{aligned}$$
(5)

be the Gaussian log-likelihood, the first principal direction can be obtained by solving the minimax problem:

$$\begin{aligned} \min _{||\textbf{u}||_2=1}\max _{\mu , \sigma ^2} \ l_{G}(\mu , \sigma ^{2}| c_{1}(\textbf{u}),\ldots , c_{n}(\textbf{u})). \end{aligned}$$

Indeed, the inner maximization can be solved analytically which leads to the optimal solution

$$\begin{aligned} {\hat{\mu }}(\textbf{u}) = \frac{1}{n} \sum _{i=1}^n c_i(\textbf{u}) =: {\bar{c}}(\textbf{u}) \end{aligned}$$

and

$$\begin{aligned} {{\hat{\sigma }}}^{2}(\textbf{u}) = \frac{1}{n} \sum _{i=1}^{n} (c_{i}(\textbf{u})- {\bar{c}}(\textbf{u}))^{2}. \end{aligned}$$

Unsurprisingly, the optimal values are the sample mean and the sample variance. Using the above formulas it is straightforward to show that

$$\begin{aligned}{} & {} {{\,\mathrm{\arg \min }\,}}_{||\textbf{u}||_2=1} \ l_{G}\big ({\hat{\mu }}(\textbf{u}), {{\hat{\sigma }}}^{2}({\textbf{u}})| c_{1}({\textbf{u}}), \ldots , c_{n}({\textbf{u}})\big ) \nonumber \\{} & {} \quad = {{\,\mathrm{\arg \max }\,}}_{||\textbf{u}||_2=1} \ {\hat{\sigma }}^{2}({\textbf{u}}) \ . \end{aligned}$$
(6)

Variations of PCA can be derived by changing the likelihood function and in the next section we analyze the case of Cauchy distribution.

3 Cauchy PCA

The Cauchy log-likelihood function is given by

$$\begin{aligned}{} & {} l_{C}({\mu },{\sigma }| {c}_{1}({\textbf{u}}), \ldots , {c}_{n}({\textbf{u}}))\nonumber \\{} & {} \qquad = n \log {\frac{\sigma }{\pi }} - \sum _{i=1}^{n} \log \left\{ {\sigma }^{2}+ (c_{i}(\textbf{u})-{\mu })^{2}\right\} . \end{aligned}$$
(7)

where \(\mu \) and \(\sigma \) are the two parameters of the Cauchy distribution. The first Cauchy principal direction is also obtained by solving the minimax optimization problem:

$$\begin{aligned} \min _{||\textbf{u}||_2=1}\max _{\mu ,\sigma } \ l_{C}(\mu , \sigma ^{2}| c_{1}(\textbf{u}),\ldots , c_{n}(\textbf{u})). \end{aligned}$$
(8)

In contrast to the Gaussian case, the inner maximization cannot be performed analytically. Therefore an iterative approach needs to be utilized. Here, we apply the Newton-Raphson method with initial values the median and half the interquartile range for the location and scale parameters, respectively. According to Copas (1975), although the mean of the Cauchy distribution does not exist and it has infinite variance, the Cauchy log-likelihood function \(l_{C}(\mu , \sigma )\) has a unique maximum likelihood estimate, \(({\hat{\mu }},{\hat{\sigma }})\).

Fixing \(\mu \) and \(\sigma \), the outer minimization is also non-analytic and a fixed point iteration is applied to calculate \(\textbf{u}\). The iteration is given by

$$\begin{aligned} {\hat{\textbf{u}}} = \frac{{\hat{\textbf{u}}}_{un}}{||{\hat{\textbf{u}}}_{un}||_2}, \end{aligned}$$
(9)

where \({\hat{\textbf{u}}}_{un}\) is the unnormalized direction which is obtained from the differentiation of the Lagrangian function with respect to \(\textbf{u}\) and it is given by

$$\begin{aligned} {\hat{\textbf{u}}}_{un} = \sum _{i=1}^{n}\frac{({\textbf{x}}_i^T{{\hat{\textbf{u}}}}-{\hat{\mu }}){\textbf{x}}_i}{{\hat{\sigma }}^2 + \left( {\textbf{x}}_i^T{{\hat{\textbf{u}}}}-{\hat{\mu }} \right) ^2} \ . \end{aligned}$$
(10)

Once the first principal direction has been computed, its contribution from the dataset \(\textbf{X}\) is removed and the same procedure to estimate the next principal direction is repeated. This iterative process is repeated k times to estimate the first k PCs. The removal of the contribution makes the principal directions orthogonal to each other. We summarize the estimation of k Cauchy PCs in the following pseudo-code (Algorithm 1).

Algorithm 1
figure a

Cauchy PCA

3.1 Robustness of the leading Cauchy principal direction

Let \(\varvec{\theta }= \left( \mu ,\sigma \right) ^T\) be the parameter vector of the Cauchy distribution and consider the infinite-sample normalized Cauchy log-likelihood function

$$\begin{aligned} l(\textbf{u}|\varvec{\theta }) = \int _{\textbf{x}\in {\mathbb {R}}^p} g(c(\textbf{u}),\varvec{\theta })\, dF(\textbf{x}), \end{aligned}$$
(11)

where \(g(c,\varvec{\theta }) = \log (\sigma /\pi ) - \log ( \sigma ^2+(c-\mu )^2)\) and \(c(\textbf{u})=\textbf{x}^T\textbf{u}\). We will estimate the influence function for the leading Cauchy principal direction

$$\begin{aligned} {\hat{\textbf{u}}} = {{\,\mathrm{\arg \min }\,}}_{||\textbf{u}||_2=1} \ l(\textbf{u}|\varvec{\theta }_F(\textbf{u})), \end{aligned}$$
(12)

where

$$\begin{aligned} \varvec{\theta }_F(\textbf{u})=\left( \mu _F(\textbf{u}), \sigma _F(\textbf{u}) \right) ^\top = {{\,\mathrm{\arg \max }\,}}_{\varvec{\theta }} l(\textbf{x}^T\textbf{u}|\varvec{\theta }) \end{aligned}$$

are the optimal Cauchy parameters for a given direction \(\textbf{u}\).

Since \({\hat{\textbf{u}}}\) is restricted to be a unit vector, the standard condition for the minimum, i.e., \(\left. \frac{\partial }{\partial \textbf{u}} l(\textbf{u}|\varvec{\theta }_F(\textbf{u}))\right| _{\textbf{u}={\hat{\textbf{u}}}} = \textbf{0}\) is not valid. The appropriate condition is defined by

$$\begin{aligned} \textbf{P}_{{\hat{\textbf{u}}}} \left. \frac{\partial }{\partial \textbf{u}} l(\textbf{u}|\varvec{\theta }_F(\textbf{u}))\right| _{\textbf{u}={\hat{\textbf{u}}}} = \textbf{0}, \end{aligned}$$
(13)

where \(\textbf{P}_{\textbf{u}}\) is the projection matrix given by \(\textbf{P}_{\textbf{u}}=\textbf{I}_p-\textbf{u}\textbf{u}^T\).

Remark 1

An equivalent condition is to satisfy \(\textbf{h}^T \)\( \left. \frac{\partial }{\partial \textbf{u}} l(\textbf{u}|\varvec{\theta }_F(\textbf{u}))\right| _{\textbf{u}={\hat{\textbf{u}}}} = \textbf{0}\) for all \(\textbf{h}\) such that \(\textbf{h}^T{\hat{\textbf{u}}}=0\) and \(||\textbf{h}||_2=1\). Both derived conditions are essentially a consequence of the Lagrangian formulation of the constraint optimization problem. Indeed, the Lagrange condition implies that at the minimum the direction of the objective function’s derivative should be parallel to the direction of the constraint’s derivative which translates to \(\left. \frac{\partial }{\partial \textbf{u}} l(\textbf{u}|\varvec{\theta }_F(\textbf{u}))\right| _{\textbf{u}={\hat{\textbf{u}}}} = \lambda {\hat{\textbf{u}}}\) where \(\lambda \ne 0\) is the Lagrange multiplier.

Let \({\bar{g}}(\textbf{x};\textbf{u}) = \left. g(\textbf{x}^T\textbf{u}|\theta )\right| _{\theta =\theta _F(\textbf{u})}\) be the likelihood function computed at \(\theta =\theta _F(\textbf{u})\) and let denote its partial derivatives as

$$\begin{aligned} {\bar{g}}_c(\textbf{x};\textbf{u}) = \left. \frac{\partial }{\partial c} g(\textbf{x}^T\textbf{u}|\theta )\right| _{\theta =\theta _F(\textbf{u})} \end{aligned}$$

and

$$\begin{aligned} {\bar{g}}_{\varvec{\theta }}(\textbf{x};\textbf{u}) = \left. \frac{\partial }{\partial \varvec{\theta }} g(\textbf{x}^T\textbf{u}|\theta )\right| _{\theta =\theta _F(\textbf{u})}. \end{aligned}$$

Similarly, \({\bar{g}}_{cc}\), \({\bar{g}}_{c\theta }\) and \({\bar{g}}_{\theta \theta }\) denote the second order derivatives. The following proposition establishes the expression for the influence function of the leading Cauchy principal direction, \({\hat{\textbf{u}}}\).

Proposition 2

Under the assumption of \({\textbf{I}}_F({\hat{\textbf{u}}})\) and \(\textbf{A}\) being invertible matrices, the influence function of \({\hat{\textbf{u}}}\) is

$$\begin{aligned} IF_{{\hat{\textbf{u}}}} (\textbf{z}, F) = \textbf{A}^{-1} \textbf{b}, \end{aligned}$$
(14)

where

$$\begin{aligned} \textbf{A}= & {} \textbf{I}_p \int _{{\mathbb {R}}^p} {\bar{g}}_{c\varvec{\theta }}(\textbf{x};{\hat{\textbf{u}}}) \textbf{x}^T{\hat{\textbf{u}}} dF(\textbf{x}) \\{} & {} \quad - \textbf{P}_{{\hat{\textbf{u}}}} \int _{{\mathbb {R}}^p} {\bar{g}}_{cc}(\textbf{x};{\hat{\textbf{u}}}) \textbf{x}^T\textbf{x}dF(\textbf{x}) \textbf{P}_{{\hat{\textbf{u}}}} \\{} & {} + \textbf{P}_{{\hat{\textbf{u}}}} \int _{{\mathbb {R}}^p} \textbf{x}{\bar{g}}_{c\varvec{\theta }}(\textbf{x};{\hat{\textbf{u}}}) dF(\textbf{x}) \, {\textbf{I}}_F({\hat{\textbf{u}}})^{-1} \\{} & {} \times \int _{{\mathbb {R}}^p} {\bar{g}}_{\varvec{\theta }c}(\textbf{x};{\hat{\textbf{u}}}) \textbf{x}^T dF(\textbf{x}) \textbf{P}_{{\hat{\textbf{u}}}} \end{aligned}$$

and

$$\begin{aligned} \textbf{b}= \textbf{b}(z)= & {} {\bar{g}}_c(\textbf{z}, {\hat{\textbf{u}}}) \textbf{z}\\{} & {} \quad + \int _{{\mathbb {R}}^p} \textbf{x}{\bar{g}}_{c\varvec{\theta }}(\textbf{x};{\hat{\textbf{u}}}) dF(\textbf{x}) \, {\textbf{I}}_F({\hat{\textbf{u}}})^{-1} \, {\bar{g}}_{\varvec{\theta }}(\textbf{z};{\hat{\textbf{u}}}), \end{aligned}$$

while

$$\begin{aligned} {\textbf{I}}_F({\hat{\textbf{u}}}) = \int _{{\mathbb {R}}^p} {\bar{g}}_{\varvec{\theta }\varvec{\theta }}(\textbf{x};{\hat{\textbf{u}}}) dF(\textbf{x}) \end{aligned}$$

is the expected Fisher information matrix under F for the parameters of the Cauchy distribution computed at \({\hat{\textbf{u}}}\).

Proof

The proof consists of several straightforward series expansions and implicit function calculations. The complete proof is given in Appendix 1. \(\square \)

The following boundedness result for the influence function states the conditions under which Cauchy PCA is robust.

Corollary 2

Let the assumptions of the proposition hold. If \(\textbf{z}\not \perp {\hat{\textbf{u}}}\) or if \(\textbf{z}\perp {\hat{\textbf{u}}}=0\) but \(\mu _F({\hat{\textbf{u}}})=0\) then the influence function for \({\hat{\textbf{u}}}\) is bounded.

Proof

First, observe that matrix \(\textbf{A}\) does not depend on \(\textbf{z}\). It is only \(\textbf{b}\) that depends on \(\textbf{z}\) and our goal is to prove that \(\textbf{b}\) is bounded with respect to \(\textbf{z}\). Second, we have to compute the partial derivatives \({\bar{g}}_c(\textbf{z}; {\hat{\textbf{u}}})\) and \({\bar{g}}_{\varvec{\theta }}(\textbf{z}; {\hat{\textbf{u}}})\). Straightforward calculations lead to

$$\begin{aligned} {\bar{g}}_c(\textbf{z}; {\hat{\textbf{u}}}) = - \frac{2(\textbf{z}^T{\hat{\textbf{u}}}-\mu _F({\hat{\textbf{u}}}))}{\sigma _F^2({\hat{\textbf{u}}})+(\textbf{z}^T{\hat{\textbf{u}}}-\mu _F({\hat{\textbf{u}}}))^2} \\ {\bar{g}}_\mu (\textbf{z}; {\hat{\textbf{u}}}) = \frac{2(\textbf{z}^T{\hat{\textbf{u}}}-\mu _F({\hat{\textbf{u}}}))}{\sigma _F^2({\hat{\textbf{u}}})+(\textbf{z}^T{\hat{\textbf{u}}}-\mu _F({\hat{\textbf{u}}}))^2} \end{aligned}$$

and

$$\begin{aligned} {\bar{g}}_\sigma (\textbf{z}; {\hat{\textbf{u}}}) = \frac{1}{\sigma _F({\hat{\textbf{u}}})} - \frac{2\sigma _F({\hat{\textbf{u}}})}{\sigma _F^2({\hat{\textbf{u}}})+(\textbf{z}^T{\hat{\textbf{u}}}-\mu _F({\hat{\textbf{u}}}))^2}. \end{aligned}$$

Let us now define an arbitrary scaling of the outlier \(\textbf{z}\rightarrow \alpha \textbf{z}\) and prove boundedness of \(\textbf{b}\) as we send \(\alpha \rightarrow \infty \). We consider the first case where \(\textbf{z}\not \perp {\hat{\textbf{u}}}\). It holds that \(\lim _{\alpha \rightarrow \infty } {\bar{g}}_c(\alpha \textbf{z}; {\hat{\textbf{u}}})\alpha \textbf{z}= -(\textbf{z}^T{\hat{\textbf{u}}})^{-1}\textbf{z}\), \(\lim _{\alpha \rightarrow \infty } {\bar{g}}_\mu (\alpha \textbf{z}; {\hat{\textbf{u}}}) = 0\) and \(\lim _{\alpha \rightarrow \infty } {\bar{g}}_\sigma (\alpha \textbf{z}; {\hat{\textbf{u}}}) = \frac{1}{\sigma _F({\hat{\textbf{u}}})}\) therefore \(\textbf{b}\) is bounded with respect to \(\alpha \).

For the second case, we have

$$\begin{aligned} \lim _{\alpha \rightarrow \infty } {\bar{g}}_c(\alpha \textbf{z}; {\hat{\textbf{u}}})\alpha \textbf{z}= \lim _{\alpha \rightarrow \infty } \frac{2\mu _F({\hat{\textbf{u}}})}{\sigma _F^2({\hat{\textbf{u}}})+\mu _F({\hat{\textbf{u}}})^2} \alpha \textbf{z}= 0, \\ \lim _{\alpha \rightarrow \infty } {\bar{g}}_\mu (\alpha \textbf{z}; {\hat{\textbf{u}}}) = \frac{2\mu _F({\hat{\textbf{u}}})}{\sigma _F^2({\hat{\textbf{u}}})+\mu _F({\hat{\textbf{u}}})^2} = 0 \end{aligned}$$

and

$$\begin{aligned} \lim _{\alpha \rightarrow \infty } {\bar{g}}_\sigma (\alpha \textbf{z}; {\hat{\textbf{u}}}) = \frac{1}{\sigma _F({\hat{\textbf{u}}})} - \frac{2\sigma _F({\hat{\textbf{u}}})}{\sigma _F^2({\hat{\textbf{u}}})+\mu _F({\hat{\textbf{u}}})^2} = -\frac{1}{\sigma _F({\hat{\textbf{u}}})} \end{aligned}$$

since \(\mu _F({\hat{\textbf{u}}})=0\) by assumption. Thus \(\textbf{b}\) is bounded with respect to \(\alpha \) for the second case, too. \(\square \)

The only case not covered by the corollary is when \(\textbf{z}^T{\hat{\textbf{u}}}=0\) and \(\mu ({\hat{\textbf{u}}})\ne 0\). Our experiments presented in the following section show that outliers that are orthogonal to the Cauchy principal direction do sometimes influence the estimation of the Cauchy principal direction yet not significantly.

3.2 Several Cauchy principal components

We briefly mention possibilities for estimating several Cauchy principal components. There are two obvious approaches: one approach, the sequential approach, is to repeat the algorithm described above on the subspace orthogonal to \({\hat{\textbf{u}}}={\hat{\textbf{u}}}_1\) to obtain \({\hat{\textbf{u}}}_2\), the second Cauchy principal component, where \({\hat{\textbf{u}}}_1\) is the first Cauchy principal component; then repeat the procedure on the subspace orthogonal to \({\hat{\textbf{u}}}_1\) and \({\hat{\textbf{u}}}_2\) to obtain \({\hat{\textbf{u}}}_3\); and so on. A second approach, the simultaneous approach, is to decide in advance how many principal components we wish to determine, p say, and then use a p-dimensional multivariate Cauchy likelihood, which has \(p+ p(p+1)/2\) free parameters, to obtain \({\hat{\textbf{u}}}_1, \ldots , {\hat{\textbf{u}}}_p\). It turns out that these two approaches lead to equivalent results in classical (Gaussian) PCA but when a Cauchy likelihood is used the two approaches produce different sets of principal components. Our current thinking is this: the sequential approach is easier to implement (essentially the same computations can be used at each step) and it is faster. However, the simultaneous approach could potentially be preferable if we know in advance how many principal components we wish to estimate. Further investigation is required.

3.3 Cauchy measures of variability explained

Suppose we obtain mutually orthogonal PC unit vectors \(\hat{\textbf{u}}_1, \ldots , \hat{\textbf{u}}_m\) from the robust outputs of a Cauchy PCA. We may wish either to perform a type of scree plot or to assess the "variability explained" by different PCs, but it is not clear at the outset how best to do this. One possibility is to use the connection between maximising the variance along different choices of \(\textbf{u}\) and minimising the likelihood in the Gaussian case, and motivate by analogy. For example, suppose that \({\hat{\ell }}_{C,1}, \ldots , {\hat{\ell }}_{C,m}\) are the minimised Cauchy log-likelihoods corresponding to the orthogonal PC unit vectors \(\hat{\textbf{u}}_1, \ldots , \hat{\textbf{u}}_m\). We may then choose a suitable function \(g(\cdot )\) and define \({\hat{\lambda }}_j = g({\hat{\ell }}_{C,j})\), \(j=1, \ldots , m\). Then we could proceed as though \({\hat{\lambda }}_1, \ldots , {\hat{\lambda }}_m\) had been obtained in a standard (Gaussian) PCA analysis. How might we choose the function g? One potentially interesting possibility would be to choose g to be the inverse function of \(h({\hat{\sigma }}^2)=-\frac{n}{2}\{\log ({\hat{\sigma }}^2)+1\}\), where h maps the maximised variance to the minimised log-likelihood in the Gaussian case. Specifically, we define

$$\begin{aligned} {\hat{\lambda }}_j = g({\hat{\ell }}_{C,j})=\exp \left( -2 {\hat{\ell }}_{C,j}/n -1 \right) . \end{aligned}$$

This approach would automatically inherit the robustness properties of the Cauchy PCA, but the resulting \({\hat{\lambda }}_j\) would be on a scale roughly comparable with that of the eigenvalues produced in a classical PCA. It would be of interest to explore this and other possibilities in future work.

Table 1 Mean angular difference between the robust eigenvectors computed with the contaminated data and the sample eigenvector computed with the uncontaminated data when \(n=100\) and \(p=500\)

4 Numerical results

4.1 Simulation studies

In this section we will empirically validate our proposed methodology, via simulation studies. We searched for R packages that offer robust PCA in the \(n<<p\) case and came up with FastHCS (Vakili 2018), rrcovHD (Todorov 2016), rpca (Sykulski 2017), RobRSVD (Zhang and Pan 2013) and pcaPP (Filzmoser et al. 2018). Out of them, pcaPP (Projection Pursuit PCA) is the only one that does not require hyper-parameter tuning, e.g. selection of the LASSO penalty \(\lambda \) or choice of the percentage of observations used to estimate a robust covariance matrix. However, we will also consider the PCP method Candès et al. (2011) and the RRSVD method (Zhang et al. 2013), implemented in the packages rpca and RobSRVD, respectively, using the default arguments as for the selection of the regularization parameters. Our proposed method has been implemented in the R package cauchypca (Tsagris et al. 2023).

4.1.1 Setup of the simulations

Initially, we created a \(p \times p\) (orthonormal) basis \(\textbf{B}\) by using QR decomposition on some randomly generated data. We then generated eigenvalues \(\lambda _i \sim Exp(0.4)\), where \(i=1,\ldots ,p\) and hence we obtained the covariance matrix \(\pmb {\Sigma } = \textbf{B}\pmb {\Lambda }{} \textbf{B}^T\), where \(\pmb {\Lambda } =\text {diag}(\lambda _i)\). The first column of \(\textbf{B}\) served as the first eigenvector, computed using the uncontaminated data, and was the benchmark in our comparative evaluations. Following this step, we simulated n random vectors \(\textbf{X} \sim N_p\left( \textbf{0}, \pmb {\Sigma } \right) \) and in order to check the robustness of the results to the center of the data, all observations were shifted right by adding 50 everywhere. A number of outliers equal to 2\(\%\) of the sample size were introduced. These outliers were \(\bar{\textbf{x}}+e^{\kappa }{} \textbf{z} \in {{\mathbb {R}}}^{p}\), where \(\bar{\textbf{x}}\) is the sample mean vector, \(\textbf{z}\) are unit vector(s) and \(e^{\kappa }\) a real number denoting their norm, where \(\kappa \) varied from 3 up to 8 increasing with a step size equal to 1 and the angle between the outliers \(\textbf{z}\) and the first “clean” eigenvector spanned from \(0^{\circ }\) up to \(90^{\circ }\). In all cases, we subtracted the spatial median or the variable-wise medianFootnote 2 and scaled them by the mean absolute deviation.

At each case, we computed the first eigenvector for the Cauchy-PCA and for each of the three competitors (PP, PCP and RSVD). The performance metric is the angle (in degrees) between the first robust eigenvector and the first eigenvector computed with the uncontaminated data using classical PCA. All experiments were repeated 100 times and the results were averaged.

Table 2 Mean angular difference between the robust eigenvectors computed with the contaminated data and the population eigenvector, computed without contamination, when \(n=100\) and \(p=500\)
Table 3 Mean angular difference between the robust eigenvectors computed with the contaminated data and the sample eigenvector computed with the uncontaminated data when \(n=100\) and \(p=1000\)
Table 4 Mean angular difference between the robust eigenvectors computed with the contaminated data and the population eigenvector, computed without any contamination present, when \(n=100\) and \(p=1000\)

4.1.2 Comparative results

Tables 1, 2, 3, 4, 5 and 6 present the performance of the first Cauchy-PCA eigenvector and of the first eigenvector of the PCP, RRSVD and PP methods for a variety of norms of the outlier, with different angles (\(\phi \)) between the outlier and the leading true eigenvector, for the \(n<p\) case. The case of \(n<p\) was selected as statistical inference in this case is more challenging than the \(p<n\) caseFootnote 3. Additionally, this case is also ordinarily met in the field of bioinformatics were the -omics data count tens of thousands of variables (genes, single nucleotide polymorphisms, etc.) but only tens or at most hundreds of observations.

As observed in Tables 1, 2, 3, 4, 5 and 6 the average angular difference between the Cauchy and the PP PCA ranges from \(20^{\circ }\) up to more than \(50^{\circ }\), which is evidently quite substantial, providing evidence that Cauchy PCA has performed in a superior manner to the projection pursuit method of Croux et al. (2007, 2013). In particular, the tables demonstrate that Cauchy PCA is less error prone than PP PCA but, as is seen in Tables 5 and 6, the error decreases for both methods with increasing sample size. Further, the mean angular difference between the two methods increases as the angle \(\phi \) increases. For instance, in Tables 1 and 2, when \(k=8\) and \(\phi =0^{\circ }\) the difference between the two methods is \(20^{\circ }\), whereas when \(\phi =90^{\circ }\) the difference increases to \(48^{\circ }\).

With regard to the other two competitors, Cauchy PCA outperforms both of them in the \( n=100 \ \& p=500\) case scenario as observed in Table 1, which presents the departure from the first (sample) eigenvector calculated using the uncontaminated data. Examination of the same case scenario, but measuring the angular departure from the population eigenvector (Table 2) shows that Cauchy PCA performs worse than the other two competitors when the angle \(\phi \) is equal to \(0^{\circ }\) or \(30^{\circ }\). For the larger values of \(\phi \), though, Cauchy PCA outperforms them. The same conclusion is drawn for all other case scenarios.

Further, the error is not highly affected by the angle \(\phi \), or the norm of the outliers. It can be seen that in Tables 3 and 5 in the special case of \(\phi =90^{\circ }\), the error increases for the Cauchy PCA by by between 2 and 3 degrees, thus corroborating the result of Corollary 2. However, this effect, as in Table 1, is rather small, though noticeable.

Table 5 Mean angular difference between the robust eigenvectors computed with the contaminated data and the sample eigenvector computed with uncontaminated data when \(n=500\) and \(p=1000\)
Table 6 Mean angular difference between the robust eigenvectors computed with the contaminated data and the population eigenvector, computed without any contamination present, when \(n=500\) and \(p=1000\)

4.2 High dimensional real datasets

Two real gene expression datasets, GSE13159 and GSE31161Footnote 4, downloaded from the Biodataome platform (Lakiotaki et al. 2018), were used in the experiments. The dimensions of the datasets were equal to \(2096 \times 54,630\) and \(1,035 \times 54,675\), respectively. We randomly selected 2000 variables and computed the outliers using the high dimensional Minimum Diagonal Product (MDP) Ro et al. (2015). In accordance with the simulations studies, we removed the \(2\%\) of the most extreme outliers detected by MDP and computed the first classical PC (benchmark eigenvector), the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector computed with the uncontaminated data. We then added those outliers and increased their norm by \(e^k\), where \(k=(0, 3, 4, \ldots , 8)\) and computed the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector. In all cases, we subtracted the spatial median or the variable-wise median and scaled them by the mean absolute deviation. The performance metric is the angle (in degrees) between the first robust (based on Cauchy or PP-PCA) eigenvector and the first population eigenvector, computed with the uncontaminated data, and the time required by each method. This procedure was repeated 200 times and the average results are graphically displayed in Fig. 1a–d.

Broadly speaking the effect of the PP PCA does not seem to have been affected substantially by the centering method, i.e. subtraction of the spatial or the variable-wise median. On the contrary, the Cauchy PCA is affected by the type of median employed to this end. Centering with the spatial median yields high error levels for all norms of the outliers, for both datasets, whereas centering with the variable-wise median produces much lower error levels. On average, the difference in the error between Cauchy PCA and PP PCA is about \(30^{\circ }\) for the GSE31159 dataset (Fig. 1a) and about \(14^{\circ }\) for the GSE3161 dataset (Fig. 1b). However, the error of the Cauchy PCA increases and the stabilizes in the GSE31159 dataset whereas the error of the PP PCA is stable regardless of the norm of the outliers. A different conclusion is extracted in the GSE31161 where the error of either method decreases as the norm of the outliers increases, until it reaches a plateau.

With regards to computational efficiency, the PP PCA is not affected by either centering method, whereas Cauchy PCA seems to be affected in the GSE31159 dataset but not in the GSE31161 dataset as seen in Fig. 1c, d. Cauchy PCA centered with the variable-wise median is, on average, 5 times faster than PP PCA.

Fig. 1
figure 1

The first row presents the angle between the first Cauchy PC of the “contaminated” data and the leading eigenvector computed with the uncontaminated data and the angle between the first Projection Pursuit PC of the "contaminated" data and the leading sampe eigenvector, computed without any contamination present, for increasing norms of the outliers. The second row contains the time in seconds

The reason why PCP and RSVD were excluded from examination with the real datasets is their high computational cost. Table 7 contains the duration (in seconds) of each algorithm required for one run with an increasing number of variables, from 100 up to 5000 variables, for each dataset. For the dataset GSE13159 and 2000 variables, the PCP method requires more than 3.5 h and the RRSVD requires 1.3 h. The figures were created using 200 replications. If we add these two other competing methods and perform only 50 iterations, for the GSE13159, the PCP method would require 8 * 3.5 * 50 = 1400 h which means 58 days and the RRSVD would require an extra 8 * 1.29 * 50 = 516 h or 21.5 days.

Table 7 Time in seconds required by each algorithm for one run for various number of variables

5 Conclusion

The starting point for this paper is the observation that classical PCA can be formulated purely in terms of operations on a Gaussian likelihood. Although this observation is not new, the specifics of this formulation of classical PCA do not appear to be as widely known as might be expected. The novel idea underlying this paper is to formulate a version of PCA in which a Cauchy likelihood is used instead of a Gaussian likelihood, leading to what we call Cauchy PCA. Study of the resulting influence function shows that Cauchy PCA has very good robustness properties. Moreover, we have provided an implementation of Cauchy PCA which runs quickly and reliably. Numerous simulation and real-data examples, mainly in high-dimensional settings, show that, in terms of quality of performance, Cauchy PCA typically outperforms, or is on a par with, alternative robust versions of PCA whose implementations are in the public domain; and Cauchy PCA is typically far superior to all of these other methods in terms of computational cost in very high dimensional settings.