1 Introduction

Let \(X \in {\mathbb {R}}^{d \times N}\) be a matrix of data, with N observations in d dimensions. We denote the empirical d-dimensional mean vector and the empirical \(d \times d\) covariance matrix of X by \(\mu \) and \(\varSigma \) respectively, and make no other assumptions about the generation or structure of the data. In this paper we consider transformations of X of the form \(X_{A}=A(X - \mu ),\) where A is a \(d \times d\) matrix, with the aim of whitening the data \(X\). Data whitening is a transformation of the data intended to decorrelate and standardize the variables. Fully decorrelated data possesses a diagonal covariance matrix, and standardized data has unit variance for each variable (Hossain 2016). Applying a whitening transformation both decorrelates and standardizes the data, so that in the case of non-degenerate data, the covariance matrix of the whitened data \(X_{A}\) will be the identity matrix. By removing the simple elliptic structure of the data through such whitening transformations, we can uncover more interesting and complex structures in the data that may have previously been hidden in correlations, such as clusters or outliers (Li and Zhang 1998). Furthermore, the orthogonality of whitened variables can improve computational time and performance of many statistical methods (Koivunen and Kostinski 1999; Huang et al. 2020; Zuber and Strimmer 2009).

Examples of existing whitening methods include the so-called Mahalanobis whitening defined by

$$\begin{aligned} X_{\varSigma ^{-1/2}} = \varSigma ^{-1/2}(X - \mu )\, , \end{aligned}$$

which is popularly used to whiten data before performing many classical methods of multivariate analyses (Kessy et al. 2018). The transformed data \(X_{\varSigma ^{-1/2}}\) has zero-valued mean and the \(d \times d\) identity matrix \(I_{d}\) as the covariance matrix. The success of the Mahalanobis whitening depends on the ability to compute \(\varSigma ^{-1/2}\) in a way that is both accurate and stable.

It is common for big, high-dimensional data to be close to degeneracy/low-rank (Udell and Townsend 2019) yielding unstable computations of \(\varSigma ^{-1/2}\), with numerous examples of this problem observed in: recommender systems data (Zhou et al. 2008; Li et al. 2016); finance (Bai and Shi 2011); medicine (Schuler et al. 2016); genomics (Wu et al. 2015); and social networks (Liben-Nowell and Kleinberg 2007). These issues also arise in generalized mixture models (Xiao 2020); multiple regression (Healy 1968; Hoang and Baraille 2012); adaptive algorithms (Baktash et al. 2017); and linear discriminant analysis (Ye and Xiong 2006). This is because variables often possess (approximate) linear dependencies, resulting in a covariance matrix \(\varSigma \) that is singular, or very close to singularity. As such, the inverse of the covariance matrix therefore does not exist or is at least unstable, and it becomes inadvisable or impossible to calculate \(\varSigma ^{-1/2}\). Consequently Mahalanobis whitening, and many other methods which directly rely on the inverse of the covariance matrix (such as those described in the survey of the recent paper (Kessy et al. 2018)), are not recommended.

Nevertheless, it has been demonstrated that applying Mahalanobis whitening prior to clustering or outlier detection (to give just two out of many possible examples) often results in better empirical results. This has been observed in several practical examples (Zafeiriou and Laskaris 2008; Shi et al. 2015). Theoretically, Mahalanobis whitening underpins weighted least squares (Seber and Lee 2012), PCA (Jolliffe 1986; Hyvärinen and Oja 2000), canonical correlation analysis (Härdle and Simar 2007) and most of the array of classic multivariate statistics methods (Li and Zhang 1998; Malsiner-Walli et al. 2016). Crucially, decorrelated and standardized data greatly simplifies both theoretical and practical multivariate data analysis (Agostinelli and Greco 2019; Anaya-Izquierdo et al. 2011; Martens et al. 2003; Thameri et al. 2011; Chen et al. 2015; Yang and Jin 2006).

The need to find stable ways of calculating (or approximating) \(\varSigma ^{-1}\) in settings when X is (close to) degeneracy is evidenced in other application domains. In high-dimensional examples one often would like to use the Mahalanobis distance (Mahalanobis 1936) to measure the ‘scatter’ or ‘spread’ of the data (Pronzato et al. 2017, 2018), as a basis for proximity-dependent techniques such as clustering (Zuanetti et al. 2019) and Approximate Bayesian Computation (ABC) (Akeret et al. 2015). But again, for the reasons outlined earlier, the covariance matrix may be singular or ill-conditioned. ABC is used to find estimates of distribution parameters by simulating data over a parameter space (informed by some prior) and finding simulated data closest to the observed data. The ideal measure of closeness is to use the Mahalanobis distance, but the practical need to use ABC is often informed by degeneracy of the observed data, rendering the construction of a computationally tractable distance measure one of the fundamental problems of applying ABC (Wegmann et al. 2009; Beaumont 2019; Prangle 2017).

Recent literature has shown that whitening can also be used to improve the training of neural networks (Huang et al. 2018). Often, normalization is used in such training, rather than whitening, due to ill-conditioned problems (Luo 2017) and the great expense of computing a large inverse square root covariance matrix (Ioffe and Szegedy 2015), despite whitening being preferable if it is possible (Huang et al. 2020).

Naturally, many methods attempt to circumvent the aforementioned problems by use of the Moore-Penrose pseudo-inverse \(\varSigma ^{-}\) to \(\varSigma \), and then to take the square root of \(\varSigma ^{-}\) if needed. However, in the case of high-dimensional data, it has been shown that the Moore-Penrose pseudo-inverse is not always suitable (Hoyle 2011; Bodnar et al. 2016; Bickel and Levina 2004), particularly when there are small eigenvalues. This is a problem appearing in several branches of mathematics, and work on this topic can be found in the statistics literature (which tend to use shrinkage-type estimators (Ledoit and Wolf 2004; Fisher and Sun 2011; Ito et al. 2015)and/or assume sparsity of the covariance matrix (Cai et al. 2011, 2016; Janková and van de Geer 2017)) and in the linear algebra/numerical analysis literature. In this latter work, algorithms (alternating projections (Higham and Strabić 2016) or Newton-type (Qi and Sun 2011; Higham 2008)) to compute ‘substitute’ covariance matrices are developed. Challenges in the successful use of these algorithms include computational tractability, stability, and the ability to find good starting-points. Our work differs from this in that we provide explicit formulae for the construction of our substitute to \(\varSigma ^{-1/2}\) and we are able to quickly generate a family of whitening matrices based on the order of our polynomials.

In this paper, we introduce polynomial whitening, and what we call the minimal-variance polynomial matrix, to be used in place of the square root of the inverse of the empirical covariance matrix. In view of the celebrated Cayley-Hamilton theorem (Cayley 1858; Hamilton 1853) the true inverse of a full-rank \(d \times d\) matrix \(\varSigma \) can be calculated through a \(d-1\) degree polynomial in \(\varSigma \). In Gillard et al. (2022), it is shown that an alternative to the inverse of a matrix can be found using low degree polynomials. Our work follows on from this, as we consider polynomials of low degree in \(\varSigma \) to now provide an alternative to the square root of its inverse. Our method is applicable in cases where the true inverse square root of the covariance matrix does not exist, making it a viable alternative for degenerate and close-to degenerate datasets. Parameter options also allow for a trade-off between data whitening accuracy and time complexity.

The main practical focus of this paper is in data whitening, but in view of the discussion above, we envisage other settings where our work may be useful. The structure of the paper is as follows. Section 2 introduces the form of our matrix polynomial, and the optimization problem we solve in order to obtain an alternative to the inverse square root of the covariance matrix. The main theorem of this paper which is studied in later examples is given in Sect. 2.3. We address different parameter choices in Sect. 2.4, and in Sect. 2.6 we discuss using our procedure in conjunction with random projection methods, which can be useful when dealing with very high-dimensional data. Examples applying our method to data whitening, outlier detection and dimension reduction are given in Sect. 3, before we conclude the paper in Sect. 4.

2 The minimal-variance polynomial

2.1 Covariance matrix of transformed data

The mean vector \(E(X_{A})\) and covariance matrix \({\mathscr {D}}(X_{A})\) of \(X_{A}=A(X-\mu )\) are respectively:

$$\begin{aligned}E(X_{A}) = 0_{d}\, , \,\,\,\,\, {\mathscr {D}}(X_{A}) = A\varSigma A^{\top }\, , \end{aligned}$$

where \(0_{d}\) is the \(d\)-dimensional vector of zeroes (Mathai and Provost 1992). Data transformed by Mahalanobis whitening, \(X_{\varSigma ^{-1/2}} = \varSigma ^{-1/2}(X-\mu )\), has covariance matrix

$$\begin{aligned} {\mathscr {D}}\left( X_{\varSigma ^{-1/2}}\right) = \varSigma ^{-1/2}\varSigma \varSigma ^{-1/2} = I_{d}\, , \end{aligned}$$

where \(I_{d}\) is the \(d \times d\) identity matrix. The total variation in \(X_{A}\) is given by \(\text {trace}\left( {\mathscr {D}}(X_A)\right) ~=~\text {trace}(A\varSigma A^{\top }),\) and for the Mahalanobis whitening the total variation in \(X_{\varSigma ^{-1/2}}\) is given by \(\text {trace}\left( {\mathscr {D}}\left( X_{\varSigma ^{-1/2}}\right) \right) =\text {trace}(I_{d})=~d\).

2.2 The minimal-variance polynomial alternative to \(\varSigma ^{-1/2}\)

Let \(\theta = \left( \theta _{0}, \theta _{1}, \dots , \theta _{k-1}\right) ^{\top }\) and \(\varSigma _{(k)}~=~\left( \varSigma ^{0}, \varSigma ^{1}, \dots , \varSigma ^{k-1}\right) ^{\top }\). We define \(A_{k}\) to be a \((k - 1)\)-degree matrix polynomial in \(\varSigma \), of the form:

$$\begin{aligned} A_{k} = \sum _{i = 0}^{k-1}\theta _{i}\varSigma ^{i} = \theta ^{\top }\varSigma _{(k)}\, . \end{aligned}$$
(1)

For a chosen integer \(k\) such that \(k - 1 < d\), our objective is to find the \(k\) coefficients of the matrix polynomial, denoted \(\theta =\left( \theta _{0}, \theta _{1}, \dots , \theta _{k-1}\right) ^{\top }\) in (1), so that the total variation of the transformed data \(X_{A_{k}}=A_{k}(X-\mu )\), is minimized, subject to suitable constraints. For further intuition as to why we minimize the total variation, see Gillard et al. (2022).

Let \(s_{j} = \text {trace}(\varSigma ^{j})\), \(S_{(i, k)} = \left( s_{i}, s_{i+1}, \dots , s_{i + k - 1}\right) \), and define the matrix

$$\begin{aligned}M_{(k)} = \left( \begin{matrix} s_{1} &{} s_{2} &{} \dots &{} s_{k} \\ s_{2} &{} s_{3} &{} \dots &{} s_{k + 1} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ s_{k} &{} s_{k+1} &{} \dots &{} s_{2k -1} \end{matrix}\right) . \end{aligned}$$

We seek to minimize the total variation of the transformed data \(X_{A_{k}}=A_{k}(X-\mu )\), which is given by:

$$\begin{aligned} \text {trace}\left( {\mathscr {D}}(X_{A})\right)&= \text {trace}\left( A_{k}\varSigma A_{k}^{\top }\right) \\&= \text {trace}\left( \sum _{i = 0}^{k - 1}\theta _{i}\varSigma ^{i}\varSigma \sum _{j = 0}^{k - 1}\theta _{j}\varSigma ^{j}\right) \\&= \theta ^{\top } M_{(k)} \theta . \end{aligned}$$

To ensure non-trivial solutions to the minimization of the total variation, we introduce a constraint. There are a number of options for this constraint; here we consider constraints of the form

$$\begin{aligned} \text {trace}\left( A_{k}\varSigma ^{\alpha }\right) = \text {trace}\left( \varSigma ^{\alpha - 1/2}\right) \end{aligned}$$
(2)

for a scalar value \(\alpha \). This can be written in the above notation as

$$\begin{aligned} \theta ^{\top }S_{(\alpha , k)} = s_{\alpha - 1/2}. \end{aligned}$$

A constraint of this form ensures that the minimal-variance polynomial matrix \(A_{k}\) has similar qualities to \(\varSigma ^{-1/2}\), in the cases where this matrix exists. The constraint (2) will be revisited after the following theorem.

2.3 Constructing the minimal-variance polynomial

Theorem 1

Let \(X \in {\mathbb {R}}^{d \times n}\) be a \(d\)-dimensional dataset with \(n\) observations, having empirical mean \(\mu \) and empirical covariance matrix \(\varSigma \). For \(k - 1 < d\), the matrix polynomial \(A_{k} = \sum _{i=0}^{k-1}\theta _{i}\varSigma ^{i} = \theta ^{\top }\varSigma _{(k)}\) such that \(\text {trace}\left( {\mathscr {D}}(X_{A_{k}})\right) \) is minimized, subject to the constraint \(\theta ^{\top }S_{(\alpha , k)} = s_{\alpha - 1/2}\), has coefficients given by

$$\begin{aligned} {\hat{\theta }} = \frac{s_{\alpha - 1/2}}{S_{(\alpha ,\, k)}^{\top }M^{-1}_{(k)}S_{(\alpha ,\, k)}}M^{-1}_{(k)}S_{(\alpha ,\, k)}\, . \end{aligned}$$
(3)

Proof

We will minimize \(\frac{1}{2}\text {trace}({\mathscr {D}}(X_{A_{k}}))\) subject to the constraint (2), where the constant \(1/2\) is introduced to simplify calculations. The Lagrange function \({\mathscr {L}}(\theta , \omega )\) with Lagrange multiplier \(\omega \) is given by

$$\begin{aligned} {\mathscr {L}}(\theta , \omega ) = \frac{1}{2}\theta ^{\top }M_{(k)}\theta - \omega (\theta ^{\top }S_{(\alpha ,\, k)} - s_{\alpha - 1/2}). \end{aligned}$$

We minimize the Lagrange function by differentiating with respect to \(\theta \) and setting the result equal to 0, which gives:

$$\begin{aligned} M_{(k)}\theta = \omega S_{(\alpha ,\, k)} \end{aligned}$$

and we can therefore rearrange to find \(\theta \):

$$\begin{aligned} {\hat{\theta }} = \omega M^{-1}_{(k)}S_{(\alpha ,\, k)}. \end{aligned}$$
(4)

Let \(\omega = \omega _{k}\). We can find the value of \(\omega _{k}\) by substituting (4) into the constraint \(\theta ^{\top }S_{(\alpha ,\, k)} = s_{\alpha - 1/2}\) giving

$$\begin{aligned} \omega _{k} = \frac{s_{\alpha - 1/2}}{S_{(\alpha ,\, k)}^{\top }M^{-1}_{(k)}S_{(\alpha ,\, k)}}. \end{aligned}$$

Thus, the vector of coefficients of the polynomial (1) which minimizes \(\text {trace}\left( {\mathscr {D}}(X_{A_{k}})\right) \) subject to the constraint (2) is given by (3). We call this polynomial the minimal-variance polynomial. \(\square \)

When evaluating the polynomial, we recommend forming the matrix powers by iteratively multiplying by \(\varSigma \), or using Horner’s method for polynomial evaluation. Both of these methods are outlined in Sect. 4.2 of Higham (2008).

2.4 How to choose parameter values in the minimal-variance polynomial

2.4.1 Choice of the parameter \(\alpha \) in the constraint (2)

We studied the outcomes of polynomial whitening with different values of \(\alpha \) in the constraint (2). Theoretically, any value of \(\alpha \) will produce an alternative whitening matrix. Empirical investigations showed that the polynomial with \(\alpha = 1/2\) performed particularly well, in terms of data whitening success, stability and computational cost. Using this value of \(\alpha \) is equivalent to applying the constraint

$$\begin{aligned} \text {trace}\left( A_{k}\varSigma ^{1/2}\right) = \text {trace}\left( I_{d}\right) =d. \end{aligned}$$
(5)

Our analysis found that, when using this constraint, data was approximately whitened using a relatively low value of \(k\) (when compared to the value of the dimension \(d\) of the dataset).

Figure 1a considers a 50-dimensional dataset, with 5 eigenvalues greater than 1, 30 eigenvalues between 0 and 1, and 15 zero eigenvalues. Figure 1b considers a 150-dimensional dataset, with 5 eigenvalues greater than 1, 100 eigenvalues between 0 and 1, and 45 zero eigenvalues. The eigenvalues of these datasets are given in Appendix 1.1. These eigenvalues have been chosen to create a degenerate example which the Moore-Penrose pseudo-inverse would struggle to deal with well. We did so by setting roughly \(d/3\) eigenvalues equal to zero, and letting the nonzero eigenvalues taper towards zero, making the rank of the dataset unclear. We plot in blue dots the nonzero eigenvalues of the dataset on the horizontal axis, and the reciprocal square root of the nonzero eigenvalues on the vertical axis. We then plot the corresponding minimal-variance polynomial with \(\alpha = 1/2\) in degree \(k\) as follows. Find the coefficients \(\theta = (\theta _{0}, \theta _{1}, \dots , \theta _{k-1})^{\top }\) of the minimal-variance polynomial using (3), and write the polynomial as in (1), replacing the matrix \(\varSigma \) with a symbol \(t\):

$$\begin{aligned} p(t) = \theta _{0}t^{0} + \theta _{1}t^{1} + \dots \theta _{k-1}t^{k-1}. \end{aligned}$$

We can then plot this polynomial \(p(t)\) for different values of \(t\). In Fig. 1, we consider polynomials of degree 3, 4 and 5, (recalling that using the value \(k\) results in a \((k-1)\)-degree polynomial). The degree of these polynomials are much lower than the dimensionality of the datasets, yet still provide a good approximate fit to the inverse square root of the given eigenvalues.

The constraint (5) with \(\alpha = 1/2\) works well in the case of non-degenerate data (when \(\varSigma \) is essentially non-singular), but requires some simple tuning for degenerate or nearly-degenerate data, which has been applied here. This tuning will be discussed in Sect. 2.5.

Fig. 1
figure 1

The minimal-variance polynomial fit to simulated eigenvalues (blue, values given in Appendix 1.1) for datasets with a 50 dimensions and b 150 dimensions. Parameters used are \(\alpha = 1/2\) and \(k = 4\) (red, dotted line), \(k = 5\) (green, dash-dot line) and \(k = 6\) (orange dashed line)

2.4.2 Choice of the parameter \(k\) to determine the degree of the minimal-variance polynomial

The true inverse square root of a full-rank covariance matrix can be written as a \((d-1)\)-degree polynomial using the characteristic polynomial. The minimal-variance polynomial with parameter \(k\) forms a \((k-1)\)-degree polynomial, as defined in (1). As \(k\) increases, the polynomial can approximate the square root of the characteristic polynomial more accurately, but is more costly to compute. Therefore, the choice of the parameter \(k\) is often a trade-off between accuracy and cost.

However, as \(k\) increases, so does the opportunity for instability in the polynomial, particularly when working in high dimensions (see Table 3 for an example of this, and Sect. 3.1 for more of a discussion on this topic). As such, keeping \(k\) relatively low is not only beneficial for cost, but for stability. Furthermore, choosing low values of \(k\) results in a good approximation for the inverse square root of the covariance matrix, as can be seen by the polynomial fit to the eigenvalues in Fig. 1. This will be further demonstrated in the numerical examples in Sect. 3.

In this paper, we often run the same experiments multiple times with different values of \(k\), and use a problem-specific metric to identify the best value of \(k\) for that dataset. For example, in Sect. 3.1, we use the Wasserstein metric to compare the whitened data to the standard normal distribution, as well as a sum-of-squares-based metric. We then choose the value of \(k\) which produces the lowest values for these metrics. This is similar to methods used in many parameterized methods, such as using scree plots or silhouette scores to judge the best number of clusters to use in a clustering algorithm. It may often be best to apply minimal-variance whitening for multiple values of \(k\) to the dataset, and then inspect the empirical covariance matrix of the transformed data to see which value of \(k\) has performed best.

2.5 Constraint adjustment for rank-deficient data

We provide an adaptation to the method when using \(\alpha = 1/2\) to make it suitable for use when \(\varSigma \) is singular, as hinted at in discussions about the choice of \(\alpha \). When using the constraint (5) with \(\alpha = 1/2\), the polynomial aims to ensure that the trace of \(A_{k}\varSigma ^{1/2}\) is equal to \(d\). We propose that this trace should aim to equal \(r\), the rank of the covariance matrix, in a similar way to the Moore-Penrose pseudo-inverse \(\varSigma ^{-}\) having the property that \(\text {trace}\left( (\varSigma ^{-})^{1/2}\varSigma ^{1/2}\right) =~r\). However, for matrices with many small eigenvalues, \(r\) is hard to calculate (Vidal and Favaro 2014), and often approximations of \(r\) are based on arbitrary eigenvalue thresholding or subjective elbow plots (Kishore Kumar and Schneider 2017).

In cases where \(\varSigma \) is not full-rank, we propose an adjustment to modify our constraint (2) without the need to calculate \(r\) directly. We will illustrate our adjustment using the two examples in Fig. 2, which plot the nonzero eigenvalues and nonzero reciprocal square root eigenvalues in the same way as in Fig. 1. The datasets used in Fig. 2 are the same as those in Fig. 1, details of which are given in Sect. 2.4 and Appendix 1.1.

Fig. 2
figure 2

The minimal-variance polynomial with \(k = 5\) fit to eigenvalues (blue, the same as in Fig. 1) before (red, dashed line) and after (green, solid line) adjustment for rank-deficiency, as described in Sect. 2.5

We first calculate the minimal-variance polynomial using the constraint (5). This is shown in Figs. 2a and b as the red, dashed line. Although the polynomials take the correct shape, they are clearly placed too high and do not fit the plot of the inverse square root eigenvalues. We adjust the polynomial by multiplying by some constant \(c\) between 0 and 1, which ensures a better fit of the polynomial. Our method of choosing a value of \(c\) is as follows.

Let \(\varLambda = \{\lambda _{1}, \dots , \lambda _{d}\}\) be the set of all eigenvalues of \(\varSigma \), and let \({\tilde{\varLambda }} = \{\lambda _{i} \in \varLambda : \lambda _{i} \ne 0\}\) be the set of all nonzero eigenvalues of \(\varSigma \). In the case of very large dimensions \(d\), computation of eigenvalues \(\lambda _i\) is certainly out of reach; in this case, as will be discussed in Sect. 2.6, we suggest to project the data to a low-dimensional space and use the set of eigenvalues for the low-dimensional version of the data. The constant \(c\) can be found in any number of ways which minimizes the distance between the polynomial \(p(\lambda )\) and the target values \(1/\lambda ^{1/2}\), for \(\lambda \in {\tilde{\varLambda }}\). We use the value \(c^{*}\) from

$$\begin{aligned} c^{*} = \mathop {\mathrm {arg\,min}}\limits _{c \in (0, 1]} \sum _{\lambda \in {\tilde{\varLambda }}}w(\lambda ) [c\cdot p(\lambda ) - \lambda ^{-0.5}]^{2} \end{aligned}$$

where \(w(\lambda )\) is a suitable weight function. That is, we seek to minimize the weighted sum of squares between the polynomial and the reciprocal square root of the nonzero eigenvalues. The optimal value of the adjustment constant \(c\) is then found to be

$$\begin{aligned} c^{*} = \frac{\sum _{\lambda \in {\tilde{\varLambda }}} w(\lambda ) \lambda ^{-0.5}p(\lambda )}{\sum _{\lambda \in {\tilde{\varLambda }}}w(\lambda )p(\lambda )^{2}}. \end{aligned}$$

In Fig. 2 (and all other examples in this paper), we have used \(w(\lambda ) = \lambda \), and in general we recommend this. However, the choice of \(w\) can be altered to give a different fit for the data given. If the user is more concerned about fitting the polynomial to the larger eigenvalues, they may decide to use \(w(\lambda ) = \lambda ^i\) with \(i > 1\), for example.

The adjusted polynomials (given by the green solid line) clearly fit the desired points much more successfully than the original polynomials. However, if this adjustment is not performed, the data transformed by the polynomial whitening matrix \(A_{k}\) will still be approximately isotropic, so the adjustment is not necessary if equal variance is of variables is sufficient. This adjustment has been applied to all examples that follow in this paper.

This adjustment to the constraint can also be used to detect the singularity of a matrix. Let us first consider the case with \(d < N\). If \(\varSigma \) is full rank, and \(k\) is chosen appropriately, the value \(c^{*}\) will be equal to (or very close to) 1, as the minimal-variance polynomial is aiming to make \(\text {trace}\left( A_{k}\varSigma \right) = d\), which is correct in the case of full-rank \(\varSigma \). If the matrix \(\varSigma \) is not full-rank, \(c^{*}\) will be less than 1. To illustrate this, Table 1 gives two \(d < N\) examples. A \(d\)-dimensional dataset with \(N\) observations is generated using a covariance matrix generated with rank \(R\). Further details on how these datasets were generated is given in Appendix 1.1. The empirical covariance matrix of the dataset has rank \(r = \text {min}(d, N, R)\), and is used to find the minimal-variance polynomial matrix with \(k = 10\), and the constraint adjustment \(c^{*}\) is given. In dataset 1, the empirical matrix has full rank \(r = d\), so \(c^{*} = 1\). In dataset 2, the ‘true’ covariance matrix has rank \(R = 50\), \(d = 100\) and \(N = 1000\), therefore the empirical covariance matrix has rank \(r = \text {min}(d, N, R) = 50\). This produces a constraint adjustment value of \(c^{*} = 0.50 < 1\), so we know the empirical covariance matrix \(\varSigma \) is singular.

We now consider cases with \(d \ge N\) through the use of three examples. Dataset 3 in Table 1 has 100 dimensions and only 50 observations. The ‘true’ covariance matrix used to generate this dataset is full-rank \(R = 100\), but the empirical covariance matrix has rank \(r = \text {min}(d, N, R) = 50\). Therefore, using the empirical covariance matrix in the minimal-variance polynomial matrix gives adjustment value \(c^{*} = 0.50\), informing us that this dataset is degenerate. Dataset 4 also has \(d = 100\), \(N = 50\), and the ‘true’ covariance matrix now has rank \(R = 50\). The adjustment value is therefore less than 1: \(c^{*} = 0.50\). The final example we consider again has dimension \(d = 100\) and number of observations \(N = 50\), but the ‘true’ covariance matrix has rank \(R = 30\). The empirical covariance matrix therefore has rank \(r = \text {min}(d, N, R) = 30\), and the adjustment value is \(c^{*} = 0.30\). In all these examples, \(c^{*} < 1\), as the empirical covariance matrix \(\varSigma \) will never be full-rank in \(d < N\) examples.

Table 1 The adjustment value \(c^{*}\) for different configurations of the dimension \(d\), number of observations \(N\), rank of true population covariance matrix \(R\) and rank of sample covariance matrix \(r\)

2.6 Applications to extremely high-dimensional data

Given a dataset \(X\) with extremely high-dimension \(d\), say \(d = 1,000,000\), finding the minimal-variance polynomial matrix can be too costly and time-intensive. We can instead sample some variables from \(X\) to produce a ‘representative’ dataset \({\tilde{X}}\) in a much smaller dimension \({\tilde{d}}\). This representative dataset can be found through random samples of the variables in \(X\), or projection to a lower dimensional space (see Bingham and Mannila (2001), Blum et al. (2014)). We can proceed with calculating the covariance matrix \({\tilde{\varSigma }}\) of \({\tilde{X}}\), and use \({\tilde{\varSigma }}\) to produce the minimal-variance polynomial alternative to \({\tilde{\varSigma }}^{-1/2}\):

$$\begin{aligned} \tilde{A_{k}} = \theta _{0}I + \theta _{1}{\tilde{\varSigma }} + \dots + \theta _{k-1}{\tilde{\varSigma }}^{k-1}\, . \end{aligned}$$
(6)

We can then replace the \({\tilde{d}}\)-dimensional matrix \({\tilde{\varSigma }}\) in (6) with the \(d\)-dimensional covariance matrix \(\varSigma \) to obtain the minimal-variance polynomial matrix \(A_{k}\). This can be used to whiten the original large dataset \(X\), and is much cheaper than finding the minimal-variance polynomial matrix directly.

For large datasets, it may be that we don’t know the eigenvalues exactly, but can approximate the distribution of the eigenvalues. If this is the case, we can sample \({\tilde{d}}\) eigenvalues from this distribution using the inverse cumulative distribution function. We will illustrate this using the Marchenko-Pastur distribution, as this distribution is known to model the eigenvalues of the sample covariance matrix of a random matrix as \(d, N \rightarrow \infty \). Figure 3 considers an example with \(d=10,000\) and \(N = 15,000\), and the probability density function (PDF) of the Marchenko-Pastur distribution with these parameters is shown by the red line. The histogram represents a random sample of 300 eigenvalues, and shows such a sample models the distribution well.

Similar methods can be used in the case where \(d \ge N\). We can reduce \(d\) to a value smaller than \(N\) by sampling the real eigenvalues, if they are known or can be calculated. Alternatively, we can sample the eigenvalues from an assumed distribution as described above.

Fig. 3
figure 3

Sampling of eigenvalues from the Marchenko-Pastur distribution. The red line indicates the Marchenko-Pastur PDF, when \(d = 10,000\) and \(N = 15,000\). The histogram shows the spread of the 300 sampled values from this distribution

3 Numerical examples

3.1 Whitening data using minimal-variance polynomials

3.1.1 Data with \(d < N\)

We begin the numerical examples by whitening several synthetic and real datasets using the minimal-variance polynomial. The details of these datasets are given in Table 2. The four synthetic datasets (D1, D2, D3, D4) are sampled from a Gaussian distribution \({\mathscr {N}}_{d}(0, \varSigma )\) with \(N = 5\times d\) observations, where the covariance matrices \(\varSigma \) are produced as follows. Generate \(d\) eigenvalues \(\varLambda = \{\lambda _{1}, \lambda _{2}, \dots , \lambda _{d}\}\) from the Wishart distribution and produce a random \(d\times d\) orthogonal matrix \(Q\). Let \(L\) be the matrix with the eigenvalues \(\varLambda \) on the diagonal and zeroes elsewhere, then let \(\varSigma = Q^{\top }LQ\).

The real datasets ‘Digits’, ‘Musk’ and ‘HAR’ (Human Activity Recognition) (Anguita et al. 2013) were obtained from the UCI Machine Learning repository (Dua and Graff 2017). The ‘MNIST’ dataset (LeCun et al. 2010) was obtained from the OpenML database (Vanschoren et al. 2013).

Table 2 Datasets used in Sect. 3.1.1, their dimension \(d\) and number of observations \(N\)

In some cases, it can be beneficial to rescale the data so that each variable has zero mean and unit variance, before finding the minimal-variance polynomial matrix. If rescaling the data provides less extreme eigenvalues in the covariance matrix, this scaling is likely to improve the performance of the polynomial whitening. The heatmaps in Fig. 4 show the covariance matrices of the datasets, and the distribution of the eigenvalues of these covariance matrices are given in Appendix 1.2. Figure 4 shows a lot of nonzero off-diagonal values in the heatmaps, indicating that these datasets are highly correlated.

Fig. 4
figure 4

Heatmaps of the covariance matrix of each dataset detailed in Table 2 before minimal-variance polynomial whitening. Datasets corresponding to Figures (a), (e), (f) and (h) are scaled to have unit variance, to improve performance of polynomial whitening. These heatmaps show the covariance matrix after this scaling

We can measure the proximity of the transformed data \(X_{A_{k}} \sim {\mathscr {N}}_{d}(0, {\mathscr {S}})\) to the standard normal distribution \({\mathscr {N}}_{d}(0, I)\) using the Wasserstein metric (Givens and Shortt 1984):

$$\begin{aligned} W(X_{A_{k}}) = \left( d + \text {trace}({\mathscr {S}}) - 2\text {trace}\left( {\mathscr {S}}^{1/2}\right) \right) /d, \end{aligned}$$
(7)

where we divide by \(d\) here to account for the difference in the dimensions of each dataset.

The heatmaps in Fig. 5 show the covariance matrix of the transformed data \(X_{A_{k}} = A_{k}X\) for each dataset, illustrating that the correlations between variables have been approximately whitened. The value of \(k\) used in these heatmaps is chosen as the value of \(k\) which gives the lowest Wasserstein score, as given in Table 3. The Wasserstein scores in Table 3 show that, in general, as the value of \(k\) increases, the transformed data is closer to the standard normal distribution, as desired. In some cases, such as the Musk dataset, higher values of \(k\) begin to show an increase in the Wasserstein score, indicating the whitening transformation is less successful than when using lower values of \(k\). This is likely due to numerical instability, as the minimal-variance polynomial aims to fit itself to extremely small eigenvalues, causing erratic behaviour in the polynomial. As such, it is recommended to use lower values of \(k\) which provide a more reliable alternative to the inverse square root of the covariance matrix, or to compute several minimal-variance polynomial matrices for different \(k\) and use the one that best satisfies some metric, such as the Wasserstein score.

Fig. 5
figure 5

Heatmaps of the covariance matrix of the datasets in Table 2 after minimal-variance polynomial whitening. The value of \(k\) used in constructing the minimal-variance polynomial is given in the caption for each dataset

Table 3 The Wasserstein scores (7), denoted \(W_{A_{k}X}\), which measure the distance between the polynomial-whitened dataset \(A_{k}X\) and the standard normal distribution \({\mathscr {N}}(0, I)\) for each dataset

As indicated by the Wasserstein scores in Table 3 and the heatmaps in Fig. 5, we produce an effective alternative to the inverse square root of the covariance matrix using a polynomial of degree significantly lower than the dimension of the dataset.

The Wasserstein measure concerns itself with the diagonal values of the covariance matrix, as it is calculated using traces. We can consider it as a measure of standardization, rather than whitening. We therefore need to measure the extent to which the data has been decorrelated. The heatmaps in Fig. 5 show that the off diagonals of the covariance matrix of the transformed data are close to zero, indicating good decorrelation. Another way we can measure this is by considering the sum of squares of the off-diagonal entries of the covariance matrix of the transformed data. In Table 4, let \(SS_{A_{k}X}\) be the sum of squares of the off-diagonal entries of the covariance matrix of the whitened dataset \(A_{k}X\).

The sum of squares values in Table 4 decrease as \(k\) increases, until a certain value of \(k\), much like the Wasserstein scores. Given we would like this value to be as small as possible, we see the value of \(k\) that gives the optimum sum of squares value for each dataset is close to value of \(k\) that gives the optimum Wasserstein score for each dataset. Therefore, when the data has been successfully standardized, it has also been decorrelated well.

Table 13 in Appendix 3 shows the average time taken to produce the minimal-variance polynomial matrices for each dataset for each value of \(k\) considered, over 100 runs. The time taken increases as the dimensionality \(d\) of the dataset increases, and as the parameter \(k\) increases. However, the procedure to calculate the matrices only takes a matter of seconds, even for 1000-dimensional datasets. This time performance could be improved much further by implementing parallel computing methods.

Table 4 The sum of squares, denoted \(SS_{A_{k}X}\), of the off diagonal values of the covariance matrix of the polynomial-whitened dataset \(A_{k}X\) for each dataset

3.1.2 Data with \(d \ge N\)

It is increasingly common for data to have higher dimensionality than number of observations in many fields, such as genetic microarrays, medical imaging and chemometrics (Hall et al. 2005). Such data is clearly rank-deficient, with \(r \le N < d\), and thus the sample covariance matrix of such data is always singular, rendering many multivariate data analysis methods unusable, including data whitening. Minimal-variance polynomial whitening is applicable in such cases, as illustrated by the following examples.

We consider four synthetically generated datasets and four real datasets, detailed in Table 5. The first two synthetic datasets, E1 and E2, are sampled from a Gaussian distribution \({\mathscr {N}}_{d}(0, \varSigma )\), where the covariance matrices \(\varSigma \) are produced as follows. Like the \(d < N\) case, we generate \(d\) eigenvalues, and produce a random \(d\times d\) orthogonal matrix \(Q\). Let \(L\) be the matrix with the eigenvalues \(\varLambda \) on the diagonal and zeroes elsewhere, then let \(\varSigma = Q^{\top }LQ\). The third synthetic dataset, E3, is generated to copy the example in Wang and Fan (2017): a multivariate Gaussian with population covariance matrix with diagonal entries \([50, 20, 10] + [1] * 47\). This creates a spiked eigenvalue model, which is of interest in HDLSS datasets (Aoshima and Yata 2018). The fourth dataset uses a covariance matrix with eigenvalues generated from a random uniform distribution between 0 and 1, to produce a non-sparse set of eigenvalues. The madelon dataset was obtained from the UCI Machine Learning Repository (Dua and Graff 2017). The raw madelon dataset has 4400 observations, greater than the 500 features, so we sampled only the first 250 observations to create the madelon dataset with \(d > N\). The yeast dataset is a real genomic dataset with 2284 features and 17 observations (Tavazoie et al. 1999; Vanschoren et al. 2013). The third real dataset is a genomic dataset on colon cancer data (Alon et al. 1999), used by (Yata and Aoshima 2013) as an example of a spiked eigenvalue model. This dataset includes two clusters which represent tumorous and non-tumorous colons; we only consider the former cluster here. The DB-emails dataset is a ‘bag-of-words’ representation of a collection of emails (Filannino 2011). Note that the madelon , yeast and colon datasets have been scaled to have unit variance. The empirical eigenvalues of all datasets are given in Appendix 1.2.

Table 5 Datasets used in Sect. 3.1, their dimension \(d\) and number of observations \(N\)

Successful whitening of these datasets would result in a covariance matrix with \(r\) eigenvalues equal to 1, and \(d - r\) eigenvalues equal to 0. We performed Moore-Penrose whitening on the four datasets in Table 5 by pre-multiplying the data by the Moore-Penrose inverse of the square root of the covariance matrix. We also performed minimal-variance polynomial whitening on the datasets as described in Sect. 2.

Figure 6 compares the distribution of the eigenvalues of the covariance matrices after Moore-Penrose whitening and minimal-variance polynomial whitening. The eigenvalues are scaled such that the maximum eigenvalue is equal to 1. The first three synthetic datasets show that using minimal-variance whitening returns a dataset with eigenvalues only equal to 0 and 1, whereas using Moore-Penrose whitening gives a dataset with a spread of eigenvalues between 0 and 1. Figure 6d shows that minimal-variance whitening may not achieve perfect whitening, but that it is still more successful than the Moore-Penrose whitening method.

Figure 6f considers the Yeast dataset, and shows that both Moore-Penrose whitening and minimal-variance whitening return only eigenvalues of value 0 or 1. However, the Moore-Penrose whitening gives one eigenvalue equal to 1, and the rest 0 (or very close to 0). When using the minimal-variance whitening, the dataset has rank equal to the original dataset (\(r = 16\) in this case). The madelon , colon and DB-emails datasets are not whitened perfectly by either method, but the eigenvalues are much more dispersed when using Moore-Penrose whitening compared to minimal-variance polynomial whitening, whereas we seek eigenvalues only valued at 0 and 1, ideally.

Fig. 6
figure 6

Log-scale histograms, showing the eigenvalues of the covariance matrix after the data has been whitened by Moore-Penrose (MP) whitening (orange histogram), and minimal-variance polynomial (MV) whitening with \(k = 9\) (green histogram), for each of the four datasets given in Table 5

3.2 Comparison to other whitening methods

Due to rotational freedom, there are infinitely many whitening matrices of the form \(W = Q\varSigma ^{-1/2}\), where \(Q\) is orthogonal and satisfies \(Q^{\top }Q = I_{d}\) (Kessy et al. 2018).

Let us define some decompositions of the covariance matrix \(\varSigma \), beginning with \(\varSigma = V^{1/2}PV^{1/2}\), where \(V\) is the diagonal variance matrix and \(P\) is the correlation matrix. Let \(\varSigma = U\varDelta U^{\top }\) be the eigendecomposition of the covariance matrix, with \(U\) the matrix of eigenvectors and \(\varDelta \) the diagonal matrix of eigenvalues. Analogously, define the eigendecomposition \(P = GOG^{\top }\) of the correlation matrix. We also define the Cholesky decomposition of the inverse covariance matrix \(LL^{\top } = \varSigma ^{-1}\), when \(\varSigma ^{-1}\) exists.

Five whitening procedures are identified by Kessy et al. (2018) to be unique in fulfilling a given objective function. Most of these objective functions used in the paper are based on the cross-covariance matrix \(\varPhi \) and the cross-correlation matrix \(\varPsi \) between the original data \(X\) with covariance \(\varSigma \) and the whitened data \(X_{W}\):

$$\begin{aligned} \varPhi&= \text {cov}(X_{W}, X) = W\varSigma , \\ \varPsi&= \text {corr}(X_{W}, X) = \varPhi V^{-1/2}. \end{aligned}$$

In the following example, we will compare polynomial whitening to the three procedures from this paper that share our goal of whitening the data while changing as little else as possible. We do not consider the other methods in this paper, as these methods aim to maximize compression of variance into the first few variables of the whitened data. Although polynomial whitening performed relatively well in these scenarios, this is not the aim of our method. The three types of whitening we will consider alongside polynomial whitening are given below.

Mahalanobis whitening (MW) \(\varvec{W} = \varvec{\varSigma }^{-1/2}\). Mahalanobis whitening is found to be the unique whitening procedure which maximizes \({{\textbf {trace}}(\varPhi )}\), the average cross-covariance between each variable of the original and the newly transformed data. This is equivalent to minimizing the total squared distance between the original data \(X\) and the whitened data \(X_{W}\), ensuring the whitened data is as similar as possible to the original data.

Mahalanobis-cor whitening (MCW) \(\varvec{W} = \varvec{P}^{-1/2}\varvec{V}^{-1/2}\). Mahalanobis whitening can be affected by the differences in the scales of variables. To avoid this issue we may use a scale-invariant version, known as Mahalanobis-correlation whitening. The Mahalanobis-correlation whitening method maximizes the cross-correlation \({{\textbf {trace}}(\varPsi )}\) between each variable of the standardized original data \(V^{-1/2}X\) and the whitened data \(X_{W}\). Doing this is shown to be equivalent to minimizing the squared distance between \(V^{-1/2}X\) and \(X_{W}\).

Cholesky whitening (CW) \(W = L^{\top }\). Cholesky whitening is the only whitening procedure fulfilling the constraint of producing lower-triangular cross-covariance and cross-correlation matrices with positive diagonal entries. It does not result from fulfilling an objective function like the above methods, but rather from satisfying this constraint.

We evaluate the performance of these different whitening procedures by applying them to a dataset and considering the different objective functions in \(\varPhi \) and \(\varPsi \). First, as in Kessy et al. (2018), we apply the whitening methods to the \(4\)-dimensional Iris dataset (Fisher et al. 1936) in Table 6. Given the dataset’s low dimension and well-conditioned covariance matrix, polynomial whitening (PW in the table) with \(k = d = 4\) produces exactly the same results as Mahalanobis whitening. We also perform polynomial-cor whitening (PCW), where the data is standardized and polynomial whitening is performed using the correlation matrix \(P\). This produces the same results as Mahalanobis-cor whitening.

Table 6 A comparison of different whitening methods applied to the Iris dataset, using metrics identified by Kessy et al. (2018)

The polynomial whitening method is more effectively used when applied to higher dimensional datasets with singular or near-singular covariance matrices. As such, we repeat the above exercise with a different dataset. For the purposes of this example, we are unable to use a dataset which has a singular covariance matrix, as the Mahalanobis and Cholesky whitening methods are not usable in this case. We use the Wisconsin Breast Cancer dataset (Wolberg et al. 1992), which we have pre-standardized to give improved results from all methods. This dataset has dimension \(d = 32\) and has a covariance matrix which could be considered ill-conditioned (see Appendix 1.4 for details on the eigenvalues). Table 7 shows that polynomial whitening outperforms Mahalanobis whitening, using both the covariance and correlation matrix.

Table 7 A comparison of different whitening methods applied to the Wisconsin Breast Cancer dataset, using metrics identified by Kessy et al. (2018)

3.3 The effect of different pre-processing methods on outlier detection algorithms

Outlier detection algorithms often require that data is pre-processed before the algorithm can be applied. It has been shown by Campos et al. (2016) that the normalization of datasets will often lead to a better performance of outlier detection algorithms.

Here we replicate a study described in Kandanaarachchi et al. (2020). The authors produced a collection of labelled benchmark datasets to be used for evaluating outlier detection algorithm performance. They evaluated the performance of various algorithms when used after applying different normalization methods to these datasets. Performance of an algorithm was measured using the area under the Receiver Operator Characteristic (ROC) curve, which compares the labels of an observation (‘inlier’ or ‘outlier’) produced by the algorithm to the ‘true’ labels. They found that two types of normalization method performed differently (dependent on data set and outlier detection method):

‘Min–Max’ normalization Each variable \(v\) of a dataset is normalized to only have values in the range [0, 1]:

$$\begin{aligned} \frac{v - \min (v)}{\max (v) -\min (v)}, \end{aligned}$$

where \(\min (v)\) and \(\max (v)\) are the minimum and maximum values of the variable \(v\), respectively.

‘Median-IQR’ normalization Each variable \(v\) is transformed to

$$\begin{aligned}\frac{v - \text {median}(v)}{\text {IQR}(v)}, \end{aligned}$$

where \(\text {median}(v)\) and \(\text {IQR}(v)\) are the median and inter-quartile range of the variable \(v\), respectively.

We consider the following four different outlier detection methods from the Python package PyOD (Zhao et al. 2019):

  1. 1.

    KNN: K-Nearest Neighbours

  2. 2.

    LOF: Local Outlier Factor

  3. 3.

    COF: Connectivity-based Outlier Factor

  4. 4.

    FastABOD: Fast Angle Based Outlier Detection

Further details of each of these algorithms are provided in Campos et al. (2016). All of the above methods require a parameter choice \(K\) (different to the polynomial degree parameter \(k\) referred to throughout this paper) to set the so-called neighbourhood size, and a contamination value \(C\) to indicate how many observations the algorithm should label as outliers. We let \(K = 0.1\times N\), where \(N\) is the number of observations in the dataset. The parameter \(C\) is equal to the percentage of outliers given by the ‘true’ labels.

For a dataset \(D\), an outlier detection method \(o\) and a pre-processing method \(z\), we denote the area under the ROC curve as \(AUC(D, o, z)\). For each outlier detection method \(o\) listed above, we say a dataset \(D\) ‘prefers’ a pre-processing method \(z\) if \(AUC(D, o, z) \ge AUC(D, o, y)\) for all other pre-processing methods \(y\). We evaluate the AUC score for transformations \(A_{k}D\) using (3) by taking the maximum AUC score over all k considered.

We tested the outlier detection methods with each pre-processing method on 7667 real datasets, as used in Kandanaarachchi et al. (2020). The datasets ranged from dimension 3 to dimension 359, and the number of observations in a dataset ranged from 44 to 5396. We impose no structural assumptions on the datasets for our method or the other normalization methods.

Table 8 shows the percentage of datasets that prefer each pre-processing method for each of the given outlier detection algorithms. The results in this table indicate that the polynomial whitening method outperforms the two normalization methods.

Table 8 The percentage of datasets that give higher AUC scores for the pre-processing technique (given in the column), by outlier detection method (given in the row)

The scatter graphs in Fig. 7 compare the minimal-variance polynomial whitening to the normalization methods considered individually. Each point represents a dataset, and the diagonal line indicates those datasets where the two methods give equal AUC scores. Points below this line, in red, indicate that the minimal-variance whitening method outperformed the other method considered. A numerical breakdown of these scatter graphs is given in Table 9. Much like Table 8, Table 9 shows the percentage of datasets that prefer each pre-processing method, but shows a pairwise comparison.

Fig. 7
figure 7

Scatter graphs plotting the AUC scores of outlier detection algorithms when performed using the minimal-variance polynomial whitening ‘Min–Var’ on the horizontal axis, and the AUC scores when using ad ‘Min–Max’ or eh ‘Med-IQR’ normalizations on the vertical axis. Points in red indicate a dataset where using Min–Var produced a better score than the alternative method, and points in blue indicate a dataset where using the alternative method produced a better score

Table 9 The percentage of datasets for which the given pre-processing method (given in the column) produces AUC scores better than the alternative method in the adjacent column, for different outlier detection methods (given in the row). I.e. 34.4% of datasets produced higher AUC scores when using Min–Var than when using Min–Max, for the KNN outlier detection method. This differs from Table 8 in that it is a pairwise comparison of the pre-processing methods

Table 10 shows the amount of datasets out of the total 7667 (and the percentage) for which the pre-processing methods produce strictly better results, for each outlier detection method. It is clear that the minimal-variance method performs as well as (and often better than) the techniques often used to preprocess datasets before applying common outlier detection methods.

Table 10 The number (and percentage) of datasets for which the given pre-processing method (in the column) produces AUC scores strictly better than the other methods, for each outlier detection method (in the row)

3.4 Principal component analysis

Principal component analysis (PCA) is a popular dimension-reduction technique, as it reduces a dataset to a chosen dimension \(p\) while retaining the greatest amount of variance from the original dataset as possible. PCA finds \(p\) linear combinations of the variables of the dataset, giving \(p\) new compressed variables with maximal variance. As such, it is highly sensitive to the variances of the variables in the dataset. If one variable is measured on a much larger scale than the others, this variable will be likely have much greater variance, and therefore be given much more weight in a linear combination than the other variables (Jolliffe and Cadima 2016). To prevent this, it is good practice to standardize the variables to ensure they are all measured on the same scale.

We compare two methods of standardization prior to performing PCA: (Moore-Penrose) Mahalanobis standardization, which is most commonly used before PCA, and minimal-variance standardization. In Mahalanobis standardization, let the variable \(v_{i} \in X\) have mean \(\mu _{i}\) and standard deviation \(\sigma _{i}\). We then consider the dataset made up of the variables

$$\begin{aligned} z_{i} = \frac{(v_{i} - \mu _{i})}{\sigma _{i}} \end{aligned}$$

for \(i \in \{1, \dots , d\}\). If \(\sigma _{i} = 0\), we use Moore-Penrose Mahalanobis (MPM) standardization, in which we find the square root of the Moore-Penrose inverse of the covariance matrix \(\varSigma ^{-}\), and then use \((\varSigma ^{-})_{i, i}\) (i.e. the \(i\)th diagonal entry of \(\varSigma ^{-}\)) in place of \(\sigma _{i}\).

In minimal-variance (MV) standardization, we find the minimal-variance polynomial matrix \(A_{k}\), and use the values on the diagonal of \(A_{k}\), denoted \((A_{k})_{i,i}\), in place of \(\sigma _{i}\):

$$\begin{aligned} w_{i} = \frac{(v_{i} - \mu _{i})}{(A_{k})_{i,i}}. \end{aligned}$$

Note that this is different to minimal-variance whitening, in that we only use the diagonal of the minimal-variance polynomial matrix to perform the transformation. We do this to align our method with the Mahalanobis standardization method.

Our method of comparing the different standardization methods for PCA is as follows. In the following sections, we consider 1000 generated datasets, each with \(K\) clusters. For each dataset, we consider three versions: let \(X\) be the original dataset, \(X_{MPM}\) be the MPM standardized dataset and \(X_{MV}\) be the MV standardized dataset. For each version, we find the data given by the PCA transformation for a given number of principal components, and then perform the \(K\)-means clustering algorithm (Lloyd 1982). This is repeated for 1000 different datasets, and the results are given in the following sections for different types of data.

3.4.1 Data with \(d < N\)

We first consider the impact of the different standardization methods on PCA for data with \(d < N\). For each of the 1000 datasets, we generate 3 clusters from multivariate Gaussian distributions \(X^{(i)},\, i = \{1, 2, 3\}\) with dimension \(d = 100\), where the parameters \(\mu ^{(i)},\, \varSigma ^{(i)}\) and \(N^{(i)}\) denote the mean, covariance matrix and number of observations in cluster \(X^{(i)}\). The details of these parameters are given in Table 11. The eigenvalues of each \(\varSigma ^{(i)}\) taper off towards zero gradually. This creates a degenerate dataset with a rank that is hard to identify, a situation which the Moore-Penrose inverse struggles to deal with well.

Table 11 Details of clusters of datasets used for PCA and \(K\)-means examples in Sect. 3.4.1

In this example, we make parameter choices based on the relative size of the eigenvalues of a dataset compared to the maximum eigenvalue. Let \(\varLambda = \{\lambda _{1}, \dots , \lambda _{d}\}\) be the set of eigenvalues of a dataset, let \(\lambda _{\text {max}}\) be the largest eigenvalue in \(\varLambda \), and let \({\bar{\lambda }}\) be the mean of the eigenvalues in \(\varLambda \).

Let \(p = p(\varLambda )\) be the number of principal components that we wish to reduce a dataset to using PCA. For each dataset, the parameter \(p(\varLambda )\) is chosen to be the number of eigenvalues in \(\varLambda \) greater than the mean eigenvalue \({\bar{\lambda }}\):

$$\begin{aligned} p(\varLambda ) = \sum _{i = 1}^{d} \mathbbm {1}_{\lambda _{i} > {\bar{\lambda }}}, \end{aligned}$$

as commonly used in practice (Abdi and Williams 2010).

The parameter \(k = k(\varLambda )\) for the minimal-variance polynomial will chosen based on the number of scaled eigenvalues \(\pi _{i} = \lambda _{i}/\lambda _{\text {max}}\) that are bigger than a given threshold \(t\):

$$\begin{aligned} k(\varLambda ) = \sum _{i = 1}^{d} \mathbbm {1}_{\pi _{i} > t}. \end{aligned}$$

In the examples that follow we use \(t = 0.1\).

The \(K\)-means clustering algorithm aims to assign each point within a dataset to a cluster, by estimating the distances from each point to the estimated centre-point of a cluster of points. For more information on the algorithm, see Jain (2010). We consider how well the \(K\)-means clustering algorithm performs after applying PCA, given the different standardization methods. We use the adjusted rand (AR) score (Hubert and Arabie 1985; Steinley 2004) of the cluster labels provided by \(K\)-means to judge how well the algorithm has found the correct clusterings. An AR score of 0 indicates random labellings, and an AR score of 1 means the clusters were perfectly labelled by the algorithm.

We also give the silhouette scores (Rousseeuw 1987) of the methods depending on the standardization methods. The silhouette score of a clustering indicates how well separated the clusters are. A score of 1 indicates well-distinguished clusters, whereas a score of -1 tells us that clusters have been incorrectly assigned. A higher silhouette score tells us that the standardization method and PCA have retained cluster structure well.

Figure 8 shows that using the MPM standardization gives a slight improvement on using no standardization. However, using minimal-variance standardization before applying PCA and \(K\)-means clustering results in vastly better AR scores, as well as better silhouette scores.

Fig. 8
figure 8

a Adjusted Rand scores and b Silhouette scores of the labellings made by the \(K\)-means algorithm after PCA, which was applied to 1000 datasets with: no standardization (Original); Moore-Penrose Mahalanobis (MPM) standardization; minimal-variance (MV) standardization

3.4.2 Data with \(d \ge N\)

We modify the above example slightly to help us consider the case where \(d \ge N\). In such circumstances, PCA can sometimes perform poorly due to difficulties in finding the eigenvectors of the covariance matrix correctly (Aoshima et al. 2018). As such, we now compare the standardization methods when applied before 3 different methods of dimension reduction:

  1. 1.

    Classical PCA (Pearson 1901)

  2. 2.

    Cross-Data PCA (CD-PCA) (Yata and Aoshima 2010)

  3. 3.

    Noise-Reduction PCA (NR-PCA) (Yata and Aoshima 2012)

The latter two methods were formulated for performing dimension reduction on HDLSS data, and can avoid the difficulties sometimes faced by PCA in such settings. Examples given in these papers show promising results for eigenvalue estimation and dimension reduction in high dimensions. For more details on the implementation of these methods, see the papers referenced above.

As in Sect. 3.4.1, we consider 1000 different datasets, each with \(d = 1000\) and \(N = 430\). Each dataset is generated as a mixture of four multivariate Gaussian distributions \(X_{i} \sim {\mathscr {N}}_{d}\left( \mu _{i}, \varSigma _{i}\right) \), \(i = \{1, 2, 3, 4\}\). The population parameters of each cluster are given in Table 12.

Table 12 Details of clusters of datasets used for PCA and \(K\)-means examples in Sect. 3.4.2

Figure 9 shows boxplots of the AR scores and silhouette scores of the labels given by \(K\)-means clustering, after applying one of the standardization methods and one of the three dimension reduction methods. Across all types of PCA, we see that MPM standardization gives very similar results to the datasets with no standardization. On the other hand, the MV standardization method provides a large improvement for all three methods of dimension reduction. The clustering results are better when MV standardization has been used, as indicated by the boxplots of AR scores in Fig. 9a, c and e. We see that cluster separation is also better when using MV standardization with dimension reduction, as the silhouette scores are much higher for all three dimension reduction methods.

Across the three dimension reduction techniques considered, the minimal-variance standardization method is clearly very useful in those cases where standardization would improve dimension reduction algorithms (or other multivariate data analysis methods), as it behaves similarly to the Moore-Penrose Mahalanobis standardization method, but does not struggle in cases where the rank is unclear and there are many small eigenvalues.

Fig. 9
figure 9

Adjusted Rand scores and Silhouette scores of the labellings made by the \(K\)-means algorithm after PCA, CD-PCA and NR-PCA which was applied to 1000 datasets with: no standardization (Original); Moore-Penrose Mahalanobis (MPM) standardization; minimal-variance (MV) standardization

4 Conclusion

We have developed a method of constructing polynomials in the empirical covariance matrix to provide an alternative to the inverse square root of a covariance matrix, particularly suitable to degenerate (or close to degenerate) data in high dimensions. The minimal-variance polynomial whitening method aims at minimizing the total variation in a transformed dataset, and in doing so it provides a dataset that has been decorrelated and standardized. We have demonstrated the potential applications of these polynomial matrices by considering whitening, outlier detection and principal component analysis for both \(d < N\) and \(d \ge N\) cases. We have discussed and given recommendations for the choice of the parameter \(k\) which dictates the degree of the polynomial, as well as an alternative constraint and adjustments. We also suggested a method to reduce computational time in extremely high-dimensional cases, ensuring that this method can be applied in such scenarios.