Appendix 1: Derivation of predictive influence
Using the chain rule, the gradient of the PRESS for a single latent factor is
$$\begin{aligned} \frac{\partial J^{(1)}}{\partial \varvec{x}_i} = \frac{1}{2} \frac{\partial }{\partial \varvec{x}_i}\left\| \varvec{e}^{(1)}_{-i}\right\| ^2 = \frac{1}{2}\varvec{e}^{(1)}_{-i} \frac{\partial }{\partial \varvec{x}_i}\varvec{e}^{(1)}_{-i}. \end{aligned}$$
For notational convenience we drop the superscript in the following. Using the quotient rule, the partial derivative of the \(i{\text{ th }}\) leave-one-out error has the following form
$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_{-i} = \frac{\frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_i (1-h_i) + \varvec{e}_i\frac{\partial h_i }{ \partial \varvec{x}_i}}{(1-h_i)^2} \end{aligned}$$
which depends on the partial derivatives of the \(i{\text{ th }}\) reconstruction error and the \(h_i\) quantities with respect to the observation \(\varvec{x}_i\). The computation of these two partial derivatives are straightforward and are, respectively
$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_i = \frac{\partial }{\partial \varvec{x}_i} \varvec{x}_i \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) = \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) , \end{aligned}$$
and
$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} h_i = \frac{\partial }{\partial \varvec{x}_i} \varvec{x}_i\varvec{v} D \varvec{v}^{\top }\varvec{x}_i^{\top }= 2\varvec{v} D d_i . \end{aligned}$$
The derivative of the PRESS, \(J\) with respect to \(\varvec{x}_i\) is then
$$\begin{aligned} \frac{1}{2}\frac{\partial }{ \partial \varvec{x}_i}\left\| \varvec{e}_{-i}\right\| ^2 = \varvec{e}_{-i} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_{-i}= \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) (1-h_i) + 2\varvec{e}_i \varvec{v} D d_i }{(1-h_i)^2}. \end{aligned}$$
(28)
However, examining the second term in the sum, \(\varvec{e}_i \varvec{v} D d_i \), we notice
$$\begin{aligned} \varvec{e}_i\varvec{v}Dd_i = (\varvec{x}_i-\varvec{x}_i\varvec{vv}^{\top })\varvec{v}Dd_i = \varvec{x}_i\varvec{v}Dd_i - \varvec{x}_i\varvec{vv}^{\top }\varvec{v}Dd_i = 0 . \end{aligned}$$
Substituting this result back in Eq. (28), the gradient of the PRESS for a single PCA component with respect to \(\varvec{x}_i\) is given by
$$\begin{aligned} \frac{1}{2}\frac{\partial }{ \partial \varvec{x}_i} \left\| \varvec{e}_{-i}\right\| ^2 = \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) (1-h_i)}{(1-h_i)^2} = \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) }{(1-h_i)} . \end{aligned}$$
In the general case for \(R>1\), the final expression for the predictive influence \(\varvec{\pi }(\varvec{x}_i)\in \mathbb{R }^{P\times 1}\) of a point \(\varvec{x}_i\) under a PCA model then has the following form:
$$\begin{aligned} \varvec{\pi }(\varvec{x}_i;\varvec{V}) = \varvec{e}^{(R)}_{-i} \left( \sum _{r=1}^{R} \frac{ \left( \varvec{I}_p - \varvec{v}^{(r)}{\varvec{v}^{(r)}}^{\top }\right) }{\left( 1-h^{(r)}_i\right) } - (R-1) \right) . \end{aligned}$$
Appendix 2: Proof of Lemma 1
From Appendix 1, for \(R=1\), the predictive influence of a point \(\varvec{\pi }({\varvec{x}_i};\varvec{v})\) is
$$\begin{aligned} \varvec{\pi }(\varvec{x}_i;\varvec{v}) =\frac{\varvec{e}_{i}}{(1-h_i)^2} \end{aligned}$$
(29)
This is simply the \(i{\text{ th }}\) leave-one-out error scaled by \(1-h_i\). If we define a diagonal matrix \(\varvec{\varXi }\in \mathbb{R }^{N\times N}\) with diagonal entries \({\varXi }_{i} = (1-h_i)^2\), we can define a matrix \(\varvec{\Pi }\in \mathbb{R }^{N\times P}\) whose rows are the predictive influences, \(\varvec{\Pi }=[\varvec{\pi }(\varvec{x}_1;\varvec{v}) ^{\top },\ldots , \varvec{\pi }(\varvec{x}_N;\varvec{v}) ^{\top }]^{\top }\). This matrix has the form
$$\begin{aligned} \varvec{\Pi } = \varvec{\varXi }^{-1}\left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) . \end{aligned}$$
Now, solving (21) is equivalent to minimising the squared Frobenius norm,
$$\begin{aligned}&\min _{\varvec{v}} \text{ Tr } \left( \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) ^{\top }\varvec{\varXi }^{-2} \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) \right) \nonumber \\&\text{ subject } \text{ to } \left\| \varvec{v} \right\| =1 . \end{aligned}$$
(30)
Expanding the terms within the trace we obtain
$$\begin{aligned} \text{ Tr } \left( \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) ^{\top }\varvec{\varXi }^{-2} \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) \right)&= \text{ Tr } \left( \varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) - 2\text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) \nonumber \\&+ \text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X}\varvec{vv}^{\top }\right) . \end{aligned}$$
By the properties of the trace, the following equalities hold
$$\begin{aligned} \text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) = \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \varvec{v}, \end{aligned}$$
and
$$\begin{aligned} \text{ Tr } \left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X}\varvec{vv}^{\top }\right)&= \text{ Tr }\left( \varvec{\varXi }^{-1}\varvec{X}\varvec{vv}^{\top }\varvec{vv}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-1}\right) \\&= \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \varvec{v}, \end{aligned}$$
since \(\varvec{\varXi }\) is diagonal and \(\varvec{v}^{\top }\varvec{v}=1\). Therefore, (30) is equivalent to
$$\begin{aligned}&\min _{\varvec{v}} \text{ Tr } \varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} - \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{Xv} , \nonumber \\&\text{ subject } \text{ to } ~ \left\| \varvec{v} \right\| =1 . \end{aligned}$$
(31)
It can be seen that under this constraint, (31) is minimised when \(\varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{Xv}\) is maximised which, for a fixed \(\varvec{\varXi }\) is achieved when \(\varvec{v}\) is the eigenvector corresponding to the largest eigenvalue of \(\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X}\).
Appendix 3: Proof of Lemma 2
In this section we provide a proof of Lemma 2 As an additional consequence of this proof, we develop an upper bound for the approximation error which can be shown to depend on the leverage terms. We derive this result for a single cluster, \(\mathcal{C }^{(\tau )}\) however it holds for all clusters.
We represent the assignment of points \(i=1,\ldots ,N\) to a cluster, \(\mathcal{C }^{(\tau )}\) using a binary valued diagonal matrix \(\varvec{A}\) whose diagonal entries are given by
$$\begin{aligned} A_{i}= \left\{ \begin{array}{ll} 1, &{} \text{ if } i\in \mathcal{C }^{(\tau )} \\ 0,&{} \text{ otherwise }, \end{array} \right. \end{aligned}$$
(32)
where \(\text{ Tr }(\varvec{A})=N_k\). We have shown in Lemma 1 that for a given cluster assignment, the parameters which optimise the objective function can be estimated by computing the SVD of the matrix
$$\begin{aligned} \sum _{i\in \mathcal{C }_k^{(\tau )}} \varvec{x}_i^{\top }{\varXi }_{i}^{-2} \varvec{x}_i = \varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{A} \varvec{X} , \end{aligned}$$
(33)
within each cluster where the \(i{\text{ th }}\) diagonal element of \({\varvec{\varXi }}\) is \(\varXi _{i}=(1-h_i)^2\le 1\), so that \(\varXi _{i}^{-2}\ge 1\). We can then represent \({\varvec{\varXi }}^{-2} = \varvec{I}_N + \varvec{\varPhi }\) where \(\varvec{\varPhi }\in \mathbb{R }^{n\times n}\) is a diagonal matrix with entries \(\varPhi _{i}=\phi _i\ge 0\). Now, we can represent Eq. (33) at the next iteration as
$$\begin{aligned} \varvec{M} = \varvec{X}^{\top }\varvec{A}(\varvec{I}_N + \varvec{\varPhi })\varvec{X} . \end{aligned}$$
(34)
We can quantify the difference between the optimal parameter, \(\varvec{v}^{*}\) obtained by solving (22) using \( \varvec{M}\) and the new PCA parameter estimated at iteration \(\tau +1\), \(\varvec{v}^{(\tau )}\) as,
$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )})= {\varvec{v}^{*}}^{\top }\varvec{M}^{(\tau )} \varvec{v}^{*} - {\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau )}, \end{aligned}$$
where \(\varvec{v}^{(\tau )}\) is obtained through the SVD of \( \varvec{X}^{\top }\varvec{A}\varvec{X} \). We can express \(E(\mathcal{S }^*,\mathcal{S }^{(\tau )})\) in terms of the spectral norm of \(\varvec{M}\). Since the spectral norm of a matrix is equivalent to its largest singular value, we have \({\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau )} =\left\| \varvec{X}^{\top }\varvec{A}\varvec{X} \right\| \) Since \(\varvec{\varPhi }\) is a diagonal matrix, its spectral norm, \(\left\| \varvec{\varPhi } \right\| = \max (\varvec{\varPhi })\). Similarly, \(\varvec{A}\) is a diagonal matrix with binary valued entries, so \(\left\| \varvec{A} \right\| = 1\).
$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )})&\le \left\| \varvec{M} - \varvec{X}^{\top }\varvec{A}\varvec{X} \right\| \nonumber \\&= \left\| \varvec{X}^{\top }\varvec{A}\varvec{\varPhi }\varvec{X} \right\| \nonumber \\&\le \max (\varvec{\varPhi }) \left\| \varvec{X}^{\top }\varvec{X} \right\| . \end{aligned}$$
(35)
Where the triangle and Cauchy-Schwarz inequalities have been used. In a similar way, we now quantify the difference between the optimal parameter and the old PCA parameter \(\varvec{v}^{(\tau -1)}\),
$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)}) = {\varvec{v}^{*}}^{\top }\varvec{M} \varvec{v}^{*} - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)}. \end{aligned}$$
Since \(\varvec{v}^{(\tau )}\) is the principal eigenvector of \(\varvec{X}^{\top }\varvec{A}\varvec{X}\), by definition, \({\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A}\varvec{Xv}^{(\tau )}\) is maximised, therefore we can represent the difference between the new parameters and the old parameters as
$$\begin{aligned} E(\mathcal{S }^{(\tau )},\mathcal{S }^{(\tau -1)})={\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{Xv}^{(\tau )} - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{Xv}^{(\tau -1)}\ge 0. \end{aligned}$$
Using this quantity, we can express \(E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)})\) as
$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)})&\le \left\| \varvec{M} \right\| - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)} \nonumber \\&\le \left\| \varvec{X}^{\top }\varvec{\varPhi } \varvec{A} \varvec{X} \right\| + \left\| \varvec{X}^{\top }\varvec{A} \varvec{X} \right\| - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)} \nonumber \\&\le \max (\varvec{\varPhi } )\left\| \varvec{X}^{\top }\varvec{X} \right\| + E(\mathcal{S }^{(\tau )},\mathcal{S }^{(\tau -1)}), \end{aligned}$$
(36)
From (36) and (35) it is clear that
$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )}) \le E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)}) . \end{aligned}$$
(37)
This proves Lemma 2.
The inequality in (37) implies that estimating the SVD using \(\varvec{X}^{\top }\varvec{A} \varvec{X}\) obtains PCA parameters which are closer to the optimal values than those obtained at the previous iteration. Therefore, estimating a new PCA model after each cluster re-assignment step never increases the objective function. Furthermore, as the recovered clustering becomes more accurate, by definition there are fewer influential observations within each cluster. This implies that \(\max (\varvec{\varPhi } ) \rightarrow 0\), and so \( E(\mathcal{S }^*,\mathcal{S }^{(\tau )}) \rightarrow 0\).