Subspace clustering of high-dimensional data: a predictive approach

Abstract

In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Notes

  1. 1.

    Software available at http://www2.imperial.ac.uk/~gmontana/psc.htm

  2. 2.

    http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi

References

  1. Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics (Oxford, England) 27(9):1269–1276. doi:10.1093/bioinformatics/btr112

    Article  Google Scholar 

  2. Belsley DA, Kuh E, Welsch RE (1980) Regression diagnostics: identifying influential data and sources of collinearity, 1st edn. Wiley, New York

    Google Scholar 

  3. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes. Proc Natl Acad Sci 98(24):13,790–13,795

    Article  Google Scholar 

  4. Bradley P, Mangasarian O (2000) k-Plane clustering. J Glob Optim 16:23–32

    Article  MATH  MathSciNet  Google Scholar 

  5. Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Selecting the number of components in principal component analysis using cross-validation approximations. Anal Bioanal Chem 390:1241–1251

    Article  Google Scholar 

  6. Bhm C, Kailing K, Krger P, Zimek A (2004) Computing clusters of correlation connected objects. In: SIGMOD

  7. Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30. doi:10.1109/msp.2007.914731

    Article  Google Scholar 

  8. Chatterjee S, Hadi A (1986) Influential observations, high leverage points, and outliers in linear regression. Statl Sci 1:379–393. doi:10.1214/ss/1177013622

    Article  MathSciNet  Google Scholar 

  9. Chen G, Lerman G (2008) Spectral Curvature Clustering (SCC). Int J Comput Vis 81:317–330. doi:10.1007/s11263-008-0178-9

    Article  Google Scholar 

  10. Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169

    MATH  Google Scholar 

  11. Delannay N, Archambeau C, Verleysen M (2008) Improving the robustness to outliers of mixtures of probabilistic pcas. In: 12th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD 2008. Springer, pp 527–535

  12. Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Knowl Discov Data Min 14:63–97

    Article  MathSciNet  Google Scholar 

  13. Elhamifar E, Vidal R (2009) Sparse subspace clustering. In: IEEE conference on computer vision and pattern recognition, pp 2790–2797. doi:10.1109/CVPRW.2009.5206547

  14. Elke Achtert Christian Böhm HPKPKAZ (2007) Robust, complete, and efficient correlation clustering. In: SIAM international conference on data mining, SDM 2007

  15. Friedman J, Hastie E, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332

    Article  MATH  MathSciNet  Google Scholar 

  16. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression. Science 286(5439):531–537

    Article  Google Scholar 

  17. Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer Series in Statistics. Springer, New York. doi:10.1007/b98835

  18. Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58

    Google Scholar 

  19. Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416. doi:10.1007/s11222-007-9033-z

    Article  MathSciNet  Google Scholar 

  20. Ma Y (2006) Generalized principal component analysis: modeling & segmentation of multivariate mixed data

  21. McWilliams B, Montana G (2010) A PRESS statistic for two-block partial least squares regression. In: Proceedings of the 10th annual workshop on computational intelligence

  22. McWilliams B, Montana G (2011) Predictive subspace clustering. In: 2011 tenth international conference on machine learning and applications (ICMLA), pp 247–252

  23. Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B 72:417–473. doi:10.1111/j.1467-9868.2010.00740.x

    Article  Google Scholar 

  24. Meloun M (2001) Detection of single influential points in OLS regression model building. Anal Chim Acta 439(2):169–191. doi:10.1016/S0003-2670(01)01040-6

    Article  Google Scholar 

  25. Mertens B, Fearn T, Thompson M (1995) The efficient cross-validation of principal components applied to principal component regression. Stat Comput 5:227–235. doi:10.1007/BF00142664

    Article  Google Scholar 

  26. Monti S, Tamayo P, Mesirov J, Golub G (2003) Consensus clustering a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118

    Google Scholar 

  27. Ng AY (2004) Feature selection, \(\ell _1\) vs. \(\ell _2\) regularization, and rotational invariance. In: Proceedings of the twenty-first international conference on Machine learning, ICML ’04. ACM, New York, NY, USA, pp 78–85. doi:10.1145/1015330.1015435

  28. Pomeroy S, Tamayo P, Gaasenbeek M, Angelo LMSM, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin A, Califano G, Stolovitzky DN, Louis JP, Mesirov ES, Lander R, Golub TR (2001) Gene expression-based classification and outcome prediction of central nervous system embryonal tumors. Nature 415(6870):436–442

    Article  Google Scholar 

  29. Rahmatullah Imon A (2005) Identifying multiple influential observations in linear regression. J Appl Stat 32:929–946. doi:10.1080/02664760500163599.

    Google Scholar 

  30. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR (2001) Multi-class cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98(26):15,149–15,154

    Article  Google Scholar 

  31. Ringnr M (2008) What is principal component analysis? Nat Biotechnol 26(3):303–304. doi:10.1038/nbt0308-303

    Article  Google Scholar 

  32. Shen H, Huang J (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99:1015–1034

    Article  MATH  MathSciNet  Google Scholar 

  33. Sherman J, Morrison W (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127

    Article  MATH  MathSciNet  Google Scholar 

  34. Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Knowl Discov Data Min 26(2): 332–397

    Google Scholar 

  35. Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci 99(7):4447–4465

    Article  Google Scholar 

  36. The Cancer Genome Atlas Research Network (2011) Integrated genomic analyses of ovarian carcinoma. Nature 474:91–118

    Google Scholar 

  37. Tibshirani R (1994) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288

    MathSciNet  Google Scholar 

  38. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63(2):411–423

    Article  MATH  MathSciNet  Google Scholar 

  39. Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482. doi:10.1162/089976699300016728

    Article  Google Scholar 

  40. Vidal R (2011) Subspace clustering. IEEE Signal Process Mag 28:52–68. doi:10.1109/MSP.2010.939739

    Article  Google Scholar 

  41. Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (lasso). IEEE Trans Inf Theory 55(5):2183–2202. doi:10.1109/TIT.2009.2016018

  42. Wang D, Ding C, Li T (2009) K-Subspace clustering. In: Machine learning and knowledge discovery in databases, pp 506–521. Springer

  43. Witten D (2010) A penalized matrix decomposition, and its applications. Ph.D. thesis, Stanford University. http://www-stat.stanford.edu/tibs/sta306b/Defense.pdf

  44. Witten D, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713–726. doi:10.1198/jasa.2010.tm09415. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2930825&tool=pmcentrez&rendertype=abstract

    Google Scholar 

  45. Yang B (1996) Asymptotic convergence analysis of the projection approximation subspace tracking algorithms. Signal Process 50:123–136

    Article  MATH  Google Scholar 

  46. Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143

    Google Scholar 

  47. Zhang T, Szlam A, Wang Y, Lerman G (2010) Hybrid linear modeling via local best-fit flats. Arxiv preprint.

Download references

Acknowledgments

The authors would like to thank the anonymous referees for their helpful comments and the EPSRC (Engineering and Physical Science Research Council) for funding this project.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Giovanni Montana.

Additional information

Responsible editor: Ian Davidson.

Appendices

Appendix 1: Derivation of predictive influence

Using the chain rule, the gradient of the PRESS for a single latent factor is

$$\begin{aligned} \frac{\partial J^{(1)}}{\partial \varvec{x}_i} = \frac{1}{2} \frac{\partial }{\partial \varvec{x}_i}\left\| \varvec{e}^{(1)}_{-i}\right\| ^2 = \frac{1}{2}\varvec{e}^{(1)}_{-i} \frac{\partial }{\partial \varvec{x}_i}\varvec{e}^{(1)}_{-i}. \end{aligned}$$

For notational convenience we drop the superscript in the following. Using the quotient rule, the partial derivative of the \(i{\text{ th }}\) leave-one-out error has the following form

$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_{-i} = \frac{\frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_i (1-h_i) + \varvec{e}_i\frac{\partial h_i }{ \partial \varvec{x}_i}}{(1-h_i)^2} \end{aligned}$$

which depends on the partial derivatives of the \(i{\text{ th }}\) reconstruction error and the \(h_i\) quantities with respect to the observation \(\varvec{x}_i\). The computation of these two partial derivatives are straightforward and are, respectively

$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_i = \frac{\partial }{\partial \varvec{x}_i} \varvec{x}_i \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) = \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) , \end{aligned}$$

and

$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} h_i = \frac{\partial }{\partial \varvec{x}_i} \varvec{x}_i\varvec{v} D \varvec{v}^{\top }\varvec{x}_i^{\top }= 2\varvec{v} D d_i . \end{aligned}$$

The derivative of the PRESS, \(J\) with respect to \(\varvec{x}_i\) is then

$$\begin{aligned} \frac{1}{2}\frac{\partial }{ \partial \varvec{x}_i}\left\| \varvec{e}_{-i}\right\| ^2 = \varvec{e}_{-i} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_{-i}= \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) (1-h_i) + 2\varvec{e}_i \varvec{v} D d_i }{(1-h_i)^2}. \end{aligned}$$
(28)

However, examining the second term in the sum, \(\varvec{e}_i \varvec{v} D d_i \), we notice

$$\begin{aligned} \varvec{e}_i\varvec{v}Dd_i = (\varvec{x}_i-\varvec{x}_i\varvec{vv}^{\top })\varvec{v}Dd_i = \varvec{x}_i\varvec{v}Dd_i - \varvec{x}_i\varvec{vv}^{\top }\varvec{v}Dd_i = 0 . \end{aligned}$$

Substituting this result back in Eq. (28), the gradient of the PRESS for a single PCA component with respect to \(\varvec{x}_i\) is given by

$$\begin{aligned} \frac{1}{2}\frac{\partial }{ \partial \varvec{x}_i} \left\| \varvec{e}_{-i}\right\| ^2 = \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) (1-h_i)}{(1-h_i)^2} = \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) }{(1-h_i)} . \end{aligned}$$

In the general case for \(R>1\), the final expression for the predictive influence \(\varvec{\pi }(\varvec{x}_i)\in \mathbb{R }^{P\times 1}\) of a point \(\varvec{x}_i\) under a PCA model then has the following form:

$$\begin{aligned} \varvec{\pi }(\varvec{x}_i;\varvec{V}) = \varvec{e}^{(R)}_{-i} \left( \sum _{r=1}^{R} \frac{ \left( \varvec{I}_p - \varvec{v}^{(r)}{\varvec{v}^{(r)}}^{\top }\right) }{\left( 1-h^{(r)}_i\right) } - (R-1) \right) . \end{aligned}$$

Appendix 2: Proof of Lemma 1

From Appendix 1, for \(R=1\), the predictive influence of a point \(\varvec{\pi }({\varvec{x}_i};\varvec{v})\) is

$$\begin{aligned} \varvec{\pi }(\varvec{x}_i;\varvec{v}) =\frac{\varvec{e}_{i}}{(1-h_i)^2} \end{aligned}$$
(29)

This is simply the \(i{\text{ th }}\) leave-one-out error scaled by \(1-h_i\). If we define a diagonal matrix \(\varvec{\varXi }\in \mathbb{R }^{N\times N}\) with diagonal entries \({\varXi }_{i} = (1-h_i)^2\), we can define a matrix \(\varvec{\Pi }\in \mathbb{R }^{N\times P}\) whose rows are the predictive influences, \(\varvec{\Pi }=[\varvec{\pi }(\varvec{x}_1;\varvec{v}) ^{\top },\ldots , \varvec{\pi }(\varvec{x}_N;\varvec{v}) ^{\top }]^{\top }\). This matrix has the form

$$\begin{aligned} \varvec{\Pi } = \varvec{\varXi }^{-1}\left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) . \end{aligned}$$

Now, solving (21) is equivalent to minimising the squared Frobenius norm,

$$\begin{aligned}&\min _{\varvec{v}} \text{ Tr } \left( \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) ^{\top }\varvec{\varXi }^{-2} \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) \right) \nonumber \\&\text{ subject } \text{ to } \left\| \varvec{v} \right\| =1 . \end{aligned}$$
(30)

Expanding the terms within the trace we obtain

$$\begin{aligned} \text{ Tr } \left( \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) ^{\top }\varvec{\varXi }^{-2} \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) \right)&= \text{ Tr } \left( \varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) - 2\text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) \nonumber \\&+ \text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X}\varvec{vv}^{\top }\right) . \end{aligned}$$

By the properties of the trace, the following equalities hold

$$\begin{aligned} \text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) = \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \varvec{v}, \end{aligned}$$

and

$$\begin{aligned} \text{ Tr } \left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X}\varvec{vv}^{\top }\right)&= \text{ Tr }\left( \varvec{\varXi }^{-1}\varvec{X}\varvec{vv}^{\top }\varvec{vv}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-1}\right) \\&= \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \varvec{v}, \end{aligned}$$

since \(\varvec{\varXi }\) is diagonal and \(\varvec{v}^{\top }\varvec{v}=1\). Therefore, (30) is equivalent to

$$\begin{aligned}&\min _{\varvec{v}} \text{ Tr } \varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} - \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{Xv} , \nonumber \\&\text{ subject } \text{ to } ~ \left\| \varvec{v} \right\| =1 . \end{aligned}$$
(31)

It can be seen that under this constraint, (31) is minimised when \(\varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{Xv}\) is maximised which, for a fixed \(\varvec{\varXi }\) is achieved when \(\varvec{v}\) is the eigenvector corresponding to the largest eigenvalue of \(\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X}\).

Appendix 3: Proof of Lemma 2

In this section we provide a proof of Lemma 2 As an additional consequence of this proof, we develop an upper bound for the approximation error which can be shown to depend on the leverage terms. We derive this result for a single cluster, \(\mathcal{C }^{(\tau )}\) however it holds for all clusters.

We represent the assignment of points \(i=1,\ldots ,N\) to a cluster, \(\mathcal{C }^{(\tau )}\) using a binary valued diagonal matrix \(\varvec{A}\) whose diagonal entries are given by

$$\begin{aligned} A_{i}= \left\{ \begin{array}{ll} 1, &{} \text{ if } i\in \mathcal{C }^{(\tau )} \\ 0,&{} \text{ otherwise }, \end{array} \right. \end{aligned}$$
(32)

where \(\text{ Tr }(\varvec{A})=N_k\). We have shown in Lemma 1 that for a given cluster assignment, the parameters which optimise the objective function can be estimated by computing the SVD of the matrix

$$\begin{aligned} \sum _{i\in \mathcal{C }_k^{(\tau )}} \varvec{x}_i^{\top }{\varXi }_{i}^{-2} \varvec{x}_i = \varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{A} \varvec{X} , \end{aligned}$$
(33)

within each cluster where the \(i{\text{ th }}\) diagonal element of \({\varvec{\varXi }}\) is \(\varXi _{i}=(1-h_i)^2\le 1\), so that \(\varXi _{i}^{-2}\ge 1\). We can then represent \({\varvec{\varXi }}^{-2} = \varvec{I}_N + \varvec{\varPhi }\) where \(\varvec{\varPhi }\in \mathbb{R }^{n\times n}\) is a diagonal matrix with entries \(\varPhi _{i}=\phi _i\ge 0\). Now, we can represent Eq. (33) at the next iteration as

$$\begin{aligned} \varvec{M} = \varvec{X}^{\top }\varvec{A}(\varvec{I}_N + \varvec{\varPhi })\varvec{X} . \end{aligned}$$
(34)

We can quantify the difference between the optimal parameter, \(\varvec{v}^{*}\) obtained by solving (22) using \( \varvec{M}\) and the new PCA parameter estimated at iteration \(\tau +1\), \(\varvec{v}^{(\tau )}\) as,

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )})= {\varvec{v}^{*}}^{\top }\varvec{M}^{(\tau )} \varvec{v}^{*} - {\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau )}, \end{aligned}$$

where \(\varvec{v}^{(\tau )}\) is obtained through the SVD of \( \varvec{X}^{\top }\varvec{A}\varvec{X} \). We can express \(E(\mathcal{S }^*,\mathcal{S }^{(\tau )})\) in terms of the spectral norm of \(\varvec{M}\). Since the spectral norm of a matrix is equivalent to its largest singular value, we have \({\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau )} =\left\| \varvec{X}^{\top }\varvec{A}\varvec{X} \right\| \) Since \(\varvec{\varPhi }\) is a diagonal matrix, its spectral norm, \(\left\| \varvec{\varPhi } \right\| = \max (\varvec{\varPhi })\). Similarly, \(\varvec{A}\) is a diagonal matrix with binary valued entries, so \(\left\| \varvec{A} \right\| = 1\).

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )})&\le \left\| \varvec{M} - \varvec{X}^{\top }\varvec{A}\varvec{X} \right\| \nonumber \\&= \left\| \varvec{X}^{\top }\varvec{A}\varvec{\varPhi }\varvec{X} \right\| \nonumber \\&\le \max (\varvec{\varPhi }) \left\| \varvec{X}^{\top }\varvec{X} \right\| . \end{aligned}$$
(35)

Where the triangle and Cauchy-Schwarz inequalities have been used. In a similar way, we now quantify the difference between the optimal parameter and the old PCA parameter \(\varvec{v}^{(\tau -1)}\),

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)}) = {\varvec{v}^{*}}^{\top }\varvec{M} \varvec{v}^{*} - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)}. \end{aligned}$$

Since \(\varvec{v}^{(\tau )}\) is the principal eigenvector of \(\varvec{X}^{\top }\varvec{A}\varvec{X}\), by definition, \({\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A}\varvec{Xv}^{(\tau )}\) is maximised, therefore we can represent the difference between the new parameters and the old parameters as

$$\begin{aligned} E(\mathcal{S }^{(\tau )},\mathcal{S }^{(\tau -1)})={\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{Xv}^{(\tau )} - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{Xv}^{(\tau -1)}\ge 0. \end{aligned}$$

Using this quantity, we can express \(E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)})\) as

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)})&\le \left\| \varvec{M} \right\| - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)} \nonumber \\&\le \left\| \varvec{X}^{\top }\varvec{\varPhi } \varvec{A} \varvec{X} \right\| + \left\| \varvec{X}^{\top }\varvec{A} \varvec{X} \right\| - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)} \nonumber \\&\le \max (\varvec{\varPhi } )\left\| \varvec{X}^{\top }\varvec{X} \right\| + E(\mathcal{S }^{(\tau )},\mathcal{S }^{(\tau -1)}), \end{aligned}$$
(36)

From (36) and (35) it is clear that

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )}) \le E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)}) . \end{aligned}$$
(37)

This proves Lemma 2.

The inequality in (37) implies that estimating the SVD using \(\varvec{X}^{\top }\varvec{A} \varvec{X}\) obtains PCA parameters which are closer to the optimal values than those obtained at the previous iteration. Therefore, estimating a new PCA model after each cluster re-assignment step never increases the objective function. Furthermore, as the recovered clustering becomes more accurate, by definition there are fewer influential observations within each cluster. This implies that \(\max (\varvec{\varPhi } ) \rightarrow 0\), and so \( E(\mathcal{S }^*,\mathcal{S }^{(\tau )}) \rightarrow 0\).

Rights and permissions

Reprints and Permissions

About this article

Cite this article

McWilliams, B., Montana, G. Subspace clustering of high-dimensional data: a predictive approach. Data Min Knowl Disc 28, 736–772 (2014). https://doi.org/10.1007/s10618-013-0317-y

Download citation

Keywords

  • Subspace clustering
  • PCA
  • PRESS statistics
  • Variable selection
  • Model selection
  • Microarrays