Skip to main content

Subspace clustering of high-dimensional data: a predictive approach

Abstract

In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

Notes

  1. Software available at http://www2.imperial.ac.uk/~gmontana/psc.htm

  2. http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi

References

  • Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics (Oxford, England) 27(9):1269–1276. doi:10.1093/bioinformatics/btr112

    Article  Google Scholar 

  • Belsley DA, Kuh E, Welsch RE (1980) Regression diagnostics: identifying influential data and sources of collinearity, 1st edn. Wiley, New York

    Book  MATH  Google Scholar 

  • Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes. Proc Natl Acad Sci 98(24):13,790–13,795

    Article  Google Scholar 

  • Bradley P, Mangasarian O (2000) k-Plane clustering. J Glob Optim 16:23–32

    Article  MATH  MathSciNet  Google Scholar 

  • Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Selecting the number of components in principal component analysis using cross-validation approximations. Anal Bioanal Chem 390:1241–1251

    Article  Google Scholar 

  • Bhm C, Kailing K, Krger P, Zimek A (2004) Computing clusters of correlation connected objects. In: SIGMOD

  • Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30. doi:10.1109/msp.2007.914731

    Article  Google Scholar 

  • Chatterjee S, Hadi A (1986) Influential observations, high leverage points, and outliers in linear regression. Statl Sci 1:379–393. doi:10.1214/ss/1177013622

    Article  MathSciNet  Google Scholar 

  • Chen G, Lerman G (2008) Spectral Curvature Clustering (SCC). Int J Comput Vis 81:317–330. doi:10.1007/s11263-008-0178-9

    Article  Google Scholar 

  • Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169

    MATH  Google Scholar 

  • Delannay N, Archambeau C, Verleysen M (2008) Improving the robustness to outliers of mixtures of probabilistic pcas. In: 12th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD 2008. Springer, pp 527–535

  • Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Knowl Discov Data Min 14:63–97

    Article  MathSciNet  Google Scholar 

  • Elhamifar E, Vidal R (2009) Sparse subspace clustering. In: IEEE conference on computer vision and pattern recognition, pp 2790–2797. doi:10.1109/CVPRW.2009.5206547

  • Elke Achtert Christian Böhm HPKPKAZ (2007) Robust, complete, and efficient correlation clustering. In: SIAM international conference on data mining, SDM 2007

  • Friedman J, Hastie E, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332

    Article  MATH  MathSciNet  Google Scholar 

  • Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression. Science 286(5439):531–537

    Article  Google Scholar 

  • Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer Series in Statistics. Springer, New York. doi:10.1007/b98835

  • Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58

    Google Scholar 

  • Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416. doi:10.1007/s11222-007-9033-z

    Article  MathSciNet  Google Scholar 

  • Ma Y (2006) Generalized principal component analysis: modeling & segmentation of multivariate mixed data

  • McWilliams B, Montana G (2010) A PRESS statistic for two-block partial least squares regression. In: Proceedings of the 10th annual workshop on computational intelligence

  • McWilliams B, Montana G (2011) Predictive subspace clustering. In: 2011 tenth international conference on machine learning and applications (ICMLA), pp 247–252

  • Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B 72:417–473. doi:10.1111/j.1467-9868.2010.00740.x

    Article  Google Scholar 

  • Meloun M (2001) Detection of single influential points in OLS regression model building. Anal Chim Acta 439(2):169–191. doi:10.1016/S0003-2670(01)01040-6

    Article  Google Scholar 

  • Mertens B, Fearn T, Thompson M (1995) The efficient cross-validation of principal components applied to principal component regression. Stat Comput 5:227–235. doi:10.1007/BF00142664

    Article  Google Scholar 

  • Monti S, Tamayo P, Mesirov J, Golub G (2003) Consensus clustering a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118

    Google Scholar 

  • Ng AY (2004) Feature selection, \(\ell _1\) vs. \(\ell _2\) regularization, and rotational invariance. In: Proceedings of the twenty-first international conference on Machine learning, ICML ’04. ACM, New York, NY, USA, pp 78–85. doi:10.1145/1015330.1015435

  • Pomeroy S, Tamayo P, Gaasenbeek M, Angelo LMSM, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin A, Califano G, Stolovitzky DN, Louis JP, Mesirov ES, Lander R, Golub TR (2001) Gene expression-based classification and outcome prediction of central nervous system embryonal tumors. Nature 415(6870):436–442

    Article  Google Scholar 

  • Rahmatullah Imon A (2005) Identifying multiple influential observations in linear regression. J Appl Stat 32:929–946. doi:10.1080/02664760500163599.

    Google Scholar 

  • Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR (2001) Multi-class cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98(26):15,149–15,154

    Article  Google Scholar 

  • Ringnr M (2008) What is principal component analysis? Nat Biotechnol 26(3):303–304. doi:10.1038/nbt0308-303

    Article  Google Scholar 

  • Shen H, Huang J (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99:1015–1034

    Article  MATH  MathSciNet  Google Scholar 

  • Sherman J, Morrison W (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127

    Article  MATH  MathSciNet  Google Scholar 

  • Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Knowl Discov Data Min 26(2): 332–397

    Google Scholar 

  • Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci 99(7):4447–4465

    Article  Google Scholar 

  • The Cancer Genome Atlas Research Network (2011) Integrated genomic analyses of ovarian carcinoma. Nature 474:91–118

    Google Scholar 

  • Tibshirani R (1994) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288

    MathSciNet  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63(2):411–423

    Article  MATH  MathSciNet  Google Scholar 

  • Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482. doi:10.1162/089976699300016728

    Article  Google Scholar 

  • Vidal R (2011) Subspace clustering. IEEE Signal Process Mag 28:52–68. doi:10.1109/MSP.2010.939739

    Article  Google Scholar 

  • Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (lasso). IEEE Trans Inf Theory 55(5):2183–2202. doi:10.1109/TIT.2009.2016018

  • Wang D, Ding C, Li T (2009) K-Subspace clustering. In: Machine learning and knowledge discovery in databases, pp 506–521. Springer

  • Witten D (2010) A penalized matrix decomposition, and its applications. Ph.D. thesis, Stanford University. http://www-stat.stanford.edu/tibs/sta306b/Defense.pdf

  • Witten D, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713–726. doi:10.1198/jasa.2010.tm09415. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2930825&tool=pmcentrez&rendertype=abstract

    Google Scholar 

  • Yang B (1996) Asymptotic convergence analysis of the projection approximation subspace tracking algorithms. Signal Process 50:123–136

    Article  MATH  Google Scholar 

  • Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143

    Google Scholar 

  • Zhang T, Szlam A, Wang Y, Lerman G (2010) Hybrid linear modeling via local best-fit flats. Arxiv preprint.

Download references

Acknowledgments

The authors would like to thank the anonymous referees for their helpful comments and the EPSRC (Engineering and Physical Science Research Council) for funding this project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanni Montana.

Additional information

Responsible editor: Ian Davidson.

Appendices

Appendix 1: Derivation of predictive influence

Using the chain rule, the gradient of the PRESS for a single latent factor is

$$\begin{aligned} \frac{\partial J^{(1)}}{\partial \varvec{x}_i} = \frac{1}{2} \frac{\partial }{\partial \varvec{x}_i}\left\| \varvec{e}^{(1)}_{-i}\right\| ^2 = \frac{1}{2}\varvec{e}^{(1)}_{-i} \frac{\partial }{\partial \varvec{x}_i}\varvec{e}^{(1)}_{-i}. \end{aligned}$$

For notational convenience we drop the superscript in the following. Using the quotient rule, the partial derivative of the \(i{\text{ th }}\) leave-one-out error has the following form

$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_{-i} = \frac{\frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_i (1-h_i) + \varvec{e}_i\frac{\partial h_i }{ \partial \varvec{x}_i}}{(1-h_i)^2} \end{aligned}$$

which depends on the partial derivatives of the \(i{\text{ th }}\) reconstruction error and the \(h_i\) quantities with respect to the observation \(\varvec{x}_i\). The computation of these two partial derivatives are straightforward and are, respectively

$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_i = \frac{\partial }{\partial \varvec{x}_i} \varvec{x}_i \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) = \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) , \end{aligned}$$

and

$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} h_i = \frac{\partial }{\partial \varvec{x}_i} \varvec{x}_i\varvec{v} D \varvec{v}^{\top }\varvec{x}_i^{\top }= 2\varvec{v} D d_i . \end{aligned}$$

The derivative of the PRESS, \(J\) with respect to \(\varvec{x}_i\) is then

$$\begin{aligned} \frac{1}{2}\frac{\partial }{ \partial \varvec{x}_i}\left\| \varvec{e}_{-i}\right\| ^2 = \varvec{e}_{-i} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_{-i}= \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) (1-h_i) + 2\varvec{e}_i \varvec{v} D d_i }{(1-h_i)^2}. \end{aligned}$$
(28)

However, examining the second term in the sum, \(\varvec{e}_i \varvec{v} D d_i \), we notice

$$\begin{aligned} \varvec{e}_i\varvec{v}Dd_i = (\varvec{x}_i-\varvec{x}_i\varvec{vv}^{\top })\varvec{v}Dd_i = \varvec{x}_i\varvec{v}Dd_i - \varvec{x}_i\varvec{vv}^{\top }\varvec{v}Dd_i = 0 . \end{aligned}$$

Substituting this result back in Eq. (28), the gradient of the PRESS for a single PCA component with respect to \(\varvec{x}_i\) is given by

$$\begin{aligned} \frac{1}{2}\frac{\partial }{ \partial \varvec{x}_i} \left\| \varvec{e}_{-i}\right\| ^2 = \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) (1-h_i)}{(1-h_i)^2} = \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) }{(1-h_i)} . \end{aligned}$$

In the general case for \(R>1\), the final expression for the predictive influence \(\varvec{\pi }(\varvec{x}_i)\in \mathbb{R }^{P\times 1}\) of a point \(\varvec{x}_i\) under a PCA model then has the following form:

$$\begin{aligned} \varvec{\pi }(\varvec{x}_i;\varvec{V}) = \varvec{e}^{(R)}_{-i} \left( \sum _{r=1}^{R} \frac{ \left( \varvec{I}_p - \varvec{v}^{(r)}{\varvec{v}^{(r)}}^{\top }\right) }{\left( 1-h^{(r)}_i\right) } - (R-1) \right) . \end{aligned}$$

Appendix 2: Proof of Lemma 1

From Appendix 1, for \(R=1\), the predictive influence of a point \(\varvec{\pi }({\varvec{x}_i};\varvec{v})\) is

$$\begin{aligned} \varvec{\pi }(\varvec{x}_i;\varvec{v}) =\frac{\varvec{e}_{i}}{(1-h_i)^2} \end{aligned}$$
(29)

This is simply the \(i{\text{ th }}\) leave-one-out error scaled by \(1-h_i\). If we define a diagonal matrix \(\varvec{\varXi }\in \mathbb{R }^{N\times N}\) with diagonal entries \({\varXi }_{i} = (1-h_i)^2\), we can define a matrix \(\varvec{\Pi }\in \mathbb{R }^{N\times P}\) whose rows are the predictive influences, \(\varvec{\Pi }=[\varvec{\pi }(\varvec{x}_1;\varvec{v}) ^{\top },\ldots , \varvec{\pi }(\varvec{x}_N;\varvec{v}) ^{\top }]^{\top }\). This matrix has the form

$$\begin{aligned} \varvec{\Pi } = \varvec{\varXi }^{-1}\left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) . \end{aligned}$$

Now, solving (21) is equivalent to minimising the squared Frobenius norm,

$$\begin{aligned}&\min _{\varvec{v}} \text{ Tr } \left( \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) ^{\top }\varvec{\varXi }^{-2} \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) \right) \nonumber \\&\text{ subject } \text{ to } \left\| \varvec{v} \right\| =1 . \end{aligned}$$
(30)

Expanding the terms within the trace we obtain

$$\begin{aligned} \text{ Tr } \left( \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) ^{\top }\varvec{\varXi }^{-2} \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) \right)&= \text{ Tr } \left( \varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) - 2\text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) \nonumber \\&+ \text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X}\varvec{vv}^{\top }\right) . \end{aligned}$$

By the properties of the trace, the following equalities hold

$$\begin{aligned} \text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) = \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \varvec{v}, \end{aligned}$$

and

$$\begin{aligned} \text{ Tr } \left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X}\varvec{vv}^{\top }\right)&= \text{ Tr }\left( \varvec{\varXi }^{-1}\varvec{X}\varvec{vv}^{\top }\varvec{vv}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-1}\right) \\&= \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \varvec{v}, \end{aligned}$$

since \(\varvec{\varXi }\) is diagonal and \(\varvec{v}^{\top }\varvec{v}=1\). Therefore, (30) is equivalent to

$$\begin{aligned}&\min _{\varvec{v}} \text{ Tr } \varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} - \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{Xv} , \nonumber \\&\text{ subject } \text{ to } ~ \left\| \varvec{v} \right\| =1 . \end{aligned}$$
(31)

It can be seen that under this constraint, (31) is minimised when \(\varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{Xv}\) is maximised which, for a fixed \(\varvec{\varXi }\) is achieved when \(\varvec{v}\) is the eigenvector corresponding to the largest eigenvalue of \(\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X}\).

Appendix 3: Proof of Lemma 2

In this section we provide a proof of Lemma 2 As an additional consequence of this proof, we develop an upper bound for the approximation error which can be shown to depend on the leverage terms. We derive this result for a single cluster, \(\mathcal{C }^{(\tau )}\) however it holds for all clusters.

We represent the assignment of points \(i=1,\ldots ,N\) to a cluster, \(\mathcal{C }^{(\tau )}\) using a binary valued diagonal matrix \(\varvec{A}\) whose diagonal entries are given by

$$\begin{aligned} A_{i}= \left\{ \begin{array}{ll} 1, &{} \text{ if } i\in \mathcal{C }^{(\tau )} \\ 0,&{} \text{ otherwise }, \end{array} \right. \end{aligned}$$
(32)

where \(\text{ Tr }(\varvec{A})=N_k\). We have shown in Lemma 1 that for a given cluster assignment, the parameters which optimise the objective function can be estimated by computing the SVD of the matrix

$$\begin{aligned} \sum _{i\in \mathcal{C }_k^{(\tau )}} \varvec{x}_i^{\top }{\varXi }_{i}^{-2} \varvec{x}_i = \varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{A} \varvec{X} , \end{aligned}$$
(33)

within each cluster where the \(i{\text{ th }}\) diagonal element of \({\varvec{\varXi }}\) is \(\varXi _{i}=(1-h_i)^2\le 1\), so that \(\varXi _{i}^{-2}\ge 1\). We can then represent \({\varvec{\varXi }}^{-2} = \varvec{I}_N + \varvec{\varPhi }\) where \(\varvec{\varPhi }\in \mathbb{R }^{n\times n}\) is a diagonal matrix with entries \(\varPhi _{i}=\phi _i\ge 0\). Now, we can represent Eq. (33) at the next iteration as

$$\begin{aligned} \varvec{M} = \varvec{X}^{\top }\varvec{A}(\varvec{I}_N + \varvec{\varPhi })\varvec{X} . \end{aligned}$$
(34)

We can quantify the difference between the optimal parameter, \(\varvec{v}^{*}\) obtained by solving (22) using \( \varvec{M}\) and the new PCA parameter estimated at iteration \(\tau +1\), \(\varvec{v}^{(\tau )}\) as,

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )})= {\varvec{v}^{*}}^{\top }\varvec{M}^{(\tau )} \varvec{v}^{*} - {\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau )}, \end{aligned}$$

where \(\varvec{v}^{(\tau )}\) is obtained through the SVD of \( \varvec{X}^{\top }\varvec{A}\varvec{X} \). We can express \(E(\mathcal{S }^*,\mathcal{S }^{(\tau )})\) in terms of the spectral norm of \(\varvec{M}\). Since the spectral norm of a matrix is equivalent to its largest singular value, we have \({\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau )} =\left\| \varvec{X}^{\top }\varvec{A}\varvec{X} \right\| \) Since \(\varvec{\varPhi }\) is a diagonal matrix, its spectral norm, \(\left\| \varvec{\varPhi } \right\| = \max (\varvec{\varPhi })\). Similarly, \(\varvec{A}\) is a diagonal matrix with binary valued entries, so \(\left\| \varvec{A} \right\| = 1\).

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )})&\le \left\| \varvec{M} - \varvec{X}^{\top }\varvec{A}\varvec{X} \right\| \nonumber \\&= \left\| \varvec{X}^{\top }\varvec{A}\varvec{\varPhi }\varvec{X} \right\| \nonumber \\&\le \max (\varvec{\varPhi }) \left\| \varvec{X}^{\top }\varvec{X} \right\| . \end{aligned}$$
(35)

Where the triangle and Cauchy-Schwarz inequalities have been used. In a similar way, we now quantify the difference between the optimal parameter and the old PCA parameter \(\varvec{v}^{(\tau -1)}\),

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)}) = {\varvec{v}^{*}}^{\top }\varvec{M} \varvec{v}^{*} - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)}. \end{aligned}$$

Since \(\varvec{v}^{(\tau )}\) is the principal eigenvector of \(\varvec{X}^{\top }\varvec{A}\varvec{X}\), by definition, \({\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A}\varvec{Xv}^{(\tau )}\) is maximised, therefore we can represent the difference between the new parameters and the old parameters as

$$\begin{aligned} E(\mathcal{S }^{(\tau )},\mathcal{S }^{(\tau -1)})={\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{Xv}^{(\tau )} - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{Xv}^{(\tau -1)}\ge 0. \end{aligned}$$

Using this quantity, we can express \(E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)})\) as

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)})&\le \left\| \varvec{M} \right\| - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)} \nonumber \\&\le \left\| \varvec{X}^{\top }\varvec{\varPhi } \varvec{A} \varvec{X} \right\| + \left\| \varvec{X}^{\top }\varvec{A} \varvec{X} \right\| - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)} \nonumber \\&\le \max (\varvec{\varPhi } )\left\| \varvec{X}^{\top }\varvec{X} \right\| + E(\mathcal{S }^{(\tau )},\mathcal{S }^{(\tau -1)}), \end{aligned}$$
(36)

From (36) and (35) it is clear that

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )}) \le E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)}) . \end{aligned}$$
(37)

This proves Lemma 2.

The inequality in (37) implies that estimating the SVD using \(\varvec{X}^{\top }\varvec{A} \varvec{X}\) obtains PCA parameters which are closer to the optimal values than those obtained at the previous iteration. Therefore, estimating a new PCA model after each cluster re-assignment step never increases the objective function. Furthermore, as the recovered clustering becomes more accurate, by definition there are fewer influential observations within each cluster. This implies that \(\max (\varvec{\varPhi } ) \rightarrow 0\), and so \( E(\mathcal{S }^*,\mathcal{S }^{(\tau )}) \rightarrow 0\).

Rights and permissions

Reprints and Permissions

About this article

Cite this article

McWilliams, B., Montana, G. Subspace clustering of high-dimensional data: a predictive approach. Data Min Knowl Disc 28, 736–772 (2014). https://doi.org/10.1007/s10618-013-0317-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0317-y

Keywords

  • Subspace clustering
  • PCA
  • PRESS statistics
  • Variable selection
  • Model selection
  • Microarrays