Abstract
In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results.
Similar content being viewed by others
Notes
References
Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics (Oxford, England) 27(9):1269–1276. doi:10.1093/bioinformatics/btr112
Belsley DA, Kuh E, Welsch RE (1980) Regression diagnostics: identifying influential data and sources of collinearity, 1st edn. Wiley, New York
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes. Proc Natl Acad Sci 98(24):13,790–13,795
Bradley P, Mangasarian O (2000) k-Plane clustering. J Glob Optim 16:23–32
Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Selecting the number of components in principal component analysis using cross-validation approximations. Anal Bioanal Chem 390:1241–1251
Bhm C, Kailing K, Krger P, Zimek A (2004) Computing clusters of correlation connected objects. In: SIGMOD
Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30. doi:10.1109/msp.2007.914731
Chatterjee S, Hadi A (1986) Influential observations, high leverage points, and outliers in linear regression. Statl Sci 1:379–393. doi:10.1214/ss/1177013622
Chen G, Lerman G (2008) Spectral Curvature Clustering (SCC). Int J Comput Vis 81:317–330. doi:10.1007/s11263-008-0178-9
Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169
Delannay N, Archambeau C, Verleysen M (2008) Improving the robustness to outliers of mixtures of probabilistic pcas. In: 12th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD 2008. Springer, pp 527–535
Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Knowl Discov Data Min 14:63–97
Elhamifar E, Vidal R (2009) Sparse subspace clustering. In: IEEE conference on computer vision and pattern recognition, pp 2790–2797. doi:10.1109/CVPRW.2009.5206547
Elke Achtert Christian Böhm HPKPKAZ (2007) Robust, complete, and efficient correlation clustering. In: SIAM international conference on data mining, SDM 2007
Friedman J, Hastie E, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression. Science 286(5439):531–537
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer Series in Statistics. Springer, New York. doi:10.1007/b98835
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58
Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416. doi:10.1007/s11222-007-9033-z
Ma Y (2006) Generalized principal component analysis: modeling & segmentation of multivariate mixed data
McWilliams B, Montana G (2010) A PRESS statistic for two-block partial least squares regression. In: Proceedings of the 10th annual workshop on computational intelligence
McWilliams B, Montana G (2011) Predictive subspace clustering. In: 2011 tenth international conference on machine learning and applications (ICMLA), pp 247–252
Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B 72:417–473. doi:10.1111/j.1467-9868.2010.00740.x
Meloun M (2001) Detection of single influential points in OLS regression model building. Anal Chim Acta 439(2):169–191. doi:10.1016/S0003-2670(01)01040-6
Mertens B, Fearn T, Thompson M (1995) The efficient cross-validation of principal components applied to principal component regression. Stat Comput 5:227–235. doi:10.1007/BF00142664
Monti S, Tamayo P, Mesirov J, Golub G (2003) Consensus clustering a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118
Ng AY (2004) Feature selection, \(\ell _1\) vs. \(\ell _2\) regularization, and rotational invariance. In: Proceedings of the twenty-first international conference on Machine learning, ICML ’04. ACM, New York, NY, USA, pp 78–85. doi:10.1145/1015330.1015435
Pomeroy S, Tamayo P, Gaasenbeek M, Angelo LMSM, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin A, Califano G, Stolovitzky DN, Louis JP, Mesirov ES, Lander R, Golub TR (2001) Gene expression-based classification and outcome prediction of central nervous system embryonal tumors. Nature 415(6870):436–442
Rahmatullah Imon A (2005) Identifying multiple influential observations in linear regression. J Appl Stat 32:929–946. doi:10.1080/02664760500163599.
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR (2001) Multi-class cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98(26):15,149–15,154
Ringnr M (2008) What is principal component analysis? Nat Biotechnol 26(3):303–304. doi:10.1038/nbt0308-303
Shen H, Huang J (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99:1015–1034
Sherman J, Morrison W (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127
Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Knowl Discov Data Min 26(2): 332–397
Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci 99(7):4447–4465
The Cancer Genome Atlas Research Network (2011) Integrated genomic analyses of ovarian carcinoma. Nature 474:91–118
Tibshirani R (1994) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63(2):411–423
Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482. doi:10.1162/089976699300016728
Vidal R (2011) Subspace clustering. IEEE Signal Process Mag 28:52–68. doi:10.1109/MSP.2010.939739
Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (lasso). IEEE Trans Inf Theory 55(5):2183–2202. doi:10.1109/TIT.2009.2016018
Wang D, Ding C, Li T (2009) K-Subspace clustering. In: Machine learning and knowledge discovery in databases, pp 506–521. Springer
Witten D (2010) A penalized matrix decomposition, and its applications. Ph.D. thesis, Stanford University. http://www-stat.stanford.edu/tibs/sta306b/Defense.pdf
Witten D, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713–726. doi:10.1198/jasa.2010.tm09415. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2930825&tool=pmcentrez&rendertype=abstract
Yang B (1996) Asymptotic convergence analysis of the projection approximation subspace tracking algorithms. Signal Process 50:123–136
Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143
Zhang T, Szlam A, Wang Y, Lerman G (2010) Hybrid linear modeling via local best-fit flats. Arxiv preprint.
Acknowledgments
The authors would like to thank the anonymous referees for their helpful comments and the EPSRC (Engineering and Physical Science Research Council) for funding this project.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Ian Davidson.
Appendices
Appendix 1: Derivation of predictive influence
Using the chain rule, the gradient of the PRESS for a single latent factor is
For notational convenience we drop the superscript in the following. Using the quotient rule, the partial derivative of the \(i{\text{ th }}\) leave-one-out error has the following form
which depends on the partial derivatives of the \(i{\text{ th }}\) reconstruction error and the \(h_i\) quantities with respect to the observation \(\varvec{x}_i\). The computation of these two partial derivatives are straightforward and are, respectively
and
The derivative of the PRESS, \(J\) with respect to \(\varvec{x}_i\) is then
However, examining the second term in the sum, \(\varvec{e}_i \varvec{v} D d_i \), we notice
Substituting this result back in Eq. (28), the gradient of the PRESS for a single PCA component with respect to \(\varvec{x}_i\) is given by
In the general case for \(R>1\), the final expression for the predictive influence \(\varvec{\pi }(\varvec{x}_i)\in \mathbb{R }^{P\times 1}\) of a point \(\varvec{x}_i\) under a PCA model then has the following form:
Appendix 2: Proof of Lemma 1
From Appendix 1, for \(R=1\), the predictive influence of a point \(\varvec{\pi }({\varvec{x}_i};\varvec{v})\) is
This is simply the \(i{\text{ th }}\) leave-one-out error scaled by \(1-h_i\). If we define a diagonal matrix \(\varvec{\varXi }\in \mathbb{R }^{N\times N}\) with diagonal entries \({\varXi }_{i} = (1-h_i)^2\), we can define a matrix \(\varvec{\Pi }\in \mathbb{R }^{N\times P}\) whose rows are the predictive influences, \(\varvec{\Pi }=[\varvec{\pi }(\varvec{x}_1;\varvec{v}) ^{\top },\ldots , \varvec{\pi }(\varvec{x}_N;\varvec{v}) ^{\top }]^{\top }\). This matrix has the form
Now, solving (21) is equivalent to minimising the squared Frobenius norm,
Expanding the terms within the trace we obtain
By the properties of the trace, the following equalities hold
and
since \(\varvec{\varXi }\) is diagonal and \(\varvec{v}^{\top }\varvec{v}=1\). Therefore, (30) is equivalent to
It can be seen that under this constraint, (31) is minimised when \(\varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{Xv}\) is maximised which, for a fixed \(\varvec{\varXi }\) is achieved when \(\varvec{v}\) is the eigenvector corresponding to the largest eigenvalue of \(\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X}\).
Appendix 3: Proof of Lemma 2
In this section we provide a proof of Lemma 2 As an additional consequence of this proof, we develop an upper bound for the approximation error which can be shown to depend on the leverage terms. We derive this result for a single cluster, \(\mathcal{C }^{(\tau )}\) however it holds for all clusters.
We represent the assignment of points \(i=1,\ldots ,N\) to a cluster, \(\mathcal{C }^{(\tau )}\) using a binary valued diagonal matrix \(\varvec{A}\) whose diagonal entries are given by
where \(\text{ Tr }(\varvec{A})=N_k\). We have shown in Lemma 1 that for a given cluster assignment, the parameters which optimise the objective function can be estimated by computing the SVD of the matrix
within each cluster where the \(i{\text{ th }}\) diagonal element of \({\varvec{\varXi }}\) is \(\varXi _{i}=(1-h_i)^2\le 1\), so that \(\varXi _{i}^{-2}\ge 1\). We can then represent \({\varvec{\varXi }}^{-2} = \varvec{I}_N + \varvec{\varPhi }\) where \(\varvec{\varPhi }\in \mathbb{R }^{n\times n}\) is a diagonal matrix with entries \(\varPhi _{i}=\phi _i\ge 0\). Now, we can represent Eq. (33) at the next iteration as
We can quantify the difference between the optimal parameter, \(\varvec{v}^{*}\) obtained by solving (22) using \( \varvec{M}\) and the new PCA parameter estimated at iteration \(\tau +1\), \(\varvec{v}^{(\tau )}\) as,
where \(\varvec{v}^{(\tau )}\) is obtained through the SVD of \( \varvec{X}^{\top }\varvec{A}\varvec{X} \). We can express \(E(\mathcal{S }^*,\mathcal{S }^{(\tau )})\) in terms of the spectral norm of \(\varvec{M}\). Since the spectral norm of a matrix is equivalent to its largest singular value, we have \({\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau )} =\left\| \varvec{X}^{\top }\varvec{A}\varvec{X} \right\| \) Since \(\varvec{\varPhi }\) is a diagonal matrix, its spectral norm, \(\left\| \varvec{\varPhi } \right\| = \max (\varvec{\varPhi })\). Similarly, \(\varvec{A}\) is a diagonal matrix with binary valued entries, so \(\left\| \varvec{A} \right\| = 1\).
Where the triangle and Cauchy-Schwarz inequalities have been used. In a similar way, we now quantify the difference between the optimal parameter and the old PCA parameter \(\varvec{v}^{(\tau -1)}\),
Since \(\varvec{v}^{(\tau )}\) is the principal eigenvector of \(\varvec{X}^{\top }\varvec{A}\varvec{X}\), by definition, \({\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A}\varvec{Xv}^{(\tau )}\) is maximised, therefore we can represent the difference between the new parameters and the old parameters as
Using this quantity, we can express \(E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)})\) as
From (36) and (35) it is clear that
This proves Lemma 2.
The inequality in (37) implies that estimating the SVD using \(\varvec{X}^{\top }\varvec{A} \varvec{X}\) obtains PCA parameters which are closer to the optimal values than those obtained at the previous iteration. Therefore, estimating a new PCA model after each cluster re-assignment step never increases the objective function. Furthermore, as the recovered clustering becomes more accurate, by definition there are fewer influential observations within each cluster. This implies that \(\max (\varvec{\varPhi } ) \rightarrow 0\), and so \( E(\mathcal{S }^*,\mathcal{S }^{(\tau )}) \rightarrow 0\).
Rights and permissions
About this article
Cite this article
McWilliams, B., Montana, G. Subspace clustering of high-dimensional data: a predictive approach. Data Min Knowl Disc 28, 736–772 (2014). https://doi.org/10.1007/s10618-013-0317-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0317-y