Subspace clustering of high-dimensional data: a predictive approach

McWilliams, Brian; Montana, Giovanni

doi:10.1007/s10618-013-0317-y

Subspace clustering of high-dimensional data: a predictive approach

Published: 05 May 2013

Volume 28, pages 736–772, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Brian McWilliams² &
Giovanni Montana¹

1408 Accesses
54 Citations
Explore all metrics

Abstract

In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical Data

Efficient Monte Carlo clustering in subspaces

Article 14 February 2017

Frequent Pattern Mining Algorithms for Data Clustering

Notes

Software available at http://www2.imperial.ac.uk/~gmontana/psc.htm
http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi

References

Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics (Oxford, England) 27(9):1269–1276. doi:10.1093/bioinformatics/btr112
Article Google Scholar
Belsley DA, Kuh E, Welsch RE (1980) Regression diagnostics: identifying influential data and sources of collinearity, 1st edn. Wiley, New York
Book MATH Google Scholar
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes. Proc Natl Acad Sci 98(24):13,790–13,795
Article Google Scholar
Bradley P, Mangasarian O (2000) k-Plane clustering. J Glob Optim 16:23–32
Article MATH MathSciNet Google Scholar
Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Selecting the number of components in principal component analysis using cross-validation approximations. Anal Bioanal Chem 390:1241–1251
Article Google Scholar
Bhm C, Kailing K, Krger P, Zimek A (2004) Computing clusters of correlation connected objects. In: SIGMOD
Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30. doi:10.1109/msp.2007.914731
Article Google Scholar
Chatterjee S, Hadi A (1986) Influential observations, high leverage points, and outliers in linear regression. Statl Sci 1:379–393. doi:10.1214/ss/1177013622
Article MathSciNet Google Scholar
Chen G, Lerman G (2008) Spectral Curvature Clustering (SCC). Int J Comput Vis 81:317–330. doi:10.1007/s11263-008-0178-9
Article Google Scholar
Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169
MATH Google Scholar
Delannay N, Archambeau C, Verleysen M (2008) Improving the robustness to outliers of mixtures of probabilistic pcas. In: 12th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD 2008. Springer, pp 527–535
Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Knowl Discov Data Min 14:63–97
Article MathSciNet Google Scholar
Elhamifar E, Vidal R (2009) Sparse subspace clustering. In: IEEE conference on computer vision and pattern recognition, pp 2790–2797. doi:10.1109/CVPRW.2009.5206547
Elke Achtert Christian Böhm HPKPKAZ (2007) Robust, complete, and efficient correlation clustering. In: SIAM international conference on data mining, SDM 2007
Friedman J, Hastie E, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332
Article MATH MathSciNet Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression. Science 286(5439):531–537
Article Google Scholar
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer Series in Statistics. Springer, New York. doi:10.1007/b98835
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58
Google Scholar
Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416. doi:10.1007/s11222-007-9033-z
Article MathSciNet Google Scholar
Ma Y (2006) Generalized principal component analysis: modeling & segmentation of multivariate mixed data
McWilliams B, Montana G (2010) A PRESS statistic for two-block partial least squares regression. In: Proceedings of the 10th annual workshop on computational intelligence
McWilliams B, Montana G (2011) Predictive subspace clustering. In: 2011 tenth international conference on machine learning and applications (ICMLA), pp 247–252
Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B 72:417–473. doi:10.1111/j.1467-9868.2010.00740.x
Article Google Scholar
Meloun M (2001) Detection of single influential points in OLS regression model building. Anal Chim Acta 439(2):169–191. doi:10.1016/S0003-2670(01)01040-6
Article Google Scholar
Mertens B, Fearn T, Thompson M (1995) The efficient cross-validation of principal components applied to principal component regression. Stat Comput 5:227–235. doi:10.1007/BF00142664
Article Google Scholar
Monti S, Tamayo P, Mesirov J, Golub G (2003) Consensus clustering a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118
Google Scholar
Ng AY (2004) Feature selection, $\ell _1$ vs. $\ell _2$ regularization, and rotational invariance. In: Proceedings of the twenty-first international conference on Machine learning, ICML ’04. ACM, New York, NY, USA, pp 78–85. doi:10.1145/1015330.1015435
Pomeroy S, Tamayo P, Gaasenbeek M, Angelo LMSM, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin A, Califano G, Stolovitzky DN, Louis JP, Mesirov ES, Lander R, Golub TR (2001) Gene expression-based classification and outcome prediction of central nervous system embryonal tumors. Nature 415(6870):436–442
Article Google Scholar
Rahmatullah Imon A (2005) Identifying multiple influential observations in linear regression. J Appl Stat 32:929–946. doi:10.1080/02664760500163599.
Google Scholar
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR (2001) Multi-class cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98(26):15,149–15,154
Article Google Scholar
Ringnr M (2008) What is principal component analysis? Nat Biotechnol 26(3):303–304. doi:10.1038/nbt0308-303
Article Google Scholar
Shen H, Huang J (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99:1015–1034
Article MATH MathSciNet Google Scholar
Sherman J, Morrison W (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127
Article MATH MathSciNet Google Scholar
Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Knowl Discov Data Min 26(2): 332–397
Google Scholar
Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci 99(7):4447–4465
Article Google Scholar
The Cancer Genome Atlas Research Network (2011) Integrated genomic analyses of ovarian carcinoma. Nature 474:91–118
Google Scholar
Tibshirani R (1994) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
MathSciNet Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63(2):411–423
Article MATH MathSciNet Google Scholar
Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482. doi:10.1162/089976699300016728
Article Google Scholar
Vidal R (2011) Subspace clustering. IEEE Signal Process Mag 28:52–68. doi:10.1109/MSP.2010.939739
Article Google Scholar
Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell _1$-constrained quadratic programming (lasso). IEEE Trans Inf Theory 55(5):2183–2202. doi:10.1109/TIT.2009.2016018
Wang D, Ding C, Li T (2009) K-Subspace clustering. In: Machine learning and knowledge discovery in databases, pp 506–521. Springer
Witten D (2010) A penalized matrix decomposition, and its applications. Ph.D. thesis, Stanford University. http://www-stat.stanford.edu/tibs/sta306b/Defense.pdf
Witten D, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713–726. doi:10.1198/jasa.2010.tm09415. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2930825&tool=pmcentrez&rendertype=abstract
Google Scholar
Yang B (1996) Asymptotic convergence analysis of the projection approximation subspace tracking algorithms. Signal Process 50:123–136
Article MATH Google Scholar
Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143
Google Scholar
Zhang T, Szlam A, Wang Y, Lerman G (2010) Hybrid linear modeling via local best-fit flats. Arxiv preprint.

Download references

Acknowledgments

The authors would like to thank the anonymous referees for their helpful comments and the EPSRC (Engineering and Physical Science Research Council) for funding this project.

Author information

Authors and Affiliations

Department of Mathematics, Imperial College London, London, UK
Giovanni Montana
Department of Informatics, ETH, Zürich, Switzerland
Brian McWilliams

Authors

Brian McWilliams
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Montana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanni Montana.

Additional information

Responsible editor: Ian Davidson.

Appendices

Appendix 1: Derivation of predictive influence

Using the chain rule, the gradient of the PRESS for a single latent factor is

$$\begin{aligned} \frac{\partial J^{(1)}}{\partial \varvec{x}_i} = \frac{1}{2} \frac{\partial }{\partial \varvec{x}_i}\left\| \varvec{e}^{(1)}_{-i}\right\| ^2 = \frac{1}{2}\varvec{e}^{(1)}_{-i} \frac{\partial }{\partial \varvec{x}_i}\varvec{e}^{(1)}_{-i}. \end{aligned}$$

For notational convenience we drop the superscript in the following. Using the quotient rule, the partial derivative of the $i{\text{ th }}$ leave-one-out error has the following form

$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_{-i} = \frac{\frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_i (1-h_i) + \varvec{e}_i\frac{\partial h_i }{ \partial \varvec{x}_i}}{(1-h_i)^2} \end{aligned}$$

which depends on the partial derivatives of the $i{\text{ th }}$ reconstruction error and the $h_i$ quantities with respect to the observation $\varvec{x}_i$. The computation of these two partial derivatives are straightforward and are, respectively

$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_i = \frac{\partial }{\partial \varvec{x}_i} \varvec{x}_i \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) = \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) , \end{aligned}$$

and

$$\begin{aligned} \frac{\partial }{ \partial \varvec{x}_i} h_i = \frac{\partial }{\partial \varvec{x}_i} \varvec{x}_i\varvec{v} D \varvec{v}^{\top }\varvec{x}_i^{\top }= 2\varvec{v} D d_i . \end{aligned}$$

The derivative of the PRESS, $J$ with respect to $\varvec{x}_i$ is then

$$\begin{aligned} \frac{1}{2}\frac{\partial }{ \partial \varvec{x}_i}\left\| \varvec{e}_{-i}\right\| ^2 = \varvec{e}_{-i} \frac{\partial }{ \partial \varvec{x}_i} \varvec{e}_{-i}= \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) (1-h_i) + 2\varvec{e}_i \varvec{v} D d_i }{(1-h_i)^2}. \end{aligned}$$

(28)

However, examining the second term in the sum, $\varvec{e}_i \varvec{v} D d_i $, we notice

$$\begin{aligned} \varvec{e}_i\varvec{v}Dd_i = (\varvec{x}_i-\varvec{x}_i\varvec{vv}^{\top })\varvec{v}Dd_i = \varvec{x}_i\varvec{v}Dd_i - \varvec{x}_i\varvec{vv}^{\top }\varvec{v}Dd_i = 0 . \end{aligned}$$

Substituting this result back in Eq. (28), the gradient of the PRESS for a single PCA component with respect to $\varvec{x}_i$ is given by

$$\begin{aligned} \frac{1}{2}\frac{\partial }{ \partial \varvec{x}_i} \left\| \varvec{e}_{-i}\right\| ^2 = \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) (1-h_i)}{(1-h_i)^2} = \varvec{e}_{-i} \frac{ \left( \varvec{I}_P - \varvec{v}{\varvec{v}}^{\top }\right) }{(1-h_i)} . \end{aligned}$$

In the general case for $R>1$, the final expression for the predictive influence $\varvec{\pi }(\varvec{x}_i)\in \mathbb{R }^{P\times 1}$ of a point $\varvec{x}_i$ under a PCA model then has the following form:

$$\begin{aligned} \varvec{\pi }(\varvec{x}_i;\varvec{V}) = \varvec{e}^{(R)}_{-i} \left( \sum _{r=1}^{R} \frac{ \left( \varvec{I}_p - \varvec{v}^{(r)}{\varvec{v}^{(r)}}^{\top }\right) }{\left( 1-h^{(r)}_i\right) } - (R-1) \right) . \end{aligned}$$

Appendix 2: Proof of Lemma 1

From Appendix 1, for $R=1$, the predictive influence of a point $\varvec{\pi }({\varvec{x}_i};\varvec{v})$ is

$$\begin{aligned} \varvec{\pi }(\varvec{x}_i;\varvec{v}) =\frac{\varvec{e}_{i}}{(1-h_i)^2} \end{aligned}$$

(29)

This is simply the $i{\text{ th }}$ leave-one-out error scaled by $1-h_i$. If we define a diagonal matrix $\varvec{\varXi }\in \mathbb{R }^{N\times N}$ with diagonal entries ${\varXi }_{i} = (1-h_i)^2$, we can define a matrix $\varvec{\Pi }\in \mathbb{R }^{N\times P}$ whose rows are the predictive influences, $\varvec{\Pi }=[\varvec{\pi }(\varvec{x}_1;\varvec{v}) ^{\top },\ldots , \varvec{\pi }(\varvec{x}_N;\varvec{v}) ^{\top }]^{\top }$. This matrix has the form

$$\begin{aligned} \varvec{\Pi } = \varvec{\varXi }^{-1}\left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) . \end{aligned}$$

Now, solving (21) is equivalent to minimising the squared Frobenius norm,

$$\begin{aligned}&\min _{\varvec{v}} \text{ Tr } \left( \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) ^{\top }\varvec{\varXi }^{-2} \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) \right) \nonumber \\&\text{ subject } \text{ to } \left\| \varvec{v} \right\| =1 . \end{aligned}$$

(30)

Expanding the terms within the trace we obtain

$$\begin{aligned} \text{ Tr } \left( \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) ^{\top }\varvec{\varXi }^{-2} \left( \varvec{X} - \varvec{X}\varvec{vv}^{\top }\right) \right)&= \text{ Tr } \left( \varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) - 2\text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) \nonumber \\&+ \text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X}\varvec{vv}^{\top }\right) . \end{aligned}$$

By the properties of the trace, the following equalities hold

$$\begin{aligned} \text{ Tr }\left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X} \right) = \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \varvec{v}, \end{aligned}$$

and

$$\begin{aligned} \text{ Tr } \left( \varvec{vv}^{\top }\varvec{X} ^{\top }\varvec{\varXi }^{-2} \varvec{X}\varvec{vv}^{\top }\right)&= \text{ Tr }\left( \varvec{\varXi }^{-1}\varvec{X}\varvec{vv}^{\top }\varvec{vv}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-1}\right) \\&= \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} \varvec{v}, \end{aligned}$$

since $\varvec{\varXi }$ is diagonal and $\varvec{v}^{\top }\varvec{v}=1$. Therefore, (30) is equivalent to

$$\begin{aligned}&\min _{\varvec{v}} \text{ Tr } \varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X} - \varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{Xv} , \nonumber \\&\text{ subject } \text{ to } ~ \left\| \varvec{v} \right\| =1 . \end{aligned}$$

(31)

It can be seen that under this constraint, (31) is minimised when $\varvec{v}^{\top }\varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{Xv}$ is maximised which, for a fixed $\varvec{\varXi }$ is achieved when $\varvec{v}$ is the eigenvector corresponding to the largest eigenvalue of $\varvec{X}^{\top }\varvec{\varXi }^{-2} \varvec{X}$.

Appendix 3: Proof of Lemma 2

In this section we provide a proof of Lemma 2 As an additional consequence of this proof, we develop an upper bound for the approximation error which can be shown to depend on the leverage terms. We derive this result for a single cluster, $\mathcal{C }^{(\tau )}$ however it holds for all clusters.

We represent the assignment of points $i=1,\ldots ,N$ to a cluster, $\mathcal{C }^{(\tau )}$ using a binary valued diagonal matrix $\varvec{A}$ whose diagonal entries are given by

$$\begin{aligned} A_{i}= \left\{ \begin{array}{ll} 1, &{} \text{ if } i\in \mathcal{C }^{(\tau )} \\ 0,&{} \text{ otherwise }, \end{array} \right. \end{aligned}$$

(32)

where $\text{ Tr }(\varvec{A})=N_k$. We have shown in Lemma 1 that for a given cluster assignment, the parameters which optimise the objective function can be estimated by computing the SVD of the matrix

$$\begin{aligned} \sum _{i\in \mathcal{C }_k^{(\tau )}} \varvec{x}_i^{\top }{\varXi }_{i}^{-2} \varvec{x}_i = \varvec{X}^{\top }\varvec{\varXi }^{-2}\varvec{A} \varvec{X} , \end{aligned}$$

(33)

within each cluster where the $i{\text{ th }}$ diagonal element of ${\varvec{\varXi }}$ is $\varXi _{i}=(1-h_i)^2\le 1$, so that $\varXi _{i}^{-2}\ge 1$. We can then represent ${\varvec{\varXi }}^{-2} = \varvec{I}_N + \varvec{\varPhi }$ where $\varvec{\varPhi }\in \mathbb{R }^{n\times n}$ is a diagonal matrix with entries $\varPhi _{i}=\phi _i\ge 0$. Now, we can represent Eq. (33) at the next iteration as

$$\begin{aligned} \varvec{M} = \varvec{X}^{\top }\varvec{A}(\varvec{I}_N + \varvec{\varPhi })\varvec{X} . \end{aligned}$$

(34)

We can quantify the difference between the optimal parameter, $\varvec{v}^{*}$ obtained by solving (22) using $ \varvec{M}$ and the new PCA parameter estimated at iteration $\tau +1$, $\varvec{v}^{(\tau )}$ as,

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )})= {\varvec{v}^{*}}^{\top }\varvec{M}^{(\tau )} \varvec{v}^{*} - {\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau )}, \end{aligned}$$

where $\varvec{v}^{(\tau )}$ is obtained through the SVD of $ \varvec{X}^{\top }\varvec{A}\varvec{X} $. We can express $E(\mathcal{S }^*,\mathcal{S }^{(\tau )})$ in terms of the spectral norm of $\varvec{M}$. Since the spectral norm of a matrix is equivalent to its largest singular value, we have ${\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau )} =\left\| \varvec{X}^{\top }\varvec{A}\varvec{X} \right\| $ Since $\varvec{\varPhi }$ is a diagonal matrix, its spectral norm, $\left\| \varvec{\varPhi } \right\| = \max (\varvec{\varPhi })$. Similarly, $\varvec{A}$ is a diagonal matrix with binary valued entries, so $\left\| \varvec{A} \right\| = 1$.

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )})&\le \left\| \varvec{M} - \varvec{X}^{\top }\varvec{A}\varvec{X} \right\| \nonumber \\&= \left\| \varvec{X}^{\top }\varvec{A}\varvec{\varPhi }\varvec{X} \right\| \nonumber \\&\le \max (\varvec{\varPhi }) \left\| \varvec{X}^{\top }\varvec{X} \right\| . \end{aligned}$$

(35)

Where the triangle and Cauchy-Schwarz inequalities have been used. In a similar way, we now quantify the difference between the optimal parameter and the old PCA parameter $\varvec{v}^{(\tau -1)}$,

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)}) = {\varvec{v}^{*}}^{\top }\varvec{M} \varvec{v}^{*} - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)}. \end{aligned}$$

Since $\varvec{v}^{(\tau )}$ is the principal eigenvector of $\varvec{X}^{\top }\varvec{A}\varvec{X}$, by definition, ${\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A}\varvec{Xv}^{(\tau )}$ is maximised, therefore we can represent the difference between the new parameters and the old parameters as

$$\begin{aligned} E(\mathcal{S }^{(\tau )},\mathcal{S }^{(\tau -1)})={\varvec{v}^{(\tau )}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{Xv}^{(\tau )} - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{Xv}^{(\tau -1)}\ge 0. \end{aligned}$$

Using this quantity, we can express $E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)})$ as

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)})&\le \left\| \varvec{M} \right\| - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)} \nonumber \\&\le \left\| \varvec{X}^{\top }\varvec{\varPhi } \varvec{A} \varvec{X} \right\| + \left\| \varvec{X}^{\top }\varvec{A} \varvec{X} \right\| - {\varvec{v}^{(\tau -1)}}^{\top }\varvec{X}^{\top }\varvec{A} \varvec{X}\varvec{v}^{(\tau -1)} \nonumber \\&\le \max (\varvec{\varPhi } )\left\| \varvec{X}^{\top }\varvec{X} \right\| + E(\mathcal{S }^{(\tau )},\mathcal{S }^{(\tau -1)}), \end{aligned}$$

(36)

From (36) and (35) it is clear that

$$\begin{aligned} E(\mathcal{S }^*,\mathcal{S }^{(\tau )}) \le E(\mathcal{S }^*,\mathcal{S }^{(\tau -1)}) . \end{aligned}$$

(37)

This proves Lemma 2.

The inequality in (37) implies that estimating the SVD using $\varvec{X}^{\top }\varvec{A} \varvec{X}$ obtains PCA parameters which are closer to the optimal values than those obtained at the previous iteration. Therefore, estimating a new PCA model after each cluster re-assignment step never increases the objective function. Furthermore, as the recovered clustering becomes more accurate, by definition there are fewer influential observations within each cluster. This implies that $\max (\varvec{\varPhi } ) \rightarrow 0$, and so $ E(\mathcal{S }^*,\mathcal{S }^{(\tau )}) \rightarrow 0$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

McWilliams, B., Montana, G. Subspace clustering of high-dimensional data: a predictive approach. Data Min Knowl Disc 28, 736–772 (2014). https://doi.org/10.1007/s10618-013-0317-y

Download citation

Received: 05 March 2012
Accepted: 13 April 2013
Published: 05 May 2013
Issue Date: May 2014
DOI: https://doi.org/10.1007/s10618-013-0317-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Subspace clustering of high-dimensional data: a predictive approach

Abstract

Access this article

Similar content being viewed by others

Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical Data

Efficient Monte Carlo clustering in subspaces

Frequent Pattern Mining Algorithms for Data Clustering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Derivation of predictive influence

Appendix 2: Proof of Lemma 1

Appendix 3: Proof of Lemma 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Subspace clustering of high-dimensional data: a predictive approach

Abstract

Access this article

Similar content being viewed by others

Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical Data

Efficient Monte Carlo clustering in subspaces

Frequent Pattern Mining Algorithms for Data Clustering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Derivation of predictive influence

Appendix 2: Proof of Lemma 1

Appendix 3: Proof of Lemma 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation