Data Mining and Knowledge Discovery

, Volume 28, Issue 3, pp 736–772 | Cite as

Subspace clustering of high-dimensional data: a predictive approach

Article

Abstract

In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results.

Keywords

Subspace clustering PCA PRESS statistics Variable selection Model selection Microarrays 

References

  1. Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics (Oxford, England) 27(9):1269–1276. doi:10.1093/bioinformatics/btr112 CrossRefGoogle Scholar
  2. Belsley DA, Kuh E, Welsch RE (1980) Regression diagnostics: identifying influential data and sources of collinearity, 1st edn. Wiley, New YorkCrossRefMATHGoogle Scholar
  3. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes. Proc Natl Acad Sci 98(24):13,790–13,795CrossRefGoogle Scholar
  4. Bradley P, Mangasarian O (2000) k-Plane clustering. J Glob Optim 16:23–32CrossRefMATHMathSciNetGoogle Scholar
  5. Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Selecting the number of components in principal component analysis using cross-validation approximations. Anal Bioanal Chem 390:1241–1251CrossRefGoogle Scholar
  6. Bhm C, Kailing K, Krger P, Zimek A (2004) Computing clusters of correlation connected objects. In: SIGMODGoogle Scholar
  7. Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30. doi:10.1109/msp.2007.914731 CrossRefGoogle Scholar
  8. Chatterjee S, Hadi A (1986) Influential observations, high leverage points, and outliers in linear regression. Statl Sci 1:379–393. doi:10.1214/ss/1177013622 CrossRefMathSciNetGoogle Scholar
  9. Chen G, Lerman G (2008) Spectral Curvature Clustering (SCC). Int J Comput Vis 81:317–330. doi:10.1007/s11263-008-0178-9 CrossRefGoogle Scholar
  10. Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169MATHGoogle Scholar
  11. Delannay N, Archambeau C, Verleysen M (2008) Improving the robustness to outliers of mixtures of probabilistic pcas. In: 12th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD 2008. Springer, pp 527–535Google Scholar
  12. Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Knowl Discov Data Min 14:63–97CrossRefMathSciNetGoogle Scholar
  13. Elhamifar E, Vidal R (2009) Sparse subspace clustering. In: IEEE conference on computer vision and pattern recognition, pp 2790–2797. doi:10.1109/CVPRW.2009.5206547
  14. Elke Achtert Christian Böhm HPKPKAZ (2007) Robust, complete, and efficient correlation clustering. In: SIAM international conference on data mining, SDM 2007Google Scholar
  15. Friedman J, Hastie E, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332CrossRefMATHMathSciNetGoogle Scholar
  16. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression. Science 286(5439):531–537CrossRefGoogle Scholar
  17. Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer Series in Statistics. Springer, New York. doi:10.1007/b98835
  18. Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58Google Scholar
  19. Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416. doi:10.1007/s11222-007-9033-z CrossRefMathSciNetGoogle Scholar
  20. Ma Y (2006) Generalized principal component analysis: modeling & segmentation of multivariate mixed dataGoogle Scholar
  21. McWilliams B, Montana G (2010) A PRESS statistic for two-block partial least squares regression. In: Proceedings of the 10th annual workshop on computational intelligenceGoogle Scholar
  22. McWilliams B, Montana G (2011) Predictive subspace clustering. In: 2011 tenth international conference on machine learning and applications (ICMLA), pp 247–252Google Scholar
  23. Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B 72:417–473. doi:10.1111/j.1467-9868.2010.00740.x CrossRefGoogle Scholar
  24. Meloun M (2001) Detection of single influential points in OLS regression model building. Anal Chim Acta 439(2):169–191. doi:10.1016/S0003-2670(01)01040-6 CrossRefGoogle Scholar
  25. Mertens B, Fearn T, Thompson M (1995) The efficient cross-validation of principal components applied to principal component regression. Stat Comput 5:227–235. doi:10.1007/BF00142664 CrossRefGoogle Scholar
  26. Monti S, Tamayo P, Mesirov J, Golub G (2003) Consensus clustering a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118Google Scholar
  27. Ng AY (2004) Feature selection, \(\ell _1\) vs. \(\ell _2\) regularization, and rotational invariance. In: Proceedings of the twenty-first international conference on Machine learning, ICML ’04. ACM, New York, NY, USA, pp 78–85. doi:10.1145/1015330.1015435
  28. Pomeroy S, Tamayo P, Gaasenbeek M, Angelo LMSM, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin A, Califano G, Stolovitzky DN, Louis JP, Mesirov ES, Lander R, Golub TR (2001) Gene expression-based classification and outcome prediction of central nervous system embryonal tumors. Nature 415(6870):436–442CrossRefGoogle Scholar
  29. Rahmatullah Imon A (2005) Identifying multiple influential observations in linear regression. J Appl Stat 32:929–946. doi:10.1080/02664760500163599.Google Scholar
  30. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR (2001) Multi-class cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98(26):15,149–15,154CrossRefGoogle Scholar
  31. Ringnr M (2008) What is principal component analysis? Nat Biotechnol 26(3):303–304. doi:10.1038/nbt0308-303 CrossRefGoogle Scholar
  32. Shen H, Huang J (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99:1015–1034CrossRefMATHMathSciNetGoogle Scholar
  33. Sherman J, Morrison W (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127CrossRefMATHMathSciNetGoogle Scholar
  34. Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Knowl Discov Data Min 26(2): 332–397Google Scholar
  35. Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci 99(7):4447–4465CrossRefGoogle Scholar
  36. The Cancer Genome Atlas Research Network (2011) Integrated genomic analyses of ovarian carcinoma. Nature 474:91–118Google Scholar
  37. Tibshirani R (1994) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288MathSciNetGoogle Scholar
  38. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63(2):411–423CrossRefMATHMathSciNetGoogle Scholar
  39. Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482. doi:10.1162/089976699300016728 CrossRefGoogle Scholar
  40. Vidal R (2011) Subspace clustering. IEEE Signal Process Mag 28:52–68. doi:10.1109/MSP.2010.939739 CrossRefGoogle Scholar
  41. Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (lasso). IEEE Trans Inf Theory 55(5):2183–2202. doi:10.1109/TIT.2009.2016018
  42. Wang D, Ding C, Li T (2009) K-Subspace clustering. In: Machine learning and knowledge discovery in databases, pp 506–521. SpringerGoogle Scholar
  43. Witten D (2010) A penalized matrix decomposition, and its applications. Ph.D. thesis, Stanford University. http://www-stat.stanford.edu/tibs/sta306b/Defense.pdf
  44. Witten D, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713–726. doi:10.1198/jasa.2010.tm09415. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2930825&tool=pmcentrez&rendertype=abstract Google Scholar
  45. Yang B (1996) Asymptotic convergence analysis of the projection approximation subspace tracking algorithms. Signal Process 50:123–136CrossRefMATHGoogle Scholar
  46. Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143Google Scholar
  47. Zhang T, Szlam A, Wang Y, Lerman G (2010) Hybrid linear modeling via local best-fit flats. Arxiv preprint.Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.Department of MathematicsImperial College LondonLondonUK
  2. 2.Department of InformaticsETHZürichSwitzerland

Personalised recommendations