High-dimensional variable selection with the plaid mixture model for clustering
With high-dimensional data, the number of covariates is considerably larger than the sample size. We propose a sound method for analyzing these data. It performs simultaneously clustering and variable selection. The method is inspired by the plaid model. It may be seen as a multiplicative mixture model that allows for overlapping clustering. Unlike conventional clustering, within this model an observation may be explained by several clusters. This characteristic makes it specially suitable for gene expression data. Parameter estimation is performed with the Monte Carlo expectation maximization algorithm and importance sampling. Using extensive simulations and comparisons with competing methods, we show the advantages of our methodology, in terms of both variable selection and clustering. An application of our approach to the gene expression data of kidney renal cell carcinoma taken from The Cancer Genome Atlas validates some previously identified cancer biomarkers.
KeywordsClassification Model selection Multiplicative mixture model Monte Carlo EM Kidney cancer genomic data
The authors are grateful to LeeAnn Chastain at MD Anderson Cancer Center for editing assistance.
- Allan J, Carbonell J, Doddington G, Yamron J, Yang Y (1998) Topic detection and tracking pilot study: final report. In: Proceedings of the DARPA broadcast news transcription and understanding workshop, pp 194–218Google Scholar
- Bhattacharya AK (2005) Evaluation of headache. J Indian Acad Clin Med 6(1):17–22Google Scholar
- Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Eighth IEEE international conference on data mining, 2008. ICDM ’08, pp 791 –796Google Scholar
- Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 2009 Ninth IEEE international conference on data mining, pp 776–781Google Scholar
- Heller KA, Ghahramani Z (2007) A nonparametric Bayesian approach to modeling overlapping clusters. J Mach Learn Res Proc Track 2:187–194Google Scholar
- Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100:602–617. http://EconPapers.repec.org/RePEc:bes:jnlasa:v:100:y:2005:p:602-617
- Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. In: Aluru S (ed) Handbook of computational molecular biology. Chapman and Hall/CRC Computer and Information Science Series, LondonGoogle Scholar
- Tibshirani R, Walther G, Hastie T (2000) Estimating the number of clusters in a dataset via the gap statistic 63:411–423Google Scholar
- Zhou H (2009) Manual for program of the algorithm of Pan, W. and Shen, X. (2007). http://www.biostat.umn.edu/~weip/prog.html. Accessed June 2016