Model based clustering for mixed data: clustMD
- 695 Downloads
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.
KeywordsLatent variables Mixture model Mixed data Monte Carlo EM
Mathematics Subject Classification62 6207 62FXX 62HXX 62H30 68T10 91C20 62P10
The authors wish to thank the coordinating editor and reviewers for their comments, which greatly improved this work. The authors would also like to thank the members of the Working Group in Model Based Clustering and the members of the Working Group in Statistical Learning for helpful discussions. This work is supported by Science Foundation Ireland under the Research Frontiers Programme (09/RFP/MTH2367) and the Insight Research Centre (SFI/12/RC/2289).
- Byar DP, Green SB (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull du Cancer 67:477–490Google Scholar
- Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of WashingtonGoogle Scholar
- Gollini I, Murphy TB (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569–588Google Scholar
- Kosmidis I, Karlis D (2015) Model-based clustering using copulas with applications. Stat Comput 1–21. doi: 10.1007/s11222-015-9590-5
- Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering of Gaussian copulas for mixed data. arXiv:1405.1299 (preprint)
- McParland D, Gormley IC, Phillips CM, Brennan L, Roche HM (2014b) Clustering mixed continuous and categorical data from the LIPGENE metabolic syndrome study: joint analysis of phenotypic and genetic data. Technical Report, University College DublinGoogle Scholar
- O’Hagan A (2012) Topics in model based clustering and classification. PhD thesis, University College DublinGoogle Scholar
- R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/