Advertisement

Model based clustering for mixed data: clustMD

  • Damien McParland
  • Isobel Claire GormleyEmail author
Regular Article

Abstract

A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.

Keywords

Latent variables Mixture model Mixed data Monte Carlo EM 

Mathematics Subject Classification

62 6207 62FXX 62HXX 62H30 68T10 91C20 62P10 

Notes

Acknowledgments

The authors wish to thank the coordinating editor and reviewers for their comments, which greatly improved this work. The authors would also like to thank the members of the Working Group in Model Based Clustering and the members of the Working Group in Statistical Learning for helpful discussions. This work is supported by Science Foundation Ireland under the Research Frontiers Programme (09/RFP/MTH2367) and the Insight Research Centre (SFI/12/RC/2289).

Supplementary material

11634_2016_238_MOESM1_ESM.pdf (684 kb)
Supplementary material 1 (pdf 684 KB)

References

  1. Andrews DA, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New YorkCrossRefzbMATHGoogle Scholar
  2. Banfield JD, Raftery AE (1993) Model-based clustering and classification of data with mixed type. Biometrics 49(3):803–821MathSciNetCrossRefzbMATHGoogle Scholar
  3. Browne RP, McNicholas PD (2012) Model-based clustering and classification of data with mixed type. J Stat Plan Inference 142:2976–2984MathSciNetCrossRefzbMATHGoogle Scholar
  4. Byar DP, Green SB (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull du Cancer 67:477–490Google Scholar
  5. Cagnone S, Viroli C (2012) A factor mixture analysis model for multivariate binary data. Stat Model 12:257–277MathSciNetCrossRefGoogle Scholar
  6. Cai JH, Song XY, Lam KH, Ip EHS (2011) A mixture of generalized latent variable models for mixed mode and heterogeneous data. Comput Stat Data Anal 55:2889–2907MathSciNetCrossRefzbMATHGoogle Scholar
  7. Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793CrossRefGoogle Scholar
  8. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38MathSciNetzbMATHGoogle Scholar
  9. Everitt BS (1988) A finite mixture model for the clustering of mixed-mode data. Stat Probab Lett 6:305–309MathSciNetCrossRefGoogle Scholar
  10. Fox JP (2010) Bayesian Item Response Modeling. Springer, New YorkCrossRefzbMATHGoogle Scholar
  11. Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetCrossRefzbMATHGoogle Scholar
  12. Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of WashingtonGoogle Scholar
  13. Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, New YorkzbMATHGoogle Scholar
  14. Geweke J, Keane M, Runkle D (1994) Alternative computational approaches to inference in the multinomial probit model. Rev Econ Stat 76(4):609–632CrossRefGoogle Scholar
  15. Gollini I, Murphy TB (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569–588Google Scholar
  16. Gruhl J, Erosheva EA, Crane P (2013) A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes. Ann Appl Stat 7(2):2361–2383MathSciNetCrossRefzbMATHGoogle Scholar
  17. Hunt L, Jorgensen M (1999) Mixture model clustering using the multimix program. Aust N Z J Stat 41:153–171CrossRefzbMATHGoogle Scholar
  18. Johnson VE, Albert JH (1999) Ordinal data modeling. Springer, New YorkzbMATHGoogle Scholar
  19. Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83MathSciNetCrossRefGoogle Scholar
  20. Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795MathSciNetCrossRefzbMATHGoogle Scholar
  21. Kosmidis I, Karlis D (2015) Model-based clustering using copulas with applications. Stat Comput 1–21. doi: 10.1007/s11222-015-9590-5
  22. Lawrence CJ, Krzanowski WJ (1996) Mixture separation for mixed-mode data. Stat Comput 6:85–92CrossRefGoogle Scholar
  23. Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering of Gaussian copulas for mixed data. arXiv:1405.1299 (preprint)
  24. McLachlan G, Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Advances in pattern recognition, vol 1451. Springer, Berlin, pp 658–666CrossRefGoogle Scholar
  25. McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions. Wiley, New JerseyCrossRefzbMATHGoogle Scholar
  26. McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New JerseyCrossRefzbMATHGoogle Scholar
  27. McParland D, Gormley IC (2013) Clustering ordinal data via latent variable models. In: Van den Poel D, Ultsch A, Lausen B (eds) Algorithms from and for nature and life. Springer, Berlin, pp 127–135CrossRefGoogle Scholar
  28. McParland D, Gormley IC, McCormick TH, Clark SJ, Kabudula CW, Collinson MA (2014a) Clustering South African households based on their asset status using latent variable models. Ann Appl Stat 8(2):747–776MathSciNetCrossRefzbMATHGoogle Scholar
  29. McParland D, Gormley IC, Phillips CM, Brennan L, Roche HM (2014b) Clustering mixed continuous and categorical data from the LIPGENE metabolic syndrome study: joint analysis of phenotypic and genetic data. Technical Report, University College DublinGoogle Scholar
  30. Morlini I (2011) A latent variable approach for clustering mixed binary and continuous variables within a Gaussian mixture model. Adv Data Anal Classif 6(1):5–28MathSciNetCrossRefzbMATHGoogle Scholar
  31. Murray JS, Dunson DB, Carin L, Lucas JE (2013) Bayesian Gaussian copula factor models for mixed data. J Am Stat Assoc 108(502):656–665MathSciNetCrossRefzbMATHGoogle Scholar
  32. Muthén B, Shedden K (1999) Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 55:463–469CrossRefzbMATHGoogle Scholar
  33. O’Hagan A (2012) Topics in model based clustering and classification. PhD thesis, University College DublinGoogle Scholar
  34. O’Hagan A, Murphy TB, Gormley IC (2012) Computational aspects of ftting mixture models via the expectation-maximisation algorithm. Comput Stat Data Anal 56(12):3843–3864MathSciNetCrossRefzbMATHGoogle Scholar
  35. Quinn KM (2004) Bayesian factor analysis for mixed ordinal and continuous responses. Political Anal 12(4):338–353MathSciNetCrossRefGoogle Scholar
  36. R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
  37. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464MathSciNetCrossRefzbMATHGoogle Scholar
  38. Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New JerseyzbMATHGoogle Scholar
  39. Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704CrossRefGoogle Scholar
  40. Willse A, Boik RJ (1999) Identifiable finite mixtures of location models for clustering mixed-mode data. Stat Comput 9:111–121CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.School of Mathematics and StatisticsUniversity College DublinDublinIreland
  2. 2.School of Mathematics and Statistics and INSIGHT: The National Centre for Data AnalyticsUniversity College DublinDublinIreland

Personalised recommendations