Model based clustering for mixed data: clustMD

McParland, Damien; Gormley, Isobel Claire

doi:10.1007/s11634-016-0238-x

Model based clustering for mixed data: clustMD

Regular Article
Published: 12 February 2016

Volume 10, pages 155–169, (2016)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

2110 Accesses
59 Citations
4 Altmetric
Explore all metrics

Abstract

A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recent Developments in Model-Based Clustering with Applications

Clustering bivariate mixed-type data via the cluster-weighted model

Article 04 July 2015

Antonio Punzo & Salvatore Ingrassia

A semiparametric method for clustering mixed data

Article 15 July 2016

Alex Foss, Marianthi Markatou, … Aliza Heching

References

Andrews DA, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New York
Book MATH Google Scholar
Banfield JD, Raftery AE (1993) Model-based clustering and classification of data with mixed type. Biometrics 49(3):803–821
Article MathSciNet MATH Google Scholar
Browne RP, McNicholas PD (2012) Model-based clustering and classification of data with mixed type. J Stat Plan Inference 142:2976–2984
Article MathSciNet MATH Google Scholar
Byar DP, Green SB (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull du Cancer 67:477–490
Google Scholar
Cagnone S, Viroli C (2012) A factor mixture analysis model for multivariate binary data. Stat Model 12:257–277
Article MathSciNet Google Scholar
Cai JH, Song XY, Lam KH, Ip EHS (2011) A mixture of generalized latent variable models for mixed mode and heterogeneous data. Comput Stat Data Anal 55:2889–2907
Article MathSciNet MATH Google Scholar
Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38
MathSciNet MATH Google Scholar
Everitt BS (1988) A finite mixture model for the clustering of mixed-mode data. Stat Probab Lett 6:305–309
Article MathSciNet Google Scholar
Fox JP (2010) Bayesian Item Response Modeling. Springer, New York
Book MATH Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington
Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, New York
MATH Google Scholar
Geweke J, Keane M, Runkle D (1994) Alternative computational approaches to inference in the multinomial probit model. Rev Econ Stat 76(4):609–632
Article Google Scholar
Gollini I, Murphy TB (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569–588
Gruhl J, Erosheva EA, Crane P (2013) A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes. Ann Appl Stat 7(2):2361–2383
Article MathSciNet MATH Google Scholar
Hunt L, Jorgensen M (1999) Mixture model clustering using the multimix program. Aust N Z J Stat 41:153–171
Article MATH Google Scholar
Johnson VE, Albert JH (1999) Ordinal data modeling. Springer, New York
MATH Google Scholar
Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83
Article MathSciNet Google Scholar
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795
Article MathSciNet MATH Google Scholar
Kosmidis I, Karlis D (2015) Model-based clustering using copulas with applications. Stat Comput 1–21. doi:10.1007/s11222-015-9590-5
Lawrence CJ, Krzanowski WJ (1996) Mixture separation for mixed-mode data. Stat Comput 6:85–92
Article Google Scholar
Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering of Gaussian copulas for mixed data. arXiv:1405.1299 (preprint)
McLachlan G, Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Advances in pattern recognition, vol 1451. Springer, Berlin, pp 658–666
Chapter Google Scholar
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions. Wiley, New Jersey
Book MATH Google Scholar
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New Jersey
Book MATH Google Scholar
McParland D, Gormley IC (2013) Clustering ordinal data via latent variable models. In: Van den Poel D, Ultsch A, Lausen B (eds) Algorithms from and for nature and life. Springer, Berlin, pp 127–135
Chapter Google Scholar
McParland D, Gormley IC, McCormick TH, Clark SJ, Kabudula CW, Collinson MA (2014a) Clustering South African households based on their asset status using latent variable models. Ann Appl Stat 8(2):747–776
Article MathSciNet MATH Google Scholar
McParland D, Gormley IC, Phillips CM, Brennan L, Roche HM (2014b) Clustering mixed continuous and categorical data from the LIPGENE metabolic syndrome study: joint analysis of phenotypic and genetic data. Technical Report, University College Dublin
Morlini I (2011) A latent variable approach for clustering mixed binary and continuous variables within a Gaussian mixture model. Adv Data Anal Classif 6(1):5–28
Article MathSciNet MATH Google Scholar
Murray JS, Dunson DB, Carin L, Lucas JE (2013) Bayesian Gaussian copula factor models for mixed data. J Am Stat Assoc 108(502):656–665
Article MathSciNet MATH Google Scholar
Muthén B, Shedden K (1999) Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 55:463–469
Article MATH Google Scholar
O’Hagan A (2012) Topics in model based clustering and classification. PhD thesis, University College Dublin
O’Hagan A, Murphy TB, Gormley IC (2012) Computational aspects of ftting mixture models via the expectation-maximisation algorithm. Comput Stat Data Anal 56(12):3843–3864
Article MathSciNet MATH Google Scholar
Quinn KM (2004) Bayesian factor analysis for mixed ordinal and continuous responses. Political Anal 12(4):338–353
Article MathSciNet Google Scholar
R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MathSciNet MATH Google Scholar
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New Jersey
MATH Google Scholar
Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704
Article Google Scholar
Willse A, Boik RJ (1999) Identifiable finite mixtures of location models for clustering mixed-mode data. Stat Comput 9:111–121
Article Google Scholar

Download references

Acknowledgments

The authors wish to thank the coordinating editor and reviewers for their comments, which greatly improved this work. The authors would also like to thank the members of the Working Group in Model Based Clustering and the members of the Working Group in Statistical Learning for helpful discussions. This work is supported by Science Foundation Ireland under the Research Frontiers Programme (09/RFP/MTH2367) and the Insight Research Centre (SFI/12/RC/2289).

Author information

Authors and Affiliations

School of Mathematics and Statistics, University College Dublin, Dublin, Ireland
Damien McParland
School of Mathematics and Statistics and INSIGHT: The National Centre for Data Analytics, University College Dublin, Dublin, Ireland
Isobel Claire Gormley

Authors

Damien McParland
View author publications
You can also search for this author in PubMed Google Scholar
Isobel Claire Gormley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isobel Claire Gormley.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 684 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

McParland, D., Gormley, I.C. Model based clustering for mixed data: clustMD. Adv Data Anal Classif 10, 155–169 (2016). https://doi.org/10.1007/s11634-016-0238-x

Download citation

Received: 29 May 2014
Revised: 26 January 2016
Accepted: 31 January 2016
Published: 12 February 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s11634-016-0238-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model based clustering for mixed data: clustMD

Abstract

Access this article

Similar content being viewed by others

Recent Developments in Model-Based Clustering with Applications

Clustering bivariate mixed-type data via the cluster-weighted model

A semiparametric method for clustering mixed data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 684 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Model based clustering for mixed data: clustMD

Abstract

Access this article

Similar content being viewed by others

Recent Developments in Model-Based Clustering with Applications

Clustering bivariate mixed-type data via the cluster-weighted model

A semiparametric method for clustering mixed data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 684 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation