Advertisement

Statistics and Computing

, Volume 26, Issue 5, pp 1079–1099 | Cite as

Model-based clustering using copulas with applications

  • Ioannis Kosmidis
  • Dimitris Karlis
Article

Abstract

The majority of model-based clustering techniques is based on multivariate normal models and their variants. In this paper copulas are used for the construction of flexible families of models for clustering applications. The use of copulas in model-based clustering offers two direct advantages over current methods: (i) the appropriate choice of copulas provides the ability to obtain a range of exotic shapes for the clusters, and (ii) the explicit choice of marginal distributions for the clusters allows the modelling of multivariate data of various modes (either discrete or continuous) in a natural way. This paper introduces and studies the framework of copula-based finite mixture models for clustering applications. Estimation in the general case can be performed using standard EM, and, depending on the mode of the data, more efficient procedures are provided that can fully exploit the copula structure. The closure properties of the mixture models under marginalization are discussed, and for continuous, real-valued data parametric rotations in the sample space are introduced, with a parallel discussion on parameter identifiability depending on the choice of copulas for the components. The exposition of the methodology is accompanied and motivated by the analysis of real and artificial data.

Keywords

Mixture models Dependence modelling Parametric rotations Multivariate discrete data  Mixed-domain data 

Supplementary material

11222_2015_9590_MOESM1_ESM.pdf (96 kb)
Supplementary material extends Example 4.2 to illustrate that distinct sensible, transformations can lead to different results. R scripts that reproduce the analyses undertaken in this paper are available upon request to the authors.(PDF 96.5KB)

References

  1. Alfo, M., Maruotti, A., Trovato, G.: A finite mixture model for multivariate counts under endogenous selectivity. Stat. Comput. 21(2), 185–202 (2011)MathSciNetCrossRefGoogle Scholar
  2. Andrews, J.L., McNicholas, P.D.: Mixtures of modified t-factor analyzers for model-based clustering, classification, and discriminant analysis. J. Stat. Plan. Inference 141, 1479–1486 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  3. Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  4. Bedford, T., Cooke, R.M.: Vines—a new graphical model for dependent random variables. Ann. Stat. 30, 1031–1068 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  5. Brechmann, E.C., Schepsmeier, U.: Modeling dependence with c- and d-vine copulas: The r package cdvine. J. Stat. Softw. 52(3), 1–27 (2013)Google Scholar
  6. Browne, R., McNicholas, P.: Model-based clustering, classification, and discriminant analysis of data with mixed type. J. Stat. Plan. Inference 142(11), 2976–2984 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  7. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28, 781–793 (1995)CrossRefGoogle Scholar
  8. Dean, N., Nugent, R.: Clustering student skill set profiles in a unit hypercube using mixtures of multivariate betas. Adv. Data Anal. Classif. 7(3), 339–357 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  9. Di Lascio, F.M.L., Giannerini, S.: A copula-based algorithm for discovering patterns of dependent observations. J. Classif. 29, 50–75 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  10. Fang, H.-B., Fang, K.-T., Kotz, S.: The meta-elliptical distributions with given marginals. J. Multivar. Anal. 82(1), 1–16 (2002). [Corr.: Journal of Multivariate Analysis 94, 222–223 (2005)]MathSciNetCrossRefzbMATHGoogle Scholar
  11. Forbes, F., Wraith, D.: A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering. Stat. Comput. 24(6), 971–984 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  12. Fraley, C., Raftery, A.E., Murphy, T.B., Scrucca, L.: mclust version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report 597, Department of Statistics, University of Washington, Seattle (2012)Google Scholar
  13. Frühwirth-Schnatter, S., Pyne, S.: Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11(2), 317–336 (2010)CrossRefGoogle Scholar
  14. Genest, C., Nešlehová, J.: A primer on copulas for count data. ASTIN Bull. 37(2), 475–515 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  15. Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: Multivariate normal and t distributions. R package version 0.9-9996. http://cran.r-project.org/package=mvtnorm (2013)
  16. Hanson, A.J.: Rotations for \(n\)-dimensional graphics. In Paeth, A. W. (Ed.), Graphics Gems V, Number II.4 in The Graphics Gems, Chapter II, pp. 55–64. Academic Press, San Diego (1995)Google Scholar
  17. Hennig, C.: Methods for merging Gaussian mixture components. Adv. Data Anal. Classif. 4(1), 3–34 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  18. Henningsen, A., Toomet, O.: maxlik: A package for maximum likelihood estimation in R. Comput. Stat. 26(3), 443–458 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  19. Hofert, M., Kojadinovic, I., Maechler, M., Yan, J.: copula: Multivariate Dependence with Copulas. R package version 0.999-13 (2015)Google Scholar
  20. Hofert, M., Mächler, M., McNeil, A.J.: Likelihood inference for Archimedean copulas in high dimensions under known margins. J. Multivar. Anal. 110, 133–150 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  21. Jajuga, K., Papla, D.: Copula functions in model based clustering. From Data and Information Analysis to Knowledge Engineering Studies in Classification, Data Analysis, and Knowledge Organization, vol. 15, pp. 606–613. Springer, Berlin (2006)CrossRefGoogle Scholar
  22. Joe, H.: Approximations to multivariate normal rectangle probabilities based on conditional expectations. J. Am. Stat. Assoc. 90(431), 957–964 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  23. Joe, H.: Multivariate Models Depend Concepts. Chapman & Hall Ltd, London (1997)CrossRefzbMATHGoogle Scholar
  24. Johnson, N., Kotz, S., Balakrishnan, N.: Multivariate Discrete Distributions. Wiley, New York (1997)zbMATHGoogle Scholar
  25. Jorgensen, M.: Using multinomial mixture models to cluster internet traffic. Aust. N. Z. J. Stat. 46(2), 205–218 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  26. Karlis, D., Meligkotsidou, L.: Finite multivariate Poisson mixtures with applications. J. Stat. Plan. Inference 137, 1942–1960 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  27. Karlis, D., Santourian, A.: Model-based clustering with non-elliptically contoured distributions. Stat. Comput. 19(1), 73–83 (2009)MathSciNetCrossRefGoogle Scholar
  28. Lee, S., McLachlan, G.: Finite mixtures of multivariate skew t-distributions: some recent and new results. Stat. Comput. 24, 181–202 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  29. Lin, T.-I., Ho, H., Lee, C.-R.: Flexible mixture modelling using the multivariate skew-t-normal distribution. Stat. Comput. 24(4), 531–546 (2014) Google Scholar
  30. Marbac, M., Biernacki, C., Vandewalle, V.: Model-based clustering of Gaussian copulas for mixed data. ArXiv e-prints (2014). arXiv:1405.1299
  31. McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000)CrossRefzbMATHGoogle Scholar
  32. McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18(3), 285–296 (2008)MathSciNetCrossRefGoogle Scholar
  33. Meng, X.-L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  34. Morris, K., McNicholas, P.: Dimension reduction for model-based clustering via mixtures of shifted asymmetric Laplace distributions. Stat. Probab. Lett. 83(9), 2088–2093 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  35. Nelsen, R.: An introduction to copulas, Springer series in statistics, 2nd ed. Springer, Berlin (2006)Google Scholar
  36. Panagiotelis, A., Czado, C., Joe, M.: Pair copula constructions for multivariate discrete data. J. Am. Stat. Assoc. 107(499), 1063–1072 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  37. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2015)Google Scholar
  38. Robitzsch, A., Kiefer, T., George, A.C., Uenlue, A.: CDM: cognitive diagnosis modeling. R package version 2.6-13. http://cran.r-project.org/package=CDM (2014)
  39. Vrac, M., Billard, L., Diday, E., Chèdin, A.: Copula analysis of mixture models. Comput. Stat. 27, 427–457 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  40. Zimmer, D., Trivedi, P.: Using trivariate copulas to model sample selection and treatment effects: application to family health care demand. J. Bus. Econ. Stat. 24(1), 63–72 (2006)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Statistical ScienceUniversity College LondonLondonUK
  2. 2.Department of StatisticsAthens University of Economics and BusinessAthensGreece

Personalised recommendations