Unifying data units and models in (co-)clustering

  • Christophe BiernackiEmail author
  • Alexandre Lourme
Regular Article


Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots.


Measurement units Mixed data Mixture models Model selection Non-identifiability 

Mathematics Subject Classification



  1. Andrews DF, Herzberg AM (1985) Data: a collection of problems from many. Fields for the student and research worker. Springer, BerlinCrossRefzbMATHGoogle Scholar
  2. Andrews JL, Mcnicholas PD (2012) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat Comput 22(5):1021–1029MathSciNetCrossRefzbMATHGoogle Scholar
  3. Atkinson A, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52(1):272–285MathSciNetCrossRefzbMATHGoogle Scholar
  4. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821MathSciNetCrossRefzbMATHGoogle Scholar
  5. Bertrand F, Droesbeke J-J, Saporta G, Thomas-Agnan C (2017) Model choice and model aggregation. Technip, ParisGoogle Scholar
  6. Bhatia P, Iovleff S, Govaert G (2015) Blockcluster: an R package for model based co-clustering. J Stat Softw 76:1–24 (in press)Google Scholar
  7. Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725CrossRefGoogle Scholar
  8. Biernacki C, Jacques J (2013) A generative model for rank data based on insertion sort algorithm. Comput Stat Data Anal 58:162–176MathSciNetCrossRefzbMATHGoogle Scholar
  9. Biernacki C, Jacques J (2016) Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm. Stat Comput 26(5):929–943MathSciNetCrossRefzbMATHGoogle Scholar
  10. Biernacki C, Lourme A (2014) Stable and visualizable Gaussian parsimonious clustering models. Stat Comput 24(6):953–969MathSciNetCrossRefzbMATHGoogle Scholar
  11. Bock H (1981) Statistical testing and evaluation methods in cluster analysis. In: Proceedings of the Indian Statistical Institute golden jubilee international conference on statistics: applications and new directions, Calcutta, pp 116–146Google Scholar
  12. Byar D, Green S (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull Cancer 67:477–490Google Scholar
  13. Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 2(1):73–92Google Scholar
  14. Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recogn 28(5):781–793CrossRefGoogle Scholar
  15. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data (with discussion). J R Stat Soc B 39:1–38zbMATHGoogle Scholar
  16. Gallopin M, Rau A, Celeux G, Jaffrézic F (2015) Transformation des données et comparaison de modèles pour la classification des données rna-seq. 47èmes Journées de Statistique de la SFdSGoogle Scholar
  17. Ghahramani Z, Hinton G (1997) The EM algorithm for factor analyzers. Technical report, University of TorontoGoogle Scholar
  18. Goodman LA (1974) Exploratory latent structure models using both identifiable and unidentifiable models. Biometrika 61:215–231MathSciNetCrossRefzbMATHGoogle Scholar
  19. Govaert G (2009) Data analysis. ISTE-Wiley, HobokenCrossRefzbMATHGoogle Scholar
  20. Govaert G, Nadif M (2013) Co-clustering. Wiley, HobokenCrossRefzbMATHGoogle Scholar
  21. Hilbe JM (2014) Modeling count data. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  22. Hunt L, Jorgensen M (1999) Mixture model clustering: a brief introduction to the multimix program. Aust N Z J Stat 41(2):153–171CrossRefzbMATHGoogle Scholar
  23. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31:651–666CrossRefGoogle Scholar
  24. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New JerseyzbMATHGoogle Scholar
  25. Jorgensen M, Hunt L (1996) Mixture model clustering of data sets with categorical and continuous variables. In: Proceedings of the conference ISIS, pp 375–384Google Scholar
  26. Keribin C, Brault V, Celeux G, Govaert G (2015) Estimation and selection for the latent block model on categorical data. Stat Comput 25(6):1201–1216MathSciNetCrossRefzbMATHGoogle Scholar
  27. Krantz DH, Luce RD, Suppes P, Tversky A (1971) Foundations of measurement (additive and polynomial representations), vol 1. Academic Press, New YorkzbMATHGoogle Scholar
  28. Law MH, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166CrossRefGoogle Scholar
  29. Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2015) Rmixmod: the R package of the model-based unsupervised, supervised and semi-supervised classification mixmod library. J Stat Softw 64:241–270 (in press)Google Scholar
  30. Lee S, McLachlan G (2013) Emmixuskew: fitting unrestricted multivariate skew t mixture models. R package version 0.11-5Google Scholar
  31. Little RJ A, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, HobokenCrossRefzbMATHGoogle Scholar
  32. Lomet A, Govaert G, Grandvalet Y (2012) Model selection in block clustering by the integrated classification likelihood. In: 20th International conference on computational statistics (COMPSTAT 2012), Lymassol, France, pp 519–530Google Scholar
  33. Luce RD, Krantz DH, Suppes P, Tversky A (1990) Foundations of measurement, vol 3. Academic Press, New YorkzbMATHGoogle Scholar
  34. Manly BF (1976) Exponential data transformations. Statistician 25(1):37–42MathSciNetCrossRefGoogle Scholar
  35. Marbac M, Sedki M (2015) Variable selection for model-based clustering using the integrated complete-data likelihood. arXiv:1501.06314
  36. Maugis C, Celeux G, Martin-Magniette M (2009a) Variable selection for clustering with Gaussian mixture models. Biometrics 65(3):701–709MathSciNetCrossRefzbMATHGoogle Scholar
  37. Maugis C, Celeux G, Martin-Magniette M-L (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53:3872–3882MathSciNetCrossRefzbMATHGoogle Scholar
  38. McLachlan G, Peel D (2000) Finite mixture models. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  39. McLachlan G, Peel D (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41:379–388MathSciNetCrossRefzbMATHGoogle Scholar
  40. McNicholas P, Murphy T (2010) Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics 21(26):2705–2712CrossRefGoogle Scholar
  41. McNicholas PD (2016) Mixture model-based classification. Chapman and Hall, New YorkCrossRefzbMATHGoogle Scholar
  42. McParland D, Gormley IC (2016) Model based clustering for mixed data: clustMD. Adv Data Anal Classif 10(2):155–169MathSciNetCrossRefGoogle Scholar
  43. Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116MathSciNetCrossRefzbMATHGoogle Scholar
  44. Meynet C (2012) Sélection de variables pour la classification non supervisée en grande dimension. Ph.D. thesis, Université Paris-Sud 11Google Scholar
  45. Meynet C, Maugis-Rabusseau C (2012) A sparse variable selection procedure in model-based clustering. Research reportGoogle Scholar
  46. Moustaki I, Papageorgiou I (2005) Latent class models for mixed variables with applications in archaeometry. Comput Stat Data Anal 48(3):65–675MathSciNetCrossRefzbMATHGoogle Scholar
  47. Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164zbMATHGoogle Scholar
  48. Prates MO, Lachos VH, Cabral C (2013) mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. J Stat Softw 54(12):1–20CrossRefGoogle Scholar
  49. Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178MathSciNetCrossRefzbMATHGoogle Scholar
  50. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850CrossRefGoogle Scholar
  51. Rao CR, Miller JP, Rao DC (2007) Handbook of statistics: epidemiology and medical statistics, vol 27. Elsevier, New YorkzbMATHGoogle Scholar
  52. Rau A, Maugis-Rabusseau C (2018) Transformation and model choice for RNA-seq co-expression analysis. Brief Bioinform 19(3):425–436Google Scholar
  53. Rau A, Maugis-Rabusseau C, Martin-Magniette M-L, Celeux G (2015) Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics 31(9):1420–1427CrossRefGoogle Scholar
  54. Redner R, Walker H (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26(2):195–239MathSciNetCrossRefzbMATHGoogle Scholar
  55. Schlimmer JC (1987) Concept acquisition through representational adjustment. Ph.D. thesis, Department of Information and Computer Science, University of California, Irvine, CAGoogle Scholar
  56. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464MathSciNetCrossRefzbMATHGoogle Scholar
  57. Seber GAF, Lee AJ (2012) Linear regression analysis, 2nd edn. Wiley, New JerseyzbMATHGoogle Scholar
  58. Sedki M, Celeux G, Maugis-Rabusseau C (2014) SelvarMix: a R package for variable selection in model-based clustering and discriminant analysis with a regularization approach. Research reportGoogle Scholar
  59. Suppes P, Krantz DH, Luce RD, Tversky A (1989) Foundations of measurement, vol 2. Academic Press, New YorkzbMATHGoogle Scholar
  60. Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100(470):602–617MathSciNetCrossRefzbMATHGoogle Scholar
  61. Thomas I, Frankhauser P, Biernacki C (2008) The morphology of built-up landscapes in Wallonia (Belgium): a classification using fractal indices. Landsc Urban Plan 84:99–115CrossRefGoogle Scholar
  62. Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New YorkCrossRefzbMATHGoogle Scholar
  63. Wang K, McLachlan GJ, Ng SK, Peel D (2012) EMMIX-skew: EM Algorithm for Mixture of Multivariate Skew Normal/t Distributions. R code version 1.0.16.
  64. Wolfe JH (1971) A monte carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Technical Bulletin STB 72-2, US Naval Personnel Research Activity, San Diego, CAGoogle Scholar
  65. Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987CrossRefGoogle Scholar
  66. Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Stat 3:1473–1496MathSciNetCrossRefzbMATHGoogle Scholar
  67. Zhu X, Melnykov V (2016) Manly transformation in finite mixture modeling. Comput Stat Data Anal 121:190–208Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of LilleInria and CNRSLilleFrance
  2. 2.University of BordeauxBordeauxFrance

Personalised recommendations