Abstract
Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots.
Similar content being viewed by others
Notes
It is obviously possible to weaken these assumptions by cancelling the variable-wise hypothesis defined in the right of (2). In particular, this relaxation would encompass important transformations such that the linear ones involved in principal component analysis (PCA), when all \(x_{ij}\in \mathbb {R}\). Nevertheless, restriction in the right of (2) has advantage to respect the variable definition, transforming only its unit.
Eight is the number of years spent by English pupils in a secondary school.
MixtComp is a clustering software developped by Biernacki C., Iovleff I. and Kubicki V. and freely available on the MASSICCC web platform https://modal-research-dev.lille.inria.fr/#/.
Be careful: The number of clusters in column (14 or 6) at this clustering step is totally unrelated to the number \(L=5\) of clusters involved in the co-clustering step that follows!
References
Andrews DF, Herzberg AM (1985) Data: a collection of problems from many. Fields for the student and research worker. Springer, Berlin
Andrews JL, Mcnicholas PD (2012) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat Comput 22(5):1021–1029
Atkinson A, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52(1):272–285
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Bertrand F, Droesbeke J-J, Saporta G, Thomas-Agnan C (2017) Model choice and model aggregation. Technip, Paris
Bhatia P, Iovleff S, Govaert G (2015) Blockcluster: an R package for model based co-clustering. J Stat Softw 76:1–24 (in press)
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725
Biernacki C, Jacques J (2013) A generative model for rank data based on insertion sort algorithm. Comput Stat Data Anal 58:162–176
Biernacki C, Jacques J (2016) Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm. Stat Comput 26(5):929–943
Biernacki C, Lourme A (2014) Stable and visualizable Gaussian parsimonious clustering models. Stat Comput 24(6):953–969
Bock H (1981) Statistical testing and evaluation methods in cluster analysis. In: Proceedings of the Indian Statistical Institute golden jubilee international conference on statistics: applications and new directions, Calcutta, pp 116–146
Byar D, Green S (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull Cancer 67:477–490
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 2(1):73–92
Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recogn 28(5):781–793
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data (with discussion). J R Stat Soc B 39:1–38
Gallopin M, Rau A, Celeux G, Jaffrézic F (2015) Transformation des données et comparaison de modèles pour la classification des données rna-seq. 47èmes Journées de Statistique de la SFdS
Ghahramani Z, Hinton G (1997) The EM algorithm for factor analyzers. Technical report, University of Toronto
Goodman LA (1974) Exploratory latent structure models using both identifiable and unidentifiable models. Biometrika 61:215–231
Govaert G (2009) Data analysis. ISTE-Wiley, Hoboken
Govaert G, Nadif M (2013) Co-clustering. Wiley, Hoboken
Hilbe JM (2014) Modeling count data. Cambridge University Press, Cambridge
Hunt L, Jorgensen M (1999) Mixture model clustering: a brief introduction to the multimix program. Aust N Z J Stat 41(2):153–171
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31:651–666
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New Jersey
Jorgensen M, Hunt L (1996) Mixture model clustering of data sets with categorical and continuous variables. In: Proceedings of the conference ISIS, pp 375–384
Keribin C, Brault V, Celeux G, Govaert G (2015) Estimation and selection for the latent block model on categorical data. Stat Comput 25(6):1201–1216
Krantz DH, Luce RD, Suppes P, Tversky A (1971) Foundations of measurement (additive and polynomial representations), vol 1. Academic Press, New York
Law MH, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166
Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2015) Rmixmod: the R package of the model-based unsupervised, supervised and semi-supervised classification mixmod library. J Stat Softw 64:241–270 (in press)
Lee S, McLachlan G (2013) Emmixuskew: fitting unrestricted multivariate skew t mixture models. R package version 0.11-5
Little RJ A, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
Lomet A, Govaert G, Grandvalet Y (2012) Model selection in block clustering by the integrated classification likelihood. In: 20th International conference on computational statistics (COMPSTAT 2012), Lymassol, France, pp 519–530
Luce RD, Krantz DH, Suppes P, Tversky A (1990) Foundations of measurement, vol 3. Academic Press, New York
Manly BF (1976) Exponential data transformations. Statistician 25(1):37–42
Marbac M, Sedki M (2015) Variable selection for model-based clustering using the integrated complete-data likelihood. arXiv:1501.06314
Maugis C, Celeux G, Martin-Magniette M (2009a) Variable selection for clustering with Gaussian mixture models. Biometrics 65(3):701–709
Maugis C, Celeux G, Martin-Magniette M-L (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53:3872–3882
McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York
McLachlan G, Peel D (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41:379–388
McNicholas P, Murphy T (2010) Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics 21(26):2705–2712
McNicholas PD (2016) Mixture model-based classification. Chapman and Hall, New York
McParland D, Gormley IC (2016) Model based clustering for mixed data: clustMD. Adv Data Anal Classif 10(2):155–169
Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116
Meynet C (2012) Sélection de variables pour la classification non supervisée en grande dimension. Ph.D. thesis, Université Paris-Sud 11
Meynet C, Maugis-Rabusseau C (2012) A sparse variable selection procedure in model-based clustering. Research report
Moustaki I, Papageorgiou I (2005) Latent class models for mixed variables with applications in archaeometry. Comput Stat Data Anal 48(3):65–675
Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164
Prates MO, Lachos VH, Cabral C (2013) mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. J Stat Softw 54(12):1–20
Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Rao CR, Miller JP, Rao DC (2007) Handbook of statistics: epidemiology and medical statistics, vol 27. Elsevier, New York
Rau A, Maugis-Rabusseau C (2018) Transformation and model choice for RNA-seq co-expression analysis. Brief Bioinform 19(3):425–436
Rau A, Maugis-Rabusseau C, Martin-Magniette M-L, Celeux G (2015) Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics 31(9):1420–1427
Redner R, Walker H (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26(2):195–239
Schlimmer JC (1987) Concept acquisition through representational adjustment. Ph.D. thesis, Department of Information and Computer Science, University of California, Irvine, CA
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Seber GAF, Lee AJ (2012) Linear regression analysis, 2nd edn. Wiley, New Jersey
Sedki M, Celeux G, Maugis-Rabusseau C (2014) SelvarMix: a R package for variable selection in model-based clustering and discriminant analysis with a regularization approach. Research report
Suppes P, Krantz DH, Luce RD, Tversky A (1989) Foundations of measurement, vol 2. Academic Press, New York
Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100(470):602–617
Thomas I, Frankhauser P, Biernacki C (2008) The morphology of built-up landscapes in Wallonia (Belgium): a classification using fractal indices. Landsc Urban Plan 84:99–115
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York
Wang K, McLachlan GJ, Ng SK, Peel D (2012) EMMIX-skew: EM Algorithm for Mixture of Multivariate Skew Normal/t Distributions. R code version 1.0.16. http://www.maths.uq.edu.au/~gjm/mix_soft/EMMIX-skew
Wolfe JH (1971) A monte carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Technical Bulletin STB 72-2, US Naval Personnel Research Activity, San Diego, CA
Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987
Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Stat 3:1473–1496
Zhu X, Melnykov V (2016) Manly transformation in finite mixture modeling. Comput Stat Data Anal 121:190–208
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Biernacki, C., Lourme, A. Unifying data units and models in (co-)clustering. Adv Data Anal Classif 13, 7–31 (2019). https://doi.org/10.1007/s11634-018-0325-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-018-0325-2