Skip to main content
Log in

Unifying data units and models in (co-)clustering

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. It is obviously possible to weaken these assumptions by cancelling the variable-wise hypothesis defined in the right of (2). In particular, this relaxation would encompass important transformations such that the linear ones involved in principal component analysis (PCA), when all \(x_{ij}\in \mathbb {R}\). Nevertheless, restriction in the right of (2) has advantage to respect the variable definition, transforming only its unit.

  2. Eight is the number of years spent by English pupils in a secondary school.

  3. MixtComp is a clustering software developped by Biernacki C., Iovleff I. and Kubicki V. and freely available on the MASSICCC web platform https://modal-research-dev.lille.inria.fr/#/.

  4. http://bowtie-bio.sourceforge.net/recount/.

  5. https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/.

  6. Be careful: The number of clusters in column (14 or 6) at this clustering step is totally unrelated to the number \(L=5\) of clusters involved in the co-clustering step that follows!

  7. http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records.

References

  • Andrews DF, Herzberg AM (1985) Data: a collection of problems from many. Fields for the student and research worker. Springer, Berlin

    Book  MATH  Google Scholar 

  • Andrews JL, Mcnicholas PD (2012) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat Comput 22(5):1021–1029

    Article  MathSciNet  MATH  Google Scholar 

  • Atkinson A, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52(1):272–285

    Article  MathSciNet  MATH  Google Scholar 

  • Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

    Article  MathSciNet  MATH  Google Scholar 

  • Bertrand F, Droesbeke J-J, Saporta G, Thomas-Agnan C (2017) Model choice and model aggregation. Technip, Paris

    Google Scholar 

  • Bhatia P, Iovleff S, Govaert G (2015) Blockcluster: an R package for model based co-clustering. J Stat Softw 76:1–24 (in press)

    Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725

    Article  Google Scholar 

  • Biernacki C, Jacques J (2013) A generative model for rank data based on insertion sort algorithm. Comput Stat Data Anal 58:162–176

    Article  MathSciNet  MATH  Google Scholar 

  • Biernacki C, Jacques J (2016) Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm. Stat Comput 26(5):929–943

    Article  MathSciNet  MATH  Google Scholar 

  • Biernacki C, Lourme A (2014) Stable and visualizable Gaussian parsimonious clustering models. Stat Comput 24(6):953–969

    Article  MathSciNet  MATH  Google Scholar 

  • Bock H (1981) Statistical testing and evaluation methods in cluster analysis. In: Proceedings of the Indian Statistical Institute golden jubilee international conference on statistics: applications and new directions, Calcutta, pp 116–146

  • Byar D, Green S (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull Cancer 67:477–490

    Google Scholar 

  • Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 2(1):73–92

    Google Scholar 

  • Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recogn 28(5):781–793

    Article  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data (with discussion). J R Stat Soc B 39:1–38

    MATH  Google Scholar 

  • Gallopin M, Rau A, Celeux G, Jaffrézic F (2015) Transformation des données et comparaison de modèles pour la classification des données rna-seq. 47èmes Journées de Statistique de la SFdS

  • Ghahramani Z, Hinton G (1997) The EM algorithm for factor analyzers. Technical report, University of Toronto

  • Goodman LA (1974) Exploratory latent structure models using both identifiable and unidentifiable models. Biometrika 61:215–231

    Article  MathSciNet  MATH  Google Scholar 

  • Govaert G (2009) Data analysis. ISTE-Wiley, Hoboken

    Book  MATH  Google Scholar 

  • Govaert G, Nadif M (2013) Co-clustering. Wiley, Hoboken

    Book  MATH  Google Scholar 

  • Hilbe JM (2014) Modeling count data. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Hunt L, Jorgensen M (1999) Mixture model clustering: a brief introduction to the multimix program. Aust N Z J Stat 41(2):153–171

    Article  MATH  Google Scholar 

  • Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31:651–666

    Article  Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New Jersey

    MATH  Google Scholar 

  • Jorgensen M, Hunt L (1996) Mixture model clustering of data sets with categorical and continuous variables. In: Proceedings of the conference ISIS, pp 375–384

  • Keribin C, Brault V, Celeux G, Govaert G (2015) Estimation and selection for the latent block model on categorical data. Stat Comput 25(6):1201–1216

    Article  MathSciNet  MATH  Google Scholar 

  • Krantz DH, Luce RD, Suppes P, Tversky A (1971) Foundations of measurement (additive and polynomial representations), vol 1. Academic Press, New York

    MATH  Google Scholar 

  • Law MH, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166

    Article  Google Scholar 

  • Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2015) Rmixmod: the R package of the model-based unsupervised, supervised and semi-supervised classification mixmod library. J Stat Softw 64:241–270 (in press)

    Google Scholar 

  • Lee S, McLachlan G (2013) Emmixuskew: fitting unrestricted multivariate skew t mixture models. R package version 0.11-5

  • Little RJ A, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken

    Book  MATH  Google Scholar 

  • Lomet A, Govaert G, Grandvalet Y (2012) Model selection in block clustering by the integrated classification likelihood. In: 20th International conference on computational statistics (COMPSTAT 2012), Lymassol, France, pp 519–530

  • Luce RD, Krantz DH, Suppes P, Tversky A (1990) Foundations of measurement, vol 3. Academic Press, New York

    MATH  Google Scholar 

  • Manly BF (1976) Exponential data transformations. Statistician 25(1):37–42

    Article  MathSciNet  Google Scholar 

  • Marbac M, Sedki M (2015) Variable selection for model-based clustering using the integrated complete-data likelihood. arXiv:1501.06314

  • Maugis C, Celeux G, Martin-Magniette M (2009a) Variable selection for clustering with Gaussian mixture models. Biometrics 65(3):701–709

    Article  MathSciNet  MATH  Google Scholar 

  • Maugis C, Celeux G, Martin-Magniette M-L (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53:3872–3882

    Article  MathSciNet  MATH  Google Scholar 

  • McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York

    Book  MATH  Google Scholar 

  • McLachlan G, Peel D (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41:379–388

    Article  MathSciNet  MATH  Google Scholar 

  • McNicholas P, Murphy T (2010) Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics 21(26):2705–2712

    Article  Google Scholar 

  • McNicholas PD (2016) Mixture model-based classification. Chapman and Hall, New York

    Book  MATH  Google Scholar 

  • McParland D, Gormley IC (2016) Model based clustering for mixed data: clustMD. Adv Data Anal Classif 10(2):155–169

    Article  MathSciNet  Google Scholar 

  • Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116

    Article  MathSciNet  MATH  Google Scholar 

  • Meynet C (2012) Sélection de variables pour la classification non supervisée en grande dimension. Ph.D. thesis, Université Paris-Sud 11

  • Meynet C, Maugis-Rabusseau C (2012) A sparse variable selection procedure in model-based clustering. Research report

  • Moustaki I, Papageorgiou I (2005) Latent class models for mixed variables with applications in archaeometry. Comput Stat Data Anal 48(3):65–675

    Article  MathSciNet  MATH  Google Scholar 

  • Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164

    MATH  Google Scholar 

  • Prates MO, Lachos VH, Cabral C (2013) mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. J Stat Softw 54(12):1–20

    Article  Google Scholar 

  • Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178

    Article  MathSciNet  MATH  Google Scholar 

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850

    Article  Google Scholar 

  • Rao CR, Miller JP, Rao DC (2007) Handbook of statistics: epidemiology and medical statistics, vol 27. Elsevier, New York

    MATH  Google Scholar 

  • Rau A, Maugis-Rabusseau C (2018) Transformation and model choice for RNA-seq co-expression analysis. Brief Bioinform 19(3):425–436

    Google Scholar 

  • Rau A, Maugis-Rabusseau C, Martin-Magniette M-L, Celeux G (2015) Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics 31(9):1420–1427

    Article  Google Scholar 

  • Redner R, Walker H (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26(2):195–239

    Article  MathSciNet  MATH  Google Scholar 

  • Schlimmer JC (1987) Concept acquisition through representational adjustment. Ph.D. thesis, Department of Information and Computer Science, University of California, Irvine, CA

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Seber GAF, Lee AJ (2012) Linear regression analysis, 2nd edn. Wiley, New Jersey

    MATH  Google Scholar 

  • Sedki M, Celeux G, Maugis-Rabusseau C (2014) SelvarMix: a R package for variable selection in model-based clustering and discriminant analysis with a regularization approach. Research report

  • Suppes P, Krantz DH, Luce RD, Tversky A (1989) Foundations of measurement, vol 2. Academic Press, New York

    MATH  Google Scholar 

  • Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100(470):602–617

    Article  MathSciNet  MATH  Google Scholar 

  • Thomas I, Frankhauser P, Biernacki C (2008) The morphology of built-up landscapes in Wallonia (Belgium): a classification using fractal indices. Landsc Urban Plan 84:99–115

    Article  Google Scholar 

  • Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York

    Book  MATH  Google Scholar 

  • Wang K, McLachlan GJ, Ng SK, Peel D (2012) EMMIX-skew: EM Algorithm for Mixture of Multivariate Skew Normal/t Distributions. R code version 1.0.16. http://www.maths.uq.edu.au/~gjm/mix_soft/EMMIX-skew

  • Wolfe JH (1971) A monte carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Technical Bulletin STB 72-2, US Naval Personnel Research Activity, San Diego, CA

  • Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987

    Article  Google Scholar 

  • Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Stat 3:1473–1496

    Article  MathSciNet  MATH  Google Scholar 

  • Zhu X, Melnykov V (2016) Manly transformation in finite mixture modeling. Comput Stat Data Anal 121:190–208

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christophe Biernacki.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biernacki, C., Lourme, A. Unifying data units and models in (co-)clustering. Adv Data Anal Classif 13, 7–31 (2019). https://doi.org/10.1007/s11634-018-0325-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-018-0325-2

Keywords

Mathematics Subject Classification

Navigation