Unifying data units and models in (co-)clustering

Biernacki, Christophe; Lourme, Alexandre

doi:10.1007/s11634-018-0325-2

Unifying data units and models in (co-)clustering

Regular Article
Published: 25 May 2018

Volume 13, pages 7–31, (2019)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

271 Accesses
2 Citations
Explore all metrics

Abstract

Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

It is obviously possible to weaken these assumptions by cancelling the variable-wise hypothesis defined in the right of (2). In particular, this relaxation would encompass important transformations such that the linear ones involved in principal component analysis (PCA), when all \(x_{ij}\in \mathbb {R}\). Nevertheless, restriction in the right of (2) has advantage to respect the variable definition, transforming only its unit.
Eight is the number of years spent by English pupils in a secondary school.
MixtComp is a clustering software developped by Biernacki C., Iovleff I. and Kubicki V. and freely available on the MASSICCC web platform https://modal-research-dev.lille.inria.fr/#/.
http://bowtie-bio.sourceforge.net/recount/.
https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/.
Be careful: The number of clusters in column (14 or 6) at this clustering step is totally unrelated to the number \(L=5\) of clusters involved in the co-clustering step that follows!
http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records.

References

Andrews DF, Herzberg AM (1985) Data: a collection of problems from many. Fields for the student and research worker. Springer, Berlin
Book MATH Google Scholar
Andrews JL, Mcnicholas PD (2012) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat Comput 22(5):1021–1029
Article MathSciNet MATH Google Scholar
Atkinson A, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52(1):272–285
Article MathSciNet MATH Google Scholar
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Article MathSciNet MATH Google Scholar
Bertrand F, Droesbeke J-J, Saporta G, Thomas-Agnan C (2017) Model choice and model aggregation. Technip, Paris
Google Scholar
Bhatia P, Iovleff S, Govaert G (2015) Blockcluster: an R package for model based co-clustering. J Stat Softw 76:1–24 (in press)
Google Scholar
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725
Article Google Scholar
Biernacki C, Jacques J (2013) A generative model for rank data based on insertion sort algorithm. Comput Stat Data Anal 58:162–176
Article MathSciNet MATH Google Scholar
Biernacki C, Jacques J (2016) Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm. Stat Comput 26(5):929–943
Article MathSciNet MATH Google Scholar
Biernacki C, Lourme A (2014) Stable and visualizable Gaussian parsimonious clustering models. Stat Comput 24(6):953–969
Article MathSciNet MATH Google Scholar
Bock H (1981) Statistical testing and evaluation methods in cluster analysis. In: Proceedings of the Indian Statistical Institute golden jubilee international conference on statistics: applications and new directions, Calcutta, pp 116–146
Byar D, Green S (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull Cancer 67:477–490
Google Scholar
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 2(1):73–92
Google Scholar
Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recogn 28(5):781–793
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data (with discussion). J R Stat Soc B 39:1–38
MATH Google Scholar
Gallopin M, Rau A, Celeux G, Jaffrézic F (2015) Transformation des données et comparaison de modèles pour la classification des données rna-seq. 47èmes Journées de Statistique de la SFdS
Ghahramani Z, Hinton G (1997) The EM algorithm for factor analyzers. Technical report, University of Toronto
Goodman LA (1974) Exploratory latent structure models using both identifiable and unidentifiable models. Biometrika 61:215–231
Article MathSciNet MATH Google Scholar
Govaert G (2009) Data analysis. ISTE-Wiley, Hoboken
Book MATH Google Scholar
Govaert G, Nadif M (2013) Co-clustering. Wiley, Hoboken
Book MATH Google Scholar
Hilbe JM (2014) Modeling count data. Cambridge University Press, Cambridge
Book Google Scholar
Hunt L, Jorgensen M (1999) Mixture model clustering: a brief introduction to the multimix program. Aust N Z J Stat 41(2):153–171
Article MATH Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31:651–666
Article Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New Jersey
MATH Google Scholar
Jorgensen M, Hunt L (1996) Mixture model clustering of data sets with categorical and continuous variables. In: Proceedings of the conference ISIS, pp 375–384
Keribin C, Brault V, Celeux G, Govaert G (2015) Estimation and selection for the latent block model on categorical data. Stat Comput 25(6):1201–1216
Article MathSciNet MATH Google Scholar
Krantz DH, Luce RD, Suppes P, Tversky A (1971) Foundations of measurement (additive and polynomial representations), vol 1. Academic Press, New York
MATH Google Scholar
Law MH, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166
Article Google Scholar
Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2015) Rmixmod: the R package of the model-based unsupervised, supervised and semi-supervised classification mixmod library. J Stat Softw 64:241–270 (in press)
Google Scholar
Lee S, McLachlan G (2013) Emmixuskew: fitting unrestricted multivariate skew t mixture models. R package version 0.11-5
Little RJ A, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
Book MATH Google Scholar
Lomet A, Govaert G, Grandvalet Y (2012) Model selection in block clustering by the integrated classification likelihood. In: 20th International conference on computational statistics (COMPSTAT 2012), Lymassol, France, pp 519–530
Luce RD, Krantz DH, Suppes P, Tversky A (1990) Foundations of measurement, vol 3. Academic Press, New York
MATH Google Scholar
Manly BF (1976) Exponential data transformations. Statistician 25(1):37–42
Article MathSciNet Google Scholar
Marbac M, Sedki M (2015) Variable selection for model-based clustering using the integrated complete-data likelihood. arXiv:1501.06314
Maugis C, Celeux G, Martin-Magniette M (2009a) Variable selection for clustering with Gaussian mixture models. Biometrics 65(3):701–709
Article MathSciNet MATH Google Scholar
Maugis C, Celeux G, Martin-Magniette M-L (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53:3872–3882
Article MathSciNet MATH Google Scholar
McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York
Book MATH Google Scholar
McLachlan G, Peel D (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41:379–388
Article MathSciNet MATH Google Scholar
McNicholas P, Murphy T (2010) Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics 21(26):2705–2712
Article Google Scholar
McNicholas PD (2016) Mixture model-based classification. Chapman and Hall, New York
Book MATH Google Scholar
McParland D, Gormley IC (2016) Model based clustering for mixed data: clustMD. Adv Data Anal Classif 10(2):155–169
Article MathSciNet Google Scholar
Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116
Article MathSciNet MATH Google Scholar
Meynet C (2012) Sélection de variables pour la classification non supervisée en grande dimension. Ph.D. thesis, Université Paris-Sud 11
Meynet C, Maugis-Rabusseau C (2012) A sparse variable selection procedure in model-based clustering. Research report
Moustaki I, Papageorgiou I (2005) Latent class models for mixed variables with applications in archaeometry. Comput Stat Data Anal 48(3):65–675
Article MathSciNet MATH Google Scholar
Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164
MATH Google Scholar
Prates MO, Lachos VH, Cabral C (2013) mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. J Stat Softw 54(12):1–20
Article Google Scholar
Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178
Article MathSciNet MATH Google Scholar
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Article Google Scholar
Rao CR, Miller JP, Rao DC (2007) Handbook of statistics: epidemiology and medical statistics, vol 27. Elsevier, New York
MATH Google Scholar
Rau A, Maugis-Rabusseau C (2018) Transformation and model choice for RNA-seq co-expression analysis. Brief Bioinform 19(3):425–436
Google Scholar
Rau A, Maugis-Rabusseau C, Martin-Magniette M-L, Celeux G (2015) Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics 31(9):1420–1427
Article Google Scholar
Redner R, Walker H (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26(2):195–239
Article MathSciNet MATH Google Scholar
Schlimmer JC (1987) Concept acquisition through representational adjustment. Ph.D. thesis, Department of Information and Computer Science, University of California, Irvine, CA
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MathSciNet MATH Google Scholar
Seber GAF, Lee AJ (2012) Linear regression analysis, 2nd edn. Wiley, New Jersey
MATH Google Scholar
Sedki M, Celeux G, Maugis-Rabusseau C (2014) SelvarMix: a R package for variable selection in model-based clustering and discriminant analysis with a regularization approach. Research report
Suppes P, Krantz DH, Luce RD, Tversky A (1989) Foundations of measurement, vol 2. Academic Press, New York
MATH Google Scholar
Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100(470):602–617
Article MathSciNet MATH Google Scholar
Thomas I, Frankhauser P, Biernacki C (2008) The morphology of built-up landscapes in Wallonia (Belgium): a classification using fractal indices. Landsc Urban Plan 84:99–115
Article Google Scholar
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York
Book MATH Google Scholar
Wang K, McLachlan GJ, Ng SK, Peel D (2012) EMMIX-skew: EM Algorithm for Mixture of Multivariate Skew Normal/t Distributions. R code version 1.0.16. http://www.maths.uq.edu.au/~gjm/mix_soft/EMMIX-skew
Wolfe JH (1971) A monte carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Technical Bulletin STB 72-2, US Naval Personnel Research Activity, San Diego, CA
Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987
Article Google Scholar
Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Stat 3:1473–1496
Article MathSciNet MATH Google Scholar
Zhu X, Melnykov V (2016) Manly transformation in finite mixture modeling. Comput Stat Data Anal 121:190–208

Download references

Author information

Authors and Affiliations

University of Lille, Inria and CNRS, Lille, France
Christophe Biernacki
University of Bordeaux, Bordeaux, France
Alexandre Lourme

Authors

Christophe Biernacki
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Lourme
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christophe Biernacki.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Biernacki, C., Lourme, A. Unifying data units and models in (co-)clustering. Adv Data Anal Classif 13, 7–31 (2019). https://doi.org/10.1007/s11634-018-0325-2

Download citation

Received: 16 January 2017
Revised: 01 March 2018
Accepted: 21 May 2018
Published: 25 May 2018
Issue Date: 08 March 2019
DOI: https://doi.org/10.1007/s11634-018-0325-2

Keywords

Mathematics Subject Classification

62H30

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Unifying data units and models in (co-)clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Unifying data units and models in (co-)clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation