Modelling the role of variables in model-based cluster analysis

Galimberti, Giuliano; Manisi, Annamaria; Soffritti, Gabriele

doi:10.1007/s11222-017-9723-0

Modelling the role of variables in model-based cluster analysis

Published: 12 January 2017

Volume 28, pages 145–169, (2018)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

762 Accesses
15 Citations
Explore all metrics

Abstract

In the framework of cluster analysis based on Gaussian mixture models, it is usually assumed that all the variables provide information about the clustering of the sample units. Several variable selection procedures are available in order to detect the structure of interest for the clustering when this structure is contained in a variable sub-vector. Currently, in these procedures a variable is assumed to play one of (up to) three roles: (1) informative, (2) uninformative and correlated with some informative variables, (3) uninformative and uncorrelated with any informative variable. A more general approach for modelling the role of a variable is proposed by taking into account the possibility that the variable vector provides information about more than one structure of interest for the clustering. This approach is developed by assuming that such information is given by non-overlapped and possibly correlated sub-vectors of variables; it is also assumed that the model for the variable vector is equal to a product of conditionally independent Gaussian mixture models (one for each variable sub-vector). Details about model identifiability, parameter estimation and model selection are provided. The usefulness and effectiveness of the described methodology are illustrated using simulated and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Anderson, T.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, New York (2003)
MATH Google Scholar
Andrews, J.L., McNicholas, P.D.: Variable selection for clustering and classification. J. Classif. 31, 136–153 (2014)
Article MathSciNet MATH Google Scholar
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Article MathSciNet MATH Google Scholar
Belitskaya-Levy, I.: A generalized clustering problem, with application to DNA microarrays. Stat. Appl. Genet. Mol. Biol. 5, Article 2 (2006)
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)
Article Google Scholar
Biernacki, C., Govaert, G.: Choosing models in model-based clustering and discriminant analysis. J. Stat. Comput. Simul. 64, 49–71 (1999)
Article MATH Google Scholar
Bozdogan, H.: Intelligent statistical data mining with information complexity and genetic algorithms. In: Bozdogan, H. (ed.) Statistical Data Mining and Knowledge Discovery, pp. 15–56. Chapman & Hall/CRC, London (2004)
Google Scholar
Browne, R.P., ElSherbiny, A., McNicholas, P.D.: mixture: mixture models for clustering and classification. R package version 1.4 (2015)
Brusco, M.J., Cradit, J.D.: A variable-selection heuristic for k-means clustering. Psychometrika 66, 249–270 (2001)
Article MathSciNet MATH Google Scholar
Campbell, N.A., Mahon, R.J.: A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Aust. J. Zool. 22, 417–425 (1974)
Article Google Scholar
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995)
Article Google Scholar
Celeux, G., Martin-Magniette, M.-L., Maugis, C., Raftery, A.E.: Letter to the editor. J. Am. Stat. Assoc. 106, 383 (2011)
Article Google Scholar
Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C., Raftery, A.E.: Comparing model selection and regularization approaches to variable selection in model-based clustering. J. Soc. Fr. Statistique 155, 57–71 (2014)
MathSciNet MATH Google Scholar
Chatterjee, S., Laudato, M., Lynch, L.A.: Genetic algorithms and their statistical applications: an introduction. Comput. Stat. Data Anal. 22, 633–651 (1996)
Article MATH Google Scholar
Dang, X.H., Bailey, J.: A framework to uncover multiple alternative clusterings. Mach. Learn. 98, 7–30 (2015)
Article MathSciNet MATH Google Scholar
Dang, U.J., McNicholas, P.D.: Families of parsimonious finite mixtures of regression models. In: Morlini, I., Minerva, T., Vichi, M. (eds.) Statistical Models for Data Analysis, pp. 73–84. Springer, Berlin (2015)
Chapter Google Scholar
De Sarbo, W.S., Cron, W.L.: A maximum likelihood methodology for clusterwise linear regression. J. Classif. 5, 249–282 (1988)
Article MathSciNet MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–22 (1977)
MathSciNet MATH Google Scholar
Dy, J.G., Brodley, C.E.: Feature selection for unsupervised learning. J. Mach. Learn. Res. 5, 845–889 (2004)
MathSciNet MATH Google Scholar
Fowlkes, E.B., Gnanadesikan, R., Kettenring, J.R.: Variable selection in clustering. J. Classif. 5, 205–228 (1988)
Article MathSciNet Google Scholar
Fraiman, R., Justel, A., Svarc, M.: Selection of variables for cluster analysis and classification rules. J. Am. Stat. Assoc. 103, 1294–1303 (2008)
Article MathSciNet MATH Google Scholar
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)
Article MathSciNet MATH Google Scholar
Fraley, C., Raftery, A.E., Murphy, T.B., Scrucca, L.: mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington (2012)
Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. Ser. B 66, 815–849 (2004)
Article MathSciNet MATH Google Scholar
Frühwirth-Schnatter, S.: Finite Mixture and Markow Switching Models. Springer, New York (2006)
MATH Google Scholar
Galimberti, G., Montanari, A., Viroli, C.: Penalized factor mixture analysis for variable selection in clustered data. Comput. Stat. Data Anal. 53, 4301–4310 (2009)
Article MathSciNet MATH Google Scholar
Galimberti, G., Scardovi, E., Soffritti, G.: Using mixtures in seemingly unrelated linear regression models with non-normal errors. Stat. Comput. 26, 1025–1038 (2016)
Article MathSciNet MATH Google Scholar
Galimberti, G., Soffritti, G.: Model-based methods to identify multiple cluster structures in a data set. Comput. Stat. Data Anal. 52, 520–536 (2007)
Article MathSciNet MATH Google Scholar
Galimberti, G., Soffritti, G.: Using conditional independence for parsimonious model-based Gaussian clustering. Stat. Comput. 23, 625–638 (2013)
Article MathSciNet MATH Google Scholar
Gnanadesikan, R., Kettenring, J.R., Tsao, S.L.: Weighting and selection of variables for cluster analysis. J. Classif. 12, 113–136 (1995)
Article MATH Google Scholar
Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading (1989)
MATH Google Scholar
Gordon, A.D.: Classification, 2nd edn. Chapman & Hall, Boca Raton (1999)
MATH Google Scholar
Grün, B., Leisch, F.: Bootstrapping finite mixture models. In: Antoch, J. (ed.) Compstat 2004. Proceedings in computational statistics, pp. 1115–1122. Phisica-Verlag/Springer, Heidelberg (2004)
Guo, J., Levina, E., Michailidis, G., Zhu, J.: Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66, 793–804 (2010)
Article MathSciNet MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009)
Book MATH Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Article MATH Google Scholar
Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)
Article MathSciNet MATH Google Scholar
Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā Ser. A 62, 49–66 (2000)
MathSciNet MATH Google Scholar
Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1154–1166 (2004)
Article Google Scholar
Liu, T.-F., Zhang, N.L., Chen, P., Liu, A.H., Poon, L.K.M., Wang, Y.: Greedy learning of latent tree models for multidimensional clustering. Mach. Learn. 98, 301–330 (2015)
Article MathSciNet MATH Google Scholar
Malsiner-Walli, G., Frühwirth-Schnatter, S., Grün, B.: Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26, 303–324 (2016)
Article MathSciNet MATH Google Scholar
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65, 701–709 (2009a)
Article MathSciNet MATH Google Scholar
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53, 3872–3882 (2009b)
Article MathSciNet MATH Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, Chichester (2000)
Book MATH Google Scholar
McLachlan, G.J., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)
Article MathSciNet MATH Google Scholar
McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)
Article MathSciNet Google Scholar
McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54, 711–723 (2010)
Article MathSciNet MATH Google Scholar
Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surv. 4, 80–116 (2010)
Article MathSciNet MATH Google Scholar
Montanari, A., Lizzani, L.: A projection pursuit approach to variable selection. Comput. Stat. Data Anal. 35, 463–473 (2001)
Article MathSciNet MATH Google Scholar
Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145–1164 (2007)
MATH Google Scholar
Poon, L.K.M., Zhang, N.L., Liu, T.-F., Liu, A.H.: Model-based clustering of high-dimensional data: variable selection versus facet determination. Int. J. Approx. Reason. 54, 196–215 (2013)
Article MATH Google Scholar
Quandt, R.E., Ramsey, J.B.: Estimating mixtures of normal distributions and switching regressions. J. Am. Stat. Assoc. 73, 730–738 (1978)
Article MathSciNet MATH Google Scholar
R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL:http://www.R-project.org (2015)
Raftery, A.E., Dean, N.: Variable selection for model-based cluster analysis. J. Am. Stat. Assoc. 101, 168–178 (2006)
Article MATH Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Article MathSciNet MATH Google Scholar
Scrucca, L.: GA: a package for genetic algorithms in R. J. Stat. Softw. 53, 1–37 (4) (2013)
Scrucca, L.: Genetic algorithms for subset selection in model-based clustering. In: Celebi, M.E., Aydin, K. (eds.) Unsupervised Learning Algorithms, pp. 55–70. Springer, Berlin (2016)
Chapter Google Scholar
Scrucca, L., Raftery, A.E.: Improved initialisation of model-based clustering using Gaussian hierarchical partitions. Adv. Data Anal. Classif. 9, 447–460 (2015)
Article MathSciNet Google Scholar
Scrucca, L., Raftery, A.E.: clustvarsel: a package implementing variable selection for model-based clustering in R (2014). Pre-print available at arxiv:1411.0606
Soffritti, G.: Identifying multiple cluster structures in a data matrix. Commun. Stat. Simul. 32, 1151–1177 (2003)
Article MathSciNet MATH Google Scholar
Soffritti, G., Galimberti, G.: Multivariate linear regression with non-normal errors: a solution based on mixture models. Stat. Comput. 21, 523–536 (2011)
Article MathSciNet MATH Google Scholar
Srivastava, M.S.: Methods of Multivariate Statistics. Wiley, New York (2002)
MATH Google Scholar
Steinley, D., Brusco, M.J.: A new variable weighting and selection procedure for k-means cluster analysis. Multivar. Behav. Res. 43, 77–108 (2008a)
Article Google Scholar
Steinley, D., Brusco, M.J.: Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika 73, 125–144 (2008b)
Article MathSciNet MATH Google Scholar
Tadesse, M.G., Sha, N., Vannucci, M.: Bayesian variable selection in clustering high-dimensional data. J. Am. Stat. Assoc. 100, 602–617 (2005)
Article MathSciNet MATH Google Scholar
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)
Book MATH Google Scholar
Viroli, C.: Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers. J. Classif. 31, 363–388 (2010)
Article MathSciNet MATH Google Scholar
Wang, S., Zhu, J.: Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64, 440–448 (2008)
Article MathSciNet MATH Google Scholar
Witten, D.M., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc. 105, 713–726 (2010)
Xie, B., Pan, W., Shen, X.: Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64, 921–930 (2008)
Article MathSciNet MATH Google Scholar
Zeng, H., Cheung, Y.-M.: A new feature selection method for Gaussian mixture clustering. Pattern Recognit. 42, 243–250 (2009)
Article MATH Google Scholar
Zhou, H., Pan, W., Shen, X.: Penalized model-based clustering with unconstrained covariance matrices. Electron. J. Stat. 3, 1473–1496 (2009)
Article MathSciNet MATH Google Scholar
Zhu, X., Melnykov, V.: Manly transformation in finite mixture modeling. Comput. Stat. Data Anal. (2016). doi:10.1016/j.csda.2016.01.015
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistical Sciences, University of Bologna, via delle Belle Arti 41, 40126, Bologna, Italy
Giuliano Galimberti, Annamaria Manisi & Gabriele Soffritti

Authors

Giuliano Galimberti
View author publications
You can also search for this author in PubMed Google Scholar
Annamaria Manisi
View author publications
You can also search for this author in PubMed Google Scholar
Gabriele Soffritti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriele Soffritti.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 229 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Galimberti, G., Manisi, A. & Soffritti, G. Modelling the role of variables in model-based cluster analysis. Stat Comput 28, 145–169 (2018). https://doi.org/10.1007/s11222-017-9723-0

Download citation

Received: 24 March 2016
Accepted: 04 January 2017
Published: 12 January 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s11222-017-9723-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modelling the role of variables in model-based cluster analysis

Abstract

Access this article

Similar content being viewed by others

The parsimonious Gaussian mixture models with partitioned parameters and their application in clustering

Gaussian parsimonious clustering models with covariates and a noise component

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 229 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modelling the role of variables in model-based cluster analysis

Abstract

Access this article

Similar content being viewed by others

The parsimonious Gaussian mixture models with partitioned parameters and their application in clustering

Gaussian parsimonious clustering models with covariates and a noise component

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 229 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation