Skip to main content

Advertisement

Log in

Modelling the role of variables in model-based cluster analysis

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

In the framework of cluster analysis based on Gaussian mixture models, it is usually assumed that all the variables provide information about the clustering of the sample units. Several variable selection procedures are available in order to detect the structure of interest for the clustering when this structure is contained in a variable sub-vector. Currently, in these procedures a variable is assumed to play one of (up to) three roles: (1) informative, (2) uninformative and correlated with some informative variables, (3) uninformative and uncorrelated with any informative variable. A more general approach for modelling the role of a variable is proposed by taking into account the possibility that the variable vector provides information about more than one structure of interest for the clustering. This approach is developed by assuming that such information is given by non-overlapped and possibly correlated sub-vectors of variables; it is also assumed that the model for the variable vector is equal to a product of conditionally independent Gaussian mixture models (one for each variable sub-vector). Details about model identifiability, parameter estimation and model selection are provided. The usefulness and effectiveness of the described methodology are illustrated using simulated and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Anderson, T.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, New York (2003)

    MATH  Google Scholar 

  • Andrews, J.L., McNicholas, P.D.: Variable selection for clustering and classification. J. Classif. 31, 136–153 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  • Belitskaya-Levy, I.: A generalized clustering problem, with application to DNA microarrays. Stat. Appl. Genet. Mol. Biol. 5, Article 2 (2006)

  • Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)

    Article  Google Scholar 

  • Biernacki, C., Govaert, G.: Choosing models in model-based clustering and discriminant analysis. J. Stat. Comput. Simul. 64, 49–71 (1999)

    Article  MATH  Google Scholar 

  • Bozdogan, H.: Intelligent statistical data mining with information complexity and genetic algorithms. In: Bozdogan, H. (ed.) Statistical Data Mining and Knowledge Discovery, pp. 15–56. Chapman & Hall/CRC, London (2004)

    Google Scholar 

  • Browne, R.P., ElSherbiny, A., McNicholas, P.D.: mixture: mixture models for clustering and classification. R package version 1.4 (2015)

  • Brusco, M.J., Cradit, J.D.: A variable-selection heuristic for k-means clustering. Psychometrika 66, 249–270 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Campbell, N.A., Mahon, R.J.: A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Aust. J. Zool. 22, 417–425 (1974)

    Article  Google Scholar 

  • Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995)

    Article  Google Scholar 

  • Celeux, G., Martin-Magniette, M.-L., Maugis, C., Raftery, A.E.: Letter to the editor. J. Am. Stat. Assoc. 106, 383 (2011)

    Article  Google Scholar 

  • Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C., Raftery, A.E.: Comparing model selection and regularization approaches to variable selection in model-based clustering. J. Soc. Fr. Statistique 155, 57–71 (2014)

    MathSciNet  MATH  Google Scholar 

  • Chatterjee, S., Laudato, M., Lynch, L.A.: Genetic algorithms and their statistical applications: an introduction. Comput. Stat. Data Anal. 22, 633–651 (1996)

    Article  MATH  Google Scholar 

  • Dang, X.H., Bailey, J.: A framework to uncover multiple alternative clusterings. Mach. Learn. 98, 7–30 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Dang, U.J., McNicholas, P.D.: Families of parsimonious finite mixtures of regression models. In: Morlini, I., Minerva, T., Vichi, M. (eds.) Statistical Models for Data Analysis, pp. 73–84. Springer, Berlin (2015)

    Chapter  Google Scholar 

  • De Sarbo, W.S., Cron, W.L.: A maximum likelihood methodology for clusterwise linear regression. J. Classif. 5, 249–282 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–22 (1977)

    MathSciNet  MATH  Google Scholar 

  • Dy, J.G., Brodley, C.E.: Feature selection for unsupervised learning. J. Mach. Learn. Res. 5, 845–889 (2004)

    MathSciNet  MATH  Google Scholar 

  • Fowlkes, E.B., Gnanadesikan, R., Kettenring, J.R.: Variable selection in clustering. J. Classif. 5, 205–228 (1988)

    Article  MathSciNet  Google Scholar 

  • Fraiman, R., Justel, A., Svarc, M.: Selection of variables for cluster analysis and classification rules. J. Am. Stat. Assoc. 103, 1294–1303 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley, C., Raftery, A.E., Murphy, T.B., Scrucca, L.: mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington (2012)

  • Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. Ser. B 66, 815–849 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Frühwirth-Schnatter, S.: Finite Mixture and Markow Switching Models. Springer, New York (2006)

    MATH  Google Scholar 

  • Galimberti, G., Montanari, A., Viroli, C.: Penalized factor mixture analysis for variable selection in clustered data. Comput. Stat. Data Anal. 53, 4301–4310 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Galimberti, G., Scardovi, E., Soffritti, G.: Using mixtures in seemingly unrelated linear regression models with non-normal errors. Stat. Comput. 26, 1025–1038 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Galimberti, G., Soffritti, G.: Model-based methods to identify multiple cluster structures in a data set. Comput. Stat. Data Anal. 52, 520–536 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Galimberti, G., Soffritti, G.: Using conditional independence for parsimonious model-based Gaussian clustering. Stat. Comput. 23, 625–638 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Gnanadesikan, R., Kettenring, J.R., Tsao, S.L.: Weighting and selection of variables for cluster analysis. J. Classif. 12, 113–136 (1995)

    Article  MATH  Google Scholar 

  • Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading (1989)

    MATH  Google Scholar 

  • Gordon, A.D.: Classification, 2nd edn. Chapman & Hall, Boca Raton (1999)

    MATH  Google Scholar 

  • Grün, B., Leisch, F.: Bootstrapping finite mixture models. In: Antoch, J. (ed.) Compstat 2004. Proceedings in computational statistics, pp. 1115–1122. Phisica-Verlag/Springer, Heidelberg (2004)

  • Guo, J., Levina, E., Michailidis, G., Zhu, J.: Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66, 793–804 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009)

    Book  MATH  Google Scholar 

  • Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)

    Article  MATH  Google Scholar 

  • Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  • Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā Ser. A 62, 49–66 (2000)

    MathSciNet  MATH  Google Scholar 

  • Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1154–1166 (2004)

    Article  Google Scholar 

  • Liu, T.-F., Zhang, N.L., Chen, P., Liu, A.H., Poon, L.K.M., Wang, Y.: Greedy learning of latent tree models for multidimensional clustering. Mach. Learn. 98, 301–330 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Malsiner-Walli, G., Frühwirth-Schnatter, S., Grün, B.: Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26, 303–324 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65, 701–709 (2009a)

    Article  MathSciNet  MATH  Google Scholar 

  • Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53, 3872–3882 (2009b)

    Article  MathSciNet  MATH  Google Scholar 

  • McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, Chichester (2000)

    Book  MATH  Google Scholar 

  • McLachlan, G.J., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  • McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)

    Article  MathSciNet  Google Scholar 

  • McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54, 711–723 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surv. 4, 80–116 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Montanari, A., Lizzani, L.: A projection pursuit approach to variable selection. Comput. Stat. Data Anal. 35, 463–473 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145–1164 (2007)

    MATH  Google Scholar 

  • Poon, L.K.M., Zhang, N.L., Liu, T.-F., Liu, A.H.: Model-based clustering of high-dimensional data: variable selection versus facet determination. Int. J. Approx. Reason. 54, 196–215 (2013)

    Article  MATH  Google Scholar 

  • Quandt, R.E., Ramsey, J.B.: Estimating mixtures of normal distributions and switching regressions. J. Am. Stat. Assoc. 73, 730–738 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  • R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL:http://www.R-project.org (2015)

  • Raftery, A.E., Dean, N.: Variable selection for model-based cluster analysis. J. Am. Stat. Assoc. 101, 168–178 (2006)

    Article  MATH  Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  • Scrucca, L.: GA: a package for genetic algorithms in R. J. Stat. Softw. 53, 1–37 (4) (2013)

  • Scrucca, L.: Genetic algorithms for subset selection in model-based clustering. In: Celebi, M.E., Aydin, K. (eds.) Unsupervised Learning Algorithms, pp. 55–70. Springer, Berlin (2016)

    Chapter  Google Scholar 

  • Scrucca, L., Raftery, A.E.: Improved initialisation of model-based clustering using Gaussian hierarchical partitions. Adv. Data Anal. Classif. 9, 447–460 (2015)

    Article  MathSciNet  Google Scholar 

  • Scrucca, L., Raftery, A.E.: clustvarsel: a package implementing variable selection for model-based clustering in R (2014). Pre-print available at arxiv:1411.0606

  • Soffritti, G.: Identifying multiple cluster structures in a data matrix. Commun. Stat. Simul. 32, 1151–1177 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  • Soffritti, G., Galimberti, G.: Multivariate linear regression with non-normal errors: a solution based on mixture models. Stat. Comput. 21, 523–536 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Srivastava, M.S.: Methods of Multivariate Statistics. Wiley, New York (2002)

    MATH  Google Scholar 

  • Steinley, D., Brusco, M.J.: A new variable weighting and selection procedure for k-means cluster analysis. Multivar. Behav. Res. 43, 77–108 (2008a)

    Article  Google Scholar 

  • Steinley, D., Brusco, M.J.: Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika 73, 125–144 (2008b)

    Article  MathSciNet  MATH  Google Scholar 

  • Tadesse, M.G., Sha, N., Vannucci, M.: Bayesian variable selection in clustering high-dimensional data. J. Am. Stat. Assoc. 100, 602–617 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)

    Book  MATH  Google Scholar 

  • Viroli, C.: Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers. J. Classif. 31, 363–388 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, S., Zhu, J.: Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64, 440–448 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Witten, D.M., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc. 105, 713–726 (2010)

  • Xie, B., Pan, W., Shen, X.: Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64, 921–930 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Zeng, H., Cheung, Y.-M.: A new feature selection method for Gaussian mixture clustering. Pattern Recognit. 42, 243–250 (2009)

    Article  MATH  Google Scholar 

  • Zhou, H., Pan, W., Shen, X.: Penalized model-based clustering with unconstrained covariance matrices. Electron. J. Stat. 3, 1473–1496 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhu, X., Melnykov, V.: Manly transformation in finite mixture modeling. Comput. Stat. Data Anal. (2016). doi:10.1016/j.csda.2016.01.015

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriele Soffritti.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 229 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Galimberti, G., Manisi, A. & Soffritti, G. Modelling the role of variables in model-based cluster analysis. Stat Comput 28, 145–169 (2018). https://doi.org/10.1007/s11222-017-9723-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-017-9723-0

Keywords

Navigation