Advertisement

Variable selection in model-based clustering and discriminant analysis with a regularization approach

  • Gilles Celeux
  • Cathy Maugis-Rabusseau
  • Mohammed Sedki
Regular Article

Abstract

Several methods for variable selection have been proposed in model-based clustering and classification. These make use of backward or forward procedures to define the roles of the variables. Unfortunately, such stepwise procedures are slow and the resulting algorithms inefficient when analyzing large data sets with many variables. In this paper, we propose an alternative regularization approach for variable selection in model-based clustering and classification. In our approach the variables are first ranked using a lasso-like procedure in order to avoid slow stepwise algorithms. Thus, the variable selection methodology of Maugis et al. (Comput Stat Data Anal 53:3872–3882, 2000b) can be efficiently applied to high-dimensional data sets.

Keywords

Variable selection Lasso Gaussian mixture Clustering Classification 

Mathematics Subject Classification

62H30 91C20 

Notes

Funding

Funding was provide by Paris- Saclay-DIGITEO and ANR (Grant No. ANR-13-JS01-0001-01).

References

  1. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821MathSciNetCrossRefzbMATHGoogle Scholar
  2. Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725CrossRefGoogle Scholar
  3. Bouveyron C, Brunet C (2014) Discriminative variable selection for clustering with the sparse Fisher-EM algorithm. Comput Stat 29:489–513MathSciNetCrossRefzbMATHGoogle Scholar
  4. Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793CrossRefGoogle Scholar
  5. Celeux G, Maugis C, Martin-Magniette ML, Raftery AE (2014) Comparing model selection and regularization approaches to variable selection in model-based clustering. J Fr Stat Soc 155:57–71MathSciNetzbMATHGoogle Scholar
  6. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J Roy Stat Soc B 39(1):1–38zbMATHGoogle Scholar
  7. Fraiman R, Justel A, Svarc M (2008) Selection of variables for cluster analysis and classification rules. J Am Stat Assoc 103:1294–1303MathSciNetCrossRefzbMATHGoogle Scholar
  8. Friedman J, Hastie T, Tibshirani R (2007) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3):432–441CrossRefzbMATHGoogle Scholar
  9. Friedman J, Hastie T, Tibshirani R (2014) glasso: graphical lasso—estimation of Gaussian graphical models. https://CRAN.R-project.org/package=glasso. Accessed 22 July 2014
  10. Gagnot S, Tamby JP, Martin-Magniette ML, Bitton F, Taconnat L, Balzergue S, Aubourg S, Renou JP, Lecharny A, Brunaud V (2008) CATdb: a public access to arabidopsis transcriptome data from the URGV-CATMA platform. Nucleic Acids Res 36(suppl 1):D986–D990Google Scholar
  11. Galimberti G, Montanari A, Viroli C (2009) Penalized factor mixture analysis for variable selection in clustered data. Comput Stat Data Anal 53:4301–4310MathSciNetCrossRefzbMATHGoogle Scholar
  12. Kim S, Song DKH, DeSarbo WS (2012) Model-based segmentation featuring simultaneous segment-level variable selection. J Mark Res 49:725–736CrossRefGoogle Scholar
  13. Law MH, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166CrossRefGoogle Scholar
  14. Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2015) Rmixmod: the R package of the model-based unsupervised, supervised and semi-supervised classification mixmod library. J Stat Softw 67(6):241–270CrossRefGoogle Scholar
  15. Lee H, Li J (2012) Variable selection for clustering by separability based on ridgelines. J Comput Graph Stat 21:315–337MathSciNetCrossRefGoogle Scholar
  16. Maugis C, Celeux G, Martin-Magniette M (2009a) Variable selection for clustering with Gaussian mixture models. Biometrics 65(3):701–709MathSciNetCrossRefzbMATHGoogle Scholar
  17. Maugis C, Celeux G, Martin-Magniette ML (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53:3872–3882MathSciNetCrossRefzbMATHGoogle Scholar
  18. Maugis C, Celeux G, Martin-Magniette ML (2011) Variable selection in model-based discriminant analysis. J Multivar Anal 102:1374–1387MathSciNetCrossRefzbMATHGoogle Scholar
  19. Meinshausen N, Bühlmann P (2006) High-dimensional graphs and variable selection with the Lasso. Ann Stat 34(3):1436–1462MathSciNetCrossRefzbMATHGoogle Scholar
  20. Murphy TB, Dean N, Raftery AE (2010) Variable selection and updating in model-based discriminant analysis for high-dimensional data with food authenticity applications. Ann Appl Stat 4:396–421MathSciNetCrossRefzbMATHGoogle Scholar
  21. Nia VP, Davison AC (2012) High-dimensional Bayesian clustering with variable selection: the R package bclust. J Stat Softw 47(5):1–22CrossRefGoogle Scholar
  22. Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164zbMATHGoogle Scholar
  23. Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178MathSciNetCrossRefzbMATHGoogle Scholar
  24. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464MathSciNetCrossRefzbMATHGoogle Scholar
  25. Scrucca L, Raftery AE (2014) clustvarsel: a package implementing variable selection for model-based clustering in R. arXiv:1411.0606
  26. Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):289Google Scholar
  27. Sun W, Wang J, Fang Y (2012) Regularized k-means clustering of high dimensional data and its asymptotic consistency. Electron J Stat 6:148–167MathSciNetCrossRefzbMATHGoogle Scholar
  28. Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100(470):602–617MathSciNetCrossRefzbMATHGoogle Scholar
  29. Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2):440–448MathSciNetCrossRefzbMATHGoogle Scholar
  30. Xie B, Pan W, Shen X (2008) Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electron J Stat 2:168–212MathSciNetCrossRefGoogle Scholar
  31. Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Stat 3:1473–1496MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Dept. de mathématiquesInria and Université Paris-SudOrsay CedexFrance
  2. 2.Institut de Mathématiques de Toulouse, UMR 5219Université de Toulouse, INSA de ToulouseToulouse Cedex 4France
  3. 3.Paris-Sud University and INSERM U1181, Bâtiment. 15/16Hôpital Paul BrousseVillejuif CedexFrance

Personalised recommendations