Variable selection in model-based clustering and discriminant analysis with a regularization approach
Several methods for variable selection have been proposed in model-based clustering and classification. These make use of backward or forward procedures to define the roles of the variables. Unfortunately, such stepwise procedures are slow and the resulting algorithms inefficient when analyzing large data sets with many variables. In this paper, we propose an alternative regularization approach for variable selection in model-based clustering and classification. In our approach the variables are first ranked using a lasso-like procedure in order to avoid slow stepwise algorithms. Thus, the variable selection methodology of Maugis et al. (Comput Stat Data Anal 53:3872–3882, 2000b) can be efficiently applied to high-dimensional data sets.
KeywordsVariable selection Lasso Gaussian mixture Clustering Classification
Mathematics Subject Classification62H30 91C20
Funding was provide by Paris- Saclay-DIGITEO and ANR (Grant No. ANR-13-JS01-0001-01).
- Friedman J, Hastie T, Tibshirani R (2014) glasso: graphical lasso—estimation of Gaussian graphical models. https://CRAN.R-project.org/package=glasso. Accessed 22 July 2014
- Gagnot S, Tamby JP, Martin-Magniette ML, Bitton F, Taconnat L, Balzergue S, Aubourg S, Renou JP, Lecharny A, Brunaud V (2008) CATdb: a public access to arabidopsis transcriptome data from the URGV-CATMA platform. Nucleic Acids Res 36(suppl 1):D986–D990Google Scholar
- Scrucca L, Raftery AE (2014) clustvarsel: a package implementing variable selection for model-based clustering in R. arXiv:1411.0606
- Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):289Google Scholar