Skip to main content

Variable selection for model-based clustering using the integrated complete-data likelihood

Abstract

Variable selection in cluster analysis is important yet challenging. It can be achieved by regularization methods, which realize a trade-off between the clustering accuracy and the number of selected variables by using a lasso-type penalty. However, the calibration of the penalty term can suffer from criticisms. Model selection methods are an efficient alternative, yet they require a difficult optimization of an information criterion which involves combinatorial problems. First, most of these optimization algorithms are based on a suboptimal procedure (e.g. stepwise method). Second, the algorithms are often computationally expensive because they need multiple calls of EM algorithms. Here we propose to use a new information criterion based on the integrated complete-data likelihood. It does not require the maximum likelihood estimate and its maximization appears to be simple and computationally efficient. The original contribution of our approach is to perform the model selection without requiring any parameter estimation. Then, parameter inference is needed only for the unique selected model. This approach is used for the variable selection of a Gaussian mixture model with conditional independence assumed. The numerical experiments on simulated and benchmark datasets show that the proposed method often outperforms two classical approaches for variable selection. The proposed approach is implemented in the R package VarSelLCM available on CRAN.

This is a preview of subscription content, access via your institution.

References

  • Bertoletti, M., Friel, N., Rastelli, R.: Choosing the number of clusters in a finite mixture model using an exact integrated completed likelihood criterion. METRON 73(2), 177–199 (2015)

    MathSciNet  Article  MATH  Google Scholar 

  • Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. Pattern Anal. Mach. Intell. IEEE Trans. 22(7), 719–725 (2000)

    Article  Google Scholar 

  • Biernacki, C., Celeux, G., Govaert, G.: Exact and Monte Carlo calculations of integrated likelihoods for the latent class model. J. Stat. Plan. Inference 140(11), 2991–3002 (2010)

    MathSciNet  Article  MATH  Google Scholar 

  • Celeux, G., Govaert, G.: Clustering criteria for discrete data and latent class models. J. Classif. 8(2), 157–176 (1991)

    Article  MATH  Google Scholar 

  • Celeux, G., Martin-Magniette, M., Maugis-Rabusseau, C., Raftery, A.: Comparing model selection and regularization approaches to variable selection in model-based clustering. Journal de la Societe francaise de statistique 155(2), 57 (2014)

    MathSciNet  MATH  Google Scholar 

  • Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  • Flury, B., Riedwyl, H.: Multivariate Statistics: A Practical Approach. Chapman and Hall, London (1988)

    Book  MATH  Google Scholar 

  • Forina, M., et al.: PARVUS an extendible package for data exploration, classification and correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Genoa, Italy (1991)

  • Friedman, J., Meulman, J.: Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. Ser. B (Stat. Methodol.) 66(4), 815–849 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  • Friel, N., Wyse, J.: Estimating the evidence-a review. Stat. Neerl. 66(3), 288–308 (2012)

    MathSciNet  Article  Google Scholar 

  • Golub, T., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)

    Article  Google Scholar 

  • Govaert, G.: Data Analysis. ISTE Wiley, New York (2009)

    Book  MATH  Google Scholar 

  • Green, P.: On use of the EM for penalized likelihood estimation. J. R. Stat. Soc. Ser. B (Methodol.) 52(3), 443–452 (1990)

    MathSciNet  MATH  Google Scholar 

  • Hand, D., Keming, Y.: Idiot’s Bayes, not so stupid after all? Int. Stat. Rev. 69(3), 385–398 (2001)

  • Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)

    MATH  Google Scholar 

  • Haughton, D.: On the choice of a model to fit data from an exponential family. Ann. Stat. 16(1), 342–355 (1988)

    MathSciNet  Article  MATH  Google Scholar 

  • Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Article  MATH  Google Scholar 

  • Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā, 49–66 (2000)

  • Maugis, C., Celeux, G., Martin-Magniette, M.: Variable selection for clustering with Gaussian mixture models. Biometrics 65(3), 701–709 (2009a)

    MathSciNet  Article  MATH  Google Scholar 

  • Maugis, C., Celeux, G., Martin-Magniette, M.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53(11), 3872–3882 (2009b)

    MathSciNet  Article  MATH  Google Scholar 

  • Moustaki, I., Papageorgiou, I.: Latent class models for mixed variables with applications in Archaeometry. Comput. Stat. Data Anal. 48(3), 659–675 (2005)

    MathSciNet  Article  MATH  Google Scholar 

  • Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145–1164 (2007)

    MATH  Google Scholar 

  • Raftery, A., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. assoc. 101(473), 168–178 (2006)

    MathSciNet  Article  MATH  Google Scholar 

  • Robert, C.: The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer, New York (2007)

    MATH  Google Scholar 

  • Rusakov, D., Geiger, D.: Asymptotic model selection for Naive Bayesian networks. J. Mach. Learn. Res. 6, 1–35 (2005)

    MathSciNet  MATH  Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    MathSciNet  Article  MATH  Google Scholar 

  • Scrucca L., Raftery, A. E.: clustvarsel: A package implementing variable selection for model-based clustering in R. Pre-print available at http://arxiv.org/abs/1411.0606 (2015)

  • Street, W., Wolberg, W., Mangasarian, O.: Nuclear feature extraction for breast tumor diagnosis. IST/SPIE 1993 international symposium on electronic imaging. Sci. Technol. 1905, 861–870 (1993)

    Google Scholar 

  • Streuli, H.: Der heutige stand der kaffeechemie. In: Association Scientifique International du Cafe, 6th International Colloquium on Coffee Chemisrty, pp. 61–72 (1973)

  • Tadesse, M., Sha, N., Vannucci, M.: Bayesian variable selection in clustering high-dimensional data. J. Am. Stat. Assoc. 100(470), 602–617 (2005)

    MathSciNet  Article  MATH  Google Scholar 

  • White, A., Wyse, J., Murphy, T.B.: Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler. Stat. Comput., 1–17 (2014)

  • Witten, D., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc. 105(490), 713–726 (2010)

    MathSciNet  Article  MATH  Google Scholar 

  • Witten, D., Tibshirani, R.: sparcl: Perform sparse hierarchical clustering and sparse k-means clustering. R package version 1, 3 (2013)

    Google Scholar 

Download references

Acknowledgments

The authors are grateful to Gilles Celeux, Paul Diver and Jean-Michel Marin for their leading comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthieu Marbac.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 510 KB)

Appendices

Appendix 1: Consistency of the MICL criterion

This section is devoted to the proof of consistency of our \(\text {MICL}\) criterion with a fixed number of components. The first part deals with non-nested models and requires a bias-entropy compensation assumption. The second part covers the nested models, i.e, when the competing model contains the true model. In what follows, we consider the true model \(\varvec{m}^{(0)} = \big (g^{(0)}, \varvec{\omega }^{(0)}\big )\), its set of relevant variables is \({\varOmega }^{(0)} = \left\{ j : \omega ^{(0)}_j = 1 \right\} \) and the parameter is \(\varvec{\theta }^{(0)}\).

Case of non-nested modelWe need to introduce the entropy notation given by

$$\begin{aligned} \xi \big (\varvec{\theta }; \mathbf {z}, \varvec{m}\big ) = \sum _{i = 1}^n \sum _{k = 1}^g z_{ik} \ln \tau _{ik}\big (\varvec{\theta }\mid \varvec{m}\big ), \end{aligned}$$

where \(\tau _{ik}\big (\varvec{\theta }\mid \varvec{m}\big ) = \dfrac{\tau _k \phi \big (\varvec{x}_i \mid \theta _k, \varvec{m}\big )}{\sum _h^g\tau _h \phi \big (\varvec{x}_i \mid \theta _h, \varvec{m}\big )}.\)

Proposition 1

Assume that \(\varvec{m}^{(1)}\) is a model such that \(\varvec{m}^{(0)}\) is a non-nested model within \(\varvec{m}^{(1)}\). Assume that

$$\begin{aligned}&- \mathbb {E}\left[ \ln \dfrac{\sum _{k = 1}^{g^{(0)}}\tau _k \prod _{j = 1}^d\phi \big (x_{1j} \mid \mu ^{(0)}_{kj}, \sigma ^{(0)2}_{kj}\big )\mathbbm {1}_{G^{(0)}_k}\big (\varvec{x}_1\big )}{p\big ( \varvec{x}_1 \mid \varvec{\theta }^{(0)},\varvec{m}^{(0)}\big )}\right] \nonumber \\&\quad \le \mathbf {KL}\Big [\varvec{m}^{(0)}||\varvec{m}^{(1)}\Big ], \end{aligned}$$
(22)

where \(\mathbf {KL}\Big [\varvec{m}^{(0)}||\varvec{m}^{(1)}\Big ]\) is the Kullback-Leibler divergence of \(p\big (\cdot \mid \varvec{\theta }^{(0)},\varvec{m}^{(0)}\big )\) from \(p\big (\cdot \mid \varvec{\theta }^{(1)},\varvec{m}^{(1)}\big )\) and

$$\begin{aligned} G^{(0)}_k = \left\{ x \in \mathbb R^d : k = \underset{1 \le h \le g^{(0)}}{\text {argmax}}\, \tau _h \prod _{j = 1}^d\phi \big (x_{1j} \mid \mu ^{(0)}_{hj}, \sigma ^{2 (0)}_{hj}\big )\right\} . \end{aligned}$$

When \(n \rightarrow \infty \), we have

$$\begin{aligned} \mathbb {P}\bigg (\text {MICL}\big (\varvec{m}^{(1)}\big ) > \text {MICL}\big (\varvec{m}^{(0)}\big )\bigg ) \longrightarrow 0. \end{aligned}$$

Proof

For any model \(\varvec{m}\), we have the following inequalities,

$$\begin{aligned} \text {ICL}\big (\varvec{m}\big ) \le \text {MICL}\big (\varvec{m}\big ) \le \ln p\big (\mathbf {x}\mid \varvec{m}\big ). \end{aligned}$$

It follows,

$$\begin{aligned}&\mathbb {P}\bigg \{\text {MICL}\big (\varvec{m}^{(1)}\big ) - \text {MICL}\big (\varvec{m}^{(0)}\big )> 0\bigg \}\\&\quad \le \mathbb {P}\bigg \{\ln p\big (\mathbf {x}\mid \varvec{m}^{(1)}\big ) - \text {ICL}\big (\varvec{m}^{(0)}\big ) > 0\bigg \}. \end{aligned}$$

Now set \({\varDelta }\nu = \nu ^{(1)} - \nu ^{(0)}\) where \(\nu ^{(1)}\) and \(\nu ^{(0)}\) are the numbers of free parameters in the models \(\varvec{m}^{(1)}\) and \(\varvec{m}^{(0)}\) respectively. Using Laplace’s approximation, we have

$$\begin{aligned} \text {ICL}\big (\varvec{m}^{(0)}\big )= & {} \ln p\left( \mathbf {x}\mid \widehat{\varvec{\theta }}^{(0)}, \varvec{m}^{(0)}\right) + \xi \left( \widehat{\varvec{\theta }}^{(0)}; \widehat{\mathbf {z}}^{(0)}, \varvec{m}^{(0)}\right) \nonumber \\&- \dfrac{\nu ^{(0)}}{2} \ln n+ \mathcal {O}_p(1), \end{aligned}$$

where \(\widehat{\varvec{\theta }}^{(0)}\) and \(\widehat{\mathbf {z}}^{(0)}\) are respectively the MLE and the partition given by the corresponding MAP rule. In the same way, we have

$$\begin{aligned} \ln p\big (\mathbf {x}\mid \varvec{m}^{(1)}\big ) = \ln p\big (\mathbf {x}\mid \widehat{\varvec{\theta }}^{(1)}, \varvec{m}^{(1)}\big ) - \dfrac{\nu ^{(1)}}{2} \ln n + \mathcal {O}_p(1), \end{aligned}$$

where \(\widehat{\varvec{\theta }}^{(1)}\) is the MLE of \(\varvec{\theta }^{(1)}\). Note that

$$\begin{aligned}&\ln p\big (\mathbf {x}\mid \varvec{m}^{(1)}\big ) - \text {ICL}\big (\varvec{m}^{(0)}\big )\\&= \dfrac{A_n}{2} + n B_n - \dfrac{{\varDelta }\nu }{2} \ln n +\, \mathcal {O}_p(1), \end{aligned}$$

where

$$\begin{aligned} A_n = 2 \ln \dfrac{p\left( \mathbf {x}\mid \widehat{\varvec{\theta }}^{(1)},\varvec{m}^{(1)}\right) }{ p\left( \mathbf {x}\mid \varvec{\theta }^{(1)}, \varvec{m}^{(1)}\right) } - 2 \ln \dfrac{p\left( \mathbf {x}\mid \widehat{\varvec{\theta }}^{(0)}, \varvec{m}^{(0)}\right) }{p\left( \mathbf {x}\mid \varvec{\theta }^{(0)}, \varvec{m}^{(0)}\right) }, \end{aligned}$$

and

$$\begin{aligned} B_n = \dfrac{1}{n} \ln \dfrac{p\left( \mathbf {x}\mid \varvec{\theta }^{(1)},\varvec{m}^{(1)}\right) }{p\left( \mathbf {x}\mid \varvec{\theta }^{(0)},\varvec{m}^{(0)}\right) } - \dfrac{1}{n}\xi \left( \widehat{\varvec{\theta }}^{(0)}; \widehat{\mathbf {z}}^{(0)},\varvec{m}^{(0)}\right) . \end{aligned}$$

When \(n \rightarrow \infty \), we have \(A_n \rightarrow \chi ^2_{{\varDelta }\nu }\) in distribution and \(B_n\) tends to

$$\begin{aligned}&-\mathbf {KL}\Big [\varvec{m}^{(0)}||\varvec{m}^{(1)}\Big ]\\&- \mathbb {E}\left[ \ln \dfrac{\sum _{k = 1}^{g^{(0)}}\tau _k \prod _{j = 1}^d\phi \big (x_{1j} \mid \mu ^{(0)}_{kj}, \sigma ^{(0)2}_{kj}\big )\mathbbm {1}_{G^{(0)}_k}\big (\varvec{x}_1\big )}{p\big ( \varvec{x}_1 \mid \varvec{\theta }^{(0)},\varvec{m}^{(0)}\big )}\right] \end{aligned}$$

in probability. Thus, under the assumption (22), \(\text {MICL}\) is consistent since when \(n \rightarrow \infty \), we have

$$\begin{aligned}&\mathbb {P}\bigg \{\text {MICL}\big (\varvec{m}^{(1)}\big ) - \text {MICL}\big (\varvec{m}^{(0)}\big )> 0 \bigg \}\\&\quad \le \mathbb {P}\bigg [A_n + \mathcal {O}_p(1)> {\varDelta }\nu \ln n\bigg ] + \mathbb {P}\bigg [B_n > 0 \bigg ]\longrightarrow 0. \end{aligned}$$

Case of nested model Recall that \(\text {MICL}\big (\varvec{m}^{(0)}\big ) = \ln p\big (\mathbf {x}, \mathbf {z}^{(0)}\mid \varvec{m}^{(0)}\big )\), where \(\mathbf {z}^{(0)} = \underset{\mathbf {z}}{\text {argmax}} \ln p\big (\mathbf {x}, \mathbf {z}\mid \varvec{m}^{(0)}\big )\). We have

$$\begin{aligned} \mathbf {z}^{(0)} {=} \underset{\mathbf {z}}{\text {argmax}}\Big \{\ln p\left( \mathbf {z}\mid g^{(0)}\right) {+} \underset{j \in {\varOmega }_0}{\sum }\ln p\big (\mathbf {x}_{\bullet j} \mid \omega _j^{(0)}, g^{(0)}, \mathbf {z}\big ) \Big \}, \end{aligned}$$

where \({\varOmega }_0 = \left\{ j : \omega ^{(0)}_j = 1\right\} \). Let \(\varvec{m}^{(1)} = \left( g^{(0)}, {\varOmega }_1\right) \) where \({\varOmega }_1 = {\varOmega }_0 \cup {\varOmega }_{01}\) and \( {\varOmega }_{01} = \left\{ j : \omega _j^{(1)} = 1, \omega _j^{(0)} = 0 \right\} \). Then, in the same way, we have \(\text {MICL}\big (\varvec{m}^{(1)}\big ) = \ln p\big (\mathbf {x}, \mathbf {z}^{(1)}\!\mid \varvec{m}^{(1)}\big )\), where

$$\begin{aligned} \mathbf {z}^{(1)} {=} \underset{\mathbf {z}}{\text {argmax}}\left[ \ln p(\mathbf {z}\mid g^{(0)}) {+} \underset{j \in {\varOmega }_1}{\sum }\ln p\left( \mathbf {x}_{\bullet j} {\mid } \omega _j^{(1)}, g^{(0)}, \mathbf {z}\right) \right] . \end{aligned}$$

Let \(j \in {\varOmega }_{01}\), Laplace’s approximation gives us,

$$\begin{aligned} \ln p\left( \mathbf {x}_{\bullet j} \mid \omega _j^{(1)}, g^{(0)}, \mathbf {z}\right)= & {} \sum _{i =1}^n \sum _{k = 1}^g z_{ik}\ln \phi \left( x_{ij} \mid \tilde{\mu }^{(1)}_{kj}, \tilde{\sigma }^{(1)2}_{kj}\right) \nonumber \\&- g^{(0)} \ln n + \mathcal {O}_p(1), \end{aligned}$$

where

$$\begin{aligned} \left( \tilde{\mu }^{(1)}_{kj}, \tilde{\sigma }^{(1)2}_{kj}\right) = \underset{\mu ^{(1)}_{kj}, \sigma ^{(1)2}_{kj}}{\text {argmax}} \sum _{i =1}^n z_{ik}\ln \phi \left( x_{ij} \mid \mu ^{(1)}_{kj}, \sigma ^{(1)2}_{kj}\right) . \end{aligned}$$

Proposition 2

Assume that \(\varvec{m}^{(1)}\) is a model such that \(g^{(1)} = g^{(0)}\) and \({\varOmega }_1 = {\varOmega }_0 \cup {\varOmega }_{01}\) where \({\varOmega }_{01} \ne \emptyset \), i.e, the model \(\varvec{m}^{(0)}\) is nested within the model \(\varvec{m}^{(1)}\) with the same number of components. When \(n \rightarrow \infty \),

$$\begin{aligned} \mathbb {P}\bigg (\text {MICL}\big (\varvec{m}^{(1)}\big ) > \text {MICL}\big (\varvec{m}^{(0)}\big )\bigg ) \longrightarrow 0. \end{aligned}$$

Proof

We have

$$\begin{aligned}&\mathbb {P}\bigg \{\text {MICL}\big (\varvec{m}^{(1)}\big )> \text {MICL}\big (\varvec{m}^{(0)}\big )\bigg \}\\&\quad \le \mathbb {P}\left\{ \sum _{j \in {\varOmega }_{01}} \ln \dfrac{p\big (\mathbf {x}_{\bullet j} \mid \omega ^{(1)}_j, g^{(0)}, \mathbf {z}^{(1)}\big )}{ p\big (\mathbf {x}_{\bullet j} \mid \omega ^{(0)}_j, g^{(0)}, \mathbf {z}^{(0)}\big )} > 0 \right\} , \end{aligned}$$

And for each \(j \in {\varOmega }_{01}\), when \(n \rightarrow \infty \)

$$\begin{aligned} 2 \sum _{i =1}^n \sum _{k =1}^{g^{(0)}} z^{(1)}_{ik}\ln \dfrac{\phi \left( x_{ij} \mid \tilde{\mu }^{(1)}_{kj}, \tilde{\sigma }^{(1)2}_{kj}\right) }{\phi \left( x_{ij} \mid \mu ^{(0)}_{1j},\sigma ^{(0)2}_{1j}\right) } \longrightarrow \chi ^2_{2g^{(0)}} \quad \text {in distribution}. \end{aligned}$$

We have

$$\begin{aligned}&\mathbb {P}\bigg (\sum _{j \in {\varOmega }_{01}} \ln \dfrac{p\big (\mathbf {x}_{\bullet j} \mid \omega ^{(1)}_j, g^{(0)}, \mathbf {z}^{(1)}\big )}{ p\big (\mathbf {x}_{\bullet j} \mid \omega ^{(0)}_j, g^{(0)}, \mathbf {z}^{(0)}\big )}> 0 \bigg )\\&\quad = \mathbb {P}\Big (\chi ^2_{2(g^{(0)}-1)} - 2 (g^{(0)}-1) \ln n > 0\Big )\\&\qquad \longrightarrow 0 \quad \text { by Chebyshev's inequality}. \end{aligned}$$

Appendix 2: Details on the partition step

At iteration [r], the partition \(\mathbf {z}^{[r]}\) is defined as a partition which increase the value of the integrated complete-data likelihood for the current model \(\varvec{m}^{[r]}\). This partition is obtained by an iterative method initialized with the partition \(\mathbf {z}^{[r-1]}\). Each iteration consists in sampling uniformly an individual which is affiliated to the class maximizing the integrated complete-data likelihood while the other class memberships are unchanged.

At iteration [r] of the global algorithm, the algorithm used to obtained \(\mathbf {z}^{[r]}\) is inialized at partition \(\mathbf {z}^{(0)}=\mathbf {z}^{[r-1]}\). It performs S iterations where iteration (s) is defined as follows:

Individual sampling \(i^{(s)} \sim \mathcal {U}\{1,\ldots ,n\}\)

Partition optimization defined the set of partition \(\mathcal {Z}^{(s)}=\{\mathbf {z}: \varvec{z}_i=\varvec{z}_i^{(s-1)},\; \forall i\ne i^{(s)}\}\) and compute the optimized partition \(\mathbf {z}^{(s)}\) defined by

$$\begin{aligned} \mathbf {z}^{(s)} = \text {argmax}_{\mathbf {z}\in \mathcal {Z}^{(s)}} \ln p\left( \mathbf {x},\mathbf {z}|\varvec{m}^{[r]}\right) . \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Marbac, M., Sedki, M. Variable selection for model-based clustering using the integrated complete-data likelihood. Stat Comput 27, 1049–1063 (2017). https://doi.org/10.1007/s11222-016-9670-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-016-9670-1

Keywords

  • Gaussian mixture model
  • Information criterion
  • Integrated complete-data likelihood
  • Model-based clustering
  • Variable selection