Model selection for Gaussian latent block clustering with the integrated classification likelihood

Abstract

Block clustering aims to reveal homogeneous block structures in a data table. Among the different approaches of block clustering, we consider here a model-based method: the Gaussian latent block model for continuous data which is an extension of the Gaussian mixture model for one-way clustering. For a given data table, several candidate models are usually examined, which differ for example in the number of clusters. Model selection then becomes a critical issue. To this end, we develop a criterion based on an approximation of the integrated classification likelihood for the Gaussian latent block model, and propose a Bayesian information criterion-like variant following the same pattern. We also propose a non-asymptotic exact criterion, thus circumventing the controversial definition of the asymptotic regime arising from the dual nature of the rows and columns in co-clustering. The experimental results show steady performances of these criteria for medium to large data tables.

This is a preview of subscription content, log in to check access.

Fig. 1

References

  1. Banerjee A, Dhillon I, Ghosh J, Merugu S (2007) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. J Mach Learn Res 8:1919–1986

    MathSciNet  MATH  Google Scholar 

  2. Berkhin P (2006) A survey of clustering data mining techniques. Springer, Berlin

    Google Scholar 

  3. Biernacki C, Celeux G, Govaert G (1998) Assessing a mixture model for clustering with the integrated classification likelihood. Tech. rep, INRIA

  4. Biernacki C, Celeux G, Govaert G (2002) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725

    Article  Google Scholar 

  5. Biernacki C, Celeux G, Govaert G (2010) Exact and monte carlo calculations of integrated likelihoods for the latent class model. J Stat Plan Infer 140(11):2991–3002

    MathSciNet  Article  Google Scholar 

  6. Charrad M, Lechevallier Y, Saporta G, Ben Ahmed M (2010) Détermination du nombre de classes dans les méthodes de bipartitionnement. In: 17ème Rencontres de la Société Francophone de Classification, Saint-Denis de la Réunion, pp 119–122

  7. Daudin JJ, Picard F, Robin S (2008) A mixture model for random graphs. Stat Comput 18(2):173–183

    MathSciNet  Article  Google Scholar 

  8. Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588

    Article  Google Scholar 

  9. Gelman A, Carlin JB, Stern HS, Rubin DB (2004) Bayesian data analysis. CRC, Boca Raton

    Google Scholar 

  10. Good IJ (1965) Categorization of classification. Mathematics and Computer Science in Biology and Medicine, Her Majesty’s Stationery Office

  11. Govaert G (1977) Algorithme de classification d’un tableau de contingence. In: First international symposium on data analysis and informatics, INRIA, Versailles

  12. Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24(4):437–458

    MATH  Google Scholar 

  13. Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recogn 36:463–473

    Article  Google Scholar 

  14. Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67:123–129

    Article  Google Scholar 

  15. Hartigan JA (2000) Bloc voting in the United States senate. J Classif 17(1):29–49

    MathSciNet  Article  Google Scholar 

  16. Jagalur M, Pal C, Learned-Miller E, Zoeller RT, Kulp D (2007) Analyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering. BMC Bioinforma 8(Suppl 10):S5

    Article  Google Scholar 

  17. Kemp C, Griffiths TL, Tenenbaum JB (2004) Discovering latent classes in relational data. Tech. rep, Computer science and artificial intelligence laboratory

  18. Keribin C, Brault V, Celeux G, Govaert G (2012) Model selection for the binary latent block model. In: Colubi A, Fokianos K, Gonzalez-Rodriguez G, Kontoghiorghes EJ (eds) Proceedings of Compstat 2012, 20th international conference on computational statistics, The International Statistical Institute/International Association for Statistical, Computing, pp 379–390

  19. Keribin C, Brault V, Celeux G, Govaert G et al (2013) Estimation and selection for the latent block model on categorical data. Tech. rep, INRIA

  20. Kluger Y, Basri R, Chang JT, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13(4):703–716

    Article  Google Scholar 

  21. Lomet A, Govaert G, Grandvalet Y (2012a) Design of artificial data tables for co-clustering analysis. Université de Technologie de Compiègne, Tech. rep

  22. Lomet A, Govaert G, Grandvalet Y (2012b) Model selection in block clustering by the integrated classification likelihood. In: Colubi A, Fokianos K, Gonzalez-Rodriguez G, Kontoghiorghes EJ (eds) Proceedings of Compstat 2012, 20th international conference on computational statistics, The International Statistical Institute/International Association for Statistical, Computing, pp 519–530

  23. Mariadassou M, Matias C (2012) Convergence of the groups posterior distribution in latent or stochastic block models. Tech. rep., arXiv

  24. McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York

    Google Scholar 

  25. Nadif M, Govaert G (2008) Algorithms for model-based block Gaussian clustering. In: DMIN’08, the 2008 international conference on data mining, Las Vegas, Nevada, USA

  26. Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). J R Stat Soc Ser B Stat Methodol 59(4):731–792

    Article  Google Scholar 

  27. Robert C (2001) The Bayesian choice. Springer, Berlin

    Google Scholar 

  28. Rocci R, Vichi M (2008) Two-mode multi-partitioning. Comput Stat Data Anal 52(4):1984–2003

    MathSciNet  Article  Google Scholar 

  29. Schepers J, Ceulemans E, Van Mechelen I (2008) Selecting among multi-mode partitioning models of different complexities: a comparison of four model selection criteria. J Classif 25(1):67–85

    MathSciNet  Article  Google Scholar 

  30. Seldin Y, Tishby N (2010) Pac-Bayesian analysis of co-clustering and beyond. J Mach Learn Res 11: 3595–3646

  31. Shan H, Banerjee A (2008) Bayesian co-clustering. In: 8th IEEE international conference on data mining, 2008. ICDM’08, pp 530–539

  32. Van Dijk B, Van Rosmalen J, Paap R (2009) A Bayesian approach to two-mode clustering. Tech. Rep. 2009–06, Econometric Institute. http://hdl.handle.net/1765/15112

  33. Wyse J, Friel N (2012) Block clustering with collapsed latent block models. Stat Comput 22(1):415–428

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgments

We thank the reviewers and associate editor for their valuable inputs. This work, carried out in the framework of the Labex MS2T (ANR-11-IDEX-0004-02), was partially funded by the French National Agency for Research under grant ClasSel ANR-08-EMER-002 and the European ICT FP7 under grant No 247022-MASH.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Aurore Lomet.

Appendices

Appendix A: Derivation of the approximation of \(\textit{ICL}\)

The first term of the expansion (2) (\(\log p(\mathbf {X}|\mathbf {z},\mathbf {w},M)\)) can be approximated in a BIC-like fashion, since the table entries are independent conditionally on the row/column partitions:

$$\begin{aligned} \log p(\mathbf {X}|\mathbf {z},\mathbf {w},M)&\approx \max _{\varvec{\alpha }} \log p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M) - \frac{\lambda }{2} \log (nd) , \end{aligned}$$

where \(\lambda \) is the dimensionality of vector \(\varvec{\alpha }\) (that is, of \(\mathcal {A}\)).

The two terms \(\log p(\mathbf {z}|M)\) and \(\log p(\mathbf {w}|M)\) can be computed exactly by taking an informative prior distribution on \(\varvec{\pi }\) and \(\varvec{\rho }\) when the proportion parameters are free. Indeed, a Dirichlet distribution \(\mathcal {D}(\delta ,\ldots ,\delta )\) yields:

$$\begin{aligned} p(\mathbf {z}|M)&= \int _{\mathcal {P}} \pi _1^{n_1} \cdots \pi _g^{n_g} \frac{\varGamma (g \delta )}{\varGamma ( \delta )\cdots \varGamma (\delta )} \mathbf {1}_{\sum _k \pi _k=1} d\varvec{\pi }, \\&= \frac{\varGamma ( g\delta )}{\varGamma ( \delta )^{g}} \frac{\varGamma ( \delta + n_1)\cdots \varGamma (\delta + n_g)}{\varGamma (n + g \delta )} \end{aligned}$$

where \(n_k\) is the number of rows in the cluster \(k\). The details of calculations are given by Robert (2001).

Using non-informative Jeffreys prior distributions for the proportion parameters (\(\delta =1/2\)), the log-priors are:

$$\begin{aligned} \log p(\mathbf {z}|M)&= \log \varGamma \left( \frac{g}{2}\right) + \sum _{k=1}^g \log \varGamma (n_k +\frac{1}{2}) - g \log \varGamma \left( \frac{1}{2}\right) - \log \varGamma (n+\frac{g}{2}),\\ \log p(\mathbf {w}|M)&= \log \varGamma \left( \frac{m}{2}\right) + \sum _{\ell =1}^m \log \varGamma (d_\ell +\frac{1}{2}) - m \log \varGamma \left( \frac{1}{2}\right) - \log \varGamma (d+\frac{m}{2}). \end{aligned}$$

Because \((\mathbf {z}, \mathbf {w})\) are unknown, we replace them by their estimates \((\hat{\mathbf {z}}, \hat{\mathbf {w}})\) obtained by the VEM algorithm. When \(\hat{n}_k\) and \(\hat{d}_\ell \) are large enough, the approximation of the Gamma function by the Stirling formula \( \varGamma (t+1) \approx t^{t+1/2} \exp (-t) (2\pi )^{1/2}\) can be used. Neglecting terms of order \(O(1)\), the log-prior distributions are then approximated as follows:

$$\begin{aligned} \log p(\hat{\mathbf {z}}|M)&\approx \sum _{k=1}^g \hat{n}_k \log \hat{n}_k - n \log n + \frac{1}{2}(g-1) \log n, \\ \log p(\hat{\mathbf {w}}|M)&\approx \sum _{\ell =1}^m \hat{d}_\ell \log \hat{d}_\ell - d \log d - \frac{1}{2}(m-1) \log m. \end{aligned}$$

In addition, \(\sum _{k=1}^g \hat{n}_k \log \frac{\hat{n}_k}{n}=\max _{\varvec{\pi }} \log p(\hat{\mathbf {z}}| \varvec{\pi }, M)\), \(\sum _{\ell =1}^m \hat{d}_\ell \log \frac{\hat{d}_\ell }{d}=\max _{\varvec{\rho }} \log p(\hat{\mathbf {w}}| \varvec{\rho }, M)\) (see Robert 2001; Biernacki et al. 2010. For \(\delta =1/2\), we obtain:

$$\begin{aligned} \log p(\hat{\mathbf {z}}|M)&\approx \max _{\varvec{\pi }}\log p(\hat{\mathbf {z}}| \varvec{\pi }, M) - \frac{g-1}{2} \log n ,\\ \log p(\hat{\mathbf {w}}|M)&\approx \max _{\varvec{\rho }} \log p(\hat{\mathbf {w}}| \varvec{\rho }, M) - \frac{m-1}{2} \log d. \end{aligned}$$

Then, the ICL criterion can be approached by:

$$\begin{aligned} ICL(M)&\approx \max _{\varvec{\alpha }} \log p(\mathbf {X}|\hat{\mathbf {z}},\hat{\mathbf {w}},\varvec{\alpha },M) - \frac{\lambda }{2} \log (nd) + \max _{\varvec{\pi }}\log p(\hat{\mathbf {z}}| \varvec{\pi }, M) \\&\quad - \frac{g-1}{2} \log n + \max _{\varvec{\rho }} \log p(\hat{\mathbf {w}}| \varvec{\rho }, M) - \frac{m-1}{2} \log d \\&\approx \max _{\varvec{\theta }} \log p(\mathbf {X},\hat{\mathbf {z}},\hat{\mathbf {w}}|\varvec{\theta },M) - \frac{\lambda }{2} \log (nd)- \frac{g-1}{2} \log n - \frac{m-1}{2} \log d . \end{aligned}$$

Appendix B: Derivation of exact \(\textit{ICL}\)

The criterion \(\textit{ICL}\) can be broken down in three terms:

$$\begin{aligned} \textit{ICL} (M)&= \log p(\mathbf {X}|\mathbf {z},\mathbf {w},M) + \log p(\mathbf {z}|M) + \log p(\mathbf {w}|M). \end{aligned}$$

Then, the first term of the expansion (2) is rewritten using the following decomposition:

$$\begin{aligned} p(\mathbf {X}|\mathbf {z},\mathbf {w},M)&= \frac{p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M) p(\varvec{\alpha }|M) }{p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M)} , \end{aligned}$$

where \(p(\varvec{\alpha }|M)\) and \(p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M)\) are respectively the prior and posterior distributions of \(\varvec{\alpha }\).

For the latent block model with different variances, given the row and column labels, the entries \(x_{ij}\) of each block are independent and identically distributed. We thus apply the standard results for Gaussian samples (Gelman 2004), where the distributions are defined by:

$$\begin{aligned}&p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M) = \prod _{i,j,k, \ell } \left\{ N(x_{ij}; \mu _{k \ell }, \sigma ^2_{k \ell }) \right\} ^{z_{ik}w_{j \ell }}, \\&p(\varvec{\alpha }|M) = \prod _{k, \ell } \left\{ N\left( \mu _{k \ell }; \mu _0,\frac{\sigma ^2_{k \ell }}{\kappa _0}\right) \times \text {Inv}-\chi ^{2} (\sigma ^2_{k \ell }; \nu _0 ,\sigma ^2_0)\right\} , \\&p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M) = \prod _{k, \ell } \left\{ N \left( \mu _{k \ell }; \frac{\kappa _0 \mu _0 + n_k d_{\ell } \bar{x}_{k \ell }}{\kappa _0+n_k d_{\ell }}, \frac{\sigma ^2}{\kappa _0+n_k d_{ \ell }}\right) \right. \\&\quad \times \left. \text {Inv}-\chi ^2 \left( \sigma ^2_{k \ell }; \nu _0+n_k d_{ \ell } , \frac{\nu _0 \sigma ^2_0 + (n_k d_{ \ell } -1)s^{2\star }_{k \ell } + \frac{\kappa _0 n_k d_{\ell }}{\kappa _0 + n_k d_{ \ell }} (\bar{x}_{k \ell }-\mu _0)^2}{\nu _0+n_k d_{\ell }} \right) \right\} . \end{aligned}$$

Using the definitions of these distributions, the first term of the expansion (2),

$$\begin{aligned} \log p(\mathbf {X}|\mathbf {z},\mathbf {w},M) = \log p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M) + \log p(\varvec{\alpha }|M) - \log p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M) , \end{aligned}$$

is identified, after some calculations, as (3).

For the latent block model with equal variances, the standard results need to be adapted to account for the shared parameter \(\sigma ^2\). The prior distributions are now defined as follows:

$$\begin{aligned} p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M)&= \prod _{i,j,k,\ell } \left\{ N(x_{ij}; \mu _{k \ell }, \sigma ^2) \right\} ^{z_{ik}w_{j \ell }}, \\ p(\varvec{\alpha }|M)&= \prod _{k,\ell } \left\{ N( \mu _{k \ell }; \mu _0,\frac{\sigma ^2}{\kappa _0}) \right\} \times \text {Inv}-\chi ^2 (\sigma ^2; \nu _0 ,\sigma ^2_0). \end{aligned}$$

The posterior distribution is then computed thanks to Bayes’ formula

$$\begin{aligned}&p(\varvec{\mu },\sigma ^2 | \mathbf {X}) \propto p(\varvec{\mu },\sigma ^2) p(\mathbf {X}| \varvec{\mu },\sigma ^2) \\&\quad \propto (\sigma ^2)^{-(\frac{\nu _0}{2}+1)} \exp (- \frac{1}{2\sigma ^2}\nu _0 \sigma ^2_0 ) \prod _{k,\ell } \left\{ \sigma ^{-1} \exp \left( -\frac{1}{2\sigma ^2} \kappa _0(\mu _{k\ell }-\mu _0)^2\right) \right\} \\&\qquad \times (\sigma ^2)^{-\frac{nd}{2}} \exp \bigg (-\frac{1}{2\sigma ^2} \underbrace{\sum _{i,j,k,\ell } z_{ik} w_{j\ell } (x_{ij}-\mu _{k\ell })^2}_{\displaystyle (nd-gm) s^{2\star }_{w} + \sum _{k,\ell } n_k d_{\ell } (\mu _{k\ell }-\bar{x}_{k\ell })^2} \bigg ) , \\&\quad = (\sigma ^2)^{-(\frac{\nu _0+nd}{2}+1)} \exp \left( - \frac{1}{2\sigma ^2} (\nu _0 \sigma ^2_0 +(nd-gm) s^{2\star }_{w})\right) \\&\qquad \times \prod _{k,\ell } \left\{ \sigma ^{-1} \exp \left( -\frac{1}{2\sigma ^2} \left( \kappa _0 (\mu _{k\ell } - \mu _0)^2+ n_k d_\ell (\mu _{k\ell }-\bar{x}_{k\ell })^2\right) \right) \right\} \\&\quad = (\sigma ^2)^{-(\frac{\nu _0+nd}{2}+1)} \exp \left( -\frac{\nu _0 \sigma ^2_0 +(nd-gm) s^{2\star }_{w}}{2\sigma ^2} \right) \\&\qquad \times \prod _{k,\ell } \sigma ^{-1} \exp \left( -\frac{\kappa _0+n_k d_\ell }{2\sigma ^2} \left( \mu _{k\ell }-\frac{\kappa _0 \mu _0+ n_k d_\ell \bar{x}_{k\ell }}{\kappa _0+n_k d_\ell } \right) ^2\!-\! \frac{(\kappa _0 n_k d_\ell )(\bar{x}_{k\ell }-\mu _0)^2}{2\sigma ^2(\kappa _0 + n_k d_\ell )} \right) \\&\quad = \prod _{k,\ell } \sigma ^{-1} \exp \left( -\frac{\kappa _0+n_k d_\ell }{2\sigma ^2} \left( \mu _{k\ell }-\frac{\kappa _0 \mu _0+ n_k d_\ell \bar{x}_{k\ell }}{\kappa _0+n_k d_\ell } \right) ^2 \right) \\&\qquad \times (\sigma ^2)^{-(\frac{\nu _0+nd}{2}+1)} \exp \left( - \frac{\nu _0 \sigma ^2_0+(nd-gm) s^{2\star }_{w} + \sum _{k,\ell }\frac{\kappa _0 n_k d_\ell }{\kappa _0 + n_k d_\ell } (\bar{x}_{k\ell }-\mu _0)^2}{2\sigma ^2} \right) \end{aligned}$$

This probability can be factorized:

$$\begin{aligned} p(\varvec{\mu },\sigma ^2| \mathbf {X})=p(\varvec{\mu }| \sigma ^2, \mathbf {X}) p(\sigma ^2| \mathbf {X}) \end{aligned}$$

Thus, the posterior distribution is defined (assuming the posterior independence of \(\mu _{k \ell }\)):

$$\begin{aligned}&p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M) = \prod _{k,\ell } \left\{ N \left( \mu _{k \ell }; \frac{\kappa _0 \mu _0 + n_k d_{\ell } \bar{x}_{k\ell }}{\kappa _0+n_k d_{\ell }}, \frac{\sigma ^2}{\kappa _0+n_k d_{\ell }} \right) \right\} \\&\quad \times \, \text {Inv}\!-\!\chi ^2 \left( \sigma ^2; \nu _0+n d , \frac{\nu _0 \sigma ^2_0+(nd-gm)s^{2\star }_{w}+\sum _{k,\ell }\frac{n_k d_{\ell } \kappa _0}{n_k d_{\ell } +\kappa _0}(\bar{x}_{k \ell }-\mu _0)^2}{\nu _0+nd} \right) \! . \end{aligned}$$

For the terms related to the proportions, when the proportions are free, we assume a symmetric Dirichlet prior distribution of parameters \((\delta _0,\ldots ,\delta _0)\) for the row and column parameters \((\varvec{\pi },\varvec{\rho })\), so that:

$$\begin{aligned} p(\mathbf {z}|M)&= \int _{\mathcal {P}} \pi _1^{n_1} \ldots \pi _g^{n_g} \frac{\varGamma (g \delta _0)}{\varGamma ( \delta _0)\ldots \varGamma (\delta _0)} {1\!\!1}_{\sum _k \pi _k=1} d\varvec{\pi }, \\&= \frac{\varGamma ( g\delta _0)}{\varGamma ( \delta _0)^{g}} \frac{\varGamma ( \delta _0 + n_1)\ldots \varGamma (\delta _0+ n_g)}{\varGamma (n + g \delta _0)}. \end{aligned}$$

More details are given by Biernacki et al. (1998).

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Lomet, A., Govaert, G. & Grandvalet, Y. Model selection for Gaussian latent block clustering with the integrated classification likelihood. Adv Data Anal Classif 12, 489–508 (2018). https://doi.org/10.1007/s11634-013-0161-3

Download citation

Keywords

  • Co-clustering
  • Latent block model
  • Model selection
  • Continuous data
  • Integrated classification likelihood
  • BIC

Mathematics Subject Classification (2000)

  • 91C20
  • 62H30