Estimation and selection for the latent block model on categorical data

Keribin, Christine; Brault, Vincent; Celeux, Gilles; Govaert, Gérard

doi:10.1007/s11222-014-9472-2

Estimation and selection for the latent block model on categorical data

Published: 05 June 2014

Volume 25, pages 1201–1216, (2015)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Christine Keribin¹,
Vincent Brault¹,
Gilles Celeux² &
…
Gérard Govaert³

1067 Accesses
58 Citations
6 Altmetric
Explore all metrics

Abstract

This paper deals with estimation and model selection in the Latent Block Model (LBM) for categorical data. First, after providing sufficient conditions ensuring the identifiability of this model, we generalise estimation procedures and model selection criteria derived for binary data. Secondly, we develop Bayesian inference through Gibbs sampling and with a well calibrated non informative prior distribution, in order to get the MAP estimator: this is proved to avoid the traps encountered by the LBM with the maximum likelihood methodology. Then model selection criteria are presented. In particular an exact expression of the integrated completed likelihood criterion requiring no asymptotic approximation is derived. Finally numerical experiments on both simulated and real data sets highlight the appeal of the proposed estimation and model selection procedures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian Semiparametric Symmetric Models for Binary Data

Bayesian Analysis of Two-Part Latent Variable Model with Mixed Data

Article Open access 09 October 2023

Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler

Article 09 December 2014

Notes

The data set is available from https://www.hds.utc.fr/coclustering/doku.php.
Congressional Voting Records data set is available from http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records.
The code of Wyse and Friel is available at http://www.ucd.ie/statdept/jwyse.

References

Allman, E., Mattias, C., Rhodes, J.: Identifiability of parameters in latent structure models with many observed variables. Ann. Stat. 37, 3099–3132 (2009)
Banerjee, A., Dhillon, I., Ghosh, J., Merugu, S., Modha, D.S.: A generalized maximum entropy approach to bregman co-clustering and matrix approximation. J. Mach. Learn. Res. 8, 1919–1986 (2007). ISSN 1532–4435
MATH MathSciNet Google Scholar
Baudry, J.-P. : Sélection de modèle pour la classification non supervisée. Choix du nombre de classes. PhD thesis, Université Paris Sud, December 2009.
Baudry, J.-P., Raftery, A.E., Celeux, G., Lo, K., Gottardo, R.: Combining mixture components for clustering. J. Comput. Gr. Stat. 19, 332–353 (2010)
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (Jul 2000)
Carreira-Perpiñàn, M., Renals, S.: Practical identifiability of finite mixtures of multivariate bernoulli distributions. Neural Comput. 12, 141–152 (2000)
Article Google Scholar
Celeux, G., Diebolt, J.: Stochastic versions of the em algorithm. Comput. Stat. Quat. 2, 73–82 (1985)
Google Scholar
Celisse, A., Daudin, J.-J., Latouche, P.: Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 6, 1847–1899 (2012)
Article MATH MathSciNet Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm (with discussion). J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer series in statistics, Springer (2006)
MATH Google Scholar
Frühwirth-Schnatter, S.: Mixtures : Estimation and Applications, Chapter Dealing with Label Switching Under Model Uncertainty. Wiley, Chichester (2011)
Google Scholar
Govaert, G. : Algorithme de classification d’un tableau de contingence. In First international Symposium on Data Analysis and Informatics, pp. 487–500, Versailles, 1977. INRIA.
Govaert, G. : Classification croisée. PhD thesis, Université Paris 6, France, 1983.
Govaert, G., Nadif, M.: Block clustering with bernoulli mixture models: comparison of different approaches. Comput. Stat. Data Anal. 52, 3233–3245 (2008)
Article MATH MathSciNet Google Scholar
Govaert, G., Nadif, M.: Latent block model for contingency table. Commun. Stat. Theory Methods 39, 416–425 (2010)
Article MATH MathSciNet Google Scholar
Gyllenberg, M., Koski, T., Reilink, E., Verlann, M.: Non-uniqueness in probabilistic numerical identification of bacteria. J. Appl. Probab. 31, 542–548 (1994)
Article MATH MathSciNet Google Scholar
Jagalur, M., Pal, C., Learned-Miller, E., Zoeller, R.T., Kulp, D.: Analyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering. BMC Bioinform. 8, S5 (2007)
Article Google Scholar
Keribin, C.: Consistent estimation of the order of mixture models. Sankhya Ser. A 62, 49–66 (2000)
Keribin, C.: Méthodes bayésiennes variationnelles: concepts et applications en neuroimagerie. Journal de la Société Française de Statistique 151, 107–131 (2010)
MathSciNet Google Scholar
Keribin, C., Brault, V., Celeux, G., Govaert, G.: Model selection for the binary latent block model. Proceedings of COMPSTAT 2012, 2012.
Keribin, C., Brault, V., Celeux, G., Govaert, G. : Estimation and Selection for the Latent Block Model on Categorical Data. Rapport de recherche RR-8264, INRIA, March 2013. URL http://hal.inria.fr/hal-00802764
Lomet, A.: Sélection de modèle pour la classification croisée de données continues. PhD thesis, Université de Technologie de Compiègne, December 2012.
Lomet, A., Govaert, G., Grandvalet, Y.: Un protocole de simulation de données pour la classification croisée. In 44ème journées de statistique, Bruxelles, Mai 2012.
Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinf. 1, 24–45 (2004)
Article Google Scholar
Mariadassou, M., Matias, C.: Convergence of the groups posterior distribution in latent or stochastic block models. arXiv, preprint arXiv:1206.7101v2, 2013.
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, Nex York (2008)
Book MATH Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models, 2nd edn. Wiley, Nex York (2000)
Book MATH Google Scholar
Meeds, E, Roweis, S: Nonparametric bayesian biclustering. Technical Report UTML TR 2007–001, Department of Computer Science, University of Toronto, 2007.
Rousseau, J., Mengersen, K.: Asymptotic behaviour of the posterior distribution in overfitted models. J. Roy. Stat. Soc. 73, 689–710 (2011)
Article MATH MathSciNet Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Article MATH Google Scholar
Shan, H., Banerjee, A.: Bayesian co-clustering. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pp. 530–539, Washington, DC, 2008. IEEE Computer Society.
Wyse, J., Friel, N.: Block clustering with collapsed latent block models. Stat. Comput. 22, 415–428 (2012)
Article MathSciNet Google Scholar

Download references

Acknowledgments

The authors thank the reviewers and the Associate Editor for their very helpful comments which have greatly improved this paper. C. K. has been supported by LMH (Hadamard Mathematics Labex), backed by the foundation FMJH (Fondation Mathématique Jacques Hadamard).

Author information

Authors and Affiliations

Laboratoire de Mathématiques, Equipe Probabilités et Statistiques, UMR 8628, Bâtiment 425, Université Paris-Sud, 91405 , Orsay Cedex, France
Christine Keribin & Vincent Brault
INRIA Saclay Île de France, Bâtiment 425, Université Paris-Sud, 91405 , Orsay Cedex, France
Gilles Celeux
U.T.C, U.M.R. 7253, C.N.R.S. Heudiasyc, Centre de Recherches de Royallieu, B.P. 20529, 60205 , Compiègne Cedex, France
Gérard Govaert

Authors

Christine Keribin
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Brault
View author publications
You can also search for this author in PubMed Google Scholar
Gilles Celeux
View author publications
You can also search for this author in PubMed Google Scholar
Gérard Govaert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christine Keribin.

Appendices

Appendix A: Proof of Theorem 1

The identifiability of a parametric model requires that for any two different values $\varvec{\theta }\ne \varvec{\theta }'$ in the parameter space, the corresponding probability distribution ${I\!P}_{\varvec{\theta }}$ and ${I\!P}_{\varvec{\theta }'}$ are different. We prove that, under assumptions of Theorem 1, there exists a unique parameter $\varvec{\theta }=(\varvec{\pi },\varvec{\rho },\varvec{\alpha })$, up to a permutation of row and column labels, corresponding to ${I\!P}(\mathbf{y})$ the probability distribution function of matrix $\mathbf{y}$ having at least $2m-1$ rows and $2g-1$ columns. The proof is adapted from Celisse et al. (2012) who set up a similar result for the Stochastic Block Model. Notice first that $\varvec{\tau }= A \varvec{\rho }$ is the vector of the probability $\tau _k$ to have $1$ in a cell of a row of given row class $k$:

$$\begin{aligned} \tau _k={I\!P}(y_{ij}=1|z_{ik}=1)=\sum _\ell \rho _\ell \alpha _{k\ell }=(A \varvec{\rho })_k. \end{aligned}$$

As all the coordinates of $\varvec{\tau }$ are distinct, $R$ defined as the $g\times g$ matrix of coefficients $R_{ik}=(\tau _k)^i$, for $0 \le i < g$ and $1 \le k \le g$, is Vandermonde, and hence invertible. Consider now $u_i$, the probability to have $1$ on the $i$ first cells of the first row of $\mathbf{y}$:

$$\begin{aligned} u_i&= {I\!P}(y_{11}=1,\ldots ,y_{1i}=1)\\&= \sum _{ k, \ell _1,\ldots ,\ell _i}{I\!P}(z_{1k}=1)\\&\times \prod _{j=1}^{j=i}\!\big ( {I\!P}(y_{1j}|z_{1k}=1, w_{j\ell _j}=1){I\!P}(w_{j\ell _j}=1)\big )\\&= \sum _{k} {I\!P}(z_{1k}=1)\\&\times \sum _{\ell _1} {I\!P}(y_{11}|z_{1k}=1, w_{1\ell _1}=1){I\!P}(w_{1\ell _1}=1 )\\&\times \ldots \times \sum _{\ell _i} {I\!P}(y_{1i}|z_{1k}=1, w_{i\ell _i}=1){I\!P}(w_{i\ell _i}=1 )\\&= {I\!P}(y_{11}=1,\ldots ,y_{1i}=1)= \sum _{k} \pi _k (\tau _k)^i \end{aligned}$$

With a given ${I\!P}(\mathbf{y}),\, u_1, \ldots ,u_{2g-1}$ are known, and we denote $u_0=1$. Let now $M$ be the $(g+1)\times g$ matrix defined by $M_{ij}=u_{i+j-2}$, for all $1\le i \le g+1$ and $1\le j\le g$ and define $M_i$ as the square matrix obtained by removing the row $i$ from $M$. The coefficients of $M$ are

$$\begin{aligned} M_{ij}=u_{i+j-2}=\sum _{1\le k\le g} \tau _k^{i-1} \pi _k \tau _k^{j-1}. \end{aligned}$$

We can write, with $A_{\pi }=\hbox {Diag}(\varvec{\pi })$

$$\begin{aligned} M_g=RA_{\pi }R'. \end{aligned}$$

Now, $R$, unknown at this stage, can be retrieved by noticing that the coefficients of $\varvec{\tau }$ are the roots of the following polynomial (Celisse et al. 2012) of degree $g$

$$\begin{aligned} B(x)=\sum _{k=0}^g (-1)^{k+g} D_k x^k, \end{aligned}$$

where $D_k=\hbox {det} M_k$ and $D_g\ne 0$ as $M_g$ is a product of invertible matrices. Hence, it is possible to determine $\varvec{\tau }$, and $R$ is now known. Consequently, $\varvec{\pi }$ is defined in a unique manner by $A_{\pi }=R^{-1} M_g R^{'-1}$.

In the same way, $\varvec{\rho }$ is defined in a unique manner by considering the probabilities $\sigma _\ell $ to have $1$ in a column of column class $\ell $ and the probabilities $v_j$ to have $1$ on the $j$ first cells of the first column of $\mathbf{y}$

$$\begin{aligned} \sigma _\ell \!=\!{I\!P}(y_{ij}\!=\!1|w_{j\ell }=1)\;\; \hbox {and } v_j \!=\!{I\!P}(y_{11}\!=\!1,\ldots ,y_{j1}=1 ) \end{aligned}$$

To determine $\varvec{\alpha }$, we introduce for $1\le i \le g$ and $1\le j \le m$ the probabilities $U_{ij}$ to have 1 in each $i$ first cells of the first row and in each $j$ first cells of the first column

$$\begin{aligned} U_{ij}&= {I\!P}(y_{11}=1,\ldots ,y_{1i}=1, y_{21}=1,\ldots ,y_{j1}=1) \\&= \sum _{\overset{k,\ell _1,\ell _2,\ldots ,\ell _{i},}{k_1,\ldots ,k_{j}}} {I\!P}(y_{11}|z_{1k}=1, w_{1\ell }=1)\\&\times {I\!P}(z_{1k}=1) {I\!P}(w_{1\ell }=1)\\&\times \prod _{\lambda =2}^i{{I\!P}(y_{1\lambda }|z_{1k}=1, w_{\lambda \ell _\lambda }=1) {I\!P}(w_{\lambda \ell _\lambda }=1)}\\&\times \prod _{\eta =2}^j{{I\!P}(y_{\eta 1}|z_{\eta k_\eta }=1, w_{1\ell }=1) {I\!P}(z_{\eta k_\eta }=1)}\\&= \sum _{k,\ell } {I\!P}(y_{11}|z_{1k}=1, w_{1\ell }=1){I\!P}(z_{1k}=1)\\&\times {I\!P}(w_{1\ell }=1){I\!P}( y_{1i}|z_{1k}=1)^{i-1} {I\!P}( y_{j1}|w_{1\ell }=1)^{j-1}\\&= \sum _{k,\ell } \pi _k \tau _k^{i-1} \alpha _{k\ell } \rho _\ell \sigma _\ell ^{j-1}. \end{aligned}$$

These probabilities are known and we can write, with $S_{j\ell }=(\sigma _\ell )^{j-1},\,j=1,\ldots m,\,\ell =1,\ldots ,m$

$$\begin{aligned} U=RA_{\pi } A A_{\rho } S', \end{aligned}$$

and it leads to define $A=A_{\pi }^{-1}R^{-1} U S^{'-1} A_{\rho }^{-1}$, and hence $\varvec{\alpha }$, in a unique manner.

The proof is straightforwardly extended to the categorical case. The identification of $\varvec{\pi }$ and $\varvec{\rho }$ are obtained by considering the probabilities of $y_{ij}=1$ where $1$ is the first outcome of the multinomial distribution. Then, each $A^h=(\alpha _{k\ell }^h)_{k=1,\ldots ,g;\ell =1,\ldots ,m}$ is successively identified by considering $y_{ij}=\ell ,\,i=1,\ldots ,m,\,j=1,\ldots , g$.

Appendix B: VEM algorithm for categorical data

Govaert and Nadif (2008) proposed to use, for the binary case, a variational approximation of the EM algorithm by imposing that the joint distribution of the labels takes the form $q^{(c)}_{zw}(\mathbf{z},\mathbf{w})=q^{(c)}_z(\mathbf{z})q^{(c)}_w(\mathbf{w})$. To get simpler formulas, we will denote

$$\begin{aligned} s_{ik}^{(c)}=q^{(c)}_z(z_{ik}=1) \end{aligned}$$

and

$$\begin{aligned} t_{j\ell }^{(c)}=q^{(c)}_w(w_{j\ell }=1). \end{aligned}$$

Using the variational approximation, the maximisation of the loglikelihood is replaced by the maximisation of the free energy

$$\begin{aligned} {\mathcal {F}}(q_{zw},\varvec{\theta })= {I\!E}_{q_{z}q_{w}}\left[ \log \frac{ p(\mathbf{y},\mathbf{z},\mathbf{w};\varvec{\theta })}{q_{z}(\mathbf{z})q_{w}(\mathbf{w})}\right] \end{aligned}$$

alternatively in $q_{z},\,q_w$ and $\varvec{\theta }$ (see Keribin 2010). The difference between the maximum loglikelihood and the maximum free energy is the Kullback divergence $KL(q_{zw}||p(\mathbf{z},\mathbf{w}|\mathbf{y};\varvec{\theta }) )={I\!E}_{q_{zw}}\left[ \log \frac{q_{zw}(\mathbf{z},\mathbf{w})}{p(\mathbf{z},\mathbf{w}|\mathbf{y};\varvec{\theta })} \right] $. This can be extended to the categorical LBM, and leads to the following Variational EM (VEM) algorithm:

E step It consists of maximising the free energy in $q_{z}$ and $q_w$, and it leads to:

1.
Computing $s_{ik}^{(c+1)}$ with fixed $w^{(c)}_{jl}$ and $\varvec{\theta }^{(c)}$
$$\begin{aligned} s_{ik}^{(c+1)}= \frac{\pi _k^{(c)} \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k\cdot }^{(c)})}{\sum _{k'} \pi _{k'}^{(c)} \psi _{k'}^{(c)}(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k'\cdot }^{(c)}) }, k=1,\ldots , g \end{aligned}$$
where $\mathbf{y}_{i\cdot }$ denotes the row $i$ of the matrix $\mathbf{y}$, $\varvec{\alpha }_{k\cdot }=(\alpha _{k1},\ldots ,\alpha _{km})$, and
$$\begin{aligned} \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k\cdot }^{(c)})=\prod _{\ell ,h} ({\alpha _{k\ell }^{h}}^{(c)})^{\sum _j t_{j\ell }^{(c)} y_{ij}^{h}} \end{aligned}$$
2.
Computing $t_{jl}^{(c+1)}$ with fixed $s^{(c+1)}_{ik}$ and $\varvec{\theta }^{(c)}$
$$\begin{aligned} t_{j\ell }^{(c+1)}= \frac{\rho _{\ell }^{(c)} \phi _{\ell }(\mathbf{y}_{\cdot j};\varvec{\alpha }_{\cdot \ell }^{(c)})}{\sum _{\ell '} \rho _{\ell '}^{(c)} \phi _{\ell '}(\mathbf{y}_{\cdot j};\varvec{\alpha }_{\cdot \ell '}^{(c)}) } , \ell =1,\ldots ,m \end{aligned}$$
where $\mathbf{y}_{\cdot j}$ denotes the column $j$ of the matrix $\mathbf{y},\, \varvec{\alpha }_{\cdot \ell }=(\alpha _{1\ell },\ldots ,\alpha _{g\ell })$ and
$$\begin{aligned} \phi _{\ell }(\mathbf{y}_{\cdot j};\varvec{\alpha }_{\cdot \ell }^{(c)})=\prod _{k,h} ({\alpha _{k\ell }^{h}}^{(c)})^{\sum _i s_{ik}^{(c+1)} y_{ij}^{h}}. \end{aligned}$$

M step Updating $\varvec{\theta }^{(c+1)}$. Denoting $s_{.k}^{(c+1)}=\sum _i s_{ik}^{(c+1)}$, $t_{.\ell }^{(c+1)}=\sum _j t_{j\ell }^{(c+1)}$, it leads to

$$\begin{aligned}&\pi _k^{(c+1)}=\frac{ s_{.k}^{(c+1)}}{n},\\&\rho _{\ell }^{(c+1)}=\frac{t_{.\ell }^{(c+1)}}{d},\\&{\alpha _{k\ell }^{h}}^{(c+1)}=\frac{\sum _{i,j} s_{ik}^{(c+1)} t_{j\ell }^{(c+1)} y_{ij}^{h}}{ s_{.k}^{(c+1)} t_{.\ell }^{(c+1)}}. \end{aligned}$$

Appendix C: SEM algorithm for categorical data

E and S step

1.
Computation of $ p(\mathbf{z}|\mathbf{y},\mathbf{w}^{(c)};\varvec{\theta }^{(c)})$, then simulation of $\mathbf{z}^{(c+1)}$ according to $ p(\mathbf{z}|\mathbf{y},\mathbf{w}^{(c)};\varvec{\theta }^{(c)})$:
$$\begin{aligned} p(z_i=k|\mathbf{y}_{i\cdot },\mathbf{w}^{(c)};\varvec{\theta }^{(c)})= \frac{\pi ^{(c)}_k \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }^{(c)}_{k\cdot })}{\sum _{k'} \pi ^{(c)}_{k'} \psi ^{(c)}_{k'}(\mathbf{y}_{i\cdot };\varvec{\alpha }^{(c)}_{k'\cdot }) }, \end{aligned}$$
for $k=1,\ldots , g$ with
$$\begin{aligned} \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k\cdot })=\prod _{\ell ,h} ({\alpha _{k\ell }^{h}}^{(c)})^{\sum _j w_{j\ell }^{(c)} y_{ij}^{h}} \end{aligned}$$
2.
computation of $p(\mathbf{w}|\mathbf{y},\mathbf{z}^{(c+1)};\varvec{\theta }^{(c)})$, then simulation of $\mathbf{w}^{(c+1)}$ according to $p(\mathbf{w}|\mathbf{y},\mathbf{z}^{(c+1)};\varvec{\theta }^{(c)})$.

M step Denoting $z_{.k}:=\sum _i z_{ik}$ and $w_{.\ell }:=\sum _j w_{j\ell }$, it leads to

$$\begin{aligned} \pi _k^{(c+1)}=\frac{z_{.k}^{(c+1)}}{n}, \rho _{\ell }^{(c+1)}=\frac{w_{.\ell }^{(c+1)}}{d} \end{aligned}$$

and

$$\begin{aligned} {\alpha _{k\ell }^{h}}^{(c+1)}=\frac{\sum _{i,j} z_{ik}^{(c+1)} w_{j\ell }^{(c+1)} y_{ij}^{h}}{ z_{.k}^{(c+1)} w_{.\ell }^{(c+1)}}. \end{aligned}$$

Note that the formulae of VEM and SEM-Gibbs are essentially the same, except that the probabilities $s_{ik}$ and $t_{j\ell }$ are replaced with binary indicator values $z_{ik}$ and $w_{j\ell }$.

Appendix D: Computing ICL

In this section, the exact ICL expression is derived for categorical data. Using the conditional independence of the $\mathbf{z}$s and the $\mathbf{w}$s conditionally to $\varvec{\alpha }, \varvec{\pi }, \varvec{\rho }$, the integrated completed likelihood can be written

$$\begin{aligned} p(\mathbf{y},\mathbf{z}, \mathbf{w} )&= \int p(\mathbf{y},\mathbf{z}, \mathbf{w} , \varvec{\alpha }, \varvec{\pi }, \varvec{\rho }) p(\varvec{\alpha }) p(\varvec{\pi }) p(\varvec{\rho }) d\varvec{\alpha }d\varvec{\pi }d\varvec{\rho }\\&= \int p(\mathbf{y}|\mathbf{z}, \mathbf{w}, \varvec{\alpha }, \varvec{\pi }, \varvec{\rho }) p(\mathbf{z}, \mathbf{w} | \varvec{\alpha }, \varvec{\pi }, \varvec{\rho })\\&\times p(\varvec{\alpha }) p(\varvec{\pi }) p(\varvec{\rho })d\varvec{\alpha }d\varvec{\pi }d\varvec{\rho }\\&= \int p(\mathbf{y}|\mathbf{z}, \mathbf{w} , \varvec{\alpha }) p(\varvec{\alpha }) d\varvec{\alpha }\int p(\mathbf{z}| \varvec{\pi }) p(\varvec{\pi })d\varvec{\pi }\\&\times \int p(\mathbf{w}| \varvec{\rho }) p(\varvec{\rho }) d\varvec{\rho }= p(\mathbf{z} )p(\mathbf{w} )p(\mathbf{y}|\mathbf{z}, \mathbf{w} ). \end{aligned}$$

Thus

$$\begin{aligned} \text{ ICL }=\log p(\mathbf{z} )+ \log p(\mathbf{w}) + \log p(\mathbf{y}|\mathbf{z}, \mathbf{w}). \end{aligned}$$

Now, according to the LBM definition,

$$\begin{aligned} p(\mathbf{z} | \varvec{\pi })=\prod _{i,k} \pi _k^{z_{ik}}. \end{aligned}$$

Since the prior distribution of $\varvec{\pi }$ is the Dirichlet distribution $\mathcal{D}(a,\ldots ,a)$

$$\begin{aligned} p(\varvec{\pi })= \frac{\Gamma (ga)}{\Gamma (a)^g} \prod _{k} \pi ^{a-1}, \end{aligned}$$

we have

$$\begin{aligned} p(\varvec{\pi }|\mathbf{z})=\frac{p(\mathbf{z}|\varvec{\pi })p(\varvec{\pi })}{\int p(\mathbf{z}|\varvec{\pi })p(\varvec{\pi }) } \propto \prod _k \pi _k^{z_{.k} +a-1}. \end{aligned}$$

We recognise a non-normalised Dirichlet distribution $\mathcal{D}(z_{.1}+a,\ldots ,z_{.g}+a)$ with the normalising factor

$$\begin{aligned} \frac{\Gamma (n+ga)}{\prod _k\Gamma (z_{.k}+a)}. \end{aligned}$$

The expression of $p( \mathbf{z})$ directly follows from the Bayes theorem

$$\begin{aligned} p( \mathbf{z})=\frac{\Gamma (ag)}{\Gamma (a)^g}\frac{\prod _k\Gamma (z_{.k}+a)}{\Gamma (n+ag)}. \end{aligned}$$

(8)

In the same manner,

$$\begin{aligned} p( \mathbf{w})=\frac{\Gamma (am)}{\Gamma (a)^m}\frac{\prod _\ell \Gamma (w_{.\ell }+a)}{\Gamma (d+am)}. \end{aligned}$$

(9)

We now turn to the computation of $p(\mathbf{y}|\mathbf{z}, \mathbf{w} )$. Since $\varvec{\alpha }$ and $(\mathbf{z}, \mathbf{w})$ are independent, we have

$$\begin{aligned} p(\varvec{\alpha }| \mathbf{y},\mathbf{z}, \mathbf{w} )=\frac{ p(\mathbf{y}|\mathbf{z}, \mathbf{w} , \varvec{\alpha }) p(\varvec{\alpha }) }{\int p(\mathbf{y}|\mathbf{z}, \mathbf{w} , \varvec{\alpha }) p(\varvec{\alpha }) d\varvec{\alpha }}, \end{aligned}$$

and, using the conditional independence of $y_{ij}$ knowing $\mathbf{z},\,\mathbf{w}$ and $\varvec{\alpha }$,

$$\begin{aligned} p(\mathbf{y}|\mathbf{z}, \mathbf{w} , \varvec{\alpha })&= \prod _{i,j,k,\ell } \left( \prod _h (\alpha _{k\ell }^{h})^{ y_{ij}^{h}}\right) ^{z_{ik}w_{j\ell }}\\&= \prod _{k,\ell } \left( \prod _{h} (\alpha _{k\ell }^{h})^{\sum _{i,j}z_{ik}w_{j\ell } y_{ij}^{h}}\right) \\&= \prod _{k,\ell } \left( \prod _{h} (\alpha _{k\ell }^{h})^{N_{k\ell }^h}\right) . \\ \end{aligned}$$

Therefore

$$\begin{aligned} p(\varvec{\alpha }| \mathbf{y},\mathbf{z}, \mathbf{w} )&\propto \prod _{k,\ell } \left( p(\alpha _{k\ell })\prod _{h} (\alpha _{k\ell }^{h})^{N_{k\ell }^h} \right) \\&\propto \prod _{k,\ell } \left[ \left( \prod _{h}(\alpha _{k\ell }^{h})^{b-1}\right) \left( \prod _{h} (\alpha _{k\ell }^{h})^{N_{k\ell }^h} \right) \right] \\&\propto \prod _{k,\ell } \left( \prod _{h} (\alpha _{k\ell }^{h})^{b+N_{k\ell }^h-1} \right) .\\ \end{aligned}$$

because $p(\alpha _{k\ell })$ is the density of a Dirichlet distribution $\mathcal{D}(b,\ldots ,b)$. Each term $k\ell $, is the density of a non-normalised Dirichlet distribution ${\mathcal {D}}(b+N_{k\ell }^1,\ldots ,b+N_{k\ell }^r)$ with the normalising factor

$$\begin{aligned} \frac{\Gamma (z_{.k}w_{.\ell }+rb)}{\prod _h \Gamma (N_{k\ell }^h +b)}. \end{aligned}$$

Thus, by the Bayes formula,

$$\begin{aligned} p(\mathbf{y}|\mathbf{z}, \mathbf{w} )= \prod _{k,\ell } \frac{\Gamma (rb)}{\Gamma (b)^r}\frac{\prod _h \Gamma (N_{k\ell }^h +b)}{\Gamma (z_{.k}w_{.\ell }+rb)}. \end{aligned}$$

(10)

And, the ICL criterion, presented in Section 4.1, is straightforwardly derived from equations (3), (8), (9) and (10).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Keribin, C., Brault, V., Celeux, G. et al. Estimation and selection for the latent block model on categorical data. Stat Comput 25, 1201–1216 (2015). https://doi.org/10.1007/s11222-014-9472-2

Download citation

Received: 26 March 2013
Accepted: 08 April 2014
Published: 05 June 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s11222-014-9472-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation and selection for the latent block model on categorical data

Abstract

Access this article

Similar content being viewed by others

Bayesian Semiparametric Symmetric Models for Binary Data

Bayesian Analysis of Two-Part Latent Variable Model with Mixed Data

Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendices

Appendix A: Proof of Theorem 1

Appendix B: VEM algorithm for categorical data

Appendix C: SEM algorithm for categorical data

Appendix D: Computing ICL

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimation and selection for the latent block model on categorical data

Abstract

Access this article

Similar content being viewed by others

Bayesian Semiparametric Symmetric Models for Binary Data

Bayesian Analysis of Two-Part Latent Variable Model with Mixed Data

Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendices

Appendix A: Proof of Theorem 1

Appendix B: VEM algorithm for categorical data

Appendix C: SEM algorithm for categorical data

Appendix D: Computing ICL

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation