Skip to main content
Log in

Greedy clustering of count data through a mixture of multinomial PCA

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve performance and stability. In this paper, we rely on the mixture of multinomial PCA, a mixture model for the clustering of count data, also known as the probabilistic clustering-projection model in the literature. Related to the latent Dirichlet allocation model, it offers the flexibility of topic modeling while being able to assign each observation to a unique cluster. We introduce a greedy clustering algorithm, where inference and clustering are jointly done by mixing a classification variational expectation maximization algorithm, with a branch & bound like strategy on a variational lower bound. An integrated classification likelihood criterion is derived for model selection, and a thorough study with numerical experiments is proposed to assess both the performance and robustness of the method. Finally, we illustrate the qualitative interest of the latter in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://github.com/nicolasJouvin/MoMPCA.

  2. Available on the CRAN.

  3. In-situ cancers are pre-invasive lesions that get their name from the fact that they have not yet started to spread. Invasive cancer tissues can contain both invasive and in-situ lesions in the same slide.

References

  • Aggarwal CC, Zhai C (2012) A survey of text clustering algorithms. Mining text data. Springer, New York, pp 77–128

    Chapter  Google Scholar 

  • Akaike H (1998) Information theory and an extension of the maximum likelihood principle. Selected papers of hirotugu akaike. Springer, New York, pp 199–213

    Chapter  Google Scholar 

  • Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11(10):R106

    Article  Google Scholar 

  • Banfield JD, Raftery AE (1993) Model-based gaussian and non-gaussian clustering. Biometrics 803–821

  • Bergé LR, Bouveyron C, Corneli M, Latouche P (2019) The latent topic block model for the co-clustering of textual interaction data. Comput Stat Data Anal

  • Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725

    Article  Google Scholar 

  • Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112(518):859–877

    Article  MathSciNet  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  • Bouveyron C, Celeux G, Murphy TB, Raftery AE (2019) Model-based clustering and classification for data science: with applications in R. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge

  • Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519

    Article  MathSciNet  Google Scholar 

  • Bouveyron C, Latouche P, Zreik R (2018) The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat Comput 28(1):11–31

    Article  MathSciNet  Google Scholar 

  • Bui QV, Sayadi K, Amor SB, Bui M (2017) Combining latent dirichlet allocation and k-means for documents clustering: effect of probabilistic based distance measures. In: Asian conference on intelligent information and database systems. Springer, New York, pp 248–257

  • Buntine W (2002) Variational extensions to em and multinomial pca. In: European conference on machine learning. Springer, New York, pp 23–34

  • Buntine WL, Perttu S (2003) Is multinomial pca multi-faceted clustering or dimensionality reduction? In AISTATS

  • Carel L, Alquier P (2017) Simultaneous dimension reduction and clustering via the nmf-em algorithm. arXiv preprint arXiv:1709.03346

  • Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332

    Article  MathSciNet  Google Scholar 

  • Chien J-T, Lee C-H, Tan Z-H (2017) Latent dirichlet mixture model. Neurocomputing

  • Chiquet J, Mariadassou M, Robin S et al (2018) Variational inference for probabilistic poisson pca. Ann Appl Stat 12(4):2674–2698

    Article  MathSciNet  Google Scholar 

  • Cunningham RB, Lindenmayer DB (2005) Modeling count data of rare species: some statistical issues. Ecology 86(5):1135–1142

    Article  Google Scholar 

  • Daudin J-J, Picard F, Robin S (2008) A mixture model for random graphs. Stat Comput 18(2):173–183

    Article  MathSciNet  Google Scholar 

  • Defossez G, Le Guyader-Peyrou S, Uhry Z, Grosclaude P, Remontet L, Colonna M, Dantony E, Delafosse P, Molinié F, Woronoff A-S, et al (2019) Estimations nationales de l’incidence et de la mortalité par cancer en france métropolitaine entre 1990 et 2018. Résultats préliminaires. Saint-Maurice (Fra): Santé publique France

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc: Ser B (Methodol) 39(1):1–22

    MathSciNet  MATH  Google Scholar 

  • Ding C, Li T, Peng W (2008) On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput Stat Data Anal 52(8):3913–3927

    Article  MathSciNet  Google Scholar 

  • Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218

    Article  Google Scholar 

  • Ellis IO, Elston CW (2006) Histologic grade. Breast pathology. Elsevier, Amsterdam, pp 225–233

    Chapter  Google Scholar 

  • Fordyce JA, Gompert Z, Forister ML, Nice CC (2011) A hierarchical bayesian approach to ecological count data: a flexible tool for ecologists. PLoS ONE 6(11):e26785

    Article  Google Scholar 

  • Hartigan JA (1975) Clustering algorithms. Wiley, Hoboken

    MATH  Google Scholar 

  • Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. Adv Neural Inf Process Syst 856–864

  • Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc, pp 289–296

  • Hornik K, Grün B (2011) topicmodels: an r package for fitting topic models. J Stat Softw 40(13):1–30

    Google Scholar 

  • Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417

    Article  Google Scholar 

  • Lakhani SR (2012) WHO classification of tumours of the breast. International Agency for Research on Cancer

  • Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2. IEEE, pp 2169–2178

  • Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788

    Article  Google Scholar 

  • Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 556–562

  • Liu L, Tang L, Dong W, Yao S, Zhou W (2016) An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5(1):1608

    Article  Google Scholar 

  • Mattei P-A, Bouveyron C, Latouche P (2016) Globally sparse probabilistic pca. Artif Intell Stat 976–984

  • McLachlan G, Peel D (2000) Finite mixture models. Willey Series in Probability and Statistics

  • Nelder JA, Wedderburn RW (1972) Generalized linear models. J R Stat Soc: Seri A (Gen) 135(3):370–384

    Google Scholar 

  • Osborne J (2005) Notes on the use of data transformations. Pract Assess Res Evalu 9(1):42–50

    Google Scholar 

  • O’hara RB, Kotze DJ (2010) Do not log-transform count data. Methods Ecol Evol 1(2):118–122

    Article  Google Scholar 

  • R Core Team (2019) R: a language and environment for statistical computing organization. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  • Ramos J et al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning, volume 242, Piscataway, pp 133–142

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  • Rau A, Celeux G, Martin-Magniette M-L, Maugis-Rabusseau C (2011) Clustering high-throughput sequencing data with Poisson mixture models. Research Report RR-7786, INRIA

  • Rigouste L, Cappé O, Yvon F (2007) Inference and evaluation of the multinomial mixture model for text clustering. Inf Process Manag 43(5):1260–1280

    Article  Google Scholar 

  • Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  Google Scholar 

  • Silvestre C, Cardoso MG, Figueiredo MA (2014) Identifying the number of clusters in discrete mixture models. arXiv preprint arXiv:1409.7419

  • Sorlie T, Tibshirani R, Parker J, Hastie T, Marron J, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S et al (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Nat Acad Sci USA 100(14):8418–8423

    Article  Google Scholar 

  • St-Pierre AP, Shikon V, Schneider DC (2018) Count data in biology-data transformation or model reformation? Ecol Evol 8(6):3077–3085

    Article  Google Scholar 

  • Tipping ME, Bishop CM (1999a) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482

    Article  Google Scholar 

  • Tipping ME, Bishop CM (1999b) Probabilistic principal component analysis. J R Stat Soc: Ser B (Stat Methodol) 61(3):611–622

    Article  MathSciNet  Google Scholar 

  • Wallach HM (2008) Structured topic models for language. PhD thesis, University of Cambridge

  • Watanabe K, Akaho S, Omachi S, Okada M (2010) Simultaneous clustering and dimensionality reduction using variational bayesian mixture model. Classification as a tool for research. Springer, New York, pp 81–89

    Chapter  Google Scholar 

  • Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the 30th conference on uncertainty in artificial intelligence

  • Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 267–273

  • Yu S, Yu K, Tresp V, Kriegel H-P (2005) A probabilistic clustering-projection model for discrete data. European conference on principles of data mining and knowledge discovery. Springer, New York, pp 417–428

    Google Scholar 

  • Zwiener I, Frisch B, Binder H (2014) Transforming rna-seq data to improve the performance of prognostic gene signatures. PLoS ONE 9(1):e85150

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by a DIM Math Innov grant from Région Ile-de-France. This work has also been supported by the French government through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002. We are thankful for the support from fédération F2PM, CNRS FR 2036, Paris. Finally, we would like to thank the anonymous reviewers for their helpful comments which contributed to improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Jouvin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Proofs

Proofs

1.1 Constructing meta-observation

Proof of Proposition 1

$$\begin{aligned} {{\,\mathrm{p}\,}}(X, \theta \mid Y, \, \beta )&= {{\,\mathrm{p}\,}}(\theta ) \times {{\,\mathrm{p}\,}}(X\mid \theta , Y) ,\\&= \prod _{q^\prime } {{\,\mathrm{p}\,}}(\theta _{q^\prime }) \times \prod _i \prod _q \prod _n {\mathcal {M}}_V(w_{in}, \, 1 , \,\beta \theta _q)^{Y_{iq}} , \\&= \prod _q {{\,\mathrm{p}\,}}(\theta _q) \prod _i \prod _v \prod _n (\beta _{v,\cdot } \theta _q)^{ Y_{iq} w_{inv}} ,\\&= \prod _q {{\,\mathrm{p}\,}}(\theta _q) \prod _v \prod _i (\beta _{v,\cdot } \theta _q)^{ Y_{iq} x_{iv}} ,\\&= \prod _q {{\,\mathrm{p}\,}}(\theta _q) \prod _v (\beta _{v,\cdot } \theta _q)^{\sum _i Y_{iq} x_{iv}} , \end{aligned}$$

since \(x_{iv} = \sum _n w_{inv}\). Then, put

$$\begin{aligned} \tilde{X}_q(Y) = \sum _{i=1}^N Y_{iq} x_{i}\, \end{aligned}$$

and this completes the proof of Proposition 1. \(\square \)

1.2 Derivation of the lower bound

Lower bound and Proposition 2

The bound of Eq. (14) follows from standard derivation of the evidence lower bound in variational inference. Since the \(\log \) is concave, by Jensen inequality:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(X, Y \mid \pi , \beta )&= \log \sum _Z \int _{\theta } {{\,\mathrm{p}\,}}(X, Y, \theta , Z \mid \pi , \beta ) \mathrm{d}\theta ,\\&= \log \sum _Z \int _{\theta } \frac{{{\,\mathrm{p}\,}}(X, Y, \theta , Z \mid \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z, \theta ) } {{\,\mathrm{\mathcal {R}}\,}}(Z, \theta ) \mathrm{d}\theta ,\\&= \log \left( {\mathbb {E}}_{{{\,\mathrm{\mathcal {R}}\,}}}\left[ \frac{{{\,\mathrm{p}\,}}(X, Y, Z, \theta \mid \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z,\theta )}\right] \right) \\&\ge {\mathbb {E}}_{{{\,\mathrm{\mathcal {R}}\,}}}\left[ \log \frac{{{\,\mathrm{p}\,}}(X, Y, Z, \theta \mid \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z,\theta )}\right] ,\\&:= {\mathcal {L}}({{\,\mathrm{\mathcal {R}}\,}}(\cdot ); \, \pi , \beta , Y) . \end{aligned}$$

Moreover, the difference between the classification log-likelihood and its bound is exactly the KL divergence between approximate posterior \({{\,\mathrm{\mathcal {R}}\,}}(\cdot )\) and the true one:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(X, Y \mid \pi , \beta ) - {\mathcal {L}}({{\,\mathrm{\mathcal {R}}\,}}(\cdot ); \, \pi , \beta , Y)&= - {\mathbb {E}}_{{{\,\mathrm{\mathcal {R}}\,}}}\left[ \log \frac{{{\,\mathrm{p}\,}}(Z, \theta \mid X, Y, \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z,\theta )}\right] . \end{aligned}$$

Furthermore, the complete expression is given in Proposition 2 as:

where

$$\begin{aligned}&{\mathcal {J}}_{\text {LDA}}^{(q)}( {{\,\mathrm{\mathcal {R}}\,}};\, \beta , \tilde{X}_q(Y))\nonumber \\&\qquad = \log \varGamma (\textstyle \sum _{k=1}^{K} \alpha _k) - \sum _{k=1}^{K}\log \varGamma (\alpha _k) \nonumber \\&\qquad + \sum _{k=1}^{K} (\alpha _k - 1) (\psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql})) \nonumber \\&\qquad + \sum _{i=1}^N Y_{iq} \sum _{k=1}^K \sum _{n=1}^{L_i} \phi _{ink} \left[ \psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql}) + \sum _{v=1}^{V} w_{inv} \log (\beta _{vk})\right] \nonumber \\&\qquad - \log \varGamma (\textstyle \sum _{k=1}^{K} \gamma _{qk}) - \sum _{k=1}^{K}\log \varGamma (\gamma _{qk}) \nonumber \\&\qquad - \sum _{k=1}^{K} (\gamma _{qk} - 1) (\psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql})) \nonumber \\&\qquad - \sum _{k=1}^K (\gamma _{qk} -1) (\psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql})) \nonumber \\&\qquad - \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \phi _{ink} \log (\phi _{ink}) . \end{aligned}$$
(17)

\(\square \)

1.3 Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(Z)\)

Proof of Proposition 3

A classical result about mean field inference, see Blei et al. (2017), states that at the optimum, considering all other distributions fixed:

$$\begin{aligned} \log {{\,\mathrm{\mathcal {R}}\,}}(z_ {in})&= {\mathbb {E}}_{Z^{ \setminus i, n}, \theta } \left[ \log {{\,\mathrm{p}\,}}(X, Z, \theta \mid Y)\right] + {{\,\mathrm{const}\,}}, \end{aligned}$$

where the expectation is taken with respect to all \(Z\) except \(z_{in}\), and to all \(\theta \), assuming \((Z, \theta ) \sim {{\,\mathrm{\mathcal {R}}\,}}\). Developing the latter leads to:

$$\begin{aligned} \log {{\,\mathrm{\mathcal {R}}\,}}(z_ {in})&= \sum _{k=1}^{K} z_{ink} \left[ \sum _{v=1}^{V} w_{inv} \log (\beta _{vk}) + \psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql}) \right] + {{\,\mathrm{const}\,}}. \end{aligned}$$
(18)

Equation (18) characterizes the log density of a multinomial:

$$\begin{aligned} {{\,\mathrm{\mathcal {R}}\,}}(z_{in}) = {\mathcal {M}}_K(z_{in}; \, 1, \,\phi _{in} = (\phi _{in1}, \ldots , \phi _{inK})), \end{aligned}$$

where the quantity inside brackets represents the logarithm of the parameter, modulo the normalizing constant. Hence,

$$\begin{aligned} \forall k, \quad \phi _{ink} \propto \left( \prod _{v=1}^V \beta _{vk}^{w_{inv}} \right) \, \prod _{q=1}^Q \exp \left\{ \psi (\gamma _{qk}) - \psi \left( \textstyle \sum _{l=1}^K \gamma _{ql}\right) \right\} ^{Y_{iq}} . \end{aligned}$$

\(\square \)

1.4 Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(\theta )\)

Proof of Proposition 4

With the same reasoning, the optimal form of \({{\,\mathrm{\mathcal {R}}\,}}(\theta )\) is:

$$\begin{aligned} \log {{\,\mathrm{\mathcal {R}}\,}}(\theta )&= {\mathbb {E}}_{Z}\left[ {{\,\mathrm{p}\,}}(X, Z, \theta \mid Y) \right] \, + \, {{\,\mathrm{const}\,}}\nonumber , \\&= \sum _{q=1}^{Q} \left[ \sum _{k=1}^{K} (\alpha _k - 1) \log (\theta _{qk}) + \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \sum _{k=1}^{K} \phi _{ink} \log (\theta _{qk}) \right] + \, {{\,\mathrm{const}\,}}, \nonumber \\&= \sum _{q=1}^{Q}\sum _{k=1}^{K} \left[ \alpha _k + \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \phi _{ink} - 1 \right] \log (\theta _{qk}) \, + \, {{\,\mathrm{const}\,}}. \end{aligned}$$
(19)

Once again, a specific functional form appears as the log of a product of Q independent Dirichlet densities. Then,

$$\begin{aligned} {{\,\mathrm{\mathcal {R}}\,}}(\theta ) = \prod _{q=1}^{Q} {{\,\mathrm{\mathcal {D}}\,}}_K\left( \theta _q; \, \gamma _q=(\gamma _{q1}, \ldots , \gamma _{qK})\right) , \end{aligned}$$

with the Dirichlet parameters inside the brackets of Eq. (19):

$$\begin{aligned} \forall (q,k), \quad \gamma _{qk} = \alpha _k + \sum _{i=1}^{N} Y_{iq}\sum _{n=1}^{L_i} \phi _{ink} . \end{aligned}$$

\(\square \)

1.5 Optimization of \(\beta \)

Proof of Proposition 5 (I)

This a constrained maximization problem with K constraints \(\sum _{v=1}^{V} \beta _{vk} = 1\). Isolating terms of Eq. (17) depending on \(\beta \), and denoting constraints multipliers as \((\lambda _k)_k\), the Lagrangian can be written:

$$\begin{aligned} f(\beta , \lambda ) =&\sum _{q=1}^{Q} \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \sum _{v=1}^{V} \phi _{ink} w_{inv} \log (\beta _{vk}) + \sum _{k=1}^{K} \lambda _k (\beta _{vk} - 1) , \\ =&\sum _{i=1}^{N} \sum _{n=1}^{L_i} \sum _{v=1}^{V} \phi _{ink} w_{inv} \log (\beta _{vk}) + \sum _{k=1}^{K} \lambda _k (\beta _{vk} - 1) . \end{aligned}$$

Setting its derivative to 0 leaves:

$$\begin{aligned} \beta _{vk} \propto \sum _{i=1}^{N} \sum _{n=1}^{L_i} \phi _{ink} \, w_{inv} . \end{aligned}$$

\(\square \)

1.6 Optimization of \(\pi \)

Proof of Proposition 5 (II)

The bound depends on \(\pi \) only through its clustering term:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(Y \mid \pi ) = \sum _{i=1}^{N}\sum _{q=1}^{Q} Y_{iq} \log (\pi _q) . \end{aligned}$$

Once again, this is a constrained optimization problem, and, introducing the Lagrange multiplier \(\lambda \) associated to the constraint \(\textstyle \sum _{q=1}^{Q} \pi _q = 1\), we get:

$$\begin{aligned} \sum _{q=1}^{Q} \sum _{i=1}^{N} Y_{iq} \log (\pi _q) + \lambda (\textstyle \sum _{q=1}^{Q} \pi _q - 1) . \end{aligned}$$

Setting the derivative with respect to \(\pi _q\) to 0, we get:

$$\begin{aligned} \pi _q = \frac{\sum _{i=1}^{N} Y_{iq}}{N} . \end{aligned}$$

\(\square \)

1.7 Model selection

Proof of Proposition 6

Assuming that the parameters \((\pi , \beta )\) follows a prior distribution that factorizes as follow:

$$\begin{aligned} {{\,\mathrm{p}\,}}(\pi , \beta \mid Q, K) = {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \, {{\,\mathrm{p}\,}}(\beta \mid K), \end{aligned}$$
(20)

where

$$\begin{aligned} {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) ={\mathcal {D}}_K(\pi ; \, \eta {\mathbf {1}}_Q) . \end{aligned}$$
(21)

Then, the classification log-likelihood is written:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(X, Y\mid Q, K)= & {} \log \int _{\pi } \int _{\beta }{{\,\mathrm{p}\,}}(X,Y, \beta , \pi \mid Q, K) \, \mathrm{d}\pi \, \mathrm{d}\beta \nonumber \\= & {} \log \int _{\pi } \int _{\beta }{{\,\mathrm{p}\,}}(X,Y \mid \beta , \pi , \, Q, K) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \, {{\,\mathrm{p}\,}}(\beta \mid K) \, \mathrm{d}\pi \, \mathrm{d}\beta \nonumber \\= & {} \log \int _{\pi } {{\,\mathrm{p}\,}}(Y \mid \pi ) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \mathrm{d}\pi \, \int _{\beta }{{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) {{\,\mathrm{p}\,}}(\beta \mid K) \mathrm{d}\beta \nonumber \\= & {} \log \int _{\pi } {{\,\mathrm{p}\,}}(Y \mid \pi ) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \mathrm{d}\pi \nonumber \\&+ \log \int _{\beta }{{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) {{\,\mathrm{p}\,}}(\beta \mid K) \mathrm{d}\beta . \end{aligned}$$
(22)

The first term in Eq. (22) is exact by Dirichlet-Multinomial conjugacy. Setting \(\eta =\frac{1}{2}\) plus a Stirling approximation on the Gamma function as in Daudin et al. (2008) leads to:

$$\begin{aligned} \log \int _{\pi } {{\,\mathrm{p}\,}}(Y \mid \pi ) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \mathrm{d}\pi \approx \max \limits _{\pi } \log {{\,\mathrm{p}\,}}(Y \mid \pi , Q) - \frac{Q-1}{2} \log (D) . \end{aligned}$$
(23)

As for the second term, a BIC-like approximation as in Bouveyron et al. (2018) gives:

$$\begin{aligned} \log \int _{\beta }{{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) {{\,\mathrm{p}\,}}(\beta \mid K) \mathrm{d}\beta \approx \max \limits _{\beta } \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) - \frac{K (V-1)}{2} \log (Q). \end{aligned}$$

In practice, \( \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) \) is still intractable, hence we replace it by its variational approximation after convergence of the VEM, \({\mathcal {J}}^\star _{\text {LDA}}\), which is the sum of the meta-observations individual LDA-bounds detailed in Eq. (17) (different from \({\mathcal {L}}\)). In the end, it gives the following criterion:

$$\begin{aligned} {{\,\mathrm{ICL}\,}}(Q, K, Y, X)= & {} {\mathcal {J}}^\star _{\text {LDA}}({{\,\mathrm{\mathcal {R}}\,}}; \, \beta , Y) - \frac{K (V-1)}{2} \log (Q) \nonumber \\&+ \max \limits _{\pi } \log {{\,\mathrm{p}\,}}(Y \mid \pi , Q) - \frac{Q-1}{2} \log (D) . \end{aligned}$$
(24)

Note that:

$$\begin{aligned} \max \limits _{\beta } \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) + \max \limits _{\pi } \log {{\,\mathrm{p}\,}}(Y \mid \pi , Q) \approx {\mathcal {L}}^\star , \end{aligned}$$

i.e. the bound after Algorithm 1 converges. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jouvin, N., Latouche, P., Bouveyron, C. et al. Greedy clustering of count data through a mixture of multinomial PCA. Comput Stat 36, 1–33 (2021). https://doi.org/10.1007/s00180-020-01008-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-020-01008-9

Keywords

Navigation