Bayesian mean-parameterized nonnegative binary matrix factorization

Abstract

Binary data matrices can represent many types of data such as social networks, votes, or gene expression. In some cases, the analysis of binary matrices can be tackled with nonnegative matrix factorization (NMF), where the observed data matrix is approximated by the product of two smaller nonnegative matrices. In this context, probabilistic NMF assumes a generative model where the data is usually Bernoulli-distributed. Often, a link function is used to map the factorization to the [0, 1] range, ensuring a valid Bernoulli mean parameter. However, link functions have the potential disadvantage to lead to uninterpretable models. Mean-parameterized NMF, on the contrary, overcomes this problem. We propose a unified framework for Bayesian mean-parameterized nonnegative binary matrix factorization models (NBMF). We analyze three models which correspond to three possible constraints that respect the mean-parameterization without the need for link functions. Furthermore, we derive a novel collapsed Gibbs sampler and a collapsed variational algorithm to infer the posterior distribution of the factors. Next, we extend the proposed models to a nonparametric setting where the number of used latent dimensions is automatically driven by the observed data. We analyze the performance of our NBMF methods in multiple datasets for different tasks such as dictionary learning and prediction of missing data. Experiments show that our methods provide similar or superior results than the state of the art, while automatically detecting the number of relevant components.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    Distributions used throughout the article are formally defined in “Appendix A”.

  2. 2.

    https://github.com/alumbreras/NBMF.

  3. 3.

    Some readers may be more accustomed to the alternative notation where the “one-hot” variable \(\varvec{\mathbf {z}}_{fn}\) is replaced by an integer-valued index \(z_{fn} \in \{1,\ldots ,K \}\). In this case, the Bernoulli parameter in Eq. (18) becomes \(h_{z_{fn}n}\).

  4. 4.

    Many thanks to Xi’an (Christian Robert) for giving us the trick via StackExchange.

References

  1. Aldous DJ (1985) Exchangeability and related topics. École d’été de probabilités de Saint-Flour, XIII–1983, vol 1117. Lecture Notes in Mathematics. Springer, Berlin, pp 1–198

  2. Alquier P, Guedj B (2017) An oracle inequality for quasi-Bayesian nonnegative matrix factorization. Math Methods Stat 26(1):55–67

    MathSciNet  MATH  Google Scholar 

  3. Anderson JR (1991) The adaptive nature of human categorization. Psychol Rev 98:409–429

    Google Scholar 

  4. Asuncion A, Welling M, Smyth P, Teh YW (2009) On smoothing and inference for topic models. In: Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI ’09), pp 27–34

  5. Bingham E, Kabán A, Fortelius M (2009) The aspect Bernoulli model: multiple causes of presences and absences. Pattern Anal Appl 12(1):55–78

    MathSciNet  MATH  Google Scholar 

  6. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  7. Buntine WL, Jakulin A (2006) Discrete component analysis. Lect Notes Comput Sci Springer 3940:1–33

    Google Scholar 

  8. Canny J (2004) GaP: a factor model for discrete data. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’04), pp 122–129

  9. Çapan G, Akbayrak S, Ceritli TY, Cemgil AT (2018) Sum conditioned Poisson factorization. In: Proceedings of the 14th international conference on latent variable analysis and signal separation (LVA/ICA ’18), pp 24–35

  10. Celma O (2010) Music recommendation and discovery in the long tail. Springer, Berlin

    Google Scholar 

  11. Cemgil AT (2009) Bayesian inference for nonnegative matrix factorisation models. Comput Intell Neurosci 2009:1–17

    Google Scholar 

  12. Cichocki A, Lee H, Kim YD, Choi S (2008) Non-negative matrix factorization with \(\alpha \)-divergence. Pattern Recognit Lett 29(9):1433–1440

    Google Scholar 

  13. Collins M, Dasgupta S, Schapire RE (2002) A generalization of principal components analysis to the exponential family. In: Advances in neural information processing systems 14, MIT Press, pp 617–624

  14. Févotte C, Idier J (2011) Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput 23(9):2421–2456

    MathSciNet  MATH  Google Scholar 

  15. Gopalan P, Ruiz FJR, Ranganath R, Blei DM (2014) Bayesian nonparametric Poisson factorization for recommendation systems. J Mach Learn Res 33:275–283

    Google Scholar 

  16. Gopalan P, Hofman JM, Blei DM (2015) Scalable recommendation with hierarchical Poisson factorization. In: Proceedings of the 31st conference on uncertainty in artificial intelligence (UAI ’15), pp 326–335

  17. He X, Liao L, Zhang H, Nie L, Hu X, Chua TS (2017) Neural collaborative filtering. In: Proceedings of the 26th international conference on world wide web (WWW ’17), International World Wide Web Conferences Steering Committee, pp 173–182

  18. Hernandez-Lobato JM, Houlsby N, Ghahramani Z (2014) Stochastic inference for scalable probabilistic modeling of binary matrices. In: Proceedings of the 31st international conference on machine learning (ICML ’14) 32:1–6

  19. Hoffman MD, Blei DM, Cook PR (2010) Bayesian nonparametric matrix factorization for recorded music. In: Proceedings of the 27th international conference on international conference on machine learning (ICML ’10), pp 439–446

  20. Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347

    MathSciNet  MATH  Google Scholar 

  21. Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the 15th conference on uncertainty in artificial intelligence (UAI ’99), pp 289–296

  22. Ishiguro K, Sato I, Ueda N (2017) Averaged collapsed variational Bayes inference. J Mach Learn Res 18(1):1–29

    MathSciNet  MATH  Google Scholar 

  23. Kabán A, Bingham E (2008) Factorisation and denoising of 0–1 data: a variational approach. Neurocomputing 71(10–12):2291–2308

    Google Scholar 

  24. Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N (2006) Learning systems of concepts with an infinite relational model. In: Proceedings of the 21st national conference on artificial intelligence (AAAI ’06), pp 381–388

  25. Landgraf AJ, Lee Y (2015) Dimensionality reduction for binary data through the projection of natural parameters. Tech. Rep. 890, Department of Statistics, The Ohio State University

  26. Larsen JS, Clemmensen LKH (2015) Non-negative matrix factorization for binary data. In: Proceedings of the 7th international joint conference on knowledge discovery, knowledge engineering and knowledge management (IC3K ’15), pp 555–563

  27. Lee DD, Seung HS (1999) Learning the parts of objects with nonnegative matrix factorization. Nature 401:788–791

    MATH  Google Scholar 

  28. Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562

    Google Scholar 

  29. Liu JS (1994) The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J Am Stat Assoc 89(427):958–966

    MathSciNet  MATH  Google Scholar 

  30. Meeds E, Ghahramani Z, Neal RM, Roweis ST (2007) Modeling dyadic data with binary latent factors. In: Advances in neural information processing systems 19, MIT Press, pp 977–984

  31. Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10):1348–1362

    Google Scholar 

  32. NOW (2018) The NOW community. new and old worlds database of fossil mammals (NOW). http://www.helsinki.fi/science/now/, release 030717, retrieved May 2018

  33. Pitman J (2002) Combinatorial stochastic processes. Lecture notes in mathematics, Springer, Berlin, lectures from the 32nd Summer School on Probability Theory held in Saint-Flour

  34. Polson NG, Scott JG, Windle J (2013) Bayesian inference for logistic models using Polya-Gamma latent variables. J Am Stat Assoc 108:1339–1349

    MATH  Google Scholar 

  35. R Core Team (2017) R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

  36. Rukat T, Holmes CC, Titsias MK, Yau C (2017) Bayesian Boolean matrix factorisation. In: Proceedings of the 34th international conference on machine learning (ICML ’17), pp 2969–2978

  37. Sammel MD, Ryan LM, Legler JM (1997) Latent variable models for mixed discrete and continuous outcomes. J R Stat Soc Ser B (Methodol) 59(3):667–678

    MATH  Google Scholar 

  38. Schein AI, Saul LK, Ungar LH (2003) A generalized linear model for principal component analysis of binary data. In: Proceedings of the 9th workshop on artificial intelligence and statistics (AISTATS ’03), pp 14–21

  39. Schmidt MN, Winther O, Hansen LK (2009) Bayesian non-negative matrix factorization. In: Proceedings of the 8th international conference on independent component analysis and signal separation (ICA ’09), pp 540–547

  40. Singh AP, Gordon GJ (2008) A unified view of matrix factorization models. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML-PKDD ’08), pp 358–373

  41. Slawski M, Hein M, Lutsik P (2013) Matrix factorization with binary components. In: Advances in neural information processing systems 26, Curran Associates, Inc., pp 3210–3218

  42. Sørensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter 5:1–34

    Google Scholar 

  43. Steck H, Jaakkola TS (2003) On the Dirichlet prior and Bayesian regularization. In: Advances in neural information processing systems 15, MIT Press, pp 713–720

  44. Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581

    MathSciNet  MATH  Google Scholar 

  45. Teh YW, Newman D, Welling M (2007) A collapsed variational Bayesian inference alzgorithm for latent Dirichlet allocation. In: Advances in neural information processing systems 19, MIT Press, pp 1353–1360

  46. Tipping ME (1999) Probabilistic visualisation of high-dimensional binary data. In: Advances in neural information processing systems 11, MIT Press, pp 592–598

  47. Tipping ME, Bishop C (1999) Probabilistic principal component analysis. J R Stat Soc Ser B 21(3):611–622

    MathSciNet  MATH  Google Scholar 

  48. Tomé AM, Schachtner R, Vigneron V, Puntonet CG, Lang EW (2013) A logistic non-negative matrix factorization approach to binary data sets. Multidimens Syst Signal Process 26(1):125–143

    MathSciNet  Google Scholar 

  49. Udell M, Horn C, Zadeh R, Boyd S (2016) Generalized low rank models. Found Trends Mach Learn 9(1):1–118

    MATH  Google Scholar 

  50. Voeten E (2013) Data and analyses of voting in the UN General Assembly. In: Reinal B (ed) Routledge handbook of international organization. Routledge, Abingdon

    Google Scholar 

  51. Xue HJ, Dai XY, Zhang J, Huang S, Chen J (2017) Deep matrix factorization models for recommender systems. In: Proceedings of the 26th international joint conference on artificial intelligence (IJCAI—17), AAAI Press, pp 3203–3209

  52. Zhang ZY, Li T, Ding C, Ren XW, Zhang XS (2009) Binary matrix factorization for analyzing gene expression data. Data Min Knowl Discov 20(1):28–52

    MathSciNet  Google Scholar 

  53. Zhou M (2015) Infinite edge partition models for overlapping community detection and link prediction. In: Proceedings of the 18th international conference on artificial intelligence and statistics (AISTATS ’15), pp 1135–1143

  54. Zhou M, Hannah L, Dunson D, Carin L (2012) Beta-negative binomial process and Poisson factor analysis. In: Proceedings of the 15th international conference on artificial intelligence and statistics (AISTATS ’12), pp 1462–1471

  55. Zhou M, Cong Y, Chen B (2016) Augmentable Gamma belief networks. J Mach Learn Res 17(163):1–44

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 681839 (project FACTORY).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Alberto Lumbreras.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Responsible editor: Pauli Miettinen.

Appendices

A Probability distributions functions

A.1 Bernoulli distribution

Distribution over a binary variable \(x \in \{0,1\}\), with mean parameter \(\mu \in [0,1]\):

$$\begin{aligned} \text {Bernoulli}(x | \mu )&= \mu ^x(1-\mu )^{1-x}. \end{aligned}$$
(68)

A.2 Beta distribution

Distribution over a continuous variable \(x \in [0,1]\), with shape parameters \(a >0\), \(b>0\):

$$\begin{aligned} \text {Beta}(x | a,b)&= \frac{\Gamma (a+b)}{\Gamma (a)\Gamma (b)}x^{a-1}(1-x)^{b-1}. \end{aligned}$$
(69)

A.3 Gamma distribution

Distribution for a continuous variable \(x >0\), with shape parameter \(a >0\) and rate parameter \(b>0\):

$$\begin{aligned} \text {Gamma}(x | a,b)&= \frac{b^a}{\Gamma (a)}x^{a-1} e^{-bx}. \end{aligned}$$
(70)

A.4 Dirichlet distribution

Distribution for K continuous variables \(x_{k} \in [0,1]\) such that \(\sum _k x_k = 1\). Governed by K shape parameters \(\alpha _1,...\alpha _K\) such that \(\alpha _k>0\):

$$\begin{aligned} \text {Dirichlet}(\varvec{\mathbf {x}} | \varvec{\mathbf {\alpha }})&= \frac{\Gamma (\sum _k \alpha _k)}{\prod _k \Gamma (\alpha _k)}\prod _k x^ {\alpha _k-1}. \end{aligned}$$
(71)

A.5 Discrete distribution

Distribution for the discrete variable \( {\mathbf {x}} \in \{ {\mathbf {e}} _{1}, \ldots , {\mathbf {e}} _{K} \}\), where \( {\mathbf {e}} _{i}\) is the \(i^{th}\) canonical vector. Governed by the discrete probabilities \(\mu _1,...,\mu _K\) such that \(\mu _{k} \in [0,1]\) and \(\sum _k \mu _k = 1\):

$$\begin{aligned} p( {\mathbf {x}} = {\mathbf {e}} _{k}) = \mu _{k} \end{aligned}$$
(72)

The probability mass function can be written as:

$$\begin{aligned} \text {Discrete}(\varvec{\mathbf {x}} | \varvec{\mathbf {\mu }})&= \prod _k \mu _k^{x_k}. \end{aligned}$$
(73)

We may write \(\text {Discrete}(\varvec{\mathbf {x}} | \varvec{\mathbf {\mu }}) = \text {Multinomial}(\varvec{\mathbf {x}} | 1, \varvec{\mathbf {\mu }})\).

A.6 Multinomial distribution

Distribution for an integer-valued vector \(\varvec{\mathbf {x}}=[x_1,...,x_K]^T \in {\mathbb {N}}^K\). Governed by the total number \(L = \sum _k x_k\) of events assigned to K bins and the probabilities \(\mu _k\) of being assigned to bin k:

$$\begin{aligned} \text {Multinomial}(\varvec{\mathbf {x}} | L, \varvec{\mathbf {\mu }})&= \frac{L!}{x1!...x_K!}\prod _k \mu _k^{x_k}. \end{aligned}$$
(74)

B Derivations for the Beta-Dir model

B.1 Marginalizing out \(\varvec{\mathbf {W}}\) and \(\varvec{\mathbf {H}}\) from the joint likelihood

We seek to compute the marginal joint probability introduced in Eq. (22) and given by:

$$\begin{aligned} p(\mathbf {V}, \mathbf {Z}) = \prod _f \overbrace{ \int p(\varvec{\mathbf {w}}_{f}) \prod _{n} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {w}}_f) \,d\mathbf {w}_f }^{p(\underline{\varvec{\mathbf {Z}}}_{f}) } \prod _n \overbrace{ \int \prod _{k} p(h_{kn}) \prod _{f} p(v_{fn} | \varvec{\mathbf {h}}_n, \varvec{\mathbf {z}}_{fn}) \, d {\mathbf {h}} _{n} }^{p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n) }. \end{aligned}$$

Using the expression of the normalization constant of the Dirichlet distribution, the first integral can be computed as follows:

$$\begin{aligned} p(\underline{\varvec{\mathbf {Z}}}_{f})&= \int p(\varvec{\mathbf {w}}_{f}) \prod _{n} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {w}}_f) \,d\mathbf {w}_{f}\end{aligned}$$
(75)
$$\begin{aligned}&= \int \frac{\Gamma (\sum _k \gamma _k)}{\prod _k \Gamma (\gamma _k)}\prod _k w_{fk}^{\gamma _k-1}\prod _n w_{fk}^{z_{fkn}} \,d\mathbf {w}_{f}\end{aligned}$$
(76)
$$\begin{aligned}&= \frac{\Gamma (\sum _k \gamma _k)}{\prod _k \Gamma (\gamma _k)} \int \prod _k w_{fk}^{\gamma _k + L_{fk}-1} \,d\mathbf {w}_{f} \end{aligned}$$
(77)
$$\begin{aligned}&= \frac{\Gamma (\sum _k \gamma _k)}{\prod _k \Gamma (\gamma _k)} \frac{\prod _k \Gamma (\gamma _k + L_{fk})}{\Gamma (\sum _k \gamma _k + L_{fk})}. \end{aligned}$$
(78)

The second integral in Eq. (22) is computed as follows. In Eq. (80) we use that \(p(v_{fn} | \varvec{\mathbf {h}}_n, \varvec{\mathbf {z}}_{fn}) = \text {Bernoulli}(v_{fn} | \prod _k h_{kn}^{z_{fkn}}) = \prod _k \text {Bernoulli}(v_{fn}|h_{kn})^{z_{fkn}}\) (recall that \(\mathbf {z}_{fn}\) is an indicator vector). In Eq. (83), we use the expression of the normalization constant of the Beta distribution.

$$\begin{aligned} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)&= \int \prod _k p(h_{kn}) \prod _{f} p(v_{fn} | \varvec{\mathbf {h}}_n, \varvec{\mathbf {z}}_{fn}) \, {d \mathbf {h}_n} \end{aligned}$$
(79)
$$\begin{aligned}&= \int \prod _k \left[ \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} h_{kn}^{\alpha _k-1}(1-h_{kn})^{\beta _k-1} \right] \prod _{fk} \left[ h_{kn}^{v_{fn}}(1-h_{kn})^{1-v_{fn}}\right] ^{z_{fkn}} {d \mathbf {h}_n} \end{aligned}$$
(80)
$$\begin{aligned}&= \prod _k \int \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} h_{kn}^{\alpha _k-1}(1-h_{kn})^{\beta _k-1} \prod _{f} \left[ h_{kn}^{v_{fn}}(1-h_{kn})^{1-v_{fn}}\right] ^{z_{fkn}} dh_{kn} \end{aligned}$$
(81)
$$\begin{aligned}&= \prod _k \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} \int h_{kn}^{\alpha _k + A_{kn} -1}(1-h_{kn})^{\beta _k + B_{kn}-1} dh_{kn} \end{aligned}$$
(82)
$$\begin{aligned}&= \prod _k \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} \frac{\Gamma (\alpha _k + A_{kn})\Gamma (\beta _k + B_{kn})}{\Gamma (\alpha _k + \beta _k + M_{kn})}. \end{aligned}$$
(83)

B.2 Conditional prior and posterior distributions of \(\varvec{\mathbf {z}}_{fn}\)

Applying the Bayes rule, the conditional posterior of \(\varvec{\mathbf {z}}_{fn}\) is given by:

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}, \varvec{\mathbf {V}}) \propto p(\varvec{\mathbf {V}} | \varvec{\mathbf {Z}})p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}). \end{aligned}$$
(84)

The likelihood itself decomposes as \(p(\varvec{\mathbf {V}} | \varvec{\mathbf {Z}}) = \prod _{n} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)\) and we may ignore the terms that do not depend on \(\varvec{\mathbf {z}}_{fn}\). Using Eq. (24) and the identity \(\Gamma (n + b) = \Gamma (n)n^b\) where b is a binary variable, we may write:

$$\begin{aligned} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)&= \prod _{k} \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} \frac{\Gamma (\alpha _k + A_{kn}) \Gamma (\beta _k + B_{kn}) }{\Gamma (\alpha _k + \beta _k + M_{kn})} \end{aligned}$$
(85)
$$\begin{aligned}&\propto \prod _{k} \frac{\Gamma (\alpha _k + A_{kn}) \Gamma (\beta _k + B_{kn}) }{\Gamma (\alpha _k + \beta _k + M_{kn})} \end{aligned}$$
(86)
$$\begin{aligned}&= \prod _{k} \frac{\Gamma (\alpha _k + A_{kn}^{\lnot fn} + z_{fkn}v_{fn}) \Gamma (\beta _k + B_{kn}^{\lnot fn} + z_{fkn}\bar{v}_{fn}) }{\Gamma (\alpha _k + \beta _k + M_{kn}^{\lnot fn} + z_{fkn})} \end{aligned}$$
(87)
$$\begin{aligned}&\propto \prod _k \frac{ \Gamma (\alpha _k + A_{kn}^{\lnot fn}) (\alpha _k + A_{kn}^{\lnot fn})^{z_{fkn}v_{fn}} \Gamma (\beta _k + B_{kn}^{\lnot fn}) (\beta _k + B_{kn}^{\lnot fn})^{z_{fkn}\bar{v}_{fn}} }{ \Gamma (\alpha _k + \beta _k + M_{kn}^{\lnot fn}) (\alpha _k + \beta _k + M_{kn}^{\lnot fn})^{z_{fkn}} } \end{aligned}$$
(88)
$$\begin{aligned}&\propto \prod _k \left[ \frac{ (\alpha _k + A_{kn}^{\lnot fn})^{v_{fn}} (\beta _k + B_{kn}^{\lnot fn})^{\bar{v}_{fn}} }{ (\alpha _k + \beta _k + M_{kn}^{\lnot fn}) } \right] ^{z_{fkn}}. \end{aligned}$$
(89)

The conditional prior term is given by

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}) = p(\varvec{\mathbf {Z}})/p(\varvec{\mathbf {Z}}_{\lnot fn}). \end{aligned}$$
(90)

Using \(p(\varvec{\mathbf {Z}}) = \prod _{f} \underline{\varvec{\mathbf {Z}}}_{f}\) and Eq. (23) we have

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn})&\propto p(\underline{\varvec{\mathbf {Z}}}_f) \end{aligned}$$
(91)
$$\begin{aligned}&\propto \prod _k\Gamma (\gamma _k + L_{kn}^{\lnot fn} + z_{fkn}) \end{aligned}$$
(92)
$$\begin{aligned}&= \prod _k \Gamma (\gamma _k + L_{kn}^{\lnot fn}) (\gamma _k + L_{kn}^{\lnot fn})^{z_{fkn}} \end{aligned}$$
(93)
$$\begin{aligned}&\propto \prod _k (\gamma _k + L_{kn}^{\lnot fn})^{z_{fkn}}. \end{aligned}$$
(94)

Using \(\sum _{k} p(\varvec{\mathbf {z}}_{fn} = {\mathbf {e}} _{k} | \varvec{\mathbf {Z}}_{\lnot fn}) =1\), a simple closed-form expression of \(p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn})\) is obtained as follows:

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} = {\mathbf {e}} _{k} | \varvec{\mathbf {Z}}_{\lnot fn})&= \frac{\gamma _k + L_{kn}^{\lnot fn}}{\sum _{k} (\gamma _k + L_{kn}^{\lnot fn})} \end{aligned}$$
(95)
$$\begin{aligned}&= \frac{\gamma _k + L_{kn}^{\lnot fn}}{\sum _{k} \gamma _k + N-1}. \end{aligned}$$
(96)

Combining Eqs. (84), (89) and (94), we obtain

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn},\varvec{\mathbf {V}}) \propto \prod _k \left[ (\gamma _k + L_{fk}^{\lnot fn} ) \frac{(\alpha _k + A_{kn}^{\lnot fn})^{v_{fn}} (\beta _k + B_{kn}^{\lnot fn})^{\bar{v}_{fn}} }{\alpha _k + \beta _k + M_{kn}^{\lnot fn}} \right] ^{z_{fkn}}. \end{aligned}$$
(97)

C Alternative Gibbs sampler for the Dir-Dir model

In this appendix, we show how to derive an alternative Gibbs sampler based on a single augmentation, like in the Beta-Dir model. This is a conceptually interesting result, though it does not lead to an efficient implementation. Likewise the Beta-Dir model, the Dir-Dir model can be augmented using the single indicator variables \(\varvec{\mathbf {z}}_{fn}\), as follows:

$$\begin{aligned} \varvec{\mathbf {h}}_{n}&\sim \text {Dirichlet}(\varvec{\mathbf {\eta }}) \end{aligned}$$
(98)
$$\begin{aligned} \varvec{\mathbf {w}}_{f}&\sim \text {Dirichlet}(\varvec{\mathbf {\gamma }}) \end{aligned}$$
(99)
$$\begin{aligned} \varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {w}}_f&\sim \text {Discrete}(\varvec{\mathbf {w}}_f) \end{aligned}$$
(100)
$$\begin{aligned} v_{fn} | \varvec{\mathbf {h}}_{n}, \varvec{\mathbf {z}}_{fn}&\sim \text {Bernoulli}\left( \prod _k h_{kn}^{z_{fkn}}\right) \end{aligned}$$
(101)

Note that compared to Eqs. (15)–(18) only the prior on \(\varvec{\mathbf {h}}_{n}\) is changed. Like in Beta-Dir, we seek in this appendix to derive a Gibbs sampler from the conditional probabilities \(p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}, \varvec{\mathbf {V}})\) given by

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}, \varvec{\mathbf {V}}) \propto p(\varvec{\mathbf {V}} | \varvec{\mathbf {Z}})p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}). \end{aligned}$$
(102)

The conditional prior term is identical to that of Beta-Dir and given by

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}) \propto \prod _k (\gamma _k + L_{kn}^{\lnot fn})^{z_{fkn}}. \end{aligned}$$
(103)

Like in Beta-Dir, the likelihood term factorizes as \(p(\varvec{\mathbf {V}} | \varvec{\mathbf {Z}}) = \prod _{n} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)\), and we now derive the expression of \(p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)\). As compared to Beta-Dir, a major source of difficulty lies in the fact that \(p( {\mathbf {h}} _n)\) does not fully factorize anymore because of the Dirichlet assumption (and in particular \(\sum _k h_{kn}=1\)). In the following, we use the multinomial theorem to obtain Eq. (107)Footnote 4 and we use the expression of the normalization constant of the Dirichlet distribution to obtain Eq. (110):

$$\begin{aligned} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)&= \int p(\varvec{\mathbf {h}}_n) \prod _{f} p(v_{fn} | \varvec{\mathbf {h}}_n, \varvec{\mathbf {z}}_{fn}) \,d\mathbf {h}_n\end{aligned}$$
(104)
$$\begin{aligned}&=\int \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \prod _k h_{kn}^{\eta _k-1} \prod _{f}\prod _k \left[ h_{kn}^{v_{fn}} (1-h_{kn})^{1-v_{fn}}\right] ^{z_{fkn}} \,d\mathbf {h}_n\end{aligned}$$
(105)
$$\begin{aligned}&= \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \int \prod _k h_{kn}^{\eta _n + A_{kn}-1} (1-h_{kn})^{B_{kn}} \,d\mathbf {h}_n \end{aligned}$$
(106)
$$\begin{aligned}&= \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \int \prod _k h_{kn}^{\eta _n + A_{kn}-1} \sum _{j_k=0}^{B_{kn}} \left( {\begin{array}{c}B_{kn}\\ j_k\end{array}}\right) (-h_{kn})^{j_k} \,d\mathbf {h}_n \end{aligned}$$
(107)
$$\begin{aligned}&=\frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \int \sum _{j_1=0}^{B_{1n}} ... \sum _{j_K=0}^{B_{Kn}} \prod _k h_{kn}^{\eta _k + A_{kn}-1} \left( {\begin{array}{c}B_{kn}\\ j_k\end{array}}\right) (-h_{kn})^{j_k} \,d\mathbf {h}_n \end{aligned}$$
(108)
$$\begin{aligned}&= \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \sum _{j_1=0}^{B_{1n}} ... \sum _{j_K=0}^{B_{Kn}} \prod _k (-1)^{j_k}\left( {\begin{array}{c}B_{kn}\\ j_k\end{array}}\right) \int \prod _k h_{kn}^{\eta _k + A_{kn} + j_k -1} \,d\mathbf {h}_n\end{aligned}$$
(109)
$$\begin{aligned}&= \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \sum _{j_1=0}^{B_{1n}} ... \sum _{j_K=0}^{B_{Kn}} \prod _k (-1)^{j_k} \left( {\begin{array}{c}B_{kn}\\ j_k\end{array}}\right) \frac{\Gamma (\eta _k + A_{kn} + j_k)}{\Gamma (\sum _k \eta _k + A_{kn} + j_k)}. \end{aligned}$$
(110)

We conclude that, though available in closed form, the expression of \(p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)\) (and thus \(p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn})\)) involves the computation of \(K\prod _{k=1}^K B_{kn}\) terms involving binomial coefficients, which is impractical in typical problem dimensions.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lumbreras, A., Filstroff, L. & Févotte, C. Bayesian mean-parameterized nonnegative binary matrix factorization. Data Min Knowl Disc 34, 1898–1935 (2020). https://doi.org/10.1007/s10618-020-00712-w

Download citation

Keywords

  • Matrix factorization
  • Latent variable models
  • Bayesian inference
  • Binary data