Bayesian mean-parameterized nonnegative binary matrix factorization

Lumbreras, Alberto; Filstroff, Louis; Févotte, Cédric

doi:10.1007/s10618-020-00712-w

Bayesian mean-parameterized nonnegative binary matrix factorization

Published: 30 August 2020

Volume 34, pages 1898–1935, (2020)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

521 Accesses
7 Citations
Explore all metrics

Abstract

Binary data matrices can represent many types of data such as social networks, votes, or gene expression. In some cases, the analysis of binary matrices can be tackled with nonnegative matrix factorization (NMF), where the observed data matrix is approximated by the product of two smaller nonnegative matrices. In this context, probabilistic NMF assumes a generative model where the data is usually Bernoulli-distributed. Often, a link function is used to map the factorization to the [0, 1] range, ensuring a valid Bernoulli mean parameter. However, link functions have the potential disadvantage to lead to uninterpretable models. Mean-parameterized NMF, on the contrary, overcomes this problem. We propose a unified framework for Bayesian mean-parameterized nonnegative binary matrix factorization models (NBMF). We analyze three models which correspond to three possible constraints that respect the mean-parameterization without the need for link functions. Furthermore, we derive a novel collapsed Gibbs sampler and a collapsed variational algorithm to infer the posterior distribution of the factors. Next, we extend the proposed models to a nonparametric setting where the number of used latent dimensions is automatically driven by the observed data. We analyze the performance of our NBMF methods in multiple datasets for different tasks such as dictionary learning and prediction of missing data. Experiments show that our methods provide similar or superior results than the state of the art, while automatically detecting the number of relevant components.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative Study of Inference Methods for Bayesian Nonnegative Matrix Factorisation

Fast Nonnegative Matrix Factorization and Completion Using Nesterov Iterations

Pseudo-marginal Markov Chain Monte Carlo for Nonnegative Matrix Factorization

Article 28 July 2016

Notes

Distributions used throughout the article are formally defined in “Appendix A”.
https://github.com/alumbreras/NBMF.
Some readers may be more accustomed to the alternative notation where the “one-hot” variable $\varvec{\mathbf {z}}_{fn}$ is replaced by an integer-valued index $z_{fn} \in \{1,\ldots ,K \}$. In this case, the Bernoulli parameter in Eq. (18) becomes $h_{z_{fn}n}$.
Many thanks to Xi’an (Christian Robert) for giving us the trick via StackExchange.

References

Aldous DJ (1985) Exchangeability and related topics. École d’été de probabilités de Saint-Flour, XIII–1983, vol 1117. Lecture Notes in Mathematics. Springer, Berlin, pp 1–198
Alquier P, Guedj B (2017) An oracle inequality for quasi-Bayesian nonnegative matrix factorization. Math Methods Stat 26(1):55–67
MathSciNet MATH Google Scholar
Anderson JR (1991) The adaptive nature of human categorization. Psychol Rev 98:409–429
Google Scholar
Asuncion A, Welling M, Smyth P, Teh YW (2009) On smoothing and inference for topic models. In: Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI ’09), pp 27–34
Bingham E, Kabán A, Fortelius M (2009) The aspect Bernoulli model: multiple causes of presences and absences. Pattern Anal Appl 12(1):55–78
MathSciNet MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
MATH Google Scholar
Buntine WL, Jakulin A (2006) Discrete component analysis. Lect Notes Comput Sci Springer 3940:1–33
Google Scholar
Canny J (2004) GaP: a factor model for discrete data. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’04), pp 122–129
Çapan G, Akbayrak S, Ceritli TY, Cemgil AT (2018) Sum conditioned Poisson factorization. In: Proceedings of the 14th international conference on latent variable analysis and signal separation (LVA/ICA ’18), pp 24–35
Celma O (2010) Music recommendation and discovery in the long tail. Springer, Berlin
Google Scholar
Cemgil AT (2009) Bayesian inference for nonnegative matrix factorisation models. Comput Intell Neurosci 2009:1–17
Google Scholar
Cichocki A, Lee H, Kim YD, Choi S (2008) Non-negative matrix factorization with $\alpha $-divergence. Pattern Recognit Lett 29(9):1433–1440
Google Scholar
Collins M, Dasgupta S, Schapire RE (2002) A generalization of principal components analysis to the exponential family. In: Advances in neural information processing systems 14, MIT Press, pp 617–624
Févotte C, Idier J (2011) Algorithms for nonnegative matrix factorization with the $\beta $-divergence. Neural Comput 23(9):2421–2456
MathSciNet MATH Google Scholar
Gopalan P, Ruiz FJR, Ranganath R, Blei DM (2014) Bayesian nonparametric Poisson factorization for recommendation systems. J Mach Learn Res 33:275–283
Google Scholar
Gopalan P, Hofman JM, Blei DM (2015) Scalable recommendation with hierarchical Poisson factorization. In: Proceedings of the 31st conference on uncertainty in artificial intelligence (UAI ’15), pp 326–335
He X, Liao L, Zhang H, Nie L, Hu X, Chua TS (2017) Neural collaborative filtering. In: Proceedings of the 26th international conference on world wide web (WWW ’17), International World Wide Web Conferences Steering Committee, pp 173–182
Hernandez-Lobato JM, Houlsby N, Ghahramani Z (2014) Stochastic inference for scalable probabilistic modeling of binary matrices. In: Proceedings of the 31st international conference on machine learning (ICML ’14) 32:1–6
Hoffman MD, Blei DM, Cook PR (2010) Bayesian nonparametric matrix factorization for recorded music. In: Proceedings of the 27th international conference on international conference on machine learning (ICML ’10), pp 439–446
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347
MathSciNet MATH Google Scholar
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the 15th conference on uncertainty in artificial intelligence (UAI ’99), pp 289–296
Ishiguro K, Sato I, Ueda N (2017) Averaged collapsed variational Bayes inference. J Mach Learn Res 18(1):1–29
MathSciNet MATH Google Scholar
Kabán A, Bingham E (2008) Factorisation and denoising of 0–1 data: a variational approach. Neurocomputing 71(10–12):2291–2308
Google Scholar
Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N (2006) Learning systems of concepts with an infinite relational model. In: Proceedings of the 21st national conference on artificial intelligence (AAAI ’06), pp 381–388
Landgraf AJ, Lee Y (2015) Dimensionality reduction for binary data through the projection of natural parameters. Tech. Rep. 890, Department of Statistics, The Ohio State University
Larsen JS, Clemmensen LKH (2015) Non-negative matrix factorization for binary data. In: Proceedings of the 7th international joint conference on knowledge discovery, knowledge engineering and knowledge management (IC3K ’15), pp 555–563
Lee DD, Seung HS (1999) Learning the parts of objects with nonnegative matrix factorization. Nature 401:788–791
MATH Google Scholar
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562
Google Scholar
Liu JS (1994) The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J Am Stat Assoc 89(427):958–966
MathSciNet MATH Google Scholar
Meeds E, Ghahramani Z, Neal RM, Roweis ST (2007) Modeling dyadic data with binary latent factors. In: Advances in neural information processing systems 19, MIT Press, pp 977–984
Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10):1348–1362
Google Scholar
NOW (2018) The NOW community. new and old worlds database of fossil mammals (NOW). http://www.helsinki.fi/science/now/, release 030717, retrieved May 2018
Pitman J (2002) Combinatorial stochastic processes. Lecture notes in mathematics, Springer, Berlin, lectures from the 32nd Summer School on Probability Theory held in Saint-Flour
Polson NG, Scott JG, Windle J (2013) Bayesian inference for logistic models using Polya-Gamma latent variables. J Am Stat Assoc 108:1339–1349
MATH Google Scholar
R Core Team (2017) R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Rukat T, Holmes CC, Titsias MK, Yau C (2017) Bayesian Boolean matrix factorisation. In: Proceedings of the 34th international conference on machine learning (ICML ’17), pp 2969–2978
Sammel MD, Ryan LM, Legler JM (1997) Latent variable models for mixed discrete and continuous outcomes. J R Stat Soc Ser B (Methodol) 59(3):667–678
MATH Google Scholar
Schein AI, Saul LK, Ungar LH (2003) A generalized linear model for principal component analysis of binary data. In: Proceedings of the 9th workshop on artificial intelligence and statistics (AISTATS ’03), pp 14–21
Schmidt MN, Winther O, Hansen LK (2009) Bayesian non-negative matrix factorization. In: Proceedings of the 8th international conference on independent component analysis and signal separation (ICA ’09), pp 540–547
Singh AP, Gordon GJ (2008) A unified view of matrix factorization models. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML-PKDD ’08), pp 358–373
Slawski M, Hein M, Lutsik P (2013) Matrix factorization with binary components. In: Advances in neural information processing systems 26, Curran Associates, Inc., pp 3210–3218
Sørensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter 5:1–34
Google Scholar
Steck H, Jaakkola TS (2003) On the Dirichlet prior and Bayesian regularization. In: Advances in neural information processing systems 15, MIT Press, pp 713–720
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581
MathSciNet MATH Google Scholar
Teh YW, Newman D, Welling M (2007) A collapsed variational Bayesian inference alzgorithm for latent Dirichlet allocation. In: Advances in neural information processing systems 19, MIT Press, pp 1353–1360
Tipping ME (1999) Probabilistic visualisation of high-dimensional binary data. In: Advances in neural information processing systems 11, MIT Press, pp 592–598
Tipping ME, Bishop C (1999) Probabilistic principal component analysis. J R Stat Soc Ser B 21(3):611–622
MathSciNet MATH Google Scholar
Tomé AM, Schachtner R, Vigneron V, Puntonet CG, Lang EW (2013) A logistic non-negative matrix factorization approach to binary data sets. Multidimens Syst Signal Process 26(1):125–143
MathSciNet Google Scholar
Udell M, Horn C, Zadeh R, Boyd S (2016) Generalized low rank models. Found Trends Mach Learn 9(1):1–118
MATH Google Scholar
Voeten E (2013) Data and analyses of voting in the UN General Assembly. In: Reinal B (ed) Routledge handbook of international organization. Routledge, Abingdon
Google Scholar
Xue HJ, Dai XY, Zhang J, Huang S, Chen J (2017) Deep matrix factorization models for recommender systems. In: Proceedings of the 26th international joint conference on artificial intelligence (IJCAI—17), AAAI Press, pp 3203–3209
Zhang ZY, Li T, Ding C, Ren XW, Zhang XS (2009) Binary matrix factorization for analyzing gene expression data. Data Min Knowl Discov 20(1):28–52
MathSciNet Google Scholar
Zhou M (2015) Infinite edge partition models for overlapping community detection and link prediction. In: Proceedings of the 18th international conference on artificial intelligence and statistics (AISTATS ’15), pp 1135–1143
Zhou M, Hannah L, Dunson D, Carin L (2012) Beta-negative binomial process and Poisson factor analysis. In: Proceedings of the 15th international conference on artificial intelligence and statistics (AISTATS ’12), pp 1462–1471
Zhou M, Cong Y, Chen B (2016) Augmentable Gamma belief networks. J Mach Learn Res 17(163):1–44
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 681839 (project FACTORY).

Author information

Authors and Affiliations

Criteo AI Lab, Paris, France
Alberto Lumbreras
IRIT, Université de Toulouse, CNRS, Toulouse, France
Louis Filstroff & Cédric Févotte

Authors

Alberto Lumbreras
View author publications
You can also search for this author in PubMed Google Scholar
Louis Filstroff
View author publications
You can also search for this author in PubMed Google Scholar
Cédric Févotte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alberto Lumbreras.

Additional information

Responsible editor: Pauli Miettinen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Probability distributions functions

1.1 A.1 Bernoulli distribution

Distribution over a binary variable $x \in \{0,1\}$, with mean parameter $\mu \in [0,1]$:

$$\begin{aligned} \text {Bernoulli}(x | \mu )&= \mu ^x(1-\mu )^{1-x}. \end{aligned}$$

(68)

1.2 A.2 Beta distribution

Distribution over a continuous variable $x \in [0,1]$, with shape parameters $a >0$, $b>0$:

$$\begin{aligned} \text {Beta}(x | a,b)&= \frac{\Gamma (a+b)}{\Gamma (a)\Gamma (b)}x^{a-1}(1-x)^{b-1}. \end{aligned}$$

(69)

1.3 A.3 Gamma distribution

Distribution for a continuous variable $x >0$, with shape parameter $a >0$ and rate parameter $b>0$:

$$\begin{aligned} \text {Gamma}(x | a,b)&= \frac{b^a}{\Gamma (a)}x^{a-1} e^{-bx}. \end{aligned}$$

(70)

1.4 A.4 Dirichlet distribution

Distribution for K continuous variables $x_{k} \in [0,1]$ such that $\sum _k x_k = 1$. Governed by K shape parameters $\alpha _1,...\alpha _K$ such that $\alpha _k>0$:

$$\begin{aligned} \text {Dirichlet}(\varvec{\mathbf {x}} | \varvec{\mathbf {\alpha }})&= \frac{\Gamma (\sum _k \alpha _k)}{\prod _k \Gamma (\alpha _k)}\prod _k x^ {\alpha _k-1}. \end{aligned}$$

(71)

1.5 A.5 Discrete distribution

Distribution for the discrete variable $ {\mathbf {x}} \in \{ {\mathbf {e}} _{1}, \ldots , {\mathbf {e}} _{K} \}$, where $ {\mathbf {e}} _{i}$ is the $i^{th}$ canonical vector. Governed by the discrete probabilities $\mu _1,...,\mu _K$ such that $\mu _{k} \in [0,1]$ and $\sum _k \mu _k = 1$:

$$\begin{aligned} p( {\mathbf {x}} = {\mathbf {e}} _{k}) = \mu _{k} \end{aligned}$$

(72)

The probability mass function can be written as:

$$\begin{aligned} \text {Discrete}(\varvec{\mathbf {x}} | \varvec{\mathbf {\mu }})&= \prod _k \mu _k^{x_k}. \end{aligned}$$

(73)

We may write $\text {Discrete}(\varvec{\mathbf {x}} | \varvec{\mathbf {\mu }}) = \text {Multinomial}(\varvec{\mathbf {x}} | 1, \varvec{\mathbf {\mu }})$.

1.6 A.6 Multinomial distribution

Distribution for an integer-valued vector $\varvec{\mathbf {x}}=[x_1,...,x_K]^T \in {\mathbb {N}}^K$. Governed by the total number $L = \sum _k x_k$ of events assigned to K bins and the probabilities $\mu _k$ of being assigned to bin k:

$$\begin{aligned} \text {Multinomial}(\varvec{\mathbf {x}} | L, \varvec{\mathbf {\mu }})&= \frac{L!}{x1!...x_K!}\prod _k \mu _k^{x_k}. \end{aligned}$$

(74)

B Derivations for the Beta-Dir model

1.1 B.1 Marginalizing out $\varvec{\mathbf {W}}$ and $\varvec{\mathbf {H}}$ from the joint likelihood

We seek to compute the marginal joint probability introduced in Eq. (22) and given by:

$$\begin{aligned} p(\mathbf {V}, \mathbf {Z}) = \prod _f \overbrace{ \int p(\varvec{\mathbf {w}}_{f}) \prod _{n} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {w}}_f) \,d\mathbf {w}_f }^{p(\underline{\varvec{\mathbf {Z}}}_{f}) } \prod _n \overbrace{ \int \prod _{k} p(h_{kn}) \prod _{f} p(v_{fn} | \varvec{\mathbf {h}}_n, \varvec{\mathbf {z}}_{fn}) \, d {\mathbf {h}} _{n} }^{p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n) }. \end{aligned}$$

Using the expression of the normalization constant of the Dirichlet distribution, the first integral can be computed as follows:

$$\begin{aligned} p(\underline{\varvec{\mathbf {Z}}}_{f})&= \int p(\varvec{\mathbf {w}}_{f}) \prod _{n} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {w}}_f) \,d\mathbf {w}_{f}\end{aligned}$$

(75)

$$\begin{aligned}&= \int \frac{\Gamma (\sum _k \gamma _k)}{\prod _k \Gamma (\gamma _k)}\prod _k w_{fk}^{\gamma _k-1}\prod _n w_{fk}^{z_{fkn}} \,d\mathbf {w}_{f}\end{aligned}$$

(76)

$$\begin{aligned}&= \frac{\Gamma (\sum _k \gamma _k)}{\prod _k \Gamma (\gamma _k)} \int \prod _k w_{fk}^{\gamma _k + L_{fk}-1} \,d\mathbf {w}_{f} \end{aligned}$$

(77)

$$\begin{aligned}&= \frac{\Gamma (\sum _k \gamma _k)}{\prod _k \Gamma (\gamma _k)} \frac{\prod _k \Gamma (\gamma _k + L_{fk})}{\Gamma (\sum _k \gamma _k + L_{fk})}. \end{aligned}$$

(78)

The second integral in Eq. (22) is computed as follows. In Eq. (80) we use that $p(v_{fn} | \varvec{\mathbf {h}}_n, \varvec{\mathbf {z}}_{fn}) = \text {Bernoulli}(v_{fn} | \prod _k h_{kn}^{z_{fkn}}) = \prod _k \text {Bernoulli}(v_{fn}|h_{kn})^{z_{fkn}}$ (recall that $\mathbf {z}_{fn}$ is an indicator vector). In Eq. (83), we use the expression of the normalization constant of the Beta distribution.

$$\begin{aligned} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)&= \int \prod _k p(h_{kn}) \prod _{f} p(v_{fn} | \varvec{\mathbf {h}}_n, \varvec{\mathbf {z}}_{fn}) \, {d \mathbf {h}_n} \end{aligned}$$

(79)

$$\begin{aligned}&= \int \prod _k \left[ \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} h_{kn}^{\alpha _k-1}(1-h_{kn})^{\beta _k-1} \right] \prod _{fk} \left[ h_{kn}^{v_{fn}}(1-h_{kn})^{1-v_{fn}}\right] ^{z_{fkn}} {d \mathbf {h}_n} \end{aligned}$$

(80)

$$\begin{aligned}&= \prod _k \int \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} h_{kn}^{\alpha _k-1}(1-h_{kn})^{\beta _k-1} \prod _{f} \left[ h_{kn}^{v_{fn}}(1-h_{kn})^{1-v_{fn}}\right] ^{z_{fkn}} dh_{kn} \end{aligned}$$

(81)

$$\begin{aligned}&= \prod _k \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} \int h_{kn}^{\alpha _k + A_{kn} -1}(1-h_{kn})^{\beta _k + B_{kn}-1} dh_{kn} \end{aligned}$$

(82)

$$\begin{aligned}&= \prod _k \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} \frac{\Gamma (\alpha _k + A_{kn})\Gamma (\beta _k + B_{kn})}{\Gamma (\alpha _k + \beta _k + M_{kn})}. \end{aligned}$$

(83)

1.2 B.2 Conditional prior and posterior distributions of $\varvec{\mathbf {z}}_{fn}$

Applying the Bayes rule, the conditional posterior of $\varvec{\mathbf {z}}_{fn}$ is given by:

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}, \varvec{\mathbf {V}}) \propto p(\varvec{\mathbf {V}} | \varvec{\mathbf {Z}})p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}). \end{aligned}$$

(84)

The likelihood itself decomposes as $p(\varvec{\mathbf {V}} | \varvec{\mathbf {Z}}) = \prod _{n} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)$ and we may ignore the terms that do not depend on $\varvec{\mathbf {z}}_{fn}$. Using Eq. (24) and the identity $\Gamma (n + b) = \Gamma (n)n^b$ where b is a binary variable, we may write:

$$\begin{aligned} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)&= \prod _{k} \frac{\Gamma (\alpha _k + \beta _k)}{\Gamma (\alpha _k)\Gamma (\beta _k)} \frac{\Gamma (\alpha _k + A_{kn}) \Gamma (\beta _k + B_{kn}) }{\Gamma (\alpha _k + \beta _k + M_{kn})} \end{aligned}$$

(85)

$$\begin{aligned}&\propto \prod _{k} \frac{\Gamma (\alpha _k + A_{kn}) \Gamma (\beta _k + B_{kn}) }{\Gamma (\alpha _k + \beta _k + M_{kn})} \end{aligned}$$

(86)

$$\begin{aligned}&= \prod _{k} \frac{\Gamma (\alpha _k + A_{kn}^{\lnot fn} + z_{fkn}v_{fn}) \Gamma (\beta _k + B_{kn}^{\lnot fn} + z_{fkn}\bar{v}_{fn}) }{\Gamma (\alpha _k + \beta _k + M_{kn}^{\lnot fn} + z_{fkn})} \end{aligned}$$

(87)

$$\begin{aligned}&\propto \prod _k \frac{ \Gamma (\alpha _k + A_{kn}^{\lnot fn}) (\alpha _k + A_{kn}^{\lnot fn})^{z_{fkn}v_{fn}} \Gamma (\beta _k + B_{kn}^{\lnot fn}) (\beta _k + B_{kn}^{\lnot fn})^{z_{fkn}\bar{v}_{fn}} }{ \Gamma (\alpha _k + \beta _k + M_{kn}^{\lnot fn}) (\alpha _k + \beta _k + M_{kn}^{\lnot fn})^{z_{fkn}} } \end{aligned}$$

(88)

$$\begin{aligned}&\propto \prod _k \left[ \frac{ (\alpha _k + A_{kn}^{\lnot fn})^{v_{fn}} (\beta _k + B_{kn}^{\lnot fn})^{\bar{v}_{fn}} }{ (\alpha _k + \beta _k + M_{kn}^{\lnot fn}) } \right] ^{z_{fkn}}. \end{aligned}$$

(89)

The conditional prior term is given by

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}) = p(\varvec{\mathbf {Z}})/p(\varvec{\mathbf {Z}}_{\lnot fn}). \end{aligned}$$

(90)

Using $p(\varvec{\mathbf {Z}}) = \prod _{f} \underline{\varvec{\mathbf {Z}}}_{f}$ and Eq. (23) we have

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn})&\propto p(\underline{\varvec{\mathbf {Z}}}_f) \end{aligned}$$

(91)

$$\begin{aligned}&\propto \prod _k\Gamma (\gamma _k + L_{kn}^{\lnot fn} + z_{fkn}) \end{aligned}$$

(92)

$$\begin{aligned}&= \prod _k \Gamma (\gamma _k + L_{kn}^{\lnot fn}) (\gamma _k + L_{kn}^{\lnot fn})^{z_{fkn}} \end{aligned}$$

(93)

$$\begin{aligned}&\propto \prod _k (\gamma _k + L_{kn}^{\lnot fn})^{z_{fkn}}. \end{aligned}$$

(94)

Using $\sum _{k} p(\varvec{\mathbf {z}}_{fn} = {\mathbf {e}} _{k} | \varvec{\mathbf {Z}}_{\lnot fn}) =1$, a simple closed-form expression of $p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn})$ is obtained as follows:

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} = {\mathbf {e}} _{k} | \varvec{\mathbf {Z}}_{\lnot fn})&= \frac{\gamma _k + L_{kn}^{\lnot fn}}{\sum _{k} (\gamma _k + L_{kn}^{\lnot fn})} \end{aligned}$$

(95)

$$\begin{aligned}&= \frac{\gamma _k + L_{kn}^{\lnot fn}}{\sum _{k} \gamma _k + N-1}. \end{aligned}$$

(96)

Combining Eqs. (84), (89) and (94), we obtain

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn},\varvec{\mathbf {V}}) \propto \prod _k \left[ (\gamma _k + L_{fk}^{\lnot fn} ) \frac{(\alpha _k + A_{kn}^{\lnot fn})^{v_{fn}} (\beta _k + B_{kn}^{\lnot fn})^{\bar{v}_{fn}} }{\alpha _k + \beta _k + M_{kn}^{\lnot fn}} \right] ^{z_{fkn}}. \end{aligned}$$

(97)

C Alternative Gibbs sampler for the Dir-Dir model

In this appendix, we show how to derive an alternative Gibbs sampler based on a single augmentation, like in the Beta-Dir model. This is a conceptually interesting result, though it does not lead to an efficient implementation. Likewise the Beta-Dir model, the Dir-Dir model can be augmented using the single indicator variables $\varvec{\mathbf {z}}_{fn}$, as follows:

$$\begin{aligned} \varvec{\mathbf {h}}_{n}&\sim \text {Dirichlet}(\varvec{\mathbf {\eta }}) \end{aligned}$$

(98)

$$\begin{aligned} \varvec{\mathbf {w}}_{f}&\sim \text {Dirichlet}(\varvec{\mathbf {\gamma }}) \end{aligned}$$

(99)

$$\begin{aligned} \varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {w}}_f&\sim \text {Discrete}(\varvec{\mathbf {w}}_f) \end{aligned}$$

(100)

$$\begin{aligned} v_{fn} | \varvec{\mathbf {h}}_{n}, \varvec{\mathbf {z}}_{fn}&\sim \text {Bernoulli}\left( \prod _k h_{kn}^{z_{fkn}}\right) \end{aligned}$$

(101)

Note that compared to Eqs. (15)–(18) only the prior on $\varvec{\mathbf {h}}_{n}$ is changed. Like in Beta-Dir, we seek in this appendix to derive a Gibbs sampler from the conditional probabilities $p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}, \varvec{\mathbf {V}})$ given by

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}, \varvec{\mathbf {V}}) \propto p(\varvec{\mathbf {V}} | \varvec{\mathbf {Z}})p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}). \end{aligned}$$

(102)

The conditional prior term is identical to that of Beta-Dir and given by

$$\begin{aligned} p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn}) \propto \prod _k (\gamma _k + L_{kn}^{\lnot fn})^{z_{fkn}}. \end{aligned}$$

(103)

Like in Beta-Dir, the likelihood term factorizes as $p(\varvec{\mathbf {V}} | \varvec{\mathbf {Z}}) = \prod _{n} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)$, and we now derive the expression of $p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)$. As compared to Beta-Dir, a major source of difficulty lies in the fact that $p( {\mathbf {h}} _n)$ does not fully factorize anymore because of the Dirichlet assumption (and in particular $\sum _k h_{kn}=1$). In the following, we use the multinomial theorem to obtain Eq. (107)^{Footnote 4} and we use the expression of the normalization constant of the Dirichlet distribution to obtain Eq. (110):

$$\begin{aligned} p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)&= \int p(\varvec{\mathbf {h}}_n) \prod _{f} p(v_{fn} | \varvec{\mathbf {h}}_n, \varvec{\mathbf {z}}_{fn}) \,d\mathbf {h}_n\end{aligned}$$

(104)

$$\begin{aligned}&=\int \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \prod _k h_{kn}^{\eta _k-1} \prod _{f}\prod _k \left[ h_{kn}^{v_{fn}} (1-h_{kn})^{1-v_{fn}}\right] ^{z_{fkn}} \,d\mathbf {h}_n\end{aligned}$$

(105)

$$\begin{aligned}&= \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \int \prod _k h_{kn}^{\eta _n + A_{kn}-1} (1-h_{kn})^{B_{kn}} \,d\mathbf {h}_n \end{aligned}$$

(106)

$$\begin{aligned}&= \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \int \prod _k h_{kn}^{\eta _n + A_{kn}-1} \sum _{j_k=0}^{B_{kn}} \left( {\begin{array}{c}B_{kn}\\ j_k\end{array}}\right) (-h_{kn})^{j_k} \,d\mathbf {h}_n \end{aligned}$$

(107)

$$\begin{aligned}&=\frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \int \sum _{j_1=0}^{B_{1n}} ... \sum _{j_K=0}^{B_{Kn}} \prod _k h_{kn}^{\eta _k + A_{kn}-1} \left( {\begin{array}{c}B_{kn}\\ j_k\end{array}}\right) (-h_{kn})^{j_k} \,d\mathbf {h}_n \end{aligned}$$

(108)

$$\begin{aligned}&= \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \sum _{j_1=0}^{B_{1n}} ... \sum _{j_K=0}^{B_{Kn}} \prod _k (-1)^{j_k}\left( {\begin{array}{c}B_{kn}\\ j_k\end{array}}\right) \int \prod _k h_{kn}^{\eta _k + A_{kn} + j_k -1} \,d\mathbf {h}_n\end{aligned}$$

(109)

$$\begin{aligned}&= \frac{\Gamma (\sum _k \eta _k)}{\prod _k \Gamma (\eta _k)} \sum _{j_1=0}^{B_{1n}} ... \sum _{j_K=0}^{B_{Kn}} \prod _k (-1)^{j_k} \left( {\begin{array}{c}B_{kn}\\ j_k\end{array}}\right) \frac{\Gamma (\eta _k + A_{kn} + j_k)}{\Gamma (\sum _k \eta _k + A_{kn} + j_k)}. \end{aligned}$$

(110)

We conclude that, though available in closed form, the expression of $p(\varvec{\mathbf {v}}_n | \varvec{\mathbf {Z}}_n)$ (and thus $p(\varvec{\mathbf {z}}_{fn} | \varvec{\mathbf {Z}}_{\lnot fn})$) involves the computation of $K\prod _{k=1}^K B_{kn}$ terms involving binomial coefficients, which is impractical in typical problem dimensions.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lumbreras, A., Filstroff, L. & Févotte, C. Bayesian mean-parameterized nonnegative binary matrix factorization. Data Min Knowl Disc 34, 1898–1935 (2020). https://doi.org/10.1007/s10618-020-00712-w

Download citation

Received: 17 December 2018
Accepted: 17 August 2020
Published: 30 August 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10618-020-00712-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian mean-parameterized nonnegative binary matrix factorization

Abstract

Access this article

Similar content being viewed by others

Comparative Study of Inference Methods for Bayesian Nonnegative Matrix Factorisation

Fast Nonnegative Matrix Factorization and Completion Using Nesterov Iterations

Pseudo-marginal Markov Chain Monte Carlo for Nonnegative Matrix Factorization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Probability distributions functions

1.1 A.1 Bernoulli distribution

1.2 A.2 Beta distribution

1.3 A.3 Gamma distribution

1.4 A.4 Dirichlet distribution

1.5 A.5 Discrete distribution

1.6 A.6 Multinomial distribution

B Derivations for the Beta-Dir model

1.1 B.1 Marginalizing out \(\varvec{\mathbf {W}}\) and \(\varvec{\mathbf {H}}\) from the joint likelihood

1.2 B.2 Conditional prior and posterior distributions of \(\varvec{\mathbf {z}}_{fn}\)

C Alternative Gibbs sampler for the Dir-Dir model

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bayesian mean-parameterized nonnegative binary matrix factorization

Abstract

Access this article

Similar content being viewed by others

Comparative Study of Inference Methods for Bayesian Nonnegative Matrix Factorisation

Fast Nonnegative Matrix Factorization and Completion Using Nesterov Iterations

Pseudo-marginal Markov Chain Monte Carlo for Nonnegative Matrix Factorization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Probability distributions functions

1.1 A.1 Bernoulli distribution

1.2 A.2 Beta distribution

1.3 A.3 Gamma distribution

1.4 A.4 Dirichlet distribution

1.5 A.5 Discrete distribution

1.6 A.6 Multinomial distribution

B Derivations for the Beta-Dir model

1.1 B.1 Marginalizing out \(\varvec{\mathbf {W}}\) and \(\varvec{\mathbf {H}}\) from the joint likelihood

1.2 B.2 Conditional prior and posterior distributions of \(\varvec{\mathbf {z}}_{fn}\)

C Alternative Gibbs sampler for the Dir-Dir model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation