A theoretical analysis of the BDeu scores in Bayesian network structure learning

Suzuki, Joe

doi:10.1007/s41237-016-0006-4

A theoretical analysis of the BDeu scores in Bayesian network structure learning

Original Paper
Published: 29 November 2016

Volume 44, pages 97–116, (2017)
Cite this article

Behaviormetrika Aims and scope Submit manuscript

Joe Suzuki¹

495 Accesses
17 Citations
Explore all metrics

Abstract

In Bayes score-based Bayesian network structure learning (BNSL), we are to specify two prior probabilities: over the structures and over the parameters. In this paper, we mainly consider the parameter priors, in particular for the BDeu (Bayesian Dirichlet equivalent uniform) and Jeffreys’ prior. In model selection, given examples, we typically consider how well a model explains the examples, how simple the model is, and choose the best one for the criteria. In this sense, if a model A is better than another model B for both of the two criteria, it is reasonable to choose the model A. In this paper, we prove that the BDeu violates such a regularity and that we will face a fatal situation in BNSL: the BDeu tends to add a variable to the current parent set of a variable X even when the conditional entropy reaches zero. In general, priors should be reflected by the learner’s belief and should not be rejected from a general point of view. However, this paper suggests that the underlying belief of the BDeu contradicts our intuition in some cases, which has not been known until this paper appears.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Marginal information for structure learning

Article 16 July 2019

Scoring Bayesian networks of mixed variables

Article 11 January 2018

A survey of Bayesian Network structure learning

Article Open access 17 January 2023

Notes

We denote $X\perp \!\!\!\perp Y|Z$ if X and Y are conditionally independent given Z.
If $Y=X^{(i)}$ and X is a parent set of Y in $Q^n(Y|X)$, the quantity c(x, y) and a(x, y) are expressed by $N_{ijk}$ and $N'_{ijk}$, respectively, when $x=j$ and $y=k$ in the literature. For ease of understanding, however, in this section, we use the current notation.

References

Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: 2nd International Symposium on Information Theory, vol 57. Budapest, Hungary
Billingsley P (1995) Probability & measure, 3rd edn. Wiley, New York
MATH Google Scholar
Buntine W (1991) Theory refinement on Bayesian networks. Uncertainty in artificial intelligence. Los Angels, CA, pp 52–60
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347
MATH Google Scholar
Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn 20(3):197–243
MATH Google Scholar
Jeffreys H (1939) Theory of probability. Oxford University Press, Oxford
MATH Google Scholar
Koller D, Friedman N (2009) Probabilistic graphical models. The MIT Press, USA
MATH Google Scholar
Pearl J (19880 Probabilistic reasoning in intelligent systems: networks of plausible inference (representation and reasoning), 2nd edn. Morgan Kaufmann, USA
Rissanen J (1978) Modeling by shortest data description. Automatica 14:465–471
Article MATH Google Scholar
Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, USA
MATH Google Scholar
Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Silander T, Kontkanen P, Myllymaki P (2007) On sensitivity of the MAP Bayesian network structure to the equipment sample size parameter. In: Laskey KB, Mahoney SM, Goldsmith J (eds) Uncertainty in Artificial Intelligence. Morgan Kaufmann, Vancouver, pp 360–367
Google Scholar
Silander T, Kontkanen P, Myllymaki P (2008) Factorized normalized maximum likelihood criterion for learning bayesian network structures. In: Proceedings of the 4th European workshop on probabilistic graphical models (PGM-08), pp 257–272
Steck H, Jaakkola TS (2002) On the Dirichlet prior and Bayesian regularization. In: Becker S, Thrun S, Obermayer K (eds) Advances in Neural Information Processing Systems (NIPS). MIT Press, Cambridge, MA, pp 697–704
Google Scholar
Suzuki J (2012) The Bayesian Chow-Liu algorithm. In: The Sixth European Workshop on Probabilistic Graphical Models. Granada, pp 315–322
Suzuki J (2015) Consistency of learning Bayesian network structures with continuous variables: an information theoretic approach. Entropy 17(8):5752–5770
Article MathSciNet Google Scholar
Ueno M (2008) Learning likelihood-equivalence Bayesian networks using an empirical Bayesian approach. Behaviormetrika 35(2):115–135
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Osaka University, Toyonaka, Osaka, 560-0043, Japan
Joe Suzuki

Authors

Joe Suzuki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joe Suzuki.

Additional information

Communicated by Brandon Malone.

Appendices

Appendix A: Proof of theorem 1

We only prove the simplest case ${\mathbf X}=\{X\}$, ${\mathbf Y}=\{Y\}$, and ${\mathbf Z}=\{\}$. The general case can be obtained in a straightforward manner.

Using Stirling’s formula,

$$\begin{aligned} \log \Gamma (z)=z\log z-z +\frac{1}{2}\log \frac{2\pi }{z}+\epsilon (z) \end{aligned}$$

with $\displaystyle \frac{1}{12z}<\epsilon (z)<\frac{1}{12z+1}$, we have

$$\begin{aligned}&-\log Q^n(X)\\&\quad =\log \Gamma \left(\sum _x\{c(x)+a(x)\}\right)-\log \Gamma \left( \sum _x a(x)\right) -\sum _x\log \Gamma (c(x)+a(x))\\&\qquad +\,\sum _x\log \Gamma (a(x))\\&\quad =-\,\sum _x \left\{c(x)+a(x)\}\log \frac{c(x)+a(x)}{\sum _{x'} \{c(x')+a(x')\}} +\frac{1}{2}\sum _x\log \{c(x)+a(x)\right\}\\&\qquad -\,\frac{1}{2}\log {\sum _x\{c(x)+a(x)\}}-\frac{\alpha -1}{2}\log 2\pi +\epsilon \left( \sum _z\{c(x)+a(x)\}\right) \\&\qquad -\,\sum _z\epsilon (c(x)+a(x)), \end{aligned}$$

where the last three terms are O(1)

For the BD score based on Jeffreys’ prior, we assume $a(z)=0.5$; thus we have

$$\begin{aligned} -\log Q^n(Z) = \sum _zc(z)\log \frac{n+\gamma /2}{c(z)+1/2}+\frac{\gamma -1}{2}\log {n}+O(1) \end{aligned}$$

From

$$\begin{aligned} -\frac{\gamma }{2}\le \sum _zc(z)\log \frac{1+\frac{\gamma }{2n}}{1+\frac{1}{2c(z)}}\le 0, \end{aligned}$$

we have

$$\begin{aligned} -\log Q^n(Z) = \sum _zc(z)\log \frac{n}{c(z)}+\frac{\gamma -1}{2}\log {n}+O(1). \end{aligned}$$

Similarly, we have

$$\begin{aligned} -\log Q^n(X,Z) = \sum _x\sum _zc(x,z)\log \frac{n}{c(x,z)}+\frac{\alpha \gamma -1}{2}\log {n}+O(1). \end{aligned}$$

$$\begin{aligned} -\log Q^n(Y,Z) = \sum _y\sum _zc(y,z)\log \frac{n}{c(y,z)}+\frac{\beta \gamma -1}{2}\log {n}+O(1). \end{aligned}$$

and

$$\begin{aligned} -\log Q^n(X,Y,Z) = \sum _x\sum _y\sum _zc(x,y,z)\log \frac{n}{c(x,y,z)}+\frac{\alpha \beta \gamma -1}{2} \log {n}+O(1). \end{aligned}$$

Thus, we have (11).

For the BDeu score, on the other hand, we assume $a(z)=\delta /\gamma $; thus we have

$$\begin{aligned} -\log Q^n(Z)& = \sum _zc(z)\log \frac{n+\delta }{c(z)+\frac{\delta }{\gamma }}+\left( \delta -\frac{1}{2}\right) \log (n+\delta )\\&\quad -\left( \frac{\delta }{\gamma }-\frac{1}{2}\right) \sum _z\log \left\{ c(z)+\frac{\delta }{\gamma }\right\} +O(1) \end{aligned}$$

From

$$\begin{aligned} -\frac{\delta }{\gamma }\le \sum _x c(z)\log \frac{1+\frac{\delta }{n}}{1+\frac{\delta }{\gamma c(z)}}\le 0, \end{aligned}$$

we have

$$\begin{aligned} -\log Q^n(Z)=\sum _zc(z)\log \frac{n}{c(z)}+\frac{\gamma -1}{2}\log {n} -\left( \frac{\delta }{\gamma }-\frac{1}{2}\right) \sum _z\log \frac{c(z)+\frac{\delta }{\gamma }}{n+\delta } +O(1). \end{aligned}$$

Similarly, we have

$$\begin{aligned} -\log Q^n(X,Z)& = \sum _x\sum _zc(x,z)\log \frac{n}{c(x,z)}\\&\quad +\,\frac{\alpha \beta -1}{2}\log {n} -\left( \frac{\delta }{\alpha \gamma } -\frac{1}{2}\right) \sum _x\sum _z\log \frac{c(x,z)+\frac{\delta }{\alpha \gamma }}{n+\delta }+O(1). \end{aligned}$$

$$\begin{aligned} -\log Q^n(Y,Z)& = \sum _y\sum _zc(y,z)\log \frac{n}{c(y,z)}\\&\quad +\,\frac{\beta \gamma -1}{2}\log {n} -\left( \frac{\delta }{\beta \gamma } -\frac{1}{2}\right) \sum _y\sum _z\log \frac{c(y,z)+\frac{\delta }{\beta \gamma }}{n+\delta }+O(1). \end{aligned}$$

and

$$\begin{aligned} -\log Q^n(X,Y,Z)& = \sum _x\sum _y\sum _zc(x,y,z)\log \frac{n}{c(x,y,z)}+\frac{\alpha \beta \gamma -1}{2}\log {n}\\&\quad -\,\left( \frac{\delta }{\alpha \beta \gamma } -\frac{1}{2}\right) \sum _x\sum _y\sum _z\log \frac{c(x,y,z)+\frac{\delta }{\alpha \beta \gamma }}{n+\delta }+O(1). \end{aligned}$$

Thus, we have (12).

Appendix B: Proof of theorem 2

We only prove the simplest case ${\mathbf X}=\{X\}$, ${\mathbf Y}=\{Y\}$, and ${\mathbf Z}=\{\}$. The general case can be obtained in a straightforward manner.

For Jeffreys’, we show $Q^n(X)Q^n(Y)\ge Q^n(XY)$ for $n\ge 1$. Since $c_X(0)=c_Y(0)=n$ and $c_{XY}(0.0)=n$ in

$$\begin{aligned} \frac{\Gamma (\alpha /2)}{\Gamma (n+\alpha /2)}\prod _x\frac{\Gamma (c_X(x)+1/2)}{\Gamma (1/2)} \cdot \frac{\Gamma (\beta /2)}{\Gamma (n+\beta /2)}\prod _y\frac{\Gamma (c_Y(y)+1/2)}{\Gamma (1/2)} \end{aligned}$$

and

$$\begin{aligned} \frac{\Gamma (\alpha \beta /2)}{\Gamma (n+\alpha \beta /2)}\prod _x\prod _y\frac{\Gamma (c_{XY}(x,y)+1/2)}{\Gamma (1/2)}, \end{aligned}$$

it is sufficient to show

$$\begin{aligned} \frac{\Gamma (\alpha /2)}{\Gamma (n+\alpha /2)}\frac{\Gamma (n+1/2)}{\Gamma (1/2)} \cdot \frac{\Gamma (\beta /2)}{\Gamma (n+\beta /2)}\frac{\Gamma (n+1/2)}{\Gamma (1/2)}\ge \frac{\Gamma (\alpha \beta /2)}{\Gamma (n+\alpha \beta /2)}\frac{\Gamma (n+1/2)}{\Gamma (1/2)}, \end{aligned}$$

which means

$$\begin{aligned} \frac{\Gamma (n+\alpha \beta /2)\Gamma (n+1/2)}{\Gamma (\alpha \beta /2)\Gamma (1/2)}\ge \frac{\Gamma (n+\alpha /2)\Gamma (n+\beta /2)}{\Gamma (\alpha /2)\Gamma (\beta /2)}. \end{aligned}$$

(15)

For $n=0$, the both sides are 1, and the inequality (15) is true. Then, we find that (15) for n implies (15) for $n+1$ because we see

$$\begin{aligned} \Gamma (n+1+\alpha /2)\Gamma (n+1+\alpha /2)& = \Gamma (n+\alpha /2)\Gamma (n+\alpha /2)\cdot (n+\alpha /2)(n+\beta /2)\\&\le \Gamma (n+\alpha \beta /2)\Gamma (n+\frac{1}{2})\cdot (n+\alpha /2)(n+\beta /2)\\&\le \Gamma (n+\alpha \beta /2)\Gamma (n+1/2) \cdot (n^2+(\alpha \beta +1)n/2+\alpha \beta /4)\\ &= \Gamma (n+\alpha \beta /2)(n+1/2)\cdot (n+{\alpha \beta }/{2})(n+1/2)=\Gamma (n+1+\alpha \beta /2)(n+1+1/2), \end{aligned}$$

where the first inequality follows from the assumption of induction.

For BDeu with equivalent sample size $\delta >0$, we show we show $Q^n(X)Q^n(Y)\le Q^n(XY)$ for $n\ge 1$. Then, the both sides will be replaced by

$$\begin{aligned} \frac{\Gamma (\delta )}{\Gamma (n+\delta )}\prod _x\frac{\Gamma (c_X(x)+\delta /\alpha )}{\Gamma (\delta /\alpha )} \cdot \frac{\Gamma (\delta )}{\Gamma (n+\delta )}\prod _y\frac{\Gamma (c_Y(y)+\delta /\beta )}{\Gamma (\delta /\beta )} \end{aligned}$$

and

$$\begin{aligned} \frac{\Gamma (\delta )}{\Gamma (n+\delta )}\prod _x\prod _y\frac{\Gamma (c_{XY}(x,y)+\delta /\alpha \beta )}{\Gamma (\delta /\alpha \beta )}, \end{aligned}$$

and it is sufficient to show

$$\begin{aligned} \frac{\Gamma (n+\delta /\alpha )}{\Gamma (\delta /\alpha )}\cdot \frac{\Gamma (n+\delta /\alpha )}{\Gamma (\delta /\alpha )} \le \frac{\Gamma (n+\delta /\alpha \beta )}{\Gamma (\delta /\alpha \beta )}\cdot \frac{\Gamma (n+\delta )}{\Gamma (\delta )}. \end{aligned}$$

(16)

Then, we find that (16) for n implies (16) for $n+1$ by induction and $(n+\delta /\alpha )(n+\delta /\beta )\le (n+\delta )(n+\delta /\alpha \beta )$.

In particular, the equality does not hold for $n=2$ so that a strict inequality holds for $n\ge 2$.

This completes the proof.

About this article

Cite this article

Suzuki, J. A theoretical analysis of the BDeu scores in Bayesian network structure learning. Behaviormetrika 44, 97–116 (2017). https://doi.org/10.1007/s41237-016-0006-4

Download citation

Received: 31 May 2016
Accepted: 14 November 2016
Published: 29 November 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s41237-016-0006-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A theoretical analysis of the BDeu scores in Bayesian network structure learning

Abstract

Access this article

Similar content being viewed by others

Marginal information for structure learning

Scoring Bayesian networks of mixed variables

A survey of Bayesian Network structure learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Proof of theorem 1

Appendix B: Proof of theorem 2

About this article

Cite this article

Keywords

Navigation

A theoretical analysis of the BDeu scores in Bayesian network structure learning

Abstract

Access this article

Similar content being viewed by others

Marginal information for structure learning

Scoring Bayesian networks of mixed variables

A survey of Bayesian Network structure learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Proof of theorem 1

Appendix B: Proof of theorem 2

About this article

Cite this article

Share this article

Keywords

Search

Navigation