Skip to main content
Log in

A theoretical analysis of the BDeu scores in Bayesian network structure learning

  • Original Paper
  • Published:
Behaviormetrika Aims and scope Submit manuscript

Abstract

In Bayes score-based Bayesian network structure learning (BNSL), we are to specify two prior probabilities: over the structures and over the parameters. In this paper, we mainly consider the parameter priors, in particular for the BDeu (Bayesian Dirichlet equivalent uniform) and Jeffreys’ prior. In model selection, given examples, we typically consider how well a model explains the examples, how simple the model is, and choose the best one for the criteria. In this sense, if a model A is better than another model B for both of the two criteria, it is reasonable to choose the model A. In this paper, we prove that the BDeu violates such a regularity and that we will face a fatal situation in BNSL: the BDeu tends to add a variable to the current parent set of a variable X even when the conditional entropy reaches zero. In general, priors should be reflected by the learner’s belief and should not be rejected from a general point of view. However, this paper suggests that the underlying belief of the BDeu contradicts our intuition in some cases, which has not been known until this paper appears.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. We denote \(X\perp \!\!\!\perp Y|Z\) if X and Y are conditionally independent given Z.

  2. If \(Y=X^{(i)}\) and X is a parent set of Y in \(Q^n(Y|X)\), the quantity c(xy) and a(xy) are expressed by \(N_{ijk}\) and \(N'_{ijk}\), respectively, when \(x=j\) and \(y=k\) in the literature. For ease of understanding, however, in this section, we use the current notation.

References

  • Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: 2nd International Symposium on Information Theory, vol 57. Budapest, Hungary

  • Billingsley P (1995) Probability & measure, 3rd edn. Wiley, New York

    MATH  Google Scholar 

  • Buntine W (1991) Theory refinement on Bayesian networks. Uncertainty in artificial intelligence. Los Angels, CA, pp 52–60

  • Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347

    MATH  Google Scholar 

  • Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn 20(3):197–243

    MATH  Google Scholar 

  • Jeffreys H (1939) Theory of probability. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Koller D, Friedman N (2009) Probabilistic graphical models. The MIT Press, USA

    MATH  Google Scholar 

  • Pearl J (19880 Probabilistic reasoning in intelligent systems: networks of plausible inference (representation and reasoning), 2nd edn. Morgan Kaufmann, USA

  • Rissanen J (1978) Modeling by shortest data description. Automatica 14:465–471

    Article  MATH  Google Scholar 

  • Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, USA

    MATH  Google Scholar 

  • Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Silander T, Kontkanen P, Myllymaki P (2007) On sensitivity of the MAP Bayesian network structure to the equipment sample size parameter. In: Laskey KB, Mahoney SM, Goldsmith J (eds) Uncertainty in Artificial Intelligence. Morgan Kaufmann, Vancouver, pp 360–367

    Google Scholar 

  • Silander T, Kontkanen P, Myllymaki P (2008) Factorized normalized maximum likelihood criterion for learning bayesian network structures. In: Proceedings of the 4th European workshop on probabilistic graphical models (PGM-08), pp 257–272

  • Steck H, Jaakkola TS (2002) On the Dirichlet prior and Bayesian regularization. In: Becker S, Thrun S, Obermayer K (eds) Advances in Neural Information Processing Systems (NIPS). MIT Press, Cambridge, MA, pp 697–704

    Google Scholar 

  • Suzuki J (2012) The Bayesian Chow-Liu algorithm. In: The Sixth European Workshop on Probabilistic Graphical Models. Granada, pp 315–322

  • Suzuki J (2015) Consistency of learning Bayesian network structures with continuous variables: an information theoretic approach. Entropy 17(8):5752–5770

    Article  MathSciNet  Google Scholar 

  • Ueno M (2008) Learning likelihood-equivalence Bayesian networks using an empirical Bayesian approach. Behaviormetrika 35(2):115–135

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joe Suzuki.

Additional information

Communicated by Brandon Malone.

Appendices

Appendix A: Proof of theorem 1

We only prove the simplest case \({\mathbf X}=\{X\}\), \({\mathbf Y}=\{Y\}\), and \({\mathbf Z}=\{\}\). The general case can be obtained in a straightforward manner.

Using Stirling’s formula,

$$\begin{aligned} \log \Gamma (z)=z\log z-z +\frac{1}{2}\log \frac{2\pi }{z}+\epsilon (z) \end{aligned}$$

with \(\displaystyle \frac{1}{12z}<\epsilon (z)<\frac{1}{12z+1}\), we have

$$\begin{aligned}&-\log Q^n(X)\\&\quad =\log \Gamma \left(\sum _x\{c(x)+a(x)\}\right)-\log \Gamma \left( \sum _x a(x)\right) -\sum _x\log \Gamma (c(x)+a(x))\\&\qquad +\,\sum _x\log \Gamma (a(x))\\&\quad =-\,\sum _x \left\{c(x)+a(x)\}\log \frac{c(x)+a(x)}{\sum _{x'} \{c(x')+a(x')\}} +\frac{1}{2}\sum _x\log \{c(x)+a(x)\right\}\\&\qquad -\,\frac{1}{2}\log {\sum _x\{c(x)+a(x)\}}-\frac{\alpha -1}{2}\log 2\pi +\epsilon \left( \sum _z\{c(x)+a(x)\}\right) \\&\qquad -\,\sum _z\epsilon (c(x)+a(x)), \end{aligned}$$

where the last three terms are O(1)

For the BD score based on Jeffreys’ prior, we assume \(a(z)=0.5\); thus we have

$$\begin{aligned} -\log Q^n(Z) = \sum _zc(z)\log \frac{n+\gamma /2}{c(z)+1/2}+\frac{\gamma -1}{2}\log {n}+O(1) \end{aligned}$$

From

$$\begin{aligned} -\frac{\gamma }{2}\le \sum _zc(z)\log \frac{1+\frac{\gamma }{2n}}{1+\frac{1}{2c(z)}}\le 0, \end{aligned}$$

we have

$$\begin{aligned} -\log Q^n(Z) = \sum _zc(z)\log \frac{n}{c(z)}+\frac{\gamma -1}{2}\log {n}+O(1). \end{aligned}$$

Similarly, we have

$$\begin{aligned} -\log Q^n(X,Z) = \sum _x\sum _zc(x,z)\log \frac{n}{c(x,z)}+\frac{\alpha \gamma -1}{2}\log {n}+O(1). \end{aligned}$$
$$\begin{aligned} -\log Q^n(Y,Z) = \sum _y\sum _zc(y,z)\log \frac{n}{c(y,z)}+\frac{\beta \gamma -1}{2}\log {n}+O(1). \end{aligned}$$

and

$$\begin{aligned} -\log Q^n(X,Y,Z) = \sum _x\sum _y\sum _zc(x,y,z)\log \frac{n}{c(x,y,z)}+\frac{\alpha \beta \gamma -1}{2} \log {n}+O(1). \end{aligned}$$

Thus, we have (11).

For the BDeu score, on the other hand, we assume \(a(z)=\delta /\gamma \); thus we have

$$\begin{aligned} -\log Q^n(Z)& = \sum _zc(z)\log \frac{n+\delta }{c(z)+\frac{\delta }{\gamma }}+\left( \delta -\frac{1}{2}\right) \log (n+\delta )\\&\quad -\left( \frac{\delta }{\gamma }-\frac{1}{2}\right) \sum _z\log \left\{ c(z)+\frac{\delta }{\gamma }\right\} +O(1) \end{aligned}$$

From

$$\begin{aligned} -\frac{\delta }{\gamma }\le \sum _x c(z)\log \frac{1+\frac{\delta }{n}}{1+\frac{\delta }{\gamma c(z)}}\le 0, \end{aligned}$$

we have

$$\begin{aligned} -\log Q^n(Z)=\sum _zc(z)\log \frac{n}{c(z)}+\frac{\gamma -1}{2}\log {n} -\left( \frac{\delta }{\gamma }-\frac{1}{2}\right) \sum _z\log \frac{c(z)+\frac{\delta }{\gamma }}{n+\delta } +O(1). \end{aligned}$$

Similarly, we have

$$\begin{aligned} -\log Q^n(X,Z)& = \sum _x\sum _zc(x,z)\log \frac{n}{c(x,z)}\\&\quad +\,\frac{\alpha \beta -1}{2}\log {n} -\left( \frac{\delta }{\alpha \gamma } -\frac{1}{2}\right) \sum _x\sum _z\log \frac{c(x,z)+\frac{\delta }{\alpha \gamma }}{n+\delta }+O(1). \end{aligned}$$
$$\begin{aligned} -\log Q^n(Y,Z)& = \sum _y\sum _zc(y,z)\log \frac{n}{c(y,z)}\\&\quad +\,\frac{\beta \gamma -1}{2}\log {n} -\left( \frac{\delta }{\beta \gamma } -\frac{1}{2}\right) \sum _y\sum _z\log \frac{c(y,z)+\frac{\delta }{\beta \gamma }}{n+\delta }+O(1). \end{aligned}$$

and

$$\begin{aligned} -\log Q^n(X,Y,Z)& = \sum _x\sum _y\sum _zc(x,y,z)\log \frac{n}{c(x,y,z)}+\frac{\alpha \beta \gamma -1}{2}\log {n}\\&\quad -\,\left( \frac{\delta }{\alpha \beta \gamma } -\frac{1}{2}\right) \sum _x\sum _y\sum _z\log \frac{c(x,y,z)+\frac{\delta }{\alpha \beta \gamma }}{n+\delta }+O(1). \end{aligned}$$

Thus, we have (12).

Appendix B: Proof of theorem 2

We only prove the simplest case \({\mathbf X}=\{X\}\), \({\mathbf Y}=\{Y\}\), and \({\mathbf Z}=\{\}\). The general case can be obtained in a straightforward manner.

For Jeffreys’, we show \(Q^n(X)Q^n(Y)\ge Q^n(XY)\) for \(n\ge 1\). Since \(c_X(0)=c_Y(0)=n\) and \(c_{XY}(0.0)=n\) in

$$\begin{aligned} \frac{\Gamma (\alpha /2)}{\Gamma (n+\alpha /2)}\prod _x\frac{\Gamma (c_X(x)+1/2)}{\Gamma (1/2)} \cdot \frac{\Gamma (\beta /2)}{\Gamma (n+\beta /2)}\prod _y\frac{\Gamma (c_Y(y)+1/2)}{\Gamma (1/2)} \end{aligned}$$

and

$$\begin{aligned} \frac{\Gamma (\alpha \beta /2)}{\Gamma (n+\alpha \beta /2)}\prod _x\prod _y\frac{\Gamma (c_{XY}(x,y)+1/2)}{\Gamma (1/2)}, \end{aligned}$$

it is sufficient to show

$$\begin{aligned} \frac{\Gamma (\alpha /2)}{\Gamma (n+\alpha /2)}\frac{\Gamma (n+1/2)}{\Gamma (1/2)} \cdot \frac{\Gamma (\beta /2)}{\Gamma (n+\beta /2)}\frac{\Gamma (n+1/2)}{\Gamma (1/2)}\ge \frac{\Gamma (\alpha \beta /2)}{\Gamma (n+\alpha \beta /2)}\frac{\Gamma (n+1/2)}{\Gamma (1/2)}, \end{aligned}$$

which means

$$\begin{aligned} \frac{\Gamma (n+\alpha \beta /2)\Gamma (n+1/2)}{\Gamma (\alpha \beta /2)\Gamma (1/2)}\ge \frac{\Gamma (n+\alpha /2)\Gamma (n+\beta /2)}{\Gamma (\alpha /2)\Gamma (\beta /2)}. \end{aligned}$$
(15)

For \(n=0\), the both sides are 1, and the inequality (15) is true. Then, we find that (15) for n implies (15) for \(n+1\) because we see

$$\begin{aligned} \Gamma (n+1+\alpha /2)\Gamma (n+1+\alpha /2)& = \Gamma (n+\alpha /2)\Gamma (n+\alpha /2)\cdot (n+\alpha /2)(n+\beta /2)\\&\le \Gamma (n+\alpha \beta /2)\Gamma (n+\frac{1}{2})\cdot (n+\alpha /2)(n+\beta /2)\\&\le \Gamma (n+\alpha \beta /2)\Gamma (n+1/2) \cdot (n^2+(\alpha \beta +1)n/2+\alpha \beta /4)\\ &= \Gamma (n+\alpha \beta /2)(n+1/2)\cdot (n+{\alpha \beta }/{2})(n+1/2)=\Gamma (n+1+\alpha \beta /2)(n+1+1/2), \end{aligned}$$

where the first inequality follows from the assumption of induction.

For BDeu with equivalent sample size \(\delta >0\), we show we show \(Q^n(X)Q^n(Y)\le Q^n(XY)\) for \(n\ge 1\). Then, the both sides will be replaced by

$$\begin{aligned} \frac{\Gamma (\delta )}{\Gamma (n+\delta )}\prod _x\frac{\Gamma (c_X(x)+\delta /\alpha )}{\Gamma (\delta /\alpha )} \cdot \frac{\Gamma (\delta )}{\Gamma (n+\delta )}\prod _y\frac{\Gamma (c_Y(y)+\delta /\beta )}{\Gamma (\delta /\beta )} \end{aligned}$$

and

$$\begin{aligned} \frac{\Gamma (\delta )}{\Gamma (n+\delta )}\prod _x\prod _y\frac{\Gamma (c_{XY}(x,y)+\delta /\alpha \beta )}{\Gamma (\delta /\alpha \beta )}, \end{aligned}$$

and it is sufficient to show

$$\begin{aligned} \frac{\Gamma (n+\delta /\alpha )}{\Gamma (\delta /\alpha )}\cdot \frac{\Gamma (n+\delta /\alpha )}{\Gamma (\delta /\alpha )} \le \frac{\Gamma (n+\delta /\alpha \beta )}{\Gamma (\delta /\alpha \beta )}\cdot \frac{\Gamma (n+\delta )}{\Gamma (\delta )}. \end{aligned}$$
(16)

Then, we find that (16) for n implies (16) for \(n+1\) by induction and \((n+\delta /\alpha )(n+\delta /\beta )\le (n+\delta )(n+\delta /\alpha \beta )\).

In particular, the equality does not hold for \(n=2\) so that a strict inequality holds for \(n\ge 2\).

This completes the proof.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suzuki, J. A theoretical analysis of the BDeu scores in Bayesian network structure learning. Behaviormetrika 44, 97–116 (2017). https://doi.org/10.1007/s41237-016-0006-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41237-016-0006-4

Keywords

Navigation