Abstract
In Bayes score-based Bayesian network structure learning (BNSL), we are to specify two prior probabilities: over the structures and over the parameters. In this paper, we mainly consider the parameter priors, in particular for the BDeu (Bayesian Dirichlet equivalent uniform) and Jeffreys’ prior. In model selection, given examples, we typically consider how well a model explains the examples, how simple the model is, and choose the best one for the criteria. In this sense, if a model A is better than another model B for both of the two criteria, it is reasonable to choose the model A. In this paper, we prove that the BDeu violates such a regularity and that we will face a fatal situation in BNSL: the BDeu tends to add a variable to the current parent set of a variable X even when the conditional entropy reaches zero. In general, priors should be reflected by the learner’s belief and should not be rejected from a general point of view. However, this paper suggests that the underlying belief of the BDeu contradicts our intuition in some cases, which has not been known until this paper appears.
Similar content being viewed by others
Notes
We denote \(X\perp \!\!\!\perp Y|Z\) if X and Y are conditionally independent given Z.
If \(Y=X^{(i)}\) and X is a parent set of Y in \(Q^n(Y|X)\), the quantity c(x, y) and a(x, y) are expressed by \(N_{ijk}\) and \(N'_{ijk}\), respectively, when \(x=j\) and \(y=k\) in the literature. For ease of understanding, however, in this section, we use the current notation.
References
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: 2nd International Symposium on Information Theory, vol 57. Budapest, Hungary
Billingsley P (1995) Probability & measure, 3rd edn. Wiley, New York
Buntine W (1991) Theory refinement on Bayesian networks. Uncertainty in artificial intelligence. Los Angels, CA, pp 52–60
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347
Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn 20(3):197–243
Jeffreys H (1939) Theory of probability. Oxford University Press, Oxford
Koller D, Friedman N (2009) Probabilistic graphical models. The MIT Press, USA
Pearl J (19880 Probabilistic reasoning in intelligent systems: networks of plausible inference (representation and reasoning), 2nd edn. Morgan Kaufmann, USA
Rissanen J (1978) Modeling by shortest data description. Automatica 14:465–471
Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, USA
Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Silander T, Kontkanen P, Myllymaki P (2007) On sensitivity of the MAP Bayesian network structure to the equipment sample size parameter. In: Laskey KB, Mahoney SM, Goldsmith J (eds) Uncertainty in Artificial Intelligence. Morgan Kaufmann, Vancouver, pp 360–367
Silander T, Kontkanen P, Myllymaki P (2008) Factorized normalized maximum likelihood criterion for learning bayesian network structures. In: Proceedings of the 4th European workshop on probabilistic graphical models (PGM-08), pp 257–272
Steck H, Jaakkola TS (2002) On the Dirichlet prior and Bayesian regularization. In: Becker S, Thrun S, Obermayer K (eds) Advances in Neural Information Processing Systems (NIPS). MIT Press, Cambridge, MA, pp 697–704
Suzuki J (2012) The Bayesian Chow-Liu algorithm. In: The Sixth European Workshop on Probabilistic Graphical Models. Granada, pp 315–322
Suzuki J (2015) Consistency of learning Bayesian network structures with continuous variables: an information theoretic approach. Entropy 17(8):5752–5770
Ueno M (2008) Learning likelihood-equivalence Bayesian networks using an empirical Bayesian approach. Behaviormetrika 35(2):115–135
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Brandon Malone.
Appendices
Appendix A: Proof of theorem 1
We only prove the simplest case \({\mathbf X}=\{X\}\), \({\mathbf Y}=\{Y\}\), and \({\mathbf Z}=\{\}\). The general case can be obtained in a straightforward manner.
Using Stirling’s formula,
with \(\displaystyle \frac{1}{12z}<\epsilon (z)<\frac{1}{12z+1}\), we have
where the last three terms are O(1)
For the BD score based on Jeffreys’ prior, we assume \(a(z)=0.5\); thus we have
From
we have
Similarly, we have
and
Thus, we have (11).
For the BDeu score, on the other hand, we assume \(a(z)=\delta /\gamma \); thus we have
From
we have
Similarly, we have
and
Thus, we have (12).
Appendix B: Proof of theorem 2
We only prove the simplest case \({\mathbf X}=\{X\}\), \({\mathbf Y}=\{Y\}\), and \({\mathbf Z}=\{\}\). The general case can be obtained in a straightforward manner.
For Jeffreys’, we show \(Q^n(X)Q^n(Y)\ge Q^n(XY)\) for \(n\ge 1\). Since \(c_X(0)=c_Y(0)=n\) and \(c_{XY}(0.0)=n\) in
and
it is sufficient to show
which means
For \(n=0\), the both sides are 1, and the inequality (15) is true. Then, we find that (15) for n implies (15) for \(n+1\) because we see
where the first inequality follows from the assumption of induction.
For BDeu with equivalent sample size \(\delta >0\), we show we show \(Q^n(X)Q^n(Y)\le Q^n(XY)\) for \(n\ge 1\). Then, the both sides will be replaced by
and
and it is sufficient to show
Then, we find that (16) for n implies (16) for \(n+1\) by induction and \((n+\delta /\alpha )(n+\delta /\beta )\le (n+\delta )(n+\delta /\alpha \beta )\).
In particular, the equality does not hold for \(n=2\) so that a strict inequality holds for \(n\ge 2\).
This completes the proof.
About this article
Cite this article
Suzuki, J. A theoretical analysis of the BDeu scores in Bayesian network structure learning. Behaviormetrika 44, 97–116 (2017). https://doi.org/10.1007/s41237-016-0006-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41237-016-0006-4