1 Introduction and background

Bayesian networks (BNs; Pearl 1988; Koller and Friedman 2009) are probabilistic graphical models based on a directed acyclic graph (DAG) \({\mathcal {G}}\) whose nodes are associated with a set of random variables \({\mathbf {X}}= \{X_1, \ldots , X_N\}\) following some distribution \({P}({\mathbf {X}})\). (The two are referred to interchangeably.) Formally, \({\mathcal {G}}\) is defined as an independency map of \({P}({\mathbf {X}})\) such that:

$$\begin{aligned} {\mathbf {X}}_A \perp\hspace{-0.21cm}\perp_G {\mathbf {X}}_B |{\mathbf {X}}_C \Longrightarrow {\mathbf {X}}_A \perp\hspace{-0.21cm}\perp _P {\mathbf {X}}_B |{\mathbf {X}}_C, \end{aligned}$$

where \({\mathbf {X}}_A\), \({\mathbf {X}}_B\) and \({\mathbf {X}}_C\) are disjoint subsets of \({\mathbf {X}}\). In other words, graphical separation (denoted \(\perp\hspace{-0.21cm}\perp _G\), and called d-separation in this context) between two nodes in \({\mathcal {G}}\) implies the conditional independence (denoted \(\perp\hspace{-0.21cm}\perp _P\)) of the corresponding variables in \({\mathbf {X}}\). Two nodes linked by an arc cannot be graphically separated; hence the arcs of \({\mathcal {G}}\) represent direct dependencies between the variables they are incident on, as opposed to indirect dependencies that are mediated by one or more nodes in \({\mathcal {G}}\).

A consequence of this definition is the Markov property (Pearl 1988): in the absence of missing data the global distribution of \({\mathbf {X}}\) decomposes into:

$${P}({\mathbf {X}}|{\mathcal {G}}) = \prod _{i=1}^N {P}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}}\right)$$
(1)

where the local distribution of each node \(X_i\) depends only on the values of its parents \(\varPi _{X_i}\) in \({\mathcal {G}}\) (denoted \(\varPi _{X_i}^{{\mathcal {G}}}\)). In this paper we will focus on discrete BNs (Heckerman et al. 1995), in which both \({\mathbf {X}}\) and the \(X_i\) are multinomial random variables; in particular:

$$\begin{aligned} X_i |\varPi _{X_i}^{{\mathcal {G}}}\sim \text{Multinomial}\left( \varTheta _{X_i}|\varPi _{X_i}^{\mathcal {G}}\right) \end{aligned}$$

where each parameter set \(\varTheta _{X_i}\) comprises the conditional probabilities:

$$\begin{aligned} \pi _{ik |j}= {P}\left( X_i = k |\varPi _{X_i}^{{\mathcal {G}}}= j\right) \end{aligned}$$

of each value k of \(X_i\) given each possible configuration of the values of \(\varPi _{X_i}^{{\mathcal {G}}}\). Other possibilities include Gaussian BNs (Geiger and Heckerman 1994) and conditional linear Gaussian BNs (Lauritzen and Wermuth 1989). In Gaussian BNs, \({\mathbf {X}}\) is multivariate normal and the (conditional) dependencies between the \(X_i\) are assumed to be linear, leading to

$$\begin{aligned} X_i |\varPi _{X_i}^{{\mathcal {G}}}\sim N\left( \varTheta _{X_i}|\varPi _{X_i}^{\mathcal {G}}\right) \end{aligned}$$

which can be written as a linear regression model of the form

$$\begin{aligned} X_i = \mu _{X_i} + \varPi _{X_i}^{{\mathcal {G}}}\beta _{X_i} + \varepsilon ,\quad \varepsilon \sim N(0, \sigma ^2_{X_i}) \end{aligned}$$

or using the partial correlations between \(X_i\) and each parent given the rest; the two parameterisations are equivalent (Weatherburn 1961). Conditional linear Gaussian BNs combine multinomial and normal variables using mixture of normals, with discrete variables identifying the components of the mixture.

It is important to note that the decomposition in (1) does not uniquely identify a BN; different DAGs can encode the same global distribution, thus grouping BNs into equivalence classes (Chickering 1995) characterised by the skeleton of \({\mathcal {G}}\) (its underlying undirected graph) and its v-structures (patterns of arcs of the type \(X_j \rightarrow X_i \leftarrow X_k\), with no arc between \(X_j\) and \(X_k\)). Intuitively, the direction of arcs that are not part of v-structures can be reversed without changing the global distribution, just factorising it in different ways, as long as the new arc directions do not introduce additional v-structures or cycles in the DAG. As a simple example, consider,

$$\begin{aligned} \underbrace{{P}(X_i){P}(X_j |X_i){P}(X_k |X_j)}_{X_i \rightarrow X_j \rightarrow X_k}&= {} {P}(X_j, X_i){P}(X_k |X_j)\\ &= {} \underbrace{{P}(X_i |X_j){P}(X_j){P}(X_k |X_j)}_{X_i \leftarrow X_j \rightarrow X_k}. \end{aligned}$$

The task of specifying BNs is called learning and can be performed using either data, prior expert knowledge on the phenomenon being modelled or both. The latter has been shown to provide very good results in a variety of applications, and should be preferred if it is feasible to elicit prior information from experts (Castelo and Siebes 2000; Mukherjee and Speed 2008). Learning BNs from data is usually performed in an inherently Bayesian fashion by maximising,

$$\begin{aligned} \underbrace{{P}({\mathcal {B}}|{\mathcal {D}}) = {P}({\mathcal {G}}, \varTheta |{\mathcal {D}})}_{{\mathrm{learning}}}=\underbrace{{P}({\mathcal {G}}|{\mathcal {D}})}_{{\mathrm{structure\, learning}}}\cdot \underbrace{{P}(\varTheta |{\mathcal {G}}, {\mathcal {D}})}_{{\mathrm{parameter\, learning}}}, \end{aligned}$$
(2)

where \({\mathcal {D}}\) is a sample from \({\mathbf {X}}\) and \({\mathcal {B}}= ({\mathcal {G}}, \varTheta )\) is a BN with DAG \({\mathcal {G}}\) and parameter set \(\varTheta = \bigcup _{i=1}^N \varTheta _{X_i}\). Structure learning consists in finding the DAG \({\mathcal {G}}\) that encodes the dependence structure of the data; parameter learning involves the estimation of the parameters \(\varTheta\) given \({\mathcal {G}}\). Expert knowledge can be incorporated in either or both these steps through the use of informative priors for \({\mathcal {G}}\) or \(\varTheta\).

Structure learning can be implemented in several ways, based on many results from probability, information and optimisation theory; algorithms for this task can be broadly grouped into constraint-based, score-based and hybrid.

Constraint-based algorithms (Aliferis et al. 2010a, b) use statistical tests to learn conditional independence relationships from the data and to determine if the corresponding arcs should be included in \({\mathcal {G}}\). In order to do that they assume that \({\mathcal {G}}\) is faithful to \({P}({\mathbf {X}})\), meaning

$$\begin{aligned} {\mathbf {X}}_A \perp\hspace{-0.21cm}\perp _G {\mathbf {X}}_B |{\mathbf {X}}_C \Longleftrightarrow {\mathbf {X}}_A \perp\hspace{-0.21cm}\perp _P {\mathbf {X}}_B |{\mathbf {X}}_C; \end{aligned}$$

this is a strong assumption that does not hold in a number of real-world scenarios, which are reviewed in Koski and Noble (2012). Depending on the nature of the data, conditional independence tests in common use are the mutual information (\(G^2\)) and Pearson’s \(\chi ^2\) tests for contingency tables (for discrete BNs); and Fisher’s Z and the exact t tests for partial correlations (for Gaussian BNs); an overview is provided in Scutari and Denis (2014).

Score-based algorithms are closer to model selection techniques developed in classic statistics and information theory. Each candidate network is assigned a score reflecting its goodness of fit, which is then taken as an objective function to maximise. This is often done using heuristic optimisation algorithms, from local search to genetic algorithms (Russell and Norvig 2009); but the availability of computational resources and advances in learning algorithms have recently made exact learning possible (Cussens 2012). Common choices for the network score include the Bayesian Information Criterion (BIC) and the marginal likelihood \({P}({\mathcal {G}}|{\mathcal {D}})\) itself; for an overview see again Scutari and Denis (2014). We will cover both in more detail for discrete BNs in Sect. 2.

Hybrid algorithms use both statistical tests and score functions, combining the previous two families of algorithms. Their general formulation is described for BNs in Friedman et al. (1999); they have proved to be some of the top performers up to date (see for instance MMHC inTsamardinos et al. 2006).

As for parameter learning, the parameters \(\varTheta _{X_i}\) can be estimated independently for each node following (1) since its parents are assumed to be known from structure learning. Both maximum likelihood and Bayesian posterior estimators are in common use, with the latter being preferred due to their smoothness and superior predictive power (Koller and Friedman 2009).

In this paper we focus on score-based structure learning in a Bayesian framework, in which we aim to identify a maximum a posteriori (MAP) DAG \({\mathcal {G}}\) that directly maximises \({P}({\mathcal {G}}|{\mathcal {D}})\). For discrete BNs, this means maximising a Bayesian–Dirichlet (BD) marginal likelihood: the most common choice is the Bayesian–Dirichlet equivalent uniform (BDeu) score from Heckerman et al. (1995). We will show that the uniform prior distribution over each \(\varTheta _{X_i}\) that underlies BDeu can be problematic from a Bayesian perspective, resulting in wildly different Bayes factors (and thus structure learning outcomes) depending on the value of its only hyperparameter, the imaginary sample size. We will further investigate this problem from an information theoretic perspective, on the grounds that Bayesian posterior inference can be framed as a particular case of the maximum relative entropy principle (ME; Shore and Johnson 1980; Skilling 1988; Caticha 2004). We find that BDeu is not a reliable network score when applied to sparse data because it can select overly complex networks over simpler ones given the same information in the prior and in the data; and that in the process it violates the maximum relative entropy principle. That does not appear to be the case for other BD scores, which arise from different priors.

The paper is organised as follows: In Sect. 2 we will review Bayesian score-based structure learning and BD scores. In Sect. 3 we will focus on BDeu, covering its underlying assumptions and issues reported in the literature. In particular, we will show with simple examples that BDeu can produce Bayes factors that are sensitive to the choice of its only hyperparameter, the imaginary sample size. In Sect. 4 we will derive the posterior expected entropy associated with a DAG \({\mathcal {G}}\), which we will further explore in Sect. 5. Finally, in Sect. 6 we will analyse BDeu using ME, and we will compare its behaviour with that of other BD scores.

2 Bayesian–Dirichlet marginal likelihoods

Score-based structure learning, when performed in a Bayesian framework, aims at finding a DAG \({\mathcal {G}}\) that has the highest posterior probability \({P}({\mathcal {G}}|{\mathcal {D}})\). Starting from (2), using Bayes’ theorem we can write,

$$\begin{aligned} {P}({\mathcal {G}}|{\mathcal {D}}) \propto {P}({\mathcal {G}}){P}({\mathcal {D}}|{\mathcal {G}}) = {P}({\mathcal {G}})\int {P}({\mathcal {D}}|{\mathcal {G}}, \varTheta ) {P}(\varTheta |{\mathcal {G}}) \,{\mathrm{d}}\varTheta \end{aligned}$$

where \({P}({\mathcal {G}})\) is the prior distribution over the space of the DAGs spanning the variables in \({\mathbf {X}}\) and \({P}({\mathcal {D}}|{\mathcal {G}})\) is the marginal likelihood of the data given \({\mathcal {G}}\) averaged over all possible parameter sets \(\varTheta\). \({P}({\mathcal {G}})\) is often taken to be uniform so that it simplifies when comparing DAGs; we will do the same in this paper for simplicity while noting that other default priors may lead to more accurate structure learning of sparse DAGs (e.g. Scutari 2016). Using (1) we can then decompose \({P}({\mathcal {D}}|{\mathcal {G}})\) into one component for each node as follows:

$$\begin{aligned} {P}({\mathcal {D}}|{\mathcal {G}}) = \prod _{i=1}^N {P}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}}\right) = \prod _{i=1}^N \left[ \int {P}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}}, \varTheta _{X_i}\right) {P}\left( \varTheta _{X_i}|\varPi _{X_i}^{\mathcal {G}}\right) \,{\mathrm{d}}\varTheta _{X_i}\right] . \end{aligned}$$
(3)

In the case of discrete BNs, we assume \(X_i |\varPi _{X_i}^{{\mathcal {G}}}\sim \text{Multinomial}(\varTheta _{X_i}|\varPi _{X_i}^{\mathcal {G}})\) where the parameters \(\varTheta _{X_i}|\varPi _{X_i}^{\mathcal {G}}\) are the conditional probabilities \(\pi _{ik |j}= {P}(X_i = k |\varPi _{X_i}^{{\mathcal {G}}}= j)\). We then assume a conjugate prior \(\varTheta _{X_i}|\varPi _{X_i}^{\mathcal {G}}\sim \text{Dirichlet}(\alpha _{ijk})\), \(\sum _{jk} \alpha _{ijk}= \alpha _i > 0\) to obtain the closed-form posterior \(\text{Dirichlet}(\alpha _{ijk}+ n_{ijk})\) which we use to estimate the \(\pi _{ik |j}\) from the counts \(n_{ijk}, \sum _{ijk} n_{ijk} = n\) observed in \({\mathcal {D}}\). \(\alpha _i\) is known as the imaginary or equivalent sample size and determines how much weight is assigned to the prior in terms of the size of an imaginary sample supporting it.

Further assuming positivity (\(\pi _{ik |j}> 0\)), parameter independence (\(\pi _{ik |j}\) for different parent configurations are independent), parameter modularity (\(\pi _{ik |j}\) associated with different nodes are independent) and complete data, Heckerman et al. (1995) derived a closed form expression for (3), known as the Bayesian–Dirichlet (BD) family of scores:

$$\begin{aligned} {\text{BD}}({\mathcal {G}}, {\mathcal {D}}; \varvec{\alpha })&= {} \prod _{i=1}^N {\text{BD}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}}; \alpha _i\right) \\ &= {} \prod _{i=1}^N \prod _{j = 1}^{q_i} \left[ \frac{\varGamma (\alpha _{ij})}{\varGamma (\alpha _{ij} + n_{ij})} \prod _{k=1}^{r_i} \frac{\varGamma (\alpha _{ijk}+ n_{ijk})}{\varGamma (\alpha _{ijk})} \right] \end{aligned}$$
(4)

where:

  • \(r_i\) is the number of states of \(X_i\)

  • \(q_i\) is the number of possible configurations of values of \(\varPi _{X_i}^{{\mathcal {G}}}\), taken to be equal to 1 if \(X_i\) has no parents;

  • \(n_{ij} = \sum _{k = 1}^{r_i} n_{ijk}\);

  • \(\alpha _{ij}= \sum _{k = 1}^{r_i} \alpha _{ijk}\);

  • and \(\varvec{\alpha }= \{\alpha _1, \ldots , \alpha _N\}\), \(\alpha _i = \sum _{j = 1}^{q_i} \alpha _{ij}\) are the imaginary sample sizes associated with each \(X_i\).

Various choices for \(\alpha _{ijk}\) produce different priors and the corresponding scores in the BD family:

  • for \(\alpha _{ijk}= 1\) we obtain the K2 score from Cooper and Herskovits (1991);

  • for \(\alpha _{ijk}= {1}/{2}\) we obtain the BD score with Jeffrey’s prior (BDJ; Suzuki 2016);

  • for \(\alpha _{ijk}= \alpha / (r_i q_i)\) we obtain the BDeu score from Heckerman et al. (1995), which is the most common choice in the BD family and has \(\alpha _i = \alpha\) for all \(X_i\);

  • for \(\alpha _{ijk}= \alpha / (r_i \tilde{q}_i)\), where \(\tilde{q}_i\) is the number of \(\varPi _{X_i}^{{\mathcal {G}}}\) such that \(n_{ij} > 0\), we obtain the BD sparse (BDs) score recently proposed in Scutari (2016);

  • for the set \(\alpha _{ijk}^s = s / (r_i q_i)\), \(s \in S_L = \{2^{-L}, 2^{-L+1}, \ldots , 2^{L-1}, 2^{L}\}\), \(L \in {\mathbb {N}}\) we obtain the locally averaged BD score (BDla) from Cano et al. (2013).

BDeu is the only score-equivalent BD score (Chickering 1995), that is, it is the only BD score that takes the same value for DAGs in the same equivalence class. This property makes BDeu the preferred score when arcs are given causal interpretation (Pearl 2009), and their directions have a meaningful interpretation beyond allowing to decompose the \({P}({\mathbf {X}})\) into the \({P}(X_i |\varPi _{X_i}^{{\mathcal {G}}})\). BDs is only asymptotically score-equivalent because it converges to BDeu when \(n \rightarrow \infty\) and the positivity assumption holds. The BIC score, defined as:

$$\begin{aligned} {\text{BIC}}({\mathcal {G}}, {\mathcal {D}}) = \sum _{i = 1}^N {\text{BIC}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}}\right) = \sum _{i = 1}^N \left[ \log {P}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}}\right) - \frac{\log n}{2}q_i (r_i - 1)\right] \end{aligned}$$
(5)

is also score-equivalent and it converges to \(\log {\text{BDeu}}\) as \(n \rightarrow \infty\). In the case of discrete BNs, maximising BIC corresponds to selecting the BN with the minimum description length (MDL; Rissanen 1978).

3 BDeu and Bayesian model selection

The uniform prior associated with BDeu has been justified by the lack of prior knowledge on the \(\varTheta _{X_i}\), as well as its computational simplicity and score equivalence; and it was widely assumed to be non-informative (e.g. Silander et al. 2007; Heckerman et al. 1995).

However, there is increasing evidence that this prior is far from non-informative and that it has a strong impact on the accuracy of the learned DAGs, making its use on real-world data problematic. Silander et al. (2007) showed via simulation that the MAP DAGs selected using BDeu are highly sensitive to the choice of \(\alpha\). Even for “reasonable” values such as \(\alpha \in [1, 20]\), they obtained DAGs with markedly different number of arcs, and they showed that large values of \(\alpha\) tend to produce DAGs with more arcs. This is counter-intuitive because a larger \(\alpha\) would normally imply stronger regularisation and would be expected to produce sparser DAGs. Steck and Jaakkola (2003) similarly showed that the number of arcs in the MAP DAG is determined by a complex interaction between \(\alpha\) and \({\mathcal {D}}\); in the limits \(\alpha \rightarrow 0\) and \(\alpha \rightarrow \infty\) it is possible to obtain both very sparse and very dense DAGs. (We informally define \({\mathcal {G}}\) to be sparse if \(|A| = O(N)\), typically with \(|A| < 5N\); a dense \({\mathcal {G}}\), on the other hand, has a relatively large |A| compared to N.) In particular, for small values of \(\alpha\) and/or sparse data (that is, discrete data for which we observe a small subset of the possible combinations of the values of the \(X_i\)), \(\alpha _{ijk}\rightarrow 0\) and

$$\begin{aligned} \lim _{\alpha _{ijk}\rightarrow 0} {\text{BDeu}}({\mathcal {G}}, {\mathcal {D}}; \alpha ) - \alpha ^{d_{\mathrm {EP}}^{({\mathcal {G}})}} = 0 \end{aligned}$$
(6)

where \(d_{\mathrm {EP}}^{({\mathcal {G}})}\) is the effective number of parameters of the model, defined as

$$\begin{aligned} d_{\mathrm {EP}}^{({\mathcal {G}})} = \sum _{i = i}^N d_{\mathrm {EP}}^{(X_i, {\mathcal {G}})} = \sum _{i = i}^N \left[ \sum _{j = 1}^{q_i} \tilde{r}_{ij} - \tilde{q}_i\right] ; \end{aligned}$$

\(\tilde{r}_{ij}\) is the number of positive counts \(n_{ijk}\) in the jth configuration of \(\varPi _{X_i}^{{\mathcal {G}}}\) and \(\tilde{q}_i\) is the number of configurations in which at least one \(n_{ijk}\) is positive.

This was then used to prove that the Bayes factor

$$\begin{aligned} \frac{{P}({\mathcal {D}}|{\mathcal {G}}^{+})}{{P}({\mathcal {D}}|{\mathcal {G}}^{-})} = \frac{{\text{BDeu}}(X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}; \alpha )}{{\text{BDeu}}(X_i |\varPi _{X_i}^{{\mathcal {G}}^{-}}; \alpha )} \rightarrow \left\{ \begin{array}{ll} 0 &{}\quad \text{if } d_{\mathrm {EDF}} > 0\\ +\infty &{}\quad \text{if } d_{\mathrm {EDF}} < 0 \end{array} \right. \end{aligned}$$
(7)

for two DAGs \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) that differ only by the inclusion of a single parent for \(X_i\). The effective degrees of freedom \(d_{\mathrm {EDF}}\) are defined as \(d_{\mathrm {EP}}^{({\mathcal {G}}^{+})} - d_{\mathrm {EP}}^{({\mathcal {G}}^{-})}\). The practical implication of this result is that, when we compare two DAGs using their BDeu scores, a large number of zero counts will force \(d_{\mathrm {EDF}}\) to be negative and favour the inclusion of additional arcs (since \({\text{BDeu}}(X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}; \alpha ) \gg {\text{BDeu}}(X_i |\varPi _{X_i}^{{\mathcal {G}}^{-}}; \alpha )\)). But that in turn makes \(d_{\mathrm {EDF}}\) even more negative, quickly leading to overfitting. Furthermore, Steck and Jaakkola (2003) argued that BDeu can be rather unstable for “medium-sized” data and small \(\alpha\), which is a very common scenario.

Steck (2008) approached the problem from a different perspective and derived an analytic approximation for the “optimal” value of \(\alpha\) that maximises predictive accuracy, further suggesting that the interplay between \(\alpha\) and \({\mathcal {D}}\) is controlled by the skewness of the \(\varTheta _{X_i}\) and by the strength of the dependence relationships between the nodes. Skewed \(\varTheta _{X_i}\) result in some \(\pi _{ik |j}\) being smaller than others, which in turn makes sparse data sets more likely; hence the problematic behaviour described in Steck and Jaakkola (2003) and reported above. Most of these results have been analytically confirmed more recently by Ueno (2010, 2011). The key insight provided by the latter paper is that we can decompose the BDeu into a likelihood term that depends on the data and a prior term that does not:

$$\begin{aligned} \log {\text{BDeu}}(X_i |\varPi _{X_i}^{{\mathcal {G}}}; \alpha )= & {} \underbrace{\sum _{j=1}^{q_i} \left( \log \varGamma (\alpha _{ij}) - \sum _{k=1}^{r_i} \log \varGamma (\alpha _{ijk}) \right) }_{\mathrm{prior\,term}} \\ &+\underbrace{\sum _{j=1}^{q_i} \left( \sum _{k=1}^{r_i} \log \varGamma (\alpha _{ijk}+ n_{ijk}) - \log \varGamma (\alpha _{ij}+ n_{ij}) \right) }_{{\mathrm{likelihood\, term}}}. \end{aligned}$$

Then if \(\alpha _{ijk}< 1\) (that is, \(\alpha < r_i q_i\)) the prior term can be approximated by

$$\begin{aligned} \sum _{j=1}^{q_i} \left( \log \varGamma (\alpha _{ij}) - \sum _{k=1}^{r_i} \log \varGamma (\alpha _{ijk}) \right) \approx q_i (r_i - 1) \log \alpha _{ijk}\end{aligned}$$

and quickly dominates the likelihood term, penalising complex BNs as the number of parameters increases, which explains why BDeu selects empty DAGs in the limit of \(\alpha _{ijk}\rightarrow 0\). On the other hand, if all \(\alpha _{ijk}> 1\) then the prior term can be approximated by

$$\begin{aligned} \sum _{j=1}^{q_i} \left( \log \varGamma (\alpha _{ij}) - \sum _{k=1}^{r_i} \log \varGamma (\alpha _{ijk}) \right) \approx \alpha \log r_i + \frac{1}{2} q_i(r_i - 1)\log \frac{\alpha _{ijk}}{2\pi }, \end{aligned}$$

leading BDeu to select a complete DAG when \(\alpha _{ijk}\rightarrow \infty\) (and therefore \(\alpha \rightarrow \infty\)) as previously reported in Silander et al. (2007).

As for the likelihood term, Ueno (2011) notes that if \(\alpha + n\) is sufficiently small, that is, for sparse samples and small imaginary sample sizes,

$$\begin{aligned} \sum _{j=1}^{q_i} \left( \sum _{k=1}^{r_i} \log \varGamma (\alpha _{ijk}+ n_{ijk}) - \log \varGamma (\alpha _{ij}+ n_{ij}) \right) \approx - q_i(r_i - 1)\log \alpha _{ijk}. \end{aligned}$$

Hence if some \(n_{ijk} = 0\), the change of the likelihood term dominates the prior term and BDeu adds extra arcs, leading to dense DAGs. On the other hand, if \(\alpha + n\) is sufficiently large, \(\alpha\) actually acts as an imaginary sample supporting the uniform distribution of the parameters assumed in the prior. This explains the observations in Steck (2008): the optimal \(\alpha\) should be large when the empirical distribution of the \(\varTheta _{X_i}\) is uniform because the prior is correct; and it should be small when the empirical distribution of \(\varTheta _{X_i}\) is skewed so that the prior can be quickly dominated. This is also the source of the sensitivity of BDeu to the choice of \(\alpha\) reported in Steck and Jaakkola (2003).

Finally, Suzuki (2016) studied the asymptotic properties of BDeu by contrasting it with BDJ. He found that BDeu is not regular in the sense that it may learn DAGs in a way that is not consistent with either the MDL principle (through BIC) or the ranking of those DAGs given by their entropy. Whether this happens depends on the values of the underlying \(\pi _{ik |j}\), even if the positivity assumption holds and if n is large. This agrees with the observations in Ueno (2010), who also observed that BDeu is not necessarily consistent for any finite n, but only asymptotically for \(n \rightarrow \infty\).

Around the same time, a possible solution to these problems was proposed by Scutari (2016) in the form of BDs. Scutari (2016) starts from the consideration that if the sample \({\mathcal {D}}\) is sparse, some configurations of the variables will not be observed; it may be that the sample size is small and those configurations have low probability, or it may be that \({\mathbf {X}}\) violates the positivity assumption (\(\pi _{ik |j}= 0\) for some ijk). As a result, we may be unable to observe all the configurations of (say) \(\varPi _{X_i}^{{\mathcal {G}}^{+}}\) in the data. Then the corresponding \(n_{ij} = 0\) and

The effective imaginary sample size, defined as the sum of the \(\alpha _{ijk}\) appearing in terms that do not simplify (and thus contribute to the value of BDeu), decreases to \(\sum _{j: n_{ij} > 0} \alpha _{ijk}= \alpha (\tilde{q}_i/ q_i) < \alpha\), where \(\tilde{q}_i< q_i\) is the number of parent configurations that are actually observed in \({\mathcal {D}}\). In other words, BDeu is computed with an imaginary sample size of \(\alpha (\tilde{q}_i/ q_i)\) instead of \(\alpha\), and the remaining \(\alpha (q_i - \tilde{q}_i) / q_i\) is effectively lost. This may lead to comparing DAGs with marginal likelihoods computed from different priors, which is incorrect from a Bayesian perspective. In order to prevent this from happening, Scutari (2016) replaced the prior of BDeu with

$$\begin{aligned} \alpha _{ijk} = \left\{ \begin{array}{ll} \alpha / (r_i \tilde{q}_i)&{} \quad \text{if } n_{ij}> 0 \\ 0 &{}\quad \text{otherwise}, \end{array} \right. \quad \text{where } \tilde{q}_i= \{ \text{number of } \varPi _{X_i}^{{\mathcal {G}}^{+}}\text{ such that } n_{ij} > 0 \}. \end{aligned}$$

thus obtaining

$$\begin{aligned} {\text{BDs}}(X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}; \alpha ) = \prod _{j : n_{ij} > 0} \left[ \frac{\varGamma (r_i \alpha _{ijk})}{\varGamma (r_i \alpha _{ijk}+ n_{ij})} \prod _{k=1}^{r_i} \frac{\varGamma (\alpha _{ijk}+ n_{ijk})}{\varGamma (\alpha _{ijk})} \right] . \end{aligned}$$
(8)

A large simulation study showed BDs to be more accurate than BDeu in learning BN structures without any loss in predictive power.

In addition to all these issues, we also find that BDeu produces Bayes factors that are sensitive to the choice of \(\alpha\). (The fact that BDeu is sensitive to the value of \(\alpha\) does not necessarily imply that the Bayes factor is sensitive itself.) In order to illustrate this instability and the other results presented in the section we consider the simple examples below.

Fig. 1
figure 1

DAGs and data sets used in Examples 1 and 2. The DAGs \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) are used in both examples. Example 1 uses data set \({\mathcal {D}}_1\), while Example 2 uses \({\mathcal {D}}_2\); the latter is a modified version of the former, which is originally from Suzuki (2016)

Example 1

Consider the DAGs \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) and the data set \({\mathcal {D}}_1\) in Fig. 1, originally from Suzuki (2016). The sample frequencies \(n_{ijk}\) for \(X |\varPi _{X}^{{\mathcal {G}}^{-}} = \{ Z, W \}\) are:

and those for \(X |\varPi _{X}^{{\mathcal {G}}^{+}} = X |\varPi _{X}^{{\mathcal {G}}^{-}} \cup \{ Y \}\) are as follows.

The conditional distributions of \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\) and \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\) are both singular, and in \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\) we only observe 4 parent configurations out of 8. Furthermore, the observed conditional distributions for those parent configurations are identical to the 4 conditional distributions in \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\), since the \(n_{ijk}\) are the same. We can then argue that \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\) does not fit \({\mathcal {D}}_1\) any better than \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\), and it does not capture any additional information from the data.

However, if we take \(\alpha = 1\) in BDeu, then \(\alpha _{ijk}= 1/8\) for \({\mathcal {G}}^{-}\) and \(\alpha _{ijk}= 1/16\) for \({\mathcal {G}}^{+}\), leading to

If we choose the DAG with the highest BDeu score, we prefer \({\mathcal {G}}^{+}\) to \({\mathcal {G}}^{-}\) despite all the considerations we have just made on the data. This is not the case if we use BDs, which does not show a preference for either \({\mathcal {G}}^{-}\) or \({\mathcal {G}}^{+}\) because \(\alpha _{ijk}= 1/8\) for both \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\) and \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\):

The same holds for BDJ, and in general for any BD score with a constant \(\alpha _{ijk}\). Comparing the expressions above, it is apparent that the only difference between them is the value of \(\alpha _{ijk}\), which is a consequence of the different number of configurations of \(\varPi _{X}^{{\mathcal {G}}^{-}}\) and \(\varPi _{X}^{{\mathcal {G}}^{+}}\).

The Bayes factors for BDeu and BDs are shown for \(\alpha \in [10^{-4}, 10^{4}]\) in the left panel of Fig. 2. The former converges to 1 for both \(\alpha _{ijk}\rightarrow 0\) and \(\alpha _{ijk}\rightarrow \infty\), but varies between 1 and 2.5 for finite \(\alpha\); whereas the latter is equal to 1 for all values of \(\alpha\), never showing a preference for either \({\mathcal {G}}^{-}\) or \({\mathcal {G}}^{+}\). The Bayes factor for BDeu does not diverge nor converge to zero, which is consistent with (7) from Steck and Jaakkola (2003) as \(d_{\mathrm {EP}}^{({\mathcal {G}}^{+})} - d_{\mathrm {EP}}^{({\mathcal {G}}^{-})} = 0 - 0 = 0\). However, it varies most quickly for \(\alpha \in [1, 10]\), exactly the range of the most common values used in practical applications. This provides further evidence supporting the conclusions of Steck and Jaakkola (2003), Steck (2008) and Silander et al. (2007).

Finally, if we consider which DAG would be preferred according to the MDL principle, we can see that BIC (unlike BDeu, like BDs) does not express a preference for either DAG:

$$\begin{aligned} {\text{BIC}}\left( X |\varPi _{X}^{{\mathcal {G}}^{-}}\right) = \log {P}\left( X |\varPi _{X}^{{\mathcal {G}}^{-}}\right) - 0 = \log {P}\left( X |\varPi _{X}^{{\mathcal {G}}^{+}}\right) - 0 = {\text{BIC}}\left( X |\varPi _{X}^{{\mathcal {G}}^{+}}\right) \end{aligned}$$

which agrees with Suzuki (2016)’s observation that BDeu violates the MDL principle. \(\square\)

Example 2

Consider another simple example, inspired by Example 1, based on the data set \({\mathcal {D}}_2\) and the DAGs \({\mathcal {G}}^{-}\), \({\mathcal {G}}^{+}\) shown in Fig. 1. The sample frequencies (\(n_{ijk}\)) for \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\) are:

and those for \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\) are as follows.

As in Example 1, 4 parent configurations out of 8 are not observed in \({\mathcal {G}}^{+}\) and the other 4 have \(n_{ijk}\) that are the same as those arising from \({\mathcal {G}}^{-}\). The resulting conditional distributions, however, are not singular. If we take again \(\alpha = 1\), the BDeu scores for \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) are different but this time \({\mathcal {G}}^{-}\) has the highest score:

$$\begin{aligned} {\text{BDeu}}\left( X |\varPi _{X}^{{\mathcal {G}}^{-}}; 1\right)= & {} \left( \frac{\varGamma ({1}/{4})}{\varGamma ({1}/{4}+3)} \left[ \frac{\varGamma ({1}/{8} + 2)}{\varGamma ({1}/{8})} \cdot \frac{\varGamma ({1}/{8} + 1)}{\varGamma ({1}/{8})} \right] \right) ^4 = 3.906\times 10^{-7}, \\ {\text{BDeu}}\left( X |\varPi _{X}^{{\mathcal {G}}^{+}}; 1\right)= & {} \left( \frac{\varGamma ({1}/{8})}{\varGamma ({1}/{8}+3)} \left[ \frac{\varGamma ({1}/{16} + 2)}{\varGamma ({1}/{16})} \cdot \frac{\varGamma ({1}/{16} + 1)}{\varGamma ({1}/{16})} \right] \right) ^4 = 3.721\times 10^{-8}. \end{aligned}$$

On the other hand, in BDs \(\alpha _{ijk}= 1/8\) for both DAGs, so they have the same score:

BDeu once more assigns different scores to \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) despite the fact that the observed conditional distributions in \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\) and \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\) are the same, while BDs does not.

The Bayes factors for BDeu and BDs are shown in the right panel of Fig. 2. BDeu results in wildly different values depending on the choice of \(\alpha\), with Bayes factors that vary between 0.05 and 1 for small changes of \(\alpha \in [1, 10]\); BDs always gives a Bayes factor of 1. Again \(d_{\mathrm {EP}}^{({\mathcal {G}}^{+})} - d_{\mathrm {EP}}^{({\mathcal {G}}^{-})} = 4 - 4 = 0\), which agrees with the fact that the Bayes factor for BDeu does not diverge or converge to zero; and \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) have the same BIC score, so BDeu (but not BDs) violates the MDL principle in this example as well. \(\square\)

Fig. 2
figure 2

The Bayes factors for \({\mathcal {G}}^{-}\) versus \({\mathcal {G}}^{+}\) computed using BDeu and BDs for Example 1 (left panel) and Example 2 (right panel in orange) and dark blue, respectively. The bullet points correspond to the values observed for \(\alpha = 1\)

4 Bayesian structure learning and entropy

Shannon’s classic definition of entropy for a multinomial random variable \(X \sim \text{Multinomial}(\varvec{\pi })\) with a fixed, finite set of states (alphabet) \({\mathcal {A}}\) is:

$$\begin{aligned} {H}(X; \varvec{\pi }) = {E}(-\log {P}(X)) = -\sum _{a \in {\mathcal {A}}} \pi _a \log \pi _a \end{aligned}$$

where the probabilities \(\pi _a\) are typically estimated with the empirical frequency of each a in \({\mathcal {D}}\), leading to the empirical entropy estimator. Its properties are detailed in canonical information theory books such as Mackay (2003) and Rissanen (2007), and it has often been used in BN structure learning (Lam and Bacchus 1994; Suzuki 2015). However, in this paper we will focus on Bayesian entropy estimators, for two reasons. Firstly, they are a natural choice when studying the properties of BD scores since they are Bayesian in nature; and having the same probabilistic assumptions (including the choice of prior distribution) for the BD scores and for the corresponding entropy estimators makes it easy to link their properties. Secondly, Bayesian entropy estimators have better theoretical and empirical properties than the empirical estimator (Hausser and Strimmer 2009; Nemenman et al. 2002).

Starting from (1), for a BN we can write:

$$\begin{aligned} {H}^{{\mathcal {G}}}({\mathbf {X}}; \varTheta ) = \sum _{i = 1}^N {H}^{{\mathcal {G}}}(X_i; \varTheta _{X_i}), \end{aligned}$$

where \({H}^{{\mathcal {G}}}(X_i; \varTheta _{X_i})\) is the entropy of \(X_i\) given its parents \(\varPi _{X_i}^{{\mathcal {G}}}\). The marginal posterior expectation of \({H}^{{\mathcal {G}}}(X_i; \varTheta _{X_i})\) with respect to \(\varTheta _{X_i}\) given the data can then be expressed as:

$$\begin{aligned} {E}({H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}) = \int {H}^{{\mathcal {G}}}(X_i; \varTheta _{X_i}) {P}(\varTheta _{X_i}|{\mathcal {D}}) \,{\mathrm{d}}\varTheta _{X_i}\end{aligned}$$

where we use \({\mathcal {D}}\) to refer specifically to the observed values for \(X_i\) and \(\varPi _{X_i}^{{\mathcal {G}}}\) with a slight abuse of notation. We can then introduce a \(\text{Dirichlet}(\alpha _{ijk})\) prior over \(\varTheta _{X_i}\) with:

$$\begin{aligned} {P}(\varTheta _{X_i}|{\mathcal {D}}) = \int {P}(\varTheta _{X_i}|{\mathcal {D}}, \alpha _{ijk}) {P}(\alpha _{ijk}|{\mathcal {D}}) \,{\mathrm{d}}\alpha _{ijk}, \end{aligned}$$

which leads to

$$\begin{aligned} {E}\left( {H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}\right)= & {} \iint {H}^{{\mathcal {G}}}(X_i; \varTheta _{X_i}) {P}(\varTheta _{X_i}|{\mathcal {D}}, \alpha _{ijk}) {P}(\alpha _{ijk}|{\mathcal {D}}) \,{\mathrm{d}}\alpha _{ijk}\,{\mathrm{d}}\varTheta _{X_i} \\ \propto & {} \int {E}({H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}, \alpha _{ijk}) {P}({\mathcal {D}}|\alpha _{ijk}) {P}(\alpha _{ijk}) \,{\mathrm{d}}\alpha _{ijk}, \end{aligned}$$
(9)

where \({P}(\alpha _{ijk})\) is a hyperprior distribution over the space of the Dirichlet priors, identified by their parameter sets \(\{\alpha _{ijk}\}\).

The first term on the right hand-side of (9) is the posterior expectation of:

$$\begin{aligned} {H}^{{\mathcal {G}}}(X_i | {\mathcal {D}}, \alpha _{ijk}) = - \sum _{j = 1}^{q_i} \sum _{k = 1}^{r_i} p_{ik |j}^{(\alpha _{ijk})} \log p_{ik |j}^{(\alpha _{ijk})}\quad \text{with}\ p_{ik |j}^{(\alpha _{ijk})} = \frac{\alpha _{ijk}+ n_{ijk}}{\alpha _{ij}+ n_{ij}} \end{aligned}$$
(10)

and has closed form

$$\begin{aligned}& {E}({H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}, \alpha _{ijk}) \\ &\quad =\sum _{j = 1}^{q_i} \left[ \psi _0(\alpha _{ij}+ n_{ij} + 1) - \sum _{k = 1}^{r_i} \frac{\alpha _{ijk}+ n_{ijk}}{\alpha _{ij}+ n_{ij}} \psi _0(\alpha _{ijk}+ n_{ijk} + 1) \right] \end{aligned}$$
(11)

as shown in Nemenman et al. (2002) and Archer et al. (2014), with \(\psi _0(\cdot )\) denoting the digamma function. The second term follows a Dirichlet-multinomial distribution (also known as multivariate Polya; Johnson et al. 1997) with density

$$\begin{aligned} {P}({\mathcal {D}}|\alpha _{ijk}) = \prod _{j = 1}^{q_i} \frac{n_{ij}! \, \varGamma (\alpha _{ij})}{\varGamma (\alpha _{ijk})^{r_i} \varGamma (n_{ij} + \alpha _{ij})} \prod _{k = 1}^{r_i} \frac{\varGamma (n_{ijk} + \alpha _{ijk})}{n_{ijk}!}, \end{aligned}$$
(12)

since,

$$\begin{aligned} {P}({\mathcal {D}}|\alpha _{ijk}) = \int {P}({\mathcal {D}}|\varTheta _{X_i}) {P}(\varTheta _{X_i}|\alpha _{ijk}) \,{\mathrm{d}}\varTheta _{X_i}\end{aligned}$$

where \({P}({\mathcal {D}}|\varTheta _{X_i})\) follows a multinomial distribution and \({P}(\varTheta _{X_i}|\alpha _{ijk})\) is a conjugate Dirichlet prior. Rearranging terms in (12) we find that,

$$\begin{aligned} {P}({\mathcal {D}}|\alpha _{ijk})= & {} \prod _{j = 1}^{q_i} \frac{n_{ij}!}{\prod _{k = 1}^{r_i} n_{iijk}!} \\ &\cdot \,\prod _{j = 1}^{q_i} \frac{\varGamma (\alpha _{ij})}{\varGamma (n_{ij} + \alpha _{ij})} \prod _{k = 1}^{r_i} \frac{\varGamma (n_{ijk} + \alpha _{ijk})}{\varGamma (\alpha _{ijk})} \propto {\text{BD}}\left( X_i |\varPi _{X_i}^{\mathcal {G}}; \alpha \right) \end{aligned}$$
(13)

making the link between BD scores and entropy explicit. Unlike (13), BD has a prequential formulation (Dawid 1984) which focuses on the sequential prediction of future events (Chickering and Heckerman 2000). For this reason it considers observations as coming in a specific temporal order and it does not include a multinomial coefficient, which we will drop in the remainder of the paper. Therefore,

$$\begin{aligned} {E}\left( {H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}\right) = \int {E}\left( {H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}, \alpha _{ijk}\right) {\text{BD}}\left( X_i |\varPi _{X_i}^{\mathcal {G}}; \alpha \right) {P}(\alpha _{ijk}) \,{\mathrm{d}}\alpha _{ijk}, \end{aligned}$$
(14)

and is determined by three components: the posterior expected entropy of \(X_i |\varPi _{X_i}^{{\mathcal {G}}}\) under a \(\text{Dirichlet}(\alpha _{ijk})\) prior, the BD score term for \(X_i |\varPi _{X_i}^{{\mathcal {G}}}\), and the hyperprior over the space of the Dirichlet priors.

This definition of the expected entropy associated with the structure \({\mathcal {G}}\) of a BN is very general and encompasses the entropies associated with all the BD scores as special cases. In particular, the entropy associated with K2, BDJ, BDeu and BDs can be obtained by giving \({P}(\alpha _{ijk}) = 1\) to the single set of \(\alpha _{ijk}\) associated with the corresponding prior, leading to,

$$\begin{aligned} {E}\left( {H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}\right) = {E}\left( {H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}, \alpha _{ijk}\right) {\text{BD}}\left( X_i |\varPi _{X_i}^{\mathcal {G}}; \alpha \right) ; \end{aligned}$$

and similarly for BDla,

$$\begin{aligned} {E}\left( {H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}\right) = \frac{1}{|S_L|} \sum _{s \in S_L} {E}\left( {H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}, \alpha _{ijk}^s\right) {\text{BD}}\left( X_i |\varPi _{X_i}^{\mathcal {G}}; s\right) . \end{aligned}$$

5 The posterior marginal entropy

The posterior expectation of the entropy for a given \(\text{Dirichlet}(\alpha _{ijk})\) prior in (11), despite having a form that looks very different from the marginal posterior entropy in (10), can be written in terms of the latter as we show in the following lemma.

Lemma 1

$$\begin{aligned} {E}\left( {H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}; \alpha _{ijk}\right) \approx {H}^{{\mathcal {G}}}(X_i |{\mathcal {D}}, \alpha _{ijk}) - \sum _{j = 1}^{q_i} \frac{r_i - 1}{2(\alpha _{ij}+ n_{ij})}. \end{aligned}$$

Proof of Lemma 1

Combining \(\psi _0(z + 1) = \psi _0(z) + 1/z\) with \(\psi _0(z) = \log (z) - 1/(2z) + o(z^{-2})\) from Anderson and Qiu (1997), we can write \(\psi _0(z + 1) = \log (z) + 1/(2z) + o(z^{-2})\). Dropping the remainder term \(o(z^{-2})\) we approximate \(\psi _0(z + 1) \approx \log (z) + 1/(2z)\), which leads to,

$$\begin{aligned}&{E}\left( {H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}, \alpha _{ijk}\right) \\ &\quad = \sum _{j = 1}^{q_i} \left[ \psi _0(\alpha _{ij}+ n_{ij} + 1) - \sum _{k = 1}^{r_i} \frac{\alpha _{ijk}+ n_{ijk}}{\alpha _{ij}+ n_{ij}} \psi _0(\alpha _{ijk}+ n_{ijk} + 1) \right] \\ &\quad \approx \sum _{j = 1}^{q_i} \left[ \log (\alpha _{ij}+ n_{ij}) + \frac{1}{2(\alpha _{ij}+ n_{ij})} \right. \\ &\qquad -\left. \sum _{k = 1}^{r_i} \frac{\alpha _{ijk}+ n_{ijk}}{\alpha _{ij}+ n_{ij}} \left( \log (\alpha _{ijk}+ n_{ijk}) + \frac{1}{2(\alpha _{ijk}+ n_{ijk})} \right) \right] \\ &\quad = -\sum _{j = 1}^{q_i} \sum _{k = 1}^{r_i} \frac{\alpha _{ijk}+ n_{ijk}}{\alpha _{ij}+ n_{ij}} \log \left( \frac{\alpha _{ijk}+ n_{ijk}}{\alpha _{ij}+ n_{ij}}\right) - \sum _{j = 1}^{q_i} \frac{r_i - 1}{2(\alpha _{ij}+ n_{ij})} \\&\quad = {H}^{{\mathcal {G}}}(X_i |{\mathcal {D}}, \alpha _{ijk}) - \sum _{j = 1}^{q_i} \frac{r_i - 1}{2(\alpha _{ij}+ n_{ij})}. \end{aligned}$$

\(\square\)

Therefore, \({E}({H}^{{\mathcal {G}}}(X_i) |{\mathcal {D}}; \alpha _{ijk})\) is well approximated by the marginal posterior entropy \({H}^{{\mathcal {G}}}(X_i | {\mathcal {D}}, \alpha _{ijk})\) from (10) plus a bias term that depends on the augmented counts \(\alpha _{ij}+ n_{ij}\) for the \(q_i\) configurations of \(\varPi _{X_i}^{{\mathcal {G}}}\). A similar result was derived in Miller (1955) for the empirical entropy estimator and is the basis of the Miller–Madow entropy estimator.

6 BDeu and the principle of maximum entropy

The maximum relative entropy principle (ME; Shore and Johnson 1980; Skilling 1988; Caticha 2004) states that we should choose a model that is consistent with our knowledge of the phenomenon we are modelling and that introduces no unwarranted information. In the context of probabilistic learning this means choosing the model that has the largest possible entropy for the data, which will encode the probability distribution that best reflects our current knowledge of \({\mathbf {X}}\) given by \({\mathcal {D}}\). In the Bayesian setting in which BD scores are defined, we then prefer a DAG \({\mathcal {G}}^{+}\) over a second DAG \({\mathcal {G}}^{-}\) if

$$\begin{aligned} {E}\left( {H}^{{\mathcal {G}}^{-}}({\mathbf {X}}) |{\mathcal {D}}\right) \leqslant {E}\left( {H}^{{\mathcal {G}}^{+}}({\mathbf {X}}) |{\mathcal {D}}\right) \end{aligned}$$
(15)

because these estimates of entropy incorporate all our knowledge including that encoded in the prior and in the hyperprior. The resulting formulation of ME represents a very general approach that includes Bayesian posterior estimation as a particular case (Giffin and Caticha 2007); which is intuitively apparent since the expected posterior entropy in (14) is proportional to BD. Furthermore, ME can also be seen as a particular case of the MDL principle (Feder 1986).

Suzuki (2016) defined regular those BD scores that prefer simpler BNs that have smaller empirical entropies and few arcs:

$$\begin{aligned}&{H}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{-}}; \pi _{ik |j}\right) \leqslant {H}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}; \pi _{ik |j}\right) ,\quad \varPi _{X_i}^{{\mathcal {G}}^{-}}\subset \varPi _{X_i}^{{\mathcal {G}}^{+}}\Rightarrow \\ &\quad {\text{BD}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{-}}; \alpha _i\right) \geqslant {\text{BD}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}; \alpha _i\right) . \end{aligned}$$

For large sample sizes, the probabilities \(p_{ik |j}^{(\alpha _{ijk})}\) from (10) used in the posterior entropy estimators converge to the empirical frequencies used in the empirical entropy estimator, making the above asymptotically equivalent to,

$$\begin{aligned}&{H}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{-}}; \alpha _{ijk}\right) \leqslant {H}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}; \alpha _{ijk}\right) ,\quad \varPi _{X_i}^{{\mathcal {G}}^{-}}\subset \varPi _{X_i}^{{\mathcal {G}}^{+}}\Rightarrow \\ &\quad {\text{BD}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{-}}; \alpha _i\right) \geqslant {\text{BD}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}; \alpha _i\right) \end{aligned}$$

and connecting DAGs with the highest BD scores with those that minimise the marginal posterior entropy from (10). However, we prefer to study BDeu and its prior using ME as defined in (15) for two reasons. Firstly, posterior expectations are widely considered to be superior to MAP estimates in the literature (Berger 1985), as has been specifically shown for entropy in Nemenman et al. (2002). Secondly, ME directly incorporates the information encoded in the prior and in the hyperprior, without relying on large samples to link the empirical entropy (which depends on \({\mathbf {X}}, \varTheta\)) with the BD scores (which depend on \({\mathbf {X}}, \varvec{\alpha }\) and integrate \(\varTheta\) out).

Without loss of generality, we consider again the simple case in which \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) differ by a single arc, so that only one local distribution differs between the two DAGs. For BDeu, \(\alpha _{ijk}= \alpha / (r_i q_i)\) and substituting (14) in (15) we get,

$$\begin{aligned}&{E}\left( {H}^{{\mathcal {G}}^{-}}(X_i) |{\mathcal {D}}, \alpha _{ijk}\right) {\text{BDeu}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{-}}; \alpha \right) \\ &\quad \leqslant {E}\left( {H}^{{\mathcal {G}}^{+}}(X_i) |{\mathcal {D}}, \alpha _{ijk}\right) {\text{BDeu}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}; \alpha \right) . \end{aligned}$$
(16)

If the sample \({\mathcal {D}}\) is sparse, some configurations of the variables will not be observed and the effective imaginary sample size may be smaller for \({\mathcal {G}}^{+}\) than for \({\mathcal {G}}^{-}\) like in Examples 1 and 2. As a result, when we compare a \({\mathcal {G}}^{-}\) for which we observe all configurations of \(\varPi _{X_i}^{{\mathcal {G}}^{-}}\) with a \({\mathcal {G}}^{+}\) for which we do not observe some configurations of \(\varPi _{X_i}^{{\mathcal {G}}^{+}}\), instead of (16) we are actually using,

$$\begin{aligned}&{E}\left( {H}^{{\mathcal {G}}^{-}}(X_i) |{\mathcal {D}}, \alpha _{ijk}\right) {\text{BDeu}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{-}}; \alpha \right) \\ &\quad \leqslant {E}\left( {H}^{{\mathcal {G}}^{+}}(X_i) |{\mathcal {D}}, \alpha _{ijk}\right) {\text{BDeu}}\left( X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}; \alpha (\tilde{q}_i/ q_i)\right) \end{aligned}$$
(17)

which is different from (15) and thus not consistent with ME. It is not correct from a Bayesian perspective either, because \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) are compared using marginal likelihoods arising from different priors; as expected since we know from Giffin and Caticha (2007) that Bayesian posterior inference is a particular case of ME. From the perspective of ME, those priors carry different information on the \(\varTheta _{X_i}\). They incorrectly express different levels of belief in the uniform prior underlying BDeu as a consequence of the difference in their effective imaginary sample sizes, even though they were meant to express the same level of belief for all possible DAGs. This may bias the entropy (which is, after all, the expected value of the information on \({\mathbf {X}}\) and on the \(\varTheta _{X_i}\)) of \({\mathcal {G}}^{+}\) compared to that of \({\mathcal {G}}^{-}\) in (17) and lead to choosing DAGs which would not be chosen by ME. We would like to stress that this scenario is not uncommon; on the contrary, such a model comparison is almost guaranteed to take place when the data are sparse. As structure learning explores more and more DAGs to identify an optimal one, it will inevitably consider DAGs with unobserved parent configurations, either because they are too dense or because those parent sets are not well supported by the few observed data points.

In particular, if some \(n_{ij} = 0\) then the posterior expected entropy of \(X_i |\varPi _{X_i}^{{\mathcal {G}}^{+}}\) becomes,

where the first term collects the conditional entropies corresponding to the \(q_i - \tilde{q}_i\) unobserved parent configurations, for which the posterior distribution coincides with the prior:

(18)

By definition, the uniform distribution has the maximum possible entropy; the posteriors we would estimate if we could observe samples for those configurations of the \(\varPi _{X_i}^{{\mathcal {G}}^{+}}\) would almost certainly have a smaller entropy. At the same time, the entropies in the second term are smaller than what they would be if we only considered the \(\tilde{q}_i\) observed parent configurations, because \(\alpha _{ijk}= \alpha / (r_i q_i) < \alpha / (r_i \tilde{q}_i)\) means that posterior densities deviate more from the uniform distribution. These two effects, however, do not necessarily balance each other out; as we can see by revisiting Examples 1 and 2 below it is possible to incorrectly choose \({\mathcal {G}}^{+}\) over \({\mathcal {G}}^{-}\).

Example 1

(Continued) The empirical posterior entropies for \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) are:

$$\begin{aligned} {H}(X |\varPi _{X}^{{\mathcal {G}}^{-}}) = {H}(X |\varPi _{X}^{{\mathcal {G}}^{+}}) = 4 \left[ - 0 \log 0 - 1 \log 1\right] = 0 \end{aligned}$$

by convention, but the posterior entropies differ:

$$\begin{aligned} {H}(X |\varPi _{X}^{{\mathcal {G}}^{-}}; \alpha )= & {} 4 \left[ - \frac{0 + {1}/{8}}{3 + {1}/{4}} \log \frac{0 + {1}/{8}}{3 + {1}/{4}} - \frac{3 + {1}/{8}}{3 + {1}/{4}} \log \frac{3 + {1}/{8}}{3 + {1}/{4}} \right] \\ = & {} 0.652,\\ {H}(X |\varPi _{X}^{{\mathcal {G}}^{+}}; \alpha )= & {} 4 \left[ - \frac{0 + {1}/{16}}{3 + {1}/{8}} \log \frac{0 + {1}/{16}}{3 + {1}/{8}} - \frac{3 + {1}/{16}}{3 + {1}/{8}} \log \frac{3 + {1}/{16}}{3 + {1}/{8}} \right] \\ = & {} 0.392. \end{aligned}$$

The expected posterior entropies for \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) are:

$$\begin{aligned}&{E}\left( {H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}, \frac{1}{8}\right) \\ &\quad = 4 \left[ \psi _0({1}/{4} + 3 + 1) - \frac{0 + {1}/{8}}{3 + {1}/{4}} \psi _0({1}/{8} + 0 + 1) - \frac{3 + {1}/{8}}{3 + {1}/{4}} \psi _0({1}/{8} + 3 + 1) \right] \\ &\quad = 0.3931, \\&{E}\left( {H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}, \frac{1}{16}\right) \\ &\quad = 4 \left[ \psi _0({1}/{8} + 3 + 1) - \frac{0 + {1}/{16}}{3 + {1}/{8}} \psi _0({1}/{16} + 0 + 1) - \frac{3 + {1}/{16}}{3 + {1}/{8}} \psi _0({1}/{16} + 3 + 1) \right] \\ &\qquad +\,4 \left[ \psi _0({1}/{8} + 0 + 1) - \frac{0 + {1}/{16}}{0 + {1}/{8}} \psi _0({1}/{16} + 0 + 1) - \frac{3 + {1}/{16}}{0 + {1}/{8}} \psi _0({1}/{16} + 0 + 1) \right] \\ &\quad = 0.5707. \end{aligned}$$

Therefore, substituting these values in (17),

$$\begin{aligned}&{E}({H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}) = 0.3931 \cdot 0.0326 = 0.0128 \\ &\quad <0.0252 = 0.5707 \cdot 0.0441 = {E}({H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}); \end{aligned}$$

and we would choose \({\mathcal {G}}^{+}\) over \({\mathcal {G}}^{-}\) even though we only observe \(\tilde{q}_i= 4\) configurations of \(\varPi _{X}^{{\mathcal {G}}^{+}}\) out of 8, and the sample frequencies are identical for those configurations. The data contribute the same information to the posterior expected entropies; both \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\) and \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\) have empirical entropy equal to zero. The difference must then arise because of the priors: both \(\varTheta _X |\varPi _{X}^{{\mathcal {G}}^{-}}\) and \(\varTheta _X |\varPi _{X}^{{\mathcal {G}}^{+}}\) follow a uniform Dirichlet prior, but in the former \(\alpha = 1\) and in the latter \(\alpha = {1}/{2}\) because \(\tilde{q}_i= 4 < 8 = q_i\). A consistent model comparison requires that all models are evaluated with the same prior, which clearly is not the case in this example. \(\square\)

Example 2

(Continued) The conditional distributions of \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\) and \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\) both have the same (positive) empirical entropy:

$$\begin{aligned} {H}(X |\varPi _{X}^{{\mathcal {G}}^{-}}) = {H}(X |\varPi _{X}^{{\mathcal {G}}^{+}}) = 4 \left[ - \frac{1}{3} \log \frac{1}{3} - \frac{2}{3} \log \frac{2}{3}\right] = 2.546. \end{aligned}$$

However, their posterior entropies are different:

$$\begin{aligned} {H}(X |\varPi _{X}^{{\mathcal {G}}^{-}}; \alpha )= & {} 4 \left[ - \frac{1 + {1}/{8}}{3 + {1}/{4}} \log \frac{1 + {1}/{8}}{3 + {1}/{4}} - \frac{2 + {1}/{8}}{3 + {1}/{4}} \log \frac{2 + {1}/{8}}{3 + {1}/{4}} \right] \\ = & {} 2.580, \\ {H}(X |\varPi _{X}^{{\mathcal {G}}^{+}}; \alpha )= & {} 4 \left[ - \frac{1 + {1}/{16}}{3 + {1}/{8}} \log \frac{1 + {1}/{16}}{3 + {1}/{8}} - \frac{2 + {1}/{16}}{3 + {1}/{8}} \log \frac{2 + {1}/{16}}{3 + {1}/{8}} \right] \\ = & {} 2.564; \end{aligned}$$

and the respective posterior expected entropies are:

$$\begin{aligned}&{E}\left( {H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}, \frac{1}{8}\right) \\ &\quad = 4 \left[ \psi _0({1}/{4} + 3 + 1) - \frac{1 + {1}/{8}}{3 + {1}/{4}} \psi _0({1}/{8} + 1 + 1) - \frac{2 + {1}/{8}}{3 + {1}/{4}} \psi _0({1}/{8} + 2 + 1) \right] \\ &\quad = 2.066, \\ &{E}\left( {H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}, \frac{1}{16}\right) \\ &\quad = 4 \left[ \psi _0({1}/{8} + 3 + 1) - \frac{0 + {1}/{16}}{3 + {1}/{8}} \psi _0({1}/{16} + 1 + 1) - \frac{3 + {1}/{16}}{3 + {1}/{8}} \psi _0({1}/{16} + 2 + 1) \right] \\ &\qquad +\,4 \left[ \psi _0({1}/{8} + 0 + 1) - \frac{0 + {1}/{16}}{0 + {1}/{8}} \psi _0({1}/{16} + 0 + 1) - \frac{3 + {1}/{16}}{0 + {1}/{8}} \psi _0({1}/{16} + 0 + 1)\right] \\ &\quad = 4.069. \end{aligned}$$

Therefore, substituting these values in (17) leads to,

$$\begin{aligned}&{E}({H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}) = 2.066 \cdot 3.906\times 10^{-7} = 8.071\times 10^{-7} \\ &\quad >1.514\times 10^{-7} = 4.069 \cdot 3.721\times 10^{-8} = {E}({H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}). \end{aligned}$$

Even though we choose \({\mathcal {G}}^{-}\) over \({\mathcal {G}}^{+}\), we still express a preference for one of the DAGs even though the information in the data is the same; which confirms that the difference is attributable to the prior. \(\square\)

Fig. 3
figure 3

The difference between \({E}({H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}, \alpha _{ijk})\) and \({E}({H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}, \alpha _{ijk})\) computed using BDeu and BDs for Example 1 (left panel) and Example 2 (right panel) in orange and dark blue, respectively. The bullet points correspond to the values observed for \(\alpha = 1\)

Fig. 4
figure 4

The difference between \({E}({H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}})\) and \({E}({H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}})\) computed using BDeu and BDs for Example 1 (left panel) and Example 2 (right panel) in orange and dark blue, respectively. The bullet points correspond to the values observed for \(\alpha = 1\)

Based on these results and the examples above, we state the following theorem.

Theorem 1

Using BDeu and the associated uniform prior over the parameters of a BN for structure learning violates the maximum relative entropy principle if any candidate parent configuration of any node is not observed in the data.

This is not the case for BDs, because its piecewise uniform prior preserves the imaginary sample size even when \(\tilde{q}_i< q_i\); and because it prevents the posterior entropy from inflating by allowing the \(\tilde{q}_i\) terms corresponding to the \(n_{ij} = 0\) to simplify. Assuming \(\alpha _{ijk}= 0\) in (18) implies:

$$\begin{aligned} \sum _{j : n_{ij} = 0} \left[ \psi _0(1) - \sum _{k = 1}^{r_i} \frac{1}{r_i} \psi _0(1) \right] = \psi _0(1) - \psi _0(1) = 0. \end{aligned}$$

Example 1

(Continued) If we compare \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\) and \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\) under the prior assumed by BDs, we have that \(\alpha _{ijk}= {1}/{8}\) for both \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\) and the \(\tilde{q}_i\) observed parent configurations in \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\). Then their posterior expected entropies are:

$$\begin{aligned}&{E}\left( {H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}, \frac{1}{8}\right) = {E}\left( {H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}, \frac{1}{8}\right) \\ &\quad = 4 \left[ \psi _0({1}/{4} + 3 + 1) - \frac{0 + {1}/{8}}{3 + {1}/{4}} \psi _0({1}/{8} + 0 + 1) - \frac{3 + {1}/{8}}{3 + {1}/{4}} \psi _0({1}/{8} + 3 + 1) \right] \\ &\quad = 0.3931 \end{aligned}$$

and substituting these values in (16)

$$\begin{aligned}&{E}({H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}) = 0.3931 \cdot 0.0326 = 0.0128 \\ &\quad =0.0128 = 0.3931 \cdot 0.0326 = {E}({H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}). \end{aligned}$$

ME does not express a preference for either \({\mathcal {G}}^{-}\) or \({\mathcal {G}}^{+}\); since we have observed above that the data contribute exactly the same information for both DAGs, the same must be true for the prior associated with BDs.

A side effect of not violating ME is that the choice between \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) is no longer sensitive to the value of \(\alpha\); we can see from the left panels of Figs. 3 and 4 that both the difference between \({E}({H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}, \frac{1}{8})\) and \({E}({H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}, \frac{1}{8})\) and the difference between \({E}({H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}})\) and \({E}({H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}})\) are equal to zero for all \(\alpha\). \(\square\)

Example 2

(Continued) Again \(\alpha _{ijk}= {1}/{8}\) for both \(X |\varPi _{X}^{{\mathcal {G}}^{-}}\) and the \(\tilde{q}_i\) observed parent configurations in \(X |\varPi _{X}^{{\mathcal {G}}^{+}}\), so

$$\begin{aligned}&{E}\left( {H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}, \frac{1}{8}\right) = {E}\left( {H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}, \frac{1}{8}\right) \\ &\quad = 4 \left[ \psi _0({1}/{4} + 3 + 1) - \frac{1 + {1}/{8}}{3 + {1}/{4}} \psi _0({1}/{8} + 1 + 1) - \frac{2 + {1}/{8}}{3 + {1}/{4}} \psi _0({1}/{8} + 2 + 1) \right] \\ &\quad = 2.066 \end{aligned}$$

which leads to

$$\begin{aligned}&{E}({H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}}) = 2.066 \cdot 3.906\times 10^{-7} = 8.071\times 10^{-7} \\&\quad =8.071\times 10^{-7} = 2.066 \cdot 3.906\times 10^{-7} = {E}({H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}}). \end{aligned}$$

Again we can see from the right panels of Figs. 3 and 4 that the choice between \({\mathcal {G}}^{-}\) and \({\mathcal {G}}^{+}\) is no longer sensitive to the choice of \(\alpha\); and \({\mathcal {G}}^{+}\) is never preferred to \({\mathcal {G}}^{-}\). This contrasts especially the behaviour of BDeu in Fig. 4, where \({E}({H}^{{\mathcal {G}}^{+}}(X) |{\mathcal {D}})\) can be both larger and smaller than \({E}({H}^{{\mathcal {G}}^{-}}(X) |{\mathcal {D}})\) for different values of \(\alpha\). \(\square\)

It is easy to show that the theorem we just stated does not apply to K2 or BDJ, since under their priors \(\alpha _{ijk}\) is not a function of \(q_i\); but it does apply to BDla since its formulation is essentially a mixture of BDeu scores.

7 Conclusions and discussion

Bayesian network learning follows an inherently Bayesian workflow in which we first learn the structure of the DAG \({\mathcal {G}}\) from a data set \({\mathcal {D}}\), and then we learn the values of the parameters \(\varTheta _{X_i}\) given \({\mathcal {G}}\). In this paper we studied the properties of the Bayesian posterior scores used to estimate \({P}({\mathcal {G}}|{\mathcal {D}})\) and to learn the \({\mathcal {G}}\) that best fits the data. For discrete Bayesian networks, these scores are Bayesian–Dirichlet (BD) marginal likelihoods that assume different Dirichlet priors for the \(\varTheta _{X_i}\) and, in the most general formulation, a hyperprior over the hyperparameters \(\alpha _{ijk}\) of the prior. We focused on the most common BD score, BDeu, which assumes a uniform prior over each \(\varTheta _{X_i}\); and we studied the impact of that prior on structure learning from a Bayesian and an information theoretic perspective. After deriving the form of the posterior expected entropy for \({\mathcal {G}}\) given \({\mathcal {D}}\), we found that BDeu may select models in a way that violates the maximum relative entropy principle. Furthermore, we showed that it produces Bayes factors that are very sensitive to the choice of the imaginary sample size. Both issues are related to the uniform prior assumed by BDeu for the \(\varTheta _{X_i}\), and can lead to the selection of overly dense DAGs when the data are sparse. In contrast, the BDs score proposed in Scutari (2016) does not, even though it converges to BDeu asymptotically; and neither do other BD scores in the literature. In the simulation study we performed in Scutari (2016), we found that BDs leads to more accurate structure learning; hence we recommend its use over BDeu for sparse data.