Learning Bayesian Networks with Non-Decomposable Scores

Chen, Eunice Yuh-Jie; Choi, Arthur; Darwiche, Adnan

doi:10.1007/978-3-319-28702-7_4

Eunice Yuh-Jie Chen¹⁷,
Arthur Choi¹⁷ &
Adnan Darwiche¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9501))

Included in the following conference series:

International Workshop on Graph Structures for Knowledge Representation and Reasoning

601 Accesses

Abstract

Modern approaches for optimally learning Bayesian network structures require decomposable scores. Such approaches include those based on dynamic programming and heuristic search methods. These approaches operate in a search space called the order graph, which has been investigated extensively in recent years. In this paper, we break from this tradition, and show that one can effectively learn structures using non-decomposable scores by exploring a more complex search space that leverages state-of-the-art learning systems based on order graphs. We show how the new search space can be used to learn with priors that are not structure-modular (a particular class of non-decomposable scores). We also show that it can be used to efficiently enumerate the $k$-best structures, in time that can be up to three orders of magnitude faster, compared to existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
At https://sites.google.com/site/bmmalone/files/urlearning.
2.
Approaches based on ILP can in principle handle non-decomposable scores (and constraints), assuming that they can be expressed using a linear cost function (or as linear constraints) [25]. We remark that order-modular priors, which we consider later, are not easy to encode as ILPs (as we need to compute linear extension counts).
3.
${ Pr}_0(G_0) = 1$ as there is a unique graph over zero nodes.
4.
For each linear extension $\pi $ of $G_i$, there are $(i+1)$ places to insert the $(i+1)$-th node, then $(i+2)$ places to insert the next, and so on. Thus, there are $(i+1)\cdots n$ ways to extend a given ordering over $i$ variables to $n$ variables.
5.
In particular, every time that we expand a node $G$, we can increment each of its children’s linear extension counts by ${\mathrm {\#}G}$. Once we have expanded every parent of a child, the child’s linear extension count is correct.
6.
There are systems available for (a) finding optimal DAGs using structure-modular priors, (b) for Bayesian model averaging using order-modular priors, and (c) for jointly optimizing over orders and DAGs, using order-modular priors. These tasks are all discussed in [18], which further states that finding optimal DAGs with order-modular priors is a more challenging problem (where we maximize over DAGs, but sum over orders).
7.
We remark, however, that [12] is more specifically concerned with the enumeration of the k-shortest paths. Since we are interested in enumerating the k-closest goal nodes, we remark that some, but not all, of their theoretical analyses applies to our problem. In particular, each distinct goal node in the BN graph may have many paths that can reach it. Hence, once we obtain one goal node, many more shortest-paths may be needed to obtain the next closest (and distinct) goal node. Moreover, we do not need to differentiate between two different paths to the same goal node, as in [12].
8.
We remark on another distinction between finding a single optimal DAG, versus enumerating the k-best DAGs. In particular, there are techniques that can guarantee that certain families will not appear in an optimal DAG, which can greatly simplify the learning problem [8, 11, 30, 31]. However, such families may still appear in a k-th best DAG, and hence, these techniques may not be directly applicable.
9.
At http://www.cs.iastate.edu/~jtian/Software/UAI-10/KBest.htm.
10.
At http://www.cs.york.ac.uk/aig/sw/gobnilp/.
11.
Note that GOBNILP is known to be more effective in other regimes, for example, where we can constrain the number of parents that a node can have [21, 34]. However, for our experiments here, we consider the more general case, where we do not assume such a constraint.

References

Bache, K., Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml
Bouckaert, R.R.: Probalistic network construction using the minimum description length principle. In: Proceedings of the European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty (ECSQARU), pp. 41–48 (1993)
Google Scholar
Brightwell, G., Winkler, P.: Counting linear extensions. Order 8(3), 225–242 (1991)
Article MathSciNet MATH Google Scholar
Buntine, W.: Theory refinement on Bayesian networks. In: Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, pp. 52–60 (1991)
Google Scholar
Chickering, D., Geiger, D., Heckerman, D.: Learning Bayesian networks: search methods and experimental results. In: Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics (AISTATS), pp. 112–128 (1995)
Google Scholar
Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)
MATH Google Scholar
Cussens, J.: Bayesian network learning with cutting planes. In: Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 153–160 (2011)
Google Scholar
Cussens, J.: An upper bound for BDeu local scores. In: European Conference on Artificial Intelligence Workshop on Algorithmic Issues for Inference in Graphical Models (2012)
Google Scholar
Cussens, J., Bartlett, M., Jones, E.M., Sheehan, N.A.: Maximum likelihood pedigree reconstruction using integer linear programming. Genet. Epidemiol. 37(1), 69–83 (2013)
Article Google Scholar
Darwiche, A.: Modeling and Reasoning with Bayesian Networks. Cambridge University Press, New York (2009)
Book MATH Google Scholar
De Campos, C.P., Ji, Q.: Efficient structure learning of Bayesian networks using constraints. J. Mach. Learn. Res. 12, 663–689 (2011)
MathSciNet MATH Google Scholar
Dechter, R., Flerova, N., Marinescu, R.: Search algorithms for m best solutions for graphical models. In: Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (2012)
Google Scholar
Felner, A., Goldenberg, M., Sharon, G., Stern, R., Beja, T., Sturtevant, N.R., Schaeffer, J., Holte, R.: Partial-expansion A* with selective node generation. In: Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (2012)
Google Scholar
Friedman, N., Koller, D.: Being Bayesian about network structure. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 201–210 (2000)
Google Scholar
Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)
Article Google Scholar
Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)
MATH Google Scholar
Jaakkola, T., Sontag, D., Globerson, A., Meila, M.: Learning Bayesian network structure using LP relaxations. In: Proceedings of the Thirteen International Conference on Artificial Intelligence and Statistics, pp. 358–365 (2010)
Google Scholar
Koivisto, M., Sood, K.: Exact Bayesian structure discovery in Bayesian networks. J. Mach. Learn. Res. 5, 549–573 (2004)
MathSciNet MATH Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. The MIT Press, Cambridge (2009)
Google Scholar
Lam, W., Bacchus, F.: Learning Bayesian belief networks: an approach based on the MDL principle. Comput. Intell. 10, 269–294 (1994)
Article Google Scholar
Malone, B., Kangas, K., Järvisalo, M., Koivisto, M., Myllymäki, P.: Predicting the hardness of learning Bayesian networks. In: Proceedings of the Twenty-Eighth Conference on Artificial Intelligence (2014)
Google Scholar
Malone, B., Yuan, C., Hansen, E.: Memory-efficient dynamic programming for learning optimal Bayesian networks. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2011)
Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Google Scholar
Niinimäki, T.M., Koivisto, M.: Annealed importance sampling for structure learning in Bayesian networks. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI) (2013)
Google Scholar
Oates, C.J., Smith, J.Q., Mukherjee, S., Cussens, J.: Exact estimation of multiple directed acyclic graphs. Stat. Comput. 1–15 (2015). http://link.springer.com/journal/11222/onlineFirst/page/2
Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach. Pearson Education, London (2010)
Google Scholar
Silander, T., Myllymäki, P.: A simple approach for finding the globally optimal Bayesian network structure. In: Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pp. 445–452 (2006)
Google Scholar
Singh, A.P., Moore, A.W.: Finding optimal Bayesian networks by dynamic programming. Technical report, CMU-CALD-050106 (2005)
Google Scholar
Suzuki, J.: A construction of Bayesian networks from databases based on an MDL principle. In: Proceedings of the Ninth Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 266–273 (1993)
Google Scholar
Teyssier, M., Koller, D.: Ordering-based search: a simple and effective algorithm for learning Bayesian networks. In: Proceedings of the Twenty-first Conference on Uncertainty in Artificial Intelligence, pp. 584–590 (2005)
Google Scholar
Tian, J.: A branch-and-bound algorithm for mdl learning Bayesian networks. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 580–588 (2000)
Google Scholar
Tian, J., He, R., Ram, L.: Bayesian model averaging using the k-best Bayesian network structures. In: Proceedings of the Twenty-Six Conference on Uncertainty in Artificial Intelligence, pp. 589–597 (2010)
Google Scholar
Yoshizumi, T., Miura, T., Ishida, T.: A* with partial expansion for large branching factor problems. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (2000)
Google Scholar
Yuan, C., Malone, B.: Learning optimal Bayesian networks: a shortest path perspective. J. Artif. Intell. Res. 48, 23–65 (2013)
MathSciNet MATH Google Scholar
Yuan, C., Malone, B., Wu, X.: Learning optimal Bayesian networks using A* search. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, pp. 2186–2191 (2011)
Google Scholar

Download references

Acknowledgments

We thank Joseph Barker, Zhaoxing Bu, and Ethan Schreiber for helpful comments and discussions. We also thank James Cussens and Brandon Malone for their comments on an earlier version of this paper. This work was supported in part by ONR grant #N00014-12-1-0423 and NSF grant #IIS-1514253.

Author information

Authors and Affiliations

Computer Science Department, University of California, Los Angeles, USA
Eunice Yuh-Jie Chen, Arthur Choi & Adnan Darwiche

Authors

Eunice Yuh-Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Choi
View author publications
You can also search for this author in PubMed Google Scholar
Adnan Darwiche
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eunice Yuh-Jie Chen .

Editor information

Editors and Affiliations

LIRMM, Montpellier Cedex 5, France
Madalina Croitoru
CRIL-CNRS, Lens, France
Pierre Marquis
Fakultät Informatik, Technische Universität Dresden, Dresden, Sachsen, Germany
Sebastian Rudolph
University of Brighton, Brighton, United Kingdom
Gem Stapleton

A Proofs for Sect. 4.2

Proof

(Theorem 1 ). The total cost of a path from the root $G_0$ to a leaf $G_n$ is:

$$\begin{aligned}&\sum _{G_i \rightarrow G_j} \mathsf {score}(X_j\mathbf{U}_j \mid \mathcal {D}) - \log \frac{{ Pr}_j(G_j)}{{ Pr}_i(G_i)}\\&=\sum _{G_i \rightarrow G_j} \mathsf {score}(X_j\mathbf{U}_j \mid \mathcal {D}) - \log { Pr}_j(G_j) + \log { Pr}_i(G_i)\\&= \mathsf {score}(X_1 \mathbf{U}_1 \mid \mathcal {D}) - \log { Pr}_1(G_1) + \log { Pr}_0(G_0) {} \\&\quad + \mathsf {score}(X_2\mathbf{U}_2 \mid \mathcal {D}) - \log { Pr}_2(G_2) + \log { Pr}_1(G_1) + \ldots {} \\&\quad + \mathsf {score}(X_n\mathbf{U}_n \mid \mathcal {D}) - \log { Pr}_{n}(G_{n}) + \log { Pr}_{n-1}(G_{n-1})\\&= \mathsf {score}(G \mid \mathcal {D}) - \log { Pr}_n(G_n) \end{aligned}$$

as desired. $\square $

Proof

(Theorem 2 ).

$$\begin{aligned} h(G_i)&= \min _{G_n: G_i \leadsto G_n} \sum _{X\mathbf{U}\in G_n - G_i} \mathsf {score}(X\mathbf{U}\mid \mathcal {D}) + \min _{G_n: G_i \leadsto G_n} - \log \frac{{ Pr}_n(G_n)}{{ Pr}_i(G_i)}\\&\le \min _{G_n: G_i \leadsto G_n} \Big ( \sum _{X\mathbf{U}\in G_n - G_i} \mathsf {score}(X\mathbf{U}\mid \mathcal {D}) - \log \frac{{ Pr}_n(G_n)}{{ Pr}_i(G_i)} \Big )\\&= \min _{G_n: G_i \leadsto G_n} g(G_n) - g(G_i) \end{aligned}$$

Since heuristic function h lower-bounds the true cost of to a goal node, it is admissible. $\square $

Below we consider the correctness of Algorithm 1.

Lemma 1

In Algorithm 1:

1.
all $G_i$ that generate $G_{i + 1}$ are extracted from H before $G_{i + 1}$ is extracted;
2.
when $(G_{i + 1}, f_{i + 1}, l_{i + 1})$ is extracted from H,
$$\begin{aligned} f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \#G_{i + 1} + h(G_{i + 1}), \end{aligned}$$
and $l_{i + 1}$ is the number of linear extensions of $G_{i + 1}$, i.e., $\#G_{i+1}$.

where $h(G_i) = h_1(G_i) - \sum _{j=i+1}^n \log j$.

Proof

Consider a minimum item $(G_{i + 1}, f_{i + 1}, l_{i + 1})$ extracted from H. Below we show by induction that (1) all $G_i$ such that $G_i$ generates $G_{i + 1}$ are extracted from H before $G_{i + 1}$ (2) $f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \#G_{i + 1} + h(G_{i + 1})$, and $l_{i + 1}$ is the number of linear extensions of $G_{i + 1}$, which is also the number of paths from $G_0$ to $G_{i + 1}$.

For $i = 0$, clearly (1) and (2) are true. Assume (1) and (2) are true for all $(G_i, f_i, l_i)$. Then when $(G_{i + 1}, f_{i + 1}, l_{i + 1})$ is extracted, $l_{i + 1}$ is the number of paths from $G_0$ to $G_{i + 1}$ that pass the $G_i$ extracted from H before $G_{i + 1}$. To see this, note that l is the number of path from $G_0$ to $G_i$. Moreover, since $l_{i + 1}$ is the number paths from $G_0$ to $G_{i + 1}$ that pass the $G_i$ extracted from H before $G_{i + 1}$, when $(G_{i + 1}, f_{i + 1}, l_{i + 1})$ is in H,

$$\begin{aligned} f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \sum _{G_i \prec G_{i + 1}} N(G_0 \rightarrow \ldots \rightarrow G_i \rightarrow G_{i + 1}) + h (G_{i + 1}), \end{aligned}$$

where $G_i \prec G_{i + 1}$ denotes that $G_i$ is extracted before $G_{i + 1}$, and $N(G_0 \rightarrow \ldots \rightarrow G_i \rightarrow G_{i + 1})$ denotes the number of paths $G_0 \rightarrow \ldots \rightarrow G_i \rightarrow G_{i + 1}$. Note that $f_{i + 1}$ decreases as more $G_i$ are extracted.

Consider when $(G_i, f_i, l_i)$ and $(G_{i + 1}, f_{i + 1}, l_{i + 1})$ are both in H. Below we show that all $G_i$ that generates $G_{i + 1}$ are extracted from H before $G_{i + 1}$. Consider

$$\begin{aligned} f_i&= \mathsf {score}(G_i | \mathcal {D}) \\&\quad - \log \sum _{G_{i - 1} \prec G_i} N(G_0 \rightarrow \ldots \rightarrow G_{i - 1} \rightarrow G_i) + h_1(G_i) - \sum _{j = i + 1}^n \log j\\ f_{i + 1}&= \mathsf {score}(G_{i + 1} | \mathcal {D}) \\&\quad - \log \sum _{G^\prime _i \prec G_{i + 1}} N(G_0 \rightarrow \ldots \rightarrow G^\prime _i \rightarrow G_{i + 1})+ h_1 (G_{i + 1}) - \sum _{j = i + 2}^n \log j\\ \end{aligned}$$

We simply need to show $f_{i + 1} > f_i$. Since $h_1$ is a consistent heuristic function for learning with $\mathsf {score}$, $\mathsf {score}(G_{i + 1} | \mathcal {D}) + h_1(G_{i + 1}) \ge \mathsf {score}(G_i | \mathcal {D}) + h_1 (G_i)$. Then we only need to show

$$\begin{aligned}&(i + 1) \sum _{G_{i - 1} \prec G_i} N(G_0 \rightarrow \ldots \rightarrow G_{i - 1} \rightarrow G_i) \\&> \sum _{G^\prime _i \prec G_{i + 1}} N(G_0 \rightarrow \ldots \rightarrow G^\prime _i \rightarrow G_{i + 1}) \end{aligned}$$

First, for any pair of DAGs $G_i$ and $G^\prime _i$ that can generate a DAG $G_{i+1}$, there exists a unique DAG $G_{i-1}$ that can generate both $G_i$ and $G^\prime _i$. For each $G^\prime _i$ on the right-hand side, there thus exists a corresponding (and unique) $G_{i-1}$ on the left-hand side that can generate both $G^\prime _i$ and $G_i$. Further, since $G^\prime _i$ was expanded, $G_{i-1}$ must also have been expanded (by induction). For each such $G_{i-1}$, if $G^\prime _i$ has a linear extension count of $L$, then $G_{i-1}$ must have at least a linear extension count of $L/i$, and hence the corresponding $N(G_0 \rightarrow \ldots \rightarrow G_{i-1} \rightarrow G_i)$ is at least $L/i$. On the left-hand side, we the corresponding term is thus at least $(i+1)\cdot L/i > L$. Since this holds for each element of the summation on the right-hand side, the above inequality holds.

Since all $G_i$ that generates $G_{i + 1}$ are extracted from H before $G_{i + 1}$, as a result, $f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \#G_{i + 1} + h(G_{i + 1})$, and $l_{i + 1}$ is the number of all paths from $G_0$ to $G_{i + 1}$.

Proof

(of Theorem 4 ). To see the correctness of the algorithm, simply note that by Lemma 1, when $(G_{i + 1}, f_{i + 1}, l_{i + 1})$ is extracted from H, i.e. the open list, $f_{i + 1} = f(G_{i + 1})$.

Proof

(of Theorem 3 ). By Lemma 1, Algorithm 1 can count the number of linear extensions.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, E.YJ., Choi, A., Darwiche, A. (2015). Learning Bayesian Networks with Non-Decomposable Scores. In: Croitoru, M., Marquis, P., Rudolph, S., Stapleton, G. (eds) Graph Structures for Knowledge Representation and Reasoning. GKR 2015. Lecture Notes in Computer Science(), vol 9501. Springer, Cham. https://doi.org/10.1007/978-3-319-28702-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-28702-7_4
Published: 03 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28701-0
Online ISBN: 978-3-319-28702-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Bayesian Networks with Non-Decomposable Scores

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Proofs for Sect. 4.2

A Proofs for Sect. 4.2

Proof

Proof

Lemma 1

Proof

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation