Abstract
One of the main research topics in machine learning nowadays is the improvement of the inference and learning processes in probabilistic graphical models. Traditionally, inference and learning have been treated separately, but given that the structure of the model conditions the inference complexity, most learning methods will sometimes produce inefficient inference models. In this paper we propose a framework for learning low inference complexity Bayesian networks. For that, we use a representation of the network factorization that allows efficiently evaluating an upper bound in the inference complexity of each model during the learning process. Experimental results show that the proposed methods obtain tractable models that improve the accuracy of the predictions provided by approximate inference in models obtained with a well-known Bayesian network learner.
Similar content being viewed by others
References
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Andreassen, S., Rosenfalck, A., Falck, B., Olesen, K.G., Andersen, S.K.: Evaluation of the diagnostic performance of the expert EMG assistant MUNIN. Electromyogr. Mot. Control 101(2), 129–144 (1996)
Arnborg, S., Corneil, D.G., Proskurowski, A.: Complexity of finding embeddings in a \(k\)-tree. SIAM J. Algebraic Discret. 8(2), 277–284 (1987)
Bach, F.R., Jordan, M.I.: Thin junction trees. In: Adv. Neural Inf., pp. 569–576 (2001)
Beygelzimer, A., Rish, I.: Approximability of probability distributions. In: Adv. Neural Inf. pp. 377–384 (2004)
Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification with Bayesian networks. Int. J. Approx. Reason. 52(6), 705–727 (2011)
Bielza, C., Larranaga, P.: Discrete Bayesian network classifiers: a survey. ACM Comput. Surv. 47(1), 5 (2014)
Bodlaender, H.L.: A linear time algorithm for finding tree-decompositions of small treewidth. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, pp. 226–234 (1993)
Bodlaender, H.L., Koster, A.M.: Treewidth computations I. Upper bounds. Inf. Comput. 208(3), 259–275 (2010)
Bouckaert, R.R.: Probabilistic network construction using the minimum description length principle. In: Lect. Notes Artif. Int., pp. 41–48 (1993)
Chechetka, A., Guestrin, C.: Efficient principled learning of thin junction trees. In: Adv. Neural Inf., pp. 273–280 (2008)
Cooper, G.F.: The computational complexity of probabilistic inference using Bayesian belief networks. Artif. Intell. 42(2), 393–405 (1990)
Cooper, G.F., Herskovits, E.: A Bayesian method for constructing Bayesian belief networks from databases. In: Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, pp. 86–94 (1991)
Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)
Dagum, P., Luby, M.: Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artif. Intell. 60(1), 141–153 (1993)
Darwiche, A.: A differential approach to inference in Bayesian networks. J. Assoc. Comput. Mach. 50(3), 280–305 (2003)
Elidan, G., Gould, S.: Learning bounded treewidth Bayesian networks. In: Adv. Neural Inf., pp. 417–424 (2009)
Fung, R.M., Chang, K.C.: Weighing and integrating evidence for stochastic simulation in Bayesian networks. In: Uncertainty in Artificial Intelligence, pp. 209–220 (1989)
Gámez, J.A., Mateo, J.L., Puerta, J.M.: Learning Bayesian networks by hill climbing: effficient methods based on progressive restriction of the neighborhood. Data Min. Knowl. Discov. 22(1–2), 106–148 (2011)
Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)
Heckerman, D., Horwitz, E., Nathwani, B.: Towards normative expert systems: part I. The pathfinder project. Methods Inf. Med. 31, 90–105 (1992)
Kim, J., Pearl, J.: A computational model for causal and diagnostic reasoning in inference systems. In: Proceedings of the Eighth International Joint Conference on Artificial Intelligence, pp. 190–193 (1983)
Kwisthout, J.: Most probable explanations in Bayesian networks: complexity and tractability. Int. J. Approx. Reason. 52(9), 1452–1469 (2011)
Lam, W., Bacchus, F.: Learning Bayesian belief networks: an approach based on the MDL principle. Comput. Intell. 10(3), 269–293 (1994)
Larranaga, P., Kuijpers, C.M., Murga, R.H., Yurramendi, Y.: Learning Bayesian network structures by searching for the best ordering with genetic algorithms. IEEE Trans. Syst. Man Cybern. 26(4), 487–493 (1996)
Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R.H., Kuijpers, C.M.: Structure learning of Bayesian networks by genetic algorithms: a performance analysis of control parameters. IEEE Trans. Pattern Anal. 18(9), 912–926 (1996)
Lowd, D., Domingos, P.: Learning arithmetic circuits. In: Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pp. 383–392 (2008)
Pham, D.T., Ruz, G.A.: Unsupervised training of Bayesian networks for data clustering. Proc. Roy. Soc. Lond. A Mat., pp. 2927–2948 (2009)
Shachter, R.D., Peot, M.A.: Simulation approaches to general probabilistic inference on belief networks. In: Uncertainty in Artificial Intelligence, pp. 221–234 (1989)
Shahaf, D., Guestrin, C.: Learning thin junction trees via graph cuts. In: International Conference on Artificial Intelligence and Statistics, pp. 113–120 (2009)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work has been partially supported by the Spanish Ministry of Economy and Competitiveness through the Cajal Blue Brain (C080020-09; the Spanish partner of the Blue Brain initiative from EPFL) and TIN2013-41592-P projects, by the Regional Government of Madrid through the S2013/ICE-2845-CASI-CAM-CM project, and by the European Union’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 604102 (Human Brain Project). M. Benjumeda is supported by a predoctoral contract for the formation of doctors from the Spanish Ministry of Economy and Competitiveness (BES-2014-068637).
Appendix: Proof of Theorem 1
Appendix: Proof of Theorem 1
This work relies heavily on Theorem 1, which assures that the proposed incremental compilation and optimization methods produce always sound PTs. To demonstrate the soundness of a PT \({\mathcal {P}}\) with respect to a BN \({\mathcal {B}}\), we show that for each node \(X_i\) of \({\mathcal {P}}\) every parent of \(X_i\) in \({\mathcal {B}}\) is a predecessor of \(X_i\) in \({\mathcal {P}}\). In this “Appendix” we provide a proof of Theorem 1.
Lemma 1
Let \({\mathcal {P}}\) be a PT over \({\mathcal {X}}_P = \{*\} \cup {\mathcal {X}}\) and \({\mathcal {B}}\) be a Bayesian network over \({\mathcal {X}}\). If \({\mathcal {P}}\) is sound with respect to \({\mathcal {B}}\), then the PT \({\mathcal {P}}'\) obtained after applying addArc\(({\mathcal {B}},{\mathcal {P}},X_{\mathrm{out}},X_{\mathrm{out}})\) is also sound with respect to \({\mathcal {B}}'\), where \({\mathcal {B}}'\) is the result of adding arc \(X_{\mathrm{out}}\rightarrow X_{\mathrm{in}}\) to \({\mathcal {B}}\), and the addition of \(X_{\mathrm{out}}\rightarrow X_{\mathrm{in}}\) to \({\mathcal {B}}\) does not produce a cycle in \({\mathcal {B}}'\).
Proof
The structure of \({\mathcal {P}}'\) depends on the precedence relationship between \(X_{\mathrm{out}}\) and \(X_{\mathrm{in}}\) in \({\mathcal {P}}\).
-
\(X_{\mathrm{out}}\in \mathbf{pred }_{\mathcal {P}}(X_{\mathrm{in}})\): there are no changes in the structure of \({\mathcal {P}}\). \(\forall X_i \in {\mathcal {X}} {\setminus } \{X_{\mathrm{in}}\}\), \(\mathbf{pa }_{\mathcal {B}'}(X_i) = \mathbf{pa }_{\mathcal {B}}(X_i)\) and \(\mathbf{pred }_{\mathcal {P}'}(X_i) = \mathbf{pred }_{\mathcal {P}}(X_i)\), so \(X_i\) is sound. \(X_{\mathrm{in}}\) is also sound because \(\mathbf{pa }_{\mathcal {B}'}(X_{\mathrm{in}}) = \mathbf{pa }_{\mathcal {B}}(X_{\mathrm{in}}) \cup \{X_{\mathrm{out}}\}\), \(\mathbf{pred }_{\mathcal {P}'}(X_{\mathrm{in}}) = \mathbf{pred }_{\mathcal {P}}(X_{\mathrm{in}})\) and \(X_{\mathrm{out}}\in \mathbf{pred }_{\mathcal {P}}(X_{\mathrm{in}})\).
-
\(X_{\mathrm{in}}\in \mathbf{pred }_{\mathcal {P}}(X_{\mathrm{out}})\): The nodes that are not descendants of \(X_{\mathrm{in}}\) in \({\mathcal {P}}\) do not change. \(\forall X_i \in {\mathcal {X}} {\setminus } (\mathbf{desc }_{\mathcal {P}}(X_{\mathrm{in}}) \cup \{X_{\mathrm{in}}\})\), \(\mathbf{pa }_{\mathcal {B}'}(X_i) = \mathbf{pa }_{\mathcal {B}}(X_i)\) and \(\mathbf{pred }_{\mathcal {P}'}(X_i) = \mathbf{pred }_{\mathcal {P}}(X_i)\). Thus, \(X_i\) is sound. \(X_{\mathrm{out}}\) and its descendants in \({\mathcal {P}}'\) that are not descendants of \(X_{\mathrm{in}}\) have less predecessors in \({\mathcal {P}}'\) than in \({\mathcal {P}}\). \(\forall X_i \in \mathbf{desc }_{\mathcal {P}'}(X_{\mathrm{out}}) \cup \{X_{\mathrm{out}}\} {\setminus } (\mathbf{desc }_{\mathcal {P}'}(X_{\mathrm{in}}) \cup \{X_{\mathrm{in}}\})\), as \(\mathbf{pred }_{\mathcal {P}'}(X_i) = \mathbf{pred }_{\mathcal {P}}(X_i) {\setminus } (\mathbf{desc }_{\mathcal {P}'}(X_{\mathrm{in}}) \cup \{X_{\mathrm{in}}\})\), \(\mathbf{pa }_{\mathcal {B}'}(X_i) = \mathbf{pa }_{\mathcal {B}}(X_i)\) and \(\mathbf{pa }_{\mathcal {B}'}(X_i) \cap (\mathbf{desc }_{\mathcal {P}'}(X_{\mathrm{in}}) \cup \{X_{\mathrm{in}}\}) = \varnothing \), \(X_i\) is sound. Finally, \(X_{\mathrm{in}}\) has \(X_{\mathrm{out}}\) as a predecessor in \({\mathcal {P}}\). \(\forall X_i \in \mathbf{desc }_{\mathcal {P}'}(X_{\mathrm{in}}) \cup \{X_{\mathrm{in}}\}, \mathbf{pred }_{\mathcal {P}'}(X_i) \supseteq \mathbf{pred }_{\mathcal {P}}(X_i) \cup \{X_{\mathrm{out}}\}\) and \(\mathbf{pa }_{\mathcal {B}'}(X_i) \subseteq \mathbf{pa }_{\mathcal {B}}(X_i) \cup \{X_{\mathrm{out}}\}\), so \(X_i\) is sound.
-
\(X_{\mathrm{out}}\notin \mathbf{pred }_{\mathcal {P}}(X_{\mathrm{in}})\) and \(X_{\mathrm{in}}\notin \mathbf{pred }_{\mathcal {P}}(X_{\mathrm{out}})\): \(X_{\mathrm{out}}\) and its predecessors in \({\mathcal {P}}\) are set as predecessors of \(X_{\mathrm{in}}\) in \({\mathcal {P}}'\). \(\forall X_i \notin \mathbf{desc }_{\mathcal {P}'}(X_{\mathrm{in}}) \cup \{X_{\mathrm{in}}\}, \mathbf{pa }_{\mathcal {B}'}(X_i) = \mathbf{pa }_{\mathcal {B}}(X_i)\) and \(\mathbf{pred }_{\mathcal {P}'}(X_i) \supseteq \mathbf{pred }_{\mathcal {P}}(X_i)\). Hence \(X_i\) is sound. \(\forall X_i \in \mathbf{desc }_{\mathcal {P}'}(X_{\mathrm{in}}) \cup \{X_{\mathrm{in}}\}\), \(\mathbf{pa }_{\mathcal {B}'}(X_i) \subseteq \mathbf{pa }_{\mathcal {B}}(X_i) \cup \{X_{\mathrm{out}}\}\) and \(\mathbf{pred }_{\mathcal {P}'}(X_i) \supseteq \mathbf{pred }_{\mathcal {P}}(X_i) \cup \{X_{\mathrm{out}}\}\). Therefore \(X_i\) is sound.
Lemma 2
Let \({\mathcal {P}}\) be a PT over \({\mathcal {X}}_P = \{*\} \cup {\mathcal {X}}\) and \({\mathcal {B}}\) be a Bayesian network over \({\mathcal {X}}\). If \({\mathcal {P}}\) is sound with respect to \({\mathcal {B}}\), then the PT \({\mathcal {P}}'\) obtained after applying removeArc\(({\mathcal {B}},{\mathcal {P}},X_{\mathrm{out}},X_{\mathrm{in}})\) is also sound with respect to \({\mathcal {B}}'\), where \({\mathcal {B}}'\) is the result of removing arc \(X_{\mathrm{out}}\rightarrow X_{\mathrm{in}}\) from \({\mathcal {B}}\).
Proof
\(\forall X_i \in {\mathcal {X}}\), \(\mathbf{pa }_{\mathcal {B}'}(X_i) \subseteq \mathbf{pa }_{\mathcal {B}}(X_i)\) and \(\mathbf{pred }_{\mathcal {P}'}(X_i) = \mathbf{pred }_{\mathcal {P}}(X_i)\), so \(X_i\) is sound.
Lemma 3
Let \({\mathcal {P}}\) be a PT over \({\mathcal {X}}_P = \{*\} \cup {\mathcal {X}}\) and \({\mathcal {B}}\) be a Bayesian network over \({\mathcal {X}}\). If \({\mathcal {P}}\) is sound with respect to \({\mathcal {B}}\), then the PT \({\mathcal {P}}'\) obtained after applying reverseArc\(({\mathcal {B}},{\mathcal {P}},X_{\mathrm{out}},X_{\mathrm{in}})\) is also sound with respect to \({\mathcal {B}}'\), where \({\mathcal {B}}'\) is the result of reversing arc \(X_{\mathrm{out}}\rightarrow X_{\mathrm{in}}\) in \({\mathcal {B}}\), and \(X_{\mathrm{in}}\rightarrow X_{\mathrm{out}}\) does not produce a cycle in \({\mathcal {B}}'\).
Proof
We can describe the reversion of arc \(X_{\mathrm{out}}\rightarrow X_{\mathrm{in}}\) in two steps:
-
1
\({\mathcal {P}}_1,{\mathcal {B}}_1 \leftarrow \) removeArc\(({\mathcal {P}},{\mathcal {B}},X_{\mathrm{out}},X_{\mathrm{in}})\).
-
2.
\({\mathcal {P}}',{\mathcal {B}}' \leftarrow \) addArc\(({\mathcal {P}}_1,{\mathcal {B}}_1,X_{\mathrm{in}},X_{\mathrm{out}})\).
From Lemma 1 we know that \({\mathcal {P}}_1\) is sound with respect to \({\mathcal {B}}_1\), and from Lemma 2 we know that \({\mathcal {P}}'\) is sound with respect to \({\mathcal {B}}'\).
Lemma 4
Let \({\mathcal {P}}\) be a PT over \({\mathcal {X}}_P = \{*\} \cup {\mathcal {X}}\) and \({\mathcal {B}}\) be a Bayesian network over \({\mathcal {X}}\). If \({\mathcal {P}}\) is sound with respect to \({\mathcal {B}}\), then the PT \({\mathcal {P}}'\) obtained after applying pushUpNode\(({\mathcal {B}},{\mathcal {P}},X_{\mathrm{opt}})\) is also sound with respect to \({\mathcal {B}}\).
Proof
Let \({\mathcal {D}}_{\mathrm{opt}}= (\mathbf{desc }_{\mathcal {P}'}(X_{\mathrm{opt}}) \cup \{X_{\mathrm{opt}}\}) {\setminus } (\mathbf{desc }_{\mathcal {P}'}(X_p) \cup \{X_p\})\), and \({\mathcal {D}}_p= \mathbf{desc }_{\mathcal {P}'}(X_p) \cup \{X_p\}\).
\(\forall X_i \in {\mathcal {X}} {\setminus } ({\mathcal {D}}_{\mathrm{opt}}\cup {\mathcal {D}}_p)\), \(\mathbf{pred }_{\mathcal {P}'}(X_i) = \mathbf{pred }_{\mathcal {P}}(X_i)\). Therefore, \(X_i\) is sound.
\(\forall X_i \in {\mathcal {D}}_{\mathrm{opt}}, \mathbf{pred }_{\mathcal {P}'}(X_i) = \mathbf{pred }_{\mathcal {P}}(X_i) {\setminus } \{X_p\}\). Given that \(X_i \in {\mathcal {D}}_{\mathrm{opt}}\) only if \(X_i \notin \mathbf{desc }_{\mathcal {B}}(X_p)\), then \(X_i\) is sound.
The nodes in \({\mathcal {D}}_p\) may contain \(X_{\mathrm{opt}}\) as a predecessor in \({\mathcal {P}}'\) depending on their predecessors in \({\mathcal {B}}\).
If \({\mathcal {D}}_p\cap \mathbf{desc }_{\mathcal {B}}(X_{\mathrm{opt}}) \ne \varnothing \), \(\forall X_i \in {\mathcal {D}}_p\), \(\mathbf{pred }_{\mathcal {P}'}(X_i) = \mathbf{pred }_{\mathcal {P}}(X_i) \cup \{X_{\mathrm{opt}}\}\), so \(X_i\) is sound. Otherwise, \(\forall X_i \in {\mathcal {D}}_p, \mathbf{pred }_{\mathcal {P}'}(X_i) = \mathbf{pred }_{\mathcal {P}}(X_i) {\setminus } \{X_{\mathrm{opt}}\}\), and given that \(X_{\mathrm{opt}}\notin \mathbf{pred }_{\mathcal {B}}(X_i)\), \(X_i\) is sound.
Theorem 1 Let \({\mathcal {P}}\) be a sound PT with respect to a BN \({\mathcal {B}}\), and \({\mathcal {A}}\) an algorithm that receives \({\mathcal {P}}\) and \({\mathcal {B}}\) and obtains a new PT \({\mathcal {P}}'\) and BN \({\mathcal {B}}'\). If every change in \({\mathcal {P}}\) and \({\mathcal {B}}\) made by \({\mathcal {A}}\) corresponds to applying Algorithms 1–4, then \({\mathcal {P}}'\) is sound with respect to \({\mathcal {B}}'\).
Proof
Algorithm \({\mathcal {A}}\) obtains \({\mathcal {P}}'\) and \({\mathcal {B}}'\) from \({\mathcal {P}}\) and \({\mathcal {B}}\) using any sequence of changes, where each change is produced by Algorithms 1–4. Since \({\mathcal {P}}\) is sound for \({\mathcal {B}}\) and Lemmas 1–4 assure that Algorithms 1–4 return a PT \({\mathcal {P}}_1\) and a BN \({\mathcal {B}}_1\) such that \({\mathcal {P}}_1\) is sound for \({\mathcal {B}}_1\), the result of applying the sequence of changes in \({\mathcal {A}}\) is a PT \({\mathcal {P}}'\) and a BN \({\mathcal {B}}'\) where \({\mathcal {P}}'\) is sound for \({\mathcal {B}}'\).
Rights and permissions
About this article
Cite this article
Benjumeda, M., Larrañaga, P. & Bielza, C. Learning Bayesian networks with low inference complexity. Prog Artif Intell 5, 15–26 (2016). https://doi.org/10.1007/s13748-015-0070-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-015-0070-0