Learning Bayesian networks from big data with greedy search: computational complexity and efficient implementation
Abstract
Learning the structure of Bayesian networks from data is known to be a computationally challenging, NPhard problem. The literature has long investigated how to perform structure learning from data containing large numbers of variables, following a general interest in highdimensional applications (“small n, large p”) in systems biology and genetics. More recently, data sets with large numbers of observations (the socalled “big data”) have become increasingly common; and these data sets are not necessarily highdimensional, sometimes having only a few tens of variables depending on the application. We revisit the computational complexity of Bayesian network structure learning in this setting, showing that the common choice of measuring it with the number of estimated local distributions leads to unrealistic time complexity estimates for the most common class of scorebased algorithms, greedy search. We then derive more accurate expressions under common distributional assumptions. These expressions suggest that the speed of Bayesian network learning can be improved by taking advantage of the availability of closedform estimators for local distributions with few parents. Furthermore, we find that using predictive instead of insample goodnessoffit scores improves speed; and we confirm that it improves the accuracy of network reconstruction as well, as previously observed by Chickering and Heckerman (Stat Comput 10: 55–62, 2000). We demonstrate these results on large realworld environmental and epidemiological data; and on reference data sets available from public repositories.
Keywords
Bayesian networks Structure Learning Big Data Computational Complexity1 Introduction

discrete \(X_i\) are only allowed to have discrete parents (denoted \(\varDelta _{X_i}\)), are assumed to follow a multinomial distribution parameterised with conditional probability tables;
 continuous \(X_i\) are allowed to have both discrete and continuous parents (denoted \(\varGamma _{X_i}\), \(\varDelta _{X_i}\cup \varGamma _{X_i}= \varPi _{X_i}\)), and their local distributions arewhich can be written as a mixture of linear regressions$$\begin{aligned} X_i {\text {}}\varPi _{X_i}\sim N \left( \mu _{X_i, \delta _{X_i}} + \varGamma _{X_i}\varvec{\beta }_{X_i, \delta _{X_i}}, \sigma ^2_{X_i, \delta _{X_i}}\right) \end{aligned}$$against the continuous parents with one component for each configuration \(\delta _{X_i}\in \mathrm {Val}(\varDelta _{X_i})\) of the discrete parents. If \(X_i\) has no discrete parents, the mixture reverts to a single linear regression.$$\begin{aligned} X_i= & {} \mu _{X_i, \delta _{X_i}} + \varGamma _{X_i}\varvec{\beta }_{X_i, \delta _{X_i}} + \varepsilon _{X_i, \delta _{X_i}}, \\&\varepsilon _{X_i, \delta _{X_i}} \sim N \left( 0, \sigma ^2_{X_i, \delta _{X_i}}\right) , \end{aligned}$$
 they decompose into one component for each local distribution following (1), saythus allowing local computations (decomposability);$$\begin{aligned} \mathrm {Score}(\mathcal {G}, \mathcal {D}) = \sum _{i = 1}^{N}\mathrm {Score}(X_i, \varPi _{X_i}, \mathcal {D}), \end{aligned}$$

they assign the same score value to DAGs that encode the same probability distributions and can therefore be grouped in an equivalence classes (score equivalence; Chickering 1995).^{2}
In addition, we note that it is also possible to perform structure learning using conditional independence tests to learn conditional independence constraints from \(\mathcal {D}\), and thus identify which arcs should be included in \(\mathcal {G}\). The resulting algorithms are called constraintbased algorithms, as opposed to the scorebased algorithms we introduced above; for an overview and a comparison of these two approaches see Scutari and Denis (2014). Chickering et al. (2004) proved that constraintbased algorithms are also NPhard for unrestricted DAGs; and they are in fact equivalent to scorebased algorithms given a fixed topological ordering when independence constraints are tested with statistical tests related to crossentropy (Cowell 2001). For these reasons, in this paper we will focus only on scorebased algorithms while recognising that a similar investigation of constraintbased algorithms represents a promising direction for future research.
 1.
to provide general expressions for the (time) computational complexity of the most common class of scorebased structure learning algorithms, greedy search, as a function of the number of variables N, of the sample size n, and of the number of parameters \(\varTheta \);
 2.
to use these expressions to identify two simple yet effective optimisations to speed up structure learning in “big data” settings in which \(n \gg N\).
The material is organised as follows. In Sect. 2, we will present in detail how greedy search can be efficiently implemented thanks to the factorisation in (1), and we will derive its computational complexity as a function N; this result has been mentioned in many places in the literature, but to the best of our knowledge its derivation has not been described in depth. In Sect. 3, we will then argue that the resulting expression does not reflect the actual computational complexity of structure learning, particularly in a “big data” setting where \(n \gg N\); and we will rederive it in terms of n and \(\varTheta \) for the three classes of BNs described above. In Sect. 4, we will use this new expression to identify two optimisations that can markedly improve the overall speed of learning GBNs and CLGBNs by leveraging the availability of closedform estimates for the parameters of the local distributions and outofsample goodnessoffit scores. Finally, in Sect. 5 we will demonstrate the improvements in speed produced by the proposed optimisations on simulated and realworld data, as well as their effects on the accuracy of learned structures.
2 Computational complexity of greedy search
A stateoftheart implementation of greedy search in the context of BN structure learning is shown in Algorithm 1. It consists of an initialisation phase (steps 1 and 2) followed by a hill climbing search (step 3), which is then optionally refined with tabu search (step 4) and random restarts (step 5). Minor variations of this algorithm have been used in large parts of the literature on BN structure learning with scorebased methods (some notable examples are Heckerman et al. 1995; Tsamardinos et al. 2006; Friedman 1997).
Hill climbing uses local moves (arc additions, deletions and reversals) to explore the neighbourhood of the current candidate DAG \(\mathcal {G}_{ max }\) in the space of all possible DAGs in order to find the DAG \(\mathcal {G}\) (if any) that increases the score \(\mathrm {Score}(\mathcal {G}, \mathcal {D})\) the most over \(\mathcal {G}_{ max }\). That is, in each iteration hill climbing tries to delete and reverse each arc in the current optimal DAG \(\mathcal {G}_{ max }\); and to add each possible arc that is not already present in \(\mathcal {G}_{ max }\). For all the resulting DAGs \(\mathcal {G}^*\) that are acyclic, hill climbing then computes \(S_{\mathcal {G}^*} = \mathrm {Score}(\mathcal {G}^*, \mathcal {D})\); cyclic graphs are discarded. The \(\mathcal {G}^*\) with the highest \(S_{\mathcal {G}^*}\) becomes the new candidate DAG \(\mathcal {G}\). If that DAG has a score \(S_{\mathcal {G}} > S_{ max }\) then \(\mathcal {G}\) becomes the new \(\mathcal {G}_{ max }\), \(S_{ max }\) will be set to \(S_{\mathcal {G}}\), and hill climbing will move to the next iteration.
This greedy search eventually leads to a DAG \(\mathcal {G}_{ max }\) that has no neighbour with a higher score. Since hill climbing is an optimisation heuristic, there is no theoretical guarantee that \(\mathcal {G}_{ max }\) is a global maximum. In fact, the space of the DAGs grows superexponentially in N (Harary and Palmer 1973); hence, multiple local maxima are likely present even if the sample size n is large. The problem may be compounded by the existence of scoreequivalent DAGs, which by definition have the same \(S_{\mathcal {G}}\) for all the \(\mathcal {G}\) falling in the same equivalence class. However, Gillispie and Perlman (2002) have shown that while the number of equivalence classes is of the same order of magnitude as the space of the DAGs, most contain few DAGs and as many as \(27.4\%\) contain just a single DAG. This suggests that the impact of score equivalence on hill climbing may be limited. Furthermore, greedy search can be easily modified into GES to work directly in the space of equivalence classes by using different set of local moves, sidestepping this possible issue entirely.
In order to escape from local maxima, greedy search first tries to move away from \(\mathcal {G}_{ max }\) by allowing up to \(t_0\) additional local moves. These moves necessarily produce DAGs \(\mathcal {G}^*\) with \(S_{\mathcal {G}*} \leqslant S_{ max }\); hence, the new candidate DAGs are chosen to have the highest \(S_\mathcal {G}\) even if \(S_\mathcal {G}< S_{ max }\). Furthermore, DAGs that have been accepted as candidates in the last \(t_1\) iterations are kept in a list (the tabu list) and are not considered again in order to guide the search towards unexplored regions of the space of the DAGs. This approach is called tabu search (step 4) and was originally proposed by Glover and Laguna (1998). If a new DAG with a score larger than \(\mathcal {G}_{ max }\) is found in the process, that DAG is taken as the new \(\mathcal {G}_{ max }\) and greedy search returns to step 3, reverting to hill climbing.
If, on the other hand, no such DAG is found then greedy search tries again to escape the local maximum \(\mathcal {G}_{ max }\) for \(r_0\) times with random nonlocal moves, that is, by moving to a distant location in the space of the DAGs and starting the greedy search again; hence, the name random restart (step 5). The nonlocal moves are typically determined by applying a batch of \(r_1\) randomly chosen local moves that substantially alter \(\mathcal {G}_{ max }\). If the DAG that was perturbed was indeed the global maximum, the assumption is that this second search will also identify it as the optimal DAG, in which case the algorithm terminates.
 1.
We treat the estimation of each local distribution as an atomic O(1) operation; that is, the (time) complexity of structure learning is measured by the number of estimated local distributions.
 2.
Model comparisons are assumed to always add, delete and reverse arcs correctly with respect to the underlying true model which happens asymptotically for \(n \rightarrow \infty \) since marginal likelihoods and BIC are globally and locally consistent (Chickering 2002).
 2.
The true DAG \(\mathcal {G}_{\mathrm{REF}}\) is sparse and contains O(cN) arcs, where c is typically assumed to be between 1 and 5.

Adding or removing an arc only alters one parent set; for instance, adding \(X_j \rightarrow X_i\) means that \(\varPi _{X_i}= \varPi _{X_i}\cup X_j\), and therefore \({\text {P}}(X_i {\text {}}\varPi _{X_i})\) should be updated to \({\text {P}}(X_i {\text {}}\varPi _{X_i}\cup X_j)\). All the other local distributions \({\text {P}}(X_k {\text {}}\varPi _{X_j}), X_k \ne X_i\) are unchanged.

Reversing an arc \(X_j \rightarrow X_i\) to \(X_i \rightarrow X_j\) means that \(\varPi _{X_i}= \varPi _{X_i}\setminus X_j\) and \(\varPi _{X_j} = \varPi _{X_j} \cup X_i\), and so both \({\text {P}}(X_i {\text {}}\varPi _{X_i})\) and \({\text {P}}(X_j {\text {}}\varPi _{X_j})\) should be updated.
3 Revisiting computational complexity

the characteristics of the data themselves (the sample size n, the number of possible values for categorical variables);

the number of parents of \(X_i\) in the DAG, that is, \(\varPi _{X_i}\);

the distributional assumptions on \({\text {P}}(X_i {\text {}}\varPi _{X_i})\), which determine the number of parameters \(\varTheta _{X_i}\).
3.1 Computational complexity for local distributions
If n is large, or if \(\varTheta _{X_i}\) is markedly different for different \(X_i\), different local distributions will take different times to learn, violating the O(1) assumption from the previous section. In other words, if we denote the computational complexity of learning the local distribution of \(X_i\) as \(O(f_{\varPi _{X_i}}(X_i))\), we find below that \(O(f_{\varPi _{X_i}}(X_i)) \ne O(1)\).
3.1.1 Nodes in discrete BNs
3.1.2 Nodes in GBNs
3.1.3 Nodes in CLGBNs
3.2 Computational complexity for the whole BN

parents are added sequentially to each of the N nodes;

if a node \(X_i\) has \(d_{X_i}\) parents then greedy search will perform \(d_{X_i}+ 1\) passes over the candidate parents;

for each pass, \(N  1\) local distributions will need to be relearned as described in Sect. 2.
3.2.1 Discrete BNs
3.2.2 GBNs
3.2.3 CLGBNs

\(O(g(N, \mathbf {d}))\) is always linear in the sample size;

unless the number of discrete parents is bounded for both discrete and continuous nodes, \(O(g(N, \mathbf {d}))\) is again more than exponential;

if the proportion of discrete nodes is small, we can assume that \(M \approx N\) and \(O(g(N, \mathbf {d}))\) is always polynomial.
4 Greedy search and big data
In Sect. 3, we have shown that the computational complexity of greedy search scales linearly in n, so greedy search is efficient in the sample size and it is suitable for learning BNs from big data. However, we have also shown that different distributional assumptions on \(\mathbf {X}\) and on the \(d_{X_i}\) lead to different complexity estimates for various types of BNs. We will now build on these results to suggest two possible improvements to speed up greedy search.
4.1 Speeding up loworder regressions in GBNs and CLGBNs
 \(j = 0\) corresponds to trivial linear regressions of the typein which the only parameters are the mean and the variance of \(X_i\).$$\begin{aligned} X_i = \mu _{X_i} + \varepsilon _{X_i}. \end{aligned}$$
 \(j = 1\) corresponds to simple linear regressions of the typefor which there are the wellknown (e.g. Draper and Smith 1998) closedform estimates$$\begin{aligned} X_i = \mu _{X_i} + X_j \beta _{X_j} + \varepsilon _{X_i}, \end{aligned}$$where \({\text {VAR}}(\cdot )\) and \({\text {COV}}(\cdot , \cdot )\) are empirical variances and covariances.$$\begin{aligned} {\hat{\mu }}_{X_i}&= {\bar{x}}_i  {\hat{\beta }}_{X_j}{\bar{x}}_j, \\ {\hat{\beta }}_{X_j}&= \frac{{\text {COV}}(X_i, X_j)}{{\text {VAR}}(X_i)}, \\ {\hat{\sigma }}^2_{X_i}&= \frac{1}{n  2}(X_i  {\hat{x}}_i)^T (X_i  {\hat{x}}_i); \end{aligned}$$
 for \(j = 2\), we can estimate the parameters ofusing their links to partial correlations:$$\begin{aligned} X_i = \mu _{X_i} + X_j \beta _{X_j} + X_k \beta _{X_k} + \varepsilon _{X_i} \end{aligned}$$for further details we refer the reader to Weatherburn (1961). Simplifying these expressions leads to$$\begin{aligned} \rho _{X_i X_j {\text {}}X_k}&= \frac{\rho _{X_i X_j}  \rho _{X_i X_k} \rho _{X_j X_k}}{\sqrt{1  \rho _{X_i X_k}^2}\sqrt{1  \rho _{X_j X_k}^2}} \\&= \beta _{j} \frac{\sqrt{1  \rho _{X_j X_k}^2}}{\sqrt{1  \rho _{X_i X_k}^2}}; \\ \rho _{X_i X_k {\text {}}X_j}&= \beta _{k} \frac{\sqrt{1  \rho _{X_j X_k}^2}}{\sqrt{1  \rho _{X_i X_j}^2}}; \end{aligned}$$with denominator$$\begin{aligned} {\hat{\beta }}_{X_j}&= \frac{1}{d} \big [{\text {VAR}}(X_k){\text {COV}}(X_i, X_j) \\&\quad  {\text {COV}}(X_j, X_k){\text {COV}}(X_i, X_k)\big ], \\ {\hat{\beta }}_{X_k}&= \frac{1}{d} \big [{\text {VAR}}(X_j){\text {COV}}(X_i, X_k) \\&\quad  {\text {COV}}(X_j, X_k){\text {COV}}(X_i, X_j)\big ]; \end{aligned}$$Then, the intercept and the standard error estimates can be computed as$$\begin{aligned} d = {\text {VAR}}(X_j){\text {VAR}}(X_k)  {\text {COV}}(X_j, X_k). \end{aligned}$$$$\begin{aligned} {\hat{\mu }}_{X_i}&= {\bar{x}}_i  {\hat{\beta }}_{X_j}{\bar{x}}_j  {\hat{\beta }}_{X_k}{\bar{x}}_k, \\ {\hat{\sigma }}^2_{X_i}&= \frac{1}{n  3}(X_i  {\hat{x}}_i)^T (X_i  {\hat{x}}_i). \end{aligned}$$
and it suggests that learning loworder local distributions in this way can be markedly faster, thus driving down the overall computational complexity of greedy search without any change in its behaviour. We also find that issues with singularities and numeric stability, which are one of the reasons to use the QR decomposition to estimate the regression coefficients, are easy to diagnose using the variances and the covariances of \((X_i, \varPi _{X_i})\); and they can be resolved without increasing computational complexity again.
Interestingly we note that (10) does not depend on \(D_{X_i}\), unlike (5); the computational complexity of learning local distributions with \(G_{X_i} \leqslant 2\) does not become exponential even if the number of discrete parents is not bounded.
4.2 Predicting is faster than learning

\(O(N\mathcal {D}^{test})\) for discrete BNs, because we just have to perform an O(1) lookup to collect the relevant conditional probability for each node and observation;

\(O(cN\mathcal {D}^{test})\) for GBNs and CLGBNs, because for each node and observation we need to compute \(\varPi _{X_i}^{(n+1)}{\hat{\beta }}_{X_i}\) and \({\hat{\beta }}_{X_i}\) is a vector of length \(d_{X_i}\).
Hence by learning local distributions only on \(\mathcal {D}^{train}\) we improve the speed of structure learning because the perobservation cost of prediction is lower than that of learning; and \(\mathcal {D}^{train}\) will still be large enough to obtain good estimates of their parameters \(\varTheta _{X_i}\). Clearly, the magnitude of the speedup will be determined by the proportion of \(\mathcal {D}\) used as \(\mathcal {D}^{test}\). Further improvements are possible by using the closedform results from Sect. 4.1 to reduce the complexity of learning local distributions on \(\mathcal {D}^{train}\), combining the effect of all the optimisations proposed in this section.
5 Benchmarking and simulations
We demonstrate the improvements in the speed of structure learning and we discussed in Sects. 4.1 and 4.2 using the MEHRA data set from Vitolo et al. (2018), which studied 50 million observations to explore the interplay between environmental factors, exposure levels to outdoor air pollutants, and health outcomes in the English regions of the UK between 1981 and 2014. The CLGBN learned in that paper is shown in Fig. 1: It comprises 24 variables describing the concentrations of various air pollutants (O3, PM_{2.5}, PM_{10}, SO_{2}, NO_{2}, CO) measured in 162 monitoring stations, their geographical characteristics (latitude, longitude, latitude, region and zone type), weather (wind speed and direction, temperature, rainfall, solar radiation, boundary layer height), demography and mortality rates.
The original analysis was performed with the bnlearn R package (Scutari 2010), and it was complicated by the fact that many of the variables describing the pollutants had significant amounts of missing data due to the lack of coverage in particular regions and years. Therefore, Vitolo et al. (2018) learned the BN using the Structural EM algorithm (Friedman 1997), which is an application of the expectationmaximisation algorithm (EM; Dempster et al. 1977) to BN structure learning that uses hill climbing to implement the M step.
 1.
we consider sample sizes of 1, 2, 5, 10, 20 and 50 millions;
 2.
for each sample size, we generate 5 data sets from the CLGBN;
 3.for each sample, we learn back the structure of the BN using hill climbing using various optimisations:

QR: estimating all Gaussian and conditional linear Gaussian local distributions using the QR decomposition, and BIC as the score function;

1P: using the closedform estimates for the local distributions that involve 0 or 1 parents, and BIC as the score function;

2P: using the closedform estimates for the local distributions that involve 0, 1 or 2 parents, and BIC as the score functions;

PRED: using the closedform estimates for the local distributions that involve 0, 1 or 2 parents for learning the local distributions on \(75\%\) of the data and estimating (12) on the remaining \(25\%\).

Sums of the SHDs between the network structures learned by BIC, PRED and that from Vitolo et al. (2018) for different sample sizes n
n  BIC  PRED 

1  11  2 
2  2  1 
5  0  1 
10  0  0 
20  0  0 
50  0  0 
Data sets from the UCI Machine Learning Repository and the JSM Data Exposition session, with their sample size (n), multinomial nodes (\(N  M\)) and Gaussian/conditional Gaussian nodes (M)
Data  n  d  M  Reference 

AIRLINE  \(53.6 \times 10^6\)  9  19  JSM, the Data Exposition Session (2009) 
GAS  \( 4.2 \times 10^6\)  0  37  UCI ML Repository, Fonollosa et al. (2015) 
HEPMASS  \(10.5 \times 10^6\)  1  28  UCI ML Repository, Baldi et al. (2016) 
HIGGS  \(11.0 \times 10^6\)  1  28  UCI ML Repository, Baldi et al. (2014) 
SUSY  \( 5.0 \times 10^6\)  1  18  UCI ML Repository, Baldi et al. (2014) 
6 Conclusions
Learning the structure of BNs from large data sets is a computationally challenging problem. After deriving the computational complexity of the greedy search algorithm in closed form for discrete, Gaussian and conditional linear Gaussian BNs, we studied the implications of the resulting expressions in a “big data” setting where the sample size is very large, and much larger than the number of nodes in the BN. We found that, contrary to classic characterisations, computational complexity strongly depends on the class of BN being learned in addition to the sparsity of the underlying DAG. Starting from this result, we suggested two for greedy search with the aim to speed up the most common algorithm used for BN structure learning. Using a large environmental data set and five data sets from the UCI Machine Learning Repository and the JSM Data Exposition, we show that it is possible to reduce the running time greedy search by \(\approx 60\%\).
Footnotes
 1.
Interestingly, some relaxations of BN structure learning are not NPhard; see for example Claassen et al. (2013) on learning the structure of causal networks.
 2.
All DAGs in the same equivalence class have the same underlying undirected graph and vstructures (patterns of arcs like \(X_i \rightarrow X_j \leftarrow X_k\), with no arcs between \(X_i\) and \(X_k\)).
Notes
References
 Allen, T.V., Greiner, R.: Model Selection Criteria for Learning Belief Nets: An Empirical Comparison. In: Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 1047–1054 (2000)Google Scholar
 Baldi, P., Sadowski, P., Whiteson, D.: Searching for Exotic Particles in Highenergy Physics with Deep Learning. Nature Communications 5 (4308), (2014). https://doi.org/10.1038/ncomms5308
 Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., Whiteson, D.: Parameterized neural networks for highenergy physics. The Euro. Phys. J. C 76, 235 (2016)CrossRefGoogle Scholar
 Bollobás, B., Borgs, C., Chayes, J., Riordan, O.: Directed scalefree graphs. In: Proceedings of the 14th Annual ACMSIAM Symposium on Discrete Algorithms, pp. 132–139 (2003)Google Scholar
 Bøttcher, S. G.: Learning Bayesian Networks with Mixed Variables. In: Proceedings of the 8th International Workshop in Artificial Intelligence and Statistics (2001)Google Scholar
 Campos, L.M.D., FernándezLuna, J.M., Gámez, J.A., Puerta, J.M.: Ant Colony optimization for learning bayesian networks. Int. J. Approx. Reason. 31(3), 291–311 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
 Chickering, D.M.: A Transformational Characterization of Equivalent Bayesian Network Structures. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 87–98 (1995)Google Scholar
 Chickering, D.M.: Learning Bayesian networks is NPcomplete. In: Fisher, D., Lenz, H. (eds.) Learning from Data: Artificial Intelligence and Statistics V, pp. 121–130. Springer, Berlin (1996)CrossRefGoogle Scholar
 Chickering, D.M.: Optimal structure identification with greedy search. J. Mach. Learn. Res. 3, 507–554 (2002)MathSciNetzbMATHGoogle Scholar
 Chickering, D.M., Heckerman, D.: Learning Bayesian networks is NPhard. Tech. Rep. MSRTR9417, Microsoft Corporation (1994)Google Scholar
 Chickering, D.M., Heckerman, D.: A comparison of scientific and engineering criteria for Bayesian model selection. Stat. Comput. 10, 55–62 (2000)CrossRefGoogle Scholar
 Chickering, D.M., Heckerman, D., Meek, C.: Largesample learning of Bayesian networks is NPhard. J. Mach. Learn. Res. 5, 1287–1330 (2004)MathSciNetzbMATHGoogle Scholar
 Claassen, T., Mooij, J.M., Heskes, T.: Learning Sparse Causal Models is not NPhard. In: Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, pp. 172–181 (2013)Google Scholar
 Cooper, G., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9, 309–347 (1992)zbMATHGoogle Scholar
 Cowell, R.: Conditions Under Which Conditional Independence and Scoring Methods Lead to Identical Selection of Bayesian Network Models. In: Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pp. 91–97 (2001)Google Scholar
 Cussens, J.: Bayesian Network Learning with Cutting Planes. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pp. 153–160 (2012)Google Scholar
 Dawid, A.P.: Present position and potential developments: some personal views: statistical theory: the prequential approach. J. R. Stat. Soc. Ser. A 147(2), 278–292 (1984)CrossRefGoogle Scholar
 Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
 Dheeru D, Karra Taniskidou E (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
 Draper, N.R., Smith, H.: Applied Regression Analysis, 3rd edn. Wiley, London (1998)CrossRefzbMATHGoogle Scholar
 Elidan, G.: Copula bayesian networks. In: Lafferty JD, Williams CKI, ShaweTaylor J, Zemel RS, Culotta A (eds) Advances in Neural Information Processing Systems 23, pp. 559–567 (2010)Google Scholar
 Fonollosa, J., Sheik, S., Huerta, R., Marco, S.: Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sens. Actuators B: Chem. 215, 618–629 (2015)CrossRefGoogle Scholar
 Friedman, N.: Learning Belief Networks in the Presence of Missing Values and Hidden Variables. In: Proceedings of the 14th International Conference on Machine Learning (ICML), pp. 125–133 (1997)Google Scholar
 Friedman, N., Koller, D.: Being Bayesian about network structure: a Bayesian approach to structure discovery in Bayesian networks. Mach. Learn. 50, 95–125 (2003)CrossRefzbMATHGoogle Scholar
 Geiger, D., Heckerman, D.: Learning Gaussian Networks. In: Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, pp. 235–243 (1994)Google Scholar
 Gillispie, S., Perlman, M.: The size distribution for Markov equivalence classes of acyclic digraph models. Artif. Intell. 14, 137–155 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
 Glover, F., Laguna, M.: Tabu search. Springer, Berlin (1998)zbMATHGoogle Scholar
 Goldenberg, A., Moore, A.: Tractable Learning of Large Bayes Net Structures from Sparse Data. In: Proceedings of the 21st International Conference on Machine Learning (ICML), pp. 44–52 (2004)Google Scholar
 Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)zbMATHGoogle Scholar
 Harary, F., Palmer, E.M.: Graphical Enumeration. Academic Press, Edinburgh (1973)zbMATHGoogle Scholar
 Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning 20(3):197–243, Available as Technical Report MSRTR9409 (1995)Google Scholar
 JSM, the Data Exposition Session (2009) Airline ontime performance. http://statcomputing.org/dataexpo/200/
 Kalisch, M., Bühlmann, P.: Estimating highdimensional directed acyclic graphs with the PCAlgorithm. J. Mach. Learn. Res. 8, 613–636 (2007)zbMATHGoogle Scholar
 Karan, S., Eichhorn, M., Hurlburt, B., Iraci, G., Zola, J.: Fast Counting in Machine Learning Applications. In: Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence, pp. 540–549 (2018)Google Scholar
 Larranaga, P., Poza, M., Yurramendi, Y., Murga, R.H., Kuijpers, C.M.H.: Structure learning of Bayesian networks by genetic algorithms: a performance analysis of control parameters. IEEE Trans. Pattern Anal. Mach. Intell. 18(9), 912–926 (1996)CrossRefGoogle Scholar
 Lauritzen, S.L., Wermuth, N.: Graphical models for associations between variables, some of which are qualitative and some quantitative. The Ann. Stat. 17(1), 31–57 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
 Moore, A., Lee, M.S.: Cached sufficient statistics for efficient machine learning with large datasets. J. Artif. Intell. Res. 8, 67–91 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
 Moral, S., Rumi, R., Salmerón, A.: Mixtures of Truncated Exponentials in Hybrid Bayesian Networks. In: Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU), Springer, Lecture Notes in Computer Science, vol 2143, pp. 156–167 (2001)Google Scholar
 Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann (1988)Google Scholar
 Peña, J.M., Björkegren, J., Tegnèr, J.: Learning dynamic Bayesian network models via crossvalidation. Patter Recognit. Lett. 26, 2295–2308 (2005)CrossRefGoogle Scholar
 Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall, Englewood Cliffs (2009)zbMATHGoogle Scholar
 Scanagatta, M., de Campos, C.P., Corani, G., Zaffalon, M.: Learning Bayesian networks with thousands of variables. Adv. Neural Inf. Process. Syst. 28, 1864–1872 (2015)Google Scholar
 Schwarz, G.: Estimating the dimension of a model. The Ann. Stat. 6(2), 461–464 (1978)MathSciNetCrossRefzbMATHGoogle Scholar
 Scutari, M.: Learning Bayesian networks with the bnlearn R Package. J. Stat. Softw. 35(3), 1–22 (2010)MathSciNetCrossRefGoogle Scholar
 Scutari, M.: Bayesian network constraintbased structure learning algorithms: parallel and optimised implementations in the bnlearn R Package. J. Stat. Softw. 77(2), 1–20 (2017)CrossRefGoogle Scholar
 Scutari, M., Denis, J.B.: Bayesian Networks with Examples in R. Chapman & Hall, London (2014)CrossRefzbMATHGoogle Scholar
 Seber, G.A.F.: A Matrix Handbook for Stasticians. Wiley, New York (2008)zbMATHGoogle Scholar
 Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn. MIT Press, Cambridge (2001)CrossRefzbMATHGoogle Scholar
 Suzuki, J.: An efficient Bayesian network structure learning strategy. N. Gener. Comput. 35(1), 105–124 (2017)CrossRefzbMATHGoogle Scholar
 Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The MaxMin HillClimbing Bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)CrossRefGoogle Scholar
 Vitolo, C., Scutari, M., Ghalaieny, M., Tucker, A., Russell, A.: Modelling Air Pollution, Climate and Health Data Using Bayesian Networks: a Case Study of the English Regions. Earth and Space Science 5, submitted (2018)Google Scholar
 Weatherburn, C.E.: A First Course in Mathematical Statistics. Cambridge University Press, Cambridge (1961)zbMATHGoogle Scholar
Copyright information
OpenAccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.