Learning Bayesian Networks from Big Data with Greedy Search: Computational Complexity and Efficient Implementation

Learning the structure of Bayesian networks from data is known to be a computationally challenging, NP-hard problem. The literature has long investigated how to perform structure learning from data containing large numbers of variables, following a general interest in high-dimensional applications ("small n, large p") in systems biology and genetics. More recently, data sets with large numbers of observations (the so-called"big data") have become increasingly common; and these data sets are not necessarily high-dimensional, sometimes having only a few tens of variables depending on the application. We revisit the computational complexity of Bayesian network structure learning in this setting, showing that the common choice of measuring it with the number of estimated local distributions leads to unrealistic time complexity estimates for the most common class of score-based algorithms, greedy search. We then derive more accurate expressions under common distributional assumptions. These expressions suggest that the speed of Bayesian network learning can be improved by taking advantage of the availability of closed form estimators for local distributions with few parents. Furthermore, we find that using predictive instead of in-sample goodness-of-fit scores improves speed; and we confirm that is improves the accuracy of network reconstruction as well, as previously observed by Chickering and Heckerman (2000). We demonstrate these results on large real-world environmental and epidemiological data; and on reference data sets available from public repositories.

More recently, data sets with large numbers of observations (the so-called "big data") have become increasingly common; and many of these data sets are not high-dimensional, having only a few tens of variables.We revisit the computational complexity of Bayesian network structure learning in this setting, showing that the common choice of measuring it with the number of estimated local distributions leads to unrealistic time complexity estimates for the most common class of scorebased algorithms, greedy search.We then derive more accurate expressions under common distributional assumptions.These expressions suggest that the speed of Bayesian network learning can be improved by taking advantage of the availability of closed form estimators for local distributions with few parents.Furthermore, we find that using predictive instead of in-sample goodness-of-fit scores improves both speed and accuracy at the same time.We demonstrate these results

Introduction
Bayesian networks (BNs; Pearl, 1988) are a class of graphical models defined over a set of random variables X = {X 1 , . . ., X N }, each describing some quantity of interest, that are associated with the nodes of a directed acyclic graph (DAG) G. (They are often referred to interchangeably.)Arcs in G express direct dependence relationships between the variables in X, with graphical separation in G implying conditional independence in probability.As a result, G induces the factorisation in which the joint probability distribution of X (with parameters Θ) decomposes in one local distribution for each X i (with parameters Θ Xi , X Θ Xi = Θ) conditional on its parents Π Xi .
While in principle there are many possible choices for the distribution of X, the literature has focused mostly on three cases.Discrete BNs (Heckerman et al, 1995) assume that both X and the X i are multinomial random variables.Local distributions take the form their parameters are the conditional probabilities of X i given each configuration of the values of its parents, usually represented as a conditional probability table arXiv:1804.08137v1[stat.CO] 22 Apr 2018 for each X i .Gaussian BNs (GBNs; Geiger and Heckerman, 1994) model X with a multivariate normal random variable and assume that the X i are univariate normals linked by linear dependencies.The parameters of the local distributions can be equivalently written (Weatherburn, 1961) as the partial Pearson correlations ρ Xi,Xj | Π X i \Xj between X i and each parent X j given the other parents; or as the coefficients β Xi of the linear regression model Xi ).Finally, conditional linear Gaussian BNs (CLGBNs; Lauritzen and Wermuth, 1989) combine discrete and continuous random variables in a mixture model: discrete X i are only allowed to have discrete parents (denoted ∆ Xi ), are assumed to follow a multinomial distribution parameterised with conditional probability tables; continuous X i are allowed to have both discrete and continuous parents (denoted Γ Xi , ∆ Xi ∪ Γ Xi = Π Xi ), and their local distributions are ) which can be written as a mixture of linear regressions against the continuous parents with one component for each configuration δ Xi ∈ Val (∆ Xi ) of the discrete parents.If X i has no discrete parents, the mixture reverts to a single linear regression.
Other distributional assumptions, such as mixtures of truncated exponentials (Moral et al, 2001) or copulas (Elidan, 2010), have been proposed in the literature but have seen less widespread adoption due to the lack of exact conditional inference and simple closed-form estimators.
The task of learning a BN from a data set D containing n observations is performed in two steps in an inherently Bayesian fashion: .
Structure learning consists in finding the DAG G that encodes the dependence structure of the data, thus maximising P(G | D) or some alternative goodness-of-fit measure; parameter learning consists in estimating the parameters Θ given the G obtained from structure learning.If we assume parameters in different local distributions are independent (Heckerman et al, 1995), we can perform parameter learning independently for each node following (1) because Furthermore, if G is sufficiently sparse each node will have a small number of parents; and X i | Π Xi will have a low-dimensional parameter space, making parameter learning computationally efficient.
On the other hand, structure learning is well known to be both NP-hard (Chickering and Heckerman, 1994) and NP-complete (Chickering, 1996), even under unrealistically favourable conditions such as the availability of an independence and inference oracle (Chickering et al, 2004).This is despite the fact that if we take again following (1) we can decompose the marginal likelihood P(D | G) into one component for each local distribution and despite the fact that each component can be written in closed form for discrete BNs (Heckerman et al, 1995), GBNs (Geiger and Heckerman, 1994) and CLGBNs (Bøttcher, 2001).The same is true if we replace P(D | G) with frequentist goodness-of-fit scores such as BIC (Schwarz, 1978), which is commonly used in structure learning because of its simple expression: Compared to marginal likelihoods, BIC has the advantage that it does not depend on any hyperparameter, while converging to log P(D | G) as n → ∞.These score functions, which we will denote with Score(G, D) in the following, have two important properties: they decompose into one component for each local distribution following (1), say thus allowing local computations (decomposability); they assign the same score value to DAGs that encode the same probability distributions, and can therefore be grouped in an equivalence classes (score equivalence; Chickering, 1995).1 Structure learning via score maximisation is performed using general-purpose optimisation techniques, typically heuristics, adapted to take advantage of these properties to increase the speed of structure learning.The most common are greedy search strategies that employ local moves designed to affect only few local distributions, to that new candidate DAGs can be scored without recomputing the full P(D | G).This can be done either in the space of the DAGs with hill-climbing and tabu search (Russell and Norvig, 2009), or in the space of the equivalence classes with Greedy Equivalent Search (GES; Chickering, 2002).Other options that have been explored in the literature are genetic algorithms (Larranaga et al, 1996) and ant colony optimization (Campos et al, 2002).Exact maximisation of P(D | G) and BIC has also become feasible for small data sets in recent years thanks to increasingly efficient pruning of the space of the DAGs and tight bounds on the scores (Cussens, 2012;Suzuki, 2017;Scanagatta et al, 2015).
In addition, we note that it is also possible to perform structure learning using conditional independence tests to learn conditional independence constraints from D, and thus identify which arcs should be included in G.The resulting algorithms are called constraint-based algorithms, as opposed to the score-based algorithms we introduced above; for an overview and a comparison of these two approaches see Scutari and Denis (2014).Chickering et al (2004) proved that constraint-based algorithms are also NP-hard for unrestricted DAGs; and they are in fact equivalent to score-based algorithms given a fixed topological ordering when independence constraints are tested with statistical tests related to cross-entropy (Cowell, 2001).For these reasons, in this paper we will focus only on score-based algorithms while recognising that a similar investigation of constraint-based algorithms represents a promising direction for future research.
The contributions of this paper are: 1. to provide general expressions for the (time) computational complexity of the most common class of score-based structure learning algorithms, greedy search, as a function of the number of variables N , of the sample size n, and of the number of parameters |Θ|; 2. to use these expressions to identify two simple yet effective optimisations to speed up structure learning in "big data" settings.
Both are increasingly important when using BNs in modern machine learning applications, as data sets with large numbers of observations (the so-called "big data") are becoming as common as classic high-dimensional data ("small n, large p", or "small n, large N " using the notation introduced above).The vast majority of complexity and scalability results (Tsamardinos et al, 2006;Kalisch and Bühlmann, 2007;Scanagatta et al, 2015) and computational optimisations (Scutari, 2017) in the literature are derived in the latter setting and implicitly assume n N ; they are not relevant in the former setting in which n N .The material is organised as follows.In Section 2 we will present in detail how greedy search can be efficiently implemented thanks to the factorisation in (1), and we will derive its computational complexity as a function N ; this result has been mentioned in many places in the literature, but to the best of our knowledge its derivation has not been described in depth.In Section 3 we will then argue that the resulting expression does not reflect the actual computational complexity of structure learning, particularly in a "big data" setting where n N ; and we will re-derive it in terms of n and |Θ| for the three classes of BNs described above.In Section 4 we will use this new expression to identify two optimisations that can markedly reduce the overall computational complexity of learning GBNs and CLGBNs by leveraging the availability of closed form estimates for the parameters of the local distributions and out-ofsample goodness-of-fit scores.Finally, in Section 5 we will demonstrate the improvements in speed produced by the proposed optimisations on simulated and realworld data, as well as their effects on the accuracy of learned structures.

Computational Complexity of Greedy Search
A state-of-the-art implementation of greedy search in the context of BN structure learning is shown in Algorithm 1.It consists of an initialisation phase (steps 1 and 2) followed by a hill climbing search (step 3), which is then optionally refined with tabu search (step 4) and random restarts (step 5).Minor variations of this algorithm have been used in large parts of the literature on BN structure learning with score-based methods (some notable examples are Heckerman et al, 1995;Tsamardinos et al, 2006;Friedman, 1997).
Hill climbing uses local moves (arc additions, deletions and reversals) to explore the neighbourhood of the 4. Tabu search: for up to t 0 times: (a) repeat step 3 but choose the DAG G with the highest S G that has not been visited in the last t 1 steps regardless of current candidate DAG G max in the space of all possible DAGs in order to find the DAG G (if any) that increases the score Score(G, D) the most over G max .That is, in each iteration hill climbing tries to delete and reverse each arc in the current optimal DAG G max ; and to add each possible arc that is not already present in G max .For all the resulting DAGs G * that are acyclic, hill climbing then computes S G * = Score(G * , D); cyclic graphs are discarded.The G * with the highest S G * becomes the new candidate DAG G.If that DAG has a score S G > S max then G becomes the new G max , S max will be set to S G , and hill climbing will move to the next iteration.
This greedy search eventually leads to a DAG G max that has no neighbour with a higher score.Since hill climbing is an optimisation heuristic, there is no theoretical guarantee that G max is a global maximum.In fact, the space of the DAGs grows super-exponentially in N (Harary and Palmer, 1973); hence multiple local maxima are likely present even if the sample size n is large.The problem may be compounded by the existence of score-equivalent DAGs, which by definition have the same S G for all the G falling in the same equivalence class.However, Gillispie and Perlman (2002) have shown that while the number of equivalence classes is of the same order of magnitude as the space of the DAGs, most contain few DAGs and as many as 27.4% contain just a single DAG.This suggests that the impact of score equivalence on hill climbing may be limited.Furthermore, greedy search can be easily modified into GES to work directly in the space of equivalence classes by using different set of local moves, side-stepping this possible issue entirely.
In order to escape from local maxima, greedy search first tries to move away from G max by allowing up to t 0 additional local moves.These moves necessarily produce DAGs G * with S G * S max ; hence the new candidate DAGs are chosen to have the highest S G even if S G < S max .Furthermore, DAGs that have been accepted as candidates in the last t 1 iterations are kept in a list (the tabu list) and are not considered again in order to guide the search towards unexplored regions of the space of the DAGs.This approach is called tabu search (step 4) and was originally proposed by Glover and Laguna (1998).If a new DAG with a score larger than G max is found in the process, that DAG is taken as the new G max and greedy search returns to step 3, reverting to hill climbing.
If, on the other hand, no such DAG is found then greedy search tries again to escape the local maximum G max for r 0 times with random non-local moves, that is, by moving to a distant location in the space of the DAGs and starting the greedy search again; hence the name random restart (step 5).The non-local moves are typically determined by applying a batch of r 1 randomlychosen local moves that substantially alter G max .If the DAG that was perturbed was indeed the global maximum, the assumption is that this second search will also identify it as the optimal DAG, in which case the algorithm terminates.
We will first study the (time) computational complexity of greedy search under the assumptions that are commonly used in the literature (see, for instance, Tsamardinos et al, 2006;Spirtes et al, 2001)  Step 4 performs t 0 more iterations, and is therefore O(t 0 N 2 ).Therefore, the combined time complexity of steps 3 and 4 is O(cN 3 + t 0 N 2 ).Each of the random restarts involves changing r 1 arcs, and thus we can expect that it will take r 1 iterations of hill climbing to go back to the same maximum, followed by tabu search; and that happens for r 0 times.Overall, this adds O(r 0 (r 1 N 2 +t 0 N 2 )) to the time complexity, resulting in an overall complexity g(N ) of The leading term is O(cN 3 ) for some small constant c, making greedy search cubic in complexity.
Fortunately, the factorisation in (1) makes it possible to recompute only one or two local distributions for each model comparison: -Adding or removing an arc only alters one parent set; for instance, adding X j → X i means that Π Xi = Π Xi ∪ X j , and therefore P(X i | Π Xi ) should be updated to P(X i | Π Xi ∪ X j ).All the other local distributions P(X k | Π Xj ), X k = X i are unchanged.-Reversing an arc X j → X i to X i → X j means that Π Xi = Π Xi \ X j and Π Xj = Π Xj ∪ X i , and so both P(X i | Π Xi ) and P(X j | Π Xj ) should be updated.
Hence it is possible to dramatically reduce the computational complexity of greedy search by keeping a cache of the score values of the N local distributions for the current G max and of the N 2 − N score differences where Π max Xi and Π G * Xi are the parents of X i in G max and in the G * obtained by removing (if present) or adding (if not) X j → X i to G max .Only N (for arc additions and deletions) or 2N (for arc reversals) elements of ∆ need to be actually computed in each iteration; those corresponding to the variable(s) whose parent sets were changed by the local move that produced the current G max in the previous iteration.After that, all possible arc additions, deletions and reversals can be evaluated without any further computational cost by adding or subtracting the appropriate ∆ ij from the B i .Arc reversals can be handled as a combination of arc removals and additions (e.g.reversing X i → X j is equivalent to removing X i → X j and adding X j → X i ).As a result, the overall computational complexity of greedy search reduces from O(cN 3 ) to O(cN 2 ).Finally, we briefly note that score equivalence may allow further computational saving because many local moves will produce new G * that are in the same equivalence class as G max ; and for those moves necessarily ∆ ij = 0 (for arc reversals) or ∆ ij = ∆ ji (for adding or removing X i → X j and X j → X i ).

Revisiting Computational Complexity
In practice, the computational complexity of estimating a local distribution P(X i | Π Xi ) from data depends on three of factors: the characteristics of the data themselves (the sample size n, the number of possible values for categorical variables); the number of parents of X i in the DAG, that is, |Π Xi |; the distributional assumptions on P(X i | Π Xi ), which determine the number of parameters |Θ Xi |.

Computational Complexity for Local Distributions
If n is large, or if |Θ Xi | is markedly different for different X i , different local distributions will take different times to learn, violating the O(1) assumption from the previous section.In other words, if we denote the computational complexity of learning the local distribution of O(1).

Nodes in Discrete BNs
In the case of discrete BNs, the conditional probabilities (3)

Nodes in GBNs
In the case of GBNs, the regressions coefficients for X i | Π Xi are usually computed by applying a QR decomposition to the augmented matrix [1 Π Xi ]: which can be solved efficiently by backward substitution since R is upper-triangular.This approach is the de facto standard approach for fitting linear regression models because it is numerically stable even in the presence of correlated Π Xi (see Seber, 2008, for details).Afterwards we can compute the fitted values xi = Π Xi βXi and the residuals

Nodes in CLGBNs
As for CLGBNs, the local distributions of discrete nodes are estimated in the same way as they would be in a discrete BN.For Gaussian nodes, a regression of X i against the continuous parents Γ Xi is fitted from the n δ X i observations corresponding to each configuration of the discrete parents ∆ Xi .Hence the overall computational complexity is has no discrete parents then (5) simplifies to (4) since |Val (∆ Xi )| = 1 and n δ X i = n.

Computational Complexity for the Whole BN
Let's now assume without loss of generality that the dependence structure of X can be represented by a DAG G with in-degree sequence d X1 d X2 . . .d X N .For a sparse graph containing cN arcs, this means N i=1 d Xi = cN .Then if we make the common choice of starting greedy search from the empty DAG, we can rewrite (2) as parents are added sequentially to each of the N nodes; if a node X i has d Xi parents then greedy search will perform d Xi + 1 passes over the candidate parents; for each pass, N − 1 local distributions will need to be relearned as described in Section 2.
The candidate parents in the (d Xi +1)th pass are evaluated but not included in G, since no further parents are accepted for a node after its parent set Π Xi is complete.
If we drop the assumption from Section 2 that each term in the expression above is O(1), and we substitute it with the computational complexity expressions we derived above in this section, then we can write where Xi currently in G and a new candidate parent X k .

Discrete BNs
For discrete BNs, f jk (X i ) takes the form shown in (3) and The second term is an arithmetic progression, and the third term is a geometric progression so the computational complexity is quadratic in N .Note that this is a stronger sparsity assumption than N i=1 d Xi = cN , because it bounds individual d Xi instead of their sum; and it is commonly used to make challenging learning problems feasible (e.g.Cooper and Herskovits, 1992;Friedman and Koller, 2003).If, on the other hand, G is dense and and O(g(N, d)) is more than exponential in N .In between these two extremes, the distribution of the d Xi determines the actual computational complexity of greedy search for a specific types of structures.For instance, if G is a scale-free DAG (Bollobás et al, 2003) the in-degree of most nodes will be small and we can expect a computational complexity closer to quadratic than exponential if the probability of large in-degrees decays quickly enough compared to N .

GBNs
If we consider the leading term of (4), we obtain the following expression: Noting the arithmetic progression

CLGBNs
Deriving the computational complexity for CLGBNs is more complicated because of the heterogeneous nature of the nodes.If we consider the leading term of ( 5) for a BN with M < N Gaussian nodes and N − M multinomial nodes we have The first term can be computed using ( 7) since discrete nodes can only have discrete parents, and thus cluster in a subgraph of N − M nodes whose in-degrees are completely determined by other discrete nodes; and the same considerations we made in Section 3.2.1 apply.
As for the second term, we will first assume that all D i discrete parents of each node are added first, before any of the G i continuous parents ( We further separate discrete and continuous nodes in the summations over the possible N − 1 candidates for inclusion or removal from the current parent set, so that substituting (5) we obtain Finally, combining all terms we obtain the following expression: While it is not possible to concisely describe the behaviour resulting from this expression given the number of data-dependent parameters (D i , G i , M ), we can observe that: -O(g(N, d)) is always linear in the sample size; unless the number of discrete parents is bounded for both discrete and continuous nodes, O(g(N, d)) is again more than exponential; if the proportion of discrete nodes is small, we can assume that M ≈ N and O(g(N, d)) is always polynomial.

Greedy Search and Big Data
In Section 3 we have shown that the computational complexity of greedy search scales linearly in n, so greedy search is efficient in the sample size and it is suitable for learning BNs from big data.However, we have also shown that different distributional assumptions on X and on the d Xi lead to different complexity estimates for various types of BNs.We will now build on these results to suggest two possible improvements to speed up greedy search.

Speeding Up Low-Order Regressions in GBNs and CLGBNs
Firstly, we suggest that estimating local distributions with few parents can be made more efficient; if we assume that G is sparse, those make up the majority of the local distributions learned by greedy search.As we can see from the summations in ( 6), the overall number of learned local distributions with j parents is that is, it is proportional to the complement of the cumulative distribution function of the d Xi .If that cumulative distribution function decays quickly enough, such as in the case of scale free networks or for networks in which all d Xi b, (8) suggests that local distributions for which j is small will dominate.Furthermore, we find that in our experience BNs will typically have a weakly connected DAG (that is, with no isolated nodes); and in this case local distributions with j = 0, 1 will need to be learned for all nodes, and those with j = 2 for all non-root nodes.
In the case of GBNs, local distributions for j = 0, 1, 2 parents can be estimated in closed form using simple expressions as follows: j = 0 corresponds to trivial linear regressions of the type in which the only parameters are the mean and the variance of X i .j = 1 corresponds to simple linear regressions of the type for which there are the well-known (e.g.Draper and Smith, 1998) where VAR(•) and COV(•, •) are empirical variances and covariances.for j = 2, we can estimate the parameters of using their links to partial correlations: for further details we refer the reader to Weatherburn (1961).Simplifying these expressions leads to Then, the intercept and the standard error estimates can be computed as All these expressions are based on the variances and the covariances of (X i , Π Xi ), and therefore can be computed in This is lower than the computational complexity from (4) for the same number of parents: and it suggests that learning low-order local distributions in this way can be markedly faster, thus driving down the overall computational complexity of greedy search without any change in its behaviour.We also find that issues with singularities and numeric stability, which are one of the reasons to use the QR decomposition to estimate the regression coefficients, are easy to diagnose using the variances and the covariances of (X i , Π Xi ); and they can be resolved without increasing computational complexity again.
As for CLGBNs, similar reductions in complexity are possible for continuous nodes.Firstly, if a continuous X i has no discrete parents (∆ Xi = ∅) then the computational complexity of learning its local distribution using QR is again given by (4) as we noted in Section 3.1.3;and we are in the same setting we just described for GBNs.Secondly, if X i has discrete parents (D Xi > 0) and j continuous parents (G Xi = j), the closed-form expressions above can be computed for all the configurations of the discrete parents in time, which is lower than that required by the estimator from (5): j from ( 5) from ( 10) Interestingly we note that (10) does not depend on D Xi , unlike (5); the computational complexity of learning local distributions with G Xi 2 does not become exponential even if the number of discrete parents is not bounded.

Predicting is Faster than Learning
BNs are often implicitly formulated in a prequential setting (Dawid, 1984), in which a data set D is considered as a snapshot of a continuous stream of observations and BNs are learned from that sample with a focus on predicting future observations.Chickering and Heckerman (2000) called this the "engineering criterion" and set as the score function to select the optimal G max , effectively maximising the negative cross-entropy between the "correct" posterior distribution of X (n+1) and that determined by the BN with DAG G.They showed that even for finite sample sizes this criterion produces BNs which are at least as good as the BNs learned using the scores in Section 1, which focus on fitting D well.Allen and Greiner (2000) and later Peña et al (2005) confirmed this fact by embedding k-fold cross-validation into greedy search, and obtaining both better accuracy both in prediction and network reconstruction.In both papers the use of cross-validation was motivated by the need to make the best use of relatively small samples, for which the computational complexity was not a crucial issue.
However, in a big data setting it is both faster and accurate to estimate (11)  βXi and βXi is a vector of length d Xi .
Hence by learning local distributions only on D train we lower the overall computational complexity of structure learning because the per-observation cost of prediction is lower; and D train will still be large enough to obtain good estimates of their parameters Θ Xi .Clearly, the reduction in complexity will be determined by the proportion of D used as D test .Further speed-ups are possible by using the closed-form results from Section 4.1 to reduce the complexity of learning local distributions on D train , combining the effect of all the optimisations proposed in this section.

Benchmarking and Simulations
We demonstrate the reductions in computational complexity we discussed in Sections 4.1 and 4.2 using the MEHRA data set from Vitolo et al (2018), which studied 50 million observations to explore the interplay between environmental factors, exposure levels to outdoor air pollutants, and health outcomes in the English regions of the United Kingdom between 1981 and 2014.
The CLGBN learned in that paper is shown in Figure 1: it comprises 24 variables describing the concentrations of various air pollutants (O3, PM 2.5 , PM 10 , SO 2 , NO 2 , CO) measured in 162 monitoring stations, their geographical characteristics (latitude, longitude, latitude, region and zone type), weather (wind speed and direction, temperature, rainfall, solar radiation), demography and mortality rates.
The original analysis was performed with the bnlearn R package (Scutari, 2010), and it was complicated by the fact that many of the variables describing the pollutants had significant amounts of missing data due to the lack of coverage in particular regions and years.Therefore, Vitolo et al (2018) learned the BN using the Structural EM algorithm (Friedman, 1997), which is an application of the Expectation-Maximisation algorithm (EM; Dempster et al, 1977) to BN structure learning that uses hill-climbing to implement the M step.
For the purpose of this paper, and to better illustrate the performance improvements arising from the optimisations from Section 4, we will generate large samples from the CLGBN learned by Vitolo et al (2018) to be able to control sample size and to work with plain hill-climbing on complete data.In particular: 1. we consider sample sizes of 1, 2, 5, 10, 20 and 50 millions; 2. for each sample size, we generate 5 data sets from the CLGBN; 3. for each sample, we learn back the structure of the BN using hill-climbing using various optimisations: -QR: estimating all Gaussian and conditional linear Gaussian local distributions using the QR decomposition, and BIC as the score function; -1P: using the closed form estimates for the local distributions that involve 0 or 1 parents, and BIC as the score function; -2P: using the closed form estimates for the local distributions that involve 0, 1 or 2 parents, and BIC as the score functions; -PR: using the closed form estimates for the local distributions that involve 0, 1 or 2 parents for learning the local distributions on 75% of the data and estimating (12) on the remaining 25%.For each sample and optimisation, we run hill-climbing 5 times and we average the resulting running times to reduce the variability of each estimate.All computations are performed with the bnlearn package in R 3.3.3on a machine with two Intel Xeon CPU E5-2690 processors (16 cores) and 384GB of RAM.
The running times for 1P, 2P and PR, normalised using those for QR as a baseline, are shown in q q q q q q q q q q q q q q q q q q q q q q q q 00:03 00:  ally decreases with the level of optimisation: 1P (green) is ≈ 20% faster than QR, 2P (pink) is ≈ 25% faster and PR (red) is ≈ 60% faster, with minor variations at different sample sizes.PR exhibits a larger variability because of the randomness introduced by the subsampling of D test , and provides smaller speed-ups for the smallest considered sample size (1 million).Furthermore, we confirm the results from Chickering and Heckerman (2000) on network reconstruction accuracy: BIC does not correctly identify 13 arcs over the 30 learned DAGs, compared to 4 for (12).We also confirm that the observed running times increase linearly in the sample size as we showed in Section 3.
In order to verify that these speed increases extend beyond the MEHRA data set, we considered five other data sets from the UCI Machine Learning Repository (Dheeru and Karra Taniskidou, 2017) and from the repository of the Data Exposition Session of the Joint Statistical Meetings (JSM).These particular data sets have been chosen because of their large sample sizes and because they have similar characteristics to MEHRA (continuous variables, a few discrete variables, 20-40 nodes overall; see Table 1 for details).However, since their underlying "true DAGs" are unknown, we cannot comment on the accuracy of the DAGs we learn from them.For the same reason, we limit the density of the learned DAGs by restricting each node to have at most

AIRLINE
GAS HEPMASS HIGGS SUSY q q q q q q q q q q q q q q q q q q q q 03:21 00:  Baldi et al (2014) Table 1 Data sets from the UCI Machine Learning Repository and the JSM Data Exposition session, with their sample size (n), multinomial nodes (N − M ) and Gaussian/conditional Gaussian nodes (M ).
5 parents; this produces DAGs with 2.5N to 3.5N arcs depending on the data set.The times for 1P, 2P and PR, again normalised by those for QR, are shown in Figure 3. Overall, we confirm that PR is ≈ 60% faster on average than QR.Compared to MEHRA, 1P and 2P are to some extend slower with average speed-ups of only ≈ 15% and ≈ 22% respectively.However, it is apparent by comparing Figures 2 and 3 that the reductions in computational complexity are consistent over all the data sets considered in this paper, and hold for a wide range of sample sizes and combinations of discrete and continuous variables.

Conclusions
Learning the structure of BNs from large data sets is a computationally challenging problem.After deriving the computational complexity of the greedy search algorithm in closed form for discrete, Gaussian and conditional linear Gaussian BNs, we studied the implications of the resulting expressions in a "big data" setting where the sample size is very large, and much larger than the number of nodes in the BN.We found that, contrary to classic characterisations, computational complexity strongly depends on the class of BN being learned in addition to the sparsity of the underlying DAG.Starting from this result, we suggested two possible optimisations to lower the computational complexity of greedy search and thus speed up the most common algorithm used for BN structure learning.Using a large environmental data set and five data sets from the UCI Machine Learning Repository and the JSM Data Exposition, we show that it is possible to reduce the running time greedy search by ≈ 60%.

Algorithm 1
Greedy Search Input: a data set D from X, an initial DAG G (usually the empty DAG), a score function Score(G, D).Output: the DAG G max that maximises Score(G, D). 1. Compute the score of G, S G = Score(G, D). 2. Set S max = S G and G max = G. 3. Hill climbing: repeat as long as S max increases: (a) for every possible arc addition, deletion or reversal in G max resulting in a DAG: i. compute the score of the modified DAG G * , restart the search from step 3. 5. Random restart: for up to r times, perturb G max with multiple arc additions, deletions and reversals to obtain a new DAG G and: (a) set S 0 = S max = S G and G 0 = G max = G and restart the search from step 3; (b) if the new G max is the same as the previous G max , stop and return G max .
) Hence, we can see that O(g(N, d)) increases linearly in the sample size.If G is uniformly sparse, all d Xi are bounded by a constant b (d Xi b, c b) and linear in n but cubic in the d Xi .We note, however, that even for dense networks (d Xi = O(N )) computational complexity remains polynomial O(g(N, d)) ≈ O nN 2 N 3 3 which was not the case for discrete BNs.If, on the other hand d Xi b, O(g(N, d)) ≈ O nN 2 b 3 3 which is quadratic in N .

Fig. 1
Fig.1Conditional Linear Gaussian BN fromVitolo et al (2018).Yellow nodes are multinomial, blue nodes are Gaussian, and yellow nodes are conditional linear Gaussian.
Figure 2. As expected, computational complexity gradu-sample size (in millions, log−scale)

Fig. 2
Fig.2Running times for the MEHRA data set, normalised using the baseline implementation based on the QR decomposition (blue), for 1P (green), 2P (pink) and PR (red).Bars represent 95% confidence intervals.Average running times are reported for QR.
2 ) model comparisons.If we assume G 0 is the empty DAG (that is, a DAG with no arcs), hill climbing will gradually add all the arcs in G REF , one in each iteration.Assuming G REF is sparse, and assuming that arcs are removed or reversed a negligible number of times, the overall computational complexity of hill climbing is then O(cN 3 ) model comparisons.
In steps 1 and 2, greedy search computes all the N local distributions for G 0 .In step 3, each iteration tries all possible arc additions, deletions and reversals.Since there are N 2 possible arcs in a DAG with N nodes, this requires O(N with each X i | Π Xi are computed from the corresponding counts n ijk tallied from {X i , Π Xi }; hence estimating them takes O(n(1+|Π Xi |)) time.Computing the marginals counts for each configuration of Π Xi then takes O(|Θ Xi |) time; assuming that each discrete variable takes at most l values, then |Θ directly by splitting the data into a training and test set and computing Score(G, D) = log P(D test | G, Θ, D train ); (12) that is, we learn the local distributions on D train and we estimate the probability of D test .As is the case for many other models (e.g., deep neural networks; Goodfellow et al, 2016), we note that prediction is computationally much cheaper than learning.In the case of BNs, computing (12) is: -O(nN |D test |) for discrete BNs, because we just have to perform an O(1) look-up to collect the relevant conditional probability for each node and observation; -O(cnN |D test |) for GBNs and CLGBNs, because for each node and observation we need to compute Π Running times for the data sets in Table1, normalised using the baseline implementation based on the QR decomposition (blue), for 1P (green), 2P (pink) and PR (red).Bars represent 95% confidence intervals.Average running times are reported for QR.