In practice, the computational complexity of estimating a local distribution \({\text {P}}(X_i {\text {|}}\varPi _{X_i})\) from data depends on three of factors:
-
the characteristics of the data themselves (the sample size n, the number of possible values for categorical variables);
-
the number of parents of \(X_i\) in the DAG, that is, \(|\varPi _{X_i}|\);
-
the distributional assumptions on \({\text {P}}(X_i {\text {|}}\varPi _{X_i})\), which determine the number of parameters \(|\varTheta _{X_i}|\).
Computational complexity for local distributions
If n is large, or if \(|\varTheta _{X_i}|\) is markedly different for different \(X_i\), different local distributions will take different times to learn, violating the O(1) assumption from the previous section. In other words, if we denote the computational complexity of learning the local distribution of \(X_i\) as \(O(f_{\varPi _{X_i}}(X_i))\), we find below that \(O(f_{\varPi _{X_i}}(X_i)) \ne O(1)\).
Nodes in discrete BNs
In the case of discrete BNs, the conditional probabilities \(\pi _{ik {\text {|}}j}\) associated with each \(X_i {\text {|}}\varPi _{X_i}\) are computed from the corresponding counts \(n_{ijk}\) tallied from \(\{X_i, \varPi _{X_i}\}\); hence, estimating them takes \(O(n(1 + |\varPi _{X_i}|))\) time. Computing the marginals counts for each configuration of \(\varPi _{X_i}\) then takes \(O(|\varTheta _{X_i}|)\) time; assuming that each discrete variable takes at most l values, then \(|\varTheta _{X_i}| \leqslant l^{1 + |\varPi _{X_i}|}\) leading to
$$\begin{aligned} O(f_{\varPi _{X_i}}(X_i)) = O\left( n(1 + |\varPi _{X_i}|) + l^{1 + |\varPi _{X_i}|}\right) . \end{aligned}$$
(3)
Nodes in GBNs
In the case of GBNs, the regressions coefficients for \(X_i {\text {|}}\varPi _{X_i}\) are usually computed by applying a QR decomposition to the augmented data matrix \([1 \, \varPi _{X_i}]\):
$$\begin{aligned}&[1 \, \varPi _{X_i}] = \mathbf {QR}&\text {leading to }&\mathbf {R}[\mu _{X_i}, \varvec{\beta }_{X_i}] = \mathbf {Q}^T X_i \end{aligned}$$
which can be solved efficiently by backward substitution since \(\mathbf {R}\) is upper-triangular. This approach is the de facto standard approach for fitting linear regression models because it is numerically stable even in the presence of correlated \(\varPi _{X_i}\) (see Seber 2008, for details). Afterwards, we can compute the fitted values \({\hat{x}}_i = \varPi _{X_i}\hat{\varvec{\beta }}_{X_i}\) and the residuals \(X_i - {\hat{x}}_i\) to estimate \({\hat{\sigma }}^2_{X_i} \propto (X_i - {\hat{x}}_i)^T (X_i - {\hat{x}}_i)\). The overall computational complexity is
$$\begin{aligned} O(&f_{\varPi _{X_i}}(X_i)) = \nonumber \\&= \underbrace{O\left( n(1 + |\varPi _{X_i}|)^2\right) }_{\text {QR decomposition}} + \underbrace{O\left( n(1 + |\varPi _{X_i}|\right) )}_{\text {computing }\mathbf {Q}^T X_i} \nonumber \\&\qquad +\underbrace{O\left( (1 + |\varPi _{X_i}|)^2\right) }_{\text {backwards substitution}} + \underbrace{O\left( n(1 + |\varPi _{X_i}|)\right) }_{\text {computing }{\hat{x}}_i} \nonumber \\&\qquad +\underbrace{O\left( 3n\right) }_{\text {computing } {\hat{\sigma }}^2_{X_i}} \end{aligned}$$
(4)
with leading term \(O((n + 1)(1 + |\varPi _{X_i}|)^2)\).
Nodes in CLGBNs
As for CLGBNs, the local distributions of discrete nodes are estimated in the same way as they would be in a discrete BN. For Gaussian nodes, a regression of \(X_i\) against the continuous parents \(\varGamma _{X_i}\) is fitted from the \(n_{\delta _{X_i}}\) observations corresponding to each configuration of the discrete parents \(\varDelta _{X_i}\). Hence, the overall computational complexity is
$$\begin{aligned} O(&f_{\varPi _{X_i}}(X_i)) \nonumber \\&= \sum _{\delta _{X_i}\in \mathrm {Val}(\varDelta _{X_i})} O\left( n_{\delta _{X_i}}(1 + |\varGamma _{X_i}|)^2\right) \nonumber \\&\qquad \qquad + O\left( 2n_{\delta _{X_i}}(1 + |\varGamma _{X_i}|)\right) + O\left( 1 + |\varGamma _{X_i}|)^2\right) \nonumber \\&\qquad \qquad + O\left( 3n_{\delta _{X_i}}\right) \nonumber \\&= O\left( n(1 + |\varGamma _{X_i}|)^2\right) + O\left( 2n(1 + |\varGamma _{X_i}|)\right) \nonumber \\&\qquad \qquad + O\left( |\mathrm {Val}(\varDelta _{X_i})|(1 + |\varGamma _{X_i}|)^2\right) + O\left( 3n\right) \nonumber \\&= O\left( (n + l^{|\varDelta _{X_i}|})(1 + |\varGamma _{X_i}|)^2\right) \nonumber \\&\qquad \qquad + O\left( 2n(1 + |\varGamma _{X_i}|)\right) + O\left( 3n\right) \end{aligned}$$
(5)
with leading term \(O\left( (n + l^{|\varDelta _{X_i}|})(1 + |\varGamma _{X_i}|)^2\right) \). If \(X_i\) has no discrete parents, then (5) simplifies to (4) since \(|\mathrm {Val}(\varDelta _{X_i})| = 1\) and \(n_{\delta _{X_i}} = n\).
Computational complexity for the whole BN
Let’s now assume without loss of generality that the dependence structure of \(\mathbf {X}\) can be represented by a DAG \(\mathcal {G}\) with in-degree sequence \(d_{X_1} \leqslant d_{X_2} \leqslant \ldots \leqslant d_{X_N}\). For a sparse graph containing cN arcs, this means \(\sum _{i = 1}^{N}d_{X_i}= cN\). Then if we make the common choice of starting greedy search from the empty DAG, we can rewrite (2) as
$$\begin{aligned} O(g(N))&= O(cN^2) \nonumber \\&= O\left( \sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}\sum _{k = 1}^{N - 1}1\right) \nonumber \\&= \sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}\sum _{k = 1}^{N - 1}O(1) = O(g(N, \mathbf {d})) \end{aligned}$$
(6)
because:
-
parents are added sequentially to each of the N nodes;
-
if a node \(X_i\) has \(d_{X_i}\) parents then greedy search will perform \(d_{X_i}+ 1\) passes over the candidate parents;
-
for each pass, \(N - 1\) local distributions will need to be relearned as described in Sect. 2.
The candidate parents in the (\(d_{X_i}+ 1\))th pass are evaluated but not included in \(\mathcal {G}\), since no further parents are accepted for a node after its parent set \(\varPi _{X_i}\) is complete. If we drop the assumption from Sect. 2 that each term in the expression above is O(1), and we substitute it with the computational complexity expressions we derived above in this section, then we can write
$$\begin{aligned} O(g(N, \mathbf {d})) = \sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}\sum _{k = 1}^{N - 1}O(f_{jk}(X_i)). \end{aligned}$$
where \(O(f_{jk}(X_i)) = O(f_{\varPi _{X_i}^{(j - 1)} \cup X_k}(X_i))\), the computational complexity of learning the local distribution of \(X_i\) conditional of \(j - 1\) parents \(\varPi _{X_i}^{(j)}\) currently in \(\mathcal {G}\) and a new candidate parent \(X_k\).
Discrete BNs
For discrete BNs, \(f_{jk}(X_i)\) takes the form shown in (3) and
$$\begin{aligned}&O(g(N, \mathbf {d})) \\&\quad = \sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}\sum _{k = 1}^{N - 1}O(n(1 + j) + l^{1 + j}) \\&\quad = O\left( n(c + 1)(N - 1)N + n(N - 1)\sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}j \right. \\&\qquad +\left. (N - 1)\sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}l^{1 + j}\right) \\&\quad \approx O\left( ncN^2 + nN \sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}j + N\sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}l^{1 + j}\right) \end{aligned}$$
The second term is an arithmetic progression,
$$\begin{aligned} \sum _{j = 1}^{d_{X_i}+ 1}j = \frac{(d_{X_i}+ 1)(d_{X_i}+ 2)}{2}; \end{aligned}$$
and the third term is a geometric progression
$$\begin{aligned} \sum _{j = 1}^{d_{X_i}+ 1}l^{1 + j} = l^2 \sum _{j = 1}^{d_{X_i}+ 1}l^{j - 1} = l^2 \frac{l^{d_{X_i}+ 1} - 1}{l - 1} \end{aligned}$$
leading to
$$\begin{aligned}&O(g(N, \mathbf {d})) \nonumber \\&\quad \approx O\left( ncN^2 + nN\sum _{i = 1}^{N}\frac{d_{X_i}^2}{2} + Nl^2 \sum _{i = 1}^{N}\frac{l^{d_{X_i}+ 1} - 1}{l - 1} \right) . \nonumber \\ \end{aligned}$$
(7)
Hence, we can see that \(O(g(N, \mathbf {d}))\) increases linearly in the sample size. If \(\mathcal {G}\) is uniformly sparse, all \(d_{X_i}\) are bounded by a constant b (\(d_{X_i}\leqslant b\), \(c \leqslant b\)) and
$$\begin{aligned} O(g(N, \mathbf {d})) \approx O\left( N^2\left[ nc + n\frac{b^2}{2} + l^2\frac{l^{b + 1} - 1}{l - 1}\right] \right) , \end{aligned}$$
so the computational complexity is quadratic in N. Note that this is a stronger sparsity assumption than \(\sum _{i = 1}^{N}d_{X_i}= cN\), because it bounds individual \(d_{X_i}\) instead of their sum; and it is commonly used to make challenging learning problems feasible (e.g. Cooper and Herskovits 1992; Friedman and Koller 2003). If, on the other hand, G is dense and \(d_{X_i}= O(N)\), then \(c = O(N)\)
$$\begin{aligned} O(g(N, \mathbf {d})) \approx O\left( N^2\left[ nc + n\frac{N^3}{2} + l^2\frac{l^N - 1}{l - 1}\right] \right) \end{aligned}$$
and \(O(g(N, \mathbf {d}))\) is more than exponential in N. In between these two extremes, the distribution of the \(d_{X_i}\) determines the actual computational complexity of greedy search for a specific types of structures. For instance, if \(\mathcal {G}\) is a scale-free DAG (Bollobás et al. 2003) the in-degree of most nodes will be small and we can expect a computational complexity closer to quadratic than exponential if the probability of large in-degrees decays quickly enough compared to N.
GBNs
If we consider the leading term of (4), we obtain the following expression:
$$\begin{aligned}&O(g(N, \mathbf {d})) \\&\quad = \sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}\sum _{k = 1}^{N - 1}O((n + 1)(j+1)^2) \\&\quad = O\left( (n + 1)(N - 1) \sum _{i = 1}^{N}\sum _{j = 1}^{d_{X_i}+ 1}(j+1)^2\right) \end{aligned}$$
Noting the arithmetic progression
$$\begin{aligned} \sum _{j = 1}^{d_{X_i}+ 1}(j+1)^2 = \frac{2d_{X_i}^3 + 15d_{X_i}^2 + 37d_{X_i}+ 24}{6} \end{aligned}$$
we can write
$$\begin{aligned} O(g(N, \mathbf {d})) \approx O\left( nN \sum _{i = 1}^{N}\frac{d_{X_i}^3}{3} \right) , \end{aligned}$$
which is again linear in n but cubic in the \(d_{X_i}\). We note, however, that even for dense networks (\(d_{X_i}= O(N)\)) computational complexity remains polynomial
$$\begin{aligned} O(g(N, \mathbf {d})) \approx O\left( nN^2 \frac{N^3}{3} \right) \end{aligned}$$
which was not the case for discrete BNs. If, on the other hand \(d_{X_i}\leqslant b\),
$$\begin{aligned} O(g(N, \mathbf {d})) \approx O\left( nN^2 \frac{b^3}{3} \right) \end{aligned}$$
which is quadratic in N.
CLGBNs
Deriving the computational complexity for CLGBNs is more complicated because of the heterogeneous nature of the nodes. If we consider the leading term of (5) for a BN with \(M < N\) Gaussian nodes and \(N - M\) multinomial nodes, we have
$$\begin{aligned} O(g(N, \mathbf {d}))= & {} \sum _{i = 1}^{N - M} \sum _{j = 1}^{d_{X_i}+ 1}\sum _{k = 1}^{N - M - 1} O(f_{jk}(X_i)) \\&+\sum _{i = 1}^{M} \sum _{j = 1}^{d_{X_i}+ 1}\sum _{k = 1}^{N - 1} O(f_{jk}(X_i)). \end{aligned}$$
The first term can be computed using (7) since discrete nodes can only have discrete parents, and thus cluster in a subgraph of \(N - M\) nodes whose in-degrees are completely determined by other discrete nodes; and the same considerations we made in Sect. 3.2.1 apply.
As for the second term, we will first assume that all \(D_i\) discrete parents of each node are added first, before any of the \(G_i\) continuous parents (\(d_{X_i}= D_i + G_i\)). Hence, we write
$$\begin{aligned}&\sum _{i = 1}^{M} \sum _{j = 1}^{d_{X_i}+ 1}\sum _{k = 1}^{N - 1} O(f_{jk}(X_i)) \\&= \sum _{i = 1}^{M} \left[ \sum _{j = 1}^{D_i} \sum _{k = 1}^{N - 1} O(f_{jk}(X_i)) + \sum _{j = D_i + 1}^{d_{X_i}+ 1} \sum _{k = 1}^{N - 1} O(f_{jk}(X_i)) \right] . \end{aligned}$$
We further separate discrete and continuous nodes in the summations over the possible \(N - 1\) candidates for inclusion or removal from the current parent set, so that substituting (5) we obtain
$$\begin{aligned}&\sum _{j = 1}^{D_i} \sum _{k = 1}^{N - 1} O(f_{jk}(X_i)) \\&\quad = \sum _{j = 1}^{D_i} \left[ \sum _{k = 1}^{N - M} O(f_{jk}(X_i)) + \sum _{k = 1}^{M - 1} O(f_{jk}(X_i)) \right] \\&\quad = \sum _{j = 1}^{D_i} \left[ (N - M) O\left( n + l^j\right) + (M - 1) O\left( 4\left( n + l^j\right) \right) \right] \\&\quad \approx O\left( (N + 3M) \sum _{j = 1}^{D_i} \left( n + l^j\right) \right) \\&\quad = O\left( (N + 3M) \left( nD_i + l \frac{l^{D_i} - 1}{l - 1}\right) \right) \\&\sum _{j = D_i + 1}^{d_{X_i}+ 1} \sum _{k = 1}^{N - 1} O(f_{jk}(X_i)) \\&\quad = \sum _{j = D_i + 1}^{d_{X_i}+ 1} \left[ \sum _{k = 1}^{N - M} O(f_{jk}(X_i)) + \sum _{k = 1}^{M - 1} O(f_{jk}(X_i)) \right] \\&\quad = \sum _{j = 1}^{G_i} \Big [ (N - M) O\left( n + l^{D_i}\right) \\&\qquad \qquad + (M - 1) O\left( \left( n + l^{D_i}\right) (1 + j)^2\right) \Big ] \\&\approx O\left( \left( n + l^{D_i}\right) \left( G_i(N - M) + M \frac{G_i^3}{3}\right) \right) . \end{aligned}$$
Finally, combining all terms we obtain the following expression:
$$\begin{aligned}&O(g(N, \mathbf {d})) \\&\quad \approx O\left( nc(N - M)^2 + n(N - M)\sum _{i = 1}^{N - M} \frac{d_{X_i}^2}{2} \right. \\&\qquad + \left. (N - M)l^2 \sum _{i = 1}^{N - M} \frac{l^{d_{X_i}+ 1} - 1}{l - 1} \right) \\&\qquad +\sum _{i = 1}^{M} O\left( (N + 3M) \left( nD_i + l \frac{l^{D_i} - 1}{l - 1}\right) \right) \\&\qquad +\sum _{i = 1}^{M} O\left( \left( n + l^{D_i}\right) \left( G_i(N - M) + M \frac{G_i^3}{3}\right) \right) . \end{aligned}$$
While it is not possible to concisely describe the behaviour resulting from this expression given the number of data-dependent parameters (\(D_i\), \(G_i\), M), we can observe that:
-
\(O(g(N, \mathbf {d}))\) is always linear in the sample size;
-
unless the number of discrete parents is bounded for both discrete and continuous nodes, \(O(g(N, \mathbf {d}))\) is again more than exponential;
-
if the proportion of discrete nodes is small, we can assume that \(M \approx N\) and \(O(g(N, \mathbf {d}))\) is always polynomial.