Progress in Artificial Intelligence

, Volume 1, Issue 4, pp 329–346

One iteration CHC algorithm for learning Bayesian networks: an effective and efficient algorithm for high dimensional problems

  • José A. Gámez
  • Juan L. Mateo
  • José M. Puerta
Regular Paper


It is well known that learning Bayesian networks from data is an NP-hard problem. For this reason, usually metaheuristics or approximate algorithms have been used to provide a good solution. In particular, the family of hill climbing algorithms has a key role in this scenario because of its good trade-off between computational demand and the quality of the learned models. In addition, these algorithms have several good theoretical properties. In spite of these characteristics of quality and efficiency, when it comes to dealing with high-dimensional datasets, they can be improved upon, and this is the goal of this paper. Recent papers have tackled this problem, usually by dividing the learning task into two or more iterations or phases. The first phase aims to constrain the search space, and, once the space is pruned, the second one consists of a (local) search in this constrained search space. Normally, the first iteration is the one with the highest computational complexity. One such algorithm is constrained hill climbing (CHC), which in its initial iteration not only progressively constrains the search space, but also learns good quality Bayesian networks. A second iteration, or even more, is used in order to improve these networks and also to ensure the good theoretical properties exhibited by the classical hill climbing algorithm. In this latter algorithm we can see that the first iteration is extremely fast when compared to similar algorithms, but the performance decays over the rest of the iterations with respect to the saved CPU time. In this paper, we present an improvement on this CHC algorithm, in which, to put it, briefly, we avoid the last iteration while still obtaining the same theoretical properties. Furthermore, we experimentally test the proposed algorithms over a set of different domains, some of them quite large (more than 1,000 variables), in order to study their behavior in practice.


Bayesian networks Machine learning Score-based learning Local search Scalability 

1 Introduction

The goal of data mining can be understood as compressing the available data into a more compact representation called a model. Later, this model can be used to tackle different descriptive (e.g. identifying dependences relations, clusters, etc.) or predictive (e.g. classification, computing posterior beliefs) tasks. Bayesian Networks [24, 27, 31] have become one of the favorite knowledge representation formalisms for model-based data mining because of their double descriptive/predictive capability and their innate uncertainty management.

Bayesian Networks (BNs) are graphical models that are able to efficiently represent and manipulate \(n\)-dimensional probability distributions [31]. The knowledge base that a BN encodes can be viewed as a double representation model divided into a qualitative (a directed acyclic graph or DAG) and a quantitative part (a set of locally specified probability distributions). Thus, descriptive tasks are carried out by performing relevance analysis over the graph, while predictive tasks are based on a clever use of the (in)dependences codified in the DAG to allow efficient probabilistic inference. As BNs are such attractive models, and due to the increasing availability of data, it is not strange to find such a large number of works in the literature that tackle the BN structure learning problem. Somewhat generalizing, there are two main approaches for learning BNs:
  • Score+search methods A function \(f\) is used to score a network/DAG with respect to the training data, and a search method is used to look for the network with the best score. Different scoring metrics [11, 25, 30] and search methods, mainly of a heuristic (e.g. [3, 4, 9, 14, 16, 25, 30]) and metaheuristic (e.g. [2, 8, 12, 28]) nature have been proposed due to the NP-hardness of the BN structure learning problem [7].

  • Constraint-based methods The idea underlying these methods is to satisfy as many independences present in the data as possible [30, 35]. Statistical hypotheses testing is used to determine the validity of conditional independence sentences. There also exist hybrid algorithms that combine these two approaches, e.g. [1] or even hybrid scoring metrics [11].

Dealing with larger data sets, and therefore designing scalable algorithms, is nowadays one of the major challenges of machine learning [21]. In BN structure learning in the space of DAGs, local search-based approach (e.g. [9, 13, 14, 16, 17, 25, 30, 36]) stands out when dealing with large datasets (very large number of variables), because of its good trade-off between the resources required (e.g. CPU time) and the accuracy of the model obtained.

Our main motivation in this paper is to scale up local search algorithms for learning BNs, such as hill climbing, while maintaining the theoretical properties that this method offers. As the cardinality of this search space is super-exponential [33], a good idea, especially in domains with a large number of variables, is to limit in the areas of the search space to be visited in some way. This idea is not new and has been exploited previously in the literature, as we will briefly review in Sect. 2.2.

Our proposal in this paper is to develop a one-stage single-iteration constrained hill climbing algorithm, with the goal of significantly reducing its running time and so applying it to databases with a larger number of variables. Our experiments confirm that the resulting hill climbing algorithm, called the FastCHC algorithm, is faster than previous approaches from the state-of-the-art algorithms, while maintaining the quality of the discovered network close to that of those learnt by this family of algorithms.

This paper is structured as follows: We begin in Sect. 2 by presenting the notation and definitions concerning BNs necessary to build our proposal. In Sect. 3, we describe in detail some local search methods for learning BNs and previous algorithms in the CHC family. Section 4 is devoted to explaining in our proposal. In Sect. 5 we describe the experiments carried out to validate our claim about the quality of our new algorithm. Finally, in Sect. 6 we present our conclusions.

2 Preliminaries

2.1 Bayesian networks

Bayesian networks (BNs) are graphical models that can efficiently represent and manipulate \(n\)-dimensional probability distributions [31]. This representation has two components that, respectively, codify qualitative and quantitative knowledge:
  • A graphical structure, or more precisely a DAG, \(\mathcal{{G}}=(\varvec{V}, \mathbf E )\), where the nodes in \(\varvec{V}=\{X_{1},X_{2},\ldots ,X_{n}\}\) represent the random variables1 from the problem we are modeling, and the topology of the graph (the arcs in \(\mathbf E \subseteq \mathbf V \times \mathbf V \)) encodes conditional (in)dependence relationships among the variables (by means of the presence or absence of direct connections between pairs of variables).

  • A set of numerical parameters (\(\varvec{\Theta }\)), usually conditional probability tables, drawn from the graph structure: For each variable \(X_{i} \in \varvec{V}\) we have a conditional probability distribution \(P(X_{i} | \mathbf{pa }(X_{i}))\), where \(\mathbf{pa }(X_{i})\) represents any combination of the values of the variables in \(\mathbf{Pa }(X_{i})\), and \(\mathbf{Pa }(X_{i})\) is the parent set2 of \(X_{i}\) in \(\mathcal{{G}}\). From these conditional distributions we can recover the joint probability distribution over \({\varvec{V}}\) thanks to the Markov Condition:
    $$\begin{aligned} P(X_{1}, X_{2}, \ldots , X_{n})= \prod _{i=1}^{n} P(X_{i} | \mathbf{Pa }_\mathcal{{G}}(X_{i})) \end{aligned}$$
    This decomposition of the joint distribution gives rise to important savings in storage requirements and also allows the definition of efficient probabilistic inference algorithms by means of local propagation schemes [27].
We denote that variables in \(\mathbf X \) are conditionally independent (through \(d\)-separation) of variables in \(\mathbf Y \) given the set \(\mathbf Z \), in a DAG \(\mathcal{{G}}\) by \(\langle \mathbf X ,\mathbf Y |\mathbf Z \rangle _\mathcal{{G}}\). The same sentence but in a probability distribution \(p\) is denoted by \(I_{p}(\mathbf X ,\mathbf Y |\mathbf Z )\).

Definition 1

A node or variable \(X\) is a collider in a path \(\pi \) if \(X\) has two incoming edges, i.e. we have the subgraph \(A \rightarrow X \leftarrow B\) (also know as a head to head node). If the tail nodes (\(A\) and \(B\)) of a collider node are not adjacent in \(\mathcal G \), this subgraph is called a v-structure in \(X\).


Definition 2

A path from node \(X\) to node \(Y\) is blocked by a set of nodes \(\mathbf Z \), if there is a node \(W\) on the path for which one of the following two conditions hold:
  1. 1.

    \(W\) is not a collider and \(W \in \mathbf Z \), or

  2. 2.

    \(W\) is a collider and neither \(W\) nor its descendants are in \(\mathbf Z \).

A path which is not blocked is active or open.


Definition 3

Two nodes \(X\) and \(Y\) are d-separated by \(\mathbf Z \) in graph \(G\) if and only if every path from \(X\) to \(Y\) is blocked by \(\mathbf Z \). Two nodes are d-connected if they are not \(d\)-separated.


Definition 4

A DAG \(\mathcal{{G}}\) is an I-map of a probability distribution \(p\) if \(\langle \mathbf X ,\mathbf Y |\mathbf Z \rangle _\mathcal{{G}} \Longrightarrow I_{p}(\mathbf X ,\mathbf Y |\mathbf Z )\). It is minimal if no arc can be removed from \(\mathcal{{G}}\) without violating the I-map condition. \(\mathcal{{G}}\) is a D-map of \(p\) if \(\langle \mathbf X ,\mathbf Y |\mathbf Z \rangle _\mathcal{{G}} \Longleftarrow I_{p}(\mathbf X ,\mathbf Y |\mathbf Z )\).

When a DAG \(\mathcal{{G}}\) is both an I-map and a D-map of \(p\), it is said that \(\mathcal{{G}}\) and \(p\) are isomorphic models (i.e. \(\mathcal{G }\) is a perfect-map of \(p\)) or we will say that \(p\) and \(\mathcal{G }\) are faithful to each other [30, 35].

Furthermore, a distribution \(p\) is faithful if there exists a graph, \(\mathcal{G }\), to which it is faithful. In a faithful BN \(\langle \mathbf X ,\mathbf Y |\mathbf Z \rangle _\mathcal{{G}} \Leftrightarrow I_{p}(\mathbf X ,\mathbf Y |\mathbf Z )\)

It is always possible to build a minimal I-map of any given probability distribution \(p\), but some distributions do not admit an isomorphic model [31].

In general, when learning Bayesian networks from data our goal is to obtain a DAG that is a minimal I-map of the probability distribution encoded by the dataset.

We will assume faithfulness in the rest of the paper. In such cases, we can assume that the terms \(d\)-separation and conditional independence are used interchangeably in \(p\) and \(\mathcal{G }\).

2.2 Related work

The idea of learning BNs in large domains by constraining the search space is not new and has been exploited previously. For example, in [5] they restrict the number of possible parents for a variable to be the \(k\) most correlated with it, and then, the K2 algorithm [10] is used to learn the BN structure. By using no restriction on the order of the nodes, we can use the Max–Min Hill Climbing (MMHC) algorithm [36], which is also a two-step algorithm that in its first stage tries to identify the parents and children of each variable and in the second one uses a local search algorithm to look for the network, but with the search restricted to the set of previously found adjacencies (parents and children). In [29], the first step is carried out as in MMHC but in the second phase substructures are learned using the information gathered in first phase. In [16], an iterated hill climbing algorithm is proposed that at each (outer) iteration restricts the number of candidate parents for each variable to the \(k\) most promising ones, \(k\) having the same value for every variable.

There also exists global search methods in which the search space is also restricted, for example in [37] they first construct an undirected graph or skeleton by using zero- and first-order dependence tests, and then a genetic algorithm is employed which is restricted to searching for DAGs belonging to this skeleton. The approach taken in [39] is slightly different in that it also uses a first stage based on low order conditional independence tests.

As we can observe, the common feature in all the aforementioned algorithms is that all of them use two clearly separated stages: (1) search space restriction; and (2) running of a search algorithm over the restricted search space. In [17], the authors propose a different way of learning BNs, namely by carrying out these two stages simultaneously. Thus, a hill climbing algorithm is launched directly without previously restricting the search space, and then it takes advantage of the computations carried out at each search step to guess which edges should not be considered from then on. In this way the search space is pruned progressively as the search advances. However, in order to maintain the nice theoretical property that under the faithfulness condition the CHC algorithm always returns a minimal I-map, at least one extra iteration should be executed. These last iterations are usually short in local steps but are time-consuming in global terms due to the extra load of computing new statistics from data. This means that the time saving with respect to previous two-stage proposals diminishes, even though the first iteration is really very fast.

Our proposal in this paper is to avoid these extra iterations in order to improve the CHC algorithm significantly reducing its running time and so applying it to databases with a larger number of variables. To do this, we propose to check the restrictions applied previously for each variable parent set in order to guarantee the minimal I-map condition in only one iteration.

3 Learning BNs by local search methods

In the score+search approach, the problem of learning the structure of a Bayesian network can be stated as follows: given a training dataset \(D=\{\mathbf{v}^{1},\ldots , \mathbf{v}^{m}\}\) of instances (configurations of values) of \({\varvec{V}}\), find the DAG \(\mathcal{G }^{*}\) such that
$$\begin{aligned} \mathcal{G }^{*}=\arg \max _{\mathcal{G } \in \mathcal{G }^n} f(\mathcal{G }:D) \end{aligned}$$
where \(f(\mathcal{G }:D)\) is a scoring metric which evaluates the merit of any candidate DAG \(\mathcal{G }\) with respect to the dataset \(D\), and \(\mathcal{G }^{n}\) is the set containing all DAGs with \(n\) nodes.

Local search (specifically hill climbing) methods traverse the search space by starting from an initial solution and performing a finite number of local steps. At each local step, the algorithm only considers minimum or local changes, i.e. neighbor DAGs, and chooses the one resulting in the greatest improvement in \(f\). The algorithm stops when there is no local change yielding an improvement in \(f\). Because of this greedy behavior, the execution stops when the algorithm is trapped at a solution that most times maximizes \(f\) locally rather than globally. Different strategies are used to try to escape from local optima: restarts, randomness, etc.

In BN learning, the usual choices for local changes in the space of DAGs are arc addition, arc deletion and arc reversal. Of course, except in arc deletion we have to take care not to introduce directed cycles in the graph. Thus, there are \(O(n^2)\) possible changes, \(n\) being the number of variables. With regard to the starting solution, the empty network is usually considered, although random starting points or perturbed local optima are also used, specially in the case of an iterated local search.

Efficient evaluation of neighbors/DAGs is based on an important property of scoring metrics: decomposability in the presence of full data. In the case of BNs, decomposable metrics evaluate a given DAG as the sum of its node family score, i.e. the subgraphs formed by a node and its parents in \(\mathcal{G }\). Formally, if \(f\) is decomposable then:
$$\begin{aligned}&\!\!\! f(\mathcal{G }:D) = \sum _{i=1}^n f_{D}(X_{i},\mathbf{Pa }_\mathcal{G }(X_{i}))\end{aligned}$$
$$\begin{aligned}&\!\!\! f_D(X_{i},\mathbf{Pa }_\mathcal{G }(X_{i}))= f_D(X_{i},\mathbf{Pa }_\mathcal{G }(X_{i}):N_{x_{i},\mathbf{pa }_\mathcal{G }(X_{i})}) \end{aligned}$$
where \(N_{x_{i},\mathbf{pa }_\mathcal{G }(X_{i})}\) are the statistics of the variables \(X_{i}\) and \(\mathbf{Pa }_\mathcal{G }(X_{i})\) in \(D\), i.e. the number of instances in \(D\) that match each possible instantiation of \(X_{i}\) and \(\mathbf{Pa }(X_{i})\).
Thus, if a decomposable metric is used, a procedure that changes only one arc at each move can efficiently evaluate the neighbor obtained by this change. This method can reuse the computations carried out in previous stages, and only the statistics corresponding to the variables whose parents have been modified need to be recomputed. If we use an HC algorithm, we will have to measure the following differences when evaluating the improvement obtained by a neighbor DAG:
  1. 1.

    Addition of \(X_{j} \rightarrow X_{i}: f_{D}(X_{i},\mathbf{Pa }(X_{i}) \cup \{X_{j}\}) - f_{D}(X_{i},\mathbf{Pa }(X_{i}))\).

  2. 2.

    Deletion of \(X_{j} \rightarrow X_{i}: f_{D}(X_{i},\mathbf{Pa }(X_{i}) \!\setminus \! \{X_{j}\}) - f_{D}(X_{i},\mathbf{Pa }(X_{i}))\).

  3. 3.

    Reversal of \(X_{j} \rightarrow X_{i}\): It is obtained as the sequence: deletion(\(X_{j} \rightarrow X_{i}\)) plus addition(\(X_{i} \rightarrow X_{j}\)), so we compute \([f_{D}(X_{i},\mathbf{Pa }(X_{i}) \setminus \{X_{j}\}) -f_{D}(X_{i},\mathbf{Pa }(X_{i}))] + [f_{D}(X_{j},\mathbf{Pa }(X_{j}) \cup \{X_{i}\}) - f_{D}(X_{j},\mathbf{Pa }(X_{j}))]\)

Then, at each step, the algorithm analyzes all the possible (local) operations, and chooses the one with the highest positive difference.

Algorithm 1 outlines the hill climbing algorithm for structural learning of Bayesian networks. Although any DAG (\(\mathcal{G }_0\)) can be used to initialize the search, usually the empty graph (i.e. a graph with no arcs) is used. In the algorithm, we assume that each time a family is scored for the first time, the obtained value \(f_D(\cdot )\) is added to a cache. Subsequent computations can be avoided by just checking that cache.

3.1 Theoretical considerations

In this section, we review some (desirable) properties of scoring metrics, most of which are taken from [9]. These concepts will constitute the theoretical basis of our proposal.

Definition 5

A scoring metric \(f\) is score equivalent if for any pair of equivalent DAGs, \(\mathcal{G }\) and \(\mathcal{G }^{\prime }, f(\mathcal{G }:D) = f(\mathcal{G }^{\prime }:D)\).

Two DAGs are equivalent if they lead to the same essential graph, that is, if they share the same skeleton and the same v-structures [38].

Definition 6

(Consistent scoring criterion [9]) Let \(D\) be a dataset containing \(m\)-independent and identically distributed (iid) samples from some distribution \(p\). Let \(\mathcal{G }\) and \(\mathcal{H }\) be two DAGs. Then, a scoring metric \(f\) is consistent if in the limit as \(m\) grows large, the following two properties hold:
  1. 1.

    If \(\mathcal{H }\) contains \(p\) and \(\mathcal{G }\) does not contain \(p\), then \(f(\mathcal{H }:D) > f(\mathcal{G }:D)\).

  2. 2.

    if \(\mathcal{H }\) and \(\mathcal{G }\) contain \(p\), but \(\mathcal{G }\) is simpler than \(\mathcal{H }\) (has less parameters), then \(f(\mathcal{G }:D) > f(\mathcal{H }:D)\).


A probability distribution \(p\) is contained in a DAG \(\mathcal{G }\) if there exists a set of parameter values \(\varvec{\Theta }\) such that the Bayesian network defined by \((\mathcal{G },\varvec{\Theta })\) represents \(p\) exactly. Of course, if two graphs are correct, then the sparser one should receive more merit.

Proposition 1

(From [9, 22, 23]) The Bayesian scoring criterion is score equivalent and consistent.


Definition 7

(Locally Consistent scoring criterion [9]) Let \(D\) be a dataset containing \(m\) iid samples from some distribution \(p\). Let \(\mathcal{G }\) be any DAG, and \(\mathcal{G }^{\prime }\) the DAG obtained by adding edge \(X_{i} \rightarrow X_{j}\) to \(\mathcal{G }\). A scoring metric is locally consistent if in the limit as \(m\) grows large, the following two conditions hold:
  1. 1.

    If \(\lnot I_p(X_{i},X_{j} | \mathbf{Pa }_\mathcal{G }(X_{j}))\), then \(f(\mathcal{G }:D) < f(\mathcal{G }^{\prime }:D)\).

  2. 2.

    If \(I_p(X_{i},X_{j} | \mathbf{Pa }_\mathcal{G }(X_{j}))\), then \(f(\mathcal{G }:D) > f(\mathcal{G }^{\prime }:D)\).


This is the main result for the proposal in [17], and the extension/improvement we propose in this paper, because from the concept of local consistency we can (asymptotically) assume that the differences computed by a locally consistent scoring metric \(f\) can be used as conditional independence tests over the dataset \(D\). To do this, we have to suppose that \(D\) constitutes a sample which is isomorphic3 to a graph.

Proposition 2

[9] The Bayesian scoring criterion is locally consistent.

Particular Bayesian scores for which Propositions 1 and 2 hold are BDe (Bayesian Dirichlet score with the assumption of likelihood equivalence) [25], and BIC (Bayesian Information Criterion) [34].

3.2 Constrained hill climbing methods

The hill climbing (HC) algorithm with \(\{\)arc-addition, arc-deletion, arc-reversal\(\}\) operations is without any doubt the most frequently used algorithm because of its ease of implementation, efficiency and the quality of the output it offers. That quality is supported by the fact that this algorithm asymptotically guarantees a minimal I-map [17, Proposition 3].

In [17], a constrained hill climbing (CHC) algorithm is proposed. The scheme of this proposal is shown in Algorithm 2, and as can be observed, it is almost identical to HC (see Algorithm 1), the only difference being the inclusion of the forbidden parent sets, \(FP(.)\). For each variable, an \(FP()\) set is considered and updated in each step of the greedy search. Thus, the nodes included in \(FP(X)\) are not taken into account as possible parents for \(X\) during the rest of the search. The CHC algorithm, as well as HC, assures monotonicity, i.e., in each step an improvement is guaranteed and CHC, like HC, stops when there is no neighbour of \(\mathcal{G }\) which improves \(f(\mathcal{G })\), therefore termination is also guaranteed. However, CHC cannot ensure that \(\hat{\mathcal{G }} = CHC(\mathcal{G }_0)\) is, asymptotically, an I-map of \(p\), as HC does. To solve these problems, The CHC\(^{*}\) algorithm is proposed in [17, 19], where the output of CHC is used as the input for unconstrained hill climbing. This latter algorithm shares the property of HC that the resulting DAG is a minimal I-map.

Two other successful variants of CHC\(^{*}\), called iCHC and 2iCHC, respectively, are also introduced in [17]. The first one consists of the iteration of CHC by taking as input the output obtained in the previous iteration. Because \(FP\) sets are reset at the beginning of each iteration, the algorithm could escape from the current point as an unconstrained HC will do. The algorithm stops when no modification is carried out over the DAG returned by the previous iteration. When this happens, and because \(FP\) sets are cleared at the beginning, we can be sure that CHC will also stop without making any modification, and so we get a minimal I-map. Algorithm 2iCHC just carries out two iterations, which it is proven to be enough to ensure that a minimal I-map is obtained [17, Proposition 4]. Obviously, this second approach is faster than the first one, though sometimes it obtains networks whose score is slightly worse.

4 The one iteration CHC Algorithm: FastCHC

The necessity of iterating the CHC algorithm is due to the possibility that the arcs belonging to a v-structure in the true graph are discovered in the wrong direction during the learning process. If this is the case, an I-map can be recovered in two different ways. First, those arcs can be reversed, but in practice this is difficult to do, and there even exists situations in which the reversal operation for those arcs is impossible in the context of a local search over the current graph. The second solution is to add an arc between the tail nodes of the v-structure, however, this action is not easy to do in the CHC approach. The reason comes from the fact that both tail nodes (\(T1\) and \(T2\)) are marginally independent given a subset of nodes \(\mathbf S \) in the original (true) distribution, and therefore, when the algorithm checks the link between \(T1\) and \(T2\) with the subset \(\mathbf S \) as the parent set of \(T1\) or \(T2, T1\) will be included in \(FP(T2)\) and/or vice versa during that stage of the search. This is the reason for the success of iterated versions of CHC, because when starting a new iteration, the \(FP()\) sets are empty and so that type of arcs can be reversed.

In this paper our challenge, and main contribution, is to be able to reverse those arcs but in a single-iteration CHC algorithm. The first consequence would be a neater algorithm, and secondly, hopefully, a more efficient one than those iterated, given that only one iteration is performed. Of course, as the nodes in the wrongly discovered v-structure would be totally connected, the output of this new algorithm will still be a minimal I-map.

The proposed method, called FastCHC, is described in Algorithm 3, and is basically the same code as in CHC (see Algorithm 2). The difference with respect to the CHC algorithm is in lines 15–20. The purpose of these is to allow the possible adding of dependences in order to satisfy the I-map condition when a v-structure is not properly identified. Specifically, when addition of \(X_{j} \rightarrow X_{i}\) is the selected operation, then any node adjacent to \(X_{j}\) is removed from \(FP(X_{i})\) and any node adjacent to \(X_{i}\) is removed from \(FP(X_{j})\). By updating the \(FP\) sets in this way, in subsequent operations it is possible to add an arc from any neighbor of \(X_{i}\) to \(X_{j}\) and vice versa. Furthermore, the opposite direction (\(X_{i} \rightarrow X_{j}\)) is also considered, removing \(X_{i}\) from \(FP(X_{a})\) and \(X_{j}\) from \(FP(X_{b})\). The following example illustrates this behaviour.
Fig. 1

Graphs for the Example 1: a target graph. b DAG after first step. c DAG after second step. d DAG after third step (result of CHC). e DAG representing a minimal I-map of a

Example 1

Let us consider the DAG \(\mathcal{G }_t\) shown in Fig. 1a as our true or target model, and suppose that we obtain a (very large) dataset \(D\) by sampling from it. Let us also assume that (as usual) we take the empty graph as the starting point for the search. Then, after initializing the forbidden parent sets to be empty (\(FP(A) = FP(B) = FP(C) = FP(D) = \emptyset \)), if we forbid edges as the search progresses then, in the first step, we have to test all the following pairs:
  1. (1)

    Add \(A \rightarrow B\): \(diff \gg 0\) because \(A\rightarrow B \in \mathcal{G }_t\), so \(\lnot I(A,B|\emptyset )_{\mathcal{G }_t}\).

  2. (2)

    Add \(A \rightarrow C\): \(diff < 0\) because \(I(A,C|\emptyset )_{\mathcal{G }_t}\): \(FP(C) = \{A\}, FP(A) = \{C\}\).

  3. (3)

    Add \(A \rightarrow D\): \(diff < 0\) because \(I(A,D|\emptyset )_{\mathcal{G }_t}\): \(FP(D) = \{A\}, FP(A) = \{C,D\}\).

  4. (4)

    Add \(B \rightarrow A\): \(diff \gg 0\) because \(A\rightarrow B \in \mathcal{G }_t\), so \(\lnot I(A,B|\emptyset )_{\mathcal{G }_t}\).

  5. (5)

    Add \(B \rightarrow C\): \(diff \gg 0\) because \(C\rightarrow B \in \mathcal{G }_t\), so \(\lnot I(C,B|\emptyset )_{\mathcal{G }_t}\).

  6. (6)

    Add \(B \rightarrow D\): \(diff > 0\) because \(\lnot I(B,D|\emptyset )_{\mathcal{G }_t}\).

  7. (–)

    there is no need to test add \(C \rightarrow A\) because \(C \in FP(A)\). However, if \(C\) had not been added to \(FP(A)\) in step (1), then this step could not be skipped.

  8. (7)

    Add \(C \rightarrow B\): \(diff \gg 0\) because \(C\rightarrow B \in \mathcal{G }_t\), so \(\lnot I(C,B|\emptyset )_{\mathcal{G }_t}\).

  9. (8)

    Add \(C \rightarrow D\): \(diff \gg 0\) because \(D\rightarrow C \in \mathcal{G }_t\), so \(\lnot I(D,C|\emptyset )_{\mathcal{G }_t}\).

  10. (–)

    there is no need to test add \(D \rightarrow A\) because \(D \in FP(A)\). However, if \(D\) had not been added to \(FP(A)\) in step (3), then this step could not be skipped.

  11. (9)

    Add \(D \rightarrow B\): \(diff > 0\) because \(\lnot I(B,D|\emptyset )_{\mathcal{G }_t}\).

  12. (10)

    Add \(D \rightarrow C\): \(diff \gg 0\) because \(D\rightarrow C \in \mathcal{G }_t\), so \(\lnot I(D,C|\emptyset )_{\mathcal{G }_t}\).

We can assume that FastCHC chooses to add \(B \rightarrow C\) (Fig. 1b). The process continues, and those scores not affected by the last operation are retrieved from the cache. Thus, the only action to test is:
  1. (11)

    Add \(D \rightarrow C: diff > 0\) because \(\lnot I(C,D|\{B\})_{\mathcal{G }_t}\).

At this point we can assume that FastCHC performs the addition operation with \(C\rightarrow D\) (Fig. 1c). Again the scores not affected by the last operation are taken from the cache and not recomputed. The algorithm continues by analyzing the following action:
  1. (12)

    Add \(B \rightarrow D\): \(diff <0\) because \(I(B,D|\{C\})_{\mathcal{G }_t}\): \(FP(B) = \{D\}, FP(D) = \{A,B\}\).

Addition \(B\rightarrow A\) is now the operation selected at this step, but because of lines 15–20 in the FastCHC algorithm (Algorithm 3), \(FP\) sets for \(A\) and \(B\) must be updated besides the graph (Fig. 1d). Thus, now \(FP(A) = \{D\}, FP(B) = \{D\}\) and \(FP(C)=\emptyset \).
In the next step, FastCHC needs to study the operation:
  1. (13)

    Add \(C \rightarrow A\): \(diff > 0\) because \(\lnot I(A,C| \{B\})_{\mathcal{G }_t}\).

Then, FastCHC will include \(A\rightarrow C\), yielding a minimal I-map (Fig. 1e). We should note that when the arc \(B \rightarrow A\) was added, the sets \(FP()\) were updated accordingly in order to delete the variable \(A\) from \(FP(C)\), this operation being the only difference between the previous algorithms and FastCHC. Without updating those FP sets, \(A \rightarrow C\) cannot be considered because \(A \in FP(C)\).


The following proposition proves the type of output provided by the FastCHC algorithm.

Proposition 3

Let \(D\) be a dataset containing \(d\)-independent and identically distributed samples from some distribution \(P\). Let \(\hat{G}\) be the directed acyclic graph obtained by running FastCHC algorithm by taking \(G_0\) as the initial solution, i.e., \(\hat{G} = FastCHC(G_0)\). If the score function \(f\) used to evaluate the candidate graphs in FastCHC is consistent and locally consistent, then under the assumption that \(P\) is faithful, \(\hat{G}\) is a minimal I-map of \(P\) in the limit as \(d\) grows large.

Proof in Appendix A.


5 Experimental evaluation

In this section we describe the set of experiments aimed at testing the performance of the algorithm presented in this paper. Given the nature of the proposed algorithm, we were not only interested in the quality of the solution obtained (accuracy) but also in the resources it needs (efficiency). Below we provide details about the implementation of the algorithms, the datasets used in this comparison, the performance indicators we have chosen to argue about the goodness of each algorithm, and to end the section we give the results and their analysis.

5.1 Algorithms

The algorithms examined in this section, apart from FastCHC, are the standard hill climbing (HC), the constrained hill climbing followed by unconstrained hill climbing (CHC*), iterated constrained hill climbing (iCHC), and two-iteration constrained hill climbing (2iCHC) algorithms. In all cases an empty network is used as starting graph. We do not consider the constrained hill climbing algorithm (CHC) because it does not guarantee a minimal I-map of the original distribution. In addition, we consider the Max–Min hill climbing algorithm (MMHC) [36] as the best scalable state-of-the-art structural learning algorithm for dealing with a high number of variables (though it does not guarantee an I-map is obtained).

An implementation of MMHC is available in Matlab from the authors web page. However, access to the source code is not free. As the functionality provided by that implementation is limited, and furthermore the second stage of the algorithm is based on a taboo search rather than on canonical hill climbing, a custom implementation is used here. This implementation is based on the implementation of the MMPC algorithm available from [32], in language C, with the efficiency improvements described in [36]. For the second stage of the algorithm, the implementation of hill climbing, also used in this comparison, is employed.

The score metric used here is the Bayesian Dirichlet equivalent uniform (BDeu) [25] with the same parameters as in [36], i.e. equivalent sample size equal to 10.

The implementation was coded in Java and interacts with the ProGraMo library for dataset and graph structures management [20].

In order to speed up all the algorithms we use an internal cache where we save the result of every score computation in order to re-use it later in the execution. This cache is implemented with a hash table4 by using the names of the probability family as key and storing the score metric (real value) for that family as value. Therefore, updating and querying operations over the cache are linear in time. To give the reader an idea of the benefits obtained by using this structure, the ratio \(\frac{\# \text{ of} \text{ statistics} \text{ required}}{\# \text{ of} \text{ actual} \text{ calls} \text{ to}\ f()}\) is 20, 25 and 75 for hill climbing algorithm when dealing with datasets/networks having 25, 30 and 100 variables (data from [18]).

All the runs of these algorithms were carried out on a dedicated server with Pentium Xeon 3.0 GHz, 64 bit architecture, 32 GB RAM memory and under Linux. The Java Runtime Environment used is the one provided by SUN Microsystems, version 1.5.

5.2 Networks

For this experimental comparison, we have selected the set of networks available in the Bayesian Network Repository ( These are Alarm, Barley, HailFinder, Insurance, Link, Mildew, Munin (version 1, 2, 3 and 4), Pigs, and PathFinder (version 1 and 2.3). The Diabetes network is not used because of its dynamic nature and Carpo because it contains non-standard probability functions. More detailed information about these networks is given in Table 1.
Table 1

Bayesian networks used in our experimental evaluation





States per node

Max states

Parents per node

Min–max parents

PC per node

Max PC


Monitoring of emergency care patients










Model of barley crop yields










A model for insulin dose adjustment (DBN)










Predicting hail in northern Colorado










Evaluating insurance applications










Pedigree for linkage analysis










A model for deciding on the amount of fungicides to be used against attack of mildew in wheat










An expert electromyography assistant





































Analysis of lymph cell pathologies



















Pedigree of breeding pigs










A model of the biological processes of a water purification plant










A model for printer troubleshooting in Microsoft Windows 95









For each network we indicate a description, the number of variables, edges, average number of states, minimum and maximum number of states, average number of parents per variable, maximum number of parents, average number of parents and children (PC) per variable and maximum number of parents and children

For all these networks, we obtained different datasets by sampling with 500, 1,000 and 5,000 instances. Each dataset was given the same name as its corresponding network. In fact, six datasets are sampled for each network and size: one is used for learning and the other five for validation (the average is reported).

5.3 Performance indicators

Two kinds of factors are used as performance indicators, one being the quality of the network obtained by the algorithm, and the other the complexity of each algorithm.

The first group contains two different statistics that, respectively, compare the learned network regarding a dataset and the learned network regarding the golden model, i.e., the network used to generate the datasets:
  • Loglikelihood Given a dataset \(D=\{\mathbf{v}^1,\ldots , \mathbf{v}^m\}\) of instances and a network \(\mathcal{{G}}\) defined over a set of variables \(V=\{X_1,\ldots ,X_n\}\), the log-likelihood of \(\mathcal{{G}}\) with respect to \(D\) is computed as:
    $$\begin{aligned} LL(\mathcal{{G}}:D) = \sum _{i=1}^{m} \log ( P_\mathcal{{G}}(\mathbf{v}_{i}|D)). \end{aligned}$$
  • Structural Hamming Distance (SHD) Given two models \(\mathcal{{G}}_1\) and \(\mathcal{{G}}_2\) (the golden one and the learned one in our case) the SHD is computed over their essential graphs \(eg(B_1)\) and \(eg(B_2)\) in order to not penalize structural differences that cannot be statistically distinguished. The algorithm described in [6] is used to transform a given DAG into an essential graph or partially oriented graph (PDAG). Then,
    $$\begin{aligned} SHD(eg(\mathcal{{G}}_1):eg(\mathcal{{G}}_2))&= |eg(\mathcal{{G}}_2) - eg(\mathcal{{G}}_1)| + |eg(\mathcal{{G}}_1)\\&-eg(\mathcal{{G}}_2)| + |diff(\mathcal{{G}}_1,\mathcal{{G}}_2)| \end{aligned}$$
    that is, SHD is computed as the number of edges missing in the first graph with respect to the second, plus the number of extra edges in the first graph with respect to the second, and plus the number of edges present in both graphs but with different directions. This last set of edges includes the reversed edges and also the edges that are undirected in one of them and detected in the other. See [36] for details.
With regard to the efficiency factor, the number of computations carried out by each algorithm, i.e. statistical tests or score function calls, is collected, as this has a direct correspondence with the CPU time required. It is important to notice that given the differences in programming language, i.e. C++ and Java, the run time is not a fair indicator.

5.4 Results

In the following tables, we present a summary of the results we obtained after running the 6 algorithms over the 16 domains (see Table 1). The summarand also the Structural Hamming Distance (SHD) with respect to the golden model (the one used to sample training and test sets). SHD is measured as in [36], based on comparing the essential graphs.

The structural Hamming distance is computed for a pair of partially directed acyclic graphs (PDAGs). Thus, the output models obtained by each algorithm are converted into a PDAG. Then, for every pair of graphs, SHD is computed as the number of edges missing in the first graph with respect to the second, plus the number of extra edges in the first graph with respect to the second, and plus the number of edges present in both graphs but with different directions. This last set of edges includes the reversed edges and also the edges that are undirected in one of them and directed in the other. \(y\) consists of the average rank of each algorithm in each dataset, from 1 to 6, 1 being the best result. The complete list of results can be seen in Appendix B. The MMHC algorithm could not handle many of the big datasets, especially those with 5,000 instances. The reason is that the algorithm needs more memory than that available in the computer, or the process does not finish before the set limit of 1 week.5 In these cases, we consider that the algorithm gives a worse result than the others in order to penalize the fact that it cannot produce a result under the given conditions.

For each one of the indicators, we performed a Friedman test [15] in order to discover whether we can assume that all algorithms deliver the same performance, and in the case that we cannot assume so we use the Holm’s post hoc test [26] to see which algorithms can be taken as not statistically worse than the best one. In all the tests, the significance level is set to 0.05.

In Table 2, we can see the results for the averaged likelihood of the obtained network for the five dataset sampled for validation. The Friedman test rejects the null hypothesis in all cases. Regardless of the size of the dataset, the best algorithm is always hill climbing, with CHC* not being significantly worse. MMHC is the worst in all cases. It is important to notice that FastCHC improves its rank as the dataset increases, from 4.63 to 3.60, being the only algorithm that has this improvement consistently as the size of dataset increases.
Table 2

Average rank of the algorithms based on likelihood over all the validation datasets





























The best results are highlighted in bold

\(^\mathrm{a}\) Those results not significantly worse than the best one

Table 3 shows the second quality factor, SHD. Here the Friedman test rejects the null hypothesis in all cases as well. As in the previous case, hill climbing is the best algorithm, but now not only CHC* produces comparable results. FastCHC is also comparable to hill climbing in all sizes and iCHC and MMHC in the smaller ones.
Table 3

Average rank of the algorithms based on SHD over all the datasets





























The best results are highlighted in bold

\(^\mathrm{a}\) Those results not significantly worse than the best one

Looking at these two result tables together we can draw the conclusion that the HC algorithm probably obtains such good results in likelihood because it overfits the network model to the data by adding many additional arcs that in turn deteriorate its performance in terms of SHD.

Finally, we study the efficiency indicator, namely the number of statistics (scores) computed, as this number is proportional to CPU time. In Table 4, we can see that the most efficient algorithm is FastCHC. The hill climbing algorithm is the least efficient in general. In this case, the Friedman test suggests that the null hypothesis has to be rejected, and the post hoc test indicates that only 2iCHC is not significantly worse than FastCHC.
Table 4

Average rank of the algorithms based on the statistics computed over all the datasets





























The best results are highlighted in bold

\(^\mathrm{a}\) Those results not significantly worse than the best one

In summary, the above tables seem to indicate that the FastCHC algorithm is the best in terms of scalability and efficiency, while HC is the best algorithm in terms of the quality of the model.

However, even though the output obtained by HC is sometimes better than that obtained by FastCHC, the cost associated with this improvement may not be worth it. To support this statement, in Table 5 we show the combined result of the resulting model’s likelihood and the effort required to obtain it. This is the trade-off between quality and efficiency, the ratio between the difference in likelihood for each learned model and the empty network and the number of computations each algorithm needs to obtain such a gain. It can be formulated as \(\frac{LL(M_{i,j})-LL(Empty_{j})}{\#computations_{i,j}}\), where \(M_{i,j}\) is the model obtained by the algorithm \(i\) on the dataset \(j, Empty_{j}\) is the empty model estimated for the dataset \(j\), and \(\#computations_{i,j}\) is the number of statistics computed by the algorithm \(i\) on the dataset \(j\). This factor indicates how much benefit per computation is produced by each algorithm.
Table 5

Average rank of the algorithms based on the combined score of likelihood and statistics computed over all the datasets





























The best results are highlighted in bold

\(^\mathrm{a}\) Those results not significantly worse than the best one

To end with the analysis of the results and as a sort of graphical summary, we represent in Fig. 2a comparison of the results obtained by the six algorithms over three dimensions: (log)likelihood, structural Hamming distance (SHD) and number of computed statistics. To allow for a fast and visual comparison, we average the result over all the datasets for each algorithm, and take the inverse when necessary, in such a way, that in all the cases, the greater the value, the worse the behaviour of the algorithm. As we can observe from Fig. 2 (and also from full results shown in Appendix B), the differences in two of the three axes (loglikelihood and SHD) are small, while the big difference appears in the third axe (number of statistics), what graphically reinforces the analysis exposed above.
Fig. 2

Graphical comparison of the tested algorithms over three dimensions: (log) likelihood, structural Hamming distance and running time (#statistics). Three different sample sizes are studied (500, 1,000 and 5,000)


6 Conclusions

In this paper, we have proposed an improved version of the algorithms in the CHC family, which in turn are improved versions of the standard hill climbing approach for learning Bayesian networks. This new algorithm avoids the necessity of iterating the search algorithm by updating the constrained search space after applying the local action/operation selected at each step. This behaviour prevents the possibility that the marginally independent variables will be considered as conditionally independent when in the actual (underlying) probability distribution there are not.

Obviously, as the FastCHC algorithm requires only one iteration, it is much more efficient than previous CHC algorithms. This is clear in our experimental evaluation, in which the superiority of FastCHC over MMHC is also made evident.

Therefore, FastCHC constitutes a good candidate for learning BNs in high dimensional problems, in which scalable algorithms are required. Furthermore, because the speed up achieved with respect to competing algorithms we think that there is room to improve the quality of the discovered network, by slightly sacrificing the speed of the search. That is, for the future we plan to search more in order to search better. Thus, the idea is to use FastCHC as the basic building block for using some local search-based meta-heuristics like taboo search or simulated annealing, which have proven to able to scape from local optima.


We use standard notation, i.e., bold font to denote sets and \(n\)-dimensional configurations, calligraphic font to denote mathematical structures, upper case for variables or sets of random variables, and lower case to denote states of variables or configurations of states (vectors).


In case of working with different graphs we will use \(\mathbf{Pa }_\mathcal{{G}}(X_{i})\) to clarify the notation.


In fact, [9] proves that the isomorphic condition can be relaxed.


Concretely, java language HashMap class is used (


Notice that we do not set a maximum number of parents during the search.



This work has been partially funded by FEDER funds and the Spanish Government (MICINN/MINECO) through project TIN2010-20900-C04-03.

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • José A. Gámez
    • 1
  • Juan L. Mateo
    • 2
  • José M. Puerta
    • 1
  1. 1.Department of Computing Systems, Intelligent Systems and Data Mining Group-i3AUniversity of Castilla-La ManchaAlbaceteSpain
  2. 2.Centre for Organismal StudiesIm NeuenheimerHeidelbergGermany

Personalised recommendations