One iteration CHC algorithm for learning Bayesian networks: an effective and efficient algorithm for high dimensional problems
 802 Downloads
 1 Citations
Abstract
It is well known that learning Bayesian networks from data is an NPhard problem. For this reason, usually metaheuristics or approximate algorithms have been used to provide a good solution. In particular, the family of hill climbing algorithms has a key role in this scenario because of its good tradeoff between computational demand and the quality of the learned models. In addition, these algorithms have several good theoretical properties. In spite of these characteristics of quality and efficiency, when it comes to dealing with highdimensional datasets, they can be improved upon, and this is the goal of this paper. Recent papers have tackled this problem, usually by dividing the learning task into two or more iterations or phases. The first phase aims to constrain the search space, and, once the space is pruned, the second one consists of a (local) search in this constrained search space. Normally, the first iteration is the one with the highest computational complexity. One such algorithm is constrained hill climbing (CHC), which in its initial iteration not only progressively constrains the search space, but also learns good quality Bayesian networks. A second iteration, or even more, is used in order to improve these networks and also to ensure the good theoretical properties exhibited by the classical hill climbing algorithm. In this latter algorithm we can see that the first iteration is extremely fast when compared to similar algorithms, but the performance decays over the rest of the iterations with respect to the saved CPU time. In this paper, we present an improvement on this CHC algorithm, in which, to put it, briefly, we avoid the last iteration while still obtaining the same theoretical properties. Furthermore, we experimentally test the proposed algorithms over a set of different domains, some of them quite large (more than 1,000 variables), in order to study their behavior in practice.
Keywords
Bayesian networks Machine learning Scorebased learning Local search Scalability1 Introduction
The goal of data mining can be understood as compressing the available data into a more compact representation called a model. Later, this model can be used to tackle different descriptive (e.g. identifying dependences relations, clusters, etc.) or predictive (e.g. classification, computing posterior beliefs) tasks. Bayesian Networks [24, 27, 31] have become one of the favorite knowledge representation formalisms for modelbased data mining because of their double descriptive/predictive capability and their innate uncertainty management.

Score+search methods A function \(f\) is used to score a network/DAG with respect to the training data, and a search method is used to look for the network with the best score. Different scoring metrics [11, 25, 30] and search methods, mainly of a heuristic (e.g. [3, 4, 9, 14, 16, 25, 30]) and metaheuristic (e.g. [2, 8, 12, 28]) nature have been proposed due to the NPhardness of the BN structure learning problem [7].

Constraintbased methods The idea underlying these methods is to satisfy as many independences present in the data as possible [30, 35]. Statistical hypotheses testing is used to determine the validity of conditional independence sentences. There also exist hybrid algorithms that combine these two approaches, e.g. [1] or even hybrid scoring metrics [11].
Our main motivation in this paper is to scale up local search algorithms for learning BNs, such as hill climbing, while maintaining the theoretical properties that this method offers. As the cardinality of this search space is superexponential [33], a good idea, especially in domains with a large number of variables, is to limit in the areas of the search space to be visited in some way. This idea is not new and has been exploited previously in the literature, as we will briefly review in Sect. 2.2.
Our proposal in this paper is to develop a onestage singleiteration constrained hill climbing algorithm, with the goal of significantly reducing its running time and so applying it to databases with a larger number of variables. Our experiments confirm that the resulting hill climbing algorithm, called the FastCHC algorithm, is faster than previous approaches from the stateoftheart algorithms, while maintaining the quality of the discovered network close to that of those learnt by this family of algorithms.
This paper is structured as follows: We begin in Sect. 2 by presenting the notation and definitions concerning BNs necessary to build our proposal. In Sect. 3, we describe in detail some local search methods for learning BNs and previous algorithms in the CHC family. Section 4 is devoted to explaining in our proposal. In Sect. 5 we describe the experiments carried out to validate our claim about the quality of our new algorithm. Finally, in Sect. 6 we present our conclusions.
2 Preliminaries
2.1 Bayesian networks

A graphical structure, or more precisely a DAG, \(\mathcal{{G}}=(\varvec{V}, \mathbf E )\), where the nodes in \(\varvec{V}=\{X_{1},X_{2},\ldots ,X_{n}\}\) represent the random variables^{1} from the problem we are modeling, and the topology of the graph (the arcs in \(\mathbf E \subseteq \mathbf V \times \mathbf V \)) encodes conditional (in)dependence relationships among the variables (by means of the presence or absence of direct connections between pairs of variables).
 A set of numerical parameters (\(\varvec{\Theta }\)), usually conditional probability tables, drawn from the graph structure: For each variable \(X_{i} \in \varvec{V}\) we have a conditional probability distribution \(P(X_{i}  \mathbf{pa }(X_{i}))\), where \(\mathbf{pa }(X_{i})\) represents any combination of the values of the variables in \(\mathbf{Pa }(X_{i})\), and \(\mathbf{Pa }(X_{i})\) is the parent set^{2} of \(X_{i}\) in \(\mathcal{{G}}\). From these conditional distributions we can recover the joint probability distribution over \({\varvec{V}}\) thanks to the Markov Condition:This decomposition of the joint distribution gives rise to important savings in storage requirements and also allows the definition of efficient probabilistic inference algorithms by means of local propagation schemes [27].$$\begin{aligned} P(X_{1}, X_{2}, \ldots , X_{n})= \prod _{i=1}^{n} P(X_{i}  \mathbf{Pa }_\mathcal{{G}}(X_{i})) \end{aligned}$$(1)
Definition 1
A node or variable \(X\) is a collider in a path \(\pi \) if \(X\) has two incoming edges, i.e. we have the subgraph \(A \rightarrow X \leftarrow B\) (also know as a head to head node). If the tail nodes (\(A\) and \(B\)) of a collider node are not adjacent in \(\mathcal G \), this subgraph is called a vstructure in \(X\).
Definition 2
 1.
\(W\) is not a collider and \(W \in \mathbf Z \), or
 2.
\(W\) is a collider and neither \(W\) nor its descendants are in \(\mathbf Z \).
Definition 3
Two nodes \(X\) and \(Y\) are dseparated by \(\mathbf Z \) in graph \(G\) if and only if every path from \(X\) to \(Y\) is blocked by \(\mathbf Z \). Two nodes are dconnected if they are not \(d\)separated.
Definition 4
A DAG \(\mathcal{{G}}\) is an Imap of a probability distribution \(p\) if \(\langle \mathbf X ,\mathbf Y \mathbf Z \rangle _\mathcal{{G}} \Longrightarrow I_{p}(\mathbf X ,\mathbf Y \mathbf Z )\). It is minimal if no arc can be removed from \(\mathcal{{G}}\) without violating the Imap condition. \(\mathcal{{G}}\) is a Dmap of \(p\) if \(\langle \mathbf X ,\mathbf Y \mathbf Z \rangle _\mathcal{{G}} \Longleftarrow I_{p}(\mathbf X ,\mathbf Y \mathbf Z )\).
When a DAG \(\mathcal{{G}}\) is both an Imap and a Dmap of \(p\), it is said that \(\mathcal{{G}}\) and \(p\) are isomorphic models (i.e. \(\mathcal{G }\) is a perfectmap of \(p\)) or we will say that \(p\) and \(\mathcal{G }\) are faithful to each other [30, 35].
Furthermore, a distribution \(p\) is faithful if there exists a graph, \(\mathcal{G }\), to which it is faithful. In a faithful BN \(\langle \mathbf X ,\mathbf Y \mathbf Z \rangle _\mathcal{{G}} \Leftrightarrow I_{p}(\mathbf X ,\mathbf Y \mathbf Z )\)
It is always possible to build a minimal Imap of any given probability distribution \(p\), but some distributions do not admit an isomorphic model [31].
In general, when learning Bayesian networks from data our goal is to obtain a DAG that is a minimal Imap of the probability distribution encoded by the dataset.
We will assume faithfulness in the rest of the paper. In such cases, we can assume that the terms \(d\)separation and conditional independence are used interchangeably in \(p\) and \(\mathcal{G }\).
2.2 Related work
The idea of learning BNs in large domains by constraining the search space is not new and has been exploited previously. For example, in [5] they restrict the number of possible parents for a variable to be the \(k\) most correlated with it, and then, the K2 algorithm [10] is used to learn the BN structure. By using no restriction on the order of the nodes, we can use the Max–Min Hill Climbing (MMHC) algorithm [36], which is also a twostep algorithm that in its first stage tries to identify the parents and children of each variable and in the second one uses a local search algorithm to look for the network, but with the search restricted to the set of previously found adjacencies (parents and children). In [29], the first step is carried out as in MMHC but in the second phase substructures are learned using the information gathered in first phase. In [16], an iterated hill climbing algorithm is proposed that at each (outer) iteration restricts the number of candidate parents for each variable to the \(k\) most promising ones, \(k\) having the same value for every variable.
There also exists global search methods in which the search space is also restricted, for example in [37] they first construct an undirected graph or skeleton by using zero and firstorder dependence tests, and then a genetic algorithm is employed which is restricted to searching for DAGs belonging to this skeleton. The approach taken in [39] is slightly different in that it also uses a first stage based on low order conditional independence tests.
As we can observe, the common feature in all the aforementioned algorithms is that all of them use two clearly separated stages: (1) search space restriction; and (2) running of a search algorithm over the restricted search space. In [17], the authors propose a different way of learning BNs, namely by carrying out these two stages simultaneously. Thus, a hill climbing algorithm is launched directly without previously restricting the search space, and then it takes advantage of the computations carried out at each search step to guess which edges should not be considered from then on. In this way the search space is pruned progressively as the search advances. However, in order to maintain the nice theoretical property that under the faithfulness condition the CHC algorithm always returns a minimal Imap, at least one extra iteration should be executed. These last iterations are usually short in local steps but are timeconsuming in global terms due to the extra load of computing new statistics from data. This means that the time saving with respect to previous twostage proposals diminishes, even though the first iteration is really very fast.
Our proposal in this paper is to avoid these extra iterations in order to improve the CHC algorithm significantly reducing its running time and so applying it to databases with a larger number of variables. To do this, we propose to check the restrictions applied previously for each variable parent set in order to guarantee the minimal Imap condition in only one iteration.
3 Learning BNs by local search methods
Local search (specifically hill climbing) methods traverse the search space by starting from an initial solution and performing a finite number of local steps. At each local step, the algorithm only considers minimum or local changes, i.e. neighbor DAGs, and chooses the one resulting in the greatest improvement in \(f\). The algorithm stops when there is no local change yielding an improvement in \(f\). Because of this greedy behavior, the execution stops when the algorithm is trapped at a solution that most times maximizes \(f\) locally rather than globally. Different strategies are used to try to escape from local optima: restarts, randomness, etc.
In BN learning, the usual choices for local changes in the space of DAGs are arc addition, arc deletion and arc reversal. Of course, except in arc deletion we have to take care not to introduce directed cycles in the graph. Thus, there are \(O(n^2)\) possible changes, \(n\) being the number of variables. With regard to the starting solution, the empty network is usually considered, although random starting points or perturbed local optima are also used, specially in the case of an iterated local search.
 1.
Addition of \(X_{j} \rightarrow X_{i}: f_{D}(X_{i},\mathbf{Pa }(X_{i}) \cup \{X_{j}\})  f_{D}(X_{i},\mathbf{Pa }(X_{i}))\).
 2.
Deletion of \(X_{j} \rightarrow X_{i}: f_{D}(X_{i},\mathbf{Pa }(X_{i}) \!\setminus \! \{X_{j}\})  f_{D}(X_{i},\mathbf{Pa }(X_{i}))\).
 3.
Reversal of \(X_{j} \rightarrow X_{i}\): It is obtained as the sequence: deletion(\(X_{j} \rightarrow X_{i}\)) plus addition(\(X_{i} \rightarrow X_{j}\)), so we compute \([f_{D}(X_{i},\mathbf{Pa }(X_{i}) \setminus \{X_{j}\}) f_{D}(X_{i},\mathbf{Pa }(X_{i}))] + [f_{D}(X_{j},\mathbf{Pa }(X_{j}) \cup \{X_{i}\})  f_{D}(X_{j},\mathbf{Pa }(X_{j}))]\)
Algorithm 1 outlines the hill climbing algorithm for structural learning of Bayesian networks. Although any DAG (\(\mathcal{G }_0\)) can be used to initialize the search, usually the empty graph (i.e. a graph with no arcs) is used. In the algorithm, we assume that each time a family is scored for the first time, the obtained value \(f_D(\cdot )\) is added to a cache. Subsequent computations can be avoided by just checking that cache.
3.1 Theoretical considerations
In this section, we review some (desirable) properties of scoring metrics, most of which are taken from [9]. These concepts will constitute the theoretical basis of our proposal.
Definition 5
A scoring metric \(f\) is score equivalent if for any pair of equivalent DAGs, \(\mathcal{G }\) and \(\mathcal{G }^{\prime }, f(\mathcal{G }:D) = f(\mathcal{G }^{\prime }:D)\).
Two DAGs are equivalent if they lead to the same essential graph, that is, if they share the same skeleton and the same vstructures [38].
Definition 6
 1.
If \(\mathcal{H }\) contains \(p\) and \(\mathcal{G }\) does not contain \(p\), then \(f(\mathcal{H }:D) > f(\mathcal{G }:D)\).
 2.
if \(\mathcal{H }\) and \(\mathcal{G }\) contain \(p\), but \(\mathcal{G }\) is simpler than \(\mathcal{H }\) (has less parameters), then \(f(\mathcal{G }:D) > f(\mathcal{H }:D)\).
A probability distribution \(p\) is contained in a DAG \(\mathcal{G }\) if there exists a set of parameter values \(\varvec{\Theta }\) such that the Bayesian network defined by \((\mathcal{G },\varvec{\Theta })\) represents \(p\) exactly. Of course, if two graphs are correct, then the sparser one should receive more merit.
Definition 7
 1.
If \(\lnot I_p(X_{i},X_{j}  \mathbf{Pa }_\mathcal{G }(X_{j}))\), then \(f(\mathcal{G }:D) < f(\mathcal{G }^{\prime }:D)\).
 2.
If \(I_p(X_{i},X_{j}  \mathbf{Pa }_\mathcal{G }(X_{j}))\), then \(f(\mathcal{G }:D) > f(\mathcal{G }^{\prime }:D)\).
This is the main result for the proposal in [17], and the extension/improvement we propose in this paper, because from the concept of local consistency we can (asymptotically) assume that the differences computed by a locally consistent scoring metric \(f\) can be used as conditional independence tests over the dataset \(D\). To do this, we have to suppose that \(D\) constitutes a sample which is isomorphic^{3} to a graph.
Proposition 2
[9] The Bayesian scoring criterion is locally consistent.
Particular Bayesian scores for which Propositions 1 and 2 hold are BDe (Bayesian Dirichlet score with the assumption of likelihood equivalence) [25], and BIC (Bayesian Information Criterion) [34].
3.2 Constrained hill climbing methods
The hill climbing (HC) algorithm with \(\{\)arcaddition, arcdeletion, arcreversal\(\}\) operations is without any doubt the most frequently used algorithm because of its ease of implementation, efficiency and the quality of the output it offers. That quality is supported by the fact that this algorithm asymptotically guarantees a minimal Imap [17, Proposition 3].
In [17], a constrained hill climbing (CHC) algorithm is proposed. The scheme of this proposal is shown in Algorithm 2, and as can be observed, it is almost identical to HC (see Algorithm 1), the only difference being the inclusion of the forbidden parent sets, \(FP(.)\). For each variable, an \(FP()\) set is considered and updated in each step of the greedy search. Thus, the nodes included in \(FP(X)\) are not taken into account as possible parents for \(X\) during the rest of the search. The CHC algorithm, as well as HC, assures monotonicity, i.e., in each step an improvement is guaranteed and CHC, like HC, stops when there is no neighbour of \(\mathcal{G }\) which improves \(f(\mathcal{G })\), therefore termination is also guaranteed. However, CHC cannot ensure that \(\hat{\mathcal{G }} = CHC(\mathcal{G }_0)\) is, asymptotically, an Imap of \(p\), as HC does. To solve these problems, The CHC\(^{*}\) algorithm is proposed in [17, 19], where the output of CHC is used as the input for unconstrained hill climbing. This latter algorithm shares the property of HC that the resulting DAG is a minimal Imap.
Two other successful variants of CHC\(^{*}\), called iCHC and 2iCHC, respectively, are also introduced in [17]. The first one consists of the iteration of CHC by taking as input the output obtained in the previous iteration. Because \(FP\) sets are reset at the beginning of each iteration, the algorithm could escape from the current point as an unconstrained HC will do. The algorithm stops when no modification is carried out over the DAG returned by the previous iteration. When this happens, and because \(FP\) sets are cleared at the beginning, we can be sure that CHC will also stop without making any modification, and so we get a minimal Imap. Algorithm 2iCHC just carries out two iterations, which it is proven to be enough to ensure that a minimal Imap is obtained [17, Proposition 4]. Obviously, this second approach is faster than the first one, though sometimes it obtains networks whose score is slightly worse.
4 The one iteration CHC Algorithm: FastCHC
The necessity of iterating the CHC algorithm is due to the possibility that the arcs belonging to a vstructure in the true graph are discovered in the wrong direction during the learning process. If this is the case, an Imap can be recovered in two different ways. First, those arcs can be reversed, but in practice this is difficult to do, and there even exists situations in which the reversal operation for those arcs is impossible in the context of a local search over the current graph. The second solution is to add an arc between the tail nodes of the vstructure, however, this action is not easy to do in the CHC approach. The reason comes from the fact that both tail nodes (\(T1\) and \(T2\)) are marginally independent given a subset of nodes \(\mathbf S \) in the original (true) distribution, and therefore, when the algorithm checks the link between \(T1\) and \(T2\) with the subset \(\mathbf S \) as the parent set of \(T1\) or \(T2, T1\) will be included in \(FP(T2)\) and/or vice versa during that stage of the search. This is the reason for the success of iterated versions of CHC, because when starting a new iteration, the \(FP()\) sets are empty and so that type of arcs can be reversed.
In this paper our challenge, and main contribution, is to be able to reverse those arcs but in a singleiteration CHC algorithm. The first consequence would be a neater algorithm, and secondly, hopefully, a more efficient one than those iterated, given that only one iteration is performed. Of course, as the nodes in the wrongly discovered vstructure would be totally connected, the output of this new algorithm will still be a minimal Imap.
Example 1
 (1)
Add \(A \rightarrow B\): \(diff \gg 0\) because \(A\rightarrow B \in \mathcal{G }_t\), so \(\lnot I(A,B\emptyset )_{\mathcal{G }_t}\).
 (2)
Add \(A \rightarrow C\): \(diff < 0\) because \(I(A,C\emptyset )_{\mathcal{G }_t}\): \(FP(C) = \{A\}, FP(A) = \{C\}\).
 (3)
Add \(A \rightarrow D\): \(diff < 0\) because \(I(A,D\emptyset )_{\mathcal{G }_t}\): \(FP(D) = \{A\}, FP(A) = \{C,D\}\).
 (4)
Add \(B \rightarrow A\): \(diff \gg 0\) because \(A\rightarrow B \in \mathcal{G }_t\), so \(\lnot I(A,B\emptyset )_{\mathcal{G }_t}\).
 (5)
Add \(B \rightarrow C\): \(diff \gg 0\) because \(C\rightarrow B \in \mathcal{G }_t\), so \(\lnot I(C,B\emptyset )_{\mathcal{G }_t}\).
 (6)
Add \(B \rightarrow D\): \(diff > 0\) because \(\lnot I(B,D\emptyset )_{\mathcal{G }_t}\).
 (–)
there is no need to test add \(C \rightarrow A\) because \(C \in FP(A)\). However, if \(C\) had not been added to \(FP(A)\) in step (1), then this step could not be skipped.
 (7)
Add \(C \rightarrow B\): \(diff \gg 0\) because \(C\rightarrow B \in \mathcal{G }_t\), so \(\lnot I(C,B\emptyset )_{\mathcal{G }_t}\).
 (8)
Add \(C \rightarrow D\): \(diff \gg 0\) because \(D\rightarrow C \in \mathcal{G }_t\), so \(\lnot I(D,C\emptyset )_{\mathcal{G }_t}\).
 (–)
there is no need to test add \(D \rightarrow A\) because \(D \in FP(A)\). However, if \(D\) had not been added to \(FP(A)\) in step (3), then this step could not be skipped.
 (9)
Add \(D \rightarrow B\): \(diff > 0\) because \(\lnot I(B,D\emptyset )_{\mathcal{G }_t}\).
 (10)
Add \(D \rightarrow C\): \(diff \gg 0\) because \(D\rightarrow C \in \mathcal{G }_t\), so \(\lnot I(D,C\emptyset )_{\mathcal{G }_t}\).
 (11)
Add \(D \rightarrow C: diff > 0\) because \(\lnot I(C,D\{B\})_{\mathcal{G }_t}\).
 (12)
Add \(B \rightarrow D\): \(diff <0\) because \(I(B,D\{C\})_{\mathcal{G }_t}\): \(FP(B) = \{D\}, FP(D) = \{A,B\}\).
 (13)
Add \(C \rightarrow A\): \(diff > 0\) because \(\lnot I(A,C \{B\})_{\mathcal{G }_t}\).
The following proposition proves the type of output provided by the FastCHC algorithm.
Proposition 3
Let \(D\) be a dataset containing \(d\)independent and identically distributed samples from some distribution \(P\). Let \(\hat{G}\) be the directed acyclic graph obtained by running FastCHC algorithm by taking \(G_0\) as the initial solution, i.e., \(\hat{G} = FastCHC(G_0)\). If the score function \(f\) used to evaluate the candidate graphs in FastCHC is consistent and locally consistent, then under the assumption that \(P\) is faithful, \(\hat{G}\) is a minimal Imap of \(P\) in the limit as \(d\) grows large.
Proof in Appendix A.
5 Experimental evaluation
In this section we describe the set of experiments aimed at testing the performance of the algorithm presented in this paper. Given the nature of the proposed algorithm, we were not only interested in the quality of the solution obtained (accuracy) but also in the resources it needs (efficiency). Below we provide details about the implementation of the algorithms, the datasets used in this comparison, the performance indicators we have chosen to argue about the goodness of each algorithm, and to end the section we give the results and their analysis.
5.1 Algorithms
The algorithms examined in this section, apart from FastCHC, are the standard hill climbing (HC), the constrained hill climbing followed by unconstrained hill climbing (CHC*), iterated constrained hill climbing (iCHC), and twoiteration constrained hill climbing (2iCHC) algorithms. In all cases an empty network is used as starting graph. We do not consider the constrained hill climbing algorithm (CHC) because it does not guarantee a minimal Imap of the original distribution. In addition, we consider the Max–Min hill climbing algorithm (MMHC) [36] as the best scalable stateoftheart structural learning algorithm for dealing with a high number of variables (though it does not guarantee an Imap is obtained).
An implementation of MMHC is available in Matlab from the authors web page. However, access to the source code is not free. As the functionality provided by that implementation is limited, and furthermore the second stage of the algorithm is based on a taboo search rather than on canonical hill climbing, a custom implementation is used here. This implementation is based on the implementation of the MMPC algorithm available from [32], in language C, with the efficiency improvements described in [36]. For the second stage of the algorithm, the implementation of hill climbing, also used in this comparison, is employed.
The score metric used here is the Bayesian Dirichlet equivalent uniform (BDeu) [25] with the same parameters as in [36], i.e. equivalent sample size equal to 10.
The implementation was coded in Java and interacts with the ProGraMo library for dataset and graph structures management [20].
In order to speed up all the algorithms we use an internal cache where we save the result of every score computation in order to reuse it later in the execution. This cache is implemented with a hash table^{4} by using the names of the probability family as key and storing the score metric (real value) for that family as value. Therefore, updating and querying operations over the cache are linear in time. To give the reader an idea of the benefits obtained by using this structure, the ratio \(\frac{\# \text{ of} \text{ statistics} \text{ required}}{\# \text{ of} \text{ actual} \text{ calls} \text{ to}\ f()}\) is 20, 25 and 75 for hill climbing algorithm when dealing with datasets/networks having 25, 30 and 100 variables (data from [18]).
All the runs of these algorithms were carried out on a dedicated server with Pentium Xeon 3.0 GHz, 64 bit architecture, 32 GB RAM memory and under Linux. The Java Runtime Environment used is the one provided by SUN Microsystems, version 1.5.
5.2 Networks
Bayesian networks used in our experimental evaluation
 Description  Nodes  Edges  States per node  Max states  Parents per node  Min–max parents  PC per node  Max PC 

Alarm  Monitoring of emergency care patients  37  46  2.84  2–4  1.24  4  2.49  6 
Barley  Model of barley crop yields  48  84  8.77  2–67  1.75  4  3.50  8 
Diabetes  A model for insulin dose adjustment (DBN)  413  602  11.34  3–21  1.46  2  2.92  24 
Hailfinder  Predicting hail in northern Colorado  56  66  3.98  2–11  1.18  4  2.36  17 
Insurance  Evaluating insurance applications  27  52  3.30  2–5  1.93  3  3.85  9 
Link  Pedigree for linkage analysis  724  1,125  2.53  2–4  1.55  3  3.11  17 
Mildew  A model for deciding on the amount of fungicides to be used against attack of mildew in wheat  35  46  17.6  3–100  1.31  3  2.63  5 
Munin1  An expert electromyography assistant  189  282  5.26  1–21  1.49  3  2.98  15 
Munin2  1,003  1,244  5.36  2–21  1.24  3  2.48  30  
Munin3  1,044  1,315  5.37  1–21  1.26  3  2.52  69  
Munin4  1,041  1,397  5.43  1–21  1.34  3  2.68  69  
Pf1  Analysis of lymph cell pathologies  109  195  4.11  2–63  1.79  5  3.58  106 
Pf23  135  200  3.85  2–76  1.48  4  2.96  130  
Pigs  Pedigree of breeding pigs  441  592  3.00  3–3  1.34  2  2.68  41 
Water  A model of the biological processes of a water purification plant  32  66  3.63  3–4  2.06  5  4.13  8 
Win95pts  A model for printer troubleshooting in Microsoft Windows 95  76  112  2.00  2–2  1.47  7  2.95  10 
For all these networks, we obtained different datasets by sampling with 500, 1,000 and 5,000 instances. Each dataset was given the same name as its corresponding network. In fact, six datasets are sampled for each network and size: one is used for learning and the other five for validation (the average is reported).
5.3 Performance indicators
Two kinds of factors are used as performance indicators, one being the quality of the network obtained by the algorithm, and the other the complexity of each algorithm.
 Loglikelihood Given a dataset \(D=\{\mathbf{v}^1,\ldots , \mathbf{v}^m\}\) of instances and a network \(\mathcal{{G}}\) defined over a set of variables \(V=\{X_1,\ldots ,X_n\}\), the loglikelihood of \(\mathcal{{G}}\) with respect to \(D\) is computed as:$$\begin{aligned} LL(\mathcal{{G}}:D) = \sum _{i=1}^{m} \log ( P_\mathcal{{G}}(\mathbf{v}_{i}D)). \end{aligned}$$
 Structural Hamming Distance (SHD) Given two models \(\mathcal{{G}}_1\) and \(\mathcal{{G}}_2\) (the golden one and the learned one in our case) the SHD is computed over their essential graphs \(eg(B_1)\) and \(eg(B_2)\) in order to not penalize structural differences that cannot be statistically distinguished. The algorithm described in [6] is used to transform a given DAG into an essential graph or partially oriented graph (PDAG). Then,that is, SHD is computed as the number of edges missing in the first graph with respect to the second, plus the number of extra edges in the first graph with respect to the second, and plus the number of edges present in both graphs but with different directions. This last set of edges includes the reversed edges and also the edges that are undirected in one of them and detected in the other. See [36] for details.$$\begin{aligned} SHD(eg(\mathcal{{G}}_1):eg(\mathcal{{G}}_2))&= eg(\mathcal{{G}}_2)  eg(\mathcal{{G}}_1) + eg(\mathcal{{G}}_1)\\&eg(\mathcal{{G}}_2) + diff(\mathcal{{G}}_1,\mathcal{{G}}_2) \end{aligned}$$
5.4 Results
In the following tables, we present a summary of the results we obtained after running the 6 algorithms over the 16 domains (see Table 1). The summarand also the Structural Hamming Distance (SHD) with respect to the golden model (the one used to sample training and test sets). SHD is measured as in [36], based on comparing the essential graphs.
The structural Hamming distance is computed for a pair of partially directed acyclic graphs (PDAGs). Thus, the output models obtained by each algorithm are converted into a PDAG. Then, for every pair of graphs, SHD is computed as the number of edges missing in the first graph with respect to the second, plus the number of extra edges in the first graph with respect to the second, and plus the number of edges present in both graphs but with different directions. This last set of edges includes the reversed edges and also the edges that are undirected in one of them and directed in the other. \(y\) consists of the average rank of each algorithm in each dataset, from 1 to 6, 1 being the best result. The complete list of results can be seen in Appendix B. The MMHC algorithm could not handle many of the big datasets, especially those with 5,000 instances. The reason is that the algorithm needs more memory than that available in the computer, or the process does not finish before the set limit of 1 week.^{5} In these cases, we consider that the algorithm gives a worse result than the others in order to penalize the fact that it cannot produce a result under the given conditions.
For each one of the indicators, we performed a Friedman test [15] in order to discover whether we can assume that all algorithms deliver the same performance, and in the case that we cannot assume so we use the Holm’s post hoc test [26] to see which algorithms can be taken as not statistically worse than the best one. In all the tests, the significance level is set to 0.05.
Average rank of the algorithms based on likelihood over all the validation datasets
Algorithm  500  1,000  5,000 

CHC*  1.94\(^\mathrm{a}\)  1.84\(^\mathrm{a}\)  2.16\(^\mathrm{a}\) 
FastCHC  4.63  3.78  3.60 
HC  1.13  1.31  1.72 
iCHC  3.84  4.16  3.94 
MMHC  5.38  5.50  5.47 
2iCHC  4.09  4.41  4.13 
Average rank of the algorithms based on SHD over all the datasets
Algorithm  500  1,000  5,000 

CHC*  3.19\(^\mathrm{a}\)  2.91\(^\mathrm{a}\)  2.78\(^\mathrm{a}\) 
FastCHC  3.31\(^\mathrm{a}\)  3.31\(^\mathrm{a}\)  3.09\(^\mathrm{a}\) 
HC  2.38  2.50  2.09 
iCHC  4.53  4.50  4.28 
MMHC  3.44\(^\mathrm{a}\)  3.41\(^\mathrm{a}\)  4.66 
2iCHC  4.16  4.38  4.09 
Looking at these two result tables together we can draw the conclusion that the HC algorithm probably obtains such good results in likelihood because it overfits the network model to the data by adding many additional arcs that in turn deteriorate its performance in terms of SHD.
Average rank of the algorithms based on the statistics computed over all the datasets
Algorithm  500  1,000  5,000 

CHC*  4.63  4.44  4.19 
FastCHC  1.19  1.19  1.00 
HC  5.75  5.69  5.31 
iCHC  3.16  3.09  2.78 
MMHC  3.94  4.19  5.50 
2iCHC  2.34\(^\mathrm{a}\)  2.41\(^\mathrm{a}\)  2.22\(^\mathrm{a}\) 
In summary, the above tables seem to indicate that the FastCHC algorithm is the best in terms of scalability and efficiency, while HC is the best algorithm in terms of the quality of the model.
Average rank of the algorithms based on the combined score of likelihood and statistics computed over all the datasets
Algorithm  500  1,000  5,000 

CHC*  4.44  4.31  4.19 
FastCHC  1.19  1.06  1.00 
HC  5.69  5.63  5.25 
iCHC  3.16  3.09  2.78 
MMHC  4.19  4.46  5.56 
2iCHC  2.34\(^\mathrm{a}\)  2.34\(^\mathrm{a}\)  2.22\(^\mathrm{a}\) 
6 Conclusions
In this paper, we have proposed an improved version of the algorithms in the CHC family, which in turn are improved versions of the standard hill climbing approach for learning Bayesian networks. This new algorithm avoids the necessity of iterating the search algorithm by updating the constrained search space after applying the local action/operation selected at each step. This behaviour prevents the possibility that the marginally independent variables will be considered as conditionally independent when in the actual (underlying) probability distribution there are not.
Obviously, as the FastCHC algorithm requires only one iteration, it is much more efficient than previous CHC algorithms. This is clear in our experimental evaluation, in which the superiority of FastCHC over MMHC is also made evident.
Therefore, FastCHC constitutes a good candidate for learning BNs in high dimensional problems, in which scalable algorithms are required. Furthermore, because the speed up achieved with respect to competing algorithms we think that there is room to improve the quality of the discovered network, by slightly sacrificing the speed of the search. That is, for the future we plan to search more in order to search better. Thus, the idea is to use FastCHC as the basic building block for using some local searchbased metaheuristics like taboo search or simulated annealing, which have proven to able to scape from local optima.
Footnotes
 1.
We use standard notation, i.e., bold font to denote sets and \(n\)dimensional configurations, calligraphic font to denote mathematical structures, upper case for variables or sets of random variables, and lower case to denote states of variables or configurations of states (vectors).
 2.
In case of working with different graphs we will use \(\mathbf{Pa }_\mathcal{{G}}(X_{i})\) to clarify the notation.
 3.
In fact, [9] proves that the isomorphic condition can be relaxed.
 4.
Concretely, java language HashMap class is used (http://docs.oracle.com/javase/1.4.2/docs/api/java/util/HashMap.html).
 5.
Notice that we do not set a maximum number of parents during the search.
Notes
Acknowledgments
This work has been partially funded by FEDER funds and the Spanish Government (MICINN/MINECO) through project TIN201020900C0403.
References
 1.Acid, S., de Campos, L.M.: A hybrid methodology for learning belief networks: Benedict. Int. J. Approx. Reason. 27(3), 235–262 (2001)MATHCrossRefGoogle Scholar
 2.Acid, S., de Campos, L.M.: Searching for Bayesian network structures in the space of restricted acyclic partially directed graphs. J. Artif.Intell. Res. 18, 445–490 (2003)MATHGoogle Scholar
 3.Buntine, W.L.: Theory refinement on Bayesian networks. In: Proceedings of the Seventh Annual Conference on Uncertainty in Artificial Intelligence, pp. 52–60 (1991)Google Scholar
 4.Buntine, W.L.: A guide to the literature on learning probabilistic networks from data. IEEE Trans. Knowl. Data Eng. 8(2), 195–210 (1996)CrossRefGoogle Scholar
 5.Cano, R., Sordo, C., Gutiérrez, J.M.: Applications of Bayesian networks in meteorology. In: Gámez, J.A., Moral, S., Salmerón, A. (eds.) Advances in Bayesian Networks, pp. 309–327. Springer, Berlin (2004)Google Scholar
 6.Chickering, D.M.: A transformational characterization of equivalent Bayesian network structures. In: UAI ’95: Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, pp. 87–98. Morgan Kaufmann, San Francisco (1995)Google Scholar
 7.Chickering, D.M.: Learning Bayesian networks is NPComplete. In: Fisher, D., Lenz, H. (eds.) Learning from Data: Artificial Intelligence and Statistics, vol. V, pp. 121–130. Springer, Berlin (1996)Google Scholar
 8.Chickering, D.M., Geiger, D., Heckerman, D.: Learning Bayesian networks: search methods and experimental results. In: Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, pp. 112–128 (1995)Google Scholar
 9.Chickering, D.M.: Optimal structure identification with greedy search. J. Mach. Learn. Res. 3, 507–554 (2002)Google Scholar
 10.Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9, 309–347 (1992)MATHGoogle Scholar
 11.de Campos, L.M.: A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. J. Mach. Learn. Res. 7, 2149–2187 (2006)MathSciNetMATHGoogle Scholar
 12.de Campos, L.M., FernándezLuna, J.M., Gámez, J.A., Puerta, J.M.: Ant colony optimization for learning Bayesian networks. Int. J. Approx. Reason. 31(3), 291–311 (2002)Google Scholar
 13.de Campos, L.M., FernándezLuna, J.M., Puerta, J.M.: Local search methods for learning Bayesian networks using a modified neighborhood in the space of dags. In: Proceedings of IBERAMIA 2002. LNCS, vol. 2527, pp. 182–192 (2002)Google Scholar
 14.de Campos, L.M., Puerta, J.M.: Stochastic local algorithms for learning belief networks: searching in the space of the orderings. In: 6th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU’01), pp. 228–239 (2001)Google Scholar
 15.Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675701 (1937)CrossRefGoogle Scholar
 16.Friedman, N., Nachman, I., Pe’er, D.: Learning Bayesian network structure from massive datasets: the “sparse candidate” algorithm. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’99), pp. 206–215 (1999)Google Scholar
 17.Gámez, J.A., Mateo, J.L., Puerta, J.M.: Learning Bayesian networks by hill climbing: efficient methods based on progressive restriction of the neighborhood. Data Mining Knowl. Discov. 22(1–2), 106–148 (2011)Google Scholar
 18.Gámez, J.A., Puerta, J.M.: Constrained score+(local)search methods for learning Bayesian networks. In: 8th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU’05), pp. 161–173 (2005)Google Scholar
 19.Gámez, J.A., Puerta, J.M.: Constrained score+(local)search methods for learning Bayesian networks. In: 8th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU05). LNCS, vol. 3571, pp. 161–173 (2005)Google Scholar
 20.Gámez, J.A., Salmerón, A., Cano, A.: Design of new algorithms for probabilistic graphical models. implementation in elvira. programo research project (tin200767418c03). In: Jornada de Seguimiento de Proyectos, 2010. Programa Nacional de Tecnologías Informáticas (2010) Google Scholar
 21.GarcíaPedrajas, N., de HaroGarcía, A.: Scaling up data mining algorithms: review and taxonomy. Prog. Artif. Intell. 1, 71–87 (2012)CrossRefGoogle Scholar
 22.Geiger, D., Heckerman, D., King, H., Meek, C.: Stratified exponential families: graphical models and model selection. Ann. Stat. 29(2), 505–529 (2001)MathSciNetMATHCrossRefGoogle Scholar
 23.Haughton, D.M.A.: On the choice of a model to fit data from an exponential family. Ann. Stat. 16(1), 342–355 (1988)Google Scholar
 24.Heckerman, D.: Bayesian networks for data mining. Data Mining Knowl. Discov. 1, 79–119 (1997)CrossRefGoogle Scholar
 25.Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)Google Scholar
 26.Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979)MathSciNetMATHGoogle Scholar
 27.Jensen, F.V., Nielsen, T.D.: Bayesian Networks and Decision Graphs, 2nd edn. Springer, Berlin (2007)Google Scholar
 28.Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R.H., Kuijpers, C.M.H.: Structure learning of Bayesian networks by genetic algorithms: a performance analysis of control parameters. IEEE Trans. Pattern Anal. Mach. Intell. 18(9), 912–926 (1996)Google Scholar
 29.Nägele, A., Dejori, M., Stetter, M.: Bayesian substructure learning—approximate learning of very large network structures. In: Proceedings of the 18th European conference on Machine Learning (ECML ’07), pp. 238–249 (2007)Google Scholar
 30.Neapolitan, R.: Learning Bayesian Networks. Prentice Hall, New Jersy (2003) Google Scholar
 31.Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)Google Scholar
 32.Peña, J.M., Nilsson, R., Björkegren, J., Tegnér, J.: Towards scalable and data efficient learning of Markov boundaries. Int. J. Approx. Reason. 45(2), 211–232 (2006)CrossRefGoogle Scholar
 33.Robinson, R.W.: Counting unlabeled acyclic digraphs. In: Combinatorial Mathematics, vol. 622, pp. 28–43. Springer, Berlin (1977)Google Scholar
 34.Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)MATHCrossRefGoogle Scholar
 35.Spirtes, P., Glymour, C., Scheines, R.: Causation, prediction and search. In: Lecture Notes in Statistics, vol. 81. Springer, Berlin (1993)Google Scholar
 36.Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max–min hillclimbing Bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)CrossRefGoogle Scholar
 37.van Dijk, S., van der Gaag, L.C., Thierens, D.: A skeletonbased approach to learning Bayesian networks from data. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’03), pp. 132–143 (2003)Google Scholar
 38.Verma, T., Pearl, J.: Equivalence and synthesis of causal models. In: Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence (UAI’90), pp. 255–270. Elsevier, Amsterdam (1991)Google Scholar
 39.Wong, M.L., Leung, K.S.: An efficient data mining method for learning Bayesian networks using an evolutionary algorithmbased hybrid approach. IEEE Trans. Evol. Comput. 8(4), 378–404 (2004)Google Scholar