Empirical hardness of finding optimal Bayesian network structures: algorithm selection and runtime prediction
 1.2k Downloads
 2 Citations
Abstract
Various algorithms have been proposed for finding a Bayesian network structure that is guaranteed to maximize a given scoring function. Implementations of stateoftheart algorithms, solvers, for this Bayesian network structure learning problem rely on adaptive search strategies, such as branchandbound and integer linear programming techniques. Thus, the time requirements of the solvers are not well characterized by simple functions of the instance size. Furthermore, no single solver dominates the others in speed. Given a problem instance, it is thus a priori unclear which solver will perform best and how fast it will solve the instance. We show that for a given solver the hardness of a problem instance can be efficiently predicted based on a collection of nontrivial features which go beyond the basic parameters of instance size. Specifically, we train and test statistical models on empirical data, based on the largest evaluation of stateoftheart exact solvers to date. We demonstrate that we can predict the runtimes to a reasonable degree of accuracy. These predictions enable effective selection of solvers that perform well in terms of runtimes on a particular instance. Thus, this work contributes a highly efficient portfolio solver that makes use of several individual solvers.
Keywords
Bayesian networks Structure learning Algorithm selection Hyperparameter optimization Empirical hardness Algorithm portfolio Runtime prediction1 Introduction
Since the formalization and popularization of Bayesian networks (Pearl 1988) for modeling and reasoning with multiple variables, much research has been devoted to learning them from data (Heckerman et al. 1995). One of the main challenges has been to learn the model structure, represented by a directed acyclic graph (DAG) on the variables. Cast as a problem of finding a DAG that is a global optimum of a score function for given data, the Bayesian network structure learning problem (BNSL) is notoriously NPhard; the hardness is chiefly due to the acyclicity constraint imposed on the DAG to be learned (Chickering 1996). To cope with the computational hardness, early work on structure learning resorted to local search algorithms. While local search algorithms oftentimes perform well, they are unfortunately unable to guarantee global optimality. The uncertainty about the quality of the found network hampers the use of the network (Malone et al. 2015) in probabilistic inference and causal discovery.
The last decade has raised hopes of solving larger problem instances to optimality. The first algorithms guaranteed to find an optimum adopted a dynamic programming approach to avoid exhaustive search in the space of DAGs (Ott et al. 2004; Koivisto and Sood 2004; Singh and Moore 2005; Silander and Myllymäki 2006). Later algorithms have expedited the dynamic programming approaches using the A\(^{*}\) search algorithm with various admissible heuristics (Yuan and Malone 2013), or have employed quite different approaches, such as branch and bound in the space of (cyclic) directed graphs (de Campos and Ji 2011), integer linear programming (ILP) (Jaakkola et al. 2010; Cussens 2011, 2013), and constraint programming (CP) (van Beek and Hoffmann 2015). In this work, we focus on such complete solvers for BNSL, which we call simply solvers. Our interest is in unsupervised learning of a joint structure over the variables, only noting in passing that alternative methods have been developed for supervised learning of the relationship between a designated response variable and the other predictor variables [see, e.g., a recent survey (Bielza and Larrañaga 2014) and references therein].
Figure 1 suggests that, to improve over the existing solvers, an alternative to developing yet another solver is to design algorithm portfolios which select a solver to run on a perinstance basis, ideally combining the bestcase performance of the different solvers. Indeed, in this work we do not focus on developing or improving an individual algorithmic approach. Instead, we aim to characterize how the performance of different algorithmic approaches depends on the problem instance, which is the key to the design of efficient algorithm portfolios. The underlying motivation for developing such techniques is the aim of improving the efficiency of state of the art in complete solvers in solving hard BNSL instances.
In this quest, it is vital to discover a collection of features that are efficient to compute and yet informative about the hardness of an instance for a solver. Prior work has identified two simple features, namely the number of variables and the number of socalled candidate parent sets, denoted by n and m, respectively. To explain the observed orthogonal performance characteristics shown in Fig. 1, it has been suggested, roughly, that typical instances can be solved to optimality by A\(^{*}\), if n is at most 40 (no matter how large m), and by ILP if m is moderate, say, at most some tens of thousands (no matter how large n) (Cussens 2013; Yuan and Malone 2013); for the more recent CP approach, we are not aware of any comparable description. Beyond this rough characterization, the practical time complexity of the bestperforming solvers is currently poorly understood. This stems from the sophisticated search heuristics employed by the solvers, which tend to be sensitive to small variations in the instances, thus resulting in somewhat chaoticlooking behavior of runtimes. Furthermore, the gap between the analytic worstcase and bestcase runtime bounds, in terms of n and m, is huge, and typical instances fall somewhere in between the two extremes.
 Q1

For determining the fastest of the available solvers on a given instance, do the simple features, the number of variables and the number of candidate parent sets, suffice?
 Q2

For predicting the runtime of a solver on a given instance, can the accuracy be significantly improved by including additional efficiently computable features?
Besides the aforementioned contributions, the empirical work associated with this paper also provides the most elaborate evaluation of stateoftheart solvers to date, significant in its own right.
The present work extends and revises substantially our preliminary study reported at the AAAI14 conference (Malone et al. 2014). Here we have thoroughly revised the methodology and analysis presented throughout the paper. We have expanded the portfolio itself to include the very recent CPbased solver (van Beek and Hoffmann 2015). At the same time, we have updated the runtime results to the most recent versions of the A\(^{*}\)based and ILPbased solvers. Furthermore, we provide a more finegrained analysis by categorizing datasets based on their origin. Our results show that the origin of the dataset significantly affects the relative runtime performance of solvers. To this end, we have also increased the number of synthetic data sets considerably, from a few dozens to several hundred. Finally, we provide a more extensive discussion of the characteristics of the learned models, such as preprocessing strategies.
1.1 Related work
Due to the wide range of potential applications, the general research area of algorithm selection, with tight connections to machine learning and algorithm portfolio design, is very diverse. Instead of aiming at a full review of the relevant literature, here we aim at a brief overview of the research area by providing references to some of the key early works on the topic and some of the more recent works most closely related to ours. For an expanded discussion of the literature on algorithm selection and runtime prediction, we refer the reader to two recent surveys on the topic with further pointers to related work (Hutter et al. 2014; Kotthoff 2014).
Research on algorithm selection for various types of important computational problems has its roots in Rice (1976), where the algorithm selection problem was introduced, and featurebased modeling was proposed to facilitate the selection of the bestperforming algorithm for a given problem instance, considering various example problems. Later works, including Carbonell et al. (1991), Fink (1998) and Lobjois and Lemaître (1998), demonstrated the efficacy of applying machine learning techniques, such as Bayesian approaches (Horvitz et al. 2001), to learn models from empirical performance data.
More recently, empirical hardness models (LeytonBrown et al. 2002, 2009) have been applied in the construction of solver portfolios (Gomes and Selman 2001) for various NPhard search problems (Kotthoff et al. 2012), including Boolean satisfiability (SAT) [e.g., in Xu et al. (2008)], constraint programming [e.g., in Gebruers et al. (2005) and Hurley et al. (2014)], quantified Boolean formula satisfiability [e.g., in Pulina and Tacchella (2008)], answer set programming [e.g., in Hoos et al. (2014)], as well as for the traveling salesperson problem [e.g., in Kotthoff et al. (2015)]. To the best of our knowledge, for the important problem of Bayesian network structure learning, the present work is the first to adopt the approach.
In terms of terminology, we investigate algorithm selection in the context of learning Bayesian networks, which is an unsupervised learning task. Nevertheless, this work is wellsituated in the context of metalearning (GiraudCarrier et al. 2004), which most often considers supervised settings. The BNSL features we propose in Sect. 3.1 are exactly a set of metafeatures for this particular domain. The regression models we learn (Sect. 3.2) capture metaknowledge about the stateoftheart BNSL solvers.
Previous work (Lee and GiraudCarrier 2008; LeytonBrown et al. 2014) has suggested that in many cases, a small set of features can lead to accurate predictions; indeed, in Sect. 5.2 we show that a very small number of features leads to nearoptimal algorithm selection performance. Furthermore, while that work relied on qualitative visual analysis, in Sect. 6.4 we quantify the utility of each feature using the Gini importance (Breiman 2001).
Recently, a simple “Best in Sample” approach (Rijn et al. 2015) was shown to be very effective for algorithm (classifier) selection in the supervised setting. Briefly, this approach trains each classifier in the portfolio using a very small subset of the data; it then selects the classifier to use based on performance on the subset. “Probing” features—a central form of features in, for example, SAT portfolios (Xu et al. 2008)—we apply in the context of BNSL (see Sect. 3.1) are similar in spirit to this approach, though adapted to the unsupervised learning setting. In terms of evaluation, our virtual best solver comparisons in Sect. 5 are quite similar to Loss Curves (Leite et al. 2012), which have previously been used in the context of metalearning.
1.2 Organization
The remainder of this paper is organized as follows. We begin in Sect. 2 by describing the problem of structure learning in Bayesian networks and by giving an overview of the algorithmic techniques underlying the stateoftheart solvers. Section 3 presents the building blocks of our empirical hardness model: we introduce several BNSL features; we choose an appropriate statistical learning framework; and we describe the methods we use for training and evaluating the models. In Sect. 4, we present the experimental setting, namely technical details of the investigated solvers and characteristics of the collected problem instances. Results on learning solver portfolios and on predicting runtimes of individual solvers are reported in Sects. 5 and 6, respectively. Finally, we discuss some questions that are left open and directions for future research in Sect. 7.
2 Learning Bayesian networks
A Bayesian network (G, P) consists of a directed acyclic graph (DAG) G on a set of random variables \(X_1, \ldots , X_n\) and a joint distribution P of the variables such that P factorizes into a product of the conditional distributions \(P(X_i  G_i)\). Here \(G_i\) denotes the set of parents of \(X_i\) in G; we call a variable \(X_j\) a parent of \(X_i\), and \(X_i\) a child of \(X_j\), if G contains an arc from \(X_j\) to \(X_i\).
2.1 The structure learning problem
In its simplest form, structure learning in Bayesian networks concerns finding a DAG that best fits some observed data on the variables.^{1} Throughout this work, we only deal with this optimization formulation, here only mentioning that there are also other popular formulations based on frequentist (multiple) hypothesis testing (Spirtes et al. 1993; Cheng et al. 2002) and Bayesian model averaging (Madigan and York 1995; Friedman and Koller 2003; Koivisto and Sood 2004).
The goodness of fit is typically measured by a realvalued scoring function s, which associates a DAG G with a realvalued score s(G).^{2} Frequently used scoring functions are based on (penalized) maximum likelihood, minimum description length, or Bayesian principles (e.g., BDeu and other forms of marginal likelihood). Additionally, they decompose (Heckerman et al. 1995) into a sum of local scores \(s_i(G_i)\) for each variable \(X_i\) and its set of parents \(G_i\). In principle, for each i the local scores are defined for all the \(2^{n1}\) possible parent sets. However, in practice this number is greatly reduced by enforcing a small upper bound for the size of the parent sets \(G_i\) or by pruning, as preprocessing, parent sets that provably cannot belong to an optimal DAG (Teyssier and Koller 2005; de Campos and Ji 2011). Applying one or both of these reductions results in a collection of candidate parent sets, which we will denote by \(\mathcal{G}_i\).
2.2 Overview of complete solvers for BNSL
We call an algorithm that is guaranteed to find a global optimum and prove its optimality for the BNSL problem a complete solver for BNSL, or simply a solver. In the next paragraphs we review some stateoftheart solvers that fit the scope of our study. We omit algorithms that assume significant additional constraints given as input (Perrier et al. 2008) or massive parallel processing (Tamada et al. 2011; Parviainen and Koivisto 2013).
Several works (Ott et al. 2004; Koivisto and Sood 2004; Silander and Myllymäki 2006) have proposed dynamic programming algorithms to solve BNSL. The solvers are based on the early observation (Buntine 1991; Cooper and Herskovits 1992) that for any fixed ordering of the n variables, the decomposability of the score enables efficient optimization over all DAGs compatible with the ordering. The algorithms proceed by adding one variable at a time, only tabulating partial solutions for the explored subsets of the variables. Thus the runtime scales roughly as \(2^n\).
Yuan and Malone (2013) formulated BNSL as a statespace search through the dynamic programming lattice and applied the A\(^{*}\) search algorithm. Unlike the other sophisticated solvers, A\(^{*}\) maintains the meaningful worstcase time bound of dynamic programming. To this end, they developed several admissible heuristics which relax the acyclicity constraint; these allow the algorithm to prune suboptimal paths during search, thus typically avoiding visiting all the variable subsets.
The branchandbound style algorithm by de Campos and Ji (2011) searches in a relaxed space of directed graphs that may contain cycles. It begins by allowing all variables to choose their optimal parents, which typically results in some number of cycles. Then, any found cyclic solutions are iteratively ruled out: it finds a cycle and breaks it by removing one arc in it, branching over the possible choices of the arc. It examines graphs in a bestfirst order, so the first acyclic graph it finds is an optimal DAG. In this way, the algorithm ignores many cyclic graphs.
Integer linear programming (ILP) algorithms by Jaakkola et al. (2010) and by Cussens (2011, 2013) search in a geometric space, in which DAGs appear as vertices of an embedded polytope, corresponding to integral solutions to a linear program (LP). A series of LP relaxations are solved, and the solution to each relaxation is checked for integrality; an integral solution corresponds to an optimal DAG. The search space is effectively pruned by employing domainspecific cutting planes.
A very recent development in solvers for BNSL is the constraint programming (CP) based approach by van Beek and Hoffmann (2015), constituting a constraintbased depthfirst branchandbound approach to BNSL. As a key ingredient, the approach uses an improved constraint model with problemspecific dominance, symmetry breaking, and acyclicity constraints and propagators. It also employs costbased pruning rules applied during search, together with domainspecific search heuristics. The approach combines some of the ideas applied in A\(^{*}\), specifically pattern databases, for obtaining bounds on the scoring function.
3 Empirical hardness models
In this work, we focus on the hardness of a BNSL instance, relative to a particular solver. We define the hardness of instance I for solver S simply as the runtime \(T_S(I)\) of the solver S on the instance I.^{3} Due to the sophisticated heuristics underlying the stateoftheart BNSL solvers, evaluating the empirical hardness is presumably (that is, under standard complexitytheoretic assumptions) computationally intractable; indeed, the fastest method we are aware of for evaluating \(T_S(I)\) is actually running S on I.
We next introduce several categories of efficientlycomputable features of BNSL instances. Most of these features have not previously been used for characterizing the hardness of BNSL. We then explain our training and testing strategies.
3.1 Features for BNSL
BNSL features
Basic  
1. Number of variables  
2. Mean number of CPSs (candidate parent sets)  
Basic extended  
3–5. Number of CPSs max, sum, sd (standard deviation)  
6–8. CPS cardinalities max, mean, sd  
Upper bounding  
Simple UB  
9–11. Node indegree max, mean, sd  
12–14. Node outdegree max, mean, sd  
15–17. Node degree max, mean, sd  
18. Number of root nodes (no parents)  
19. Number of leaf nodes (no children)  
20. Number of nontrivial SCCs (strongly connected components)  
21–23. Size of nontrivial SCCs max, mean, sd  
Pattern database UB  
24–38. The same features as for Simple UB but calculated on the graph derived from the pattern databases  
Probing  
Greedy probing  
39–41. Node indegree max, mean, sd  
42–44. Node outdegree max, mean, sd  
45–47. Node degree max, mean, sd  
48. Number of root nodes  
49. Number of leaf nodes  
50. Error bound, derived from the score of the graph and the pattern database upper bound  
A \(^{*}\) probing  
51–62. The same features as for Greedy probing but calculated on the graph learned with A\(^{*}\) probing  
ILP probing  
63–74. The same features as for Greedy probing but calculated on the graph learned with ILP probing  
CP probing  
75–86. The same features as for Greedy probing but calculated on the graph learned with CP probing 
The Basic features are the number of variables n and the mean number of candidate parent sets per variable, m / n, which can be viewed as a natural measure of the “density” of an instance. The features in Basic extended are other simple features that summarize the size distribution of the collections \(\mathcal{G}_i\) and the candidate parent sets \(G_i\) in each \(\mathcal{G}_i\). During training, we take the logarithm of the features related to the number of candidate parent sets (Features 2–5).
In the Upper bounding set, the features are characteristics of a directed graph that is an optimal solution to a relaxation of the original BNSL problem. Notice here especially the features based on strongly connected components (SCCs), which can be seen as a proxy for cyclicity.^{4} In the Simple UB subset, a graph is obtained by letting each variable select its best parent set according to the scores. The resulting graph may contain cycles, and the associated score is a guaranteed upper bound on the score of an optimal DAG. Many of the reviewed stateoftheart solvers either implicitly or explicitly use this upper bounding technique; however, they do not use this information to estimate the difficulty of a given instance. The features summarize structural properties of the graph: in and outdegree distribution over the variables, and the number and size of nontrivial strongly connected components. In the Pattern database UB subset, the features are the same but the graph is obtained by solving a more sophisticated relaxation of the BNSL problem using the pattern databases technique (Yuan and Malone 2012). Briefly, this strategy optimally breaks cycles among some subsets of variables but allows cycles among larger groups; it is a strictly tighter relaxation than the Simple UB. Both A\(^{*}\) and CP explicitly make use of the pattern database relaxation.
Probing refers to running a solver for a fixed number of seconds and collecting statistics about its behavior during the run. Probing has previously been shown to be an important form of features, for example, in the context of Boolean satisfiability within the SATzilla portfolio approach (Xu et al. 2008). Hutter et al. (2014) survey the use of probing features in other domains. Here in the context of BNSL we consider four probing strategies: greedy hill climbing with a TABU list and random restarts, an anytime variant of A\(^{*}\) (Malone and Yuan 2013), and the default versions of ILP (Cussens 2013) and CP (van Beek and Hoffmann 2015). All of these algorithms have anytime characteristics, so they can be stopped at any time and output the best DAG found so far. Furthermore, the A\(^{*}\), ILP, and CP implementations give guaranteed error bounds on the quality of the found DAGs in terms of the BNSL objective function; an error bound can also be calculated for the DAG found using greedy hill climbing by using the upper bounding techniques discussed above. Probing is implemented in practice by running each algorithm for 5 s and then collecting several features, including in and outdegree statistics and the error bound. We refer to these feature subsets of Probing as Greedy probing, A \(^{*}\) probing, ILP probing, and CP probing, respectively.
3.2 Model training and evaluation
In this work, we use the autosklearn system (Feurer et al. 2015) to learn an explicit empirical hardness model \(\hat{T}_S\) for each solver S; Briefly, autosklearn uses a Bayesian optimization strategy for learning good model classes and hyperparameters for those model classes for a given training set; additionally, preprocessing strategies, such as polynomial expansion or feature selection, and associated hyperparameters are included in this optimization. Importantly, this approach avoids the difficult step of manually choosing hyperparameters in an ad hoc fashion. We refer the reader to the original publication (Feurer et al. 2015) for more details.
In total, autosklearn selects from amongst eleven preprocessing strategies, including higher dimensional projection techniques like polynomial expansion and feature selection strategies based on, for example, mutual information. The default learning strategy for autosklearn includes twelve model classes for regression and selects an ensemble of up to 50 regressors with optimized hyperparameters. In order to learn interpretable models and avoid potential overfitting, we restricted the use of autosklearn to learn the hyperparameters for a single preprocessor and random forest.^{5} As described in detail in Sect. 4.2, this study includes three types of BNSL instances: Real, Sampled and Synthetic. For model training, we used all of the three types of datasets.
The portfolios and prediction accuracy are evaluated using an “outer” tenfold crossvalidation scheme. In other words, the data is partitioned into 10 nonoverlapping subsets. For each fold, nine of the subsets are used to train the model. As a first step in training, we normalize each feature so that it has zero mean and unit variance; the same mean and variance are later used to scale the test data. We then use autosklearn to learn the respective models. Internally, autosklearn further splits the training data in an “inner” crossvalidation approach to avoid overfitting. We give 5 h for training time for each fold. The remaining subset is used for testing, which only takes a few seconds; each subset is used as the testing set once. Importantly, the subset used for testing is not at all seen by autosklearn during training.
For testing, we predict the runtime of each testing instance using the appropriate model for each solver. For the algorithm selection analysis in Sect. 5.2, we then select the solver with the lowest predicted runtime. In order to accurately reflect the entire cost of algorithm selection, we report the runtime of a portfolio on a given instance as the sum of the runtimes of (i) feature computation for all feature sets used in the respective models and (ii) the selected solver.
4 Experimental setup
We continue with a detailed description of our experimental setup, including descriptions of the solver parameterizations used, the data sets used in the experiments, as well as the computing infrastructure used.
4.1 Solvers
We begin by describing the exact parameterizations of complete BNSL solvers used in the experiments. Specifically, we evaluate three complete approaches: IntegerLinear Programming (ILP), A\(^{*}\)based statespace search (A\(^{*}\)), and a constraint programming based approach (CP). Importantly, these approaches constitute the current stateoftheart solvers for BNSL.^{6}
 ILP

We use the GOBNILP solver (Cussens 2013) as a stateoftheart representative of the ILPbased approaches to BNSL. GOBNILP uses the SCIP framework (Achterberg 2009) and an external linear program solver; we chose the open source SoPlex solver (Wunderling 1996) bundled with the SCIP Optimization Suite. We consider the most recent version, GOBNILP 1.6.2, which uses SCIP 3.2.0 with SoPlex 2.2.0, as well as GOBNILP 1.4.1 (SCIP 3.0.1, SoPlex 1.7.1). For both versions we consider two parameterizations: the default configuration, which searches for BNSLspecific cutting planes using graphbased cycle finding algorithms, and a second configuration, “nc” (“no cyclefinding”), which only uses nested integer programs. We call these parameterizations ilp141, ilp141nc, ilp162, and ilp162nc, respectively, for short.
 A \(^{*}\)

We use the URLearning solver (Yuan and Malone 2013) as a stateoftheart representative approach to BNSL based on the A\(^{*}\) search method. We consider three parameterizations: A \(^{*}\) ed3, which uses dynamic pattern databases, A \(^{*}\) ec, which uses a combination of dynamic and static pattern databases, and A \(^{*}\) comp which uses a strongly connected componentbased decomposition (Fan et al. 2014).
 CP

We use the CPBayes solver (van Beek and Hoffmann 2015) as the most recent stateoftheart representative approach to BNSL based on branchandbound style constraint programming search with problemspecific filtering (searchspace pruning) techniques. This solver does not expose any parameters to control its behavior, so we apply the solver in our experiments in its default configuration, cpbayes.
The nondefault parameterizations of the solvers were suggested to us by the solver developers. While we use both an “uptodate” version (1.6.2) and an older version (1.4.1) of GOBNILP, it is important to note that, generally, the choice of parameters and the solver version can at times have a noticeable effect on the perinstance runtimes of the resulting solver—so much so that one could consider the solvers different.^{7}
4.2 Training data
 Real

Realworld datasets obtained from machine learning repositories: the UCI repository (Bache and Lichman 2013), the MLData repository (http://mldata.org/), and the Weka distribution (Hall et al. 2009). We searched primarily for datasets of fully or mostly categorical data and a reasonable number of variables (16–64) to produce instances that are feasible but nontrivial to solve. Every dataset found and matching these criteria was included. While some of the datasets have originally been designed for supervised learning, they have been regularly included also in studies of unsupervised learning. These datasets are summarized in more detail in Table 9 of the Appendix.
 Sampled

Datasets sampled from benchmark Bayesian networks, obtained from http://www.cs.york.ac.uk/aig/sw/gobnilp/. These datasets are widely used for evaluating the performance of individual solvers, for example, recently in the context of optimal BNSL (Bartlett and Cussens a; van Beek and Hoffmann 2015; Berg et al. 2014; Cussens 2013; Fan et al. 2014; Fan and Yuan 2015; Malone et al. 2014, 2015; Saikko et al. 2015). These datasets are summarized in Table 10 of the Appendix.
 Synthetic

Datasets sampled from synthetic Bayesian networks. We generated random networks of varying number of binary variables (20–60) and maximum indegree (2–8). For each network one dataset was produced by sampling a random number (100–10,000) of records.
We preprocessed each dataset by removing unique identifiers (to avoid overfitting) and trivial variables that only take on one value. Continuous variables as well as other variables with very large domains were either removed or discretized using a normalized maximum likelihood approach (Kontkanen and Myllymäki 2007) when possible. The maximum number of records per dataset was limited to 60,000 to make the evaluation of scoring functions feasible.
We considered five different scoring functions^{9}: the BDeu score with the Equivalent Sample Size parameter selected from \(\{0.1, 1, 10, 100\}\) and the BIC score. For each dataset in the Real and Sampled categories we produced multiple instances by considering all scoring functions and varying upper bounds on the size of each candidate parent set, ranging from 2 to 6, as well as the unbounded case. For each dataset in Synthetic we produced one instance, choosing both the scoring function and the parent limit at random. For larger datasets, evaluating the scores was feasible only up to lower values of the maximum parent set size. The total number of datasets and BNSL instances produced is summarized in Table 2.
For running all solvers on these instances we used a cluster of Dell PowerEdge M610 computing nodes equipped with two 2.53GHz Intel Xeon E5540 CPUs and 32GB RAM. For each individual run we used one CPU core, with a timeout of 2 h and a 30GB memory limit. We treat the runtime of any instance as 2 h if a solver exceeds either the time or memory limit.
Number of source datasets, instances generated from the source datasets, and instances used in training and testing the models
Category  Datasets  All instances  Training and testing 

Real  39  637  486 
Sampled  19  317  283 
Synthetic  477  477  410 
The runtime of feature computation for each feature category in seconds, shown as the average, median, minimum, and maximum runtime over all training instances
Feature set  Average  Median  Min  Max 

Basic  0.00  0  0  0 
Basic extended  0.00  0  0  0 
Lower bounding  0.00  0  0  0 
Greedy probing  2.53  2  0  6 
A* probing  4.61  5  0  7 
ILP probing  3.94  5  0  10 
CP probing  4.49  6  0  10 
All  15.57  18  0  26 
4.3 Feature computation
In order to train the models we computed the features detailed in Sect. 3.1 for all training instances. Table 3 summarizes the time spent to compute these features separately for each feature category. We observe that the computation takes around 16 s per instance on average and about 26 s in the worst case. Further, most of the time is spent on probing, while features of all other categories are computed in less than 1 s. In other words, a time limit needs to be enforced only for computing the probing features. As witnessed by the maximum feature computation times, probing occasionally exhibits higher running times than the limit of 5 s to finish a preprocessing step. This can be caused by overhead resulting, for example, from memory deallocation operations. We gave an additional 5 s for probing to finish on those specific instances. If the probing solver was still not completed within this time, it was terminated.
All in all, the overhead from computing the features is negligible from a portfolio perspective, as our main interest is in choosing the fastest solver for harder instances that take several minutes or even hours to solve. The easiest instances by contrast are often solved already in the probing phase.^{11}
4.4 Availability of experiment data
Furthermore, the runtime and feature data are available as a scenario in the ASlib Algorithm Selection Library (Bischl et al. 2016) for further benchmarking purposes at
5 Portfolios for BNSL
5.1 Solver performance
As the basis of this work, we ran all the solvers on all the BNSL instances, as described in Sect. 4. A comparison of solver performance is shown in Fig. 3, in terms of the number of instances for which a particular solver was empirically faster than all other solvers on the considered benchmarks. Tables 4 and 5 show an alternative comparison in terms of the total number of instances that were successfully solved within the given computational resources as well as the total CPU time required to either solve an instance or run out of time or memory. The results are given in comparison to the Virtual Best Solver (VBS), which is the theoretically optimal portfolio that always selects the best solver, constructed by selecting a posteriori the fastest solver for each input instance. Essentially, a theoretical lower bound on the runtime of any portfolio approach using a fixed set of k solvers is the runtime of the VBS. Furthermore, by interleaving the executions of the solvers until the best solver for a specific instance terminates, a theoretical upper bound of k times the runtime of the VBS is obtained.
The performance of all solvers as well as the Virtual Best Solver (VBS) and four portfolios on all training instances, measured as the number of instances solved and the overall runtime
Solver  Instances solved  (%)  Runtime (s)  

Cumulative  Average  Median  
VBS  1179  100  259,440  220  7.33 
VBS without CP  1164  98  368,690  313  9.40 
VBS without A\(^{*}\)  1157  98  475,032  403  8.96 
VBS without ILP  937  79  2,022,296  1715  33.35 
portfoliobasic  1141  96  540,384  458  12.30 
autofoliobasic  1146  97  548,030  465  18.34 
portfolioall  1152  97  488,093  414  27.70 
autofolioall  1152  97  501,146  425  23.84 
ilp141  1036  87  1,364,855  1158  36.39 
ilp141nc  1034  87  1,384,022  1174  41.83 
ilp162  1029  87  1,453,932  1233  29.56 
ilp162nc  1026  87  1,494,879  1268  32.18 
cpbayes  896  75  2,423,547  2056  85.83 
A \(^{*}\) comp  768  65  3,152,809  2674  185.79 
A \(^{*}\) ec  519  44  4,866,797  4128  7200.00 
A \(^{*}\) ed3  478  40  5,163,876  4380  7200.00 
The performance of all solvers and portfolios within each instance category
Solver  Solved  (%)  Runtime (s)  Category  

Cumulative  Average  Median  
VBS  486  100  92,165  78  2.69  Real 
VBS without CP  480  98  141,833  120  5.13  
VBS without A\(^{*}\)  469  96  244,625  207  5.60  
VBS without ILP  448  92  370,212  314  8.48  
portfoliobasic  470  96  209,490  178  4.60  
autofoliobasic  469  96  232,889  198  9.78  
portfolioall  475  97  175,555  149  16.66  
autofolioall  474  97  197,599  168  16.00  
ilp141  396  81  800,432  679  55.68  
ilp141nc  396  81  799,734  678  56.78  
ilp162  382  78  882,431  748  44.24  
ilp162nc  382  78  887,222  753  48.25  
cpbayes  427  87  549,230  466  14.85  
A \(^{*}\) comp  382  78  860,025  729  65.98  
A \(^{*}\) ec  311  63  1,300,350  1103  156.30  
A \(^{*}\) ed3  281  57  1,523,034  1292  523.43  
VBS  283  100  62,010  53  5.62  Sampled 
VBS without CP  278  98  92,511  78  6.31  
VBS without A\(^{*}\)  280  98  97,027  82  5.95  
VBS without ILP  227  80  453,422  385  33.07  
portfoliobasic  274  96  131,034  111  9.02  
autofoliobasic  277  97  123,468  105  17.15  
portfolioall  278  98  115,254  98  23.97  
autofolioall  280  98  97,502  83  19.08  
ilp141  256  90  253,298  215  9.54  
ilp141nc  254  89  266,871  226  13.91  
ilp162  257  90  280,990  238  13.57  
ilp162nc  252  89  309,674  263  15.07  
cpbayes  212  74  603,795  512  91.45  
A \(^{*}\) comp  182  64  749,656  636  145.95  
A \(^{*}\) ec  81  28  1,488,628  1263  7200.00  
A \(^{*}\) ed3  71  25  1,558,424  1322  7200.00  
VBS  410  100  105,264  89  14.98  Synthetic 
VBS without CP  406  99  134,346  114  16.15  
VBS without A\(^{*}\)  408  99  133,380  113  15.90  
VBS without ILP  262  63  1,198,662  1017  357.74  
portfoliobasic  397  96  199,860  170  25.44  
autofoliobasic  400  97  191,674  163  26.44  
portfolioall  399  97  197,284  167  38.68  
autofolioall  398  97  206,045  175  36.77  
ilp141  384  93  311,125  264  45.16  
ilp141nc  384  93  317,417  269  50.39  
ilp162  390  95  290,512  246  30.32  
ilp162nc  392  95  297,984  253  29.48  
cpbayes  257  62  1,270,522  1078  758.21  
A \(^{*}\) comp  204  49  1,543,127  1309  7200.00  
A \(^{*}\) ec  127  30  2,077,819  1762  7200.00  
A \(^{*}\) ed3  126  30  2,082,419  1766  7200.00 
In terms of the the relative performance of the solvers, Fig. 4 shows the pairwise correlations between the solvers on all instances. Unsurprisingly, different parameterizations within the same solver family correlate strongly with each other. Within the A\(^{*}\) family, the strongest correlation is between A \(^{*}\) ec and A \(^{*}\) ed3, while all ILP parameterizations are strongly correlated, though mildly less so between different versions of the solver. Between solver families, A\(^{*}\) and ILP correlate with each other the least, while CP exhibits mild correlation with ILP and moderate correlation with A\(^{*}\). Interestingly, A \(^{*}\) comp correlates more with CP than with the other A\(^{*}\) parameterizations.
While the ILP approach appears to be the bestperforming measured in the total runtime and the number of instances solved on the set of benchmarks considered, the results suggest that the performance of ILP on a perinstance basis is quite orthogonal to that of both CP and A\(^{*}\) (recall Fig. 1). We will now show that a BNSL solver portfolio can closely capture the bestcase performance of all eight of the considered solver parameterizations in terms of empirical runtimes.
5.2 Portfolios for BNSL
As a main observation reported on in this section, we found that using only the Basic features (number of variables, n, and mean number of candidate parent sets, m / n) is enough to construct an efficient BNSL solver portfolio. We emphasize that, while on an intuitive level the importance of these two features may be to some extent unsurprising, such intuition does not directly translate into an actual predictor that would closetooptimally predict the bestperforming solver.
The contribution of each solver to the VBS and the two portfolios measured as the Shapley value in terms of the average number of additional instances solved after adding the indicated solver to the portfolio
Solver  VBS  portfolioall  portfoliobasic 

ilp162  184.53  181.82  178.75 
ilp141  184.12  179.78  181.86 
ilp141nc  182.48  179.62  178.79 
ilp162nc  181.50  178.96  177.44 
cpbayes  160.42  152.18  149.37 
A \(^{*}\) comp  136.24  131.60  127.62 
A \(^{*}\) ec  78.28  77.72  77.10 
A \(^{*}\) ed3  71.43  70.34  70.08 
Given the good runtime performance of the portfolios obtained using runtime predictions from random forests as the underlying algorithm selection strategy, it is interesting to investigate to what extent the choice of algorithm selection strategy impacts portfolio performance using the same set of BNSL features. For comparison, we consider AutoFolio (Lindauer et al. 2015), a stateoftheart algorithm selection system,^{12} for constructing the portfolios autofoliobasic (using AutoFolio on the Basic feature set) and autofolioall (using AutoFolio on the full feature set).^{13}
AutoFolio (Lindauer et al. 2015) trains a binary classifier for each pair of solvers which selects the betterperforming for a given instance; the instances are weighted based on the difference in performance for the two solvers. Further, AutoFolio selects among the feature sets to use during testing to minimize the overall solution time. A Bayesian optimization strategy is used to optimize the classifier hyperparameters, feature set and preprocessing choices.^{14} For an unseen instance, each of the trained classifiers votes for a solver; the solver with the most votes is used for that instance. The training and testing splits were the same for both AutoFolio and autosklearn. For AutoFolio, we also used an “outer” tenfold crossvalidation scheme to ensure it does not use testing instances during training.
The two portfolios produced by AutoFolio perform very similarly on the benchmark set as those based on predicting runtimes with random forests. In more detail, autofolioall solves more instances than portfolioall within the first 30 s for all instance types; this is because AutoFolio does not always use all of the feature sets, so it spends less time computing features during test time. After this initial phase, the number of instances solved under a given perinstance timeout was very similar for portfolioall and autofolioall. As Table 4 shows, though, in total, portfolioall has a slightly lower cumulative runtime than autofolioall; the detailed breakdown in Table 5 clarifies that this is largely due to better performance of portfolioall on the Real instances.
5.3 Basic features and solver performance
As the Basic features yield efficient BNSL portfolios, we look more closely at the effect of the perinstance Basic feature values on solver performance. Figure 9 reinforces the orthogonal strengths of different solver families in the space spanned by these two features. Specifically, we observe that ILP parameterizations can fairly reliably solve instances up to around 1000 candidate parent sets per variable, regardless of the number of variables. In comparison, the A\(^{*}\) family consistently solves benchmark instances up to 30 variables, and many up to 40, even with tens of thousands of candidate parent sets per variable. Our results show that CP takes a middle ground between the two, solving many instances at the high end of either of the Basic features, albeit less consistently than either A\(^{*}\) or ILP.
In particular, Fig. 9 (top left) demonstrates why the Basic features result in strong portfolio behavior; namely, the instances which are optimally solved by the different solver families are nearly linearly separable in this space. The figure also supports the rough characterization (recall Sect. 1) of the computational limitations of stateoftheart solvers: none of the stateoftheart solvers are able to solve the benchmark instances where both of the Basic features are very large.
6 Predicting runtimes
As shown in Sect. 5, the Basic features can effectively distinguish between solvers to use on a particular instance of BNSL. We will now address question Q2, that is, whether the use of additional features (cf. Sect. 3.1) improves the accuracy of the runtimes predicted by the random forests learned with autosklearn.
6.1 Predictions with added features
Figure 11 depicts the actual runtimes of solvers compared to the runtimes predicted by the random forests learned with autosklearn. We again use A \(^{*}\) comp, cpbayes, and ilp162 as representatives of their solver families (recall Sect. 5.2; similar conclusions hold for all solvers within the respective families). On the left we see this comparison for models trained using the Basic features only. Even though these predictions allow for good portfolio behavior, the considerable amount of prediction error makes them less useful for obtaining accurate estimates of the runtime. The right side, on the other hand, shows the same comparison when using All, where the predictions are more concentrated near the diagonal. In other words, the larger, more sophisticated feature set results in more accurate runtime predictions. Table 7 presents a numerical measure of the improvement in terms of change in the approximation factor, defined as \(\rho = \max \{\frac{a}{p}, \frac{p}{a}\}\), where a and p are the actual and predicted runtimes, respectively. In particular, smaller approximation factors are better.
The percentage of instances with an approximation factor within the given ranges of \(\rho \), when predicting runtimes based on either Basic or All features
Range of \(\rho \)  A \(^{*}\) comp  cpbayes  ilp162  

Basic (%)  All (%)  Basic (%)  All (%)  Basic (%)  All (%)  
\({<}\,2\)  48  60  45  67  59  71 
[2, 5)  22  22  27  20  29  22 
[5, 10)  14  7  13  7  7  4 
\({>}\,10\)  17  11  15  6  4  3 
The coefficient of determination (\(R^2\)) for the actual runtime given the predicted runtime
Solver  A \(^{*}\) ec  A \(^{*}\) ed3  A \(^{*}\) comp  cpbayes  ilp141  ilp141nc  ilp162  ilp162nc 

Basic  0.71  0.79  0.57  0.51  0.67  0.69  0.73  0.72 
All  0.86  0.89  0.66  0.65  0.76  0.78  0.81  0.79 
6.2 Preprocessing characteristics
We now turn to more qualitative analysis based on the preprocessor and single random forest with optimized hyperparameters learned by autosklearn.
First, we examine preprocessor choices. As shown in Fig. 15, the choice of preprocessor often reflects the amount of information inherently available in the feature sets. Furthermore, Fig. 15 includes a clustering of the solvers and feature sets based on the choice of preprocessor. In the clustering, we see that the families of solvers tend to cluster together.
The Basic feature set (dark tan) almost always result in a preprocessor which increases the dimensionality, either the polynomial expansion or random forest embedding technique; we interpret this to mean that the features alone do not provide sufficient information for accurate prediction, so autosklearn attempts to increase the information with preprocessing. Likewise, many of the “mildly informative” feature sets, such as Simple UB (dark teal), almost exclusively result in polynomial expansion for preprocessing the input features. Interestingly, the Basic extended feature set (light tan) results in polynomial expansion, a dimensionality expansion strategy, and feature agglomeration, a dimensionality reduction strategy, in roughly equal proportions for all solvers.
On the other hand, for the A* algorithms with the larger feature sets like All (light brown), autosklearn has “too much” information, so it uses feature aggregation, as well as modelbased and percentilebased feature selection, to combine or remove uninformative features; these choices typically are statistically significant. Preprocessing is usually not used for predicting most of the ILP runtimes using “informative” feature sets, such as All and ILP Probing (light teal); again, almost all of these choices are statistically significant.
This analysis demonstrates that the choice of preprocessing strategy by autosklearn largely agrees with intuition. For small, relatively uninformative feature sets, feature expansion strategies like polynomial expansion are often used; when more informative features are available, they are relatively unchanged. Finally, when “too much” information is present, sophisticated feature selection strategies are used to retain useful features while removing noise.
6.3 Model complexity
We additionally analyzed the complexity of the learned random forests, in terms of the mean size of the regression trees composing them. As expected, Fig. 16a shows that the trees learned using the Basic features are the smallest. Other simpler feature sets, such as Basic extended and Simple UB also resulted in small trees for all solvers.
Somewhat surprisingly, though, the regression trees for the various ILP solvers are much larger than those for the cpbayes and A\(^{*}\) family of solvers for the A \(^{*}\) probing, Pattern database UB, All and CP probing feature sets. As shown in Fig. 15, autosklearn often forewent preprocessing in these cases for ILP. On the other hand, it used sophisticated preprocessing, like the modelbased approach, for A\(^{*}\) and cpbayes a significant amount of the time. Thus, these results suggest an implicit tradeoff in autosklearn between resources used for preprocessing and the model itself.
Also unexpectedly, the trees for ILP without the graphbased cutting plane routines (the “nc” parameterizations) are much larger than those using it with the ILP probing feature set. We hypothesize this is due to differences in the ILP implementation used for probing and the “nc” solvers; namely, the ILP implementation used in probing does use the graphbased cutting plane routines. autosklearn uses preprocessing only sparingly in all of these cases, so it again appears that a more complex model is used to handle the noise in the features.
6.4 Important features
Figure 16b shows important features for the different solvers. Several of the importances are unsurprising; the number of variables in the dataset determines the size of the search space for A\(^{*}\), and that was the most important feature for all parameterizations. Similarly, the size of the linear program solved by ILP is directly determined by the number of candidate parent sets, and its most important features describe these sets. Likewise, the respective probing error bound features were typically somewhat important for ILP and CP. This is sensible because these features indicate when a solver can quickly converge to a nearlyoptimal solution; however, as could be seen from Fig. 14, the overall improvement to RMSE is modest with the addition of the probing features.
Figure 16b shows that the CP and A\(^{*}\) family models share many important features. For example, CP uses the pattern database relaxation which also guides the A\(^{*}\) search, and pattern database node degree features are indeed important for both CP and A\(^{*}\) models.
In contrast to ILP and CP, A \(^{*}\) comp is the only A\(^{*}\) parameterization for which probing was an important feature. Coupled with the minimal improvement to RMSE shown in Fig. 14 when using probing, this suggests that the runtime characteristics of the anytime variant of A\(^{*}\) are different enough from the A\(^{*}\) family of solvers included in the portfolio that it adds significant noise to learning.
Another somewhat unexpected result concerning A\(^{*}\) is that many Simple UB features are quite important. Previous experimental results (Yuan and Malone 2013) show that the pattern database bounding approach is much more informative during the A\(^{*}\) search. However, the solvers construct their pattern databases differently than those used for extracting features, so the structural properties, such as the number of nontrivial SCCs, of the constructed graphs may not reflect the difficulty of the problem for the solver.
In general, the results presented in Fig. 16b reveal that a small number of features were consistently important for any particular solver; this is in line with previous work (Lee and GiraudCarrier 2008; LeytonBrown et al. 2014). Qualitatively, this implies that most of the trees were based on the same small set of features.
7 Conclusions
We have investigated the empirical hardness of BNSL, the Bayesian network structure learning problem, in relation to several stateoftheart complete solvers based on A\(^{*}\) search, integer linear programming, and constraint programming. While each of these solvers always finds an optimal Bayesian network structure (with respect to a given scoring function), the runtimes of the solvers can vary greatly even within instances of the same size. Moreover, on a given instance, some solvers may run very fast, whereas others require considerably longer time, sometimes by several orders of magnitude. We validated this general view, which has emerged from a series of recent studies, by conducting the most elaborate evaluation of stateoftheart solvers to date. We have made the rich evaluation data publicly available^{16} in order to facilitate possible further analyses that go beyond the scope of the present work.
As the second contribution, we applied machine learning methods to construct empirical hardness models from the data obtained by the solver evaluations. Instantiating the general methodology of empirical hardness models (Rice 1976; LeytonBrown et al. 2009), we proposed several features, that is, realvalued functions of BNSL instances, which are potentially informative about solver runtimes and which go beyond the basic parameters of instance size.
We used two approaches, autosklearn and AutoFolio, for building BNSL portfolio solvers, to directly address the algorithm selection problem. Additionally, we studied in more detail the runtime prediction accuracy of the models learned with autosklearn. Both of these stateoftheart systems use Bayesian optimization to optimize model class, preprocessing and relevant hyperparameters, for the respective models.
The learned models allowed us to answer two basic questions concerning prediction of the solvers’ relative and absolute performance without actually running the solvers. The first question (Q1) asked whether the basic parameters of input size suffice for reliably predicting which of the solvers is the fastest on a given problem instance. We answered this question in the affirmative by showing that whenever a solver is significantly slower than the fastest solver on a given instance, the slower one is very rarely predicted as the fastest one. We compared the performance of portfolios based on models learned by both AutoFolio and autosklearn, and observed that these two approaches yielded very similar portfolio runtime performance. For varying distributions of instances, our portfolio solver using a very basic set of BNSL features resulted in the fastest solver overall, exhibiting cumulative runtimes within two times that of the Virtual Best Solver (VBS). In contrast, the cumulative runtime of the best individual solver is over five times that of the VBS. As a result, the proposed solver portfolio is currently the fastest algorithm for solving BNSL when averaged over a large heterogeneous set of instances.
Our answer was affirmative also to the second question (Q2) of whether the runtimes of each of the solvers can be predicted more accurately by extending the set of features. We observed that, in general, the more highquality the features, the more accurate the predictions. For algorithm selection, however, the more accurate runtime predictions translated only to a small improvement. This was somewhat expected since the selections based on the basic features already achieved very good performance.
Via the extensive empirical evaluation presented as part of this work, we managed to answer some of the key basic questions about the empirical hardness of BNSL. This first study opens several avenues for future research. First, we believe the proposed collection of features is not complete—presumably, there are even more informative, albeit possibly slowertocompute, features yet to be discovered. For example, while not considered here, one straightforward possibility would be to use summary statistics for the BNSL features that are less susceptible to outliers, for example, medians. The question of how to efficiently trade informativeness for computational efficiency is relevant also more generally for the algorithm selection methodology; probing features (Hutter et al. 2014), as applied in this work to the context of BNSL, provide just one, rather generic technique. Second, the empirical hardness model and its evaluated performance obviously depend on the distribution of the training and test instances. While this dependency is unavoidable, it is an intriguing question to what extent the dependency can be weakened by considering appropriate distributions and sufficiently large samples of instances.
Finally, we note that while in this work we focused on the runtime behavior of complete BNSL solvers, that is, exact algorithms that provide provablyoptimal solutions to given BNSL instances, the techniques studied and developed in this paper could also be extended to cover inexact localsearch style, greedy, and approximate algorithmic approaches to BNSL. While such approaches typically exhibit better scalability than the exact approaches studied here, the fact that inexact approaches cannot give guarantees of optimality on the produced solutions brings new challenges in terms of portfolio construction and prediction, specifically in understanding the interplay between solution quality and runtimes. Another potentially interesting direction for further study—although a somewhat secondary aspect compared to runtime behavior—would be to understand and predict the memory usage of exact approaches. Furthermore, it would be interesting to expand the study in the future by including additional datasets, for example, from OpenML (Vanschoren et al. 2013).
Footnotes
 1.
Strictly speaking, the data are assumed to consist of N independent and identically distributed tuples \((X^t_1,\ldots ,X^t_n)\), \(t = 1,\ldots ,N\), so the dimension of the data is \(N \times n\).
 2.
The score does not depend on the parameters of the unspecified distribution P, which are treated as nuisance parameters and absorbed by the scoring function (e.g., estimated or integrated away).
 3.
While, in principle, the function \(T_S\) also depends on external factors such as the specific hardware on which the solver is run, we do not consider those factors in this work.
 4.
Note that counting the number of cycles in a given graph is, in terms of computational complexity, presumably highly intractable, whereas SCC computation is achieved fast with wellknown polynomialtime algorithms.
 5.
The choice of preprocessor was not restricted.
 6.
In a preliminary version of this work (Malone et al. 2014), we also considered an earlier proposed branchandbound approach (de Campos and Ji 2011), which we found to be always dominated by ILP; therefore, we dropped it from consideration. Furthermore, the earlier proposed dynamic programming approach (Koivisto and Sood 2004) is clearly dominated by A\(^{*}\). We have also discarded some parameterizations of both ILP and A\(^{*}\)based solvers that were found to be uncompetitive.
 7.
For corroborating evidence on this, see, e.g., empirical data provided (Cussens 2013) for different parameterizations and versions of GOBNILP.
 8.
The main motivations for including both more real and, on the other hand, synthetic datasets in the study are twofold: (i) We aimed at a notably heterogeneous set of benchmarks for the study, yielding insights into the prediction task on a wide range of datasets with different properties; and (ii) the threeway categorization has analogies in the benchmark categorization used in the SAT domain (Järvisalo et al. 2012).
 9.
In our experiments, the results were not very sensitive to the scoring function, except through its effect on the number of candidate parent sets and other features, so our results can generalize to other decomposable scores as well.
 10.
This is in line with related work on portfolio construction in other domains such as SAT (Hutter et al. 2014) as well as the SAT Competitions where a similar criterion is used to filter out “too easy” instances from the competition benchmark sets (Balint et al. 2015). Solver selection for very easy instances is trivial, as any choice of a solver is essentially a good one.
 11.
The benchmark set used was not filtered based on probing results.
 12.
In particular, we use an updated version recommended by the author, https://github.com/mlindauer/AutoFolio.
 13.
We thank an anonymous reviewer for proposing this comparison with AutoFolio.
 14.
The AutoFolio implementation includes a presolving component (Hoos et al. 2015). We disabled that feature for purposes of this comparison in order to strictly consider how well the models capture solver behavior; however, a similar strategy could be used to include a presolver for the autosklearnbased approach, as well.
 15.
\(R^2\) ranges from 0 to 1, where 0 indicates that the feature is completely uninformative about runtime, and 1 indicates that all of the variance in runtime is explained by the respective feature.
 16.
Notes
Acknowledgements
The authors thank James Cussens for discussions on GOBNILP and the anonymous reviewers for valuable suggestions that helped improve the manuscript. This work is supported by Academy of Finland, Grants #125637, #251170 (COIN Centre of Excellence in Computational Inference Research), #255675, #276412, and #284591; Finnish Funding Agency for Technology and Innovation (Project D2I); and Research Funds of the University of Helsinki.
References
 Achterberg, T. (2009). SCIP: Solving constraint integer programs. Mathematical Programming Computation, 1(1), 1–41.MathSciNetCrossRefzbMATHGoogle Scholar
 Bache, K., & Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml
 Balint, A., Belov, A., Järvisalo, M., & Sinz, C. (2015). Overview and analysis of the SAT Challenge 2012 solver competition. Artificial Intelligence, 223, 120–155.CrossRefGoogle Scholar
 Bartlett, M., & Cussens, J. (2015). Integer linear programming for the Bayesian network structure learning problem. Artificial Intelligence, 244, 258–271. (in press).MathSciNetCrossRefzbMATHGoogle Scholar
 Berg, J., Järvisalo, M., & Malone, B. (2014). Learning optimal bounded treewidth Bayesian networks via maximum satisfiability. In Proceedings of the 17th international conference on artificial intelligence and statistics (AISTATS 2014), JMLR workshop and conference proceedings (Vol. 33, pp. 86–95). JMLR.Google Scholar
 Bielza, C., & Larrañaga, P. (2014). Discrete Bayesian network classifiers: A survey. ACM Computing Surveys, 47(1), 5:1–5:43.CrossRefzbMATHGoogle Scholar
 Bischl, B., Kerschke, P., Kotthoff, L., Lindauer, M. T., Malitsky, Y., Fréchette, A., et al. (2016). ASlib: A benchmark library for algorithm selection. Artificial Intelligence, 237, 41–58. https://doi.org/10.1016/j.artint.2016.04.003.MathSciNetCrossRefzbMATHGoogle Scholar
 Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.CrossRefzbMATHGoogle Scholar
 Buntine, W. (1991). Theory refinement on Bayesian networks. In Proceedings of the 7th conference on uncertainty in artificial intelligence (UAI 1997) (pp. 52–60). Morgan Kaufmann Publishers Inc.Google Scholar
 Carbonell, J., Etzioni, O., Gil, Y., Joseph, R., Knoblock, C., Minton, S., et al. (1991). Prodigy: An integrated architecture for planning and learning. SIGART Bulletin, 2, 51–55.CrossRefGoogle Scholar
 Cheng, J., Greiner, R., Kelly, J., Bell, D. A., & Liu, W. (2002). Learning Bayesian networks from data: An informationtheory based approach. Artificial Intelligence, 137(1–2), 43–90.MathSciNetCrossRefzbMATHGoogle Scholar
 Chickering, D. (1996). Learning Bayesian networks is NPcomplete. In D. Fisher, HJ. Lenz (Eds.), Learning from data: Artificial intelligence and statistics (Vol. V, pp. 121–130). Springer: New York.Google Scholar
 Cooper, G., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347.zbMATHGoogle Scholar
 Cussens, J. (2011). Bayesian network learning with cutting planes. In Proceedings of the 27th conference on uncertainty in artificial intelligence (UAI 2011) (pp. 153–160). AUAI Press.Google Scholar
 Cussens, J. (2013). Advances in Bayesian network learning using integer programming. In Proceedings of the 29th conference on uncertainty in artificial intelligence (UAI 2013), (pp. 182–191). AUAI Press.Google Scholar
 de Campos, C., & Ji, Q. (2011). Efficient learning of Bayesian networks using constraints. Journal of Machine Learning Research, 12, 663–689.MathSciNetzbMATHGoogle Scholar
 Fan, X., Malone, B., & Yuan, C. (2014). Finding optimal Bayesian network structures with constraints learned from data. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI 2014) (pp. 200–209). AUAI Press.Google Scholar
 Fan, X., & Yuan, C. (2015). An improved lower bound for Bayesian network structure learning. In Proceedings of the 29th AAAI conference on artificial intelligence (AAAI 2015) (pp. 3526–3532). AAAI Press.Google Scholar
 Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., & Hutter, F. (2015). Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in neural information processing systems (Vol. 28, pp. 2962–2970). Curran Associates, Inc.Google Scholar
 Fink, E. (1998). How to solve it automatically: Selection among problemsolving methods. In Proceedings of the 4th international conference on artificial intelligence planning systems (AIPS 1998) (pp. 126–136). AAAI Press.Google Scholar
 Fréchette, A., Kotthoff, L., Michalak, T. P., Rahwan, T., Hoos, H. H., & LeytonBrown, K. (2016). Using the Shapley value to analyze algorithm portfolios. In D. Schuurmans, M. P. Wellman (Eds.), Proceedings of the 30th AAAI conference on artificial intelligence (pp. 3397–3403). AAAI Press.Google Scholar
 Friedman, N., & Koller, D. (2003). Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50, 95–125.CrossRefzbMATHGoogle Scholar
 Gebruers, C., Hnich, B., Bridge, D. G., & Freuder, E. C. (2005). Using CBR to select solution strategies in constraint programming. In 6th International conference on casebased reasoning (ICCBR 2005), lecture notes in computer science (Vol. 3620, pp. 222–236). Springer.Google Scholar
 GiraudCarrier, C., Vilalta, R., & Brazdil, P. (2004). Introduction to the special issue on metalearning. Machine Learning, 54(3), 187–193.CrossRefGoogle Scholar
 Gomes, C. P., & Selman, B. (2001). Algorithm portfolios. Artificial Intelligence, 126(1–2), 43–62.MathSciNetCrossRefzbMATHGoogle Scholar
 Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18.CrossRefGoogle Scholar
 Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243.zbMATHGoogle Scholar
 Hoos, H., Kaminski, R., Lindauer, M., & Schaub, T. (2015). aspeed: Solver scheduling via answer set programming. Theory and Practice of Logic Programming, 15(1), 117–142.CrossRefzbMATHGoogle Scholar
 Hoos, H., Lindauer, M. T., & Schaub, T. (2014). claspfolio 2: Advances in algorithm selection for answer set programming. Theory and Practice of Logic Programming, 14(4–5), 569–585.CrossRefzbMATHGoogle Scholar
 Horvitz, E., Ruan, Y., Gomes, C. P., Kautz, H. A., Selman, B., & Chickering, D. M. (2001). A Bayesian approach to tackling hard computational problems. In Proceedings of the 17th conference on uncertainty in artificial intelligence (UAI 2001) (pp. 235–244). Morgan Kaufmann.Google Scholar
 Hurley, B., Kotthoff, L., Malitsky, Y., & O’Sullivan, B. (2014) Proteus: A hierarchical portfolio of solvers and transformations. In Proceedings of the 11th international conference on integration of AI and OR techniques in constraint programming (CPAIOR 2014), lecture notes in computer science (Vol. 8451, pp. 301–317). Springer.Google Scholar
 Hutter, F., Hoos, H. H., & LeytonBrown, K. (2011). Sequential modelbased optimization for general algorithm configuration. In Selected papers of the 5th international conference on learning and intelligent optimization (LION 5), lecture notes in computer science (Vol. 6683, pp. 507–523). Springer.Google Scholar
 Hutter, F., Xu, L., Hoos, H. H., & LeytonBrown, K. (2014). Algorithm runtime prediction: Methods and evaluation. Artificial Intelligence, 206, 79–111.MathSciNetCrossRefzbMATHGoogle Scholar
 Jaakkola, T. S., Sontag, D., Globerson, A., & Meila, M. (2010). Learning Bayesian network structure using LP relaxations. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS 2010), JMLR proceedings (Vol. 9, pp. 358–365). JMLR.org.Google Scholar
 Järvisalo, M., Le Berre, D., Roussel, O., & Simon, L. (2012). The international SAT solver competitions. AI Magazine, 33(1), 89–92.CrossRefGoogle Scholar
 Koivisto, M., & Sood, K. (2004). Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5, 549–573.MathSciNetzbMATHGoogle Scholar
 Kontkanen, P., & Myllymäki, P. (2007). MDL histogram density estimation. In Proceedings of the eleventh international conference on artificial intelligence and statistics (AISTATS 2007), JMLR proceedings (Vol. 2, pp. 219–226). JMLR.org.Google Scholar
 Kotthoff, L. (2014). Algorithm selection for combinatorial search problems: A survey. AI Magazine, 35(3), 48–60.CrossRefGoogle Scholar
 Kotthoff, L., Gent, I. P., & Miguel, I. (2012). An evaluation of machine learning in algorithm selection for search problems. AI Communications, 25(3), 257–270.MathSciNetGoogle Scholar
 Kotthoff, L., Kerschke, P., Hoos, H., & Trautmann, H. (2015). Improving the state of the art in inexact TSP solving using perinstance algorithm selection. In Revised selected papers of the 9th international conference on learning and intelligent optimization (LION 9), lecture notes in computer science (Vol. 8994, pp. 202–217). Springer.Google Scholar
 Lee, J. W., & GiraudCarrier, C. G. (2008). Predicting algorithm accuracy with a small set of effective metafeatures. In Proceedings of the 7th international conference on machine learning and applications (IEEE ICMLA 2008) (pp. 808–812). IEEE Computer Society.Google Scholar
 Leite, R., Brazdil, P., & Vanschoren, J. (2012). Selecting classification algorithms with active testing. In Proceedings of the 8th international conference on machine learning and data mining in pattern recognition (MLDM 2012), lecture notes in computer science (Vol. 7376, pp. 117–131). Springer.Google Scholar
 LeytonBrown, K., Hoos, H. H., Hutter, F., & Xu, L. (2014). Understanding the empirical hardness of NPcomplete problems. Communications of the ACM, 57(5), 98–107.CrossRefGoogle Scholar
 LeytonBrown, K., Nudelman, E., & Shoham, Y. (2002). Learning the empirical hardness of optimization problems: The case of combinatorial auctions. In 8th International conference on principles and practice of constraint programming (CP 2002), lecture notes in computer science (Vol. 2470, pp. 556–572). Springer.Google Scholar
 LeytonBrown, K., Nudelman, E., & Shoham, Y. (2009). Empirical hardness models: Methodology and a case study on combinatorial auctions. Journal of the ACM. https://doi.org/10.1145/1538902.1538906.
 Lindauer, M. T., Hoos, H. H., Hutter, F., & Schaub, T. (2015). AutoFolio: An automatically configured algorithm selector. Journal of Artificial Intelligence Research, 53, 745–778.MathSciNetGoogle Scholar
 Lobjois, L., & Lemaître, M. (1998). Branch and bound algorithm selection by performance prediction. In Proceedings of the 15th national conference on artificial intelligence (AAAI 1998) (pp. 353–358). AAAI Press.Google Scholar
 Madigan, D., & York, J. (1995). Bayesian graphical models for discrete data. International Statistical Review, 63, 215–232.CrossRefzbMATHGoogle Scholar
 Malone, B., Järvisalo, M., & Myllymäki, P. (2015). Impact of learning strategies on the quality of Bayesian networks: An empirical evaluation. In Proceedings of the 31st conference on uncertainty in artificial intelligence (UAI 2015) (pp. 362–371). AUAI PressGoogle Scholar
 Malone, B., Kangas, K., Järvisalo, M., Koivisto, M., & Myllymäki, P. (2014). Predicting the hardness of learning Bayesian networks. In Proceedings of the 28th AAAI conference on artificial intelligence (AAAI 2014) (pp. 2460–2466). AAAI Press.Google Scholar
 Malone, B. M., & Yuan, C. (2013). Evaluating anytime algorithms for learning optimal Bayesian networks. In Proceedings of the 29th conference on uncertainty in artificial intelligence (UAI 2013). AUAI Press.Google Scholar
 Ott, S., Imoto, S., & Miyano, S. (2004). Finding optimal models for small gene networks. In Proceedings of the pacific symposium on biocomputing 2004 (pp. 557–567). World Scientific.Google Scholar
 Parviainen, P., & Koivisto, M. (2013). Finding optimal Bayesian networks using precedence constraints. Journal of Machine Learning Research, 14, 1387–1415.MathSciNetzbMATHGoogle Scholar
 Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Burlington: Morgan Kaufmann.zbMATHGoogle Scholar
 Perrier, E., Imoto, S., & Miyano, S. (2008). Finding optimal Bayesian network given a superstructure. Journal of Machine Learning Research, 9, 2251–2286.MathSciNetzbMATHGoogle Scholar
 Pulina, L., & Tacchella, A. (2008). Treewidth: A useful marker of empirical hardness in quantified Boolean logic encodings. In Proceedings of the 15th international conference on logic for programming, artificial intelligence, and reasoning (LPAR 2008), lecture notes in computer science (Vol. 5330, pp. 528–542). Springer.Google Scholar
 Rice, J. (1976). The algorithm selection problem. Advances in Computers, 15, 65–118.CrossRefGoogle Scholar
 Rijn, J. N., Abdulrahman, S. M., Brazdil, P., & Vanschoren, J. (2015). Fast algorithm selection using learning curves. In Proceedings of the 14th international symposium on advances in intelligent data analysis (IDA 2015), lecture notes in computer science (Vol. 9385, pp. 298–309). Springer.Google Scholar
 Saikko, P., Malone, B., & Järvisalo, M. (2015). MaxSATbased cutting planes for learning graphical models. In Proceedings of the 12th international conference on integration of artificial intelligence and operations research techniques in constraint programming (CPAIOR 2015), lecture notes in computer science (Vol. 9075, pp. 345–354). Springer.Google Scholar
 Shapley, L. S. (1953). A value for nperson games. Contributions to the Theory of Games, 2, 307–317.MathSciNetzbMATHGoogle Scholar
 Silander, T., & Myllymäki, P. (2006). A simple approach for finding the globally optimal Bayesian network structure. In Proceedings of the 22nd conference in uncertainty in artificial intelligence (UAI 2006) (pp. 445–452). AUAI Press.Google Scholar
 Singh, A., & Moore, A. (2005). Finding optimal Bayesian networks by dynamic programming. Technical report, Carnegie Mellon University.Google Scholar
 Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. The University of Kansas Science Bulletin, 38(2), 1409–1438.Google Scholar
 Spirtes, P., Glymour, C., & Schemes, R. (1993). Causation, prediction, and search. New York: Springer.CrossRefzbMATHGoogle Scholar
 Tamada, Y., Imoto, S., & Miyano, S. (2011). Parallel algorithm for learning optimal Bayesian network structure. Journal of Machine Learning Research, 12, 2437–2459.MathSciNetzbMATHGoogle Scholar
 Teyssier, M., & Koller, D. (2005). Orderingbased search: A simple and effective algorithm for learning Bayesian networks. In Proceedings of the 21st conference in uncertainty in artificial intelligence (UAI 2005) (pp. 584–590). AUAI Press.Google Scholar
 van Beek, P., & Hoffmann, H. (2015). Machine learning of Bayesian networks using constraint programming. In Proceedings of the 21st international conference on principles and practice of constraint programming (CP 2015), lecture notes in computer science (Vol. 9255, pp. 429–445). Springer.Google Scholar
 Vanschoren, J., van Rijn, J. N., Bischl, B., & Torgo, L. (2013). OpenML: Networked science in machine learning. SIGKDD Explorations, 15(2), 49–60.CrossRefGoogle Scholar
 Wunderling, R. (1996). Paralleler und objektorientierter simplexalgorithmus. Ph.D. thesis, Technische Universität BerlinGoogle Scholar
 Xu, L., Hutter, F., Hoos, H., & LeytonBrown, K. (2008). SATzilla: Portfoliobased algorithm selection for SAT. Journal of Artificial Intelligence Research, 32, 565–606.zbMATHGoogle Scholar
 Yuan, C., & Malone, B. (2012). An improved admissible heuristic for finding optimal Bayesian networks. In Proceedings of the 27th conference in uncertainty in artificial intelligence (UAI 2012) (pp. 924–933). AUAI Press.Google Scholar
 Yuan, C., & Malone, B. (2013). Learning optimal Bayesian networks: A shortest path perspective. Journal of Artificial Intelligence Research, 48, 23–65.MathSciNetzbMATHGoogle Scholar