Abstract
Benchmarking is a crucial phase when developing algorithms. This also applies to solvers for the SAT (propositional satisfiability) problem. Benchmark selection is about choosing representative problem instances that reliably discriminate solvers based on their runtime. In this paper, we present a dynamic benchmark selection approach based on active learning. Our approach predicts the rank of a new solver among its competitors with minimum runtime and maximum rank prediction accuracy. We evaluated this approach on the Anniversary Track dataset from the 2022 SAT Competition. Our selection approach can predict the rank of a new solver after about 10 % of the time it would take to run the solver on all instances of this dataset, with a prediction accuracy of about 92 %. We also discuss the importance of instance families in the selection process. Overall, our tool provides a reliable way for solver engineers to determine a new solver’s performance efficiently.
Keywords
- Propositional satisfiability
- Benchmarking
- Active learning
Download conference paper PDF
1 Introduction
One of the main phases of algorithm engineering is benchmarking. This also applies to propositional satisfiability (SAT), the archetypal \(\mathcal{N}\mathcal{P}\)-complete problem. Benchmarking is, however, quite expensive regarding the runtime of experiments. While benchmarking a single SAT solver might still be feasible, developing new, competitive SAT solvers requires extensive experimentation with a variety of ideas [2, 8]. In particular, a new solver idea is rarely best on the first try. Thus, it is highly desirable to reduce benchmarking time and discard unpromising ideas early, allowing to test more approaches or spend more time on promising ones. The field of SAT solver benchmarking is well established, but traditional benchmark selection approaches do not optimize benchmark runtime. Instead, they focus on selecting a representative set of instances for scoring solvers [10, 15]. For the latter, SAT Competitions typically employ the PAR-2 score, i.e., the average runtime with a penalty of \(2 \tau \) for timeouts with time-limit \(\tau \) [8].
In this paper, we present a novel benchmark selection approach based on active learning. Our approach can predict the rank of a new solver with high accuracy in only a fraction of the time needed to evaluate the complete benchmark. Definition 1 specifies the problem we address.
Definition 1 (New-Solver Problem)
Given solvers \(\mathcal {A}\), instances \(\mathcal {I}\), runtimes \(r\!: \mathcal {A} \times \mathcal {I} \rightarrow \left[ 0, \tau \right] \) with time-limit \(\tau \), and a new solver \(\hat{a} \notin \mathcal {A}\), incrementally select benchmark instances from \(\mathcal {I}\) to maximize the confidence in predicting the rank of \(\hat{a}\) while minimizing the total benchmark runtime.
Note that our scenario assumes knowing the runtimes of all solvers, except the new one, on all instances. One could also imagine a collaborative filtering scenario, where runtimes are only partially known [23, 25].
Our approach satisfies several desirable criteria for benchmarking: Rather than outputting a binary classification, i.e., whether the new solver is worse than an existing solver or not, we provide a scoring function that shows by which margin a solver is worse and how similar it is to existing solvers. In particular, our approach enables ranking the new solver amidst a set of existing solvers. For this ranking, we do not even need to predict exact solver runtimes, which is trickier. Further, we optimize the runtime that our strategy needs to arrive at its conclusion. We use instance and runtime features. Moreover, we select instances non-randomly and incrementally. In particular, we consider runtime information from already done experiments when choosing the next. By doing so, we can control the properties of the benchmarking approach, such as its required runtime. Our approach is scalable in that it ranks a new solver \(\hat{a}\) among any number of known solvers \(\mathcal {A}\). In particular, we only subsample the benchmark once instead of comparing pairwise against each other solver [21].
We evaluate our approach with the SAT Competition 2022 Anniversary Track dataset [2], consisting of 5355 instances and runtimes of 28 solvers. We perform cross-validation by treating each solver once as the new solver and learning to predict the PAR-2 rank of that solver. On average, our predictions reach about 92 % accuracy with only about 10 % of the runtime required to evaluate these solvers on the complete set of instances.
Our entire source codeFootnote 1 and experimental dataFootnote 2 are available on GitHub.
2 Related Work
Benchmarking is not only of high interest in many fields but also an active research area on its own. Recent studies show that benchmark selection is challenging for multiple reasons. Biased benchmarks can easily lead to fallacious interpretations [7]. Benchmarking also has many interchangeable parts, such as the performance measures used, how measurement points are aggregated, and how missing values are handled. Questionable research practices could alter these elements a-posteriori to meet expectations, thereby skewing the results [27]. In the following, we discuss related work from the areas of static benchmark selection, algorithm configuration, incremental benchmark selection, and active learning. Table 1 compares the most relevant approaches, which all pursue slightly different goals. Thus, our approach is not a general improvement over the others but the only one fully aligned with Definition 1.
Static Benchmark Selection. Benchmark selection is essential for competitions, e.g., the SAT Competition. In such competitions, the organizers define the rules for composing the benchmarks. These selection strategies are primarily static, i.e., they do not depend on particular solvers to distinguish. Balint et al. provide an overview of benchmark-selection criteria in different solver competitions [1]. Froleyks et al. describe benchmark selection in recent SAT competitions [8]. Manthey and Möhle find that competition benchmarks might contain redundant instances and propose a feature-based approach to remove redundancy [20]. Mısır presents a feature-based approach to reduce benchmarks by matrix factorization and clustering [24].
Hoos et al. [15] discuss which properties are most desirable when selecting SAT benchmark instances. The selection criteria are instance variety to avoid over-fitting, adapted instance hardness (not too easy but also not too hard), and avoiding duplicate instances. To filter too similar instances, they use a distance-based approach with the SATzilla features [37, 38]. The approach does, however, not optimize for benchmark runtime and selects instances randomly, apart from constraints on the instance hardness and feature distance.
Algorithm Configuration. Further related work can be found within the field of algorithm configuration [14, 32], e.g., the configuration system SMAC [16]. Thereby, the goal is to tune SAT solvers for a given sub-domain of problem instances. Although this task is different from our goal, e.g., we do not need to navigate the configuration space, there are similarities to our approach as well. For example, SMAC also employs an iterative, model-based selection procedure, though for configurations rather than instances. An algorithm configurator, however, cannot be used to rank/score a new solver since algorithm configuration solemnly seeks to find the best-performing configuration. Also, while using a model-based selection strategy to sample configurations, instance selection is made randomly, i.e., without building a model over instances.
Incremental Benchmark Selection. Matricon et al. present an incremental benchmark selection approach [21]. Their per-set efficient algorithm selection problem (PSEAS) is similar to our New-Solver Problem (cf. Definition 1). Given a pair of SAT solvers, they iteratively select a subset of instances until the desired confidence level is reached to decide which of the two solvers is better. The selection of instances depends on the choice of the solvers to distinguish. They calculate a scoring metric for all unselected instances, run the experiment with the highest score, and update the confidence. Their approach ticks off most of our desired features in Table 1. However, the approach only compares solvers binarily rather than providing a scoring. Thus, it is unclear how similar two given solvers are or on which instances they behave similarly. Moreover, a significant shortcoming is the lacking scalability with the number of solvers. Comparing only pairs of solvers, evaluating a new solver requires sampling a separate benchmark for each existing solver. In contrast, our approach allows comparing a new solver against a set of existing solvers by sampling only one benchmark.
Active Learning. Prediction models in passive machine learning are trained on datasets with given instance labels (cf. Fig. 1a). In contrast, active learning (AL) starts with no or little labeled data. It repeatedly selects interesting problem instances for which to acquire labels, aiming to gradually improve the prediction model (cf. Fig. 1b). AL methods are especially beneficial if acquiring labels is computationally expensive, like obtaining solver runtimes. Without AL methods, it is not obvious which instances to label and which not. On the one hand, we want to maximize the utility an instance provides to our model, i.e., rank prediction accuracy, and on the other hand, minimize the cost, i.e., predicted runtime, associated with the instance’s acquisition. Thus, we strive for an accurate prediction model without having to label every data point.
Types of machine learning (depiction inspired by Rubens et.al. [29]).
Rubens et. al. [29] survey active-learning advances. While synthesis-based AL methods [5, 9, 34] generate instances for labeling, pool-based methods [11, 13, 19] rely on a fixed set of unlabeled instances to sample from. Recent synthesis-based methods within the field of SAT solving show how to generate problem instances with desired properties [5, 9]. This goal is, however, orthogonal to ours. While those approaches want to generate instances on which a solver is good or bad, we want to predict whether a solver is good or bad on an existing benchmark. Volpato and Guangyan use pool-based AL to learn an instance-specific algorithm selector [35]. Rather than benchmarking a solver’s overall performance, their goal is to recommend the best solver out of a set of solvers for each SAT instance.
3 Active Learning for SAT Solver Benchmarking
Algorithm 1 outlines our benchmarking framework. Given a set of solvers \(\mathcal {A}\), instances \(\mathcal {I}\) and runtimes r, we first initialize a prediction model \(\mathcal {M}\) for the new solver \(\hat{a} \not \in \mathcal {A}\) (Line 1). The prediction model \(\mathcal {M}\) is used to repeatedly select an instance (Line 4) for benchmarking \(\hat{a}\) (Line 5). The acquired result is subsequently used to update the prediction model \(\mathcal {M}\) (Line 7). When the stopping criterion is met (Line 3), we quit the benchmarking loop and predict the final score of \(\hat{a}\) (Line 8). Algorithm 1 returns the predicted score of \(\hat{a}\) as well as the acquired instances and runtime measurements (Line 9).
Section 3.1 describes the underlying prediction model \(\mathcal {M}\) and specifies how we may derive a solver ranking from it. We discuss criteria for selecting instances in Section 3.2. Section 3.3 concludes with possible stopping conditions.

3.1 Solver Model
The model M provides a runtime-label prediction function \(f : \mathcal {\hat{A}} \times \mathcal {I} \rightarrow \mathbb {R}\) for all solvers \(\mathcal {\hat{A}} := \mathcal {A} \cup \lbrace \hat{a} \rbrace \). This prediction function powers instance selection as described in Section 3.2. During model updates (Algorithm 1, Line 7), f is trained to predict a transformed version of the acquired runtimes \(\mathcal {R}\). We describe the runtime transformation in the subsequent section. The features described in Section 4.2 serve as the input to the model. Further, note that we build a new prediction model in each iteration since running experiments (Line 5) dominates the runtime of model training by magnitudes. Finally, we predict the score of the new solver \(\hat{a}\) with the prediction function f (Line 8).
Runtime Transformation. For the prediction model M, we transform the real-valued runtimes into discrete runtime labels on a per-instance basis. For each instance \(e \in \mathcal {I}\), we use a clustering algorithm to assign the runtimes in \(\bigl \{ r(a, e) \mid a \in \mathcal {A} \bigr \}\) to one of k clusters \(C_1, \dots , C_k\) such that the fastest runtimes for the instance e are in cluster \(C_1\) and the slowest are in cluster \(C_{k-1}\). Timeouts \(\tau \) always form a separate cluster \(C_{k}\). The runtime transformation function \(\gamma _k : {\mathcal {A} \times \mathcal {I}} \rightarrow \left\{ 1, \dots , k \right\} \) is then specified as follows:
Given an instance \(e \in \mathcal {I}\), a solver \(a \in \mathcal {A}\) belongs to the \(\gamma _k(a, e)\)-fastest solvers on instance e. In preliminary experiments, we achieved higher accuracy for predicting such discrete runtime labels than for predicting raw runtimes. Research on portfolio solvers has also shown that discretization works well in practice [4, 26].
Ranking Solvers. To determine solver ranks, we use the transformed runtimes \(\gamma _k(a, e)\) in the adapted scoring function \(s_k : \mathcal {A} \rightarrow [1, 2 \cdot k]\) as follows:
I.e., we apply PAR-2 scoring, which is commonly used in SAT competitions [8], on the discrete labels. The scoring function \(s_k\) induces a ranking among solvers.
3.2 Instance Selection
Selecting an instance based on the model is a core functionality of our framework (cf. Algorithm 1, Line 4). In this section, we introduce two instance sampling strategies, one that minimizes uncertainty and one that maximizes information gain. Both strategies use the model’s label-prediction function f and are inspired by existing work within the realms of active learning [30]. These methods require the model’s predictions to include probabilities for the k discrete runtime labels. Let \(f' : \mathcal {\hat{A}} \times \mathcal {I} \rightarrow \left[ 0, 1\right] ^k\) denote this modified prediction function. In the following, the set \(\tilde{\mathcal {I}} \subseteq \mathcal {I}\) denotes the instances that have already been sampled.
Uncertainty Sampling. The uncertainty sampling strategy selects the instance closest to the model’s decision boundary, i.e., we select the instance \(e \in \mathcal {I} \setminus \tilde{\mathcal {I}}\) that minimizes U(e), which is specified as follows:
Information-Gain Sampling. The information-gain sampling strategy selects the instance with the highest expected entropy reduction regarding the runtime labels of the instance. To be more specific, we select the instance \(e \in \mathcal {I} \setminus \tilde{\mathcal {I}}\) that maximizes IG(e), which is specified as follows:
Here, \(\textrm{H}(e)\) denotes the entropy of the runtime labels \(\gamma (a, e)\) over all \(a \in \mathcal {A}\) and \(\textrm{H}(e, n)\) denotes the entropy of these labels plus n as the runtime label for \(\hat{a}\). The term \({\hat{rm H}}_n(e)\) is computed for every possible runtime label \(n \in \{1, \dots , k\}\). By maximizing information gain, we select instances that identify solvers with similar behavior.
3.3 Stopping Criteria
In this section, we present the two dynamic stopping criteria in our experiments, the Wilcoxon and the ranking stopping criterion (cf. Algorithm 1, Line 3).
Wilcoxon Stopping Criterion. The Wilcoxon stopping criterion stops the active-learning process when we are confident enough that the predicted runtime labels of the new solver are sufficiently different from existing solvers. This criterion is loosely inspired by Matricon et. al. [21]. We use the average p-value \(W_{\hat{a}}\) of a Wilcoxon signed-rank test \(\textrm{w}(S,P)\) of the two runtime label distributions \(S=\{ \gamma (a, e) \mid e \in \mathcal {I} \}\) for an existing solver a and \(P=\{ f(\hat{a}, e) \mid e \in \mathcal {I} \}\) for the new solver \(\hat{a}\):
To improve the stability of this criterion, we use an exponential moving average to smooth out outliers and stop as soon as \(W^{(i)}_{\exp }\) drops below a fixed threshold:
Ranking Stopping Criterion. The ranking stopping criterion is less sophisticated in comparison. It stops the active-learning process if the ranking induced by the model’s predictions (Equation 1) remained unchanged within the last l iterations. However, the concrete values of the predicted score \(s_{\hat{a}}\) might still change. We are solemnly interested in the induced ranking in this case.
4 Experimental Design
Given all the previously presented instantiations for Algorithm 1, this section outlines our experimental design, including our evaluation framework, used data sets, hyper-parameter choices, and implementation details.
4.1 Evaluation Framework

As stated in the Introduction, this work addresses the New-Solver Problem (cf. Definition 1). As described in Section 3.1, a prediction model \(\mathcal {M}\) provides us with an estimated scoring \(s_{\hat{a}}\) for the new solver \(\hat{a}\).
To evaluate a concrete instantiation of Algorithm 1, i.e., a concrete choice for all the sub-routines, we perform cross-validation on our set of solvers. Algorithm 2 shows this. That means each solver plays the role of the new solver \(\hat{a}\) once (Line 2). Note that the new solver in each iteration is excluded from the set of solvers \(\mathcal {A}\) to avoid data leakage (Line 3). After running our active-learning framework for solver \(\hat{a}\) (Line 4), we compute the value of both our optimization goals, i.e., ranking accuracy and runtime. We define the ranking accuracy \(O_{\textrm{acc}} \in \left[ 0, 1\right] \) (higher is better) by the fraction of pairs \(\left( \hat{a}, a\right) \) for all \(a \in \mathcal {A}\) that are decided correctly regarding the ground-truth scoring \(\textrm{par}_{2}\) (Lines 5-8). The fraction of runtime that the algorithm needs to arrive at its conclusion is denoted by \(O_{\textrm{rt}} \in \left[ 0, 1\right] \) (lower is better). This metric puts the runtime summed over the sampled instances in relation to the runtime summed over all instances in the dataset (Lines 9-13). Finally, we compute averages of the output metrics in Line 15 after we have collected all cross-validation results in Line 14. Overall, we want to find an approach that maximizes
whereby \(\delta \in \left[ 0, 1\right] \) allows for linear weighting between the two optimization goals \(O_{\textrm{acc}}\) and \(O_{\textrm{rt}}\). Plotting the approaches that maximize \(O_\delta \) for all \(\delta \in \left[ 0, 1\right] \) on an \(O_{\textrm{rt}}\)-\(O_{\textrm{acc}}\)-diagram provides us with a Pareto front of the best approaches for different optimization-goal weightings.
4.2 Data
In our experiments, we work with the dataset of the SAT Competition 2022 Anniversary Track [2]. The dataset consists of 5355 instances with respective runtime data of 28 sequential SAT solvers. We also use a database of 56 instance featuresFootnote 3 from the Global Benchmark Database (GBD) by Iser et al. [17]. They comprise instance size features and node distribution statistics for several graph representations of SAT instances, among others, and are primarily inspired by the SATzilla 2012 features described in [38]. All features are numeric and free of missing values. We drop 10 out of 56 features because of zero variance. Overall, prediction models have access to 46 instance features and 27 runtime features, i.e., excluding the current new solver \(\hat{a}\).
Additionally, we retrieve instance-family informationFootnote 4 to evaluate the composition of our sampled benchmarks. Instance families comprise instances from the same application domain, e.g., planning, cryptography, etc., and are a valuable tool for analyzing solver performance.
For hyper-parameter tuning, we randomly sample 10 % of the complete set of 5355 instances with stratification regarding the instances’ family. All instance families that are too small, i.e., 10 % of them corresponds to less than one instance, are put into one meta-family for stratification. This tuning dataset allows for a more extensive exploration of the hyper-parameter space.
4.3 Hyper-parameters
Given Algorithm 1, there are several possible instantiations for the three sub-routines, i.e., ranking, selection, and stopping. Also, there are different choices for the runtime-label prediction model and runtime discretization. We describe these experimental configurations in the following.
Ranking. Regarding ranking (cf. Section 3.1), we experiment with the following approaches and hyper-parameter values:
-
Observed PAR-2 ranking of already sampled instances
-
Predicted runtime-label ranking
-
History size: Consider the latest 1, 10, 20, 30, or 40 predictions within a voting approach for stability. The latest x predictions for each instance vote on the instance’s winning label.
-
Fallback threshold: If the difference of scores between the new solver \(\hat{a}\) and another solver drops below 0.01, 0.05, or 0.1, use the partially observed PAR-2 ranking as a tie-breaker.
-
Selection. For selection (cf. Section 3.2), we experiment with the following methods and hyper-parameter values. Since the potential runtime of experiments is by magnitudes larger than the model’s update time, we only consider incrementing our benchmark by one instance at a time rather than using batches, which is also proposed in current active-learning advances [31, 34]. A drawback of this is the lack of parallel execution of runtime experiments.
-
Random sampling
-
Uncertainty sampling
-
Fallback threshold: Use random sampling for the first 0 %, 5 %, 10 %, 15 %, or 20 % of instances to explore the instance space.
-
Runtime scaling: Whether to normalize uncertainty scores per instance by the average runtime of solvers on it or use the absolute values.
-
-
Information-gain sampling
-
Fallback threshold: Use random sampling for the first 0 %, 5 %, 10 %, 15 %, or 20 % of instances to explore the instance space.
-
Runtime scaling: Whether to normalize information-gain scores per instance by the average runtime of solvers on it or use the absolute values.
-
Stopping. For stopping decisions (cf. Section 3.3), we experiment with the following criteria and hyper-parameter values:
-
Subset-size stopping criterion, using 10 % or 20 % of instances
-
Ranking stopping criterion
-
Minimum amount: Sample at least 2 %, 8 %, 10 %, or 12 % of instances before applying the criterion.
-
Convergence duration: Stop if the predicted ranking stays the same for a number of sampled instances equal to 1 % or 2 % of all instances.
-
-
Wilcoxon stopping criterion
-
Minimum amount: Sample at least 2 %, 8 %, 10 %, or 12 % of instances before applying the criterion.
-
Average of p-values to drop below: 5 %.
-
Exponential-moving average: Incorporate previous significance values by using an EMA with \(\beta = 0.1\) or \(\beta = 0.7\).
-
Prediction model. Our experiments only use one model configuration for runtime-label prediction since an exhaustive grid search would be infeasible. In preliminary experiments, we compared various model types from scikit-learn [28]. In particular, we conducted nested cross-validation, including hyper-parameter tuning, and used Matthews Correlation Coefficient [12, 22] to assess the performance for predicting runtime labels. Our final choice is a stacking ensemble [36] of two prediction models, a quadratic-discriminant analysis [33] and a random forest [3]. Both these models can learn non-linear relationships between the instance features and the runtime labels. Stacking means that another prediction model, in our case a simple decision tree, decides which of the two ensemble members makes the prediction on which instance.
Runtime discretization. To define prediction targets, i.e., discrete runtime labels, we use hierarchical clustering with \(k = 3\) and a log-single-link criterion, which produced the most useful labels in preliminary experiments. We denote this adapted solver scoring function with \(s_3\). In our chosen hierarchical procedure, each non-timeout runtime starts in a separate interval. We then gradually merge intervals whose single-link logarithmic distance is the smallest until the desired number of partitions is reached. Other clustering approaches that we tried include hierarchical clustering with mean-, median-, and complete-link criterion, as well as k-means and spectral clustering.
To obtain useful labels, we need to ensure that discretized labels still discriminate solvers and align with the actual PAR-2 ranking. We analyzed the ranking induced by \(s_3\) in preliminary experiments with the SAT Competition 2022 Anniversary Track [2]. According to a Wilcoxon-signed-rank test with \(\alpha = 0.05\), 87.83 % of solver pairs have significantly different scores after discretization, only a slight drop compared to 89.95 % before discretization. Further, our ranking approach correctly decides for almost all (about 97.45 %; \(\sigma = {3.68}\,{\%}\)) solver pairs which solver is faster. In particular, the Spearman correlation of \(s_3\) and PAR-2 ranking is about 0.988, which is very close to the optimal value of 1 [6]. All these results show that discretized runtimes are suitable for our framework.
4.4 Implementation Details
For reproducibility, our source code and data are available on GitHub (cf. footnotes in Section 1). Our code is implemented in Python using scikit-learn [28] for making predictions and gbd-tools [17] for SAT-instance retrieval.
5 Evaluation
In this section, we evaluate our active-learning framework. First, we analyze and tune the different sub-routines of our framework on the tuning dataset. Next, we evaluate the best configurations with the full dataset. Finally, we analyze the importance of different instance families to our framework.
5.1 Hyper-Parameter Analysis
\(O_{\textrm{rt}}\)-\(O_{\textrm{acc}}\)-diagrams comparing different hyper-parameter instantiations of our active-learning framework on the hyper-parameter-tuning dataset. The x-axis shows the ratio of total solver runtime on the sampled instances relative to all instances. The y-axis shows the ranking accuracy (cf. Section 4.1). Each line entails the front of Pareto-optimal configurations for the respective hyper-parameter instantiation.
Our experiments follow the evaluation framework introduced in Section 4.1. Fig. 2 shows the performance of the approaches from Section 4.3 on \(O_{\textrm{rt}}\)-\(O_{\textrm{acc}}\)-diagrams for the hyper-parameter-tuning dataset. Evaluating a particular configuration with Algorithm 2 returns a point \(\left( O_{\textrm{rt}},\, O_{\textrm{acc}}\right) \). We do not show intermediate results of the active-learning procedure but only the final results after stopping. The plotted lines represent the best-performing configurations per ranking approach (Fig. 2a), selection approach (Fig. 2b), and stopping criterion (Fig. 2c). In particular, we show the Pareto front, i.e., of all configurations that share a particular value of the plotted hyper-parameter, we take the maximum ranking accuracy over all remaining hyper-parameters not displayed in the corresponding plot.
Regarding ranking approaches (Fig. 2a), using the predicted \(s_3\)-induced runtime-label ranking consistently outperforms the partially observed PAR-2 ranking for each possible value of the trade-off parameter \(\delta \). This outcome is expected since selection decisions are not random. For example, we might sample more instances of one family if it benefits discrimination of solvers. While the partially observed PAR-2 score is skewed, the prediction model can account for this.
Regarding the selection approaches (Fig. 2b), uncertainty sampling performs best in most cases. However, information-gain sampling is beneficial if runtime is strongly favored (small \(\delta \); runtime fraction less than 5 %). This result aligns with our expectations: Information-gain sampling selects instances that maximize the expected reduction in entropy. This means we sample instances revealing similarities between solvers rather than differences, which helps to build a confident model quickly. However, the method cannot select helpful instances for distinguishing solvers later. Random sampling performs reasonably well but is outperformed by uncertainty sampling in all cases, showing the benefit of actively selecting instances based on a prediction model.
Regarding the stopping criteria (Fig. 2c), the ranking stopping criterion performs most consistently well. If accuracy is strongly favored (very high \(\delta \)), the Wilcoxon stopping criterion performs better. The subset-size stopping criterion performs reasonably well but does not improve beyond a certain accuracy because of sampling a fixed subset of instances.
Scatter plot comparing different instantiations of trade-off parameter \(\delta \) for our active-learning framework on the hyper-parameter-tuning dataset. The x-axis shows the fraction of runtime \(O_{\textrm{rt}}\) of the sample, while the y-axes show the fraction of instances sampled and ranking accuracy, respectively. The color indicates the weighting between different optimization goals \(\delta \in \left[ 0, 1\right] \). The larger \(\delta \), the more we favor accuracy over runtime.
Fig. 3a shows an interesting consequence of weighting our optimization goals: If we, on the one hand, desire to get a rough estimate of a solver’s performance fast (low \(\delta \)), approaches favor selecting many easy instances. In particular, the fraction of sampled instances is larger than the fraction of runtime. By having many observations, it is easier to build a model. If we, on the other hand, desire to get a good estimate of a solver’s performance in a moderate amount of time (high \(\delta \)), approaches favor selecting few, difficult instances. In particular, the fraction of instances is smaller than the fraction of runtime.
Furthermore, Fig. 3b reveals which values make the most sense for \(\delta \). The range \(\delta \in \left[ 0.2, 0.8\right] \), thereby, corresponds to the points with a runtime fraction between 0.03 and 0.22. We consider this region to be most promising, analogous to the elbow method in cluster analysis [18].
5.2 Full-Dataset Evaluation
Having selected the most promising hyper-parameters, we run our active-learning experiments on the complete Anniversary Track dataset (5355 instances). The aforementioned range \(\delta \in \left[ 0.2, 0.8\right] \) only results in two distinct configurations. The best-performing approach for \(\delta \in \left[ 0.2, 0.7\right] \) uses the predicted runtime-label ranking, information-gain sampling, and ranking stopping criterion. It can predict a new solver’s PAR-2 ranking with 90.48 % accuracy (\(O_{\textrm{acc}}\)) in only 5.41 % of the full evaluation time (\(O_{\textrm{rt}}\)). The best-performing approach for \(\delta \in (0.7, 0.8]\) uses the predicted runtime-label ranking, uncertainty sampling, and ranking stopping criterion. It can predict a new solver’s PAR-2 ranking with 92.33 % accuracy (\(O_{\textrm{acc}}\)) in only 10.35 % of the full evaluation time (\(O_{\textrm{rt}}\)).
Table 2 shows how both active-learning approaches (column AL) compare against two static baselines: Random samples instances until it reaches roughly the same fraction of runtime as the AL benchmark sets. We repeat sampling 1000 times and report average results. Most Freq. uses a static benchmark set consisting of those instances most frequently sampled by our active learning approach. In particular, we consider the average sampling frequency over all solvers and Pareto-optimal active-learning approaches.
Both our AL approaches perform better than random sampling. However, the performance differences are not significant regarding a Wilcoxon signed-rank test with \(\alpha = 0.05\) and also depend on the fraction of sampled runtime (cf. Fig. 2b). A clear advantage of our approach is, though, that it indicates when to stop adding further instances, depending on the trade-off parameter \(\delta \). While the active-learning results are less strong on the full dataset than on the smaller tuning dataset, they still show the benefit of making benchmark selection dependent on the solvers to distinguish.
A static benchmark using the most frequently AL-sampled instances performs poorly, though, compared to active learning and random sampling. This outcome is somewhat expected since the static benchmark does not reflect the right balance of instance families: Families whose instances are uniform-randomly selected by AL, e.g., for different solvers, appear less often in this benchmark than families where some instances are sampled more often than others.
5.3 Instance-Family Importance
Scatter plot showing the importance of different instance families to our framework on the full dataset. The x-axis shows the frequency of instance families in the dataset. The y-axis shows the average frequency of instance families in the samples selected by active learning. The dashed line represents families that occur with the same frequency in the dataset and samples.
Selection decisions of our approach also reveal the importance of different instance families to our framework. Fig. 4 shows the occurrence of instance families within the dataset and the benchmarks created by active learning. We use the best-performing configurations for all \(\delta \in \left[ 0, 1\right] \) and examine the selection decisions by the active-learning approach on the SAT Competition 2022 Anniversary Track dataset [2]. While most families appear with the same fraction in the dataset and the sampled benchmarks, a few outliers need further discussion. Problem instances of the families fpga, quasigroup-completion, and planning are especially helpful to our framework in distinguishing solvers. Instances of these families are selected over-proportionally in comparison to the full dataset. In contrast, instances of the largest family, i.e., hardware-verification, roughly appear with the same fraction in the dataset and the sampled benchmarks. Finally, instances of the family cryptography are less important in distinguishing solvers than their vast weight in the dataset suggests. A possible explanation is that these instances are very similar, such that a small fraction of them is sufficient to estimate a solver’s performance on all of them.
6 Conclusions and Future Work
In this work, we have addressed the New-Solver Problem: Given a new solver, we want to find its ranking amidst competitors. Our approach provides accurate ranking predictions while needing significantly less runtime than a complete evaluation on a given benchmark set. On data from the SAT Competition 2022 Anniversary Track, we can determine a new solver’s PAR-2 ranking with about 92 % accuracy while only needing 10 % of the full-evaluation time. We have evaluated several ranking algorithms, instance-selection approaches, and stopping criteria within our sequential active-learning framework. We also took a brief look at which instance families are the most prevalent in selection decisions.
Future work may compare further sub-routines for ranking, instance selection, and stopping. Additionally, one can apply our evaluation framework to arbitrary computation-intensive problems, e.g., other \(\mathcal{N}\mathcal{P}\)-complete problems than SAT, as all discussed active-learning methods are problem-agnostic. Such problems share most of the relevant properties of SAT solving, i.e., there are established instance features, a complete benchmark is expensive, and traditional benchmark selection requires expert knowledge.
From the technical perspective, one could formulate runtime discretization as an optimization problem rather than addressing it empirically. Further, a major shortcoming of our current approach is the lack of parallelization, selecting instances one at a time. Benchmarking on a computing cluster with n cores benefits from having batches of n instances. However, bigger batch sizes n impede active learning. Also, it is unclear how to synchronize instance selection and updates of the prediction model without wasting too much runtime.
References
Balint, A., Belov, A., Järvisalo, M., Sinz, C.: Overview and analysis of the SAT Challenge 2012 solver competition. Artif. Intell. 223, 120–155 (2015). https://doi.org/10.1016/j.artint.2015.01.002
Balyo, T., Heule, M., Iser, M., Järvisalo, M., Suda, M. (eds.): Proceedings of SAT Competition 2022: Solver and Benchmark Descriptions. Department of Computer Science, University of Helsinki (2022), http://hdl.handle.net/10138/347211
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Collautti, M., Malitsky, Y., Mehta, D., O’Sullivan, B.: SNNAP: solver-based nearest neighbor for algorithm portfolios. In: Proc. ECML PKDD. pp. 435–450 (2013). https://doi.org/10.1007/978-3-642-40994-3_28
Dang, N., Akgün, Ö., Espasa, J., Miguel, I., Nightingale, P.: A framework for generating informative benchmark instances. In: Proc. CP. pp. 18:1–18:18 (2022). https://doi.org/10.4230/LIPIcs.CP.2022.18
De Winter, J.C.F., Gosling, S.D., Potter, J.: Comparing the pearson and spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychol. Methods 21(3), 273–290 (2016). https://doi.org/10.1037/met0000079
Dehghani, M., Tay, Y., Gritsenko, A.A., Zhao, Z., Houlsby, N., Diaz, F., Metzler, D., Vinyals, O.: The benchmark lottery. arXiv:2107.07002 [cs.LG] (2021), https://arxiv.org/abs/2107.07002
Froleyks, N., Heule, M., Iser, M., Järvisalo, M., Suda, M.: SAT Competition 2020. Artif. Intell. 301 (2021). https://doi.org/10.1016/j.artint.2021.103572
Garzón, I., Mesejo, P., Giráldez-Cru, J.: On the performance of deep generative models of realistic SAT instances. In: Proc. SAT. pp. 3:1–3:19 (2022). https://doi.org/10.4230/LIPIcs.SAT.2022.3
Gelder, A.V.: Careful ranking of multiple solvers with timeouts and ties. In: Proc. SAT. pp. 317–328 (2011). https://doi.org/10.1007/978-3-642-21581-0_25
Golbandi, N., Koren, Y., Lempel, R.: Adaptive bootstrapping of recommender systems using decision trees. In: Proc. WSDM. pp. 595–604 (2011). https://doi.org/10.1145/1935826.1935910
Gorodkin, J.: Comparing two k-category assignments by a k-category correlation coefficient. Comput. Biol. Chem. 28(5–6), 367–374 (2004). https://doi.org/10.1016/j.compbiolchem.2004.09.006
Harpale, A., Yang, Y.: Personalized active learning for collaborative filtering. In: Proc. SIGIR. pp. 91–98 (2008). https://doi.org/10.1145/1390334.1390352
Hoos, H.H., Hutter, F., Leyton-Brown, K.: Automated configuration and selection of SAT solvers. In: Handbook of Satisfiability, chap. 12, pp. 481–507. IOS Press, 2 edn. (2021). https://doi.org/10.3233/FAIA200995
Hoos, H.H., Kaufmann, B., Schaub, T., Schneider, M.: Robust benchmark set selection for boolean constraint solvers. In: Proc. LION. pp. 138–152 (2013). https://doi.org/10.1007/978-3-642-44973-4_16
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Proc. LION. pp. 507–523 (2011). https://doi.org/10.1007/978-3-642-25566-3_40
Iser, M., Sinz, C.: A problem meta-data library for research in SAT. In: Proc. PoS. pp. 144–152 (2018). https://doi.org/10.29007/gdbb
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. Int. J. Adv. Res. Comput. Sci. Manage. Stud. 1(6), 90–95 (2013), http://www.ijarcsms.com/docs/paper/volume1/issue6/V1I6-0015.pdf
Koren, Y., Bell, R.M., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009). https://doi.org/10.1109/MC.2009.263
Manthey, N., Möhle, S.: Better evaluations by analyzing benchmark structure. In: Proc. PoS (2016), http://www.pragmaticsofsat.org/2016/reg/POS-16_paper_4.pdf
Matricon, T., Anastacio, M., Fijalkow, N., Simon, L., Hoos, H.H.: Statistical comparison of algorithm performance through instance selection. In: Proc. CP. pp. 43:1–43:21 (2021). https://doi.org/10.4230/LIPIcs.CP.2021.43
Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta - Protein Struct. 405(2), 442–451 (1975). https://doi.org/10.1016/0005-2795(75)90109-9
Mısır, M.: Data sampling through collaborative filtering for algorithm selection. In: Proc. IEEE CEC. pp. 2494–2501 (2017). https://doi.org/10.1109/CEC.2017.7969608
Mısır, M.: Benchmark set reduction for cheap empirical algorithmic studies. In: Proc. IEEE CEC. pp. 871–877 (2021). https://doi.org/10.1109/CEC45853.2021.9505012
Mısır, M., Sebag, M.: ALORS: An algorithm recommender system. Artif. Intell. 244, 291–314 (2017). https://doi.org/10.1016/j.artint.2016.12.001
Ngoko, Y., Cérin, C., Trystram, D.: Solving SAT in a distributed cloud: A portfolio approach. Int. J. Appl. Math. Comput. Sci. 29(2), 261–274 (2019). https://doi.org/10.2478/amcs-2019-0019
Nießl, C., Herrmann, M., Wiedemann, C., Casalicchio, G., Boulesteix, A.: Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. WIREs Data Min. Knowl. Discov. 12(2) (2022). https://doi.org/10.1002/widm.1441
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Édouard Duchesnay: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011), http://jmlr.org/papers/v12/pedregosa11a.html
Rubens, N., Elahi, M., Sugiyama, M., Kaplan, D.: Active learning in recommender systems. In: Recommender Systems Handbook, chap. 24, pp. 809–846. Springer, 2 edn. (2015). https://doi.org/10.1007/978-1-4899-7637-6_24
Settles, B.: Active learning literature survey. Tech. rep., University of Wisconsin-Madison, Department of Computer Sciences (2009), http://digital.library.wisc.edu/1793/60660
Sinha, S., Ebrahimi, S., Darrell, T.: Variational adversarial active learning. In: Proc. ICCV. pp. 5971–5980 (2019). https://doi.org/10.1109/ICCV.2019.00607
Stützle, T., López-Ibáñez, M., Pérez-Cáceres, L.: Automated algorithm configuration and design. In: Proc. GECCO. pp. 997–1019 (2022). https://doi.org/10.1145/3520304.3533663
Tharwat, A.: Linear vs. quadratic discriminant analysis classifier: a tutorial. Int. J. Appl. Pattern Recognit. 3(2), 145–180 (2016). https://doi.org/10.1504/IJAPR.2016.079050
Tran, T., Do, T., Reid, I.D., Carneiro, G.: Bayesian generative active deep learning. In: Proc. ICML. pp. 6295–6304 (2019), http://proceedings.mlr.press/v97/tran19a.html
Volpato, R., Song, G.: Active learning to optimise time-expensive algorithm selection. arXiv:1909.03261 [cs.LG] (2019), https://arxiv.org/abs/1909.03261
Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992). https://doi.org/10.1016/S0893-6080(05)80023-1
Xu, L., Hutter, F., Hoos, H.H., Leyton-Brown, K.: SATzilla: Portfolio-based algorithm selection for SAT. J. Artif. Intell. Res. 32, 565–606 (2008). https://doi.org/10.1613/jair.2490
Xu, L., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Features for SAT. Tech. rep., University of British Columbia (2012), https://www.cs.ubc.ca/labs/beta/Projects/SATzilla/Report_SAT_features.pdf
Acknowledgments
This work was supported by the Ministry of Science, Research and the Arts Baden-Württemberg, project Algorithm Engineering for the Scalability Challenge (AESC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Fuchs, T., Bach, J., Iser, M. (2023). Active Learning for SAT Solver Benchmarking. In: Sankaranarayanan, S., Sharygina, N. (eds) Tools and Algorithms for the Construction and Analysis of Systems. TACAS 2023. Lecture Notes in Computer Science, vol 13993. Springer, Cham. https://doi.org/10.1007/978-3-031-30823-9_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-30823-9_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30822-2
Online ISBN: 978-3-031-30823-9
eBook Packages: Computer ScienceComputer Science (R0)