Abstract
Benchmarking is a crucial phase when developing algorithms. This also applies to solvers for the SAT (propositional satisfiability) problem. Benchmark selection is about choosing representative problem instances that reliably discriminate solvers based on their runtime. In this paper, we present a dynamic benchmark selection approach based on active learning. Our approach predicts the rank of a new solver among its competitors with minimum runtime and maximum rank prediction accuracy. We evaluated this approach on the Anniversary Track dataset from the 2022 SAT Competition. Our selection approach can predict the rank of a new solver after about 10 % of the time it would take to run the solver on all instances of this dataset, with a prediction accuracy of about 92 %. We also discuss the importance of instance families in the selection process. Overall, our tool provides a reliable way for solver engineers to determine a new solver’s performance efficiently.
Keywords
 Propositional satisfiability
 Benchmarking
 Active learning
Download conference paper PDF
1 Introduction
One of the main phases of algorithm engineering is benchmarking. This also applies to propositional satisfiability (SAT), the archetypal \(\mathcal{N}\mathcal{P}\)complete problem. Benchmarking is, however, quite expensive regarding the runtime of experiments. While benchmarking a single SAT solver might still be feasible, developing new, competitive SAT solvers requires extensive experimentation with a variety of ideas [2, 8]. In particular, a new solver idea is rarely best on the first try. Thus, it is highly desirable to reduce benchmarking time and discard unpromising ideas early, allowing to test more approaches or spend more time on promising ones. The field of SAT solver benchmarking is well established, but traditional benchmark selection approaches do not optimize benchmark runtime. Instead, they focus on selecting a representative set of instances for scoring solvers [10, 15]. For the latter, SAT Competitions typically employ the PAR2 score, i.e., the average runtime with a penalty of \(2 \tau \) for timeouts with timelimit \(\tau \) [8].
In this paper, we present a novel benchmark selection approach based on active learning. Our approach can predict the rank of a new solver with high accuracy in only a fraction of the time needed to evaluate the complete benchmark. Definition 1 specifies the problem we address.
Definition 1 (NewSolver Problem)
Given solvers \(\mathcal {A}\), instances \(\mathcal {I}\), runtimes \(r\!: \mathcal {A} \times \mathcal {I} \rightarrow \left[ 0, \tau \right] \) with timelimit \(\tau \), and a new solver \(\hat{a} \notin \mathcal {A}\), incrementally select benchmark instances from \(\mathcal {I}\) to maximize the confidence in predicting the rank of \(\hat{a}\) while minimizing the total benchmark runtime.
Note that our scenario assumes knowing the runtimes of all solvers, except the new one, on all instances. One could also imagine a collaborative filtering scenario, where runtimes are only partially known [23, 25].
Our approach satisfies several desirable criteria for benchmarking: Rather than outputting a binary classification, i.e., whether the new solver is worse than an existing solver or not, we provide a scoring function that shows by which margin a solver is worse and how similar it is to existing solvers. In particular, our approach enables ranking the new solver amidst a set of existing solvers. For this ranking, we do not even need to predict exact solver runtimes, which is trickier. Further, we optimize the runtime that our strategy needs to arrive at its conclusion. We use instance and runtime features. Moreover, we select instances nonrandomly and incrementally. In particular, we consider runtime information from already done experiments when choosing the next. By doing so, we can control the properties of the benchmarking approach, such as its required runtime. Our approach is scalable in that it ranks a new solver \(\hat{a}\) among any number of known solvers \(\mathcal {A}\). In particular, we only subsample the benchmark once instead of comparing pairwise against each other solver [21].
We evaluate our approach with the SAT Competition 2022 Anniversary Track dataset [2], consisting of 5355 instances and runtimes of 28 solvers. We perform crossvalidation by treating each solver once as the new solver and learning to predict the PAR2 rank of that solver. On average, our predictions reach about 92 % accuracy with only about 10 % of the runtime required to evaluate these solvers on the complete set of instances.
Our entire source code^{Footnote 1} and experimental data^{Footnote 2} are available on GitHub.
2 Related Work
Benchmarking is not only of high interest in many fields but also an active research area on its own. Recent studies show that benchmark selection is challenging for multiple reasons. Biased benchmarks can easily lead to fallacious interpretations [7]. Benchmarking also has many interchangeable parts, such as the performance measures used, how measurement points are aggregated, and how missing values are handled. Questionable research practices could alter these elements aposteriori to meet expectations, thereby skewing the results [27]. In the following, we discuss related work from the areas of static benchmark selection, algorithm configuration, incremental benchmark selection, and active learning. Table 1 compares the most relevant approaches, which all pursue slightly different goals. Thus, our approach is not a general improvement over the others but the only one fully aligned with Definition 1.
Static Benchmark Selection. Benchmark selection is essential for competitions, e.g., the SAT Competition. In such competitions, the organizers define the rules for composing the benchmarks. These selection strategies are primarily static, i.e., they do not depend on particular solvers to distinguish. Balint et al. provide an overview of benchmarkselection criteria in different solver competitions [1]. Froleyks et al. describe benchmark selection in recent SAT competitions [8]. Manthey and Möhle find that competition benchmarks might contain redundant instances and propose a featurebased approach to remove redundancy [20]. Mısır presents a featurebased approach to reduce benchmarks by matrix factorization and clustering [24].
Hoos et al. [15] discuss which properties are most desirable when selecting SAT benchmark instances. The selection criteria are instance variety to avoid overfitting, adapted instance hardness (not too easy but also not too hard), and avoiding duplicate instances. To filter too similar instances, they use a distancebased approach with the SATzilla features [37, 38]. The approach does, however, not optimize for benchmark runtime and selects instances randomly, apart from constraints on the instance hardness and feature distance.
Algorithm Configuration. Further related work can be found within the field of algorithm configuration [14, 32], e.g., the configuration system SMAC [16]. Thereby, the goal is to tune SAT solvers for a given subdomain of problem instances. Although this task is different from our goal, e.g., we do not need to navigate the configuration space, there are similarities to our approach as well. For example, SMAC also employs an iterative, modelbased selection procedure, though for configurations rather than instances. An algorithm configurator, however, cannot be used to rank/score a new solver since algorithm configuration solemnly seeks to find the bestperforming configuration. Also, while using a modelbased selection strategy to sample configurations, instance selection is made randomly, i.e., without building a model over instances.
Incremental Benchmark Selection. Matricon et al. present an incremental benchmark selection approach [21]. Their perset efficient algorithm selection problem (PSEAS) is similar to our NewSolver Problem (cf. Definition 1). Given a pair of SAT solvers, they iteratively select a subset of instances until the desired confidence level is reached to decide which of the two solvers is better. The selection of instances depends on the choice of the solvers to distinguish. They calculate a scoring metric for all unselected instances, run the experiment with the highest score, and update the confidence. Their approach ticks off most of our desired features in Table 1. However, the approach only compares solvers binarily rather than providing a scoring. Thus, it is unclear how similar two given solvers are or on which instances they behave similarly. Moreover, a significant shortcoming is the lacking scalability with the number of solvers. Comparing only pairs of solvers, evaluating a new solver requires sampling a separate benchmark for each existing solver. In contrast, our approach allows comparing a new solver against a set of existing solvers by sampling only one benchmark.
Active Learning. Prediction models in passive machine learning are trained on datasets with given instance labels (cf. Fig. 1a). In contrast, active learning (AL) starts with no or little labeled data. It repeatedly selects interesting problem instances for which to acquire labels, aiming to gradually improve the prediction model (cf. Fig. 1b). AL methods are especially beneficial if acquiring labels is computationally expensive, like obtaining solver runtimes. Without AL methods, it is not obvious which instances to label and which not. On the one hand, we want to maximize the utility an instance provides to our model, i.e., rank prediction accuracy, and on the other hand, minimize the cost, i.e., predicted runtime, associated with the instance’s acquisition. Thus, we strive for an accurate prediction model without having to label every data point.
Rubens et. al. [29] survey activelearning advances. While synthesisbased AL methods [5, 9, 34] generate instances for labeling, poolbased methods [11, 13, 19] rely on a fixed set of unlabeled instances to sample from. Recent synthesisbased methods within the field of SAT solving show how to generate problem instances with desired properties [5, 9]. This goal is, however, orthogonal to ours. While those approaches want to generate instances on which a solver is good or bad, we want to predict whether a solver is good or bad on an existing benchmark. Volpato and Guangyan use poolbased AL to learn an instancespecific algorithm selector [35]. Rather than benchmarking a solver’s overall performance, their goal is to recommend the best solver out of a set of solvers for each SAT instance.
3 Active Learning for SAT Solver Benchmarking
Algorithm 1 outlines our benchmarking framework. Given a set of solvers \(\mathcal {A}\), instances \(\mathcal {I}\) and runtimes r, we first initialize a prediction model \(\mathcal {M}\) for the new solver \(\hat{a} \not \in \mathcal {A}\) (Line 1). The prediction model \(\mathcal {M}\) is used to repeatedly select an instance (Line 4) for benchmarking \(\hat{a}\) (Line 5). The acquired result is subsequently used to update the prediction model \(\mathcal {M}\) (Line 7). When the stopping criterion is met (Line 3), we quit the benchmarking loop and predict the final score of \(\hat{a}\) (Line 8). Algorithm 1 returns the predicted score of \(\hat{a}\) as well as the acquired instances and runtime measurements (Line 9).
Section 3.1 describes the underlying prediction model \(\mathcal {M}\) and specifies how we may derive a solver ranking from it. We discuss criteria for selecting instances in Section 3.2. Section 3.3 concludes with possible stopping conditions.
3.1 Solver Model
The model M provides a runtimelabel prediction function \(f : \mathcal {\hat{A}} \times \mathcal {I} \rightarrow \mathbb {R}\) for all solvers \(\mathcal {\hat{A}} := \mathcal {A} \cup \lbrace \hat{a} \rbrace \). This prediction function powers instance selection as described in Section 3.2. During model updates (Algorithm 1, Line 7), f is trained to predict a transformed version of the acquired runtimes \(\mathcal {R}\). We describe the runtime transformation in the subsequent section. The features described in Section 4.2 serve as the input to the model. Further, note that we build a new prediction model in each iteration since running experiments (Line 5) dominates the runtime of model training by magnitudes. Finally, we predict the score of the new solver \(\hat{a}\) with the prediction function f (Line 8).
Runtime Transformation. For the prediction model M, we transform the realvalued runtimes into discrete runtime labels on a perinstance basis. For each instance \(e \in \mathcal {I}\), we use a clustering algorithm to assign the runtimes in \(\bigl \{ r(a, e) \mid a \in \mathcal {A} \bigr \}\) to one of k clusters \(C_1, \dots , C_k\) such that the fastest runtimes for the instance e are in cluster \(C_1\) and the slowest are in cluster \(C_{k1}\). Timeouts \(\tau \) always form a separate cluster \(C_{k}\). The runtime transformation function \(\gamma _k : {\mathcal {A} \times \mathcal {I}} \rightarrow \left\{ 1, \dots , k \right\} \) is then specified as follows:
Given an instance \(e \in \mathcal {I}\), a solver \(a \in \mathcal {A}\) belongs to the \(\gamma _k(a, e)\)fastest solvers on instance e. In preliminary experiments, we achieved higher accuracy for predicting such discrete runtime labels than for predicting raw runtimes. Research on portfolio solvers has also shown that discretization works well in practice [4, 26].
Ranking Solvers. To determine solver ranks, we use the transformed runtimes \(\gamma _k(a, e)\) in the adapted scoring function \(s_k : \mathcal {A} \rightarrow [1, 2 \cdot k]\) as follows:
I.e., we apply PAR2 scoring, which is commonly used in SAT competitions [8], on the discrete labels. The scoring function \(s_k\) induces a ranking among solvers.
3.2 Instance Selection
Selecting an instance based on the model is a core functionality of our framework (cf. Algorithm 1, Line 4). In this section, we introduce two instance sampling strategies, one that minimizes uncertainty and one that maximizes information gain. Both strategies use the model’s labelprediction function f and are inspired by existing work within the realms of active learning [30]. These methods require the model’s predictions to include probabilities for the k discrete runtime labels. Let \(f' : \mathcal {\hat{A}} \times \mathcal {I} \rightarrow \left[ 0, 1\right] ^k\) denote this modified prediction function. In the following, the set \(\tilde{\mathcal {I}} \subseteq \mathcal {I}\) denotes the instances that have already been sampled.
Uncertainty Sampling. The uncertainty sampling strategy selects the instance closest to the model’s decision boundary, i.e., we select the instance \(e \in \mathcal {I} \setminus \tilde{\mathcal {I}}\) that minimizes U(e), which is specified as follows:
InformationGain Sampling. The informationgain sampling strategy selects the instance with the highest expected entropy reduction regarding the runtime labels of the instance. To be more specific, we select the instance \(e \in \mathcal {I} \setminus \tilde{\mathcal {I}}\) that maximizes IG(e), which is specified as follows:
Here, \(\textrm{H}(e)\) denotes the entropy of the runtime labels \(\gamma (a, e)\) over all \(a \in \mathcal {A}\) and \(\textrm{H}(e, n)\) denotes the entropy of these labels plus n as the runtime label for \(\hat{a}\). The term \({\hat{rm H}}_n(e)\) is computed for every possible runtime label \(n \in \{1, \dots , k\}\). By maximizing information gain, we select instances that identify solvers with similar behavior.
3.3 Stopping Criteria
In this section, we present the two dynamic stopping criteria in our experiments, the Wilcoxon and the ranking stopping criterion (cf. Algorithm 1, Line 3).
Wilcoxon Stopping Criterion. The Wilcoxon stopping criterion stops the activelearning process when we are confident enough that the predicted runtime labels of the new solver are sufficiently different from existing solvers. This criterion is loosely inspired by Matricon et. al. [21]. We use the average pvalue \(W_{\hat{a}}\) of a Wilcoxon signedrank test \(\textrm{w}(S,P)\) of the two runtime label distributions \(S=\{ \gamma (a, e) \mid e \in \mathcal {I} \}\) for an existing solver a and \(P=\{ f(\hat{a}, e) \mid e \in \mathcal {I} \}\) for the new solver \(\hat{a}\):
To improve the stability of this criterion, we use an exponential moving average to smooth out outliers and stop as soon as \(W^{(i)}_{\exp }\) drops below a fixed threshold:
Ranking Stopping Criterion. The ranking stopping criterion is less sophisticated in comparison. It stops the activelearning process if the ranking induced by the model’s predictions (Equation 1) remained unchanged within the last l iterations. However, the concrete values of the predicted score \(s_{\hat{a}}\) might still change. We are solemnly interested in the induced ranking in this case.
4 Experimental Design
Given all the previously presented instantiations for Algorithm 1, this section outlines our experimental design, including our evaluation framework, used data sets, hyperparameter choices, and implementation details.
4.1 Evaluation Framework
As stated in the Introduction, this work addresses the NewSolver Problem (cf. Definition 1). As described in Section 3.1, a prediction model \(\mathcal {M}\) provides us with an estimated scoring \(s_{\hat{a}}\) for the new solver \(\hat{a}\).
To evaluate a concrete instantiation of Algorithm 1, i.e., a concrete choice for all the subroutines, we perform crossvalidation on our set of solvers. Algorithm 2 shows this. That means each solver plays the role of the new solver \(\hat{a}\) once (Line 2). Note that the new solver in each iteration is excluded from the set of solvers \(\mathcal {A}\) to avoid data leakage (Line 3). After running our activelearning framework for solver \(\hat{a}\) (Line 4), we compute the value of both our optimization goals, i.e., ranking accuracy and runtime. We define the ranking accuracy \(O_{\textrm{acc}} \in \left[ 0, 1\right] \) (higher is better) by the fraction of pairs \(\left( \hat{a}, a\right) \) for all \(a \in \mathcal {A}\) that are decided correctly regarding the groundtruth scoring \(\textrm{par}_{2}\) (Lines 58). The fraction of runtime that the algorithm needs to arrive at its conclusion is denoted by \(O_{\textrm{rt}} \in \left[ 0, 1\right] \) (lower is better). This metric puts the runtime summed over the sampled instances in relation to the runtime summed over all instances in the dataset (Lines 913). Finally, we compute averages of the output metrics in Line 15 after we have collected all crossvalidation results in Line 14. Overall, we want to find an approach that maximizes
whereby \(\delta \in \left[ 0, 1\right] \) allows for linear weighting between the two optimization goals \(O_{\textrm{acc}}\) and \(O_{\textrm{rt}}\). Plotting the approaches that maximize \(O_\delta \) for all \(\delta \in \left[ 0, 1\right] \) on an \(O_{\textrm{rt}}\)\(O_{\textrm{acc}}\)diagram provides us with a Pareto front of the best approaches for different optimizationgoal weightings.
4.2 Data
In our experiments, we work with the dataset of the SAT Competition 2022 Anniversary Track [2]. The dataset consists of 5355 instances with respective runtime data of 28 sequential SAT solvers. We also use a database of 56 instance features^{Footnote 3} from the Global Benchmark Database (GBD) by Iser et al. [17]. They comprise instance size features and node distribution statistics for several graph representations of SAT instances, among others, and are primarily inspired by the SATzilla 2012 features described in [38]. All features are numeric and free of missing values. We drop 10 out of 56 features because of zero variance. Overall, prediction models have access to 46 instance features and 27 runtime features, i.e., excluding the current new solver \(\hat{a}\).
Additionally, we retrieve instancefamily information^{Footnote 4} to evaluate the composition of our sampled benchmarks. Instance families comprise instances from the same application domain, e.g., planning, cryptography, etc., and are a valuable tool for analyzing solver performance.
For hyperparameter tuning, we randomly sample 10 % of the complete set of 5355 instances with stratification regarding the instances’ family. All instance families that are too small, i.e., 10 % of them corresponds to less than one instance, are put into one metafamily for stratification. This tuning dataset allows for a more extensive exploration of the hyperparameter space.
4.3 Hyperparameters
Given Algorithm 1, there are several possible instantiations for the three subroutines, i.e., ranking, selection, and stopping. Also, there are different choices for the runtimelabel prediction model and runtime discretization. We describe these experimental configurations in the following.
Ranking. Regarding ranking (cf. Section 3.1), we experiment with the following approaches and hyperparameter values:

Observed PAR2 ranking of already sampled instances

Predicted runtimelabel ranking

History size: Consider the latest 1, 10, 20, 30, or 40 predictions within a voting approach for stability. The latest x predictions for each instance vote on the instance’s winning label.

Fallback threshold: If the difference of scores between the new solver \(\hat{a}\) and another solver drops below 0.01, 0.05, or 0.1, use the partially observed PAR2 ranking as a tiebreaker.

Selection. For selection (cf. Section 3.2), we experiment with the following methods and hyperparameter values. Since the potential runtime of experiments is by magnitudes larger than the model’s update time, we only consider incrementing our benchmark by one instance at a time rather than using batches, which is also proposed in current activelearning advances [31, 34]. A drawback of this is the lack of parallel execution of runtime experiments.

Random sampling

Uncertainty sampling

Fallback threshold: Use random sampling for the first 0 %, 5 %, 10 %, 15 %, or 20 % of instances to explore the instance space.

Runtime scaling: Whether to normalize uncertainty scores per instance by the average runtime of solvers on it or use the absolute values.


Informationgain sampling

Fallback threshold: Use random sampling for the first 0 %, 5 %, 10 %, 15 %, or 20 % of instances to explore the instance space.

Runtime scaling: Whether to normalize informationgain scores per instance by the average runtime of solvers on it or use the absolute values.

Stopping. For stopping decisions (cf. Section 3.3), we experiment with the following criteria and hyperparameter values:

Subsetsize stopping criterion, using 10 % or 20 % of instances

Ranking stopping criterion

Minimum amount: Sample at least 2 %, 8 %, 10 %, or 12 % of instances before applying the criterion.

Convergence duration: Stop if the predicted ranking stays the same for a number of sampled instances equal to 1 % or 2 % of all instances.


Wilcoxon stopping criterion

Minimum amount: Sample at least 2 %, 8 %, 10 %, or 12 % of instances before applying the criterion.

Average of pvalues to drop below: 5 %.

Exponentialmoving average: Incorporate previous significance values by using an EMA with \(\beta = 0.1\) or \(\beta = 0.7\).

Prediction model. Our experiments only use one model configuration for runtimelabel prediction since an exhaustive grid search would be infeasible. In preliminary experiments, we compared various model types from scikitlearn [28]. In particular, we conducted nested crossvalidation, including hyperparameter tuning, and used Matthews Correlation Coefficient [12, 22] to assess the performance for predicting runtime labels. Our final choice is a stacking ensemble [36] of two prediction models, a quadraticdiscriminant analysis [33] and a random forest [3]. Both these models can learn nonlinear relationships between the instance features and the runtime labels. Stacking means that another prediction model, in our case a simple decision tree, decides which of the two ensemble members makes the prediction on which instance.
Runtime discretization. To define prediction targets, i.e., discrete runtime labels, we use hierarchical clustering with \(k = 3\) and a logsinglelink criterion, which produced the most useful labels in preliminary experiments. We denote this adapted solver scoring function with \(s_3\). In our chosen hierarchical procedure, each nontimeout runtime starts in a separate interval. We then gradually merge intervals whose singlelink logarithmic distance is the smallest until the desired number of partitions is reached. Other clustering approaches that we tried include hierarchical clustering with mean, median, and completelink criterion, as well as kmeans and spectral clustering.
To obtain useful labels, we need to ensure that discretized labels still discriminate solvers and align with the actual PAR2 ranking. We analyzed the ranking induced by \(s_3\) in preliminary experiments with the SAT Competition 2022 Anniversary Track [2]. According to a Wilcoxonsignedrank test with \(\alpha = 0.05\), 87.83 % of solver pairs have significantly different scores after discretization, only a slight drop compared to 89.95 % before discretization. Further, our ranking approach correctly decides for almost all (about 97.45 %; \(\sigma = {3.68}\,{\%}\)) solver pairs which solver is faster. In particular, the Spearman correlation of \(s_3\) and PAR2 ranking is about 0.988, which is very close to the optimal value of 1 [6]. All these results show that discretized runtimes are suitable for our framework.
4.4 Implementation Details
For reproducibility, our source code and data are available on GitHub (cf. footnotes in Section 1). Our code is implemented in Python using scikitlearn [28] for making predictions and gbdtools [17] for SATinstance retrieval.
5 Evaluation
In this section, we evaluate our activelearning framework. First, we analyze and tune the different subroutines of our framework on the tuning dataset. Next, we evaluate the best configurations with the full dataset. Finally, we analyze the importance of different instance families to our framework.
5.1 HyperParameter Analysis
Our experiments follow the evaluation framework introduced in Section 4.1. Fig. 2 shows the performance of the approaches from Section 4.3 on \(O_{\textrm{rt}}\)\(O_{\textrm{acc}}\)diagrams for the hyperparametertuning dataset. Evaluating a particular configuration with Algorithm 2 returns a point \(\left( O_{\textrm{rt}},\, O_{\textrm{acc}}\right) \). We do not show intermediate results of the activelearning procedure but only the final results after stopping. The plotted lines represent the bestperforming configurations per ranking approach (Fig. 2a), selection approach (Fig. 2b), and stopping criterion (Fig. 2c). In particular, we show the Pareto front, i.e., of all configurations that share a particular value of the plotted hyperparameter, we take the maximum ranking accuracy over all remaining hyperparameters not displayed in the corresponding plot.
Regarding ranking approaches (Fig. 2a), using the predicted \(s_3\)induced runtimelabel ranking consistently outperforms the partially observed PAR2 ranking for each possible value of the tradeoff parameter \(\delta \). This outcome is expected since selection decisions are not random. For example, we might sample more instances of one family if it benefits discrimination of solvers. While the partially observed PAR2 score is skewed, the prediction model can account for this.
Regarding the selection approaches (Fig. 2b), uncertainty sampling performs best in most cases. However, informationgain sampling is beneficial if runtime is strongly favored (small \(\delta \); runtime fraction less than 5 %). This result aligns with our expectations: Informationgain sampling selects instances that maximize the expected reduction in entropy. This means we sample instances revealing similarities between solvers rather than differences, which helps to build a confident model quickly. However, the method cannot select helpful instances for distinguishing solvers later. Random sampling performs reasonably well but is outperformed by uncertainty sampling in all cases, showing the benefit of actively selecting instances based on a prediction model.
Regarding the stopping criteria (Fig. 2c), the ranking stopping criterion performs most consistently well. If accuracy is strongly favored (very high \(\delta \)), the Wilcoxon stopping criterion performs better. The subsetsize stopping criterion performs reasonably well but does not improve beyond a certain accuracy because of sampling a fixed subset of instances.
Fig. 3a shows an interesting consequence of weighting our optimization goals: If we, on the one hand, desire to get a rough estimate of a solver’s performance fast (low \(\delta \)), approaches favor selecting many easy instances. In particular, the fraction of sampled instances is larger than the fraction of runtime. By having many observations, it is easier to build a model. If we, on the other hand, desire to get a good estimate of a solver’s performance in a moderate amount of time (high \(\delta \)), approaches favor selecting few, difficult instances. In particular, the fraction of instances is smaller than the fraction of runtime.
Furthermore, Fig. 3b reveals which values make the most sense for \(\delta \). The range \(\delta \in \left[ 0.2, 0.8\right] \), thereby, corresponds to the points with a runtime fraction between 0.03 and 0.22. We consider this region to be most promising, analogous to the elbow method in cluster analysis [18].
5.2 FullDataset Evaluation
Having selected the most promising hyperparameters, we run our activelearning experiments on the complete Anniversary Track dataset (5355 instances). The aforementioned range \(\delta \in \left[ 0.2, 0.8\right] \) only results in two distinct configurations. The bestperforming approach for \(\delta \in \left[ 0.2, 0.7\right] \) uses the predicted runtimelabel ranking, informationgain sampling, and ranking stopping criterion. It can predict a new solver’s PAR2 ranking with 90.48 % accuracy (\(O_{\textrm{acc}}\)) in only 5.41 % of the full evaluation time (\(O_{\textrm{rt}}\)). The bestperforming approach for \(\delta \in (0.7, 0.8]\) uses the predicted runtimelabel ranking, uncertainty sampling, and ranking stopping criterion. It can predict a new solver’s PAR2 ranking with 92.33 % accuracy (\(O_{\textrm{acc}}\)) in only 10.35 % of the full evaluation time (\(O_{\textrm{rt}}\)).
Table 2 shows how both activelearning approaches (column AL) compare against two static baselines: Random samples instances until it reaches roughly the same fraction of runtime as the AL benchmark sets. We repeat sampling 1000 times and report average results. Most Freq. uses a static benchmark set consisting of those instances most frequently sampled by our active learning approach. In particular, we consider the average sampling frequency over all solvers and Paretooptimal activelearning approaches.
Both our AL approaches perform better than random sampling. However, the performance differences are not significant regarding a Wilcoxon signedrank test with \(\alpha = 0.05\) and also depend on the fraction of sampled runtime (cf. Fig. 2b). A clear advantage of our approach is, though, that it indicates when to stop adding further instances, depending on the tradeoff parameter \(\delta \). While the activelearning results are less strong on the full dataset than on the smaller tuning dataset, they still show the benefit of making benchmark selection dependent on the solvers to distinguish.
A static benchmark using the most frequently ALsampled instances performs poorly, though, compared to active learning and random sampling. This outcome is somewhat expected since the static benchmark does not reflect the right balance of instance families: Families whose instances are uniformrandomly selected by AL, e.g., for different solvers, appear less often in this benchmark than families where some instances are sampled more often than others.
5.3 InstanceFamily Importance
Selection decisions of our approach also reveal the importance of different instance families to our framework. Fig. 4 shows the occurrence of instance families within the dataset and the benchmarks created by active learning. We use the bestperforming configurations for all \(\delta \in \left[ 0, 1\right] \) and examine the selection decisions by the activelearning approach on the SAT Competition 2022 Anniversary Track dataset [2]. While most families appear with the same fraction in the dataset and the sampled benchmarks, a few outliers need further discussion. Problem instances of the families fpga, quasigroupcompletion, and planning are especially helpful to our framework in distinguishing solvers. Instances of these families are selected overproportionally in comparison to the full dataset. In contrast, instances of the largest family, i.e., hardwareverification, roughly appear with the same fraction in the dataset and the sampled benchmarks. Finally, instances of the family cryptography are less important in distinguishing solvers than their vast weight in the dataset suggests. A possible explanation is that these instances are very similar, such that a small fraction of them is sufficient to estimate a solver’s performance on all of them.
6 Conclusions and Future Work
In this work, we have addressed the NewSolver Problem: Given a new solver, we want to find its ranking amidst competitors. Our approach provides accurate ranking predictions while needing significantly less runtime than a complete evaluation on a given benchmark set. On data from the SAT Competition 2022 Anniversary Track, we can determine a new solver’s PAR2 ranking with about 92 % accuracy while only needing 10 % of the fullevaluation time. We have evaluated several ranking algorithms, instanceselection approaches, and stopping criteria within our sequential activelearning framework. We also took a brief look at which instance families are the most prevalent in selection decisions.
Future work may compare further subroutines for ranking, instance selection, and stopping. Additionally, one can apply our evaluation framework to arbitrary computationintensive problems, e.g., other \(\mathcal{N}\mathcal{P}\)complete problems than SAT, as all discussed activelearning methods are problemagnostic. Such problems share most of the relevant properties of SAT solving, i.e., there are established instance features, a complete benchmark is expensive, and traditional benchmark selection requires expert knowledge.
From the technical perspective, one could formulate runtime discretization as an optimization problem rather than addressing it empirically. Further, a major shortcoming of our current approach is the lack of parallelization, selecting instances one at a time. Benchmarking on a computing cluster with n cores benefits from having batches of n instances. However, bigger batch sizes n impede active learning. Also, it is unclear how to synchronize instance selection and updates of the prediction model without wasting too much runtime.
References
Balint, A., Belov, A., Järvisalo, M., Sinz, C.: Overview and analysis of the SAT Challenge 2012 solver competition. Artif. Intell. 223, 120–155 (2015). https://doi.org/10.1016/j.artint.2015.01.002
Balyo, T., Heule, M., Iser, M., Järvisalo, M., Suda, M. (eds.): Proceedings of SAT Competition 2022: Solver and Benchmark Descriptions. Department of Computer Science, University of Helsinki (2022), http://hdl.handle.net/10138/347211
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Collautti, M., Malitsky, Y., Mehta, D., O’Sullivan, B.: SNNAP: solverbased nearest neighbor for algorithm portfolios. In: Proc. ECML PKDD. pp. 435–450 (2013). https://doi.org/10.1007/9783642409943_28
Dang, N., Akgün, Ö., Espasa, J., Miguel, I., Nightingale, P.: A framework for generating informative benchmark instances. In: Proc. CP. pp. 18:1–18:18 (2022). https://doi.org/10.4230/LIPIcs.CP.2022.18
De Winter, J.C.F., Gosling, S.D., Potter, J.: Comparing the pearson and spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychol. Methods 21(3), 273–290 (2016). https://doi.org/10.1037/met0000079
Dehghani, M., Tay, Y., Gritsenko, A.A., Zhao, Z., Houlsby, N., Diaz, F., Metzler, D., Vinyals, O.: The benchmark lottery. arXiv:2107.07002 [cs.LG] (2021), https://arxiv.org/abs/2107.07002
Froleyks, N., Heule, M., Iser, M., Järvisalo, M., Suda, M.: SAT Competition 2020. Artif. Intell. 301 (2021). https://doi.org/10.1016/j.artint.2021.103572
Garzón, I., Mesejo, P., GiráldezCru, J.: On the performance of deep generative models of realistic SAT instances. In: Proc. SAT. pp. 3:1–3:19 (2022). https://doi.org/10.4230/LIPIcs.SAT.2022.3
Gelder, A.V.: Careful ranking of multiple solvers with timeouts and ties. In: Proc. SAT. pp. 317–328 (2011). https://doi.org/10.1007/9783642215810_25
Golbandi, N., Koren, Y., Lempel, R.: Adaptive bootstrapping of recommender systems using decision trees. In: Proc. WSDM. pp. 595–604 (2011). https://doi.org/10.1145/1935826.1935910
Gorodkin, J.: Comparing two kcategory assignments by a kcategory correlation coefficient. Comput. Biol. Chem. 28(5–6), 367–374 (2004). https://doi.org/10.1016/j.compbiolchem.2004.09.006
Harpale, A., Yang, Y.: Personalized active learning for collaborative filtering. In: Proc. SIGIR. pp. 91–98 (2008). https://doi.org/10.1145/1390334.1390352
Hoos, H.H., Hutter, F., LeytonBrown, K.: Automated configuration and selection of SAT solvers. In: Handbook of Satisfiability, chap. 12, pp. 481–507. IOS Press, 2 edn. (2021). https://doi.org/10.3233/FAIA200995
Hoos, H.H., Kaufmann, B., Schaub, T., Schneider, M.: Robust benchmark set selection for boolean constraint solvers. In: Proc. LION. pp. 138–152 (2013). https://doi.org/10.1007/9783642449734_16
Hutter, F., Hoos, H.H., LeytonBrown, K.: Sequential modelbased optimization for general algorithm configuration. In: Proc. LION. pp. 507–523 (2011). https://doi.org/10.1007/9783642255663_40
Iser, M., Sinz, C.: A problem metadata library for research in SAT. In: Proc. PoS. pp. 144–152 (2018). https://doi.org/10.29007/gdbb
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in kmeans clustering. Int. J. Adv. Res. Comput. Sci. Manage. Stud. 1(6), 90–95 (2013), http://www.ijarcsms.com/docs/paper/volume1/issue6/V1I60015.pdf
Koren, Y., Bell, R.M., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009). https://doi.org/10.1109/MC.2009.263
Manthey, N., Möhle, S.: Better evaluations by analyzing benchmark structure. In: Proc. PoS (2016), http://www.pragmaticsofsat.org/2016/reg/POS16_paper_4.pdf
Matricon, T., Anastacio, M., Fijalkow, N., Simon, L., Hoos, H.H.: Statistical comparison of algorithm performance through instance selection. In: Proc. CP. pp. 43:1–43:21 (2021). https://doi.org/10.4230/LIPIcs.CP.2021.43
Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta  Protein Struct. 405(2), 442–451 (1975). https://doi.org/10.1016/00052795(75)901099
Mısır, M.: Data sampling through collaborative filtering for algorithm selection. In: Proc. IEEE CEC. pp. 2494–2501 (2017). https://doi.org/10.1109/CEC.2017.7969608
Mısır, M.: Benchmark set reduction for cheap empirical algorithmic studies. In: Proc. IEEE CEC. pp. 871–877 (2021). https://doi.org/10.1109/CEC45853.2021.9505012
Mısır, M., Sebag, M.: ALORS: An algorithm recommender system. Artif. Intell. 244, 291–314 (2017). https://doi.org/10.1016/j.artint.2016.12.001
Ngoko, Y., Cérin, C., Trystram, D.: Solving SAT in a distributed cloud: A portfolio approach. Int. J. Appl. Math. Comput. Sci. 29(2), 261–274 (2019). https://doi.org/10.2478/amcs20190019
Nießl, C., Herrmann, M., Wiedemann, C., Casalicchio, G., Boulesteix, A.: Overoptimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. WIREs Data Min. Knowl. Discov. 12(2) (2022). https://doi.org/10.1002/widm.1441
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Édouard Duchesnay: Scikitlearn: Machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011), http://jmlr.org/papers/v12/pedregosa11a.html
Rubens, N., Elahi, M., Sugiyama, M., Kaplan, D.: Active learning in recommender systems. In: Recommender Systems Handbook, chap. 24, pp. 809–846. Springer, 2 edn. (2015). https://doi.org/10.1007/9781489976376_24
Settles, B.: Active learning literature survey. Tech. rep., University of WisconsinMadison, Department of Computer Sciences (2009), http://digital.library.wisc.edu/1793/60660
Sinha, S., Ebrahimi, S., Darrell, T.: Variational adversarial active learning. In: Proc. ICCV. pp. 5971–5980 (2019). https://doi.org/10.1109/ICCV.2019.00607
Stützle, T., LópezIbáñez, M., PérezCáceres, L.: Automated algorithm configuration and design. In: Proc. GECCO. pp. 997–1019 (2022). https://doi.org/10.1145/3520304.3533663
Tharwat, A.: Linear vs. quadratic discriminant analysis classifier: a tutorial. Int. J. Appl. Pattern Recognit. 3(2), 145–180 (2016). https://doi.org/10.1504/IJAPR.2016.079050
Tran, T., Do, T., Reid, I.D., Carneiro, G.: Bayesian generative active deep learning. In: Proc. ICML. pp. 6295–6304 (2019), http://proceedings.mlr.press/v97/tran19a.html
Volpato, R., Song, G.: Active learning to optimise timeexpensive algorithm selection. arXiv:1909.03261 [cs.LG] (2019), https://arxiv.org/abs/1909.03261
Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992). https://doi.org/10.1016/S08936080(05)800231
Xu, L., Hutter, F., Hoos, H.H., LeytonBrown, K.: SATzilla: Portfoliobased algorithm selection for SAT. J. Artif. Intell. Res. 32, 565–606 (2008). https://doi.org/10.1613/jair.2490
Xu, L., Hutter, F., Hoos, H.H., LeytonBrown, K.: Features for SAT. Tech. rep., University of British Columbia (2012), https://www.cs.ubc.ca/labs/beta/Projects/SATzilla/Report_SAT_features.pdf
Acknowledgments
This work was supported by the Ministry of Science, Research and the Arts BadenWürttemberg, project Algorithm Engineering for the Scalability Challenge (AESC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Fuchs, T., Bach, J., Iser, M. (2023). Active Learning for SAT Solver Benchmarking. In: Sankaranarayanan, S., Sharygina, N. (eds) Tools and Algorithms for the Construction and Analysis of Systems. TACAS 2023. Lecture Notes in Computer Science, vol 13993. Springer, Cham. https://doi.org/10.1007/9783031308239_21
Download citation
DOI: https://doi.org/10.1007/9783031308239_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031308222
Online ISBN: 9783031308239
eBook Packages: Computer ScienceComputer Science (R0)