1 Introduction

One of the main phases of algorithm engineering is benchmarking. This also applies to propositional satisfiability (SAT), the archetypal \(\mathcal{N}\mathcal{P}\)-complete problem. Benchmarking is, however, quite expensive regarding the runtime of experiments. While benchmarking a single SAT solver might still be feasible, developing new, competitive SAT solvers requires extensive experimentation with a variety of ideas [2, 8]. In particular, a new solver idea is rarely best on the first try. Thus, it is highly desirable to reduce benchmarking time and discard unpromising ideas early, allowing to test more approaches or spend more time on promising ones. The field of SAT solver benchmarking is well established, but traditional benchmark selection approaches do not optimize benchmark runtime. Instead, they focus on selecting a representative set of instances for scoring solvers [10, 15]. For the latter, SAT Competitions typically employ the PAR-2 score, i.e., the average runtime with a penalty of \(2 \tau \) for timeouts with time-limit \(\tau \) [8].

In this paper, we present a novel benchmark selection approach based on active learning. Our approach can predict the rank of a new solver with high accuracy in only a fraction of the time needed to evaluate the complete benchmark. Definition 1 specifies the problem we address.

Definition 1 (New-Solver Problem)

Given solvers \(\mathcal {A}\), instances \(\mathcal {I}\), runtimes \(r\!: \mathcal {A} \times \mathcal {I} \rightarrow \left[ 0, \tau \right] \) with time-limit \(\tau \), and a new solver \(\hat{a} \notin \mathcal {A}\), incrementally select benchmark instances from \(\mathcal {I}\) to maximize the confidence in predicting the rank of \(\hat{a}\) while minimizing the total benchmark runtime.

Note that our scenario assumes knowing the runtimes of all solvers, except the new one, on all instances. One could also imagine a collaborative filtering scenario, where runtimes are only partially known [23, 25].

Our approach satisfies several desirable criteria for benchmarking: Rather than outputting a binary classification, i.e., whether the new solver is worse than an existing solver or not, we provide a scoring function that shows by which margin a solver is worse and how similar it is to existing solvers. In particular, our approach enables ranking the new solver amidst a set of existing solvers. For this ranking, we do not even need to predict exact solver runtimes, which is trickier. Further, we optimize the runtime that our strategy needs to arrive at its conclusion. We use instance and runtime features. Moreover, we select instances non-randomly and incrementally. In particular, we consider runtime information from already done experiments when choosing the next. By doing so, we can control the properties of the benchmarking approach, such as its required runtime. Our approach is scalable in that it ranks a new solver \(\hat{a}\) among any number of known solvers \(\mathcal {A}\). In particular, we only subsample the benchmark once instead of comparing pairwise against each other solver [21].

We evaluate our approach with the SAT Competition 2022 Anniversary Track dataset [2], consisting of 5355 instances and runtimes of 28 solvers. We perform cross-validation by treating each solver once as the new solver and learning to predict the PAR-2 rank of that solver. On average, our predictions reach about 92 % accuracy with only about 10 % of the runtime required to evaluate these solvers on the complete set of instances.

Our entire source codeFootnote 1 and experimental dataFootnote 2 are available on GitHub.

2 Related Work

Benchmarking is not only of high interest in many fields but also an active research area on its own. Recent studies show that benchmark selection is challenging for multiple reasons. Biased benchmarks can easily lead to fallacious interpretations [7]. Benchmarking also has many interchangeable parts, such as the performance measures used, how measurement points are aggregated, and how missing values are handled. Questionable research practices could alter these elements a-posteriori to meet expectations, thereby skewing the results [27]. In the following, we discuss related work from the areas of static benchmark selection, algorithm configuration, incremental benchmark selection, and active learning. Table 1 compares the most relevant approaches, which all pursue slightly different goals. Thus, our approach is not a general improvement over the others but the only one fully aligned with Definition 1.

Static Benchmark Selection. Benchmark selection is essential for competitions, e.g., the SAT Competition. In such competitions, the organizers define the rules for composing the benchmarks. These selection strategies are primarily static, i.e., they do not depend on particular solvers to distinguish. Balint et al. provide an overview of benchmark-selection criteria in different solver competitions [1]. Froleyks et al. describe benchmark selection in recent SAT competitions [8]. Manthey and Möhle find that competition benchmarks might contain redundant instances and propose a feature-based approach to remove redundancy [20]. Mısır presents a feature-based approach to reduce benchmarks by matrix factorization and clustering [24].

Hoos et al. [15] discuss which properties are most desirable when selecting SAT benchmark instances. The selection criteria are instance variety to avoid over-fitting, adapted instance hardness (not too easy but also not too hard), and avoiding duplicate instances. To filter too similar instances, they use a distance-based approach with the SATzilla features [37, 38]. The approach does, however, not optimize for benchmark runtime and selects instances randomly, apart from constraints on the instance hardness and feature distance.

Table 1. Comparison of features of our benchmark-selection approach, the static benchmark-selection approach by Hoos et al. [15], the algorithm configuration system SMAC [16], and the active-learning approaches by Matricon et al. [21].

Algorithm Configuration. Further related work can be found within the field of algorithm configuration [14, 32], e.g., the configuration system SMAC [16]. Thereby, the goal is to tune SAT solvers for a given sub-domain of problem instances. Although this task is different from our goal, e.g., we do not need to navigate the configuration space, there are similarities to our approach as well. For example, SMAC also employs an iterative, model-based selection procedure, though for configurations rather than instances. An algorithm configurator, however, cannot be used to rank/score a new solver since algorithm configuration solemnly seeks to find the best-performing configuration. Also, while using a model-based selection strategy to sample configurations, instance selection is made randomly, i.e., without building a model over instances.

Incremental Benchmark Selection. Matricon et al. present an incremental benchmark selection approach [21]. Their per-set efficient algorithm selection problem (PSEAS) is similar to our New-Solver Problem (cf. Definition 1). Given a pair of SAT solvers, they iteratively select a subset of instances until the desired confidence level is reached to decide which of the two solvers is better. The selection of instances depends on the choice of the solvers to distinguish. They calculate a scoring metric for all unselected instances, run the experiment with the highest score, and update the confidence. Their approach ticks off most of our desired features in Table 1. However, the approach only compares solvers binarily rather than providing a scoring. Thus, it is unclear how similar two given solvers are or on which instances they behave similarly. Moreover, a significant shortcoming is the lacking scalability with the number of solvers. Comparing only pairs of solvers, evaluating a new solver requires sampling a separate benchmark for each existing solver. In contrast, our approach allows comparing a new solver against a set of existing solvers by sampling only one benchmark.

Active Learning. Prediction models in passive machine learning are trained on datasets with given instance labels (cf. Fig. 1a). In contrast, active learning (AL) starts with no or little labeled data. It repeatedly selects interesting problem instances for which to acquire labels, aiming to gradually improve the prediction model (cf. Fig. 1b). AL methods are especially beneficial if acquiring labels is computationally expensive, like obtaining solver runtimes. Without AL methods, it is not obvious which instances to label and which not. On the one hand, we want to maximize the utility an instance provides to our model, i.e., rank prediction accuracy, and on the other hand, minimize the cost, i.e., predicted runtime, associated with the instance’s acquisition. Thus, we strive for an accurate prediction model without having to label every data point.

Fig. 1.
figure 1

Types of machine learning (depiction inspired by Rubens et.al. [29]).

Rubens et. al. [29] survey active-learning advances. While synthesis-based AL methods [5, 9, 34] generate instances for labeling, pool-based methods [11, 13, 19] rely on a fixed set of unlabeled instances to sample from. Recent synthesis-based methods within the field of SAT solving show how to generate problem instances with desired properties [5, 9]. This goal is, however, orthogonal to ours. While those approaches want to generate instances on which a solver is good or bad, we want to predict whether a solver is good or bad on an existing benchmark. Volpato and Guangyan use pool-based AL to learn an instance-specific algorithm selector [35]. Rather than benchmarking a solver’s overall performance, their goal is to recommend the best solver out of a set of solvers for each SAT instance.

3 Active Learning for SAT Solver Benchmarking

Algorithm 1 outlines our benchmarking framework. Given a set of solvers \(\mathcal {A}\), instances \(\mathcal {I}\) and runtimes r, we first initialize a prediction model \(\mathcal {M}\) for the new solver \(\hat{a} \not \in \mathcal {A}\) (Line 1). The prediction model \(\mathcal {M}\) is used to repeatedly select an instance (Line 4) for benchmarking \(\hat{a}\) (Line 5). The acquired result is subsequently used to update the prediction model \(\mathcal {M}\) (Line 7). When the stopping criterion is met (Line 3), we quit the benchmarking loop and predict the final score of \(\hat{a}\) (Line 8). Algorithm 1 returns the predicted score of \(\hat{a}\) as well as the acquired instances and runtime measurements (Line 9).

Section 3.1 describes the underlying prediction model \(\mathcal {M}\) and specifies how we may derive a solver ranking from it. We discuss criteria for selecting instances in Section 3.2. Section 3.3 concludes with possible stopping conditions.

figure q

3.1 Solver Model

The model M provides a runtime-label prediction function \(f : \mathcal {\hat{A}} \times \mathcal {I} \rightarrow \mathbb {R}\) for all solvers \(\mathcal {\hat{A}} := \mathcal {A} \cup \lbrace \hat{a} \rbrace \). This prediction function powers instance selection as described in Section 3.2. During model updates (Algorithm 1, Line 7), f is trained to predict a transformed version of the acquired runtimes \(\mathcal {R}\). We describe the runtime transformation in the subsequent section. The features described in Section 4.2 serve as the input to the model. Further, note that we build a new prediction model in each iteration since running experiments (Line 5) dominates the runtime of model training by magnitudes. Finally, we predict the score of the new solver \(\hat{a}\) with the prediction function f (Line 8).

Runtime Transformation. For the prediction model M, we transform the real-valued runtimes into discrete runtime labels on a per-instance basis. For each instance \(e \in \mathcal {I}\), we use a clustering algorithm to assign the runtimes in \(\bigl \{ r(a, e) \mid a \in \mathcal {A} \bigr \}\) to one of k clusters \(C_1, \dots , C_k\) such that the fastest runtimes for the instance e are in cluster \(C_1\) and the slowest are in cluster \(C_{k-1}\). Timeouts \(\tau \) always form a separate cluster \(C_{k}\). The runtime transformation function \(\gamma _k : {\mathcal {A} \times \mathcal {I}} \rightarrow \left\{ 1, \dots , k \right\} \) is then specified as follows:

$$\gamma _k(a, e) = j ~\Leftrightarrow ~ r(a, e) \in C_j$$

Given an instance \(e \in \mathcal {I}\), a solver \(a \in \mathcal {A}\) belongs to the \(\gamma _k(a, e)\)-fastest solvers on instance e. In preliminary experiments, we achieved higher accuracy for predicting such discrete runtime labels than for predicting raw runtimes. Research on portfolio solvers has also shown that discretization works well in practice [4, 26].

Ranking Solvers. To determine solver ranks, we use the transformed runtimes \(\gamma _k(a, e)\) in the adapted scoring function \(s_k : \mathcal {A} \rightarrow [1, 2 \cdot k]\) as follows:

$$\begin{aligned} s_k(a) := \frac{1}{|\mathcal {I}|} \sum _{e \in \mathcal {I}} \gamma '_k(a, e){} & {} \gamma '_k(a, e) := {\left\{ \begin{array}{ll} 2 \cdot \gamma _k(a, e) &{} \text {if } \gamma _k(a, e) = k\\ \gamma _k(a, e) &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

I.e., we apply PAR-2 scoring, which is commonly used in SAT competitions [8], on the discrete labels. The scoring function \(s_k\) induces a ranking among solvers.

3.2 Instance Selection

Selecting an instance based on the model is a core functionality of our framework (cf. Algorithm 1, Line 4). In this section, we introduce two instance sampling strategies, one that minimizes uncertainty and one that maximizes information gain. Both strategies use the model’s label-prediction function f and are inspired by existing work within the realms of active learning [30]. These methods require the model’s predictions to include probabilities for the k discrete runtime labels. Let \(f' : \mathcal {\hat{A}} \times \mathcal {I} \rightarrow \left[ 0, 1\right] ^k\) denote this modified prediction function. In the following, the set \(\tilde{\mathcal {I}} \subseteq \mathcal {I}\) denotes the instances that have already been sampled.

Uncertainty Sampling. The uncertainty sampling strategy selects the instance closest to the model’s decision boundary, i.e., we select the instance \(e \in \mathcal {I} \setminus \tilde{\mathcal {I}}\) that minimizes U(e), which is specified as follows:

$$\begin{aligned} \textrm{U}(e) := \left|\frac{1}{k} - \max _{n \in \left\{ 1, \dots , k \right\} } f'\!\left( \hat{a}, e\right) _{n} \right|\end{aligned}$$

Information-Gain Sampling. The information-gain sampling strategy selects the instance with the highest expected entropy reduction regarding the runtime labels of the instance. To be more specific, we select the instance \(e \in \mathcal {I} \setminus \tilde{\mathcal {I}}\) that maximizes IG(e), which is specified as follows:

$$\begin{aligned} \textrm{IG}(e) := \textrm{H}(e) - \sum _{n = 1}^{k} f'(\hat{a}, e)_{n} {\hat{rm H}}_n(e) \end{aligned}$$

Here, \(\textrm{H}(e)\) denotes the entropy of the runtime labels \(\gamma (a, e)\) over all \(a \in \mathcal {A}\) and \(\textrm{H}(e, n)\) denotes the entropy of these labels plus n as the runtime label for \(\hat{a}\). The term \({\hat{rm H}}_n(e)\) is computed for every possible runtime label \(n \in \{1, \dots , k\}\). By maximizing information gain, we select instances that identify solvers with similar behavior.

3.3 Stopping Criteria

In this section, we present the two dynamic stopping criteria in our experiments, the Wilcoxon and the ranking stopping criterion (cf. Algorithm 1, Line 3).

Wilcoxon Stopping Criterion. The Wilcoxon stopping criterion stops the active-learning process when we are confident enough that the predicted runtime labels of the new solver are sufficiently different from existing solvers. This criterion is loosely inspired by Matricon et. al. [21]. We use the average p-value \(W_{\hat{a}}\) of a Wilcoxon signed-rank test \(\textrm{w}(S,P)\) of the two runtime label distributions \(S=\{ \gamma (a, e) \mid e \in \mathcal {I} \}\) for an existing solver a and \(P=\{ f(\hat{a}, e) \mid e \in \mathcal {I} \}\) for the new solver \(\hat{a}\):

$$\begin{aligned} W_{\hat{a}} := \frac{1}{|\mathcal {A} |} \sum _{a \in \mathcal {A}} \textrm{w}(S, P) \end{aligned}$$

To improve the stability of this criterion, we use an exponential moving average to smooth out outliers and stop as soon as \(W^{(i)}_{\exp }\) drops below a fixed threshold:

$$\begin{aligned} W_{\exp }^{\left( 0\right) }&:= 1\\ W_{\exp }^{\left( i\right) }&:= \beta W_{\hat{a}} + \left( 1 - \beta \right) W_{\exp }^{\left( i - 1\right) } \end{aligned}$$

Ranking Stopping Criterion. The ranking stopping criterion is less sophisticated in comparison. It stops the active-learning process if the ranking induced by the model’s predictions (Equation 1) remained unchanged within the last l iterations. However, the concrete values of the predicted score \(s_{\hat{a}}\) might still change. We are solemnly interested in the induced ranking in this case.

4 Experimental Design

Given all the previously presented instantiations for Algorithm 1, this section outlines our experimental design, including our evaluation framework, used data sets, hyper-parameter choices, and implementation details.

4.1 Evaluation Framework

figure r

As stated in the Introduction, this work addresses the New-Solver Problem (cf. Definition 1). As described in Section 3.1, a prediction model \(\mathcal {M}\) provides us with an estimated scoring \(s_{\hat{a}}\) for the new solver \(\hat{a}\).

To evaluate a concrete instantiation of Algorithm 1, i.e., a concrete choice for all the sub-routines, we perform cross-validation on our set of solvers. Algorithm 2 shows this. That means each solver plays the role of the new solver \(\hat{a}\) once (Line 2). Note that the new solver in each iteration is excluded from the set of solvers \(\mathcal {A}\) to avoid data leakage (Line 3). After running our active-learning framework for solver \(\hat{a}\) (Line 4), we compute the value of both our optimization goals, i.e., ranking accuracy and runtime. We define the ranking accuracy \(O_{\textrm{acc}} \in \left[ 0, 1\right] \) (higher is better) by the fraction of pairs \(\left( \hat{a}, a\right) \) for all \(a \in \mathcal {A}\) that are decided correctly regarding the ground-truth scoring \(\textrm{par}_{2}\) (Lines 5-8). The fraction of runtime that the algorithm needs to arrive at its conclusion is denoted by \(O_{\textrm{rt}} \in \left[ 0, 1\right] \) (lower is better). This metric puts the runtime summed over the sampled instances in relation to the runtime summed over all instances in the dataset (Lines 9-13). Finally, we compute averages of the output metrics in Line 15 after we have collected all cross-validation results in Line 14. Overall, we want to find an approach that maximizes

$$\begin{aligned} O_\delta := \delta O_{\textrm{acc}} + \left( 1 - \delta \right) \left( 1 - O_{\textrm{rt}}\right) \text {,} \end{aligned}$$
(2)

whereby \(\delta \in \left[ 0, 1\right] \) allows for linear weighting between the two optimization goals \(O_{\textrm{acc}}\) and \(O_{\textrm{rt}}\). Plotting the approaches that maximize \(O_\delta \) for all \(\delta \in \left[ 0, 1\right] \) on an \(O_{\textrm{rt}}\)-\(O_{\textrm{acc}}\)-diagram provides us with a Pareto front of the best approaches for different optimization-goal weightings.

4.2 Data

In our experiments, we work with the dataset of the SAT Competition 2022 Anniversary Track [2]. The dataset consists of 5355 instances with respective runtime data of 28 sequential SAT solvers. We also use a database of 56 instance featuresFootnote 3 from the Global Benchmark Database (GBD) by Iser et al. [17]. They comprise instance size features and node distribution statistics for several graph representations of SAT instances, among others, and are primarily inspired by the SATzilla 2012 features described in [38]. All features are numeric and free of missing values. We drop 10 out of 56 features because of zero variance. Overall, prediction models have access to 46 instance features and 27 runtime features, i.e., excluding the current new solver \(\hat{a}\).

Additionally, we retrieve instance-family informationFootnote 4 to evaluate the composition of our sampled benchmarks. Instance families comprise instances from the same application domain, e.g., planning, cryptography, etc., and are a valuable tool for analyzing solver performance.

For hyper-parameter tuning, we randomly sample 10 % of the complete set of 5355 instances with stratification regarding the instances’ family. All instance families that are too small, i.e., 10 % of them corresponds to less than one instance, are put into one meta-family for stratification. This tuning dataset allows for a more extensive exploration of the hyper-parameter space.

4.3 Hyper-parameters

Given Algorithm 1, there are several possible instantiations for the three sub-routines, i.e., ranking, selection, and stopping. Also, there are different choices for the runtime-label prediction model and runtime discretization. We describe these experimental configurations in the following.

Ranking. Regarding ranking (cf. Section 3.1), we experiment with the following approaches and hyper-parameter values:

  • Observed PAR-2 ranking of already sampled instances

  • Predicted runtime-label ranking

    • History size: Consider the latest 1, 10, 20, 30, or 40 predictions within a voting approach for stability. The latest x predictions for each instance vote on the instance’s winning label.

    • Fallback threshold: If the difference of scores between the new solver \(\hat{a}\) and another solver drops below 0.01, 0.05, or 0.1, use the partially observed PAR-2 ranking as a tie-breaker.

Selection. For selection (cf. Section 3.2), we experiment with the following methods and hyper-parameter values. Since the potential runtime of experiments is by magnitudes larger than the model’s update time, we only consider incrementing our benchmark by one instance at a time rather than using batches, which is also proposed in current active-learning advances [31, 34]. A drawback of this is the lack of parallel execution of runtime experiments.

  • Random sampling

  • Uncertainty sampling

    • Fallback threshold: Use random sampling for the first 0 %, 5 %, 10 %, 15 %, or 20 % of instances to explore the instance space.

    • Runtime scaling: Whether to normalize uncertainty scores per instance by the average runtime of solvers on it or use the absolute values.

  • Information-gain sampling

    • Fallback threshold: Use random sampling for the first 0 %, 5 %, 10 %, 15 %, or 20 % of instances to explore the instance space.

    • Runtime scaling: Whether to normalize information-gain scores per instance by the average runtime of solvers on it or use the absolute values.

Stopping. For stopping decisions (cf. Section 3.3), we experiment with the following criteria and hyper-parameter values:

  • Subset-size stopping criterion, using 10 % or 20 % of instances

  • Ranking stopping criterion

    • Minimum amount: Sample at least 2 %, 8 %, 10 %, or 12 % of instances before applying the criterion.

    • Convergence duration: Stop if the predicted ranking stays the same for a number of sampled instances equal to 1 % or 2 % of all instances.

  • Wilcoxon stopping criterion

    • Minimum amount: Sample at least 2 %, 8 %, 10 %, or 12 % of instances before applying the criterion.

    • Average of p-values to drop below: 5 %.

    • Exponential-moving average: Incorporate previous significance values by using an EMA with \(\beta = 0.1\) or \(\beta = 0.7\).

Prediction model. Our experiments only use one model configuration for runtime-label prediction since an exhaustive grid search would be infeasible. In preliminary experiments, we compared various model types from scikit-learn [28]. In particular, we conducted nested cross-validation, including hyper-parameter tuning, and used Matthews Correlation Coefficient [12, 22] to assess the performance for predicting runtime labels. Our final choice is a stacking ensemble [36] of two prediction models, a quadratic-discriminant analysis [33] and a random forest [3]. Both these models can learn non-linear relationships between the instance features and the runtime labels. Stacking means that another prediction model, in our case a simple decision tree, decides which of the two ensemble members makes the prediction on which instance.

Runtime discretization. To define prediction targets, i.e., discrete runtime labels, we use hierarchical clustering with \(k = 3\) and a log-single-link criterion, which produced the most useful labels in preliminary experiments. We denote this adapted solver scoring function with \(s_3\). In our chosen hierarchical procedure, each non-timeout runtime starts in a separate interval. We then gradually merge intervals whose single-link logarithmic distance is the smallest until the desired number of partitions is reached. Other clustering approaches that we tried include hierarchical clustering with mean-, median-, and complete-link criterion, as well as k-means and spectral clustering.

To obtain useful labels, we need to ensure that discretized labels still discriminate solvers and align with the actual PAR-2 ranking. We analyzed the ranking induced by \(s_3\) in preliminary experiments with the SAT Competition 2022 Anniversary Track [2]. According to a Wilcoxon-signed-rank test with \(\alpha = 0.05\), 87.83 % of solver pairs have significantly different scores after discretization, only a slight drop compared to 89.95 % before discretization. Further, our ranking approach correctly decides for almost all (about 97.45 %; \(\sigma = {3.68}\,{\%}\)) solver pairs which solver is faster. In particular, the Spearman correlation of \(s_3\) and PAR-2 ranking is about 0.988, which is very close to the optimal value of 1 [6]. All these results show that discretized runtimes are suitable for our framework.

4.4 Implementation Details

For reproducibility, our source code and data are available on GitHub (cf. footnotes in Section 1). Our code is implemented in Python using scikit-learn [28] for making predictions and gbd-tools [17] for SAT-instance retrieval.

5 Evaluation

In this section, we evaluate our active-learning framework. First, we analyze and tune the different sub-routines of our framework on the tuning dataset. Next, we evaluate the best configurations with the full dataset. Finally, we analyze the importance of different instance families to our framework.

5.1 Hyper-Parameter Analysis

Fig. 2.
figure 2

\(O_{\textrm{rt}}\)-\(O_{\textrm{acc}}\)-diagrams comparing different hyper-parameter instantiations of our active-learning framework on the hyper-parameter-tuning dataset. The x-axis shows the ratio of total solver runtime on the sampled instances relative to all instances. The y-axis shows the ranking accuracy (cf. Section 4.1). Each line entails the front of Pareto-optimal configurations for the respective hyper-parameter instantiation.

Our experiments follow the evaluation framework introduced in Section 4.1. Fig. 2 shows the performance of the approaches from Section 4.3 on \(O_{\textrm{rt}}\)-\(O_{\textrm{acc}}\)-diagrams for the hyper-parameter-tuning dataset. Evaluating a particular configuration with Algorithm 2 returns a point \(\left( O_{\textrm{rt}},\, O_{\textrm{acc}}\right) \). We do not show intermediate results of the active-learning procedure but only the final results after stopping. The plotted lines represent the best-performing configurations per ranking approach (Fig. 2a), selection approach (Fig. 2b), and stopping criterion (Fig. 2c). In particular, we show the Pareto front, i.e., of all configurations that share a particular value of the plotted hyper-parameter, we take the maximum ranking accuracy over all remaining hyper-parameters not displayed in the corresponding plot.

Regarding ranking approaches (Fig. 2a), using the predicted \(s_3\)-induced runtime-label ranking consistently outperforms the partially observed PAR-2 ranking for each possible value of the trade-off parameter \(\delta \). This outcome is expected since selection decisions are not random. For example, we might sample more instances of one family if it benefits discrimination of solvers. While the partially observed PAR-2 score is skewed, the prediction model can account for this.

Regarding the selection approaches (Fig. 2b), uncertainty sampling performs best in most cases. However, information-gain sampling is beneficial if runtime is strongly favored (small \(\delta \); runtime fraction less than 5 %). This result aligns with our expectations: Information-gain sampling selects instances that maximize the expected reduction in entropy. This means we sample instances revealing similarities between solvers rather than differences, which helps to build a confident model quickly. However, the method cannot select helpful instances for distinguishing solvers later. Random sampling performs reasonably well but is outperformed by uncertainty sampling in all cases, showing the benefit of actively selecting instances based on a prediction model.

Regarding the stopping criteria (Fig. 2c), the ranking stopping criterion performs most consistently well. If accuracy is strongly favored (very high \(\delta \)), the Wilcoxon stopping criterion performs better. The subset-size stopping criterion performs reasonably well but does not improve beyond a certain accuracy because of sampling a fixed subset of instances.

Fig. 3.
figure 3

Scatter plot comparing different instantiations of trade-off parameter \(\delta \) for our active-learning framework on the hyper-parameter-tuning dataset. The x-axis shows the fraction of runtime \(O_{\textrm{rt}}\) of the sample, while the y-axes show the fraction of instances sampled and ranking accuracy, respectively. The color indicates the weighting between different optimization goals \(\delta \in \left[ 0, 1\right] \). The larger \(\delta \), the more we favor accuracy over runtime.

Fig. 3a shows an interesting consequence of weighting our optimization goals: If we, on the one hand, desire to get a rough estimate of a solver’s performance fast (low \(\delta \)), approaches favor selecting many easy instances. In particular, the fraction of sampled instances is larger than the fraction of runtime. By having many observations, it is easier to build a model. If we, on the other hand, desire to get a good estimate of a solver’s performance in a moderate amount of time (high \(\delta \)), approaches favor selecting few, difficult instances. In particular, the fraction of instances is smaller than the fraction of runtime.

Furthermore, Fig. 3b reveals which values make the most sense for \(\delta \). The range \(\delta \in \left[ 0.2, 0.8\right] \), thereby, corresponds to the points with a runtime fraction between 0.03 and 0.22. We consider this region to be most promising, analogous to the elbow method in cluster analysis [18].

5.2 Full-Dataset Evaluation

Having selected the most promising hyper-parameters, we run our active-learning experiments on the complete Anniversary Track dataset (5355 instances). The aforementioned range \(\delta \in \left[ 0.2, 0.8\right] \) only results in two distinct configurations. The best-performing approach for \(\delta \in \left[ 0.2, 0.7\right] \) uses the predicted runtime-label ranking, information-gain sampling, and ranking stopping criterion. It can predict a new solver’s PAR-2 ranking with 90.48 % accuracy (\(O_{\textrm{acc}}\)) in only 5.41 % of the full evaluation time (\(O_{\textrm{rt}}\)). The best-performing approach for \(\delta \in (0.7, 0.8]\) uses the predicted runtime-label ranking, uncertainty sampling, and ranking stopping criterion. It can predict a new solver’s PAR-2 ranking with 92.33 % accuracy (\(O_{\textrm{acc}}\)) in only 10.35 % of the full evaluation time (\(O_{\textrm{rt}}\)).

Table 2. Performance comparison (on the full dataset) of the best-performing active-learning approaches (AL), random sampling of the same runtime fraction with 1000 repetitions (Random), and statically selecting the instances most frequently sampled by active-learning approaches (Most Freq.)

Table 2 shows how both active-learning approaches (column AL) compare against two static baselines: Random samples instances until it reaches roughly the same fraction of runtime as the AL benchmark sets. We repeat sampling 1000 times and report average results. Most Freq. uses a static benchmark set consisting of those instances most frequently sampled by our active learning approach. In particular, we consider the average sampling frequency over all solvers and Pareto-optimal active-learning approaches.

Both our AL approaches perform better than random sampling. However, the performance differences are not significant regarding a Wilcoxon signed-rank test with \(\alpha = 0.05\) and also depend on the fraction of sampled runtime (cf. Fig. 2b). A clear advantage of our approach is, though, that it indicates when to stop adding further instances, depending on the trade-off parameter \(\delta \). While the active-learning results are less strong on the full dataset than on the smaller tuning dataset, they still show the benefit of making benchmark selection dependent on the solvers to distinguish.

A static benchmark using the most frequently AL-sampled instances performs poorly, though, compared to active learning and random sampling. This outcome is somewhat expected since the static benchmark does not reflect the right balance of instance families: Families whose instances are uniform-randomly selected by AL, e.g., for different solvers, appear less often in this benchmark than families where some instances are sampled more often than others.

5.3 Instance-Family Importance

Fig. 4.
figure 4

Scatter plot showing the importance of different instance families to our framework on the full dataset. The x-axis shows the frequency of instance families in the dataset. The y-axis shows the average frequency of instance families in the samples selected by active learning. The dashed line represents families that occur with the same frequency in the dataset and samples.

Selection decisions of our approach also reveal the importance of different instance families to our framework. Fig. 4 shows the occurrence of instance families within the dataset and the benchmarks created by active learning. We use the best-performing configurations for all \(\delta \in \left[ 0, 1\right] \) and examine the selection decisions by the active-learning approach on the SAT Competition 2022 Anniversary Track dataset [2]. While most families appear with the same fraction in the dataset and the sampled benchmarks, a few outliers need further discussion. Problem instances of the families fpga, quasigroup-completion, and planning are especially helpful to our framework in distinguishing solvers. Instances of these families are selected over-proportionally in comparison to the full dataset. In contrast, instances of the largest family, i.e., hardware-verification, roughly appear with the same fraction in the dataset and the sampled benchmarks. Finally, instances of the family cryptography are less important in distinguishing solvers than their vast weight in the dataset suggests. A possible explanation is that these instances are very similar, such that a small fraction of them is sufficient to estimate a solver’s performance on all of them.

6 Conclusions and Future Work

In this work, we have addressed the New-Solver Problem: Given a new solver, we want to find its ranking amidst competitors. Our approach provides accurate ranking predictions while needing significantly less runtime than a complete evaluation on a given benchmark set. On data from the SAT Competition 2022 Anniversary Track, we can determine a new solver’s PAR-2 ranking with about 92 % accuracy while only needing 10 % of the full-evaluation time. We have evaluated several ranking algorithms, instance-selection approaches, and stopping criteria within our sequential active-learning framework. We also took a brief look at which instance families are the most prevalent in selection decisions.

Future work may compare further sub-routines for ranking, instance selection, and stopping. Additionally, one can apply our evaluation framework to arbitrary computation-intensive problems, e.g., other \(\mathcal{N}\mathcal{P}\)-complete problems than SAT, as all discussed active-learning methods are problem-agnostic. Such problems share most of the relevant properties of SAT solving, i.e., there are established instance features, a complete benchmark is expensive, and traditional benchmark selection requires expert knowledge.

From the technical perspective, one could formulate runtime discretization as an optimization problem rather than addressing it empirically. Further, a major shortcoming of our current approach is the lack of parallelization, selecting instances one at a time. Benchmarking on a computing cluster with n cores benefits from having batches of n instances. However, bigger batch sizes n impede active learning. Also, it is unclear how to synchronize instance selection and updates of the prediction model without wasting too much runtime.