Learning to select SAT encodings for pseudo-Boolean and linear integer constraints

Many constraint satisfaction and optimisation problems can be solved effectively by encoding them as instances of the Boolean Satisfiability problem (SAT). However, even the simplest types of constraints have many encodings in the literature with widely varying performance, and the problem of selecting suitable encodings for a given problem instance is not trivial. We explore the problem of selecting encodings for pseudo-Boolean and linear constraints using a supervised machine learning approach. We show that it is possible to select encodings effectively using a standard set of features for constraint problems; however we obtain better performance with a new set of features specifically designed for the pseudo-Boolean and linear constraints. In fact, we achieve good results when selecting encodings for unseen problem classes. Our results compare favourably to AutoFolio when using the same feature set. We discuss the relative importance of instance features to the task of selecting the best encodings, and compare several variations of the machine learning method.


Introduction
Many constraint satisfaction and optimisation problems can be solved effectively by encoding them as instances of the Boolean Satisfiability problem (SAT).Modern SAT solvers are remarkably effective even with large formulas, and have proven to be competitive with (and often faster than) CP solvers (including those with conflict learning).However, even the simplest types of constraints have many encodings in the literature with widely varying performance, and the problem of predicting suitable encodings is not trivial.
We explore the problem of selecting encodings for constraints of the form n i=1 q i e i k where ∈ {<, ≤, =, =, ≥, >}, q 1 . . .q n are integer coefficients, k is an integer constant and e i are decision variables or simple expressions containing a single decision variable (such as negation).We separate these constraints into two classes: pseudo-Boolean (PB) when all e i are Boolean; and linear integer (LI) when there exists an integer expression e i .We treat these two classes separately, selecting one encoding for each class when encoding an instance.
We select from a set of state-of-the-art encodings, including all eight encodings of Bofill et al [1,2,3] which are extensions of the Generalized Totalizer [4], Binary Decision Diagram [5], Global Polynomial Watchdog [6], Local Polynomial Watchdog [6], Sequential Weight Counter [7], and n-Level Modulo Totalizer [8].All eight of these encodings are for pseudo-Boolean constraints combined with at-most-one (AMO) sets of terms (where at most one of the corresponding e i Boolean expressions are true in a solution).The AMO sets come from an integer variable or are detected automatically [9] as described in Section 2.1.We also use an encoding named Tree which is described in this paper.

Preliminaries
A constraint satisfaction problem (CSP) is defined as a set of variables X, a function that maps each variable to its domain, D : X → 2 Z where each domain is a finite set, and a set of constraints C. A constraint c ∈ C is a relation over a subset of the variables X.The scope of a constraint c, named scope(c), is the set of variables that c constrains.A constraint optimisation problem (COP) also minimises or maximises the value of one variable.A solution is an assignment to all variables that satisfies all constraints c ∈ C.
Boolean Satisfiability (SAT) is a subset of CSP with only Boolean variables and only constraints (clauses) of the form (l 1 ∨ • • • ∨ l k ) where each l i is a literal x j or ¬x j .A SAT encoding of a CSP variable x is a set of SAT variables and set of clauses with exactly one solution for each value in D(x).A SAT encoding of a constraint c is a set of clauses and additional Boolean variables A, where the clauses contain only literals of A and of the encodings of variables in scope(c).An encoding of c has at least one solution corresponding to each solution of c.Also, an encoding of c has no solutions corresponding to a non-solution of c.
Generalised arc consistency (GAC) for a constraint c means that for a given partial assignment, all values are removed from the domain of each variable in scope(c) if they cannot appear in any extended assignment satisfying c.A SAT encoding of c has the property GAC iff unit propagation of the SAT encoding of c results in the following correspondence: for each variable x i ∈ scope(c), the set of remaining solutions of the encoding of x i corresponds to the set of values in D(x i ) after GAC has been enforced on c.
A SAT encoding of c has the Consistency Checker (CC) property [3] iff unit propagation of the SAT encoding of c will derive false when the SAT partial assignment corresponds to a CSP partial assignment that cannot be extended to a full assignment that satisfies c.

Learning to Choose SAT Encodings
First we describe the palette of encodings for PB and LI constraints, then our approach to selecting encodings using instance features and machine learning.

SAT Encodings
Recall that we are considering constraints of the following form where q 1 . . .q n are integer coefficients, k is an integer constant and e i are decision variables or simple expressions containing one variable.n i=1 q i e i k where ∈ {<, ≤, =, =, ≥, >} An expression e i may be: an integer or Boolean variable x i ; a negated Boolean variable ¬x i ; or a comparison (x i #k i ) where # ∈ {<, ≤, =, =, ≥, >}, x i is an integer variable or Boolean literal, and k i is a constant.We refer to the set of values that e i can take as D(e i ), extending the notation D(x i ).We distinguish between top-level constraints that must be satisfied in all solutions, and nested constraints that are contained in a logic operator such as ∨ or →.
Initial normalisation steps are applied to all constraints of this form, regardless of the choice of encoding.All < constraints are converted to ≤, and > constraints to ≥ by adjusting k.If the coefficients q have a greatest common divisor (GCD) greater than 1 then all coefficients are divided by the GCD.When the comparator is ≤ then k ← k/GCD , and when the comparator is ≥ then k ← k/GCD .When ∈ {=, =}, k ← k/GCD unless k/GCD is non-integer in which case the constraint is evaluated to true or false as appropriate.
If the comparator is = then a new auxiliary variable a is created whose domain is the set of all values the sum may take.A new top-level LI equality constraint is introduced, as follows.
The original constraint is replaced with a = k, and if it is top-level then k will be removed from the domain of a.
Any constraints that are not top-level (i.e. are nested in another expression such as a disjunction) are always encoded with the Tree encoding in SAVILE ROW, described in Section 2.1.3.For top-level constraints, we have nine encodings available and each can be applied to either PB or LI constraints, producing 81 configurations.

SAT Encoding Preliminaries
First we describe the encoding of CSP variables in SAVILE ROW.Boolean variables and integer variables that have two values are encoded with a single SAT variable.For other integer variables, SAVILE ROW generates the direct [16] or order [17] encoding (or both) as required to encode the constraints on that variable.We provide these details for completeness and without claiming novelty.
The direct encoding of a variable x has one SAT variable representing each value in D(x).It provides a SAT literal, or a constant true or false, for a proposition (x = a) or (x = a) for any integer a, with the literal denoted x = a or x = a .We use the 2-product encoding [18] to ensure x is assigned at most one (AMO) value, and a single clause to ensure x is assigned at least one (ALO) value.
The order encoding of x introduces one SAT variable for each of the following propositions.
The order encoding provides a SAT literal (or constant true or false) for a proposition (x ≤ a) or its negation (x > a) for any integer a, with the literal denoted x ≤ a or x > a .For each pair of values a clause is generated to ensure coherence of the encoding: x When the direct and order encodings are both required they are channelled together with the following set of clauses.
The AMO constraint of the direct encoding is omitted in this case.
There are two equivalences that can be used to slightly improve the direct-order encoding: x ≤ a ↔ x = a for minimum value a, and x > b − 1 ↔ x = b for maximum value b.By using these two equivalences we can remove two SAT variables so 2|D(x)| − 3 variables are generated. PRE-PRINT

PB(AMO) Encodings
The first eight are encodings of PB(AMO) constraints [1,2,3], which are pseudo-Boolean constraints with nonintersecting at-most-one (AMO) groups of terms (where at most one of the corresponding e i expressions are true in any solution).Encodings of PB(AMO) constraints can be substantially smaller and more efficient to solve than the corresponding PB constraints [1,2,9,3].
For the eight PB(AMO) encodings the constraints must be placed in a normal form where all coefficients are positive, only ≤ is allowed, and each e i must be a Boolean expression.The first normalisation step is to decompose equality into two inequalities ≤ and ≥.Second, any ≥ constraints are converted to ≤ by negating all q i coefficients and k.At this stage, all constraints are in ≤ form.
Non-trivial integer terms q i e i (where e i is not Boolean and D(e i ) = {0, 1}) are dealt with as follows.If q i < 0 then q i ← −q i and e i ← −e i .The integer variable contained in e i is required to have a direct SAT encoding.Finally, q i e i is replaced with an AMO group of |D(e i )| − 1 terms representing each value of q i e i except the smallest (which is cancelled out by adjusting k).For example, if q i = 2 and e i has values {1, 2, 3} then k ← k − 2 and q i e i would be replaced with the following AMO group of two terms.
2(e i = 2) + 4(e i = 3) At this point all terms q i e i in the constraint must be Boolean or have domain D(e i ) = {0, 1}.If q i < 0, the term is replaced with −q i (e i = 0) or −q i (¬e i ) (as appropriate for the type of e i ) and k ← k − q i .Otherwise, the term is replaced with q i (e i = 1) or remains unchanged (as appropriate for the type).
Automatic AMO detection [9] (which applies constraint propagation to find AMO groups among the Boolean terms of the original constraint) is enabled in our experiments.Automatic AMO detection has been shown to substantially improve solving time in some cases [9].It partitions the Boolean terms of the constraint into AMO groups, and (as described above) each integer term becomes an AMO group also.Finally, we exactly follow the PB(AMO) normalisation rules of Bofill et al [3] prior to encoding.
The PB(AMO) encodings are as follows: MDD The Multi-valued Decision Diagram encoding [1] (a generalisation of the BDD encoding for PB constraints [5]) uses an MDD to encode the PB(AMO) constraint.Each layer of the MDD corresponds to one AMO group.BDDs and MDDs are a popular choice for encoding sums to SAT since they can compress equivalent states in each layer.
GGPW The Generalized Global Polynomial Watchdog encoding [2] (generalising GPW [6]) is based on bit arithmetic and is polynomial in size.It has the CC property but not GAC.
GLPW The Generalized Local Polynomial Watchdog encoding [3] (generalising LPW [6]) is similar to GGPW but has the GAC property.However, while being polynomial in size, it is often too large to be practical.
GGT The Generalized Generalized Totalizer [2] encodes the PB(AMO) constraint with a binary tree, where the leaves represent the AMO groups and each internal node represents the sum of all leaves beneath it.GGT extends the Generalized Totalizer [4].The binary tree is constructed using the minRatio heuristic [3] that aims to minimise the numbers of values of internal nodes.
GGTd is identical to GGT except that a balanced binary tree is used.
RGGT The Reduced Generalized Generalized Totalizer [3] attempts to improve on GGT by compressing equivalent states at its internal nodes.The minRatio heuristic is used to construct the tree.
GSWC The Generalized Sequential Weight Counter [2] (based on the Sequential Weight Counter [7]) encodes the sum of each prefix sub-sequence of the AMO groups.
GMTO The Generalized n-Level Modulo Totalizer [3] (based on the n-Level Modulo Totalizer [8]) is an extension of the Generalized Totalizer that represents values of the internal nodes in a mixed-radix base.GMTO can be substantially more compact than other PB(AMO) encodings.
Where the original constraint c contains no integer terms and the comparator of the original constraint is neither = nor = then these six encodings will enforce GAC on c. GGPW has the CC property.Where the original constraint c contains no integer terms and the comparator of the original constraint is not = or = then GGPW will have the CC property for c.The Tree encoding is related to the Generalized Totalizer (GT) [4] encoding of PB constraints.However, Tree is able to natively encode equalities (in addition to inequalities), and it natively supports integer variables and negative coefficients.Tree does not support arbitrary AMO groups of terms (i.e. it is not a PB(AMO) encoding) so it does not benefit from automatic AMO detection [9].
Given a constraint c : n i=1 q i e i k where ∈ {≤, =, ≥}, the first step is to normalise each term to have a lower bound of 0. Each term q i e i is shifted such that its smallest value becomes 0, and k is adjusted accordingly.Shifting a term q i e i is implemented with a linear view that introduces no SAT variables or clauses when c is encoded.
The encoding process has two main stages.In the first stage, a tree is constructed from c, with each term (integer or Boolean) represented by a leaf.The tree is binary with the exception of the root node which may have three children.Each internal non-root node has a corresponding auxiliary integer variable that represents the sum of its two child nodes.A constraint is generated to connect each internal non-root node to its two children.The process to construct the tree maintains a set S of terms, initially containing all terms in the original sum.While S contains more than three terms, two terms w i y i and w j y j are removed from S and replaced with one new term a, where a is a new auxiliary variable.A constraint w i y i + w j y j − a 0 is generated.Finally S contains at most three terms and the last constraint is generated: S k.
For each new variable a (introduced for the sum of w i y i and w j y j ), the domain of a is the set of values obtainable by adding any value of w i y i to any value of w j y j .If ∈ {≤, =}, values greater than k are removed from the domain of a.The shape of the tree and domain sizes of the internal variables are determined by the choice of terms to remove from S at each step.Tree uses a simple heuristic that chooses a term with the smallest number of possible values, breaking ties in favour of the smallest range from minimum to maximum value.The aim of the heuristic is to minimise the number of values of the auxiliary variables.To complete the first stage, if is = then each equality constraint is broken down into one ≤ and one ≥ constraint.
As an example, consider the following constraint (where each variable x i is Boolean): The tree generated for this constraint is shown in Figure 1.In this case the terms all have two possible values and the heuristic generates a balanced tree.Four auxiliary variables are generated as shown in Figure 1, and five constraints are generated as follows.
The second stage is to encode the variables and constraints to SAT.The order encoding is required for integer leaf nodes and is used for all a i variables on the internal nodes.Constraints are encoded using an improved version of the original order encoding of Tamura et al [19].The original order encoding builds a set of clauses from the intervals {min(D(q i e i )) − 1 . . .max(D(q i e i ))} (for all i).Tamura, Banbara, and Soh [20] presented an improved order encoding that uses the sets D(q i e i ) and is more compact than the original version.The version presented here also uses the sets D(q i e i ), and by taking the entire set into account it avoids generating redundant clauses compared to the original version.First, terms are sorted by increasing number of possible values.For a constraint of arity r, the PRE-PRINT set of clauses is as follows: where the set of tuples B is defined as follows.
For example, we encode the first constraint as follows (where false literals and one trivially true clause have been removed).
The original order encoding [19] would generate two additional (redundant) clauses, as follows.
When applied to a PB ≤ constraint, Tree is similar to GT in most respects.The main difference occurs at the root node, which for Tree has 3 children and no auxiliary SAT variables, while for GT the root node has 2 children and an auxiliary SAT variable for each possible value of the sum that lies between 0 and k + 1.On the example constraint, Tree generates 10 SAT variables (in addition to the original x 1 . . .x 7 ) and 30 clauses.GT with the minRatio heuristic generates 20 SAT variables and 36 clauses, while GT with a balanced binary tree produces 21 SAT variables and 54 clauses.
When applied to an equality constraint, Tree generates one set of auxiliary variables at each internal (non-root) node and these are used to encode both ≤ and ≥ parts of the constraint.In contrast, when using any PB(AMO) encoding, the constraint is decomposed into two inequalities which are then encoded entirely separately.As a result Tree can be more compact in terms of SAT variables.Replacing ≤ with = in Equation ( 1), the Tree encoding introduces 10 variables and 53 clauses, whereas GT with minRatio generates 48 variables and 82 clauses.
Tree has the GAC property when the comparator of the original constraint is neither = nor =.

Discussion
The set of 9 encodings is diverse but not exhaustive.Abío et al proposed a BDD-based encoding for linear constraints [21], however it has been directly related to the MDD encoding [22].In addition to MDD-based encodings, Abío et al propose two further encodings for linear constraints [23]: one based on sorting networks (SN), which is related to the GPW encoding, and another log-based encoding BDD-Dec.Other log encodings such as the one used by Picat-SAT [24] may also be more effective in some cases.
For our experiments we use an extended version of SAVILE ROW 1.9.1 [25].All constraints other than PB and LI use the default encoding as described in the SAVILE ROW manual.

Instance Features
Our task of selecting the best SAT encodings relies on extracting features of constraint problems in order to predict a performant encoding configuration.We initially use existing generic CSP instance features, and then go on to define features which relate directly to the PB and LI constraints in a given CSP instance.The resulting featuresets are described below.
f2f We use the fzn2feat tool [26] to extract 95 static instance features relating to the number and types of variables and their domains, the types and sizes of constraints and features of the objective in optimisation problems.
The full list of features can be found at https://github.com/CP-Unibo/mzn2feat;some features were PRE-PRINT not applicable, e.g.there are no float variables in Essence Prime and SAVILE ROW does not produce all the same annotations.The fzn2feat tool requires FlatZinc models as input -we generate these using SAVILE ROW's standard FlatZinc back-end.
f2fsr We also re-implement the f2f features as closely as possible within SAVILE ROW, applied to the model directly before encoding to SAT.
lipb We introduce a new set of 45 features describing the PB constraints in a problem instance.We also extract these for LI constraints, giving 90 new features in total.These features are listed in Table 1.
combi We combine the f2fsr and lipb features.

Problem Corpus
We begin with the 65 constraint models with a total of 757 instances from a recent paper [11] in order to work with a wide variety of problem classes.An added advantage is that the models are written in Essence Prime, SAVILE ROW's input language.Unfortunately this collection has a very skewed distribution of instances per problem class, ranging from just 1 to 100.We mitigate this in two ways: firstly, we limit the number of instances per class to 50 by taking a random sample where more instances are available; secondly, we add instances to existing classes where it is easy, such as when instance parameters are just a few integers.We also add two problem classes from recent XCSP3 competitions [27]: the Balanced Academic Curriculum Problem (BACP) and the Hamiltonian Cycle Problem (HCP).Some of the problems are filtered out during the data cleaning phase; we give details of this process and the resulting corpus in Section 3.1.2and Table 2.

Training
We evaluated several classifier models from the scikit-learn library [28], including decision trees and forests of extremely randomised trees, as well as the XGBoost classifier [29].We also investigated training various regressors to predict runtime.We found that a random forest classifier performs best for our purposes.The scikit-learn implementation is based on Breiman's random forests [30], but uses an average of predicted probabilities from its decision trees rather than a simple vote.
We follow the method of Probst et al. [31] who investigated hyperparameter tuning for random forests and concluded that the number of estimators should be set sufficiently high (we use 200) and that it is worth tuning the number of features, maximum tree depth, and sample size.We use randomised search with 50 iterations and 5-fold crossvalidation to tune the hyperparameters.We experimented with more tuning iterations but it did not lead to improved prediction quality.If a classifier makes a poor prediction, the consequences vary.It is possible that the chosen encodings lead to a running time which is very close to that of the ideal choice; the opposite is also true and misclassification can be very expensive.To address this issue, we follow a similar approach to the pairwise classification used in AUTOFOLIO [14]: we train a random forest model for each of the possible pairs of encoding configurations.When making predictions, each model chooses between its two candidates.The configuration with most votes is chosen; if two or more configurations have equal votes, we select the one which produced the shortest total running time over the training set.This approach effectively creates a predicted ranking of configurations from the features and often leads to better prediction performance than using a single random forest classifier.
To facilitate the pairwise training and prediction approach, we reduce our selection of encoding combinations from 81 (9 PB encodings × 9 LI encodings) to a portfolio of 6, thus needing to train just 15 models (rather than 3240 if we had used all 81 choices).We seek to retain performance complementarity as described in [32] from a much reduced portfolio size.The portfolio is built from the timings in the training set using a greedy approach as follows.We begin with a single encoding configuration in the portfolio and then successively add the remaining configuration which would lower the virtual best PAR10 time by the biggest margin (PAR10 is defined in Section 3.1.2).We do this until we have a portfolio of 6.We repeat the process using each of the 81 configurations as the starting element and finally select the best-performing portfolio from these 81. Figure 2 shows that this portfolio reduction has a small impact on the virtual best performance across our corpus -the virtual best time for a portfolio of size 6 is within 16% of the time achievable with all configurations.
In addition to the pairwise voting scheme, we implement two further customisations when training the classifiers: Sample Weights Firstly we aim to give more importance to instances which are harder (with a longer virtual best runtime) and where the encoding choice makes a bigger difference.Each instance is given a positive integer weight w according to the formula where t V B and t V W are the very best and very worst runtimes respectively for the instance.Custom Loss Secondly, we provide a custom loss function for the cross-validation used during hyperparameter tuning.The custom loss function simply returns the difference in time between the runtime of the chosen encoding configuration and the virtual best for that instance.
To conduct a more complete comparison we also implement two additional alternative setups.
Single Classifier We try using a single random forest classifier with the same portfolio of 6 configurations (combining PB and LI encodings).In terms of hyperparameter tuning, we try it both with a generous allowance of 15 PRE-PRINT times the iterations allocated to each classifier in the pairwise setup, and with a more restricted budget of 60 iterations.
Separate LI/PB Choice Secondly we modify the pairwise setup to make a separate prediction for LI and PB constraints, choosing a portfolio of 6 encodings for each encoding type.This approach has its difficulties because when labelling the dataset with the best encoding for one type of constraint, the encoding of the other constraint type must be chosen somehow.We address this by setting the other constraint type to the single best for the training set.This setup is more expensive in terms of training time, effectively repeating the entire process for each constraint type under consideration, rather than taking advantage of a complementary portfolio across two (or more) encodings.

Method
We provide an overview of our experimental process in Figure 3.Briefly, the method consists of: • Running SAVILE ROW with different encoding choices in order to collect runtime information and to extract features.
• Cleaning the resulting dataset.
• Carrying out 50 split, train, predict cycles with each of our machine learning setups, using the same train/test splits in order to allow fair comparison across the setups.
• Using the predicted encoding choices to identify the resulting runtimes.
• Aggregating the "predicted" runtimes and calculating reference times for comparison.
The experimental design is described in more detail below.

Solving Problem Instances and Extracting Features
We run SAVILE ROW on each instance in the corpus with each of the 81 encoding configurations.The CNF clause limit is set to 5 million and the SAVILE ROW time-out to 1 hour.We switch on automatic detection of At-Most-One constraints [9].We choose Kissat as our SAT solver as it formed the basis of the top three performers in the 2021 SAT competition [33].We use the latest release available at the time, sc2021-sweep [34], with default settings and separate time limit of 1 hour.The experiment is run on the Viking research cluster with Intel Xeon 6138 20-core 2.0 GHz processors; we set the memory limit for each job to 8 GB.We carry out 5 runs (with distinct random seeds) for each configuration to average out stochastic behaviour of the solver.
To extract the features we run each problem instance once with the SAVILE ROW feature extractor and once to generate standard FlatZinc (using the -flatzinc flag) followed by fzn2feat [26].We record the time taken to extract the features.

Cleaning the Dataset
We calculate the median runtime over 5 runs for each instance and encoding configuration.We mark a result as timed out if the total runtime (SAVILE ROW + Kissat) exceeds 1 hour.To decide what penalty to apply to runs which time out, we consider all instances for whom every configuration finishes within the allocated time.The mean worst/best ratio is 13.06 and the median ratio is 4.91.When we consider only those problem instances which are not solved with any encoding configuration in less than 1 minute, then the worst/best mean ratio is 9.18 so we believe it fair to penalise a time-out by a factor of 10.We therefore choose to use PAR10, i.e. assigning 10 hours to any result which takes longer than our 1 hour time-out limit.This is the same penalty applied in other related literature [12,14,13] which addresses the problem of selecting SAT encodings for CSPs.
Having applied PAR10, we filter the corpus as follows.We drop instances if they contain no PB or LI constraints.We also exclude any instances which end up requiring no SAT solving -SAVILE ROW can sometimes solve a problem in pre-processing through its automatic re-formulation and domain filtering.Finally we exclude instances for which all configurations time out.At this point, 614 instances of 49 problem classes remain in the corpus;

Splitting the Corpus, Training and Predicting
For each of our classifier setups and our four featuresets, we run a split, train, predict cycle 50 times.We use seeds 1 to 50 to co-ordinate the splits so that we compare the prediction power of the different feature sets and setups using the same training and test sets.
For each cycle, we aim for an 80:20 train to test split using two approaches.The split-by-instance approach simply selects instances at random with uniform probability -with this approach, instances of any given problem class are usually found in both the training and test sets.The split-by-class approach also splits problems randomly but ensures that all instances of a problem class end up either in the training or the test set, meaning that predictions are being made on unseen problem classes.This second method can lead to the test set being slightly larger than 20%.
Prior to training the classifiers, the portfolio of available configurations is built based on the runtimes of the training set.Then the training instances are labelled for each pairwise classifier with the configuration that has the shorter runtime.For each pairwise classifier, we search the hyperparameter space and fit the model to the training set.Finally, we make predictions using the test set ready for evaluation.

Evaluating the Performance of Predicted Encodings
To evaluate the impact of using the learnt encoding choices, we calculate two benchmarks commonly used in algorithm selection [32]: the Virtual Best (VB) time is the total time taken to solve the instances in a test set if we always made the best possible choice from all 81 configurations; and the Single Best (SB) time is the total time taken to solve the PRE-PRINT instances in a test set when using the single configuration that minimises the total solving time on the training set.In addition we refer to: the time taken using SAVILE ROW's default (Def) configuration, which is the Tree encoding for both PB and LI constraints, and finally the Virtual Worst (VW) time to indicate the overall variation in performance of the encoding configurations in the portfolio.

Results and Discussion
In Table 3 we report the total PAR10 runtime across all 50 test sets for the predicted encoding configurations from each of the six classifier setups, four feature sets and two splitting methods.The predicted runtimes include the time taken to extract the features. 1For ease of comparison, we report the runtime as a multiple of the virtual best time.For example, a figure of 2.00 in Table 3 means that the predictions across the 50 test sets led to a total runtime which was twice as long as the runtime achieved if we always chose the best available configuration.We remind the reader that the two splitting strategies (by class vs. by instance) yield different test sets for the 50 seeds, as explained in Section 3.1.3.

On Performance and Features
We found that the machine learning predictors work well, clearly outperforming the SB and Def configurations.These performance improvements can be achieved with predictions based on the generic CSP feature sets f2f and f2fsr, but are even better when using the new specialised features (lipb) for the majority of ML setups we implement and especially so when predicting encodings for unseen problem classes.Sometimes the best results are obtained by the combined featureset combi, again more often for unseen problem classes.
We argue that the split-by-class approach is both a more difficult challenge and closer to a real-world deployment, where a new instance to solve may belong to an unseen problem class.However, both approaches are realistic, so we choose the Pairwise-Voting Combined-Encoding Classifier with Sample Weighting and Custom Loss as our preferred setup for the rest of this paper because it is the best in the split-by-class task and performs competitively in the splitby-instance setting.In a recent survey, Kerschke et al. state that "State-of-the-art per-instance algorithm selectors for combinatorial problems have demonstrated to close between 25% and 96% of the VBS-SBS gap" [32].In these terms, our preferred setup using lipb features closes 59% of the VB-SB gap for unseen classes and 73% for seen classes.

# of timeouts
Because the distribution of runtimes in the split-by-class trials is skewed, we use a non-parametric statistical test to report on the significance of the improvement achieved by using our classifier.We apply the Wilcoxon Signed-Rank test for paired samples on the mean times from the SB selector choices and our preferred selector using the lipb features.We obtain a p value of 6.5 × 10 −5 (well below even a 1% significance level) and an effect size of −0.62 using the rank-biserial correlation method which would usually be interpreted as a medium to large effect.
Figure 4 summarises the performance for our preferred setup, showing the distribution of mean predicted PAR10 times per test set.The mean values are marked with diamonds and correspond to the numbers reported in Table 3, albeit not scaled.We note that when splitting by instance, the performance across test sets is fairly symmetrical, with very similar means and medians for all selectors.However, when it comes to splitting by class, the distributions show positive skew -this likely comes from test sets where there are many instances from an unseen problem class for which the classifier struggles to make the best choices.
We present an additional visualisation of selector performance in Figure 5, showing the number of instances solved as time is increased.With both splitting strategies we observe that the featuresets emerge in the order lipb, combi, f2fsr, f2f with the lipb-predicted encodings enabling most instances to be solved within the timeout.When splitting by instance, the four featuresets follow a very similar trajectory; however when splitting by class a clear advantage is shown for the specialised featuresets.The single best (SB) performs very differently in the two settings.When splitting by class, there are occasions when SB (derived from the training set) performs considerably worse than the Tree Tree default configuration, leading to the default configuration outperforming SB.
All featuresets lead to similar performance in the split-by-instance setting.However, when predicting for unseen problem classes, the mean runtimes are clearly better when using the specialised featuresets.This is also reflected in the number of timeouts, where the lipb featureset gives the most robust predictions.It is interesting that when splitting by class the generic featureset f2f leads to a low median runtime and low median number of timeouts but is strongly outperformed by the specialised lipb features in terms of means, suggesting that the specialised features help to avoid more costly misclassifications; i.e. the generic features help to "win" on easier problems, but don't do so well on harder ones.
A further insight is provided by Figure 6 which shows the accuracy of predictions across the 50 training and test sets -in this figure we see how often the pairwise classifier ends up making exactly the "right" decision.In the splitby-instance scenario the prediction accuracy is fairly consistent across feature sets; however, for unseen classes we observe something unexpected.The f2f features lead to the most accurate predictions, but, as discussed above, the overall performance in terms of the resulting runtimes is considerably worse.This means that the specialised features enable the pairwise classifier to produce a safer prediction, so that even when the prediction made is not the absolute best encoding choice, the selected encoding provides performance closer to the best on average.

On Encoding Choices
In Figure 7 we show the frequency with which different encoding configurations are predicted.Recall that although we use a portfolio of 6 encodings, the portfolio is generated from the training set; consequently the portfolios are different across the 50 sets.The VB column shows a smooth distribution of ideal encoding choices from the full range of encodings available.In both splitting scenarios we observe five configurations preferred by the classifiers: RGGT Tree, GGPW GGPW, Tree MDD, GGPW GGTd and GGT MDD.Additionally, in the split-by-class task GGPW Tree is also used very often.The GGPW encoding for LI constraints features heavily in these choices and can therefore be considered a very good single choice in many settings; remember that when illustrating the building of a portfolio using the whole corpus, the single-choice winning configuration in Figure 2 was GGPW Tree.In the distribution of predictions made, there is more variety in the PB encoding selected, with five different choices featuring in the six top configurations mentioned.
Of all the encoding distributions in the split-by-class task, it is clear that the predictions made using the lipb features are closest to the VB in the sense of making the broadest variety of choices.Of course in every case the predictions can only be made from the portfolio determined at training time, so in each of the 50 cycles the classifier has to choose between 6 encoding configurations.It is evident that using the specialised features allows our machine learning setup to remain safe (i.e.not choose encodings which lead to very bad performance) while taking enough "risks" to leverage the complementarity of the portfolios.

Comparison with AUTOFOLIO
To further assess the value of our approach, we compare with AUTOFOLIO [14], a sophisticated algorithm selection approach which automatically configures algorithm selectors and "can be applied out-of-the-box to previously unseen algorithm selection scenarios."We use the latest version of AUTOFOLIO (the 2020-03-12 commit which adds a CSV API to the 2.1.2release) with its default settings.We use the algorithm selection component of AUTOFOLIO to make a single prediction per instance, turning on the hyperparameter tuning option; we do not use its pre-solving schedule generation.
To compare as fairly as possible, we train AUTOFOLIO on exactly the same training data, and test on the same test sets as our method.Our system takes less than 5 minutes to train using 8 cores on the cluster, so we allow AUTOFOLIO  1 hour on one core to tune and train.We also run it with a more generous budget of 2 hours and 4 hours to see if its performance improves.The runtime performance based on AUTOFOLIO's predictions is included in our main table of results (Table 3), in the final three rows of each main section.
Our system's predictions lead to better runtimes than AUTOFOLIO's.AUTOFOLIO is designed to be a good general algorithm selection and configuration system able to make good predictions when choosing between different solvers.It is likely that AUTOFOLIO's sophisticated decision-making is better suited to problems that run much longer or to algorithms for which the likelihood of timeouts or non-termination is more of an issue.It is interesting to note that AUTOFOLIO performs better with the lipb features than the generic instance features.Allowing AUTOFOLIO more time for tuning leads to marginal improvement with some feature sets, but in some cases actually results in worse performance, for example with split-by-instance and the combi features.

Feature Importance
We investigate the relative importance of instance features by computing the permutation feature importance.Breiman [30] calculates "variable importance" in random forests by recording the percentage increase in misclassification when each variable (feature) has its values randomly permuted compared to when all features are used.
Permuting the values means that the distribution is preserved but the feature effectively becomes noise.This method is applied at prediction time to the test set, unlike the Gini (entropy) feature importance measure which is calculated during training.We implement this analysis but record the mean increase in PAR10 time when each feature is permuted, effectively giving us the extra runtime cost when the feature is lost.Each feature is randomly permuted 5 times and the mean time increase recorded.The distribution of feature importance thus calculated is shown in Figure 8.We report on the lipb features and on the combi feature set which additionally contains the generic features from f2fsr.We only show the top 20 features ordered by mean importance.
We can see in Figure 8 that for both feature sets the median feature importance in the majority of cases is close to zero, but the mean importance varies considerably.This suggests that there are no features which are dominant on their own -most of the time a missing feature incurs no loss of prediction performance.Indeed sometimes removing a feature can improve performance, as shown some negative in most plots.However, the means the show there cases where each of the top features shown is able to prevent a costly wrong choice.The full extent of the variation of the mean feature importance is shown in Figure 9.
In the top 20 combi features we find a roughly equal mix of generic features and features specific to PB/LI constraints (the names of these features have prefixes pb and li ), when it comes to predicting for known problem classes (i.e.split by instance).This is in keeping with the similar performance of the f2fsr and libp featuresets as shown previously in Table 3.We suspect that when splitting by instance the system is, to a large extent, recognising problem classes rather than picking out traits of PB/LI constraints.
When we predict for unseen problem classes, the proportion of PB/LI to generic features in the top 20 rises to 14:6, supporting the hypothesis that making choices about which encodings to choose for certain constraint types is better served by using features relating to those constraints in the problem instance.
Let us consider the importance of features related to PBs as distinct to LIs.Considering the split-by-instance case in Figure 8 we note that most of the top features relate to LIs (17 to 3); this is almost entirely reversed when splitting by class (16 to 4 in favour of PB features).If the classifiers are just recognising problem classes in the split-by-instance case (hence predicting an encoding which has worked well for other instances in that class), then the LI features are just as suitable as the PB features.In fact we saw that the generic features performed competitively in this setting.However, when it comes to predicting for unseen problem classes, we know that the LI/PB related features are more discriminating as shown in the performance results (Table 3).We have also seen that there is more variation in the best PB encoding than the best LI encoding (with GGPW often winning for LI) and therefore the PB-related features are more prominent in this harder setting.As a final discussion point relating to feature importance, we look in more detail at the top 20 features from the lipb featureset when used on unseen problem classes.We have commented already on the disparity between the mean and median of the permutation feature importance.In Table 4 we list the top 20 features by mean and by median.A positive median value tells us that a feature is more often than not valuable in making good predictions.The mean indicates the overall contribution in a different way, i.e. how much time is lost on average per prediction batch across the 50 cycles.The features highlighted in bold type appear in both top-20 lists and so can help to explain what kinds of features are most helpful.We note that these mostly pertain to size, e.g.pb n sum is the total number of terms across all PBs, pb k max is the highest upper bound k, pb amogs size mn mn is the mean of the mean size of the AMO groups in PBs, pb count is the number of PB constraints.
There are limitations to how much we can read into the permutation feature importance (PFI), in particular because of two factors: PFI considers features in isolation and we have quite a large number of features.A feature α may be discriminating but could be masked by another feature β with which it is highly correlated, so when we permute α's values, there may not be a great loss in prediction performance while the information from β remains.Thus we could wrongly conclude that α is not very valuable.We have also shown that the features in libp and f2fsr can give comparable prediction performance (especially when splitting by instance) even though they consider different aspects of a CSP.
It is very difficult to draw strong conclusions about which features are the most significant.Many algorithms exist to aid feature selection before applying machine learning methods.Although further work in reducing the featuresets could be of value, we have shown that better predictions are achievable when using only the constraint-specific features in the split-by-class setting.

Analysis of the Configuration Space
We end our account of the empirical investigation with a brief analysis of the configuration space in which we are making encoding choices.We have argued already that the task of selecting suitable SAT encodings is not just a simple classification task.We hope the observations in this section shed some light on important considerations when selecting encodings for a set of problems.The LI encoding is represented by the line colours and the PB encoding by the marker shape.We show the encoding configurations which appear in the top 20 both in terms of their contribution to VB and the number of times they are the best (see Table 5).

PRE-PRINT
In Table 5 we list the 20 best performing encoding configurations across the entire cleaned corpus2 using two different criteria.Firstly, according to how often an encoding configuration is the best available; secondly, calculating the proportion of the total VB runtime allocated to a configuration.For instance, we see that Tree Tree is the clear winner in the former league table, with more than twice as many wins as the next entry (73 vs. 31).However, the instances on which it wins have a mean runtime of 46 seconds.We can calculate its contribution to the VB as roughly 73 × 46 ≈ 3400s, approximate because the mean is rounded.On the other hand, RGGT Tree wins fewer times but the relevant instances are almost three times harder with a mean runtime of 127 seconds, making a contribution to the VB of approximately 31 × 127 ≈ 3900s.
Figure 10 shows the performance profile of the encoding configurations which appear in both top 20 lists, and highlighted in bold in Table 5.As before, we see that GGPW is a great choice for LI constraints, appearing in 4 of the top 8 combined performers, whereas there is greater variety in the PB encodings in the best configurations.Figure 10 also demonstrates that these "top" encoding choices are excellent all-rounders -the top 2 are each able to solve over 600 of the 614 instances in the cleaned corpus (in which every instance is solvable within the timeout by at least one encoding configuration from the full set of 81).
In terms of prediction accuracy, choosing Tree Tree most often makes sense, and indeed this is the default encoding provided by SAVILE ROW.This is an excellent choice if the task is to solve very many problems, each of which are individually relatively easy, i.e. would solve in under a minute on our hardware.If, instead, the task is to solve "harder" problems, then, at least according to our corpus of problems, GGPW is a good choice for both LI and PB constraints.
Recall that all encodings except Tree take advantage of AMO groups to reduce the size of the SAT encoding -another important consideration when selecting an encoding.

Related Work
In recent work, new or improved SAT encodings of linear constraints [23] and pseudo-Boolean constraints (combined with AMO constraints) [2,3] have been devised and their performance compared on several benchmark problems.The scaling properties of encodings are studied, and it is suggested that smaller encodings should be used when coefficients or values of integer variables are large.However, to the best of our knowledge the problem of selecting an encoding (particularly for a previously-unseen problem class) has not been systematically addressed for LI or PB constraints.We use the full set of encodings from one recent paper [3] combined with automatic AMO detection [9].
MeSAT [13] and Proteus [12] both select SAT encodings using machine learning.MeSAT has two encodings of LI constraints: the order encoding [19]; and an encoding based on enumeration of allowed tuples of values (which uses a direct encoding of the CSP variables).It is not clear whether high-arity sums are broken up before encoding.MeSAT selects from three configurations using a k-nearest neighbour classifier using 70 CSP instance features.They report high accuracy (within 4% of the virtual best configuration), however the single best configuration is only 18% slower than the virtual best.Proteus makes a sequence of decisions: whether to use CSP or SAT; the SAT encoding; and the SAT solver to use.The portfolio contains three SAT encodings: direct, support, and a hybrid direct-order, however the encoding of LI constraints is not specified [12].Proteus generates each candidate SAT encoding and extracts features of the SAT formula to inform its selection -scaling this approach would be difficult when several constraint types are involved, each with many encoding choices.Results show that the choice of encoding (combined with the choice of SAT solver) is important and that machine learning methods can be effective in their context.
Soh et al [35] have proposed a hybrid encoding of CSP to SAT in which each variable may have the log or order encoding (but not both).A hand-crafted heuristic is used to automatically select one of the two encodings for each variable.A new encoding is defined for linear inequalities that contain a mix of log and order encoded integer variables.The hybrid encoding is shown to outperform both log and order encodings, demonstrating the potential of selecting encodings for individual variables or constraints rather than for an instance.

Conclusions and Future Work
We have shown that it is possible to close much of the performance gap between the single best and virtual best SAT encodings by using machine learning to select encoding configurations based on instance features.We have studied the problem of selecting encodings for instances of previously-unseen classes, a problem that is more challenging and arguably more realistic than the usual setting where training and test instances are drawn from the same set of problem

Figure 3 :
Figure 3: An overview of the steps involved in our experimental investigation.The white boxes with solid borders represent data; the grey boxes represent processes.

Figure 4 :
Figure4: Prediction performance using different featuresets against reference times.We show mean runtime (left) and number of timeouts (right) per test set across the 50 cycles, when using our preferred setup (pairwise combined, sample weights, custom loss).Outliers are indicated with crosses and represent values more that 1.5× IQR outside the quartiles.

Figure 6 :
Figure 6: Distributions of prediction accuracy across the 50 split, train, predict cycles using our preferred setup.

Figure 7 :
Figure7: Frequency of each configuration (li pb) selected across the 50 test sets when using each feature set with our preferred setup.We also show the virtual best (VB) configuration distribution for comparison.The colour indicates the LI encoding and the fill pattern shows the PB encoding.Only the top 12 most used configs (of 81 in total) are shown in the legend.

Figure 9 :
Figure 9: The mean permutation feature importance across the 50 cycles for every feature in the lipb and combi featuresets, from most to least important.

Figure 10 :
Figure10: Performance profile of selected encodings on the entire corpus, showing how many problems can be individually solved within a given time, up to the timeout of 1 hour.The LI encoding is represented by the line colours and the PB encoding by the marker shape.We show the encoding configurations which appear in the top 20 both in terms of their contribution to VB and the number of times they are the best (see Table5).

Table 1 :
New features for pseudo-Boolean and linear integer constraints.For each aspect of a constraint listed in the left column, we calculate the aggregates in the right column.In the aggregation functions, IQR means inter-quartile range, skew refers to the non-parametric skew and ent is Shannon's entropy.The identifier for each aspect is given in brackets.
mean, min, max, ent Distinct coefficient values (sep) mean, max Ratio of distinct coeff.values to num. of coeffs.(sepr) mean, max Number of At-Most-One groups (AMOGs) (amogs) mean Mean size of AMO group (asize mn) mean Mean AMOG size ÷ number of terms (asize r2n) mean Mean maximum coeff.size in AMOGs (amaxw mn) mean Skew of maximum coeff. in AMOGs (amaxw skew) mean, ent Upper limit (k) (k) mean, median, max, IQR, ent, skew k × number of AMOGs (k amo prod) mean, IQR, ent Table 2 shows the number of instances for each problem class and the mean number of PB and LI constraints per instance.

Table 2 :
Number of instances (#), mean number of PB constraints (PBs) and mean number of LI constaints (LIs) per instance for each problem class in the eventual corpus.

Table 3 :
Total PAR10 times over the 50 test sets as a multiple of the virtual best configuration time.The best time for each combination of setup, features and splitting method is shown in bold.The predicted runtimes include feature extraction time.In the setup details, Co/Sep shows whether LI and PB encodings were selected separately or as a combined choice; SW means sample weighting is used; CL indicates custom loss used in cross-validation; Tuning refers to the number of cycles of hyperparameter tuning, or the time budget in the case of AUTOFOLIO.

Table 4 :
The 20 most important features in our lipb feature set by their mean and median permutation feature importance (PFI).Features which appear in both top-20 lists are highlighted in bold.These PFI values were obtained in the split-by-class task and are averaged over the 50 split, train, predict cycles.
Figure8: Permutation feature importance: increase in PAR10 time over 50 split, train, predict cycles when each feature's data is permuted.We show the top 20 features according to mean importance.We omit outliers, which are defined as being beyond 1.5 × IQR away from the box.The mean importance over the 50 cycles is shown by a diamond.Features beginning li or pb refer to our LI/PB features as listed in Table1; the other feature names refer to the generic instance features from the combi feature set.

Table 5 :
Summary of the 20 best encoding configurations across the problem corpus by two different criteria.Left: the encodings which are best most frequently, showing the number of "wins" and the mean runtime for the instances on which it wins; the mean is rounded to the nearest second.Right: the encodings whose allocated instances in the virtual best selection have the highest total runtimes, and their contribution as a percentage of the total VB runtime.Encodings appearing in both top 20 lists are highlighted in bold type.