Generating Extended Resolution Proofs with a BDD-Based SAT Solver

In 2006, Biere, Jussila, and Sinz made the key observation that the underlying logic behind algorithms for constructing Reduced, Ordered Binary Decision Diagrams (BDDs) can be encoded as steps in a proof in the extended resolution logical framework. Through this, a BDD-based Boolean satisfiability (SAT) solver can generate a checkable proof of unsatisfiability. Such proofs indicate that the formula is truly unsatisfiable without requiring the user to trust the BDD package or the SAT solver built on top of it. We extend their work to enable arbitrary existential quantification of the formula variables, a critical capability for BDD-based SAT solvers. We demonstrate the utility of this approach by applying a prototype solver to obtain polynomially sized proofs on benchmarks for the mutilated chessboard and pigeonhole problems—ones that are very challenging for search-based SAT solvers.


Generating Extended Resolution Proofs
with a BDD-Based SAT Solver (Extended Version) 1 Introduction When a Boolean satisfiability (SAT) solver returns a purported solution to a Boolean formula, its validity can easily be checked by making sure that the solution indeed satisfies the formula.When the formula is unsatisfiable, on the other hand, having the solver simply declare this to be the case requires the user to have faith in the solver, a complex piece of software that could well be flawed.Indeed, modern solvers employ a number of sophisticated techniques to reduce the search space.
If one of those techniques is invalid or incorrectly implemented, the solver may overlook actual solutions and label a formula as unsatisfiable, even when it is not.With SAT solvers providing the foundation for a number of different real-world tasks, this "false negative" outcome could have unacceptable consequences.For example, when used as part of a formal verification system, the usual strategy is to encode some undesired property of the system as a formula.The SAT solver is then used to determine whether some operation of the system could lead to this undesirable property.Having the solver declare the formula to be unsatisfiable is an indication that the undesirable behavior cannot occur, but only if the formula is truly unsatisfiable.
Rather than requiring users to place their trust in a complex software system, a proof-generating solver constructs a proof that the formula is unsatisfiable.The proof has a form that can readily be checked by a simple proof checker.Initial work of checking unsatisfiability results was based on resolution proofs, but modern checkers are based on stronger proof systems [50,27].The checker provides an independent validation that the formula is indeed unsatisfiable.The checker can even be simple enough to be formally verified [32,18].Such a capability has become an essential feature for modern SAT solvers.
In their 2006 papers [30,43], Jussila, Sinz and Biere made the key observation that the underlying logic behind algorithms for constructing Reduced, Ordered Binary Decision Diagrams (BDDs) [6] can be encoded as steps in a proof in the extended resolution (ER) logical framework [44].Through this, a BDD-based Boolean satisfiability solver can generate checkable proofs of unsatisfiability.Such proofs indicate that the formula is truly unsatisfiable without requiring the user to trust the BDD package or the SAT solver built on top of it.
In this paper, we refine these ideas to enable a full-featured, BDD-based SAT solver.Chief among these is the ability to perform existential quantification on arbitrary variables.(Jussila, Sinz, and Biere [30] extended their original work [43] to allow existential quantification, but only for the root variable of a BDD.)In addition, we allow greater flexibility in the choice of variable ordering and the order in which conjunction and quantification operations are performed.This combination allows a wide range of strategies for creating a sequence of BDD operations that, starting with a set of input clauses, yield the BDD representation of the constant function 0, indicating that the formula is unsatisfiable.Using the extended-resolution proof framework, these operations can generate a proof showing that the original set of clauses logically implies the empty clause, providing a checkable proof that the formula is unsatisfiable.
We evaluated the performance of both our SAT solver TBSAT and KISSAT, a state-of-the-art solver based on conflict detection and clause learning (CDCL) [5,35].Our results demonstrate that a proof-generating BDD-based SAT solver has very different performance characteristics from the more mainstream CDCL solvers.It does not do especially well as a general-purpose solver, but it can achieve far better scaling for several classic challenge problems [1,16,26,46].We find that several of these problems can be efficiently solved using the bucket elimination strategy [21] employed by Jussila, Sinz, and Biere [30], but others require a novel approach inspired by symbolic model checking [14].
This paper assumes the reader has some background on BDDs and their algorithms.This background can be obtained from a variety of tutorial presentations [2,7,8].The paper is largely self-contained regarding proof generation.The paper is structured as follows.First, it provides a brief introduction to the resolution and extended resolution logical frameworks and to BDDs.Then we show how a BDD-based SAT solver can generate proofs by augmenting algorithms for computing the conjunction of two functions represented as BDDs, and for checking that one function logically implies another.We then describe our implementation and evaluate its performance on several classic problems.We conclude with some general observations and suggestions for further work.
This paper is an extended version of an earlier conference paper [12].Here we present more background material and more details about the proof-generation algorithms, as well as updated benchmark results with a new implementation and on additional challenge problems.

Preliminaries
Given a Boolean formula over a set of variables {x 1 , x 2 , . . ., x n }, a SAT solver attempts to find an assignment to these variables that will satisfy the formula, or it declares the formula to be unsatisfiable.As is standard practice, A literal ℓ can be either a variable or its complement.Most SAT solvers use Boolean formulas expressed in conjunctive normal form, where the formula consists of a set of clauses, each consisting of a set of literals.Each clause is a disjunction: if an assignment sets any of its literals to true, the clause is considered to be satisfied.The overall formula is a conjunction: a satisfying assignment must satisfy all of the clauses.
We write ⊤ to denote both tautology and logical truth.It arises when a clause contains both a variable and its complement.We write ⊥ to denote logical falsehood.It is represented by an empty clause.
We make use of the if-then-else operation, written ITE, defined as When writing clauses, we omit disjunction symbols and use overlines to denote negation, writing ¬u ∨ v ∨ ¬w as u v w.

Resolution Proofs
Robinson [40] observed that the resolution inference rule formulated by Davis and Putnam [20] could form the basis for a refutation theorem-proving technique for first-order logic.Here, we consider its specialization to propositional logic.For clauses of the form C ∨ x, and x ∨ D, the resolution rule derives the new clause C ∨ D. This inference is written with a notation showing the required conditions above a horizontal line, and the resulting inference (known as the resolvent) below: Intuitively, the resolution rule is based on the property that implication is transitive.To see this, let proposition p denote ¬C, and proposition q denote D. Then C ∨ x is equivalent to p → x, x ∨ D is equivalent to x → q, and C ∨ D is equivalent to p → q.In other words, the resolution rule encodes the property that if p → x and x → q, then p → q.As a special case, when C contains a literal l and D contains its complement l, then the resolvent of C ∨ x and D ∨ x will be a tautology.Resolution provides a mechanism for proving that a set of clauses is unsatisfiable.Suppose the input consists of m clauses.A resolution proof is given as a trace consisting of a series of steps S, where each step s i consists of a clause C i and a (possibly empty) list of antecedents A i , where each antecedent is the index of one of the previous steps.The first set of steps, denoted S m , consists of the input clauses without any antecedents.Each successive step then consists of a clause and a set of antecedents, such that the clause can be derived from the clauses in the antecedents by one or more resolution steps.It follows by transitivity that for each step s i , with i > m, clause C i is logically implied by the input clauses, written S m |= C i .If, through a series of steps, we can reach a step s t where C t is the empty clause, then the trace provides a proof that S m |= ⊥, i.e., the set of input clauses is not satisfiable.
A typical resolution proof contains many applications of the resolution rule.These enable deriving sequences of implications that combine by transitivity.For example, consider the following implications, shown both as formulas and as clauses: We can derive the final clause from the first three using two resolution steps:

Reverse Unit Propagation (RUP)
Reverse unit propagation (RUP) provides an easily checkable way to express a linear sequence of resolution operations as a single proof step [24,47].It is the core rule supported by standard proof checkers [28,49] for propositional logic.Let C = ℓ 1 ℓ 2 • • • ℓ p be a clause to be proved and let D 1 , D 2 , . . ., D k be a sequence of supporting antecedent clauses occurring earlier in the proof.A RUP step proves that 1≤i≤k D i → C by showing that the combination of the antecedents plus the negation of C leads to a contradiction.The negation of C is the formula ℓ 1 ∧ ℓ 2 ∧ • • • ∧ ℓ p having a CNF representation consisting of p unit clauses of the form ℓ i for 1 ≤ i ≤ p.A RUP check processes the clauses of the antecedent in sequence, inferring additional unit clauses.In processing clause D i , if all but one of the literals in the clause is the negation of one of the accumulated unit clauses, then we can add this literal to the accumulated set.That is, all but this literal have been falsified, and so it must be set to true for the clause to be satisfied.The final step with clause D k must cause a contradiction, i.e., all of its literals are falsified by the accumulated unit clauses.
As an example, consider a RUP step to derive x → (a → d) from the three clauses shown in the earlier example.A RUP proof would take the following form.Here, the target and antecedent clauses are listed along the top, while the resulting unit clauses are shown on the bottom, along with the final contradiction.
RUP is an alternative formulation of resolution.For target clause C, it can be seen that applying resolution operations to the antecedent clauses from right to left will derive a clause C ′ such that C ′ ⊆ C. By subsumption [39], we then have C ′ → C. Compared to listing each resolution operation as a separate step, using RUP as the basic proof step makes the proofs more compact.

Extended Resolution
Grigori S. Tseitin [44] introduced the extended-resolution proof framework in a presentation at the Leningrad Seminar on Mathematical Logic in 1966.The key idea is to allow the addition of new extension variables to a resolution proof in a manner that preserves the soundness of the proof.In particular, in introducing variable e, there must be an accompanying set of clauses that encode e ↔ F , where F is a formula over variables (both original and extension) that were introduced earlier.These are referred to as the defining clauses for extension variable e. Variable e then provides a shorthand notation by which F can be referenced multiple times.Doing so can reduce the size of a clausal representation of a problem by an exponential factor.An extension variable e is introduced into the proof by including its defining clauses in the list of clauses being generated.The proof checker must then ensure that the defining clauses obey the requirements for extension variables, as is discussed below.Thereafter, other clauses can include the extension variable or its complement, and they can list the defining clauses as antecedents.
Tseitin transformations are commonly used to encode a logic circuit or formula as a set of clauses without requiring the formulas to be "flattened" into a conjunctive normal form over the circuit inputs or formula variables.These introduced variables are called Tseitin variables and are considered to be part of the input formula.An extended resolution proof takes this concept one step further by introducing additional variables as part of the proof.The proof checker must ensure that these extension variables are used in a way that does not result in an unsound proof.Some problems for which the minimum resolution proof must be of exponential size can be expressed with polynomial-sized proofs in extended resolution [17].

Clausal Proofs
We use a clausal proof system to validate our proofs based on the DRAT proof framework [28].This framework provides supports both extended resolution and resolution operations based on a proof rule that generalizes reverse unit propagation.There are a number of fast and formallyverified checkers for these proofs [19,49,32].The checker ensures that all extension variables are used properly and that each new clause can be derived via RUP from its antecedent clauses.
As in a resolution proof, a clausal proof is given as a trace, where each step s i consists of a clause C i and a list of antecedents A i , where the initial m clauses are the input clauses.Let S m denote the set of input clauses, and for i > m, define S i inductively as S i = S i−1 ∪ {C i }.The proof steps s m+1 , . . ., s t represent a derivation from S m to S t .A clausal proof is a refutation if S t contains the empty clause.Step s i in a proof is valid if the equisatisfiability1 of S i−1 and S i can be checked using a polynomially decidable redundancy property.For the case where C i was obtained via RUP, we can simply perform a RUP check using C i and the antecedents.In case C i is one of the defining clauses for some extension variable e, the checker must ensure that the clause is blocked [31].That is, all possible resolvents of C i with clauses in S i−1 that contain e must be tautologies.The blocked clause proof system is a generalization of extended resolution and allows the addition of blocked clauses that are blocked on non-extension variables.However, we do not use such capabilities in our proofs.
Clausal proofs also allow the removal of clauses.A proof can indicate that clause C j can be removed after step s i if it will not be used as an antecedent in any step s k with k > i.With this restriction, clause deletion does not affect the integrity of the proof.As the experimental results

APPLY(OP
Figure 1: General structure of the Apply algorithm.The operation for a specific logical operation OP is determined by its terminal cases and its recursive structure.
of Section 5 demonstrate, deleting clauses that are no longer needed can substantially reduce the number of clauses the checker must track while processing a proof.

Binary Decision Diagrams
Reduced, Ordered Binary Decision Diagrams (which we refer to as simply "BDDs") provide a canonical form for representing Boolean functions, and an associated set of algorithms for constructing them and testing their properties [6].With BDDs, functions are defined over a set of variables X = {x 1 , x 2 , . . ., x n }.We let T 0 and T 1 denote the two leaf nodes, representing the constant functions 0 and 1, respectively.
Each nonterminal node u has an associated variable Var(u) and children Hi(u), indicating the case where the node variable has value 1, and Lo(u), indicating the case where the node variable has value 0.
Two lookup tables-the unique table and the operation cache-are critical for guaranteeing the canonicity of the BDDs and for ensuring polynomial performance of the BDD construction algorithms.
A node u is stored in a unique table, indexed by a key of the form Var(u), Hi(u), Lo(u) , so that isomorphic nodes are never created.The nodes are shared across all of the BDDs [37].In presenting algorithms, we assume a function GETNODE(Var(u), Hi(u), Lo(u)) that checks the unique table and either returns the node stored there, or it creates a new node and enters it into the table.With this table, we can guarantee that the subgraphs with root nodes u and v represent the same Boolean function if and only if u = v.We can therefore uniquely identify Boolean functions with their BDD root nodes.
BDD packages support multiple operations for constructing and testing the properties of Boolean functions represented by BDDs.A number of these are based on the Apply algorithm [6].Given a set of BDD roots u 1 , u 2 , . . ., u k representing functions f 1 , f 2 , . . ., f k respectively, and a Boolean operation OP, the algorithm generates the BDD representation w of the operation applied to those functions.For example, with k = 2, and OP = AND, APPLY(AND, u 1 , u 2 ) returns the root node for the BDD representation of f 1 ∧ f 2 .
Figure 1 shows pseudo-code describing the overall structure of the Apply algorithm.The details for a specific operation are embodied in the functions ISTERMINAL, TERMINALVALUE, and APPLYRECUR.The first two of these detect terminal cases and what value to return when a terminal case is encountered.The third describes how to handle the general case, where the arguments must be expanded recursively.The algorithm makes use of memoizing, where previously computed results are stored in an operation cache, indexed by a key consisting of the operands [36].Whenever possible, results are retrieved from this cache, avoiding the need to perform redundant calls to APPLYRECUR.With this cache, the worst case number of recursive steps required by the algorithm is bounded by the product of the sizes (in nodes) of the arguments.

Proof Generation During BDD Construction
In our formulation, every newly created BDD node u is assigned an extension variable u. (As notation, nodes are denoted by boldface characters, possibly with subscripts, e.g., u, v, and v 1 , while their corresponding extension variables are denoted with a normal face, e.g., u, v, and v 1 .)We then extend the Apply algorithm to generate proofs based on the recursive structure of the BDD operations.
Let S m denote the set of input clauses.Our goal is to generate a proof that S m |= ⊥, i.e., there is no satisfying assignment for these clauses.Our BDD-based approach generates a sequence of BDDs with root nodes u 1 , u 2 , . . ., u t , where u t = T 0 , based on a combination of the following operations.(The exact sequencing of operations is determined by the evaluation mechanism, as is described in Section 5.) 1.For input clause C i generate its BDD representation u i using a series of Apply operations to perform the disjunctions.
2. For roots u j and u k , generate the BDD representation of their conjunction u l = u j ∧ u k using the Apply operation to perform conjunction.
3. For root u j and some set of variables Y ⊆ X, perform existential quantification: Although the existential quantification operation is not mandatory for a BDD-based SAT solver, it can greatly improve its performance [22].It is the BDD counterpart to Davis-Putnam variable elimination on clauses [20].As the notation indicates, there are often multiple variables that can be eliminated simultaneously.Although the operation can cause a BDD to increase in size, it generally causes a reduction.Our experimental results demonstrate the importance of this operation.
As these operations proceed, we simultaneously generate a set of proof steps.The details of each step are given later in the presentation.For each BDD generated, we maintain the proof invariant that the extension variable u j associated with root node u j satisfies S m |= u j .
1. Following the generation of the BDD u i for input clause C i , we also generate a proof that C i |= u i .This is described in Section 3.1.

Justifying the results of conjunctions requires two parts:
(a) Using a modified version of the Apply algorithm for conjunction we follow the structure of its recursive calls to generate a proof that the algorithm preserves implication: u j ∧ u k → u l .This is described in Section 3.2.
(b) This implication can be combined with the earlier proofs that S m |= u j and S m |= u k to prove S m |= u l .
3. Justifying the quantification also requires two parts: (a) Following the generation of u k via existential quantification, we perform a separate check that their associated extension variables satisfy u j → u k .This check uses a proof-generating version of the Apply algorithm for implication checking.This is described in Section 3.3.
(b) This implication can be combined with the earlier proof that S m |= u j to prove S m |= u k .
Compared to the prior work by Sinz and Biere [43], our key refinement is to handle arbitrary existential quantification operations.(When implementing a SAT solver, these quantifications must be applied in restricted ways [48], but since proofs of unsatisfiability only require proving implication, we need not be concerned with the details of these restrictions.)Rather than attempting to track the detailed logic underlying the quantification operation, we run a separate check that implication is preserved.As is the case with many BDD packages, our implementation can perform existential quantification of an arbitrary set of variables in a single pass over the argument BDD.We only need to perform a single implication check for the entire quantification.
Sinz and Biere's construction assumed there were special extension variables n 1 and n 0 to represent the BDD leaves T 1 and T 0 .Their proofs then includes unit clauses n 1 and n 0 to force these variables to be set to 1 and 0, respectively.We have found that these special variables are not required and instead directly associate leaves T 1 and T 0 with ⊤ and ⊥, respectively.
The n variables in the input clauses all have associated BDD variables.The proof then introduces an extension variable u every time a new BDD node u is created.In the actual implementation, the extension variable (an integer) is stored as one of the fields in the node representation.
When creating a new node, the GETNODE function adds (up to) four defining clauses for the associated extension variable.For node u with variable Var(u) = x, Hi(u) = u 1 , and Lo(u) = u 0 , the clauses are:

Notation
Formula Clause The names for these clauses combines an indication of whether they correspond to variable x being 1 (H) or 0 (L) and whether they form an implication from the node down to its child (D) or from the child up to its parent (U).When one of the child nodes u 0 or u 1 is a leaf, some of these defining clauses will degenerate into tautologies, and some will reduce to just two literals.Tautologies are not included in the proof.These defining clauses encode the assertion: satisfying Tseitin's restriction on the use of extension variables.Each clause is numbered according to its step number in the trace.

Generating BDD Representations of Clauses
The BDD representation for a clause C has a simple, linear structure.For root node u, it is easy to prove that C |= u using one RUP step.The general algorithm is described by Sinz and Biere [43].
Here we illustrate the idea via an example.
Figure 2 shows the BDD representation of clause C = a b c.As can be seen, the BDD for a clause has a very specific structure.For each literal in the clause, there is a node labeled by the variable, with one child being leaf T 1 and the other being either the node for the next literal in the variable ordering or leaf T 0 .The lower part of the figure shows a RUP justification of C |= u a , where u a is the root node of the BDD.The proof uses the antecedents HU(u) and LU(u) for each node u in the BDD (except for the tautological case representing the final edge to T 0 ), with the final antecedent being the clause itself.The RUP steps introduce the complements of the clause variables as unit clauses, causing a contradiction with the input clause.The order in which the two defining clauses for a node are listed in the antecedent depends on whether the variable is positive or negative in the clause.As this example demonstrates, we can generate a single proof step for C i |= u i for each input clause C i .

Performing Conjunctions
The key idea in generating proofs for the conjunction operation is to follow the recursive structure of the Apply algorithm.We do this by integrating proof generation into the Apply procedures, as is shown in Figure 3.This follows the standard form of the Apply algorithm (Figure 1), with the novel feature that each result includes both a BDD node w and a proof step number s.For arguments u and v, step s lists clause u v w along with antecedents defining a RUP proof of the implication u ∧ v → w.
As the table of terminal cases shows, these cases all correspond to tautologies.For example, the case of u = T 1 , giving w = v is justified by that tautology ⊤ ∧ v → v. Failing a terminal or previously computed case, the function must recurse, branching on the variable x that is the minimum of the two root variables.The procedure accumulates a set of proof steps J to be used in the implication proof.These include the two steps (possibly tautologies) from the two recursive calls.At the end, it invokes a function JUSTIFYAND to generate the required proof.In returning the pair (w, s), this value will be stored in the operation cache and returned as the result of the Label Formula Clause UHD HD(u) x v v 0 WHU HU(w) x w 1 w WLU LU(w)

Proof Generation for the Standard Case
A proof generated by APPLY with operation AND inducts on the structure of the argument and result BDDs.That is, it assumes that the result nodes w 1 and w 0 of the recursive calls to arguments u 1 and v 1 and to u 0 and v 0 satisfy the implications u 1 ∧ v 1 → w 1 and u 0 ∧ v 0 → w 0 , and that these calls generated proof steps s 1 and s 0 justifying these implications.For the standard case, where none of the equalities hold and the recursive calls do not yield tautologies, the supporting clauses for the proof are shown in Figure 4.That is the set J contains references to eight clauses, which we identify by labels.Six of these are defining clauses: the downward clauses for the argument nodes (labeled UHD, VHD, ULD, and VLD) and the upward clauses for the result (labeled WHU and WLU).The other two are implications for the two recursive calls, labeled (ANDH and ANDL).We partition these supporting clauses into two sets: These supporting clauses are used to derive the target clause u ∧ v → w using the two RUP

Proof Generation for Special Cases
The proof structure shown in Figure 5 only holds for the standard form of the recursion.However, there are many special cases, such as when a recursive call yield a tautologous result, when some of the child nodes are equal, and when the two recursive calls return the same node.Fortunately, a general approach can handle the many special cases that arise.The examples shown in Figure 6 illustrate a range of possibilities.Based on these and the standard case of Figure 5, we will show how to handle all of the cases with a simple algorithm.Figure 6A illustrates the case where some of the nodes in the recursive calls are equal.In particular, when Var(u) > Var(v), the recursion will split with u 1 = u 0 = u.This will cause supporting clauses UHD and ULD to be tautologies.This example also has w 1 = w 0 = w, as will occur when the two recursive calls return identical result.This will cause supporting clauses WHU and WLU to be tautologies.The two sets of equalities will cause supporting clause ANDH to be u v 1 w and supporting clause ANDL to be u v 0 w.As can be seen, the resulting proof will consist of the same two steps as the standard form, but with fewer supporting clauses.Figure 6B illustrates the case where u 1 = T 1 , and therefore the first recursive call generates a tautologous result.This case will cause w 1 = v 1 , and therefore supporting clause WHU will be x v 1 w.In addition, supporting clauses UHD and ANDH will be tautologies.Despite these changes, the proof will still have the same two-step structure as the standard case.
Finally, Figure 6C illustrates the case where u 1 = T 0 , and therefore the first recursive call again generates a tautologous result.This case will cause w 1 = T 0 , and only two clauses among those in A H will not be tautogies: UHD will be x u, and VHD will be x v v 1 .As can be seen, the proof for this case consists of a single RUP step.Furthermore, it does not make use of supporting clause VHD.
These three examples illustrate the following general properties: • When neither ANDH nor ANDL is a tautology, the proof requires two steps.Some of the supporting clauses may be tautologies, but the proof can follow the standard form shown in in Figure 5.
• When either ANDH or ANDL is a tautology, it may be possible to generate a single-step proof.
Otherwise, it can follow the standard, two-step form.
Given these possibility, our implementation of JUSTIFYAND uses the following strategy: 1.If supporting clause ANDH is a tautology, then attempt a single-step proof, using the nontautologous clauses in A H followed by those in A L .If this fails, then perform a two-step proof.
2. Similarly, if supporting clause ANDL is a tautology, then attempt a single-step proof, using the non-tautologous clauses in A L followed by those in A H .If this fails, then perform a two-step proof.
3. A two-step proof proceeds by first proving the weaker clause x u v w using the non-tautologous clauses in A H .It then uses this result, plus the clauses in A L to justify target clause u v w.
In all cases, the antecedent is generated by stepping through the clauses in their specified order, adding only those that cause unit propagation or conflict.

Checking Implication
As described in Section 3, we need not track the detailed logic of the algorithm that performs existential quantification.Instead, when the quantification operation applied to node u generates node v, we generate a proof of implication afterwards, using the Apply algorithm adapted for

Terminal Cases Condition
Result x u u 0 VHU HU(v) Figure 9: RUP proof steps for standard recursive check of implication checking implication checking, as shown in Figure 7.A failure of this implication check would indicate an error in the BDD package, and so its only purpose is to generate a proof that the implication holds, signaling a fatal error if the implication does not hold.This particular operation does not generate any new nodes, and so the returned result is simply a proof step number.The (successful) terminal cases correspond to the tautological cases u → u, ⊥ → v, and u → ⊤.
Each recursive step accumulates up to six proof steps as the set J to be used in the implication proof.Figure 8 shows the structure of these clauses for the standard case where neither equality holds, and neither recursive call returns ⊤.The clauses consist of the two downward defining clauses for argument u, labeled UHD and ULD, the two upward defining clauses for argument v, labeled VHU and VLU, and the clauses returned by the recursive calls, labeled IMH and IML.
Figure 9 shows the two RUP steps required to prove the standard case.The first step proves the weaker target x → (u → v), having clausal representation x u v using the three supporting clauses containing x.The second proves the full target, having clausal representation u v using the weaker result plus the supporting clauses containing x.
As with the conjunction operation, there can be many special cases, but they can be handled with the same general strategy.If either recursive result IMH or IML is a tautology, a one-step proof is attempted.If that fails, or if neither recursive result is a tautology, a two-step proof is generated.

Implementation
We implemented the TBUDDY proof-generating BDD package by modifying the widely used BUDDY BDD package, developed by Jørn Lind-Nielsen in the 1990s [9].This involved adding several additional fields to the BDD node and cache entry data structures, yielding a total memory overhead of 1.35×.TBUDDY generates proofs in the LRAT proof format [18].We then implemented TBSAT, a proof-generating SAT solver based on TBUDDY.
TBSAT supports three different evaluation mechanisms: Linear: Form the conjunction of the clauses.No quantification is performed.This mode matches the operation described for the original version of EBDDRES [43].When forming the conjunction of a set of terms, the program makes use of a first-in, first-out queue, removing two elements from the front of the queue, computing their conjunction, and placing the result at the end of the queue.This has the effect of forming a binary tree of conjunctions.
Bucket Elimination: Place the BDDs representing the clauses into buckets according to the levels of their topmost variables.Then process the buckets from lowest to highest.While a bucket has more than one element, repeatedly remove two elements, form their conjunction, and place the result in the bucket designated by its topmost variable.Once the bucket has a single element, existentially quantify the topmost variable and place the result in the appropriate bucket [21].This matches the operation described for the revised version of EBDDRES [30].
Scheduled: Perform operations as specified by a scheduling file, as described below.
The scheduling file contains a sequence of lines, each providing a command in a simple, stackbased notation: c c 1 , . . ., c k Push the BDD representations of the specified clauses onto the stack a m Replace the top m elements on the stack with their conjunction q v 1 , . . ., v k Replace the top stack element with its quantification by the specified variables

Experimental Results
In our preliminary experiments, we found that the capabillities of TBSAT differ greatly from the more mainstream CDCL solvers, and it therefore must be evaluated by a different set of standards.In particular, CDCL solvers are most commonly evaluated according to their performance on collections of benchmark problems in a series of annual solver competitions.Over the years, the benchmark problems have been updated to provide new challenges and to better distinguish the performance of the different solvers.This competition has stimulated major improvements in the solvers through improved algorithms and implementation techniques.One unintended consequence, however, has been that the benchmarks have evolved to be only at, or slightly beyond, the capabilities of CDCL solvers.
As an example, as is discused in Section 5.2, Simon [15] and Li [33] contributed multiple benchmark formulas for the 2002 SAT competition [41] based on a class of unsatisfiable formulas devised by Urquhart [46].The formulas scale quadratically by a size parameter m, both in terms of the number of variables and the number of clauses.Simon's largest benchmark had m = 5, while Li's had m = 4.No solver at the time could complete for these formulas, even though Li's formula for m = 4 has only 288 variables and 768 clauses.The 2022 SAT competition featured a special "Anniversary track" using as formulas the 5355 formulas that have been used across all prior SAT competitions.In all, 32 solvers participated in the competition with a 5000-second time limit for each problem.Even after years of improvements in the solvers and with vastly better hardware, none of the solvers completed these 20-year-old benchmark problems.There has been no attempt to evaluate solvers running on Urquhart formulas for larger values of m, because these were clearly beyond the reach of the competing solvers.
By contrast, TBSAT can easily handle the Urquhart formulas.Generating proofs of unsatisfiability for Simon's benchmark with m = 5 and Li's benchmark with m = 4 requires 0.23 and 0.13 seconds, respectively.We show experimental results with m = 38 for Li's version and m = 60 for Simon's.In a more recent effort [9], we augmented TBSAT to use Gaussian elimination for reasoning about parity constraints, allowing us to generate an unsatisfiability proof for Li's version with m = 316, a formula with over two million variables and five million clauses.This example demonstrates that measuring performance on benchmarks designed to evaluate CDCL solvers cannot capture the full capabilities of a BDD-based SAT solver.
In the following experiments, we explore the capability of TBSAT on four scalable benchmark problems that pose major challenges for CDCL solvers.We do so not to show that TBSAT is uniformly superior, but rather that it performs very well on some classes of problems for which CDCL is especially weak.A long-term research direction is to combine the capabilities of CDCL and BDDs to build on the strengths of each.
All experiments were performed on a 3.2 GHz Apple M1 Max processor with 64 GB of memory and running the OS X operating system.The runtime for each experiment was limited to 1000 seconds.We compare the performance of TBSAT to that of KISSAT, the winner of several recent SAT solver competitions [5].KISSAT represents the state of the art in CDCL solvers.The proofs were checked using DRAT-TRIM for the proofs generated by KISSAT and LRAT-CHECK for those generated by TBSAT.We report both the elapsed time by the solver, as well as the total number of clauses in the proof of unsatisfiability.For KISSAT, the proof clauses indicate the conflicts the solver encountered during its search.For TBSAT, these are the defining clauses for the extension variables (up to four per BDD node generated) and the derived clauses (one per input clause and up to two per result inserted into the operation cache.)

Reordered Parity Formulas
Chew and Heule [16] introduced a benchmark problem based on computing the parity of a set of Boolean values x 1 , . . ., x n using two different orderings of the inputs, and with one of the variables negated in the second computation: where π is a random permutation, and and each p i is either 0 or 1, with the restriction that p i = 1 for only one value of i.The two sums associate from left to right.The formula ParityA ∧ ParityB is therefore unsatisfiable, but the permutation makes this difficult for CDCL solvers to determine.The CNF has a total of 3n − 2 variables-n values of x i , plus the auxilliary variables encoding the intermediate terms in the two expressions.
Chew and Heule experimented with the CDCL solver CADICAL [3] and found it could not handle cases with n greater than 50.They devised a specialized method for directly generating proofs in the DRAT proof system, obtaining proofs that scale as O(n log n) and gave results for up to n = 4,000.They also tried EBDDRES, but only in its default mode, where it performs only linear evaluation without any quantification.
Figure 10 shows the result of applying both TBSAT and KISSAT to this problem.In this and other figures, the top graph shows how the runtime scales with the problem size, while the bottom graph shows how the number of proof clauses scale.Both graphs are log-log plots, and so the values are highly compressed along both dimensions.Linear evaluation performs poorly, only handling up to n = 24 within the 1000-second time limit, generating a proof with over 312 million clauses.Using KISSAT, we found the results were very sensitive to the choice of random permutation, and so we show results using three different random seeds for each value of n.We were able to generate proofs for instances with n up to 46 within the time limit but also started having timeouts with n = 42.We can see that KISSAT does better than linear evalution with TBSAT, but both appear to scale exponentially.
Bucket elimination, on the other hand, displays much better scaling, as is shown in the loglog plot of Figure 10.We found the best performance was achieved by randomly permuting the variables, although this strategy only yields a constant factor improvement over the ordering from  As the graphs show, we were able to handle cases with n up to 9,750, within the time limit.This generated a proof with over 419 million clauses, but the LRAT checker was able to verify this proof in 256 seconds.Although TBSAT could generate proofs for larger values of n, these exceeded the capacity of the LRAT checker.
Included in the second graph are results for running Chew and Heule's proof generator on this problem.As can be seen, the proof sizes generated by TBSAT are comparable to theirs up to around n = 100.From there on, however, the benefit of their O(n log n) algorithm becomes apparent.Even for n = 10,000, their proof contains less than 11 million clauses.Of course, their construction relies on particular properties of the underlying problem, while ours was generated by a general-purpose SAT solver.

Urquhart Formulas
Urquhart [46] introduced a family of formulas that require resolution proofs of exponential size.Over the years, two families of SAT benchmarks have been labeled as "Urquhart Problems:" one developed by Simon [15], and the other by Li [33].These are considered to be difficult challenge problems for SAT solvers.Here we define their general form, describe the differences between the two families, and evaluate the performance of both KISSAT and TBSAT on both classes.
Urquhart's construction is based on a class of bipartite graphs with special properties.Define G k as the set of undirected graphs with each graph satisfying the following properties: • It is bipartite: The set of vertices can be partitioned into sets L and R, such that the edges E satisfy E ⊆ L × R.
• It is balanced: • It has bounded degree: No vertex has more than k incident edges.
Furthermore, the graphs must be expanders, defined as follows [29].For a subset of vertices U ⊆ L, define R(U) to be those vertices in R adjacent to the vertices on U. A graph in G k is an expander if there is some constant d > 0, such that for any U To transform such a graph into a formula, each edge (i, j) ∈ E has an associated variable x {i,j} .(We use this notation to emphasize that the order of the indices does not matter.)Each vertex is assigned a polarity p i ∈ 0, 1, such that the sum of the polarities is odd.The clauses then encode the formula: This is false, of course, since each edge gets counted twice in the sum, and the sum of the polarities is odd.
The two families of benchmarks differ in how the graphs are constructed.Li's benchmarks are based on the explicit construction of expander graphs due to Margulis [23,34] that is cited by Urquhart.Thus, his graphs are fully defined by the size parameter m.Simon's benchmarks are based on randomly generated graphs, and thus they are characterized by both the size parameter m and the initial random seed s.Although random graphs satisfy the expander condition with high probability [29], it is unlikely that the particular instances generated by Simon's benchmark generator are truly expander graphs.The widely used SAT benchmarks with names of the form UrqM_S.cnf were generated by Simon's program for size parameter m = M and initial seed S. For Simon's benchmarks, we used five different seeds for each value of m.
Figure 11 shows data for running KISSAT, as well as TBSAT using bucket elimination.The data for KISSAT demonstrates how difficult these benchmark problems are for CDCL solvers.With a time limit of 1000 seconds, we found that KISSAT could handle all five instances of Simon's benchmarks with m = 3, but none for larger values of m.For Li's benchmarks it failed for even the minimum case of m = 3.Running TBSAT with bucket elimination with a random ordering of the variables fares much better.For Li's benchmarks, it successfully handled instances up to m = 38, yielding a proof with around 373 million clauses.For Simon's benchmarks, bucket elimination handled benchmarks for all five seeds up to m = 60.We can also see that Simon's benchmarks are decidedly easier than Li's, requiring up to an order of magnitude fewer clauses in the proofs.
Jussila, Sinz, and Biere [30] showed benchmark results for what appear to be Simon's Urquhart formulas up to m = 8 with performance (in terms of proof size) comparable to ours.Indeed, in using bucket elimination, we are replicating their approach.We know of no prior proof-generating SAT solver that could handle Urquhart formulas of this scale.

Mutilated Chessboard
The mutilated chessboard problem considers an n × n chessboard, with the corners on the upper left and the lower right removed.It attempts to tile the board with dominos, with each domino covering two squares.Since the two removed squares had the same color, and each domino covers one white and one black square, no tiling is possible.This problem has been well studied in the context of resolution proofs, for which it can be shown that any proof must be of exponential size [1].
A standard CNF encoding involves defining Boolean variables to represent the boundaries between adjacent squares, set to 1 when a domino spans the two squares, and set to 0 otherwise.The clauses then encode an Exactly1 constraint for each square, requiring each square to share a domino with exactly one of its neighbors.We label the variables representing a horizontal boundary between a square and the one below as y i,j , with 1 ≤ i < n and 1 ≤ j ≤ n.The variables representing the vertical boundaries are labeled x i,j , with 1 ≤ i ≤ n and 1 ≤ j < n.With a mutilated chessboard, we have As the plots of Figure 12 show, a straightforward application of linear conjunctions or bucket elimination by TBSAT displays exponential scaling.Indeed, TBSAT fares no better than KISSAT when operating in either of these modes, with all limited to n ≤ 20 within the 1000-second time limit.On the other hand, another approach, inspired by symbolic model checking [14], demonstrates far better scaling, reaching n = 340.It is based on the following observation: when processing the columns from left to right, the only information required to place dominos in column j is the identity of those rows i for which a domino crosses horizontally from j − 1 to j.This information is encoded in the values of x i,j−1 for 1 ≤ i ≤ n.
In particular, group the variables into columns, with X j denoting variables x 1,j , . . ., x n,j , and Y j denoting variables y 1,j , . . ., y n−1,j .Scanning the board from left to right, consider X j to encode the "state" of processing after completing column j.As the scanning process reaches column j, there is a characteristic function σ j−1 (X j−1 ) describing the set of allowed crossings of horizontallyoriented dominos from column j − 1 into column j.No other information about the configuration of the board to the left is required.The characteristic function after column j can then be computed as: where T j (X j−1 , Y j , X j ) is a "transition relation" consisting of the conjunction of the Exactly1 constraints for column j.From this, we can existentially quantify the variables Y j to obtain a BDD encoding all compatible combinations of the variables X j−1 and X j .By conjuncting this with the characteristic function for column j − 1 and existentially quantifying the variables X j−1 , we obtain the characteristic function for column j.With a mutilated chessboard, we generate leaf node L 0 in attempting the final conjunction.Note that Equation 3does not represent a reformulation of the mutilated chessboard problem.It simply defines a way to schedule the conjunction and quantification operations over the input clauses.
One important rule-of-thumb in symbolic model checking is that the successive values of the next-state variables must be adjacent in the variable ordering.Furthermore, the vertical variables in Y j must be close to their counterparts in X j−1 and X j .Both objectives can be achieved by ordering the variables row-wise, interleaving the variables x i,j and y i,j , ordering first by row index i and then by column index j.This requires the quantification operations of Equation 3 to be performed on non-root variables.
In our experiments, we found that this scanning reaches a fixed point after processing n/2 columns.That is, from that column onward, the characteristic functions become identical, except for a renaming of variables.This indicates that the set of all possible horizontal configurations stabilizes halfway across the board.Moreover, the BDD representations of the states grow as O(n 2 ).For n = 340 largest has just 29,239 nodes.The problem size for the mutilated chessboard scales as n 2 , the number of squares in the board.Thus, an instance with n = 340 is 289 times larger than an instance with n = 20, in terms of the number of input variables and clauses.Column scanning yields a major benefit in the solver performance.
The plot labeled "No Quantification" demonstrates the importance of including existential quantification in solving this problem.These data were generated by using the same schedule as with column scanning, but with all quantification operations omitted.As can be seen, this approach could not scale beyond n = 10.
It is interesting to reflect on how our column-scanning approach relates to SAT-based bounded model checking (BMC) [4].This approach to verification encodes the operation of a state transition system for k steps, for some fixed value of k, by replicating the transition relation k times.It then uses a SAT solver to detect whether some condition can arise within k steps of operation.By contrast, we effectively compress the mutilated chessboard problem into a state machine that adds tiles to successive columns of the board and then perform a BDD-based reachability computation for this system, much as would a symbolic model checker [14].Just as BDD-based model checking can outperform SAT-based BMC for some problems, we have demonstrated that a BDD-based SAT solver can sometimes outperform a search-based SAT solver.

Pigeonhole Problem
The pigeonhole problem is one of the most studied problems in propositional reasoning.Given a set of n holes and a set of n + 1 pigeons, it asks whether there is an assignment of pigeons to holes such that 1) every pigeon is in some hole, and 2) every hole contains at most one pigeon.The answer is no, of course, but any resolution proof for this must be of exponential length [26].Groote and Zantema have shown that any BDD-based proof of the principle that only uses the Apply algorithm must be of exponential size [25].On the other hand, Cook constructed an extended resolution proof of size O(n 4 ), in part to demonstrate the expressive power of extended resolution [17].
We used an representation of the problem that scales as as O(n 2 ), using an encoding of the at-most-one constraints due to Sinz [42].It starts with a set of variables p i,j for 1 ≤ i ≤ n and 1 ≤ j ≤ n + 1, with the interpretation that pigeon j is assigned to hole i. Encoding the property that each pigeon j is assigned to some hole can be expressed as with a single clause: Sinz's method of encoding the property that each hole i contains at most one pigeon introduces auxilliary variables to effectively track which holes are occupied, starting with pigeon 1 and working upward.These variables are labeled s i,j for 1 ≤ i ≤ n and 1 ≤ j ≤ n.Informally, variables s i,1 , s i,2 , . . ., s i,n serves as a signal chain that indicates the point at which a pigeon has been assigned to hole i.For each hole i, there is a total of 3n − 1 clauses: Each of these clauses serves either to define how the next value in the chain is to be computed, or to describe the effect of the signal on the allowed assignments of pigeons to the hole.That is, for hole i, the signal is generated at position j if pigeon j is assigned to that hole.Once set, the signal continues to propagate across higher values of j.Once the signal is set, it suppresses further assignments of pigeons to the hole.This encoding require 3n − 1 clauses and n auxilliary variables per hole.Figure 13 shows the results of running the two solvers on this problem.Once again we see TBSAT with either linear or bucket evaluation having exponential scaling, as does KISSAT.None can go beyond n = 13 within the 1000-second time limit.
On the other hand, the column scanning approach used for the mutilated checkerboard can also be applied to the pigeonhole problem when the Sinz encoding is used.Consider an array with hole i represented by row i and pigeon j represented by column j.Let S j represent the auxilliary variables s i,j for 1 ≤ i ≤ n.The "state" is then encoded in these auxilliary variables.In processing pigeon j, we can assume that the possible combinations of values of auxilliary variables S j−1 is encoded by a characteristic function σ j−1 (S j−1 ).In addition, we incorporate into this characteristic function the requirement that each pigeon k, for 1 ≤ k ≤ j − 1 is assigned to some hole.Letting P j denote the variables p i,j for 1 ≤ i ≤ n, the characteristic function at column j can then be expressed as: where the "transition relation" T j consists of the clauses associated with the auxilliary variables, plus the clause encoding constraint Pigeon j .As with the mutilated chessboard, having a proper variable ordering is critical to the success of a column scanning approach.We interleave the ordering of the variables p i,j and s i,j , ordering them first by i (holes) and then by j (pigeons.)Figure 13 demonstrates the effectiveness of the column-scanning approach.We were able to handle instances up to n = 210.Unlike with the mutilated chessboard, the scanning does not reach a fixed point.Instead, the BDDs start very small, because they must encode the locations of only a small number of occupied holes.They reach their maximum size at pigeon n/2, as the number of combinations for occupied and unoccupied holes reaches its maximum of C(n, n/2).The BDD sizes then drop off, symmetrically to the first n/2 pigeons, as the encoding needs to track the positions of a decreasing number of unoccupied holes.Fortunately, all of these BDDs scale quadraftically with n, reaching a maximum of 11,130 nodes for n = 210.
We also ran experiments using a direct encoding of the at-most-one constraints, having a clause p i,j ∨ p i,k for each hole i and for 1 ≤ j < k ≤ n + 1.This encoding scales as Θ(n 3 ).With this encoding, we were unable to find any method that avoided exponential scaling using either TBSAT or KISSAT.

Evaluation
Overall, our results demonstrate the potential for generating small proofs of unsatisfiability using BDDs.We were able to greatly outperform traditional, CDCL solvers for four well-known challenge problems.
The success for the first two benchmark problems relies on the ability of BDDs to handle exclusive-or operations efficiently.Generally, the exclusive-or of k variables can be expressed as a BDD with 2k + 1 nodes, including the leaves.These representations are also independent of the variable ordering.As we saw, however, it is critical to quantify variables whenever possible, to avoid requiring the BDD to encode the parity relationships among many overlapping subsets of the variables.We found that bucket elimination works well on these problems, and that randomness in the problem structure and the variable ordering did not adversely affect performance.This strategy was outlined by Jussila, Sinz, and Biere; our experimental results serve as a demonstration of the utility of their work.The success of column scanning for the final two benchmark problems relies on finding a way to scan in one dimension, encoding the "state" of the scan in a compact form.This strategy only works when the problem is encoded in a way that it can be partitioned along two dimensions.This approach draws its inspiration from symbolic model checking, and it requires the more general capability to handle quantification that we have presented.One strength of modern SAT solvers is that they generally succeeed without any special guidance from the user.It remains an open question whether column scanning can be made more general and whether a suitable schedule and variable ordering can be generated automatically.Without these capabilities, our results for column scanning show promise, but they require too much guidance from the user.
Other studies have compared BDDs to CDCL solvers on a variety of benchmark problems.Several of these observed exponential performance for BDD-based solvers for problems for which we have obtained more promising results.Uribe and Stickel [45] ran experiments with the mutilated chessboard problem, but they did not do any variable quantification.Pan and Vardi [38] applied a variety of scheduling and variable ordering strategies for the mutilated chessboard and pigeonhole problems.Although they found that they could get better performance than with a CDCL solver, their performance still scaled exponentially.Obtaining scalability requires devising more problemspecific approaches than the ones they considered.Our experiments with KISSAT confirm that a BDD-based SAT solver requires careful attention to the problem encoding, the variable ordering, and the use of quantification in order to outperform a state-of-the CDCL solver.
Tables 1-2 provide some performance data for the largest instances solved for each of the four benchmark problems.A first observation is that these problems are very large, with tens of thousands of input variables and clauses.
Looking at the BDD data, the total number of BDD nodes indicates the total number generated by the function GETNODE, and for which extension variables are created.These are numbered in the millions, and far exceed the number of input variables.The entries for "Maximum live clauses" shows the peak number of clauses that had been added but not yet deleted across the entire proof.As can be seen, these can vary from 7% to nearly 40% of the total clauses.The peak number of live clauses proved to be a limiting factor for the LRAT proof checker.. Figure 14 provides more insight into the nature of the proofs generated by the CDCL solver KISSAT and the BDD-based solver TBSAT.Each point indicates one benchmark run, with the value on the Y axis indicating the runtime of the solver divided by the number of clauses generated, scaled by 10 6 , while the X value is the proof size.In other words, the Y values show the average time, in microseconds for each proof clause to be generated.In all, 340 points are shown, with 75 for KISSAT, and the rest for TBSAT in its various operating modes.
These data quantify a fundamental difference between how proofs are generated with a CDCL solver, versus with a BDD-based solver.A CDCL solver emits a clause each time it encounters a conflict during the search.This may come after many steps involving selecting a decision variable and performing Boolean constraint propagation.Thus, there can be considerable, and highly variable amounts of processing between successive clause emissions.We see that average times ranging between 4 and 60 microseconds for the KISSAT runs, and even these averages mask the considerable variations that can occur within a single run.
With a BDD-based solver, on the other hand, the proof has the form of a log describing the recursive steps taken by the BDD algorithm, expressed within a standard proof framework.There is very little variability from one run to the next, and the different evaluation modes having minimal impact.The only trend of note is a general increase of the average time per clause as the proofs get longer.The short runs require less than 1.0 microsecond per clause, while the longer ones require over 2.0.This increase can be attributed to the complexity of managing long BDD computations, requiring garbage collection, table resizing, and other overhead operations.
Figure 15 shows a similar plot, but with the Y axis indicating the average time for the proof checkers to check each clause.Again, we see two important characteristics.The proof steps generated by KISSAT do not include lists of antecedent clauses (hints).Instead, the checking program DRAT-TRIM scans the set of clauses and constructs each hint sequence.This takes significant ef-fort and can vary greatly across benchmarks.The proofs generated by TBSAT, on the other hand, contain full hints and can therefore be readily checked at an average of around 0.7 µ seconds per proof clause, regardless of the proof size or solution method.

Conclusion
The pioneering work by Biere, Sinz, and Jussila [30,43] did not lead to as much follow-on work as it deserved.Here, over fifteen years later, we found that small modifications to their approach enable a powerful, BDD-based SAT solver to generate proofs of unsatisfiability.The key to its success is the ability to perform arbitrary existential quantification.As the experimental results demonstrate, such a capability is critical to obtaining reasonable performance.
More advanced BDD-based SAT solvers employ additional techniques to improve their performance.Extending our methods to handle these techniques would be required to have them generate proofs of unsatisfiability.Some of these would be straightforward.For example, Weaver, Franco, and Schlipf [48] derive a very general set of conditions under which existential quantification can be applied while preserving satisfiability.For generating proofs of unsatisfiability, our ability to prove that existential quantification preserves implication would be sufficient for all of these cases.On the other hand, more advanced solvers, such as SBSAT [22], employ a variety of techniques to prune the intermediate BDDs based on the structure of other BDDs that remain to be conjuncted.This pruning generally reduces the set of satisfying assignments to the BDD, and so implication does not hold.
In more recent work, we have been able to show that BDD-based methods can use solution methods that view a Boolean formula as encoding linear equations over integers or modular integers [10].Proof-generating BDD operations can be used to justify the individual steps taken while solving systems of equations by several different methods.That has allowed us to scale the benchmark problems considered in Section 5 even further, and to avoid the need for problem-specific solution methods.We have also demonstrated that proof-generating BDDs can be integrated into a convention CDCL solver to allow it to use Gauss-Jordan elimination on the parity constraints encoded in the formula [13].Overall, we believe that BDD-based methods can augment other SAT solving methods to provide new capabilities.
The ability to generate correctness proofs in a BDD-based SAT solver invites us to also consider generating proofs for other tasks to which BDDs are applied.We have already done so for quantified Boolean formulas, demonstrating the ability to generate proofs for both true and false formulas in a unified framework [11].Other problems of interest include model checking and model counting.Perhaps a proof of unsatisfiability could provide a useful building block for constructing correctness proofs for these other tasks.
We would like to thank Chu-Min Li and Laurent Simon for sharing their programs for generating the two classes of Urquhart benchmarks evaluated in Section 5.2.

Figure 2 :
Figure 2: BDD representation of clause C = a b c and the justification of root unit clause u a with one RUP step.

Figure 3 :
Figure 3: Terminal cases and recursive step of the Apply operation for conjunction, modified for proof generation.Each call returns both a node and a proof step.

Figure 4 :
Figure 4: Supporting clauses for standard step of the Apply algorithm for conjunction operations

Figure 5 :
Figure 5: RUP proof steps for standard recursive step of the conjunction operation

Figure 6 :
Figure 6: RUP proof steps for conjunction for illustrative special cases

Figure 7 :
Figure 7: Terminal cases and recursive step of the Apply algorithm for implication checking

Figure 8 :
Figure 8: Clause structure for standard step of implication checking

Figure 11 :
Figure 11: Generating unsatisfiability proofs for Urquhart formulas with size parameter m.KISSAT timed out for even the minimum-sized version of Li's benchmark (m = 3).

Figure 13 :
Figure 13: Generating unsatisfiability proofs for assigning n + 1 pigeons to n holes using Sinz's encoding

Table 1 :
Summary data for the largest parity and Urquhart formulas solved

Table 2 :
Summary data for the largest chess and pigeonhole problems solved