1 Introduction

Given a formula of propositional logic, the task of a SAT solver is to decide whether there exists an assignment that satisfies the formula. Such a satisfying assignment, if found by a SAT solver, is easily verifiable by independent checkers and so one does not need to trust the inner workings of the solver. The situation with unsatisfiable formulas, i.e., where no satisfying assignment exists, is not as straightforward. Here, SAT solvers must produce an unsatisfiability proof (also called a refutation) for the input formula. Ideally, the proof system and corresponding proof format for such proofs should be sufficiently expressive, allowing SAT solvers to efficiently produce proofs that correspond to the SAT solving techniques they use at runtime. At the same time, the resulting proofs ought to be efficiently checkable by independent and trustworthy tools.

The de facto standard proof system for propositional unsatisfiability proofs is Resolution Asymmetric Tautology (\(\textsf{RAT}\)) [31]. The associated DRAT format [55] combines clause addition based on \(\textsf{RAT}\) steps and clause deletion. Independent checking tools can validate proofs in the DRAT format; they have been used to check the results of the SAT competitions since 2014 [55] and in industry [20]. Enriching DRAT proofs with hints is the main technique for developing efficient verified proof checkers, e.g., existing verified checkers use the enriched proof formats LRAT [10] and GRAT [39].

A recently proposed proof system, called Propagation Redundancy (\(\textsf{PR}\)) [28], generalizes \(\textsf{RAT}\). There exist short \(\textsf{PR}\) proofs without new variables for many problems that are hard for resolution, such as pigeonhole formulas, Tseitin problems, and mutilated chessboard problems [26]. Due to the absence of new variables it is easier to find \(\textsf{PR}\) proofs automatically [27], and it is considered unlikely that there exist short \(\textsf{RAT}\) proofs for these problems that do not introduce new variables nor reuse eliminated variables [28]. Such \(\textsf{PR}\) proofs can be checked directly [28], or they can first be transformed into DRAT proofs or even Extended Resolution proofs by introducing new variables [25, 34]. In theory, the blowup is small, i.e., polynomial-sized. However, in practice, the transformed proofs can be significantly more expensive to validate compared to the original \(\textsf{PR}\) proofs [28].

Table 1 A comparison of SAT proof checkers that have been verified in various proof assistants [10, 20, 39]. Green background (cells with \({+}\)) indicates, in our view, desirable properties, e.g., LPR is based on a stronger proof system than LRAT and GRAT, while red backgrounds (cells with \(\times \)) indicate less desirable properties. Yellow backgrounds (cells with \({-}\)) are also undesirable but to a lesser extent.

A natural question arises: why should proof checkers be trusted to correctly check proofs if we do not likewise trust SAT solvers to correctly determine satisfiability? One answer is that proof checkers are much easier to implement so their code can be carefully audited. Another answer is that the algorithms underlying proof checkers have been formally verified in a proof assistant [10, 20, 39]. However, to get executable code for these verified checkers, some additional unverified steps are still required. Although unlikely, each of these steps can introduce bugs in the resulting executable: (a) the algorithms are extracted by unverified code generation tools into source code for a programming language; (b) unverified parsing, file I/O, and command-line interface code is added; (c) the combined code is compiled by unverified compilers to executable machine code.

The contributions of this article are: (i) a new Linear PR (henceforth LPR) proof format that enriches \(\textsf{PR}\) proofs with hints and is backwards compatible with the LRAT format, (ii) an extension of the existing DPR-trim tool [28] to efficiently convert \(\textsf{PR}\) proofs into LPR format, and (iii) cake_lpr, an efficient verified LPR proof checker with correctness guarantees, including for steps (a)–(c) enumerated above. The cake_lpr tool was used to validate the unsatisfiability proofs for the 2020 SAT Competition because of its strong trust story combined with easy compilation and usage. Moreover, the stronger \(\textsf{PR}\) proof system could be supported in future competitions. The tool is publicly available at: https://github.com/tanyongkiam/cake_lpr

This article extends our conference version [53] with: (iv) a new compositional proof format consisting of a top-level summary proof whose proof steps can be separately justified by respective underlying proofs and (v) verified extensions in cake_lpr to support the compositional proof format. The new correctness result for cake_lpr allows users to exploit verified compositional proof checking by running parallel instances of the tool to check very large unsatisfiability proofs, such as those typically found in SAT-solver aided proofs of mathematical results [23, 29, 35]. In particular, we explain how compositional proofs can be conveniently generated from the cube-and-conquer [22] SAT solving technique that is naturally parallelizable. Together with our verified checker, this enables a fully parallel pipeline for SAT solving, proof generation, and verified proof checking. To the best of our knowledge, this is the first verification result for a proof checker that formally accounts for multiple, separate executions of the checker.

Section 3 shows how \(\textsf{PR}\) proofs are enriched to obtain LPR proofs and presents the corresponding LPR proof checking algorithm (Contributions (i) & (ii)). Existing LRAT proof checkers can be extended in a clean and minimal way to support LPR proofs. Section 4 introduces the compositional proof format which extends an underlying clausal proof format with support for separate proof checking (Contribution (iv)). Section 5 explains the implementation of our proof checker in CakeML, as well as the correctness guarantees and high-level verification strategy behind the proofs (Contributions (iii) & (v)). Section 6 benchmarks our proof checker against existing implementations. A summary comparison of the new proof checker against existing verified proof checkers is in Table 1.

2 Background

This section provides background on CakeML and its related tools. It also recalls the standard problem format and clausal proof systems used by SAT solvers.

2.1 HOL4 and CakeML

HOL4 is a proof assistant implementing classical higher-order logic [51] and CakeML  [45] is a programming language with syntax and semantics formally defined in HOL4. Tools for developing verified CakeML software are used to fill the verification gaps in the correspondingly enumerated items in Section 1:

  1. (a)

    Two tools are used to produce (or extract) verified CakeML source code in HOL4:

    • the CakeML proof-producing translator [46] automatically synthesizes verified source code from pure algorithmic specifications;

    • the CakeML characteristic formula (CF) framework [19] provides a separation logic which can be used to manually verify (more efficient) imperative code for performance-critical parts of the proof checker.

  2. (b)

    CakeML provides a foreign function interface (FFI) and a corresponding formal FFI model [15]. These are used to verify system call interactions, e.g., file I/O and command-line interfaces, under carefully specified assumptions on the system environment.

  3. (c)

    Most importantly, CakeML has a compiler that is verified [54] to preserve the semantics of source CakeML programs down to their compiled machine-code implementations. Hence, all guarantees obtained from the preceding steps can be carried down to the level of machine code with proofs.

The combination of these tools enables binary code extraction [36] where verified machine code is extracted directly in HOL4. Several CakeML programs have been verified using these tools, including: certificate checkers for floating-point error bounds [6] and vote counting [18], an OpenTheory article checker [1], and the bootstrapped CakeML compiler [54]. Other toolchains can be used to build verified checkers, e.g., Œuf provides a similar binary code extraction toolchain in the Coq proof assistant [44]; the Verified Software Toolchain [9] provides a program logic for a subset of C which can be compiled with the verified compiler CompCert [40]; and the Isabelle Refinement framework can produce verified LLVM implementations [37, 38].

2.2 SAT Problems and Clausal Proofs

Fix a set of boolean variables \(x_1,\dots ,x_n\), where the negation of variable \(x_i\) is denoted \(\overline{x}_i\), and the negation of \(\overline{x}_i\) is identified with \(x_i\). Variables and their negations are called literals and are denoted using \(l\).

The input for propositional SAT solvers is a formula \(F\) in conjunctive normal form (CNF) over the set of variables \(x_1,\dots ,x_n\). Here, CNF means that \(F\) consists of an outer logical conjunction \(F\equiv \bigwedge _{i=1}^m C_i\), where each clause \(C_i\) is a disjunction over some of the literals \(C_i \equiv l_{i1} \vee l_{i2}, \dots \vee l_{ik}\); in general, each clause can contain a different number of literals. Formulas in CNF can be represented directly as sets of clauses and clauses as sets of literals. The empty clause is denoted \(\bot \).

An assignment \(\alpha \) assigns boolean values (true or false) to each variable; \(\alpha \) can be partial, i.e., it only assigns values to some of the variables. Like formulas and clauses, a (partial) assignment \(\alpha \) can be represented as a set of literals such that \(l \in \alpha \) iff \(\alpha \) assigns l to true. For consistency, \(\alpha \) may contain at most one of each literal l or its negation. The negation of an assignment, denoted \(\overline{\alpha }\), assigns the negation of all literals in \(\alpha \) to true. An assignment \(\alpha \) satisfies a clause C iff their set intersection is nonempty. Additionally, we define if \(\alpha \) satisfies C; otherwise, denotes the result of removing from C all the literals falsified by \(\alpha \), i.e., . For a formula F, we define . Intuitively, contains the remaining simplified clauses in formula F after committing to the partial assignment \(\alpha \).

The task of a SAT solver is to determine whether \(F\) is satisfiable, i.e., whether there exists a (possibly partial) assignment \(\alpha \) such that is empty. Any satisfying assignment can be used as certificate of satisfiability. Formulas without a satisfying assignment are unsatisfiable. Certifying unsatisfiability is more difficult and typically uses a clausal proof system [28]. The idea behind these proof systems is briefly recalled next, using the key concept of clause redundancy.

Definition 1

A clause \(C\) is redundant with respect to formula \(F\) iff \(F\wedge C\) and \(F\) are both satisfiable or both unsatisfiable, i.e., they are satisfiability equivalent.

A clause \(C\) that is redundant for \(F\) can be added to \(F\) without changing its satisfiability. Clausal proof systems work by successively adding redundant clauses to \(F\) until the empty clause \(\bot \) is added. Such a sequence of additions is illustrated below:

Satisfiability is preserved along each \(\Longrightarrow \) step because of clause redundancy, e.g., satisfiability of \(F\) implies satisfiability of \(F\wedge C_1\). Since the final formula containing \(\bot \) is unsatisfiable, the sequence of redundant clause addition steps \(C_1, C_2, \dots , \bot \) corresponds to a proof of unsatisfiability for \(F\). Deciding clause redundancy is as hard as solving the SAT problem itself because \(\bot \) is always redundant for unsatisfiable formulas. The difference between clausal proof systems is how the redundancy of a (proposed) redundant clause \(C\) is efficiently certified at each proof step.

Many syntactic notions of redundancy are based on unit propagation. A unit clause is a clause with only one literal. The result of applying the unit clause rule to a formula F is the formula where (l) is a unit clause in F. The iterated application of the unit clause rule to a formula F until no unit clauses are left is called unit propagation. If unit propagation on F yields the empty clause \(\bot \), denoted by \(F \vdash _1 \bot \), we say that F implies \(\bot \) by unit propagation. The notion of implied by unit propagation is also used for regular clauses as follows: \(F \vdash _1 C\) iff \(F \wedge \lnot C \vdash _1 \bot \) with \(\lnot C = \bigwedge _{l \in C} (\overline{l})\). Observe that \(\lnot {C}\) can be viewed as a partial assignment that assigns the literals \(\overline{l}\), for \(l \in C\), to true. For a formula G, \(F \vdash _1 G\) iff \(F \vdash _1 C\) for all \(C \in G\). The main clausal proof system used in this article is based on propagation redundant clauses, which are defined as follows.

Definition 2

Let F be a formula, C a nonempty clause, and \(\alpha \) the smallest assignment that falsifies C. Then, C is propagation redundant (\(\textsf{PR}\)) with respect to F if there exists an assignment \(\omega \) which satisfies C and such that .

Intuitively, a \(\textsf{PR}\) clause C is redundant because any satisfying assignment for F that does not already satisfy C can be modified to a satisfying assignment for \(F \wedge C\) by updating its literals assigned to true according to the (partial) witnessing assignment \(\omega \) (see [28, Theorem 1] for more details). Propagation redundancy is efficiently checkable in polynomial time using the witnessing assignment \(\omega \) and \(\textsf{PR}\) generalizes various other notions of clause redundancy, including the de facto standard Resolution Asymmetric Tautology (\(\textsf{RAT}\)) proof system (see [28, Theorem 2]) that is able to compactly express all current techniques used in state-of-the-art SAT solvers [31]. There is ongoing research towards integrating \(\textsf{PR}\) in SAT solvers [27, 28]. For example, \(\textsf{PR}\)-based preprocessing has been shown to improve solver performance on SAT competition benchmarks [49].

In practice, clausal proof formats also support clause deletions to speed up proof validation, especially for proof checking steps that need to iterate over the entire formula, e.g., for Def. 2. Hence, unsatisfiability proofs for formula F are modeled as sequences \(I_{1}, \dots , I_n\) of instructions that either add or delete a clause; an addition instruction is a triple \(\langle \textsf{a}, C, \omega \rangle \), where C is a clause and \(\omega \) is a (possibly empty) witnessing assignment; a deletion instruction is a pair \(\langle \textsf{d}, C\rangle \) where C is a clause. The sequence \(I_{1}, \dots , I_n\) gives rise to formulas \(F_1, \dots , F_n\) with \(F_0 = F\), where \(F_j\) is the accumulated formula, i.e., a (multi)set of clauses, up to the j-th instruction computed recursively according to (1).

$$\begin{aligned} F_{j} = {\left\{ \begin{array}{ll} F_{j-1} \cup \{C_j\} &{} \text {if }I_j\text { is of the form } \langle \textsf{a}, {C_j, \omega }\rangle \\ F_{j-1} \setminus \{C_j\} &{} \text {if }I_j\text { is of the form } \langle \textsf{d}, {C_j}\rangle \\ \end{array}\right. } \end{aligned}$$
(1)

A \(\textsf{PR}\) proof of unsatisfiability is valid if the last instruction adds the empty clause \(I_n = \langle \textsf{a}, \bot ,\emptyset \rangle \), and, for all addition instructions \(I_j = \langle \textsf{a}, C_j, \omega _j\rangle \), it holds that \(C_j\) is \(\textsf{PR}\) with respect to \(F_{j-1}\) using witness \(\omega _j\). In case an empty witness is provided for \(I_j\), then \(F_{j-1} \vdash _1 C_j\) should hold.

3 Linear Propagation Redundancy

This section describes a new clausal proof format called Linear Propagation Redundancy (LPR) which is designed to enable efficient validation of \(\textsf{PR}\) clauses using a (verified) proof checker. It also presents our enhancements to the DPR-trim toolFootnote 1 to efficiently add hints to \(\textsf{PR}\) proofs, thereby turning them into LPR proofs. Throughout the section, we emphasize how LPR can be viewed as a clean and minimal extension of the existing LRAT proof format, which thereby enables its straightforward implementation in existing LRAT tools.

The most commonly used proof format for SAT solvers is DRAT, which combines deletion with \(\textsf{RAT}\) redundancy [55]. DRAT proofs are easy for SAT solvers to emit and top-tier SAT solvers support it, but they have some disadvantages for verified proof checking. In particular, checking whether a clause is \(\textsf{RAT}\) requires a significant amount of proof search to find the unit clauses necessary for showing the implied-by-unit-propagation \(\vdash _1\) property. This complicates verification of the proof checking algorithm and slows down the resulting verified proof checkers. The idea behind the Linear RAT (LRAT) [10, 20] and GRAT [39] formats is to include these unit clauses as hints so that verified proof checkers can follow the hints directly without the need for proof search. The LPR format lifts this idea to allow fast validation of the \(\textsf{PR}\) property as follows.

Fig. 1
figure 1

The grammar for the LPR format. Additions compared to the LRAT grammar [10] are highlighted in bold.

An assignment \(\omega \) reduces a clause C if and . To check the \(\textsf{PR}\) property , it suffices to check, for each clause \(C \in F\) reduced by \(\omega \), that . In practice, a smaller \(\omega \) yields a cheaper \(\textsf{PR}\) check. The LPR format extends the \(\textsf{PR}\) format by adding, for each clause reduced by the witness, a list of all unit clause hints required for showing the implied-by-unit-propagation property. Additionally, in order to point to clauses, the LPR format includes an index for each clause at the beginning of each line. The grammar of the LPR format is shown in Fig. 1.

The DPR-trim tool [28, Sect. 6] is built on top of DRAT-trim and facilitates verification of DPR proofs, a generalization of DRAT proofs with \(\textsf{PR}\) clause addition. If a clause addition step includes a witnessing assignment \(\omega \), then the PR redundancy check is performed. Otherwise DPR-trim falls back on the RAT check from the DRAT-trim code base. Deletion steps are not validated. DPR-trim has similar features compared to DRAT-trim, including backward checking, extraction of unsatisfiable cores, and proof optimization.

Our extension to DPR-trim enriches input \(\textsf{PR}\) proofs by finding and adding all required unit clause hints to produce LPR proofs. Most of the changes generalize the code to produce LRAT proofs in DRAT-trim. The main \(\textsf{PR}\)-specific optimization shrinks the witness \(\omega \) where possible: every literal in \(\omega \cap \alpha \) is removed as well as any literal in \(\omega \) that is implied by unit propagation from . The shrinking was shown to be correct [28], but has not been implemented so far. We observed that the witnesses in the \(\textsf{PR}\) proofs produced by SaDiCaL  [27] can be substantially compressed using this method.

Fig. 2 (left) shows an example formula in the standard DIMACS problem format. The DIMACS format includes a header line starting with “p cnf ” followed by the number of variables and the number of clauses. The non-comment lines (not starting with “c ”) represent clauses, and they end with “0”. Positive integers denote positive literals, while negative integers denote negative literals. By convention (following LRAT), the clauses are implicitly indexed according to their order of appearance in the file, starting from index 1. Fig. 2 (right) shows a corresponding proof in LPR format.Footnote 2 Deletion lines in LPR are formatted identically to LRAT [10] (not shown here). For clause addition lines, the LPR format only differs from LRAT in case the clause to be added has \(\textsf{PR}\) but not \(\textsf{RAT}\) redundancy. A clause addition line in LPR format consists of three parts. The first part is the first integer on the line, which denotes the index of the new clause. The second part exactly matches the \(\textsf{PR}\) proof format [28]. It consists of the redundant clause and its witness; the first group of literals is the clause while the (potentially empty) witness starts from the second occurrence of the first literal of the clause until the first 0 that separates the unit clause hints. The third part (after the first 0) are the unit clause hints, which exactly matches the LRAT format [10].

Fig. 2
figure 2

(Left) Clauses of an unsatisfiable pigeonhole formula (4 pigeons, 3 holes) in the DIMACS format used by SAT solvers. The first 4 clauses encode that each pigeon belongs to at least one hole, e.g., the variable 1 (resp. 4, 7, 10) is set to true iff pigeon A (resp. B, C, D) is in the first hole; the latter 18 clauses encode that no two pigeons share the same hole, e.g., the clause -1 -4 encodes that pigeons A and B do not share the first hole. (Right) The LPR refutation consisting of clause-witness pairs and unit clause hints. The first bold integer in each line is the clause index while other bold integers are the unit clause hints. Dropping the bold integers yields a proof in the \(\textsf{PR}\) format. Redundant spaces have been added to improve readability.

Fig. 3
figure 3

Algorithm to check a single clause addition step in the LPR format. The bold parts show the additions compared to LRAT proof checking [10].

The checking algorithm for LPR, shown in Fig. 3, overlaps significantly with that for LRAT (see [10, Algorithm 1]). The only differences are in Steps 4 and 5.1. In Step 4, the witness is used (if present) instead of always using the first literal in \(C_j\). In Step 5.1, clauses are skipped if they are satisfied by the witness. Notice that a clause can only be both reduced and satisfied by a witness if the witness consists of at least two literals, while in the LRAT format witnesses always consist of exactly one literal. Note also that the algorithm does not check whether , which is a requirement for \(\textsf{PR}\). This omission is allowed because the first literal in \(\omega \) in the LPR (and \(\textsf{PR}\)) format is syntactically the same as the first literal in \(C_j\).

4 Compositional Proofs

This section presents a new compositional proof format for unsatisfiability proof checking, motivated by the need to check very large unsatisfiability proofs behind SAT-solver aided proofs of mathematical results [23, 29, 35]. The rules of compositional proofs in propositional logic have been discussed in earlier work [24]; here, we define a format for compositional proofs and present a proof checking framework for the entire toolchain. Intuitively, the format consists of a top-level summary proof and a set of underlying proofs which are used to justify the top-level proof steps, see Fig. 4 for an illustration (explained below). The underlying proofs can be checked separately and in parallel to speed up proof checking. Section 4.1 presents the proof format and its proof checking algorithm. Section 4.2 explains our technique for generating compositional proofs from the cube-and-conquer SAT solving technique [22].

Fig. 4
figure 4

Illustration of a compositional proof: the top-level proof is a sequence of instructions \(I_1, \dots , I_n\) that can add or delete clauses; the indexes \(i_0, \dots , i_k\) divide the top-level proof steps into ranges; and the \(\boxed {\text {boxes}}\) indicate underlying clausal proofs which are separately checked (indicated by \(\rightsquigarrow \)) using instances of appropriate proof checkers, e.g., cake_lpr.

4.1 Compositional Proof Checking

Compositional proofs are modeled using instruction sequences \(I_{1}, \dots , I_n\) similar to clausal proofs (Section 2.2) except addition instructions are of the form \(\langle \textsf{a}, C\rangle \), i.e., they do not carry witnesses. The accumulated formulas \(F_j\) are defined according to equation 1.

The key idea behind compositional proof checking is to justify a range of instructions simultaneously using an underlying clausal proof, see Fig. 4. More precisely, given a pair of indices (ij), with \(i \le j\), the algorithm checks that satisfiability of the accumulated formula \(F_i\) implies satisfiability of the accumulated formula \(F_j\). By transitivity of satisfiability implication, this means that a proof with n instructions can be checked by separately checking k ranges \((i_0,i_1)\), \((i_1,i_2)\), \(\dots \), \((i_{k-1},i_k)\) for some choice of indices \(0=i_0 \le i_1 \le i_2 \le \dots \le i_k = n\). If all k checks succeed, then satisfiability of F implies satisfiability of \(F_n\) and, in particular, if \(F_n\) also contains \(\bot \) then the input formula F is unsatisfiable. Satisfiability implication for each pair \(F_i\), \(F_j\) is checked using an underlying clausal proof and its corresponding proof checker. A formal description of the algorithm for checking range (ij) is in Fig. 5. By design, there is significant flexibility in how the underlying proofs can be checked:

  • The indexes \(i_0,i_1,\dots ,i_k\) can be chosen arbitrarily by tools, as long as they cover all steps of the top-level proof with \(i_0=0\), \(i_k=n\).

  • Different clausal proof formats and corresponding proof checkers can be used for each underlying proof.

  • The underlying proofs can be checked separately, in any order, and in parallel by running multiple instances of proof checkers on several machines.

Fig. 5
figure 5

Compositional proof checking algorithm for range (ij) based on an underlying clausal proof format.

Fig. 6
figure 6

The grammar for the DRAT format [55], re-used for compositional proofs. Each \(\langle line\rangle \) represents either addition (no prefix) or deletion (prefixed by “d”) of a clause.

To enable re-use of existing parsing tools, compositional proofs are syntactically represented using the DRAT format [55], recalled in Fig. 6. Clauses are implicitly numbered according to their order of addition, continuing from the last (implicit) index of the corresponding DIMACS CNF. For example, starting from a DIMACS file with 5 clauses (indices 1–5), the first added clause in the compositional proof has index 6, the second has index 7, and so on. Deletion steps are ignored for the purposes of indexing.

Fig. 7
figure 7

Compositional proof used for the cube-and-conquer parallel SAT solving strategy. The \(\boxed {\text {boxes}}\) correspond to underlying proofs generated by DRAT-trim for each cube which can be checked using cake_lpr or other LRAT tools [10]. In practice, the cubes \(G_i\) used in the top-level proof steps can be simplified to core subsets \(H_i \subseteq G_i\) obtained from incremental SAT solving.

4.2 Compositional Proof Generation

Compositional proofs can be generated particularly conveniently from cube-and-conquer SAT proofs [22]. Given an input formula F, the basic idea behind cube-and-conquer is to partition F into a set of subformulas \(F \wedge G_1, F \wedge G_2, \dots , F \wedge G_n\), and to solve each subproblem separately. Here, each \(G_i = l_1 \wedge l_2 \wedge \dots \wedge l_j\) is a cube, i.e., a conjunction of literals, such that the disjunction of cubes \(\bigvee _{i=1}^n G_i\) is a tautology. Thus, if every subproblem \(F \wedge G_i\) is unsatisfiable, then F is unsatisfiable.

The compositional unsatisfiability proof for F under cube-and-conquer is shown in Fig. 7, analogous to Fig. 4. Here, the top-level proof is the sequence of n additions of negated cubes \(\lnot {G_i}\) (which are clauses), followed by addition of the empty clause. The first n underlying proofs justify unsatisfiability of F under each cube, while the last proof shows that the cube-and-conquer strategy correctly partitions the space of assignments.

The top-level proof steps can be simplified while generating the underlying clausal proofs for each step. In practice, each subproblem \(F \wedge G_i\) is tackled using incremental SAT solvers that support solving F under assumptions (or partial assignments) \(G_i\). Such assumptions are typically provided either via iCNF files [56] or via the solver API; an example formula with three cubes expressed in iCNF (DIMACS files extended with assumptions) is shown in Fig. 8. If the formula F is found to be unsatisfiable under assumptions \(G_i\), incremental solvers can further compute a subset of the cube \(H_i \subseteq G_i\) that was involved in determining unsatisfiability [13]. The negation \(\lnot {H_i}\) can be used in the compositional proof in place of \(\lnot {G_i}\), e.g., the topmost cube in Fig. 8 has a proof with \(H_i \subset G_i\). Since existing solvers with incremental support can only log proofs in the DRAT format, we modified DRAT-trim to convert such partial proofs into LRAT proofs that end with the addition of clause \(\lnot {H_i}\). In addition, when generating the LRAT proof for \(\lnot {H_i}\), the compositional proof checking algorithm Fig. 5 allows us to assume all previously added clauses, i.e., starting with \(F \wedge \lnot {H_1} \wedge \dots \wedge \lnot {H_{i-1}}\), which may help to simplify the LRAT proof.

Fig. 8
figure 8

(Left) Incremental CNF files for the pigeonhole formula from Fig. 2 split using three cubes (assumptions are specified using a ... 0 lines). (Middle) Generated DRAT proofs for each incremental CNF; the last line in each proof is \(\lnot {H_i}\). (Right) The top-level compositional proof for the pigeonhole formula in Fig. 2 built from \(\lnot {H_i}\).

The final line in compositional unsatisfiability proofs is the addition of the empty clause. In most use cases, there exists a short justification of the empty clause using the added clauses in the compositional proof, e.g., the conjunction of preceding clauses in the compositional proof is typically unsatisfiable (in our case, by construction). Thus, a SAT solver can be used on that formula to produce the justification of the empty clause in the appropriate clausal proof format. Alternatively, if the compositional proof was constructed using more complicated cube-and-conquer strategies [22], then the underlying tree structure used to generate the cubes can be used to guide the resolution steps for obtaining the empty clause.

Fig. 9
figure 9

The three step refinement used in the development of cake_lpr.

5 CakeML Proof Checking

This section explains the implementation and verification of cake_lpr, our verified CakeML LPR proof checker. Section 5.1 focuses on the high-level verification strategy which we used to reduce the verification task to mostly routine low-level proofs (details of the latter are omitted). Section 5.2 explains the main verified theorems for the proof checker. Section 5.3 highlights some of the verified performance optimizations used in the proof checker.

5.1 Implementation and Verification Strategy

The development of cake_lpr proceeds in three refinement steps, where each step progressively produces a more concrete and performant implementation of the proof checker. These refinements are visualized in the three columns of Fig. 9 for LPR proof checking. A similar verification process was used to add support for compositional proof checking.

Step 1 formalizes the definition of CNF formulas and their unsatisfiability, as well as the \(\textsf{PR}\) proof system described in Section 2.2. The inputs and outputs to the proof system are abstract and not tied to any concrete representation at this step. For example, input variables are drawn from an arbitrary type \(\alpha \), while clauses and CNFs are represented using sets. The correctness of the \(\textsf{PR}\) proof system is proved in this step, i.e., we show that a valid \(\textsf{PR}\) proof implies unsatisfiability of the input CNF. The proof essentially follows [28, Theorem 1].

Step 2 implements a purely functional version of the LPR proof checking algorithm from Fig. 3. Here, the inputs and outputs are given concrete representations with computable datatypes, e.g., literals are integers (similar to DIMACS), clauses are lists of integers, and CNFs are lists of clauses. These concrete representations lift naturally to the abstract, set-based representation from Step 1. The output is a YES or NO answer according to the algorithm from Fig. 3. The correctness theorem for Step 2 shows that LPR proof checking correctly refines the \(\textsf{PR}\) proof system, i.e., if it outputs YES, then there exists a valid \(\textsf{PR}\) proof for the input (lifted) CNF; by Step 1, this implies that the CNF is unsatisfiable. If the output is NO, the input CNF could still be unsatisfiable, but the input LPR proof is not valid according to the algorithm in Fig. 3.

Step 3 uses imperative features in the CakeML source language, e.g., (byte) arrays and exceptions, to improve code performance; these optimizations are detailed further in Section 5.3. This step also adds user interface features like parsing and file I/O so that the input CNF formula is read (and parsed) from a file, and the results are printed on the standard output and error streams. The verification of this step uses CakeML’s proof-producing translator [46] and characteristic formula framework [19] to prove the correctness of the source code implementation of cake_lpr; this code is subsequently compiled with the verified CakeML compiler [54]. Composing the correctness theorem for source cake_lpr with CakeML ’s compiler correctness theorem yields the corresponding correctness theorem for the cake_lpr binary.

At the point of writing, the verified cake_lpr binary can be invoked from the command-line in five ways, each with associated soundness proofs (Section 5.2):

  1. 1.

    cake_lpr <DIMACS>

    Parses the input file in DIMACS format and prints the parsed formula.

  2. 2.

    cake_lpr <DIMACS> <LPR>

    Runs LPR proof checking on the parsed formula.

  3. 3.

    cake_lpr <DIMACS1> <LPR> <DIMACS2>

    Runs LPR transformation checking on the parsed formulas, i.e., DIMACS1 is checked to imply DIMACS2 (satisfiability-wise) using the LPR proof [10].

  4. 4.

    cake_lpr <DIMACS> <COMP> i-j <LPR>

    Runs compositional proof checking on the parsed formula and compositional proof COMP for the range ij and underlying LPR proof LPR.

  5. 5.

    cake_lpr <DIMACS> <COMP> -check <OUTPUT>

    Checks the combined result of several compositional proof checker executions (see below).

Recall from Section 4.1 that cake_lpr needs to be executed (with option 4) for a set of ranges covering the lines of the compositional (top-level) proof COMP and we expect users to exploit this compositionality by running separate instances of cake_lpr in parallel on several machines. Accordingly, an important correctness caveat for compositional proof checking is that users correctly set up separate executions for their respective systems.

Option 5 is designed to add a simple layer of protection against user error when setting up separate (or parallel) executions. In particular, cake_lpr outputs the following string for each successful run of option 4, where md5 takes the MD5 hash of the input files:

s VERIFIED RANGE md5(DIMACS) md5(COMP) i-j

The MD5 hash is chosen for convenience because it is available on most machines, so users can, e.g., manually compare md5 hashes of their input DIMACS and proof files to check that the correct files were used on all machines. By concatenating these outputs into an output file OUTPUT and executing cake_lpr with option 5, cake_lpr can be used to check that the output file contains the correct hashes and that the specified ranges appropriately cover the entire compositional proof. Our implementation checks range coverage (e.g., that ranges 0–2, 4–8, 2–4, 8–12 can be strung together to check range 0–12) by reusing a verified reachability checking function originally developed for a verified compiler optimization in the CakeML compiler [32].

Fig. 10
figure 10

The end-to-end correctness theorem for the CakeML LPR proof checker. The cases for command-line options 1 and 2 are shown here (other cases are elided with \(\dots \) for brevity).

5.2 Correctness Theorems

The main correctness theorem for cake_lpr in HOL4 is shown in Fig. 10. The first line (2) (in ) summarizes routine assumptions for compiled CakeML programs that use its basis library. Briefly, it assumes that the command-line \( cl \) and file system \( fs \) models are well-formed and the compiled code is placed in (and executed from) code memory of a machine state \( ms \) according to CakeML ’s x64 machine model \( mc \).

The first guarantee (3) (in ) is that the machine-code implementation always terminates normally according to CakeML ’s x64 machine-code semantics. Notably, this means the binary never crashes and it may emit some I/O events when run; however, it possibly terminates with an out-of-memory error () if the CakeML runtime runs out of stack or heap space [54]. The second guarantee (4) (in ) states that the only observable change to the filesystem after executing cake_lpr are some strings printed on standard output \( out \) and standard error \( err \). To minimize user confusion, cake_lpr is designed to print all error messages to standard error and only success messages on standard output.

Finally, lines (5) (in black) list the output behaviors of cake_lpr for each command-line option.Footnote 3

  1. 1.

    When cake_lpr is given one command-line argument (\(\textsf {\small length}\; cl \;\textsf {\small =}\;2\))Footnote 4, it attempts to read and parse the file before printing (if successful) the parsed formula to standard output. The DIMACS parser (parse_dimacs) is proved to be left inverse to the DIMACS printer (print_dimacs) as follows:

    figure f

    This says that for any well-formed formula \( fml \), printing that formula into DIMACS format then parsing it yields the formula back. All parsed formulas are proved to be well-formed (not shown here).

  2. 2.

    If two arguments are given (\(\textsf {\small length}\; cl \;\textsf {\small =}\;3\)), then if the string “s VERIFIED UNSAT\n” is printed onto standard output, cake_lpr was provided with a file (in its first argument), and the file parses in DIMACS format to a formula \( fml \) whose lifted representation (\(\textsf {\small interp}\; fml \)) is unsatisfiable.

  3. 3–5.

    The specifications for the remaining cases have a similar flavor and are omitted here for brevity.

Fig. 11
figure 11

The correctness theorem for compositional proof checking.

The correctness theorem for cake_lpr ’s compositional proof checking is shown in Fig. 11. The definition check_successful_par characterizes a successful set of runs of cake_lpr using command line options 4 and 5 on input strings \( fmlstr \) and \( pfstr \). The lines in  (6) say that there is a list of output strings \( outs \) such that every entry of this list is produced from the standard output of an execution of cake_lpr with command-line option 4 (with appropriate setup of each machine’s filesystem and command-line arguments).Footnote 5 The lines in  (7) say that \( outs \) is successfully checked by an execution of cake_lpr with command-line option 5 and the corresponding success string (concat [\(\dots \)]) is printed to standard output. Using this definition, the correctness theorem (8) says that on a successful set of runs, satisfiability of the formula \( fml \) parsed from \( fmlstr \) implies satisfiability of the final formula \(\textsf {\small run\_proof}\; fml \; pf \) obtained by running all lines of the parsed proof \( pf \) on \( fml \), i.e., if \( fml \) is \(F_0\) and \( pf \) has n instructions, then \(\textsf {\small run\_proof}\; fml \; pf \) computes the accumulated formula \(F_n\) as defined in Section 4.1.

The theorems in Figures 10 and 11 have a trusted computing base (TCB) where the CakeML compiler itself is not present. The TCB does, however, still include CakeML ’s model of the foreign function interface and assumed semantics for the targeted hardware platform. CakeML’s TCB is discussed in detail in prior publications [36, 42].

5.3 Verified Optimizations

To minimize verification effort, CakeML’s imperative features are only used for the most performance-critical steps of cake_lpr. Our design decisions are based on empirical observations about the LPR proof checking algorithm. These are explained below with reference to specific steps in the algorithm outlined in Fig. 3.

5.3.1 Array-based representations

In practice, many LPR proof steps do not require the full strength of a \(\textsf{PR}\) (or \(\textsf{RAT}\)) clause. Hence, a large part of proof checking time is spent in the Step 3 loop of the algorithm (reverse unit propagation) and it is important to compute the main loop bottleneck, in Step 3.1, as efficiently as possible. Here, CakeML ’s native byte arrays are used to maintain a compact bitset-like representation of the assignment \(\alpha \), so that can be computed in one pass over \(C_i\) with constant time bitset lookup for each literal in \(C_i\).

For proof steps requiring the full strength of \(\textsf{PR}\) clauses, Step 5 loops over all undeleted clauses in the formula. Formulas are represented as an array of clausesFootnote 6 together with a lazily updated list that tracks all indices of the array containing undeleted clauses. This enables both constant-time lookup of clauses throughout the algorithm and fast iteration over the undeleted clauses for Step 5. Deletion in the index list is done in (amortized) constant time by removing a deleted index only when the index is looked up in Step 5.1.

Additionally, for each literal, the smallest clause index where that literal occurs (if any) is lazily tracked in a lookup array; for a given witness \(\omega \), all clauses occurring at indices below the index of any literal in \(\overline{\omega }\) can be skipped in Step 5.1.

5.3.2 Proof checking exceptions

There are several steps in the proof checking algorithm that can fail (report NO) if the input proof is invalid, e.g., in Step 3.3. In a purely functional implementation, results are represented with an option: None indicating a failure and Some \( res \) indicating success with result \( res \). While conceptually simple, this means that common case (successful) intermediate results are always boxed within a Some constructor and then immediately unboxed with pattern matching to be used again. In cake_lpr, failures instead raise exceptions which are directly handled at the top level. Thus, successful results can be passed directly, i.e., as \( res \), without any boxing. Support for verifying the use of exceptions is a unique feature of CakeML ’s CF framework [19].

5.3.3 Hashtables

For compositional proof checking (Fig. 5), it is important to check the inclusion of clauses \(F_j \subseteq G\) efficiently since both formulas can contain a large number of clauses. To achieve this, the formula G (in the array-based representation) is converted into CakeML ’s verified hashtable library using a simple rolling hash for clauses. This allows every clause in \(F_j\) to be checked against the hashtable for G in near-constant time.

5.3.4 Buffered I/O streams

Proof files generated by SAT solvers can be large, e.g., ranging from 300 MB to 4 GB for the second benchmark suite in Section 6. These files are streamed into memory line by line because each proof step depends only on information contained in its corresponding line in the file. This streaming interaction is optimized using CakeML ’s verified buffered I/O library [41] which maintains an internal buffer of yet-to-be-read bytes from the read-only proof file to batch and minimize the number of expensive filesystem I/O calls.

5.3.5 Producing MD5 hashes

As part of this work, we verified a CakeML library for computing the MD5 hash of an input stream which is connected with the buffered I/O library to efficiently compute hashes for the compositional proof checking command-line options 4 and 5 in Section 5.1. The verification shows that our imperative implementation of MD5 hashing matches its functional specification. In particular, since the MD5 hash is provided solely for user convenience, we do not prove any formal cryptographic properties of the hashing function.

6 Benchmarks

This section compares the CakeML LPR proof checker against other verified checkers on two benchmark suites (Sections 6.1 and 6.2) and a \(\textsf{RAT}\) microbenchmark (Section 6.3). It also evaluates the compositional proof format on unsatisfiability proofs of Erdős discrepancy properties [35] (Section 6.4). The first suite is a collection of problems with \(\textsf{PR}\) proofs generated by the satisfaction-driven clause learning (SDCL) solver SaDiCaL  [27]; the second suite consists of unsatisfiable problems from the SAT Race 2019 competition.Footnote 7 The \(\textsf{RAT}\) microbenchmark consists of proofs for large mutilated chessboards generated by a BDD-based SAT solver [8]. The unsatisfiability proofs for the Erdős discrepancy properties are generated by cube-and-conquer (see Section 4.2). Raw timing data used to produce the figures and tables in this section is available from the cake_lpr repository.

The CakeML checker is labeled cake_lpr (default 4GB heap and stack space), while the other proof checkers used are labeled acl2-lrat (verified in ACL2  [20]), coq-lrat (verified in Coq  [10]), and GRATchk (verified in Isabelle/HOL  [39]) respectively. All experiments were ran on identical nodes with Intel Xeon E5-2695 v3 CPUs (35M cache, 2.30GHz) and 128GB RAM; more specific configuration options for each benchmark suite are reported below.

6.1 SaDiCaL \(\textsf{PR}\) Benchmarks

The SaDiCaL solver produces \(\textsf{PR}\) proofs for hard SAT problems in its benchmark suite [27] and it is experimentally much faster than a plain DRAT-based CDCL solver on those problems [27, Section 7]. The \(\textsf{PR}\) proofs are directly checked by cake_lpr after conversion into LPR format with DPR-trim. For all other checkers, the \(\textsf{PR}\) proofs were first converted to DRAT format using pr2drat (as in an earlier approach [27]), and then into LRAT and GRAT formats using the DRAT-trim and GRATgenFootnote 8 tools respectively. All tools were ran with a timeout of 10000 seconds and all timings are reported in seconds (to one d.p.). Results are summarized in Tables 2 and 3.

All benchmarks were successfully solved by SaDiCaL except mchess19 which exceeded the time limit. For the remaining benchmarks, generating and checking LPR proofs required a comparable (1–2.5x) amount of time to solving the problems, except mchess, for which LPR generation and checking is much faster than solving the problems (Table 2). Unsurprisingly, direct checking of LPR proofs is much faster than the circuitous route of converting into DRAT and then into either LRAT or GRAT (Table 3). Unlike LPR, checking \(\textsf{PR}\) proofs via the LRAT route is 5–60x slower than solving those problems; this is a significant drawback to using the route in practice for certifying solver results. Compared to an unverified C implementation of LPR proof checking, cake_lpr is approximately an order of magnitude slower on these benchmarks; a detailed comparison against unverified proof checking is in Section 6.3.

Table 2 Timings for \(\textsf{PR}\) benchmarks with conversion into LPR format. The “Total (LPR)” column sums the generation and checking times. The timing for mchess19 is omitted because SaDiCaL timed out; timings for the Urquhart U.-s3-* benchmarks are omitted because they took a negligible amount of time (\(< 1.0\)s total).
Table 3 Timings for \(\textsf{PR}\) benchmarks, first converted to DRAT and subsequently converted into LRAT and GRAT formats. The “Total (LRAT)” and “Total (GRAT)” columns sum the fastest generation and checking times for the LRAT and GRAT formats respectively. The “Total (LPR)” column (in bold, fastest total time) is reproduced from Table 2 for ease of comparison. Fail(T) indicates a timeout. Timings for the mchess19 and U.-s3-* benchmarks are omitted as in Table 2.

The backwards compatibility of cake_lpr is also shown in Table 3, where it is used to check the generated LRAT proofs. Among the LRAT checkers, acl2-lrat is fastest, followed by cake_lpr (LRAT checking), and coq-lrat. Although cake_lpr (LRAT checking) is on average 1.3x slower than acl2-lrat, it scales better on the mchess problems and is actually much faster than acl2-lrat on mchess18. We also observed that the GRAT toolchain (summing SaDiCaL, pr2drat, GRATgen and GRATchk times) is much slower than the LRAT toolchains (summing SaDiCaL, pr2drat, DRAT-trim and fastest LRAT checking times). This is in contrast to the SAT Race 2019 benchmarks below (see Fig. 12), where we observed the opposite relationship. We believe that the difference in checking speed is due to the various checkers having different optimizations for checking the expensive \(\textsf{RAT}\) proof steps produced by conversion from \(\textsf{PR}\) proofs.

6.2 SAT Race 2019 Benchmarks

Table 4 A summary of the SAT Race 2019 benchmark results on 102 unsatisfiability proofs generated by CaDiCaL. The N/A row counts problems that timed out or failed in an earlier step of the respective toolchains.
Fig. 12
figure 12

(Left) SAT Race 2019 proofs checked within a given (per instance) time limit for the LRAT proof checkers. (Right) SAT Race 2019 proofs generated and checked within a given (per instance) time limit for the LRAT and GRAT toolchains.

We further benchmarked the verified checkers on a suite of 102 unsatisfiable problems from the SAT Race 2019 competition.Footnote 9 For all problems, DRAT proofs were generated using the state-of-the-art SAT solver CaDiCaL before conversion into the LRAT or GRAT formats. Proofs generated by CaDiCaL on this suite rarely require \(\textsf{RAT}\) (or \(\textsf{PR}\)) steps, so the checkers are stress-tested on their implementation of file I/O, parsing, and reverse unit propagation based on the annotated hints (Step 3.1 from Fig. 3); cake_lpr is the only tool with a formally verified implementation of the former two steps. All tools were ran with the SAT competition solver timeout of 5000 seconds.

A summary of the results is given in Table 4, where all proofs generated by CaDiCaL were checked by at least one verified checker. The acl2-lrat checker fails with a parse error on one problem even though none of the other checkers reported such an error; GRATgen aborted on two problems for an unknown reason. Plots comparing LRAT proof checking time and overall proof generation and checking time (LRAT and GRAT) are shown in Fig. 12. From Fig. 12 (top), the relative order of LRAT checking speeds remains the same as Table 3, where cake_lpr is on average 1.2x slower than acl2-lrat, although cake_lpr is faster on 28 benchmarks. From Fig. 12 (bottom), both LRAT toolchains are slower than the GRAT toolchain (average 3.5x slower for cake_lpr and 3.4x for acl2-lrat). Part of this speedup for the GRAT toolchain comes from GRATgen which can be run in parallel (with 8 threads). This suggests that adding native support for GRAT-based input to cake_lpr or adding LRAT support to GRATgen could be worthwhile future extensions.

6.3 Mutilated Chessboard \(\textsf{RAT}\) Microbenchmarks

The following microbenchmark suite tests the LRAT checkers on large mutilated chessboard problems (up to 100 by 100) solved by pgbdd, a BDD-based SAT solver [8]. Unlike the previous benchmark suites, LRAT proofs are emitted directly by the solver so additional DRAT-trim conversion is not needed. All tools were ran with a timeout of 10000 seconds and all timings are reported in seconds (to one d.p.). For additional scaling comparison, we also report results for lrat-check, an unverified LRAT proof checker implemented in C.

The results in Table 5 show the impact of cake_lpr ’s \(\textsf{RAT}\) optimizations (Section 5.3). Notably, cake_lpr scales essentially linearly in the size of the proofs (up to \(\approx 10\) million proof steps). As a result, cake_lpr is significantly faster than acl2-lrat and coq-lrat on these \(\textsf{RAT}\)-heavy proofs and it comes within a 5x factor of the unverified lrat-check tool.

Table 5 Timings for the \(\textsf{RAT}\) microbenchmark. The number of proof steps and file size of the proofs (in MB) are shown in the last two columns. Fail(T) indicates a timeout.

6.4 Erdős Discrepancy Properties

The Erdős Discrepancy Problem (EDP) asks if, for every \(C > 0\) and infinite \(\pm 1\) sequence \((x_1,x_2,x_3,\dots )\), there exists kd such that \(| \sum _{i=1}^k x_{i \cdot d} | > C\). Konev and Lisitsa [35] provided a SAT-solver aided proof of the EDP in the case \(C=2\). We demonstrate the parallel scalability of compositional proof checking (Section 4) using cake_lpr ’s compositional proof checking option on a cube-and-conquer proof generated by iGlucose, a version of Glucose 3.0 with support for iCNF files, and the technique reported in Section 4.2. The technique yields a 5226 line compositional top-level proof where each line is justified by an underlying LRAT proof, see Fig. 8. The underlying proofs consist of 20 million clause addition lines in total and they vary widely in size, ranging from 88 bytes to 110 MB.

Proof steps are allocated evenly to parallel threads by their index. For example, with two threads, the first thread would check all odd-numbered proof steps by sequentially running instances of cake_lpr to check the underlying proofs for ranges \((i_0,i_1), (i_2,i_3), \dots \), while the second thread checks all even-numbered proof steps for ranges \((i_1,i_2), (i_3,i_4), \dots \), and similarly for more parallel threads. Results are summarized in Table 6 with wall-clock execution times reported in seconds and relative speedup against a single thread. All values are rounded to one decimal place.

Table 6 Timings (wall-clock) for the EDP benchmark. Speedup is calculated relative to 1 thread.

The speedup is nearly linear for lower number of threads (1–32) but drops off at 64 and 128 threads. This is likely due to the unbalanced nature of proof sizes, where large LRAT proofs dominate the overall proof checking time at high levels of parallelism. A more advanced parallelization scheme, e.g., with a scheduler, could further improve proof checking performance.

7 Related Work

Verified Proof Checking. There are several \(\textsf{RAT}\)-based verified proof checkers, in ACL2  [20], Coq  [10], and Isabelle/HOL  [39]; these verified checkers all use an unverified preprocessing tool to add proof hints to DRAT proofs, either DRAT-trim or GRATgen. Alternative preprocessing tools are available, e.g., based on the recently proposed FRAT format [4]. The DRAT format is itself an extension of the DRUP format [21]; the Coq checker is based on a predecessor verified checker for the GRIT [11] format. The ACL2 checker can be efficiently and directly executed (without extraction) using imperative primitives native to the ACL2 kernel [20]. However, the implementation of these features in ACL2 itself must be trusted to trust the proof checking results [50], hence the yellow background in Table 1. SMTCoq [2, 14] is another certificate-based checker for SAT and SMT problems in Coq. Its resolution-based proof certificates can be checked natively using native computation extensions of the Coq kernel.

Verified checkers are available for other logics, such as the Verified TESC Verifier for first-order logic [3], Pastèque for the practical algebraic calculus [33], and various checkers for higher-order logics (and type theories) underlying proof assistants [1, 47, 52].

Applications. SAT solving is a key technology underlying many software and hardware verification domains [7, 30]. Certifying SAT results adds a layer of trust and is clearly a worthwhile endeavor. Solver-aided proofs of mathematical results [23, 29, 35] are particularly interesting and challenging to certify because these often feature complicated SAT encodings, custom (hand-crafted) proof steps, and enormous resulting proofs [29]. Our cake_lpr checker is designed to handle the latter two challenges effectively. For the first challenge, the SAT encoding of mathematical problems can also be verified within proof assistants. This was done for the Boolean Pythagorean Triples problem building on the Coq proof checker [12].

Verified SAT Solving. An alternative to proof checking is to verify the SAT solvers [16, 17, 43, 48]. This is a significant undertaking but it would allow the pipeline of generating and checking proofs to be entirely bypassed. Furthermore, such verification efforts can yield new insights about key invariants underlying SAT solving techniques compared to prior pen-and-paper presentations, e.g., the 2WL invariant [17]. However, the performance of verified SAT solvers are not yet competitive with modern (unverified) SAT solving technology [16, 17].

8 Conclusion

This work presents the LPR proof format for verified checking of \(\textsf{PR}\) proofs and a compositional proof format for separate (parallel) proof checking. It also demonstrates the feasibility of using binary code extraction to verify cake_lpr, a performant proof checker supporting both formats down to its machine-code implementation. Given the strength of the \(\textsf{PR}\) proof system, there is ongoing research into the design of satisfaction-driven clause learning techniques [27, 28, 49] for SAT solvers based on \(\textsf{PR}\) clauses. Our proof checker opens up the possibility of using a verified checker to help check and debug the implementation of these new techniques. It also gives future SAT competitions the option of providing \(\textsf{PR}\) as the default (verified) proof system for participating solvers. Another interesting direction is to add cake_lpr support for other proof formats [5, 39]. In particular, this would allow users to build compositional proofs that utilize different underlying proof formats for separate parts of the proof.