Verified Propagation Redundancy and Compositional UNSAT Checking in CakeML

Modern SAT solvers can emit independently-checkable proof certificates to validate their results. The state-of-the-art proof system that allows for compact proof certificates is propagation redundancy (PR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{PR}$$\end{document}). However, the only existing method to validate proofs in this system with a formally verified tool requires a transformation to a weaker proof system, which can result in a significant blowup in the size of the proof and increased proof validation time. This article describes the first approach to formally verify PR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{PR}$$\end{document} proofs on a succinct representation. We present (i) a new Linear PR (LPR) proof format, (ii) an extension of the DPR-trim tool to efficiently convert PR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{PR}$$\end{document} proofs into LPR format, and (iii) cake_lpr, a verified LPR proof checker developed in CakeML. We also enhance these tools with (iv) a new compositional proof format designed to enable separate (parallel) proof checking. The LPR format is backwards compatible with the existing LRAT format, but extends LRAT with support for the addition of PR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{PR}$$\end{document} clauses. Moreover, cake_lpr is verified using CakeML ’s binary code extraction toolchain, which yields correctness guarantees for its machine code (binary) implementation. This further distinguishes our clausal proof checker from existing checkers because unverified extraction and compilation tools are removed from its trusted computing base. We experimentally show that: LPR provides efficiency gains over existing proof formats; cake_lpr ’s strong correctness guarantees are obtained without significant sacrifice in its performance; and the compositional proof format enables scalable parallel proof checking for large proofs.


Introduction
Given a formula of propositional logic, the task of a SAT solver is to decide whether there exists an assignment that satisfies the formula. Such a satisfying assignment, if found by a SAT solver, is easily verifiable by independent checkers and so one does not need to trust the inner workings of the solver. The situation with unsatisfiable formulas, i.e., where no satisfying assignment exists, is not as straightforward.
Here, SAT solvers must produce an unsatisfiability proof (also called a refutation) for the input formula. Ideally, the proof system and corresponding proof format for such proofs should be sufficiently expressive, allowing SAT solvers to efficiently produce proofs that correspond to the SAT solving techniques they use at runtime. At the same time, the resulting proofs ought to be efficiently checkable by independent and trustworthy tools.
The de facto standard proof system for propositional unsatisfiability proofs is Resolution Asymmetric Tautology (RAT) [31]. The associated DRAT format [55] combines clause addition based on RAT steps and clause deletion. Independent checking tools can validate proofs in the DRAT format; they have been used to check the results of the SAT competitions since 2014 [55] and in industry [20]. Enriching DRAT proofs with hints is the main technique for developing efficient verified proof checkers, e.g., existing verified checkers use the enriched proof formats LRAT [10] and GRAT [39].
A recently proposed proof system, called Propagation Redundancy (PR) [28], generalizes RAT. There exist short PR proofs without new variables for many problems that are hard for resolution, such as pigeonhole formulas, Tseitin problems, and mutilated chessboard problems [26]. Due to the absence of new variables it is easier to find PR proofs automatically [27], and it is considered unlikely that there exist short RAT proofs for these problems that do not introduce new variables nor reuse eliminated variables [28]. Such PR proofs can be checked directly [28], or they can first be transformed into DRAT proofs or even Extended Resolution proofs by introducing new variables [25,34]. In theory, the blowup is small, i.e., polynomial-sized. However, in practice, the transformed proofs can be significantly more expensive to validate compared to the original PR proofs [28].
A natural question arises: why should proof checkers be trusted to correctly check proofs if we do not likewise trust SAT solvers to correctly determine satisfiability? One answer is that proof checkers are much easier to implement so their code can be carefully audited. Another answer is that the algorithms underlying proof checkers have been formally verified in a proof assistant [10,20,39]. However, to get executable code for these verified checkers, some additional unverified steps are still required. Although unlikely, each of these steps can introduce bugs in the resulting executable: (a) the algorithms are extracted by unverified code generation tools into source code for a programming language; (b) unverified parsing, file I/O, and command-line interface code is added; (c) the combined code is compiled by unverified compilers to executable machine code.
The contributions of this article are: (i) a new Linear PR (henceforth LPR) proof format that enriches PR proofs with hints and is backwards compatible with the LRAT format, (ii) an extension of the existing DPR-trim tool [28] to efficiently convert PR proofs into LPR format, and (iii) cake_lpr, an efficient verified LPR proof checker with correctness guarantees, including for steps (a)-(c) enumerated above. The cake_lpr tool was used to validate the unsatisfiability proofs for the 2020 SAT Competition because of its strong trust story combined with easy compilation and usage. Moreover, the stronger PR proof system could be supported in future competitions. The tool is publicly available at: https://github.com/tanyongkiam/cake_ lpr This article extends our conference version [53] with: (iv) a new compositional proof format consisting of a toplevel summary proof whose proof steps can be separately justified by respective underlying proofs and (v) verified extensions in cake_lpr to support the compositional proof format. The new correctness result for cake_lpr allows users to exploit verified compositional proof checking by running parallel instances of the tool to check very large unsatisfiability proofs, such as those typically found in SATsolver aided proofs of mathematical results [23,29,35]. In particular, we explain how compositional proofs can be conveniently generated from the cube-and-conquer [22] SAT solving technique that is naturally parallelizable. Together with our verified checker, this enables a fully parallel pipeline for SAT solving, proof generation, and verified proof checking. To the best of our knowledge, this is the first verification result for a proof checker that formally accounts for multiple, separate executions of the checker. Section 3 shows how PR proofs are enriched to obtain LPR proofs and presents the corresponding LPR proof checking algorithm (Contributions (i) & (ii)). Existing LRAT proof checkers can be extended in a clean and minimal way to support LPR proofs. Section 4 introduces the compositional proof format which extends an underlying clausal proof format with support for separate proof checking (Contribution (iv)). Section 5 explains the implementation of our proof checker in CakeML, as well as the correctness guarantees and high-level verification strategy behind the proofs (Contributions (iii) & (v)). Section 6 benchmarks our proof checker against existing implementations. A summary comparison of the new proof checker against existing verified proof checkers is in Table 1.

Background
This section provides background on CakeML and its related tools. It also recalls the standard problem format and clausal proof systems used by SAT solvers.

HOL4 and CakeML
HOL4 is a proof assistant implementing classical higherorder logic [51] and CakeML [45] is a programming language with syntax and semantics formally defined in HOL4. Tools for developing verified CakeML software are used to fill the verification gaps in the correspondingly enumerated items in Section 1: (a) Two tools are used to produce (or extract) verified CakeML source code in HOL4:  [10,20,39]. Green background (cells with +) indicates, in our view, desirable properties, e.g., LPR is based on a stronger proof system than LRAT and GRAT, while red backgrounds (cells with ×) indicate less desirable properties. Yellow backgrounds (cells with −) are also undesirable but to a lesser extent.

Property
ACL2 checker [20] Coqchecker [10] GRATchk [39] cake_lpr the CakeML proof-producing translator [46] automatically synthesizes verified source code from pure algorithmic specifications; the CakeML characteristic formula (CF) framework [19] provides a separation logic which can be used to manually verify (more efficient) imperative code for performance-critical parts of the proof checker.
(b) CakeML provides a foreign function interface (FFI) and a corresponding formal FFI model [15]. These are used to verify system call interactions, e.g., file I/O and command-line interfaces, under carefully specified assumptions on the system environment. (c) Most importantly, CakeML has a compiler that is verified [54] to preserve the semantics of source CakeML programs down to their compiled machine-code implementations. Hence, all guarantees obtained from the preceding steps can be carried down to the level of machine code with proofs.
The combination of these tools enables binary code extraction [36] where verified machine code is extracted directly in HOL4. Several CakeML programs have been verified using these tools, including: certificate checkers for floating-point error bounds [6] and vote counting [18], an OpenTheory article checker [1], and the bootstrapped CakeML compiler [54]. Other toolchains can be used to build verified checkers, e.g., OEuf provides a similar binary code extraction toolchain in the Coq proof assistant [44]; the Verified Software Toolchain [9] provides a program logic for a subset of C which can be compiled with the verified compiler CompCert [40]; and the Isabelle Refinement framework can produce verified LLVM implementations [37,38].

SAT Problems and Clausal Proofs
Fix a set of boolean variables 1 , . . . , , where the negation of variable is denoted , and the negation of is identified with . Variables and their negations are called literals and are denoted using .
The input for propositional SAT solvers is a formula in conjunctive normal form (CNF) over the set of variables 1 , . . . , . Here, CNF means that consists of an outer logical conjunction ≡ =1 , where each clause is a disjunction over some of the literals ≡ 1 ∨ 2 , · · · ∨ ; in general, each clause can contain a different number of literals. Formulas in CNF can be represented directly as sets of clauses and clauses as sets of literals. The empty clause is denoted ⊥.
An assignment assigns boolean values (true or false) to each variable; can be partial, i.e., it only assigns values to some of the variables. Like formulas and clauses, a (partial) assignment can be represented as a set of literals such that ∈ iff assigns to true. For consistency, may contain at most one of each literal or its negation. The negation of an assignment, denoted , assigns the negation of all literals in to true. An assignment satisfies a clause iff their set intersection is nonempty. Additionally, we define | = if satisfies ; otherwise, | denotes the result of removing from all the literals falsified by , i.e., | = \ . For a formula , we define | = { | | ∈ and | ≠ }. Intuitively, | contains the remaining simplified clauses in formula after committing to the partial assignment .
The task of a SAT solver is to determine whether is satisfiable, i.e., whether there exists a (possibly partial) assignment such that | is empty. Any satisfying assignment can be used as certificate of satisfiability. Formulas without a satisfying assignment are unsatisfiable. Certifying unsatisfiability is more difficult and typically uses a clausal proof system [28]. The idea behind these proof systems is briefly recalled next, using the key concept of clause redundancy.
Definition 1 A clause is redundant with respect to formula iff ∧ and are both satisfiable or both unsatisfiable, i.e., they are satisfiability equivalent.
A clause that is redundant for can be added to without changing its satisfiability. Clausal proof systems work by successively adding redundant clauses to until the empty clause ⊥ is added. Such a sequence of additions is illustrated below: Satisfiability is preserved along each =⇒ step because of clause redundancy, e.g., satisfiability of implies satisfiability of ∧ 1 . Since the final formula containing ⊥ is unsatisfiable, the sequence of redundant clause addition steps 1 , 2 , . . . , ⊥ corresponds to a proof of unsatisfiability for . Deciding clause redundancy is as hard as solving the SAT problem itself because ⊥ is always redundant for unsatisfiable formulas. The difference between clausal proof systems is how the redundancy of a (proposed) redundant clause is efficiently certified at each proof step.
Many syntactic notions of redundancy are based on unit propagation. A unit clause is a clause with only one literal. The result of applying the unit clause rule to a formula is the formula | where ( ) is a unit clause in . The iterated application of the unit clause rule to a formula until no unit clauses are left is called unit propagation. If unit propagation on yields the empty clause ⊥, denoted by 1 ⊥, we say that implies ⊥ by unit propagation. The notion of implied by unit propagation is also used for regular clauses as follows: Observe that ¬ can be viewed as a partial assignment that assigns the literals , for ∈ , to true. For a formula , 1 iff 1 for all ∈ . The main clausal proof system used in this article is based on propagation redundant clauses, which are defined as follows.
Definition 2 Let be a formula, a nonempty clause, and the smallest assignment that falsifies . Then, is propagation redundant (PR) with respect to if there exists an assignment which satisfies and such that | 1 | .
Intuitively, a PR clause is redundant because any satisfying assignment for that does not already satisfy can be modified to a satisfying assignment for ∧ by updating its literals assigned to true according to the (partial) witnessing assignment (see [28,Theorem 1] for more details). Propagation redundancy is efficiently checkable in polynomial time using the witnessing assignment and PR generalizes various other notions of clause redundancy, including the de facto standard Resolution Asymmetric Tautology (RAT) proof system (see [28,Theorem 2]) that is able to compactly express all current techniques used in state-of-the-art SAT solvers [31]. There is ongoing research towards integrating PR in SAT solvers [27,28]. For example, PR-based preprocessing has been shown to improve solver performance on SAT competition benchmarks [49].
In practice, clausal proof formats also support clause deletions to speed up proof validation, especially for proof checking steps that need to iterate over the entire formula, e.g., | 1 | for Def. 2. Hence, unsatisfiability proofs for formula are modeled as sequences 1 , . . . , of instructions that either add or delete a clause; an addition instruction is a triple a, , , where is a clause and is a (possibly empty) witnessing assignment; a deletion instruction is a pair d, where is a clause. The sequence 1 , . . . , gives rise to formulas 1 , . . . , with 0 = , where is the accumulated formula, i.e., a (multi)set of clauses, up to the -th instruction computed recursively according to (1).
A PR proof of unsatisfiability is valid if the last instruction adds the empty clause = a, ⊥, ∅ , and, for all addition instructions = a, , , it holds that is PR with respect to −1 using witness . In case an empty witness is provided for , then −1 1 should hold.

Linear Propagation Redundancy
This section describes a new clausal proof format called Linear Propagation Redundancy (LPR) which is designed to enable efficient validation of PR clauses using a (verified) proof checker. It also presents our enhancements to the DPR-trim tool1 to efficiently add hints to PR proofs, thereby turning them into LPR proofs. Throughout the section, we emphasize how LPR can be viewed as a clean and minimal extension of the existing LRAT proof format, which thereby enables its straightforward implementation in existing LRAT tools. The most commonly used proof format for SAT solvers is DRAT, which combines deletion with RAT redundancy [55]. DRAT proofs are easy for SAT solvers to emit and top-tier SAT solvers support it, but they have some disadvantages for verified proof checking. In particular, checking whether a clause is RAT requires a significant amount of proof search to find the unit clauses necessary for showing the implied-byunit-propagation 1 property. This complicates verification of the proof checking algorithm and slows down the resulting verified proof checkers. The idea behind the Linear RAT (LRAT) [10,20] and GRAT [39] formats is to include these unit clauses as hints so that verified proof checkers can follow the hints directly without the need for proof search. The LPR format lifts this idea to allow fast validation of the PR property as follows.
An assignment reduces a clause if | ⊂ and | ≠ . To check the PR property | 1 | , it suffices to check, for each clause ∈ reduced by , that | 1 | . In practice, a smaller yields a cheaper PR check. The LPR format extends the PR format by adding, for each clause reduced by the witness, a list of all unit clause hints required for showing the implied-by-unit-propagation property. Additionally, in order to point to clauses, the LPR format includes an index for each clause at the beginning of each line. The grammar of the LPR format is shown in Fig. 1.
The DPR-trim tool [28, Sect. 6] is built on top of DRAT-trim and facilitates verification of DPR proofs, a generalization of DRAT proofs with PR clause addition. If a clause addition step includes a witnessing assignment , then the PR redundancy check is performed. Otherwise DPR-trim falls back on the RAT check from the DRAT-trim code base. Deletion steps are not validated. DPR-trim has similar features compared to DRAT-trim, including backward checking, extraction of unsatisfiable cores, and proof optimization.
Our extension to DPR-trim enriches input PR proofs by finding and adding all required unit clause hints to produce LPR proofs. Most of the changes generalize the code to produce LRAT proofs in DRAT-trim. The main PR-specific optimization shrinks the witness where possible: every literal in ∩ is removed as well as any literal in that is implied by unit propagation from | . The shrinking was shown to be correct [28], but has not been implemented so far. We observed that the witnesses in the PR proofs produced by SaDiCaL [27] can be substantially compressed using this method. Fig. 2 (left) shows an example formula in the standard DIMACS problem format. The DIMACS format includes a header line starting with "p cnf " followed by the number of variables and the number of clauses. The non-comment lines (not starting with "c ") represent clauses, and they end with "0". Positive integers denote positive literals, while negative integers denote negative literals. By convention (following LRAT), the clauses are implicitly indexed according to their order of appearance in the file, starting from index 1. Fig. 2 (right) shows a corresponding proof in LPR format.2 Deletion lines in LPR are formatted identically to LRAT [10] (not shown here). For clause addition lines, the LPR format only differs from LRAT in case the clause to be added has PR but not RAT redundancy. A clause addition line in LPR format consists of three parts. The first part is the first integer on the line, which denotes the index of the new clause. The second part exactly matches the PR proof format [28]. It consists of the redundant clause and its witness; the first group of literals is the clause while the (potentially empty) witness starts from the second occurrence of the first literal of the clause until the first 0 that separates the unit clause hints. The third part (after the first 0) are the unit clause hints, which exactly matches the LRAT format [10].
The checking algorithm for LPR, shown in Fig. 3, overlaps significantly with that for LRAT (see [10,Algorithm 1]). The only differences are in Steps 4 and 5.1. In Step 4, the witness is used (if present) instead of always using the first literal in . In Step 5.1, clauses are skipped if they are satisfied by the witness. Notice that a clause can only be both reduced and satisfied by a witness if the witness consists of at least two literals, while in the LRAT format witnesses always consist of exactly one literal. Note also that the algorithm does not check whether | = , which is a requirement for PR. This omission is allowed because the first literal in in the LPR (and PR) format is syntactically the same as the first literal in . Clauses of an unsatisfiable pigeonhole formula (4 pigeons, 3 holes) in the DIMACS format used by SAT solvers. The first 4 clauses encode that each pigeon belongs to at least one hole, e.g., the variable 1 (resp. 4, 7, 10) is set to true iff pigeon A (resp. B, C, D) is in the first hole; the latter 18 clauses encode that no two pigeons share the same hole, e.g., the clause -1 -4 encodes that pigeons A and B do not share the first hole. (Right) The LPR refutation consisting of clause-witness pairs and unit clause hints. The first bold integer in each line is the clause index while other bold integers are the unit clause hints. Dropping the bold integers yields a proof in the PR format. Redundant spaces have been added to improve readability.

Compositional Proofs
This section presents a new compositional proof format for unsatisfiability proof checking, motivated by the need to check very large unsatisfiability proofs behind SAT-solver aided proofs of mathematical results [23,29,35]. The rules of compositional proofs in propositional logic have been discussed in earlier work [24]; here, we define a format for compositional proofs and present a proof checking framework for the entire toolchain. Intuitively, the format consists of a top-level summary proof and a set of underlying proofs which are used to justify the top-level proof steps, see Fig. 4 for an illustration (explained below). The underlying proofs can be checked separately and in parallel to speed up proof checking. Section 4.1 presents the proof format and its proof checking algorithm. Section 4.2 explains our technique for generating compositional proofs from the cube-and-conquer SAT solving technique [22].

Compositional Proof Checking
Compositional proofs are modeled using instruction sequences 1 , . . . , similar to clausal proofs (Section 2.2) except addition instructions are of the form a, , i.e., they do not carry witnesses. The accumulated formulas are defined according to equation 1.
The key idea behind compositional proof checking is to justify a range of instructions simultaneously using an underlying clausal proof, see Fig. 4. More precisely, given a pair of indices ( , ), with ≤ , the algorithm checks that satisfiability of the accumulated formula implies satisfiability of the accumulated formula . By transitivity of satisfiability implication, this means that a proof with instructions can be checked by separately checking ranges If all checks succeed, then satisfiability of implies satisfiability of and, in particular, if also contains ⊥ then the input formula is unsatisfiable. Satisfiability implication for each pair , is checked using an underlying clausal proof and its   The grammar for the DRAT format [55], re-used for compositional proofs. Each represents either addition (no prefix) or deletion (prefixed by "d") of a clause.
To enable re-use of existing parsing tools, compositional proofs are syntactically represented using the DRAT format [55], recalled in Fig. 6. Clauses are implicitly numbered according to their order of addition, continuing from the last (implicit) index of the corresponding DIMACS CNF. For example, starting from a DIMACS file with 5 clauses (indices 1-5), the first added clause in the compositional proof has index 6, the second has index 7, and so on. Deletion steps are ignored for the purposes of indexing.

Compositional Proof Generation
Compositional proofs can be generated particularly conveniently from cube-and-conquer SAT proofs [22]. Given an input formula , the basic idea behind cube-and-conquer is to partition into a set of subformulas ∧ 1 , ∧ 2 , . . . , ∧ , and to solve each subproblem separately. Here, each = 1 ∧ 2 ∧ · · · ∧ is a cube, i.e., a conjunction of literals, such that the disjunction of cubes =1 is a tautology. Compositional proof used for the cube-and-conquer parallel SAT solving strategy. The boxes correspond to underlying proofs generated by DRAT-trim for each cube which can be checked using cake_lpr or other LRAT tools [10]. In practice, the cubes used in the top-level proof steps can be simplified to core subsets ⊆ obtained from incremental SAT solving.
Thus, if every subproblem ∧ is unsatisfiable, then is unsatisfiable.
The compositional unsatisfiability proof for under cubeand-conquer is shown in Fig. 7, analogous to Fig. 4. Here, the top-level proof is the sequence of additions of negated cubes ¬ (which are clauses), followed by addition of the empty clause. The first underlying proofs justify unsatisfiability of under each cube, while the last proof shows that the cube-and-conquer strategy correctly partitions the space of assignments.
The top-level proof steps can be simplified while generating the underlying clausal proofs for each step. In practice, each subproblem ∧ is tackled using incremental SAT solvers that support solving under assumptions (or partial assignments) . Such assumptions are typically provided either via iCNF files [56] or via the solver API; an example formula with three cubes expressed in iCNF (DIMACS files extended with assumptions) is shown in Fig. 8. If the formula is found to be unsatisfiable under assumptions , incremental solvers can further compute a subset of the cube ⊆ that was involved in determining unsatisfiability [13]. The negation ¬ can be used in the compositional proof in place of ¬ , e.g., the topmost cube in Fig. 8 has a proof with ⊂ . Since existing solvers with incremental support can only log proofs in the DRAT format, we modified DRAT-trim to convert such partial proofs into LRAT proofs that end with the addition of clause ¬ . In addition, when generating the LRAT proof for ¬ , the compositional proof checking algorithm Fig. 5 allows us to assume all previously added clauses, i.e., starting with ∧ ¬ 1 ∧ · · · ∧ ¬ −1 , which may help to simplify the LRAT proof.
The final line in compositional unsatisfiability proofs is the addition of the empty clause. In most use cases, there exists a short justification of the empty clause using the added clauses in the compositional proof, e.g., the conjunction of preceding clauses in the compositional proof is typically unsatisfiable (in our case, by construction). Thus, a SAT solver can be used on that formula to produce the justification of the empty clause in the appropriate clausal proof format. Alternatively, if the compositional proof was constructed using more complicated cube-and-conquer strategies [22], then the underlying tree structure used to generate the cubes can be used to guide the resolution steps for obtaining the empty clause.

CakeML Proof Checking
This section explains the implementation and verification of cake_lpr, our verified CakeML LPR proof checker. Section 5.1 focuses on the high-level verification strategy which we used to reduce the verification task to mostly routine low-level proofs (details of the latter are omitted). Section 5.2 explains the main verified theorems for the proof checker. Section 5.3 highlights some of the verified performance optimizations used in the proof checker.

Implementation and Verification Strategy
The development of cake_lpr proceeds in three refinement steps, where each step progressively produces a more concrete and performant implementation of the proof checker. These refinements are visualized in the three columns of Fig. 9 for LPR proof checking. A similar verification process was used to add support for compositional proof checking.
Step 1 formalizes the definition of CNF formulas and their unsatisfiability, as well as the PR proof system described in Section 2.2. The inputs and outputs to the proof system are abstract and not tied to any concrete representation at this step. For example, input variables are drawn from an arbitrary type , while clauses and CNFs are represented using sets. The correctness of the PR proof system is proved in this step, i.e., we show that a valid PR proof implies unsatisfiability of the input CNF. The proof essentially follows [28, Theorem 1].
Step 2 implements a purely functional version of the LPR proof checking algorithm from Fig. 3. Here, the inputs and outputs are given concrete representations with computable datatypes, e.g., literals are integers (similar to DIMACS), clauses are lists of integers, and CNFs are lists of clauses. These concrete representations lift naturally to the abstract, set-based representation from Step 1. The output is a YES or NO answer according to the algorithm from Fig. 3. The correctness theorem for Step 2 shows that LPR proof checking correctly refines the PR proof system, i.e., if it outputs YES, then there exists a valid PR proof for the input (lifted) CNF; by Step 1, this implies that the CNF is unsatisfiable. If the output is NO, the input CNF could still be unsatisfiable, but the input LPR proof is not valid according to the algorithm in Fig. 3.
Step 3 uses imperative features in the CakeML source language, e.g., (byte) arrays and exceptions, to improve code performance; these optimizations are detailed further in Section 5.3. This step also adds user interface features like parsing and file I/O so that the input CNF formula is read (and parsed) from a file, and the results are printed on the standard output and error streams. The verification of this step uses CakeML's proof-producing translator [46] and characteristic formula framework [19] to prove the correctness of the source code implementation of cake_lpr; this code is subsequently compiled with the verified CakeML compiler [54]. Composing the correctness theorem for source cake_lpr with CakeML's compiler correctness theorem yields the corresponding correctness theorem for the cake_lpr binary.
At the point of writing, the verified cake_lpr binary can be invoked from the command-line in five ways, each with associated soundness proofs (Section 5.2):

cake_lpr <DIMACS>
Parses the input file in DIMACS format and prints the parsed formula.

cake_lpr <DIMACS> <COMP> i-j <LPR>
Runs compositional proof checking on the parsed formula and compositional proof COMP for the range i-j and underlying LPR proof LPR.
Recall from Section 4.1 that cake_lpr needs to be executed (with option 4) for a set of ranges covering the lines of the compositional (top-level) proof COMP and we expect users to exploit this compositionality by running separate instances of cake_lpr in parallel on several machines. Accordingly, an important correctness caveat for compositional proof checking is that users correctly set up separate executions for their respective systems.
Option 5 is designed to add a simple layer of protection against user error when setting up separate (or parallel) executions. In particular, cake_lpr outputs the following string for each successful run of option 4, where md5 takes the MD5 hash of the input files: s VERIFIED RANGE md5(DIMACS) md5(COMP) i-j The MD5 hash is chosen for convenience because it is available on most machines, so users can, e.g., manually compare md5 hashes of their input DIMACS and proof files to check that the correct files were used on all machines. By concatenating these outputs into an output file OUTPUT and executing cake_lpr with option 5, cake_lpr can be used to check that the output file contains the correct hashes and that the specified ranges appropriately cover the entire compositional proof. Our implementation checks range coverage (e.g., that ranges 0-2, 4-8, 2-4, 8-12 can be strung together to check range 0-12) by reusing a verified reachability checking function originally developed for a verified compiler optimization in the CakeML compiler [32].

Correctness Theorems
The main correctness theorem for cake_lpr in HOL4 is shown in Fig. 10. The first line (2) (in red) summarizes routine assumptions for compiled CakeML programs that use its basis library. Briefly, it assumes that the command-line cl and file system fs models are well-formed and the compiled code is placed in (and executed from) code memory of a machine state ms according to CakeML's x64 machine model mc.
The first guarantee (3) (in blue) is that the machinecode implementation always terminates normally according to CakeML's x64 machine-code semantics. Notably, this means the binary never crashes and it may emit some I/O events when run; however, it possibly terminates with an outof-memory error (extend_with_resource_limit) if the CakeML runtime runs out of stack or heap space [54]. The second guarantee (4) (in green) states that the only observable change to the filesystem after executing cake_lpr are some strings printed on standard output out and standard error err. To minimize user confusion, cake_lpr is designed to print all error messages to standard error and only success messages on standard output. Finally, lines (5) (in black) list the output behaviors of cake_lpr for each command-line option.3 1. When cake_lpr is given one command-line argument (length cl = 2)4, it attempts to read and parse the file before printing (if successful) the parsed formula to standard output. The DIMACS parser (parse_dimacs) is proved to be left inverse to the DIMACS printer (print_dimacs) as follows: every wf_clause fml ⇒ parse_dimacs (print_dimacs fml) = Some fml This says that for any well-formed formula fml, printing that formula into DIMACS format then parsing it yields the formula back. All parsed formulas are proved to be well-formed (not shown here). 2. If two arguments are given (length cl = 3), then if the string "s VERIFIED UNSAT\n" is printed onto standard output, cake_lpr was provided with a file (in its first argument), and the file parses in DIMACS format to a formula fml whose lifted representation (interp fml) is unsatisfiable. 3-5. The specifications for the remaining cases have a similar flavor and are omitted here for brevity.
The correctness theorem for cake_lpr's compositional proof checking is shown in Fig. 11. The definition check_suc-cessful_par characterizes a successful set of runs of cake_lpr using command line options 4 and 5 on input strings fmlstr and pfstr. The lines in red (6) say that there is a list of output strings outs such that every entry of this list is produced from the standard output of an execution of cake_lpr with command-line option 4 (with appropriate setup of each 3 These lines feature HOL4 definitions for interacting with the filesystem fs, e.g., inFS_fname fs s says the string s is a valid filename in fs, all_lines fs s returns the file contents as a list of strings (per line), and file_content fs s returns the file content as a raw string.
4 By convention, the default (zeroth) command-line argument is always the name of the executable.  machine's filesystem and command-line arguments).5 The lines in blue (7) say that outs is successfully checked by an execution of cake_lpr with command-line option 5 and the corresponding success string (concat [. . . ]) is printed to standard output. Using this definition, the correctness theorem (8) says that on a successful set of runs, satisfiability of the formula fml parsed from fmlstr implies satisfiability of the final formula run_proof fml pf obtained by running all lines of the parsed proof pf on fml, i.e., if fml is 0 and pf has instructions, then run_proof fml pf computes the accumulated formula as defined in Section 4.1.
The theorems in Figures 10 and 11 have a trusted computing base (TCB) where the CakeML compiler itself is not present. The TCB does, however, still include CakeML's model of the foreign function interface and assumed semantics for the targeted hardware platform. CakeML's TCB is discussed in detail in prior publications [36,42].

Verified Optimizations
To minimize verification effort, CakeML's imperative features are only used for the most performance-critical steps of cake_lpr. Our design decisions are based on empirical observations about the LPR proof checking algorithm. These are explained below with reference to specific steps in the algorithm outlined in Fig. 3.

Array-based representations
In practice, many LPR proof steps do not require the full strength of a PR (or RAT) clause. Hence, a large part of proof checking time is spent in the Step 3 loop of the algorithm (reverse unit propagation) and it is important to compute the main loop bottleneck, | in Step 3.1, as efficiently as possible. Here, CakeML's native byte arrays are used to maintain a compact bitset-like representation of the assignment , so that | can be computed in one pass over with constant time bitset lookup for each literal in .
For proof steps requiring the full strength of PR clauses, Step 5 loops over all undeleted clauses in the formula. Formulas are represented as an array of clauses6 together with a lazily updated list that tracks all indices of the array containing undeleted clauses. This enables both constant-time lookup of clauses throughout the algorithm and fast iteration over the undeleted clauses for Step 5. Deletion in the index list is done in (amortized) constant time by removing a deleted index only when the index is looked up in Step 5.1.
Additionally, for each literal, the smallest clause index where that literal occurs (if any) is lazily tracked in a lookup array; for a given witness , all clauses occurring at indices below the index of any literal in can be skipped in Step 5.1.

Proof checking exceptions
There are several steps in the proof checking algorithm that can fail (report NO) if the input proof is invalid, e.g., in Step 3.3. In a purely functional implementation, results are represented with an option: None indicating a failure and Some res indicating success with result res. While conceptually simple, this means that common case (successful) intermediate results are always boxed within a Some constructor and then immediately unboxed with pattern matching to be used again. In cake_lpr, failures instead raise exceptions which are directly handled at the top level. Thus, successful results can be passed directly, i.e., as res, without any boxing. Support for verifying the use of exceptions is a unique feature of CakeML's CF framework [19].

Hashtables
For compositional proof checking (Fig. 5), it is important to check the inclusion of clauses ⊆ efficiently since both formulas can contain a large number of clauses. To achieve this, the formula (in the array-based representation) is converted into CakeML's verified hashtable library using a simple rolling hash for clauses. This allows every clause in 6 Deleted clauses are no longer referenced by the array and are automatically freed by CakeML's garbage collector.
to be checked against the hashtable for in near-constant time.

Buffered I/O streams
Proof files generated by SAT solvers can be large, e.g., ranging from 300 MB to 4 GB for the second benchmark suite in Section 6. These files are streamed into memory line by line because each proof step depends only on information contained in its corresponding line in the file. This streaming interaction is optimized using CakeML's verified buffered I/O library [41] which maintains an internal buffer of yet-to-be-read bytes from the read-only proof file to batch and minimize the number of expensive filesystem I/O calls.

Producing MD5 hashes
As part of this work, we verified a CakeML library for computing the MD5 hash of an input stream which is connected with the buffered I/O library to efficiently compute hashes for the compositional proof checking command-line options 4 and 5 in Section 5.1. The verification shows that our imperative implementation of MD5 hashing matches its functional specification. In particular, since the MD5 hash is provided solely for user convenience, we do not prove any formal cryptographic properties of the hashing function.

Benchmarks
This section compares the CakeML LPR proof checker against other verified checkers on two benchmark suites (Sections 6.1 and 6.2) and a RAT microbenchmark (Section 6.3). It also evaluates the compositional proof format on unsatisfiability proofs of Erdős discrepancy properties [35] (Section 6.4). The first suite is a collection of problems with PR proofs generated by the satisfaction-driven clause learning (SDCL) solver SaDiCaL [27]; the second suite consists of unsatisfiable problems from the SAT Race 2019 competition.7 The RAT microbenchmark consists of proofs for large mutilated chessboards generated by a BDD-based SAT solver [8]. The unsatisfiability proofs for the Erdős discrepancy properties are generated by cube-and-conquer (see Section 4.2). Raw timing data used to produce the figures and tables in this section is available from the cake_lpr repository.

SaDiCaL PR Benchmarks
The SaDiCaL solver produces PR proofs for hard SAT problems in its benchmark suite [27] and it is experimentally much faster than a plain DRAT-based CDCL solver on those problems [27,Section 7]. The PR proofs are directly checked by cake_lpr after conversion into LPR format with DPR-trim. For all other checkers, the PR proofs were first converted to DRAT format using pr2drat (as in an earlier approach [27]), and then into LRAT and GRAT formats using the DRAT-trim and GRATgen8 tools respectively. All tools were ran with a timeout of 10000 seconds and all timings are reported in seconds (to one d.p.). Results are summarized in Tables 2 and 3.
All benchmarks were successfully solved by SaDiCaL except mchess19 which exceeded the time limit. For the remaining benchmarks, generating and checking LPR proofs required a comparable (1-2.5x) amount of time to solving the problems, except mchess, for which LPR generation and checking is much faster than solving the problems (Table 2). Unsurprisingly, direct checking of LPR proofs is much faster than the circuitous route of converting into DRAT and then into either LRAT or GRAT (Table 3). Unlike LPR, checking PR proofs via the LRAT route is 5-60x slower than solving those problems; this is a significant drawback to using the route in practice for certifying solver results. Compared to an unverified C implementation of LPR proof checking, cake_lpr is approximately an order of magnitude slower on these benchmarks; a detailed comparison against unverified proof checking is in Section 6.3.
The backwards compatibility of cake_lpr is also shown in Table 3, where it is used to check the generated LRAT proofs. Among the LRAT checkers, acl2-lrat is fastest, followed by cake_lpr (LRAT checking), and coq-lrat. Although cake_lpr (LRAT checking) is on average 1.3x slower than acl2-lrat, it scales better on the mchess problems and is actually much faster than acl2-lrat on mchess18. We also observed that the GRAT toolchain (summing SaDiCaL, pr2drat, GRATgen and GRATchk times) is much slower than the LRAT toolchains (summing SaDiCaL, pr2drat, DRAT-trim and fastest LRAT checking times). This is in contrast to the SAT Race 2019 benchmarks below (see Fig. 12), where we observed the opposite relationship. We believe that the difference in checking speed is due to the various checkers having different optimizations for checking the expensive RAT proof steps produced by conversion from PR proofs.

SAT Race 2019 Benchmarks
We further benchmarked the verified checkers on a suite of 102 unsatisfiable problems from the SAT Race 2019 competition.9 For all problems, DRAT proofs were generated using the state-of-the-art SAT solver CaDiCaL before conversion into the LRAT or GRAT formats. Proofs generated by CaDiCaL on this suite rarely require RAT (or PR) steps, so the checkers are stress-tested on their implementation of file I/O, parsing, and reverse unit propagation based on the annotated hints (Step 3.1 from Fig. 3); cake_lpr is the only tool with a formally verified implementation of the former two steps. All tools were ran with the SAT competition solver timeout of 5000 seconds.
A summary of the results is given in Table 4, where all proofs generated by CaDiCaL were checked by at least one verified checker. The acl2-lrat checker fails with a parse error on one problem even though none of the other checkers reported such an error; GRATgen aborted on two problems for an unknown reason. Plots comparing LRAT proof checking time and overall proof generation and checking time (LRAT and GRAT) are shown in Fig. 12. From Fig. 12 (top), the relative order of LRAT checking speeds remains the same as Table 3, where cake_lpr is on average 1.2x slower than acl2-lrat, although cake_lpr is faster on 28 benchmarks. From Fig. 12 (bottom), both LRAT toolchains are slower than the GRAT toolchain (average 3.5x slower for cake_lpr and 3.4x for acl2-lrat). Part of this speedup for the GRAT toolchain comes from GRATgen which can be run in parallel (with 8 threads). This suggests that adding native support for GRAT-based input to cake_lpr or adding LRAT support to GRATgen could be worthwhile future extensions. Table 2 Timings for PR benchmarks with conversion into LPR format. The "Total (LPR)" column sums the generation and checking times. The timing for mchess19 is omitted because SaDiCaL timed out; timings for the Urquhart U.-s3-* benchmarks are omitted because they took a negligible amount of time (< 1.0s total).

Problem
SaDiCaL  Table 3 Timings for PR benchmarks, first converted to DRAT and subsequently converted into LRAT and GRAT formats. The "Total (LRAT)" and "Total (GRAT)" columns sum the fastest generation and checking times for the LRAT and GRAT formats respectively. The "Total (LPR)" column (in bold, fastest total time) is reproduced from Table 2 for ease of comparison. Fail(T) indicates a timeout. Timings for the mchess19 and U.-s3-* benchmarks are omitted as in Table 2.

Mutilated Chessboard RAT Microbenchmarks
The following microbenchmark suite tests the LRAT checkers on large mutilated chessboard problems (up to 100 by 100) solved by pgbdd, a BDD-based SAT solver [8]. Unlike the previous benchmark suites, LRAT proofs are emitted directly by the solver so additional DRAT-trim conversion is not needed. All tools were ran with a timeout of 10000 seconds and all timings are reported in seconds (to one d.p.). For additional scaling comparison, we also report results for lrat-check, an unverified LRAT proof checker implemented in C. The results in Table 5 show the impact of cake_lpr's RAT optimizations (Section 5.3). Notably, cake_lpr scales essentially linearly in the size of the proofs (up to ≈ 10 million proof steps). As a result, cake_lpr is significantly faster than acl2-lrat and coq-lrat on these RAT-heavy proofs and it comes within a 5x factor of the unverified lrat-check tool.

Erdős Discrepancy Properties
The Erdős Discrepancy Problem (EDP) asks if, for every > 0 and infinite ±1 sequence ( 1 , 2 , 3 , . . . ), there exists , such that | =1 · | > . Konev and Lisitsa [35] provided a SAT-solver aided proof of the EDP in the case = 2. We demonstrate the parallel scalability of compositional proof checking (Section 4) using cake_lpr's compositional proof checking option on a cube-and-conquer proof generated by iGlucose, a version of Glucose 3.0 with support for iCNF files, and the technique reported in Section 4.2. The technique yields a 5226 line compositional top-level proof where each line is justified by an underlying LRAT proof, see Fig. 8. The underlying proofs consist of 20 million clause addition lines in total and they vary widely in size, ranging from 88 bytes to 110 MB. Proof steps are allocated evenly to parallel threads by their index. For example, with two threads, the first thread would check all odd-numbered proof steps by sequentially running instances of cake_lpr to check the underlying proofs for ranges ( 0 , 1 ), ( 2 , 3 ), . . . , while the second thread checks all even-numbered proof steps for ranges ( 1 , 2 ), ( 3 , 4 ), . . . , and similarly for more parallel threads. Results are summarized in Table 6 with wall-clock execution times reported in seconds and relative speedup against a single thread. All values are rounded to one decimal place.
The speedup is nearly linear for lower number of threads (1-32) but drops off at 64 and 128 threads. This is likely due to the unbalanced nature of proof sizes, where large LRAT proofs dominate the overall proof checking time at high levels of parallelism. A more advanced parallelization scheme, e.g., with a scheduler, could further improve proof checking performance.

Related Work
Verified Proof Checking. There are several RAT-based verified proof checkers, in ACL2 [20], Coq [10], and Isabelle/HOL [39]; these verified checkers all use an unverified preprocessing tool to add proof hints to DRAT proofs, either DRAT-trim or GRATgen. Alternative preprocessing tools are available, e.g., based on the recently proposed FRAT format [4]. The DRAT format is itself an extension of the DRUP format [21]; the Coq checker is based on a predecessor verified checker for the GRIT [11] format. The ACL2 checker can be efficiently and directly executed (without extraction) using imperative primitives native to the ACL2 kernel [20]. However, the implementation of these features in ACL2 itself must be trusted to trust the proof checking results [50], hence the yellow background in Table 1. SMT-Coq [2,14] is another certificate-based checker for SAT and SMT problems in Coq. Its resolution-based proof certificates can be checked natively using native computation extensions of the Coq kernel. Verified checkers are available for other logics, such as the Verified TESC Verifier for first-order logic [3], Pastèque for the practical algebraic calculus [33], and various checkers for higher-order logics (and type theories) underlying proof assistants [1,47,52].
Applications. SAT solving is a key technology underlying many software and hardware verification domains [7,30]. Certifying SAT results adds a layer of trust and is clearly a worthwhile endeavor. Solver-aided proofs of mathematical results [23,29,35] are particularly interesting and challenging to certify because these often feature complicated SAT encodings, custom (hand-crafted) proof steps, and enormous resulting proofs [29]. Our cake_lpr checker is designed to handle the latter two challenges effectively. For the first challenge, the SAT encoding of mathematical problems can also be verified within proof assistants. This was done for the Boolean Pythagorean Triples problem building on the Coq proof checker [12].
Verified SAT Solving. An alternative to proof checking is to verify the SAT solvers [16,17,43,48]. This is a significant undertaking but it would allow the pipeline of generating and checking proofs to be entirely bypassed. Furthermore, such verification efforts can yield new insights about key invariants underlying SAT solving techniques compared to prior pen-and-paper presentations, e.g., the 2WL invariant [17]. However, the performance of verified SAT solvers are not yet competitive with modern (unverified) SAT solving technology [16,17].

Conclusion
This work presents the LPR proof format for verified checking of PR proofs and a compositional proof format for separate (parallel) proof checking. It also demonstrates the feasibility of using binary code extraction to verify cake_lpr, a performant proof checker supporting both formats down to its machine-code implementation. Given the strength of the PR proof system, there is ongoing research into the design of satisfaction-driven clause learning techniques [27,28,49] for SAT solvers based on PR clauses.
Our proof checker opens up the possibility of using a verified checker to help check and debug the implementation of these new techniques. It also gives future SAT competitions the option of providing PR as the default (verified) proof system for participating solvers. Another interesting direction is to add cake_lpr support for other proof formats [5,39]. In particular, this would allow users to build compositional proofs that utilize different underlying proof formats for separate parts of the proof.