Certiﬁed SAT solving with GPU accelerated inprocessing

Since 2013, the leading SAT solvers in SAT competitions all use inprocessing, which, unlike preprocessing, interleaves search with simpliﬁcations. However, inprocessing is typically a performance bottleneck, in particular for hard or large formulas. In this work, we introduce the ﬁrst attempt to parallelize inprocessing on GPU architectures. As one of the main challenges in GPU programming is memory locality, we present new compact data structures and devise a data-parallel garbage collector. It runs in parallel on the GPU to reduce memory consumption and improve memory locality. Our new parallel variable elimination algorithm is roughly twice as fast as previous work. Moreover, we augment the variable elimination with the ﬁrst parallel algorithm for functional dependency extraction in an attempt to ﬁnd more logical gates to eliminate that cannot be found with syntactic approaches. We present a novel algorithm to generate clausal proofs in parallel to validate all simpliﬁcations running on the GPU besides the CDCL search, giving high credibility to our solver and its use in critical applications such as model checkers. In experiments, our new solver ParaFROST solves numerous benchmarks faster on the GPU than its sequential counterparts. With functional dependency extraction, inprocessing in ParaFROST was more effective in reducing the solving time. Last but not least, all proofs generated by ParaFROST were successfully veriﬁed.

Since 2013, simplification techniques [9,18,21,25,58] are also used periodically during SAT solving, which is known as inprocessing [4,5,8,27]. Applying inprocessing iteratively to large problems can be a performance bottleneck in SAT solving procedure, or even increase the size of the formula in case of bounded variable elimination [9,18,58], negatively impacting the solving time.
Graphics processors (GPUs) have become more and more attractive for general-purpose computing with the availability of programming models such as Open Computing Language (OpenCL) [38] and the Compute Unified Device Architecture (CUDA) [39]. On the one hand, OpenCL is portable to both CPUs and GPUs manufactured by different vendors such as AMD and NVIDIA. On the other hand, CUDA is designed by NVIDIA to make the optimum use of their hardware capabilities in accelerating applications that are computationally intensive w.r.t. data processing. For instance, we have applied CUDA to accelerate explicitstate model checking [12,[65][66][67], bisimilarity checking [64], wind turbine emulation [37], term rewriting [61,62], metaheuristic SAT solving [68,69], SAT-based test generation [43] and bounded model checking [42,47]. Recently, we introduced SIGmA [44,45] as the first SAT simplification preprocessor to utilize GPUs.
In [48], we introduced the first SAT solver (ParaFROST) with GPU accelerated inprocessing which supports various simplification rules to rewrite a SAT formula into a compact equisatisfiable one with fewer variables and/or clauses. Preprocessing is done only once before the solving starts, while in inprocessing, this is done periodically during the solving. A new dynamically expanded data structure was introduced for clauses supporting both 32bit [19] and 64-bit references with a minimum of 16 bytes per clause. In addition, a new parallel garbage collector was presented, tailored for GPU inprocessing.
In our latest work [47], we have worked on compacting the data structure used in ParaFROST as much as possible, while still allowing for the application of effective solving optimisations. Furthermore, we introduced the memory-aware Bounded Variable Elimination (BVE), to avoid running out of memory due to adding too many new clauses. In practice, we experienced this problem when applying the original procedure of [48] for Bounded Model Checking (BMC).
Contributions The current article is an extension of our earlier papers [47,48], to which we have added the following original contributions: The BVE in ParaFROST can reduce the number of added clauses by detecting in parallel logical gates that are syntactically translated to Conjunctive Normal Form (CNF) using Tseitin encoding [59]. Such definition of x is written as x ↔ f (v 1 , . . . , v n ). The simplest example is the AND gate x ↔ v 1 ∧v 2 . However, this approach fails to recognize irregular gates not encoded as simple gate types such as AND, XOR or If-Then-Else gates. In this work, we provide a semantic way using functional dependency extraction to detect any gate definition in parallel that could not be found by the syntactic approach. We use function tables and binary magic numbers to compress the variables to bit-vectors [29]. Moreover, the number of added clauses is further reduced by finding and resolving a Minimal Unsatisfiable Set (MUS) of clauses representing the gates [34].
While the impact of accelerating these procedures has been demonstrated in [47,48], their correctness in refuting a formula has not yet been addressed, especially if used in critical applications such as BMC [47]. If the solver claims that a formula is satisfiable, the generated solution (model) can be checked linearly in the size of the formula. However, if a solver declares a formula is unsatisfiable (i.e. has no solutions), there is no guarantee that the GPU code is sound and correct due to ill logic or data hazards that could be introduced at the implementation level. Certifying SAT solvers is becoming crucial to validate the results in tools such as theorem provers and model checkers. For this reason, we propose an efficient parallel approach to generate clausal proofs for the GPU-accelerated simplifications in DRAT (Deletion Resolution Asymmetric Tautology) format [24,63] with two goals: the proof should be compact and not to pose an overhead to the GPU solver.
We provide a comprehensive evaluation of the proposed algorithms with and without clausal proof generation. Furthermore, ParaFROST with the new extensions is compared to the state-of-the-art sequential solvers developed by the third author.

Preliminaries
All SAT formulas in this article are in CNF format. A CNF formula is a conjunction of clauses m i=1 C i where each clause C i is a disjunction of literals r j=1 j such that (m ≥ 1) and (r ≥ 1). A literal is a Boolean variable x or its negation ¬x. For a literal , var( ) denotes the referenced variable, i.e., var(x) = x and var(¬x) = x. The domain of all literals is L. The domain of all variables is var(L). With L(S), we refer to all literals in S. We interpret a clause C as a set of literals { 1 , . . . , r } representing the clause 1 ∨ . . . ∨ r , and a SAT formula S as a set of clauses {C 1 , . . . , C m } representing the formula C 1 ∧ . . . ∧ C m . Further, we denote the set of all clauses of S in which occurs by S = {C ∈ S | ∈ C}. The set of clauses E x = S x ∪ S ¬x is called the environment of x. The set of clauses S | ¬ is called the set of co-factors of S and is defined as S | ¬ = {C\{ } | C ∈ S }.
In this article, we interpret constants and data structures with all-capital letters in the format CONSTANT or STRUCT. All arrays/lists and structure members are named in the format array or member. Function and solver names are written as function or solver. The variables defined within the algorithms have the font shape variable.
The GPU-accelerated inprocessing is integrated with the CDCL [46,54] search algorithm. One important feature of CDCL is to learn from previous assignments to prune the search space and make better decisions in the future. This learning process involves the periodic adding of new learnt clauses to the input formula while CDCL is running. We consider clauses to be either LEARNT or ORIGINAL (redundant and irredundant in [27] and in the SAT solver CaDiCaL [8]). A LEARNT clause is added to the formula by the CDCL clause learning process, and an ORIGINAL clause is part of the formula from the very beginning. Moreover, each assignment is associated with a decision level that acts as a time stamp, to monitor the order in which assignments are performed. The first assignment is made at decision level one. The number of distinct levels in a clause at which literals are assigned is called the literal block distance (LBD) or glucose level of that clause [2].

Bounded variable elimination
BVE can remove variables completely from the CNF formula by trivially eliminating pure literals [17], applying the resolution rule [17,30,58] or gate-equivalence reasoning [18,32,49]. Definition 1 (Pure literal) For a formula S, a literal is called pure iff S ¬ = ∅. A pure literal can be eliminated from S, resulting in the new formula S = S\S . Definition 2 (Resolution rule) Let us consider two clauses C 1 and C 2 such that for some variable x, we have x ∈ C 1 and ¬x ∈ C 2 . We represent the application of the rule w.r.t. some variable x using a resolving operator ⊗ x on C 1 and C 2 . Given that x ∈ C 1 and ¬x ∈ C 2 , the operator ⊗ x is defined as The result of applying the rule is called the resolvent [58]. Moreover, The ⊗ x operator can be extended to resolve sets of clauses w.r.t. variable x.
Definition 3 (Resolvents set) For a formula S, let L ⊂ S be the set of learnt clauses when we apply the resolution rule during the solving procedure. The set of new resolvents is then defined as Notice that learnt clauses can be ignored [27] (i.e., in practice, it is not effective to apply resolution on learnt clauses). The last condition ensures that a resolvent is not a tautology.
In gate-equivalence reasoning, we substitute eliminated variables with deduced logical equivalent expressions. Combining gate equivalence reasoning with the resolution rule tends to result in smaller formulas compared to only applying the resolution rule [18,27]. Let G (S) be the gate clauses having as the gate output and H (S) the non-gate clauses, i.e. clauses not contributing to the gate itself. For regular gates (e.g. AND), substitution can be performed by resolving non-gate with gate clauses as follows: In this article, we focus on finding definitions for irregular gates by checking the unsatisfiability of the co-factors formula {S x | ¬x ∪ S ¬x | x }. In [5,9], a BDD-based approach is used to solve the co-factors. In this work, we replace the BDD structure with a function table (bit-vector) encoding clausal core of the co-factors. Checking the bit-vector of the latter can be done effectively on the GPU. The clausal core is mapped back to the original gate clauses G x and G ¬x by adding back x and ¬x, respectively. Then, the set of resolvents

Subsumption elimination
Suppose that we have two clauses C 1 , C 2 and C 2 ⊂ C 1 . In subsumption elimination, C 1 is said to be subsumed by C 2 or C 2 subsumes C 1 . The subsumed clause C 1 is redundant and can be removed [18]. If C 2 is a LEARNT clause, it must be considered as ORIGINAL in the future, to prevent deleting it during learnt clause reduction [8].
Definition 5 (Self-subsuming resolution) The self-subsuming resolution is a special case of subsumption. The former can be applied on clauses C 1 , C 2 iff for some variable x, we have In that case, x can be removed from C 1 .
In this work, with SUB, we refer to the application of self-subsuming resolution followed by subsumption elimination until a heuristic fixpoint is reached.

Example 3
Consider the formula {{a, b, c}, {¬a, b}, {b, c, d}}. The first clause is selfsubsumed by the second clause w.r.t. variable a and can be strengthened to {b, c} which in turn subsumes the last clause {b, c, d}. The latter clause is then removed from S and the simplified formula becomes {{b, c}, {¬a, b}}.

Eager redundancy elimination
ERE is a new elimination technique that we proposed in [48], which repeats the following until a fixpoint has been reached: for a given formula S and clauses C 1 ∈ S, C 2 ∈ S with x ∈ C 1 and ¬x ∈ C 2 for some variable x, if there exists a clause C ∈ S for which C ≡ C 1 ⊗ x C 2 , then let S := S\{C}. In this method, we restrict removing C to the condition That is, if C is LEARNT or the resolved clauses (C 1 , C 2 ) and C are ORIGINAL, then C is called a redundancy and can be removed.
Note that this method is entirely different from Asymmetric Tautology Elimination in [25]. The latter requires adding so-called hidden literals to all clauses to check which is a hidden tautology. ERE can operate on learnt clauses and does not require literals addition, making it more effective and adequate to data parallelism. The effectiveness of ERE relies on the fact that resolution is unit propagation. Consider the previous example, if the solver sets c to true or false, then either a or ¬b can be directly assigned as implications, forcing the solver to propagate the fourth clause {¬b, a} in which the same implications are disjoined. Hence, removing the latter from the formula, saves unnecessary overhead during unit propagation.

DRAT clausal proof
DRAT is a clausal proof format that can be constructed from smaller lemmas. Each line of the proof stores a lemma which is either a sequence of literals terminated by 0 or a deletion instruction (a prefix represented by the character d). The Lemmas are emitted to an output file in DIMACS format [28] and are validated using both the Resolution Asymmetric Tautology (RAT) [24] and Reverse Unit Propagation (RUP) [23] checks via the external proof checker drat-trim [63]. The mathematical notions behind RAT and RUP are out of this article scope. Both learnt clauses and resolvents are considered lemma additions. Clause deletions are not essential for the proof to succeed; however, they help reduce the computation time in verifying the proof. A proof terminating with an empty clause (i.e. a line containing only a zero), declares the input formula is UNSAT.

GPU architecture
Since 2007, NVIDIA has been developing a parallel computing platform called CUDA [39] that allows developers to use GPU resources for general purpose processing. A GPU contains multiple streaming multiprocessors (SMs), each SM consisting of an array of streaming processors (SPs) or cores. Every SM can execute multiple threads grouped together in 32thread scheduling units called warps.
GPU Kernel A GPU computation can be launched in a program by the host (CPU side of a program) by calling a GPU function called a kernel, which is executed by the device (GPU side of a program). When a kernel is called, it is specified how many threads need to execute it. These threads are partitioned into thread blocks of up to 1,024 threads (or 32 warps). Each block is assigned to an SM. All threads together form a grid. Threads and blocks can be indexed by a one-dimensional, two-dimensional, or three-dimensional unique identifier (ID) accessible within the kernel. By using this ID, we can achieve that different threads in the same block work on different data. A hardware warp scheduler evenly distributes the launched blocks to the available SMs.
We express a thread dimension with a bold italic font dimension. For example, threads or blocks can be launched in the x or y or z dimension. Additionally, in our developed kernels, we use two conventions. First of all, with t x , we refer to the block-local ID of the working thread in x. Second of all, we use so-called grid-stride loops to process data elements in parallel. The statement for all tid ∈ 0, N do in parallel expresses that all natural numbers in the range [0, N ) must be considered in the loop, and that this is done in parallel by having each executing thread start with element t x , i.e., tid = t x , and before starting each additional iteration through the loop, the thread adds to tid the total number of threads on the GPU.
If the updated tid is smaller than N , the next iteration is performed with this updated tid. Otherwise, the thread exits the loop. A grid-stride loop ensures that when the range of numbers to consider is larger than the number of threads, all numbers are still processed.
Memory Hierarchy Concerning the memory hierarchy, a GPU has multiple types of memory: • Global memory with high bandwidth but also high latency is accessible by both GPU threads and CPU threads and thus acts as interface between CPU and GPU. • Constant memory is read-only for all GPU threads. It has a lower latency than global memory, and can be used to store any pre-defined constants. • Shared memory is on-chip memory shared by the threads in a block. Each SM has its own shared memory. It is much smaller in size than global and constant memory (in the order of tens of kilobytes), but has a much lower latency. It can be used to efficiently communicate data between threads in a block. • Registers are used for on-chip storage of thread-local data. They are very small, but provide the fastest memory and the possibility for threads in a warp to exchange register data.
Regarding atomicity, a GPU is capable of executing atomic operations on both global and shared memory. A GPU atomic function typically performs a read-modify-write memory operation on one 32-bit or 64-bit word.
Optimisations In this work, we use unified memory [39] to store the main data structures that need to be regularly accessed by both the CPU (host) and the GPU (device). Unified memory creates a pool of managed memory that is virtually shared between the host and the device. This pool is accessible to both sides using the same addresses.
To hide the latency of global memory, ensuring that the threads perform coalesced accesses is one of the best practices. When the threads in a warp try to access a consecutive block of 32bit words, their accesses are combined into a single (coalesced) memory access. Uncoalesced memory accesses can, for instance, be caused by data sparsity or misalignment.
To maximise the bandwidth of memory transfers from device and host arrays allocated via dedicated memory (non-unified), we use page-locked (or pinned) memory. Memory allocations on host are pageable by default and the GPU cannot access data directly from pageable host memory. Therefore, when a data transfer from device to host pageable memory (and vice versa) is invoked, the CUDA driver must first allocate a temporary pinned buffer, and copy the data to the buffer first before it reaches its destination. We can avoid this extra transfer by directly allocating a host array in the pinned memory. However, pinned-memory allocations should be avoided for large data structures (a SAT formula, for instance) as they may reduce the physical memory available for the operating system.

Data structures
To efficiently implement inprocessing techniques (i.e. Variable-Clause Eliminations (VCE)) for GPU architectures with the ability to generate clausal proof, we adopted the clause data structure described in our latest work [47]. The new structure requires only 12 bytes of bookkeeping, compared to 16 bytes consumed by our initial design in [48] excluding literals. Figure 1a, b shows the proposed structures to store a clause (denoted by SCLAUSE) and the SAT formula represented in CNF form (denoted by CNF), respectively. The following information is stored for each clause: • The state field (2 bits) stores if the state is ORIGINAL, LEARNT or DELETED. can still be used before it gets deleted during database reduction. As a heuristic, LEARNT clauses are used at most twice [8,55]. • The added field (1 bit) is used to mark a clause as resolvent.
• The flag field (1 bit) marks the clause when it contributes to a gate (when applying substitution). • The literal block distance (lbd) (26 bits) stores the number of decision levels contributing to a conflict, if there is one [2]. A maximum value of 2 26 turns out to be sufficient. This field is updated when the clause is altered. Both used and lbd can be altered via clause strengthening [8] in SUB. • The size (32 bits) of the clause, i.e., the number of literals.
• A signature sig (32 bits) is a clause hash, for fast clause comparison [18].
In addition, a list of literals is stored, each literal taking 32 bits (1 bit to indicate whether it is negated or not, and 31 bits to identify the variable). In total, a clause requires 12 + 4t bytes, with t the number of literals in the clause. For comparison, MiniSat only requires 4 + 4t bytes, but it does not involve the used, lbd and sig fields, thereby not supporting the associated heuristics. CaDiCaL [8] uses 28 + 4t bytes, since it applies solving and VCE via the same clause structure. In ParaFROST, the GPU is only used for VCE; therefore, heuristic information for probing [35] and vivification [51], for instance, is irrelevant.
As implemented in MiniSat, we use the clauses field in CNF (Fig. 1b) to store the raw bytes of SCLAUSE instances with any extra literals in 4-byte buckets with 64-bit reference support. The cap variable indicates the total memory capacity available for the storage of clauses, and size reflects the current size of the list of clauses. We always have size ≤ cap. The references field is used to directly access the clauses by saving for each clause a reference to their first bucket. The mechanism for storing references works in the same way as for clauses.
In a similar way, an occurrence table structure (Fig. 1c), denoted by OT, is created to record the 64-bit clause references for each literal in the formula. The references to all clauses in the formula are stored in a single container called occurrences in OT. The lists array in OT is created of type occurrence list (OL) to facilitate direct access to the occurrences memory by saving for each literal a pointer to its first occurrence. The creation of an OL instance is done in parallel on the GPU for each literal using atomic instructions. For each clause C, a thread is launched to insert the occurrences of C's literals in the associated lists.
Initially, we pre-allocate unified memory for clauses and references which is in size twice as large as the input formula, to guarantee enough space for the original and learnt clauses. During variable elimination, every GPU thread checks first before adding new clauses if there is enough memory allocated using the cap variable in CNF. Variables exceeding memory boundaries are skipped from elimination. Hence, BVE is guaranteed to terminate safely no matter how much memory is allocated [47]. More on this is in Sect. 8. The OT memory is reallocated dynamically if needed after each variable elimination. Further, we check the amount of free GPU memory before allocation. If no memory is available, the inprocessing step is skipped and the solving continues on the CPU.

Parallel garbage collection
Modern sequential SAT solvers implement a garbage collection (GC) algorithm to reduce memory consumption and maintain data locality [2,8,19].
Since GPU global memory is a scarce resource and coalesced accesses are essential to hide the latency of global memory (see Sect. 3.1), we decided to develop an efficient and parallel GC algorithm for the GPU without adding overhead to the GPU computations. Figure 2 visualises the proposed approach for a simple SAT formula shows, in addition, how the references and clauses lists in Fig. 1b are updated for the given formula. The reference for each clause C is calculated based on the sum of the sizes (in buckets) of all clauses preceding C in the list of clauses. For example, the first clause C 1 requires 12 + 4t = 24 bytes or CB + t buckets, where a bucket consists of four bytes, and the constant CB is the number of buckets needed to store SCLAUSE, in our case 12 bytes / 4 bytes. Given the number of buckets needed for C 1 is 6, the next clause (C 2 ) must be stored starting from position 6 in the list of clauses. This position plus the size of C 2 determines in a similar way the starting position for C 3 , and so on.
The first step towards compacting the CNF instance when C 2 is to be deleted is to compute a stencil and a list of corresponding clause sizes in terms of numbers of buckets. In this step, each clause C i is inspected by a different thread that writes a '0' at position tid of a list named stencil if the clause must be deleted, and a '1' otherwise. The size of stencil is equal to the number of clauses. In a list of the same size called buckets, the thread writes at position tid '0' if the clause will be deleted, and otherwise the size of the clause in terms of the number of buckets.
At step 2, a parallel exclusive-segmented scan operation is applied on the buckets array to compute the new references. In this scan, the value stored at buckets[tid], masked by the corresponding stencil, is the sum of the values stored at positions 0 up to, but not including, tid. An optimised GPU implementation of this operation is available via the CUDA CUB library, 1 which transforms a list of size n in log(n) iterations. In the example, this results in C 3 being assigned reference 6, thereby replacing C 2 .
At step 3, the stencil list is used to update references in parallel. The DeviceSelect::Flagged standard function of the CUB library can be deployed for this, keeping clause references in consecutive positions via stream compaction [11]. Finally, the actual clauses are copied to their new locations in clauses. Algorithm 1 describes in detail the GPU implementation of the parallel GC. As input, Algorithm 1 requires a SAT formula S in as an instance of CNF. The constant CB is kept in GPU constant memory for fast access. The highlighted lines in light green are executed on the GPU. To begin GC, we count the number of clauses and literals in the S in formula after simplification has been applied (line 1). The counting is done via the parallel reduction kernel countSurvived, listed at lines 7-33.
The values rCls and rLits at line 8 will hold the current number of clauses and literals, respectively, counted by the executing thread. The vcalue b is used as a loop counter and initially holds the current block size. These variables are stored in the thread-local register memory. Within the loop at lines 9-14, the counters rCls, rLits are updated incrementally if the clause at position tid in clauses is not deleted. Once a thread has checked all its assigned clauses, it stores the counter values in the block-local shared memory arrays (shCls, shLits) at line 16.
A non-participating thread simply writes zeros (line 18). Next, all threads in the block are synchronized by the syncThreads call. The loop at lines 21-27 performs the actual parallel reduction to accumulate the number of non-deleted clauses and literals in shared memory within thread blocks. In each iteration, the counter b is divided by 2 until it is equal to 32 (note that blocks always consist of a power of two number of threads). The last 32 threads assembling a full warp reduce the data in the shared memory via warp shuffle reduction (line 28). This operation allows all-reduce direct communication between the threads in a single warp without the need for synchronization.
The total number of clauses and literals per block is in the end stored by thread 0 in the shared memory, and this thread adds those numbers using atomic instructions to the globally stored counters numRefs and numLits at lines 30-31, resulting in the final output. In the procedure described here, we prevent having each thread perform atomic instructions on the global memory, by which we avoid a potential performance bottleneck. The computed numbers are used to allocate enough memory for the output formula at line 2 on the CPU side.
The kernel computeStencil, called at line 3, is responsible for checking clause states and computing the number of buckets for each clause. The computeStencil kernel is given at lines 34-43. If a clause C is set to DELETED (line 37), the corresponding entries in stencil and cindex are cleared at line 38, otherwise the stencil entry is set to 1 and the cindex entry is updated with the number of clause buckets.
The exclusiveScan routine at line 4 calculates the new references to store the remaining clauses based on the collected buckets. For that, we use the exclusive scan method offered by the CUB library. The compactRefs routine called at line 5 groups the valid references, i.e., those flagged by stencil, into consecutive values and stores them in references(S out ), which refers to the references field of the output formula S out . Finally, copying clause contents (literals, state, etc.) is done in the copyClauses kernel, called at line 6. This kernel is described at lines 44-51. If a clause in S in is flagged by stencil via thread tid, then a new SCLAUSE reference is created in clauses(S out ), which refers to the clauses field in S out , offset by cindex[tid]. The & symbol at line 47 denotes a memory reference. At line 48, the actual data in S in [tid] is copied to the new destination C dest .
The GC mechanism described above resulted from experimenting with several less efficient mechanisms first. In the first attempt, two atomic additions per thread were performed for each clause, one to move the non-deleted clause buckets and the other for moving the corresponding reference. However, the excessive use of atomics resulted in a performance bottleneck and produced a different simplified formula on each run, that is, the order in which the new clauses were stored depended on the outcome of the atomic instructions. The second attempt was to maintain stability by moving the GC to the host side. However, accessing unified memory on the host side results in a performance penalty, as it implicitly results in copying data to the host side. The current GPU approach is faster and always results in the same output formula because both segmented scan and stream compaction preserve the original data order.

Proof memory management
In this work, we adopt the variable-length encoding in generating the binary DRAT proof. This format saves significant amount of memory, particularly on the GPU side (ideally, by a factor of 3 as reported in [63]). Let l and −l be the positive and negatives integers to represent the literals resp. ¬ in DIMACS format. To encode l in the binary form, it has to be mapped first to the unsigned literal l + : The mapped value can then be compressed into a variable-byte sequence of 7-bit words (w i ): The 8th bit of a byte in this sequence indicates whether there are still more bytes to follow. Moreover, every sequence has two additional bytes. The first byte acts as a prefix to express whether a lemma is added (character 'a' or 61 in hexadecimal) or deleted (character 'd' or 64 in hexadecimal). The last byte is zero to mark the end of the lemma. Fig. 3. Eliminating the variables 2 and 4 yields the resolvents {1, 3}, {1, -3}, and {-1, 5}, {-1, -5}, respectively. Further, eliminating the variables 3 and 5 produces two unit clauses {1} and {-1}, respectively. Thus, the formula is declared unsatisfiable due to the contradicting units. In the middle, the DRAT proof is provided by the ParaFROST solver, revealing all resolvents added after each resolution step. A binary-equivalent DRAT format is also shown on the rightmost side.

Example 5 Consider the CNF formula in
Before applying certified simplifications on the GPU, an upper-bound of memory space required to store the binary p proof is calculated for all literals. The idea is to compute, per unsigned literal l + , the minimum number of bytes as indicated by words(l + ). To do that efficiently on the GPU, one can count the number of leading zeros (Z ) in the bit string of the integer value using the intrinsic function clz. By subtracting Z from 32, we get the position of the most-significant high bit M (i.e. minimum number of bits to represent l + ). Dividing the latter by 7 (remember binary DRAT uses 7-bit wording), gives the lower bound of the number of words W . To get the upper bound, W needs to be rounded up using the integer  Figure 4 gives a working example of the parallel computation of the upper-bound for the literals 67,713, −63, 64, and −67713. Initially, the literals are mapped to the unsigned integers 135,426, 127, 128, and 135,427, respectively. Next, each thread calculates the minimum number of bytes per literal as described above and the results are rounded up using the lookup table LUP stored in the constant memory. For example, 135,427 would occupy a minimum of 3 bytes to store. Finally, parallel reduction is applied on the pbytes array to sum up the contents and obtain the upper bound (9 in this example). Ideally, we need a memory space equal to the literals upper bound plus 2 times the number of clauses in a CNF formula (recall that DRAT requires two additional bytes per clause). However, in practice, 1.5 times this bound is needed to guarantee enough space for emitting the proof on the GPU side.
Algorithm 2 lists in detail the GPU proof memory allocator. It takes as input the formula S and the lookup table LUP. First, all clauses are flattened into consecutive unsigned literals and stored in the array literals. At line 2, the kernel countProof is launched (given at lines 4-31) to create both the pbytes array and the memory bound proofbound. The former is needed as reference to emit the proof later in VCE using atomic instructions.
The variable rBytes at line 5 will hold the current number of bytes counted by the executing thread. Again, the value b initially holds the current block size (used later as a loop counter). Within the loop at lines 6-14, the counter rBytes is updated incrementally if the value pbytes[l + ] is zero, i.e., the number of bytes has not been computed before for the literal l + . Notice that we subtract the bit position clz(l + ) at line 9 from 30 rather than 32 as the table is indexed from 0 to 30 (see Fig. 4). Having a non-zero value at pbytes[l + ] means the current literal is a duplicate and its variadic size is already computed before. Accordingly, in this case, we rely on data racing and thread divergence, which contradicts the convention of parallel programming. The following example explains this phenomenon.

Example 6
Suppose we have a set of unsigned literals = {3, 3, 5, 6} and four threads t 0 , . . . , t 3 where t i represents the tid of thread i. In Algorithm 2, when t 0 and t 1 , both inspect literal 3, the following two scenarios are possible: 1. Either one of the threads t 0 or t 1 is faster than the other and takes the control path at lines 8-10, updating pbytes[3] to 1. The other thread has seen the new updated value pbytes[3] = 1; thus, taking the control path at lines 11-12. In that case, the more threads executing that path, the better. 2. Both threads t 0 and t 1 check the condition at line 8 at the exact same time. Thus, they both do the counting and update pbytes [3] simultaneously. This is not problematic, as they both write the same value.
Once a thread has checked its literal, it stores the counter value in the block-local shared memory array shBytes at line 16. The values in shBytes are then reduced in parallel to the global variable proofbound. With this value, memory is allocated to the proof stream P in bytes at line 3.

Variable scheduling
In our GPU-accelerated inprocessing, each simplification method is applied on multiple variables simultaneously. Doing so may lead to data hazards, due to the disjunction between literals in all clauses (i.e. data dependency). Two variables x and y are dependent iff there exists a clause C with (x ∈ C ∨¬x ∈ C)∧(y ∈ C ∨¬y ∈ C). If the two dependent variables x and y were to be processed for simplification, two threads might manipulate C at the same time. To guarantee soundness of the parallel simplifications, we apply our least constrained variable elections algorithm (LCVE) [44] prior to simplification. It is responsible for electing a set of mutually independent variables (candidates) from a set of authorised candidates. The remaining variables relying on the elected ones are frozen.
Moreover, we map no more than 12 frozen variables to local variables named q 1 to q 12 . Any variable beyond this range is set to 0 (i.e. a unique stamp to identify out-of-range variables). The mapped variables are used later in BVE to build the function tables (denoted by funTab) with size 2 12 = 4096 bits.
The authorised, elected and frozen candidates are defined as follows: Definition 6 (Authorised candidates) Given a formula S, we call A the set of authorised candidates:

is a histogram array (h[x] is the number of occurrences of x in S).
• μ denotes a given maximum number of occurrences allowed for both x and its complement, representing the cut-off point for the LCVE algorithm.

Definition 7 (Candidate Dependency Relation)
We call a relation D: A × A a candidate dependency relation iff ∀x, y ∈ A, x D y implies that ∃C ∈ S.(x ∈ C ∨ ¬x ∈ C) ∧ (y ∈ C ∨ ¬y ∈ C) Definition 8 (Elected candidates) Given a set of authorised candidates A, we call a set Φ ⊆ A a set of elected candidates iff ∀x, y ∈ Φ. ¬(x D y)

Definition 9 (Frozen variables) Given the sets A and Φ, the set of frozen variables
Before LCVE is executed, a sorted list of the variables in S needs to be created, ordered by the number of occurrences in that formula, in ascending order (following the same rule as in [18]). From this list, the authorised candidates A can be straightforwardly derived, using μ as a cut-off point. Construction of this list can be done efficiently on a GPU using Algorithm 3. As input, it requires a SAT formula S and a cut-off point μ. At line 1, a histogram array h, providing for each literal the number of occurrences in S, is constructed. This histogram can be constructed on the GPU using the histogram method offered by the Thrust library. 2 Once assignScores kernel execution has terminated, at line 2, the candidates in A are sorted on the GPU based on their scores in scores while μ is used to prune candidates with too many occurrences. We used the radix-sort algorithm as provided in Thrust.
In assignScores, at line 6, the thread index is used as a variable index (variable indices start at 1). At lines 7-11, a score is computed for the currently considered variable x. This score should be indicative of the number of resolvents produced when eliminating x, which depends on the number of occurrences of both x and ¬x, and can be approximated by the formula h[x] × h [¬x]. To avoid score zero in case exactly one of the two literals does not occur in S, we consider that case separately. LCVE Algorithm Next, Algorithm 4 is executed on the host, given S, A, h and an instance of OT named T . This algorithm accesses 2 · |A| number of OL instances and parts of S. The use of unified memory significantly improves the rates of the resulting transfers and avoids explicitly copying entire data structures to the host side. Initially, all elements in F map are set to 0 (line 1). From this point forward, all violet routines or notations suggest a new contribution compared to our previous work in [47,48]. Afterwards, the algorithm considers all variables x in A (line 2). If x has not yet been frozen (line 3), it adds x to Φ (line 4). Next, the algorithm needs to identify all variables that depend on x. For this, it iterates over all clauses containing either x or ¬x (line 5), and each literal in those clauses is compared to x (lines 6-8). If refers to a different variable v, then v must be frozen. In addition, we map v to a value q in the range 1 ≤ q ≤ 12 and store it in F map (lines 10-12).
A top-level description of GPU parallel inprocessing is shown in Algorithm 5. As input, it takes the current formula S h from the solver (executed on the host) and copies it to the device global memory as S d (line 1). Initially, before simplification, we compute the clause signatures and sort clause literals via stream 0 at line 2 (prepareFormula procedure). Concurrently, via stream 1, variables are ordered at line 3. A stream is a sequence of instructions that are executed in issue-order on the GPU [39]. The use of concurrent streams allows the running of multiple GPU kernels concurrently, if there are enough resources. The orderVariables routine produces an ordered array of authorised candidates A following Definition 6.
At line 4, Algorithm 2 is executed via the proofAllocator routine on the same stream as prepareFormula. The space allocated for P d resides in global memory; whilst P h gets a pinned memory space (see Sect. 3.1) on the host side with the same size as P d .
The for loop at lines 5-21 applies SUB and BVE, for a configured number of iterations (indicated by phases), with increasingly large values of the threshold μ. Increasing μ exponentially allows LCVE to elect additional variables in the next elimination phase since after a phase is executed on the GPU, many elected variables are eliminated. In addition, mapping a new set of frozen variables is essential for the effectiveness of funTab in finding new gate definitions. The ERE method is computationally expensive. Therefore, it is only executed once in the final iteration, at line 11. At line 6, syncAll is called to synchronize all streams being executed. At line 7, the occurrence table T is created. Next, the lcve routine produces the set Φ (see Definition 8) as explained earlier in Algorithm 4.
The parallel creation of the occurrence lists in T results in the order of these lists being chosen non-deterministically. Directly applying the eliminate procedure called at line 15, which performs the parallel simplifications, would produce results non-deterministically as well. To remedy this effect, the lists in T are sorted according to a unique key in ascending order before eliminate is called. Besides the benefit of stability, this allows SUB to abort early when performing subsumption checks.
The sorting key is given as the device function listKey at lines 22-29. It takes two references a, b and fetches the corresponding clauses C a , C b from S d (line 23). First, clause sizes are tested at line 24. If they are equal, the first and the last literal in each clause are checked, respectively, at lines 25-26. If the literals are equal, clause signatures are tested at line 27. Otherwise, clause references are compared at line 28. These references are always distinct; thus, they guarantee sorting stability. However, they should be tested as a last resort. Experiments have shown that using only clause reference in addition to its size has a negative impact overall on the CDCL search. CaDiCaL implements a similar function, but only considers clause sizes [8]. The sortOT routine launches a kernel to sort the lists pointed to by the variables in Φ in parallel. Each thread runs an insertion sort to in-place swap clause references using listKey.
The eliminate procedure at line 15 calls SUB to remove any subsumed clauses or strengthen clauses if possible, after which BVE is applied. The SUB and BVE methods call kernels that scan the occurrence lists of all variables in Φ in parallel. More information on this is in Sects. 8 and 9. Both the BVE and SUB methods emit the proof to P d and may add new unit clauses atomically to a separate array U d . The propagation of these units cannot be done immediately on the GPU due to possible data races, as multiple variables in a clause may occur in unit clauses. For instance, if we have unit clauses {a} and {b}, and these would be processed by different threads, then a clause {ā,b, c} could be updated by both threads simultaneously. Therefore, this propagation is delayed until the next iteration, and performed by the host at line 8. Note that T must be recreated first to consider all resolvents added by BVE during the previous phase. The ERE method at line 11 is executed only once at the last phase (phases) before the loop is terminated. Section 10 explains in detail how ERE can be effective in simplifying both ORIGINAL and LEARNT clauses in parallel. Again, clausal proof of ERE correctness is emitted to P d .
At line 16-17, the proof stream and new units are copied from the device to the host arrays P h and U h , respectively. The data transfers are done asynchronously via stream1 and stream2. Similar to P h , the U h array is allocated in pinned host memory. It should be noted that asynchronous data transfers to the host are only permitted if the host memory is pagelocked [39]. The collect procedure does the GC as described by Algorithm 1 via stream3. At line 19, we synchronise the proof data transfer performed by stream1 and write the byte stream to the proof output file. Other active streams are synchronised at line 6.

Three-phase parallel variable elimination
The BVIPE algorithm in our previous work [44] had a main shortcoming due to the heavy use of atomic operations in adding new resolvents. Per eliminated variable, two atomic instructions were performed, one for adding new clauses and the other for adding new literals. Besides performance degradation, this also resulted in the order of added clauses being chosen non-deterministically, which impacted reproducibility (even though the produced formula would always at least be logically the same).
The approach to avoiding the excessive use of atomic instructions when adding new resolvents is to perform parallel BVE in three phases. The first phase scans the constructed list Φ to identify the elimination type (e.g., resolution or gate substitution) of each variable and to calculate the number of resolvents and their corresponding buckets.
The second phase computes an exclusive scan to determine the new references for adding resolvents, as is done in our GC mechanism (Sect. 4). At the last phase, we store the actual resolvents in their new locations in the simplified formula. For solution reconstruction, we use an atomic addition to count the resolved literals. The order in which they are resolved is irrelevant. The same is done for emitting the proof and adding units. For the latter, experiments show that the number of added units and proof bytes are relatively small compared to the eliminated variables, 3 hence the penalty of using atomic instructions is almost negligible. It would be an overkill to use a segmented scan for adding proof bytes or units.
At line 1 of Algorithm 6, phase 1 is executed by the VceScan kernel (given at lines [5][6][7][8][9][10][11][12][13][14][15][16][17][18]. Every thread scans the clause set of its designated literals x and ¬x (line 7). References to these clauses are stored at T x and T ¬x . Moreover, register variables t, rCls, rLits are created to hold the current elimination type (i.e. NONE, RES, SUBST or CORE), number of added clauses, and number of added literals of x, respectively (line 8). If x is pure at line 10, then there are no resolvents to add and the clause sets of x and ¬x are directly marked as DELETED by the routine toblivion. Moreover, this routine adds the marked literals atomically to litstack. Note that these clauses are not emitted to the proof. At line 12, we check first if x contributes to a regular logical gate using the routine gateReasoning, and save the corresponding rCls and rLits. If this is the case, the type t is set to SUBST, otherwise we try funTab reasoning at line 13 or resolution at line 14.
The funReasoning procedure at lines 37-46 is responsible for finding irregular gate definitions as explained in Sect. 2.1. At line 38, two Boolean function tables f p , f n are created in threads' local memory. 4 Both are bit-vectors of length 4096 entries initialized to ones. Each table represents a 12-variables Boolean function stored in a vector of 4096 bits. The bit at index i (where i is written as q 12 , q 11 , . . . , q 1 in binary) gives the value of the Boolean function for q 12 , q 11 , . . . , q 1 . Note that the maximum number of variables we support is 12. In real implementation, only enough frozen variables in F are mapped to values in the domain [1,12]. At line 39, we encode the clause sets in S d [T x ] and S d [T ¬x ] into their truth tables f p resp. f n via buildFunTab. More on how buildFunTab is implemented is coming later. If all literals are successfully mapped to the above range, then withinRange is set to true. A gate is found, in case both tables are built and their bit-wise and is all-zeros (i.e. unsatisfiable). The clauses set G can be reduced by finding a shorter clausal core (not necessarily minimal, though). The loop at lines 42-44 removes a clause at a time from G and tests for all-zero bit string via isFalseFun. If G \C is unsatisfiable, C is marked as a non-gate clause. After the loop terminates, all non-flagged clauses in G together form a clausal core.
The loops at lines 48-58 in the buildFunTab function, transform only the frozen literals (i.e. is skipped) in each clause C ∈ S d [T ] to bit-vectors using binary magic numbers. A binary magic number is a unique constant in which a sequence of bits is repeating itself multiple times. These numbers can be used to extract and pack integer values into a single bit string (e.g. the Boolean function table). At line 52, the frozen variable var( ) is mapped to q. If it has the value 0 (line 53), then we bail out immediately with a false. At line 54, magicNum(q) fetches the corresponding magic number from a 64-bit constant array using q's value as an index. With these constants, the truth table of each variable q can be stored in memory. Afterwards, the elementary truth tables are combined using the bitwise operators (or, and) to build the truth table of the formula S d . If the resulting truth table f contains only zeros, the formula is unsatisfiable.
The condition rCls ≤ (|T x | + |T ¬x |) is always tested implicitly by all above routines to limit the number of resolvents per x. The varinfo, rindex, and cindex arrays are updated at line 15. The total number of buckets needed to store all added clauses is calculated by the formula (CB × rCls + rLits) and stored in cindex [tid].
Finally, in phase 3, we use the calculated indices in rindex and cindex to guide the new resolvents to their locations in S d . The kernel is described at lines 19-36. Each thread either calls the procedure resolve or substitute or coreSubstitute, based on the type stored for the designated variables. However, a condition for applying an elimination is that (t = NONE) and there is enough memory space, which is checked using line 25. Any produced units are saved into U d atomically. The cidx and ridx variables indicate where resolvents should be stored in S d per variable x. Similarly, these resolvents are saved into P d as stream bytes using the transformation in words(l + ). Recall that the number of bytes per literal is already stored in pbytes and is not required to be computed again.
The sequential running time of Algorithm 6 is O(m ·|Φ|), where m is the maximum length of a resolved clause in S. In practice, a limit over a resolvent length is set to a small constant value (≤ 100, for instance). Hence, the worst case is linear w.r.t. |Φ|. Consequently, the parallel complexity is O(|Φ|/ p), where p is the number of threads. Since a GPU is capable of launching thousands of threads, that is, p ≈ |Φ|, the parallel complexity is an amortised constant O(1).

Parallel subsumption elimination
Parallel SUB through Algorithm 7 is executed on elected variables before variable elimination. (Self)-subsumption elimination tends to reduce the number of occurrences of these variables as it usually removes many literals and clauses. The parallelism in Algorithm 7 is achieved on the variable level. In other words, each thread is assigned to a variable x and performs subsumption checks on all clauses in E x . At line 5, a new clause is loaded, referenced by T [x], into shared memory C s for faster access.
The shared clause is compared in the loop at lines 6-12 to all clauses referenced by T [¬x] to check whether x is a self-subsuming literal. If so, both the original clause C, which resides in the global memory, and C s must be strengthened (via the strengthen function). If C s is a unit clause, it is added atomically to U d (line 9). At line 10, we write this clause to the proof as an added lemma. Subsequently, the strengthened C s is used for subsumption checking in the loop at lines 13-21. In case the subsuming clause C is LEARNT and C is ORIGINAL, then C must be turned to ORIGINAL (see Sect. 2.2). This time, the subsumed clause is written to the proof as a deleted lemma.
Regarding the complexity of Algorithm 7, the worst-case is that a variable x occurs in all clauses of S. However, in practice, the number of occurrences of x is bounded by the threshold value μ (see Definition 6). The same applies for its complement. In a worst case scenario, a variable and its negation both occur μ times. As SUB considers all variables in Φ and worst case has to traverse each loop μ times, its sequential complexity is O(|Φ| · μ 2 ) and its parallel complexity is O(μ 2 ).

Eager redundancy elimination
Algorithm 8 describes a two-dimensional kernel, in which from each thread ID, an x and y coordinate is derived. This allows us to use two nested grid-stride loops. In the loops, we specify which of the two coordinates should be used to initialise tid in the first iteration.
Based on the kernel's y ID (line 2), each thread merges where possible two clauses of its designated variable x and its complement ¬x (lines 3-6), and writes the result in shared memory as C m . This new clause is produced by the routine resolve at line 6. At lines 7-11, we check if one of the resolved clauses is LEARNT, and if so, the state st of C m is set to LEARNT as well, otherwise it is set to ORIGINAL. This state of C m will guide the forwardEquality routine called at line 12 to search for redundant clauses of the same type. In this function (lines [18][19][20][21][22][23][24][25][26][27], the thread's x ID is used to search the clauses referenced by the minimum occurrence list T min , which is produced by findMinList at line 19. It has the minimum size among the lists of all literals in C m . If a clause C is found that is equal to C m and is either LEARNT or has a state equal to the one of C m , it is set to DELETED (lines 23). Finally, at line 24, the deleted clause is emitted to the binary proof P d .
The worst-case parallel complexity of this algorithm is where p y and p x are the number of launched threads in y resp. x dimensions. Note that |T min | is usually very small compared to the upper bound μ. Hence, the former length can be neglected w.r.t. p x , and the parallel complexity in such case will be O(μ 2 |C m |), that is, the parallel running time is a quadratic function of the upper-bound μ.

Kernel automated tuning
A GPU kernel needs to be configured prior to launch. The kernel configuration sets up the number of threads per block (blockDim) and the total number of blocks per grid (gridDim). These parameters are calculated intuitively based on the data size. As a rule of thumb, the total number of threads (blockDim×gridDim) should cover the whole data given to the kernel to process and not to exceed the limit (p) supported by the GPU. If the data size is larger than p, a good practice is to use the grid-stride loops as discussed in Sect. 3.1. However, the GPU occupancy is crucial to achieve a near-optimal balance of the kernel workload across the GPU resources. The occupancy defines how many SMs are busy (occupied) by the thread blocks. That is, a good occupancy means a fair distribution of the launched blocks across the available SMs. Algorithm 9 illustrates a way to automate the tuning of the kernel configuration for maximum occupancy. It takes as input: the data size N, maximum supported threads p, and the initial block size initBlockDim. The minimum block size minBlockDim and the minimum occupancy minOccupancy are user-defined lower boundaries. Initially, the blockDim value is set to initBlockDim at line 1. Next, gridDim is calculated based on the latter and the data size. This step gives the initial configuration without tuning. At line 3, the maximum number of blocks that can be launched at once is computed based on the initial block size. Given this value and the minimum occupancy desired, minBlocks is obtained at line 4. The loop at lines 5-8 is triggered iff blockDim has not gone lower than the boundary minBlockDim and the gridDim has not reached the limit minBlocks. The goal of this loop is to keep cutting down the blockDim by 2 till a maximum value of gridDim (within bounds) is reached. There is still a possibility that gridDim grows beyond minBlocks; therefore, always the minimum of the most recent value of gridDim and maxBlocks is targeted (line 9). In ParaFROST, we have set the minBlockDim and minOccupancy to 4 resp. 0.5.
We observed that the number of scheduled variables |Φ| in the preceding kernels VceScan, VceApply, SUB, and ERE usually go down as the solver progresses due to the elimination of many variables in the preceding call of the inprocessing procedure. Therefore, we applied the tuner in Algorithm 9 on the previous kernels to maximally increase the number of blocks that are scheduled for execution on the available SMs. This led to an overall reduction in the running time of the inprocessing procedure by 2%.

Setup
We implemented the proposed algorithms in our solver ParaFROST 5 with CUDA C++ version 11.0 [39]. Besides the implementations of our new GPU algorithms, we involved a CPU-only version of ParaFROST (called ParaFROST (noGPU)) for the solving of problems. Additionally, we compare to CaDiCaL and Kissat [8] solvers developed by the third author.
To find benchmarks with potential for simplifications on the GPU (i.e. having sufficient amount of redundant variables and clauses), we have selected all formulas that are larger than 5 MB from the 2013-2021 SAT competitions, excluding redundancies (repeated formulas across competitions). That is, a total of 641 distinct formulas were selected which encode around 80+ different real-world applications, with various logical properties. For reproducibility, the benchmark suite can be downloaded from [40].
We evaluated all GPU experiments on the compute nodes of the Lisa GPU cluster. 6 Each problem was analysed in isolation on a separate computing node, with a time-out of 3600 s. Each computing node has an Intel Xeon Gold 6130 CPU running at a base clock speed of 2.1 GHz with 96 GB of system memory, is equipped with an NVIDIA Titan RTX GPU, and runs on Debian Linux operating system. This GPU has 72 SMs (64 cores each), 24 GB global memory and 48 KB shared memory. It operates at a base clock of 1.3 GHz.
The sequential solvers are executed on the compute nodes of a different cluster called DAS-5 [3] to dedicate our computing hours on Lisa cluster only to the GPU experiments. Each node of DAS-5 had an Intel Xeon E5-2630 CPU (2.4 GHz) with 64 GB of memory. The proofs generated by the GPU solver are verified separately by the drat-trim tool [63] on DAS-5, with a time-out of 20,000 s. It is worth mentioning that the CPU time of a single-threaded task is around 10% faster on DAS-5 compared to Lisa.
With this information, we adhere to four out of five principles laid out in the SAT manifesto (version 1) [10]: 1. Benchmarks should be available for research purposes. 2. Solvers should be available in binary form for research purposes. 3. A recent generic benchmark set (e.g. competition benchmarks) should be chosen among those of the last 3 years. 4. Experimental results should include a comparison with the state of the art. 5. Details on the experimental conditions should be provided (e.g. hardware, OS).
As for the third principle, we refrained from using a single benchmarks set from a particular year, as most of the included benchmarks are very small in size for the GPU to work with (i.e. only few variables and clauses can be removed).

SAT-simplification speedup
The first part of our experiments discusses the speedup obtained by the GPU algorithms for applying GC, BVE, funTab, and proof generation compared to their previous implementations in SIGmA [44,45] or sequential counterparts in ParaFROST (noGPU). For these experiments, we set μ and phases initially to 32 resp. 5. Preprocessing is only enabled to measure the speedup. Figure 5 gives the speedup of running parallel GC against a sequential version on the host. For almost all cases, Algorithm 1 achieved a high acceleration when executed on the device with a maximum speed up of 72.6× and an average of 35×. Figure 6 reveals how fast the 3-phase parallel BVE is compared to a version using more atomic instructions. On average, the new algorithm is 1.5× faster than the old BVIPE algorithm [44]. In addition, we get reproducible results. Figure 7 evaluates the funTab method in Algorithm 6 against the sequential counterpart. Note the logarithmic scale on the y-axis. Cases with zero runtime are ignored. Clearly, the GPU achieved a remarkable speedup in finding general gate definitions compared to the CPU with a maximum of 342× and an average of 11.33×. Likewise, as shown by Fig. 8, the acceleration of proof generation on the GPU is significant, with a maximum speedup of 186× and 12× on average.

SAT solving performance
The second part of experiments provides a thorough assessment of our CPU/GPU solver, the CPU-only version, CaDiCaL, and Kissat on SAT solving with inprocessing turned on. To gain advantage of the GPU resources, preprocessing in ParaFROST is enabled by default while disabled in CaDiCaL and Kissat solvers. We could enable preprocessing in other solvers but we preferred to leave their default options untouched. The timeout is set to 3,600 s for all experiments.  Figure 9 demonstrates the runtime results for all solvers with(out) proof emission over the benchmark suite. The keyword certified means the proof generation is enabled in the solver instance. Data are sorted w.r.t. the x-axis. The simplification time accounts data transfers in ParaFROST. Overall, ParaFROST (even with proof enabled) dominates over ParaFROST (noGPU) and CaDiCaL. Keep in mind that ParaFROST and ParaFROST (noGPU) has the same CDCL engine. Thus, ParaFROST is faster due to the GPU accelerated inprocessing. Figure 10 compares ParaFROST to Kissat with(out) irregular gate reasoning. As indicated by the green and the blue lines, finding such gates has a noticeable impact on both ParaFROST and Kissat more than ParaFROST (noGPU). Recall that the parallel funTab had a considerable speedup compared to its sequential counterpart (see Fig. 7), which explains why funTab is not as competitive in ParaFROST (noGPU) as for ParaFROST. On the other hand, Kissat uses a very different method than funTab to find general definitions. It calls a simple built-in solver called Kitten which is responsible for solving and extracting  the clausal core from variable environments scheduled for elimination. Further, as expected, Kissat is more efficient than both CaDiCaL and ParaFROST. The reason for the solving discrepancies is that ParaFROST CDCL heuristics (run on the host) is based on CaDiCaL which is not up-to-date as Kissat. We expect ParaFROST to compete with Kissat if the same heuristics implemented in the latter is used. Figures 11 and 12 show simplification time and its percentage of the total run time, respectively. Clearly, the ParaFROST solver outperforms the sequential solvers due to the parallel acceleration. Figure 12 tells us that ParaFROST keeps the workload in the majority of cases in the region between 0 and 20% as the elimination methods are scheduled on a bulk of mutually independent variables in parallel. In CaDiCaL and Kissat, variables and clauses are simplified sequentially, which takes more time. Figure 13 reflects the overall efficiency of parallel inprocessing on variables and clauses with(out) funTab on solved formulas with successful clause reductions. Data are sorted in Fig. 13 Reduction efficiency with(out) irregular-gate reasoning descending order. Reductions can remove up to 97% and 80% of the variables and clauses, respectively. Figure 14 gives the heat-map distribution of the time spent on verification and the memory consumed by the proofs generated by ParaFROST [41] and CaDiCaL. All proofs are successfully verified via drat-trim tool. The verification times are reperesented by the colormap and are sorted in descending order w.r.t. the unsatisfiable instances solved. Figure 14a reveals that drat-trim takes more time to verify ParaFROST proofs which appears by the hotspot on the left side of x-axis (ranges between 1.5 × 10 4 and 1.75 × 10 4 ). That is foreseen as the deleted lemmas in Algorithm 6 are not saved to the proof in order to avoid GPU memory exhaustion. On the other hand, CaDiCaL proofs as demonstrated by Plot 14b take less time to verify (e.g. 0.75 × 10 4 to 1 × 10 4 ) due to the saving of all deleted lemmas in BVE which helps drat-trim to cut down the resolution steps. Also, proofs generated by ParaFROST for the formulas 30-40 consume slightly more memory than CaDiCaL owning to the extra resolvents and deleted lemmas produced by the funTab method and ERE, respectively. Those methods are not implemented in CaDiCaL.
Tables 1 zooms into the impact of applying the funTab method in BVE on solving a sample of 30 formulas using ParaFROST and ParaFROST (noFun), respectively. The letters V and C refer to Variables and Clauses, respectively. The keywords org and rem denote original and removed, respectively. Removed clauses include both ORIGINAL and LEARNT types. Bold entries in the V's and C's columns indicate that more variables and clauses are removed by enabling funTab in ParaFROST. For example, funTab allowed ParaFROST to remove 4,682,382 variables in the formula T96.1.1 compared to 4,672,738 by the configuration without funTab. That is 9,644 extra variables are eliminated as a side effect of funTab. Additionally, ParaFROST solved many cases faster than p (noFun) within the time limit (3,600 s). For instance, the formula HCP-446-105 was solved by ParaFROST with funTab in just 907.21 s, while it took 1,020.07 s to solve for ParaFROST without funTab.

Related work
A simple GC monitor for GPU term rewriting has been proposed by [62]. The monitor tracks deleted terms and stores their indices in a list. New terms can be added at those indices. The first author extended the former work by a stream GC similar to the one described in this article but much more simpler [61]. The authors in [1,36] investigated the challenges for offloading garbage collectors to an Accelerated Processing Unit (APU) [56] introduced a promising alternative for stream compaction [11] via parallel defragmentation on GPUs. Our GC, on the other hand, is tailored to SAT solving, which allows it to be simple yet efficient.
Regarding inprocessing, [27] introduced certain rules to determine how and when inprocessing techniques can be applied. [5] presented Lingeling, the first solver with the ability of finding general gate definitions using a BDD-based approach. The works in [7,8,20] introduced a SAT-based variant of mining gate definitions by calling an embedded SAT solver named Kitten as independent component of Kissat. The GPU-based variant presented in this article was inspired by [5,7].
Acceleration of the DPLL SAT solving algorithm on a GPU has been done in [50], where some parts of the search were performed on a GPU and the remainder is handled by the CPU. Incomplete approaches are more amenable to be executed entirely on a GPU, e.g., an approach using metaheuristic algorithms [68]. Recently, [52] used the GPUs to determine the usefulness of a learnt clause for parallel Portfolio-based solvers. Nonetheless, we are the first to work on certified CDCL solvers with GPU accelerated inprocessing.

Conclusion
We have presented compact data structures tailored for SAT inprocessing and various ways to do GPU memory management. Our solver ParaFROST achieved substantial gains through GPU-accelerated inprocessing compared to its sequential version and the state-of-the-art solver CaDiCaL. With the improvements made to the BVE procedure, the usage of atomic operations has been considerably reduced which lead to an average speedup of 1.5× compared to the atomic version. Owing to funTab reasoning, more logical gates can be detected and removed with an average speedup of 11.33× compared to the sequential counterpart.
We proposed the first parallel GC and proof generation on the GPU for SAT applications with average accelerations of 35× and 11×, respectively. The garbage collector helped reduce the GPU memory consumption while stimulating coalesced memory access. The proof generator allowed ParaFROST to validate all the SAT simplifications running on the GPU besides the CDCL search, giving high credibility to our solver and its applicability in critical tools such as model checkers.
Regarding future work, we aim to adopt the inprocessing techniques and the memory management concerning the former to a multi-GPU setup with robust load balancing. Another direction is to use ParaFROST in a Portfolio-based parallel SAT solving and exploit the GPU capabilities in managing the shared clause database as recently introduced by [52]. regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.