Partial Order Reduction for Deep Bug Finding in Synchronous Hardware

Symbolic model checking has become an important part of the verification flow in industrial hardware design. However, its use is still limited due to scaling issues. One way to address this is to exploit the large amounts of symmetry present in many real world designs. In this paper, we adapt partial order reduction for bounded model checking of synchronous hardware and introduce a novel technique that makes partial order reduction practical in this new domain. These approaches are largely automatic, requiring only minimal manual effort. We evaluate our technique on open-source and commercial packet mover circuits – designs containing FIFOs and arbiters.


Introduction
Modern society relies increasingly on electronic systems, powered by hardware components that continue to grow in complexity and variety. Ensuring the functional correctness of these components is essential, as bugs and errors can have consequences ranging from undermining a company's reputation to jeopardizing human safety [1,22,25,32,33]. Most electronic designs must therefore include a significant verification effort, and this effort often consumes more time and resources than all other aspects of the design process [17,34].
Formal methods such as symbolic model checking have become a crucial part of the verification effort because of their strong guarantees and automation [24]. However, due to the state space explosion problem [14], model checking typically only works well for small-to medium-sized circuits with primarily control logic, limiting its potential for addressing industry verification challenges.
One approach for combating the state space explosion problem is partial order reduction [14]. While symbolic partial order reduction has been successfully applied for the verification of asynchronous systems [37], its use in synchronous systems has been limited. In this paper, we introduce a novel approach for adapting symbolic partial order reduction to model checking of synchronous hardware and demonstrate dramatic reductions in the time to reach deep bugs on certain classes of synchronous circuits. Moreover, the technique requires only an interface-level annotation of the circuit, and when fully automated approaches fail, can be guided by the user. The paper makes the following contributions: 1. We adapt partial order reduction for synchronous hardware verification. 2. We introduce a novel technique for reducing the possible inputs to a circuit at a single time step, which is crucial for practical application of partial order reduction to synchronous hardware. 3. We provide a set of sufficient conditions, which, if proven, guarantee that the proposed techniques maintain the reachable states. 4. We introduce conservative proof techniques for verifying these conditions, which empirically work well on packet movers. 5. We evaluate our techniques on a set of open-source and commercial packet mover circuits, demonstrating dramatic speed-ups with minimal manual effort.
The rest of the paper is organized as follows. We first provide a motivating example, below. Then, in Section 2, we cover relevant background material and notation. We explain our partial order reduction in Section 3 and our interface simplification technique in Section 4. We provide an experimental evaluation in Section 5. Section 6 covers related work, and Section 7 concludes.

Motivating Example
Throughout this paper we use the running example shown in Code Snippet 1. We chose this example because: i) it is easy to understand; ii) it resembles real-world packet mover circuits; and iii) it contains a difficult to reach bug.
The system has a synchronizing clock and takes two 1-bit inputs: inc x and inc y. The 6-bit registers (state elements) x and y index the valid vector and are initialized to 0. The 64-bit registers valid and data start at 0 and 1, respectively. The 64x64 bit memory is uninitialized. If inc x and en x are true, the system increments the value of x. When inc y is true, the system increments y, sets the valid bit at index y, writes data to the memory at location y, and rotates the data vector to the left. Notice that the en x signal ensures that x never surpasses y (until all bits in valid are set). This incrementing pointer logic is similar to that found in a circular pointer FIFO. To ensure the asserted property, the code attempts to maintain the invariant: data = 1 << y.
At first, it appears that the asserted property should hold based on this invariant, but it does not. There is a bug that can first occur at cycle 65: the overflow check in the data update uses integers, which are assumed to be 32bits. Since y is zero-extended to be 32-bits, y+1 can never be equal to 0. Thus, when y has the value 63 and is incremented, data, which is supposed to be one-hot, is set to 0.
Although the system is small, this is a surprisingly difficult bug to reach using model checking. We believe this is due in part to the non-determinism in the Specifically, all but two of the model checker configurations we tried timed out at 2 hours before reaching the bug. Since bounded model checking (BMC) is one of the best approaches for bug-finding, we focus on improvements to BMC that help reach this bug. We introduce automated, best effort techniques that reduce the time to hit this bug from over 1000 seconds to 46 seconds by safely adding temporal symmetry breaking constraints to the system.

Background
Before explaining our algorithm, we adapt the standard notion of synchronous transition systems and review fundamental model checking concepts below. For a more thorough introduction to model checking, we refer the reader to [14,15].
-S: a set of states -Init ⊆ S: a set of initial states Definition 1. A Synchronous Transition System (STS) is a tuple, S, Init, A, En, D, T : -A: a finite set of atomic actions -logically distinct operations of the system -En = {en a |a ∈ A}: where en a : S → B is a state predicate that holds iff action a is enabled in a given state -D: a set of data inputs to the system -T ⊆ S × (P(A) × D) × S: the state transition relation, where P denotes power set For our purposes, an STS instruction can perform multiple atomic actions simultaneously. We define the system's instruction set (i.e. the set of actions that the system can perform in one transition) as I := P(A). We then define the set of inputs of an STS as Input := I × D. Thus, the transition relation T is a subset of S × Input × S.
We denote the cardinality of an instruction i as |i|. For s, s ∈ S, in ∈ Input, T (s, in, s ) holds iff it is possible to reach s from s by applying input in. It is often convenient to reason about sequences using vector notation. Let in ∈ Input n and s ∈ S n+1 , with n > 0. We use subscripts to name individual elements of vectors, e.g. s := s 0 , s 1 , . . . . We use the notation T (in, s) to denote 0≤i<n T (s i , in i , s i+1 ). The length of a vector is given by | · |, e.g. |s| = n + 1, and prepending is represented as · : ·, e.g. s = s 0 : s for some s ∈ S n . With some abuse of notation, we allow prepending both sequences and single elements. For k > 0, we say that s ∈ S k is reachable if ∃n ∈ N, s ∈ S n+1 , in ∈ Input n+k . Init(s 0 ) ∧ T (in, s : s).
The set of enabledness predicates En constrain the valid states in which an action can occur. For an instruction i ∈ I and s ∈ S, let en i (s) := a∈i en a (s). In the remainder of the paper, we only consider transition relations T that respect the enabledness conditions. That is, we assume ∀ s, i.(en i (s) ↔ ∃ s , d.T (s, i, d , s )). Depending on the context, this can be checked with a model checker or added as an environmental assumption. We also assume that the existence of a transition does not depend on the data input, that is, ∀ s, i. (∃ d, s . T (s, i, d , s ) =⇒ ∀ d. ∃ s . T (s, i, d , s )).
Example 1. We can define an STS for the motivating example. Let BV k denote the set of all bitvectors of width k. Because there is only a single clock with no negative edge behavior, we model the system without the clock, where every transition corresponds to a clock cycle. Define an STS S, Init, A, En, D, T , where: x, y, valid, data, mem -Init is the set containing all states where x = 0, y = 0, valid = 0 and data = 1 -A = {inc x, inc y} -En = {en inc x := valid[x] = 1, en inc y := true} -D = {nil } (here, nil is just a dummy placeholder used to ensure that T is not empty). -T is the relation describing the next state updates in Code Snippet 1.
Model Checking. Given an STS S, let a safety property P ⊆ S be a set containing acceptable states. The model checking problem is to determine whether the system stays within this acceptable set for all possible execution traces. Formally, we want to check whether the following holds: When equation (1) holds, we say that P is an invariant of S. A number of techniques exist for solving this problem, including Binary Decision Diagram (BDD)-based [12] approaches, Interpolant-based [27] approaches, and IC3/PDR (property directed reachability) techniques [10,16]. We refer the interested reader to [15] for a more complete survey of model checking algorithms.
In this paper, we will focus on bounded model checking (BMC). In BMC, instead of proving (1) for all n, we prove it for all n less than some finite bound k. Though it typically cannot be used to prove properties, BMC can be quite effective at finding bugs [6] and is especially useful when full model checking is infeasible.
Symmetry. Early on in the development of model checking, researchers recognized the importance of symmetry reduction to combat the state explosion problem [13]. Existing approaches in the hardware domain perform data symmetry reduction and data type reduction through the use of bit-width reduction preprocessing passes or syntactic restrictions such as scalarsets [8,20,28]. There have also been abstraction-refinement loop algorithms proposed to handle memory symmetries [9]. All of these approaches are focused on symmetries present in the transition system description, such as the presence of large data types. We refer to these types of symmetries as data symmetries. Most of these techniques are intended to speed up proofs of true properties rather than accelerate bug-finding.
Model checking of asynchronous systems such as concurrent programs faces an orthogonal issue due to the many possible redundant interleavings of independent processes. Throughout this paper, we refer to this as path symmetry. Path symmetry is a temporal symmetry: it relates to executions of a system rather than just its size. Path symmetries occur when there are many distinct ways of reaching the same state in a system execution. Exploring all such paths can result in exponential case splitting.
This paper provides evidence that path symmetry can also severely hurt model checking performance in synchronous systems. One of the first techniques proposed to handle path symmetry was partial order reduction.
Partial Order Reduction. Partial order reduction was first developed in the explicit-state model checking context but was later extended to symbolic model checking [37]. The approach is named "partial order reduction" for historical reasons, but Clarke noted in [14] that "model checking using representatives" [30,31] may have been a more appropriate name. In particular, partial order reduction attempts to develop equivalence classes of behaviors so that only one representative from each class needs to be considered during model checking. Note that partial order reductions are sound only for checking state invariants. If the property of interest is temporal, the reduction could disallow input sequences that trigger the property. This can be avoided by first instantiating a monitor [15] and, if necessary, converting liveness properties to safety [5].
Partial order reduction is less natural in the synchronous setting, because synchronous transition systems do not have easily expressible independent actions. Nevertheless, these systems can still benefit from partial order reduction. Consider our motivating example: despite the huge number of system execution paths to consider, many of them are redundant. Observe that if both inputs are zero, then the state does not change. Furthermore, there is a temporal symmetry in the system execution: from any state where en x is true, driving only inc x followed by only inc y results in the same state as driving them in the opposite order. Thus, this system has a large number of redundant interleavings, much like a multi-threaded program. To address this problem, we introduce a partial order reduction for synchronous hardware. Our goal is to remove redundant interleavings by adding constraints to the system. To maintain soundness, we provide a set of conditions which must pass before we can add constraints.

Synchronous Partial Order Reduction
In order to be able to apply partial order reduction to a synchronous transition system, we are interested in identifying pairs of instructions that can be reordered without affecting the resulting state. More generally, we also want to be able to find pairs that can only be reordered under certain conditions. To formalize these notions, we adapt the notation and representation of guarded independence relations from [37]. 1 Definition 2. Given an STS: S, Init, A, En, D, T with instruction set I, let G := P(S) be the set of predicates over the states. Let i 0 , i 1 , g be a guarded independence tuple iff for all d 0 , d 1 ∈ D and reachable s ∈ S 3 , the following condition holds: According to this definition, if we can prove that i 0 , i 1 , g is a guarded independence tuple, then we can reorder i 1 , i 0 instruction sequences as long as i) i 0 is enabled in the first state; ii) g holds in the first state; and iii) we also reorder the corresponding data inputs. We check only the enabledness of i 0 because i 0 , i 1 is the representative order, and we only need to be able to reorder to the representative, not from it. The guard allows us to consider partial order reductions that only hold for a subset of the reachable states. To avoid trivially overconstraining the system with conflicting reorderings, we will only consider one ordering for each pair of instructions.
The condition in Definition 2 is difficult to check automatically because of the existential quantifier. We instead check two slightly weaker conditions that  [14,37]. The first condition states that instruction i 0 cannot disable i 1 under guard g: Intuitively, this condition ensures that we do not remove reachable states by disabling instructions. The second condition is that executing the instructions in either order leads to the same final state: When applying partial order reduction to concurrent programs, the standard approach is to check conservative syntactic properties which guarantee conditions (2) and (3). Synchronous systems do not typically have these syntactic properties, because there is no notion of distinct processes. Instead, we must check these conditions directly. In real circuits, it is unlikely that (2) will hold over arbitrary states. However, it is sufficient to prove that it holds for all reachable states. This can be done with a model checker.
To prove (3), we could encode it as an LTL property or build a monitor automaton and use a model checker. Alternatively, we have found that we can often use a straightforward commuting-diagram approach starting from a symbolic initial state, depicted in Fig. 1. We duplicate the system, unroll it twice, then start both copies in the same symbolic state and check that applying the instructions in either order results in the same final state. This simple approach has the disadvantage that a symbolic initial state ignores reachability which could lead to spurious counterexamples. However, notice that the initial state is constrained by enabledness assumptions. To apply an instruction it must be enabled, so both instructions must be enabled in the initial state. We have found that these enabledness assumptions often constrain the initial state enough to rule out spurious counterexamples.
If both conditions pass, then we can choose a representative order and disallow the opposite ordering for that pair of instructions. If the proof of condition (3) fails, it provides a counterexample which should either convince the user that partial order reduction does not apply for that pair of instructions (a real counterexample), or serve as a guide for the user to write guards that would remove the spurious counterexample. Other invariants of the system, either obtained automatically or manually guessed by the user, could also remove spurious counterexamples. We can now state the first theorem of synchronous partial order reduction: that these conditions guarantee guarded independence over all reachable states.  (2) and (3) hold for instructions i o , i 1 ∈ I, and guard g ∈ P(S), then i 0 , i 1 , g is a guarded independence tuple.
Proof. Assume conditions (2) and (3) and that for some d 0 , d 1 ∈ D and reachable s ∈ S 3 , we have: Because en i0 (s 0 ), we have ∃s , d . T (s 0 , i 0 , d , s ) because of our enabledness assumption. Furthermore, by the data-input independence property of transition relations, it follows that for some s 1 , T (s 0 , i 0 , d 0 , s 1 ) Now, because one of our assumptions is a transition from s 0 using i 1 , en i1 (s 0 ) must be true. Condition (2) implies that en i1 (s 1 ), thus ∃ s , d . T ( i 0 , d 0 , i 1 , d , s 0 , s 1 , s ). As before, this implies that for some s 2 , we also have that T ( i 0 , d 0 , i 1 , d 1 , s 0 , s 1 , s 2 ). It then follows from (3) that s 2 = s 2 , and thus, i 0 , i 1 , g satisfies the condition from Definition 2.
Let a guarded independence relation, R ⊆ I × I × G, be a set of guarded independence tuples. We now describe how to apply partial order reductions, given some R. For each i 0 , i 1 , g ∈ R, and for every s ∈ S 2 , d ∈ D, whenever T (s 0 , i 1 , d 1 , s 1 )∧en i0 (s 0 )∧g(s 0 ) holds, we remove from T every transition of the form s 1 , i 0 , d , s (for any d and s). Let T R be the result. To apply this reduction in practice, we add a constraint to the BMC encoding: (g(s 0 )∧en i0 (s 0 )∧i 1 ) =⇒ ¬next(i 0 ). This makes it impossible for the STS system to ever execute an instruction i 0 after an instruction i 1 when starting from a state where i 0 is enabled and g holds. This effectively gives preference to i 0 as long as it is enabled. The effect of partial order reduction on a pair of instructions in a synchronous system is depicted in Fig. 2. Red X's show removed transitions, and for simplicity, we assume a trivial guard of true. Notice that all states are still reachable via some path from the initial state in the bottom left corner.
Theorem 2. Given S := S, Init, A, En, D, T , let R be a guarded independence relation and let S R be the reduced STS obtained by replacing T with T R in S. Then, if a property P is an invariant for S R , it is also an invariant for S. Proof. It suffices to show that S R can reach all the same states as S. We prove this by contradiction. Assume there is some in, s such that Init(s 0 )∧T (in, s) and 0 ≤ j ≤ |s| − 1 such that s j is the first state that is unreachable in S R . The value of j cannot be 0 or 1, because S and S R have the same initial states and T R only excludes sequences of length 2. Then, by the definition of T R , in j−2 , in j−1 must be a sequence excluded by T R . Conditions (2) and (3) guarantee that permuting in j−2 and in j−1 results in an enabled sequence that ends in the same state, s j , which contradicts the assumption. Thus, there cannot be a state which is reachable in S but not S R .

Reduced Instruction Sets
Now that we can apply partial order reduction to synchronous systems, our main goal is to identify a maximal guarded independence relation, R. Recall that we defined instructions as sets of atomic actions. We call an instruction containing at most one action atomic (this includes the instruction with no actions). Non-atomic instructions are complex. Instructions thus reflect the parallelism of synchronous hardware, and lead to natural candidates for R: pairs of atomic instructions.
Furthermore, notice that the number of instructions is exponential in the number of actions. Thus, it could be prohibitively expensive to check every pair of instructions for guarded independence. In contrast, the number of atomic instructions is equal to the number of actions (plus one). Furthermore, it is likely that many complex instruction pairs will not have a guarded independence relationship because they contain common actions. Our goal in this section is to disallow as many complex instructions as possible without losing any reachable states, thereby reducing the number of pairs of instructions we need to check while also making it more likely for the checks to succeed. Note that, in isolation, removing instructions might be problematic, because it could extend the bound needed to reach a property violation. However, as we will demonstrate in the experimental section, this disadvantage is more than compensated for when it is applied in combination with partial order reduction.
Given an STS with instruction set I, we seek a reduced instruction set, I r ⊆ I, which preserves the reachable states of the system. Let Input r be the set of inputs which only use instructions from I r . Given an input in ∈ Input, our goal is to prove the existence of a witness w(in) ∈ Input n r (for some n > 0) that simulates the behavior of in using only reduced instructions. Formally, the witness function w should satisfy: In other words, we need to show that for every instruction in the original instruction set, there exists a sequence of inputs, using only instructions from the reduced instruction set (RIS), that results in the same final state. Notice that a witness function that also depended on the state would be more general, but for our purposes, it is sufficient for the witness function to depend only on the input.

Atomic instruction sets
The condition in (4) is quite general and does not provide any intuition on how to choose w. Here, we focus on a specific case where w is easy to construct: we choose I r to be an atomic instruction set, defined as an instruction set containing only atomic instructions. We then must prove that the set of reachable states is not affected by restricting the instructions to those in I r .
It is sufficient to prove that for each complex instruction, we can remove one of its actions and perform that action in the next step, with the same result. For some complex instruction i containing a and some data input d, let w a ( i, d ) be i−{a}, d , {a}, d . We must show that for each input in containing a complex instruction, there exists some a where w a (in) has the equivalent effect on the system as in. Formally, the requirement is: Condition (5) is still difficult to prove because of the existential quantifier. One conservative approach is to replace the existential quantifier with a universal quantifier and attempt to prove that stronger condition. For real systems, this is unlikely to hold. Instead, we propose a counterexample blocking procedure which, if it succeeds, guarantees (5). We introduce symbolic values for i, d, and a and then iteratively add constraints over them until the proof succeeds or we have enumerated all possibilities. This algorithm is a specialized ∀∃ decision procedure that exploits the structure of (5) and additional domain knowledge about the proof goal. We use a constraint solver as an oracle. while check sat(|i| = c ∧ s1 = s 2 ) do 9: µ ← get model () 10: iµ ← assignment(µ, i) 11: aµ ← assignment(µ, a) 12: add constraint(iµ ⊆ i =⇒ a = aµ) 13: if ¬check sat(i = iµ) then 14: return false // exhausted all possible decompositions for this instruction 15: end if 16: end while 17: end for 18: return true // every instruction can be decomposed Algorithm 1 takes an STS, S := S, Init, A, En, D, T and returns true if the instruction set can be decomposed into an atomic instruction set by delaying a single action from each instruction. 2 For simplicity, the algorithm assumes (and we check this assumption separately) that if a complex instruction i is enabled, then for each a ∈ i, executing i − {a} results in a state where a is enabled.

Formally:
∀ i ∈ I \I r , d ∈ D, s ∈ S 2 , a ∈ i. en i (s 0 ) ∧ T (s 0 , i−{a}, d , s 1 ) =⇒ en a (s 1 ) (6) Note that this is only a slight generalization of the property that atomic instructions do not disable each other, a condition that we will need anyway in order to apply partial order reduction to the atomic instruction set (see condition (2)).
The algorithm first creates an identical copy of the STS in line 1. Lines 2-4 set up symbolic variables for the instructions, data, and states of each system. Line 5 adds constraints to the solver enforcing that both systems start in the same state, use the same data, and that i is i but with symbolic action a dropped. Line 6 adds the transition relation constraint for each STS. The initial symbolic set up is depicted in Fig. 3.
The outer loop at line 7 iterates over all possible complex instruction cardinalities. The inner loop starting at line 8 attempts to show that for each cardinality c, instructions of that cardinality can be decomposed by delaying one action (symbolically represented by a). If all instructions of cardinality c have been decomposed, then the while loop condition is false and the outer loop continues. Otherwise, it gets variable assignments from the constraint solver in lines 9-11 and learns a constraint at line 13 that prevents this particular action, a µ , from being chosen for decomposition again. To ensure that we have not blocked all possible actions, there is an additional check at line 13, which returns false in the case that no action can be delayed for the current instruction.
Importantly, the algorithm assumes that if the delay of action a µ does not create a valid witness sequence for a given complex instruction i µ , then the same is true whenever the instruction i includes i µ . We call this a monotonicity assumption, and it typically holds when actions are somewhat independent. The monotonicity assumption motivated the current structure of the algorithm and can significantly reduce the number of iterations in the algorithm. We can remove this assumption by changing i µ ⊆ i to i µ = i in the antecedent in line 13. Note that the monotonicity assumption does not make the algorithm unsound: if it returns true, then (as we prove below) condition (5) holds. However, if the algorithm returns false, then it may be that the version without the assumption would return true. For each of our experiments, we were able to get a true result with the monotonicity assumption.
Because the algorithm does not consider state reachability and looks for a witness function that only depends on inputs, it can still return false when an equivalent sequence might exist for reachable states. In such cases, users can examine the constraint solver models and attempt to remove some of them by proving other invariants. 3 If algorithm 1 returns true, we replace T with T r , where T r is the result of removing from T all transitions s, i, d , s where |i| > 1. Practically, this is achieved by adding a disjunctive constraint over the possible atomic actions. We can now state the main results for reduced instruction sets.
Proof. We maintain the loop invariant at line 8 that for every instruction i , there is some action a such that check sat(|i| = c ∧ i = i ∧ a = a ) is true. It's true initially for each c by condition (6). Afterwards, the check on line 14 ensures that it is maintained. Furthermore, the check on line 9 ensures that when the while loop is exited, then any satisfying assignment for check sat(|i| = c) is such that s 1 = s 2 . Together, these conditions guarantee that (5) holds.
Theorem 4. Let S := S, Init, A, En, D, T be an STS such that condition (6) holds and ProveRIS(S) returns true, and let T r be the transition relation for the reduced instruction set. Let S r be the reduced STS obtained by replacing T with T r in S. Then, safety property P ∈ S is an invariant for S r if and only if it is also an invariant for S.
Proof. It suffices to show that the reachable states of S and S r are identical. Init does not change, so the initial states cannot be different. Furthermore, T r is obtained by removing transitions from T , we know that S r cannot add any reachable states. To show that it also does not remove any reachable states, consider an arbitrary trace Init(s 0 ) ∧ T (in, s) with |s| = n, we must show ∃ in , m, s ∈ S m . Init(s 0 ) ∧ T r (in , s ) ∧ s n−1 = s m−1 . We prove this by showing by induction that it holds whenever in contains instructions of cardinality at most c.
In the base case, c = 1, so all instructions are of size one or less. All of these are already atomic and thus we can take in = in and s = s by the definition of T r .
For the inductive step, suppose that it holds for cardinalities up to c − 1, and assume Init(s 0 ) ∧ T (in, s) with |s| = n. Let in j = i, d be an input containing an instruction of size at most c. If |i| < c, there is nothing to be done. Thus we only consider the case where |i| = c. We know that T (s j , in j , s j+1 ) holds. By Theorem 3 and condition (5), it follows that T ( i−{a}, d , {a}, d , s j , s, s j+1 ) holds for some a and s. We can thus replace in j in in by i − {a}, d followed by {a}, d to obtain an input sequence in c and insert s between s j and s j+1 in s to obtain s c with final state s n−1 such that Init(s 0 ) ∧ T (in c , s c ). Repeating this process for each input containing an instruction of size c yields a final in c such that the maximum cardinality of any instruction is c−1. The property then holds by the inductive hypothesis.
Note that if there is some instruction i ∈ I which cannot be decomposed into atomic instructions, we could always keep this instruction in I r and still benefit from removing other complex instructions. In many cases, we can also remove the empty instruction, i e = ∅. If applying i e cannot change the state of the system, regardless of the data input, then it is considered a stutter step [14]. It is straightforward to check whether i e can be removed by comparing the state before and after applying i e .

Experimental Results
We developed a prototype flow for proving the POR and RIS conditions and applying the necessary constraints. We use the IC3/PDR implementation in ABC [11], pdr, to prove condition (6) (which implies condition (2)). This requires manually writing a Verilog property for each atomic instruction. 4 We implemented the ProveRIS algorithm in our SMT-based model checker, CoSA [26], configured with boolector [29] on the smtcomp19 branch, using CaDiCaL [4] as the underlying SAT solver. 5 We check the commuting diagram for condition (3) in CoSA as well. It tries the trivial guard true by default, and allows the user to provide additional candidate guards if necessary. The set up for proofs in CoSA is automated based on user-provided annotations for the actions and enable conditions. We show our best results which used an encoding leveraging the SMT theory of arrays to represent memories for proving conditions, and a pure bitvector encoding for bounded model checking.
Our flow applies the following steps: i) read in a system description in Verilog using Yosys [38] and generate AIGER [7] for ABC (or BTOR2 [29] for other tools); ii) check condition (6) for each atomic instruction; iii) run the ProveRIS algorithm, and if it returns true, add constraints to rule out all but atomic instructions; and iv) check POR condition (3) for each pair of atomic instructions and add constraints for each passing pair of instructions with the associated guard. Each step depends on the previous step passing successfully. In each of our experiments described below, we successfully completed every step of this flow, though in some cases guards were required in step (iv). For POR and RIS runtimes, we always include the time to check the conditions. We tried running with POR alone, but it resulted in negligible improvements in runtime and thus we omit these results. This demonstrates the importance of RIS. We ran all experiments on a 3.5GHz Intel Xeon CPU with 16GB of RAM.

Motivating Example
First, we return to our motivating example. We compare the time to reach the bug using the SAT-based ABC [11] engines pdr and bmc, and SMT-based bounded model checking using btormc [29] and CoSA. We ran the SMT-based model checkers both with and without the SMT theory of arrays for the encoding of the memory. Both btormc and CoSA without the array encoding were able to reach the bug in 1230s and 1437s, respectively, but all other approaches timed out at two hours. In particular, pdr times out at 2 hours on the property, but can prove condition (6) for every atomic instruction in less than a second. Intuitively, this makes sense because the enabledness conditions do not involve data or mem. Thus, none of the datapath falls in the cone of influence, leaving only control logic for IC3 to reason about. The remaining conditions, (3) and (5), are proven in less than three seconds. Since all the conditions pass, we apply the POR and RIS constraints, which reduces the time to hit the bug from 1437s to 46s in CoSA, including the time to check the conditions.

Packet Movers
We now evaluate our approach on data integrity properties for a variety of packet-mover circuits. Data integrity is a safety property that ensures no packets are dropped or corrupted. In practice, data integrity is often checked by instantiating a monitor, called a scoreboard. It provides the necessary infrastructure for formal verification. In our case, it non-deterministically tags a magic packet and checks that this packet exits the system when it should. Crucially, the scoreboard is a reusable module which can check data integrity of arbitrary packet movers.
Notice that existing symmetry reduction techniques will not be very effective for this scoreboard setup. For example, consider a circular pointer FIFO which maintains two incrementing pointers that index a memory for reading and writing, respectively. We cannot use scalarsets to break symmetries in the memory addresses because the pointers index the memory and are involved in arithmetic, breaking the syntactic requirements for scalarsets [28]. Furthermore, sequential memory abstraction [9] could reduce the size of the memory, but does not address the path symmetry. In addition, both these symmetry reduction techniques are focused on proofs, not bug-finding.
We evaluate our approach on two commercial library components from a major hardware company. We also implemented simpler, open-source versions of these designs. Our open-source benchmarks include: i) a circular pointer FIFO which assumes power-of-two depth but is instantiated with a non-power-of-two depth (one greater than the provided parameter); ii) a shift register FIFO which does not properly add data to the last register in the pipeline; and iii) 2-5   Fig. 4: Runtime Comparison correct circular pointer FIFOs in parallel with a non-deterministic arbiter and credit counters for managing data flow. The reset state of the credit counter has one too many credits, so data can be pushed to a full FIFO. The single FIFOs have two actions each: one for pushing data, and one for popping data. For the arbitrated circuits, there is a separate action for pushing data onto each FIFO as well as a single request action which is enabled whenever any FIFO is non-empty.
There is an inherent symmetry in all of these designs. Consider any of the FIFOs. There are two main actions: pushing data (which is enabled if the FIFO is not full); and popping data (which is enabled if the FIFO is not empty). In a state where both are enabled, pushing data followed by popping results in the same state as popping and then pushing the same data. Furthermore, the actions can be performed simultaneously, but requiring that they are performed separately should not change the reachable states (depending on the implementation), so RIS is applicable. Our experiments vary both the parameterizable data width and depth of the packet movers, by sweeping all powers of two between 2 and 128. All benchmarks contain injected bugs and reach the bug at a deep bound relative to the depth. We used a timeout of 4 hours. We use our prototype flow for checking the conditions and CoSA for bounded model checking. 6 For condition (3), we had to write one guard which is true whenever the scoreboard counter is greater than zero to handle an edge case. This same guard was used for every design, but an appropriate invariant relating the scoreboard counter to the internal state of the system being verified would also have worked. The open-source shift register FIFO required one more guard about the number of stored elements. We obtained both guards by observing counterexamples. Table 1 compares the number of solved instances (49 total per row) within the timeout and the average runtime of commonly solved instances in seconds. Columns marked "PR" used the POR and RIS constraints. We additionally use the following abbreviations: "com" for commercial, "cp" for circular pointer, "sr" for shift register and "arb" for arbitrated. In Fig. 4 we plot the actual runtime on a log-scale for all the benchmarks with and without POR and RIS. The dotted lines show 10x and 100x improvements.
Analysis. There is a cluster of points in the bottom left of Fig. 4 which are solved extremely quickly by both approaches, but slightly faster without POR and RIS. These are results on benchmarks with very small parameter values, where the bug occurs at a low depth, and so the POR and RIS results are dominated by the time taken to check the conditions. However, as the parameter sizes, and runtimes, increase, it is clear that POR and RIS can result in exponential speed ups.
Recall that one concern is that RIS could extend the bound needed to reach the bug. In the shift register and arbitrated FIFO systems, it extended the bound by a few steps. However, for the bug in the open-source circular pointer FIFO, it doubled the bound needed to reach the bug. Regardless, this was more than 6 Note: CoSA's bounded model checking performance is comparable to commercial model checkers on these benchmarks. compensated for by the symmetry-breaking of POR, as evidenced by the faster times to reach the bug. The deepest bound was 260 which occurred at FIFO depth 129. It is interesting to note that encoding the transition systems to SMT using the theory of arrays was always slower for bounded model checking, but was noticeably faster for checking RIS and POR conditions. Perhaps this is because the state comparison is easier for the solver to reason about using array extensionality [23].
We have demonstrated that these techniques work well for packet movers. In part, this is because packet movers are often well-constrained by their environmental assumptions, and their behavior is largely independent of incoming data values. Furthermore, we typically expect the POR and RIS conditions to hold for a correct packet-mover implementation, so a failure in a condition could identify a bug.

Related Work
Various techniques have been employed to accelerate bounded model checking. The authors of [19] use BDDs to accelerate BMC, and the techniques introduced in [35,36] exploit the structure of BMC queries to help the SAT solver. The authors of [18] take advantage of structural information with an SMT framework tailored for BMC. Our technique is similar in that we speed up bounded model checking by adding constraints to the transition system, but we obtain constraints using partial order reduction analysis.
Wang et al. [37] pioneered partial order reduction for symbolic software model checking, guaranteeing optimal reduction for two threads. Their follow-up paper, [21], extended this framework to find the optimal reduction for any number of threads. We adapted their symbolic POR technique for synchronous hardware model checking, and developed reduced instruction sets to improve the efficacy of POR in this new domain. Bhattacharya et al. used a SAT solver to directly check guarded independence conditions (as opposed to checking syntactic properties) for asynchronous rule-based languages [3]. We also check conditions directly, but in a synchronous setting.
The techniques developed by McMillan, temporal case splitting and path splitting [28], provide a framework for splitting on possible values at a given timestep. These approaches deal with system executions, but still rely on breaking data symmetries for performance. In contrast, our techniques focus on mitigating path symmetries.
The work of Bengtsson et al. [2] extended POR to timed automata using a local-time desynchronization of clocks, followed by resynchronization with an added global clock. Similarly, our techniques adapt POR by modifying the system. However, our approach targets a different domain, and only modifies the original system by adding constraints.

Conclusion
We have presented a set of conservative conditions over transition systems and automated techniques for proving these conditions. If the conditions can be proved, then constraints can be added to the system that break path symmetries. We evaluated our approach on parameterized open-source and commercial packet-mover circuits and demonstrated significant improvements in bounded model checking performance.
Some potential future work includes improvements to the ProveRIS procepacket movers, developing more targeted condition proofs by associating actions with particular data inputs, and building an interactive tool which helps the user identify and manage reduced instruction sets and partial order reductions.

Data Availability Statement
The experimental results and the necessary software for reproducing results in a standard Ubuntu 18.04 installation are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.11874687.