# Automatic Generation of Inductive Invariants from High-Level Microarchitectural Models of Communication Fabrics

Satrajit Chatterjee and Michael Kishinevsky

Intel Corporation, Hillsboro OR 97124, USA satrajit.chatterjee@intel.com

Abstract. Abstract microarchitectural models of communication fabrics present a challenge for verification. Due to the presence of deep pipelining, a large number of queues and distributed control, the state space of such models is usually too large for enumeration by protocol verification tools such as Murphi. On the other hand, we find that stateof-the-art RTL model checkers such as ABC have poor performance on these models since there is very little opportunity for localization and most of the recent capacity advances in RTL model checking have come from better ways of discarding the irrelevant parts of the model. In this work we explore a new approach for verifying these models where we capture a model at a high level of abstraction by requiring that it be described using a small set of well-defined microarchitectural primitives. We exploit the high level structure present in this description, to automatically strengthen some classes of properties, in order to make them 1-step inductive, and then use an RTL model checker to prove them. In some cases, even if we cannot make the property inductive, we can dramatically reduce the number and complexity of lemmas that are needed to make the property inductive.

#### 1 Introduction

Consider the microarchitectural model shown in Figure 1. It consists of a source that non-deterministically generates packets that contain the 6-bit value 0. The source feeds into a pair of serially connected FIFOs each of size k, the second of which feeds into a sink that consumes a packet non-deterministically. The communication between the source, the FIFOs and the sink is by means of a simple handshake. We present a formal semantics for these microarchitectural primitives in Section 3, but we hope that for now this intuitive description suffices.

Consider the problem of verifying that any packet seen at the output of the second FIFO contains the value 0. If we generate Verilog from this description and use a state-of-the-art RTL model checking engine such as ABC [3] (winner of the 2008 CAV Hardware Model Checking contest), we find that this apparently trivial problem is surprisingly hard even for small values of k. For instance, even for k=4, ABC takes about 10 minutes to solve this problem on an Intel 3 GHz Xeon processor resorting to interpolation to prove it. Our experience



**Fig. 1.** A simple microarchitectural model with a source that generates the 6-bit value 0, two queues that can store k elements each and a sink. The components are connected by channels x, y and z.

with other industrial tools is similar. And this is for a system with only two queues and a simple topology. In our work on modeling the microarchitecture of communication fabrics we routinely encounter systems where a packet may traverse tens of queues in its lifetime (due to pipelining, path splitting and reconvergence, etc.) and there is complex control logic for resource management. Therefore, even if each queue is sized minimally and packets are represented abstractly, there is still a lot of state. RTL model checkers – though useful for bounded model checking – are unsuccessful in producing proofs for all but the simplest examples even when run for days or weeks. On the other hand, explicit state model checkers such as Murphi run out of memory since there are many interleavings due to non-determinism and deep pipelining.

If we go back to our example, it is obvious to a human designer that the property should hold. It is obvious since we are able to use our knowledge of queues in order to reason about the system. However, when we throw this problem to ABC or Murphi, this high-level information is lost. ABC sees a sea of gates, and Murphi a sea of rules. The traditional approach to handling such verification problems is to resort to theorem proving, or its cousin manual invariant strengthening. In manual invariant strengthening, a verification engineer adds additional invariants (called lemmas) to the model so that the entire set of invariants becomes inductive. Adding these additional invariants is a black art often requiring expertise both in formal verification and the system being verified [10].

In this work, we seek a less labor-intensive way of exploiting the high-level structure of our models than theorem proving or invariant strengthening. The key idea is to require that the microarchitectural models be described in terms of a small set of primitives such as queues, arbiters, forks and joins. Using our knowledge of these primitives, we can *automatically* add a number of lemmas so that the whole set of invariants becomes (1-step) inductive. Most of these lemmas are not local primitive-specific invariants, but are obtained by global analysis of the model. The experimental results are very encouraging: with no or little human effort and little CPU time, we can now prove a number of properties on real models which could not be proved before. In our example above, all necessary lemmas are added automatically, and ABC discharges the resulting problem in almost no time.

The requirement that the model be expressed in terms of specific primitives could be a difficult one to satisfy in general. However, the set of primitives we use in this work originated in a project aimed at reducing the effort required to write microarchitectural models of communication fabrics [4]. Using this modeling methodology we have been able to capture the microarchitecture of a number of real designs and to validate them using simulation and bounded model checking. The goal of this work is to extend verification to obtain full proofs of correctness for some important types of properties.

The use of "high-level structure" for more efficient model checking is a holy grail of hardware verification. We believe this work makes a contribution in that direction by presenting a concrete proposal for describing hardware at a level of abstraction higher than RTL along with a couple of analysis techniques that illustrate how such structure could be exploited for efficient verification. The properties we consider are simpler than those verified in previously published manual efforts (e.g. see [9,10] and references therein) but we seek more automation. On the other hand, the use of high-level structure allows us to infer invariants which would be very difficult for existing automatic RTL-based methods (e.g. see [1] and references therein) to discover. What we present is only a beginning, and we hope that these techniques can be extended to an even larger class of properties in future work.

# 2 Methodology

Our microarchitectural models are described by instantiating components from a library of primitives and connecting them. We refer to these models as xMAS networks (xMAS stands for eXecutable MicroArchitectural Specification). The properties to be verified are specified on these networks. For verification, an xMAS network is compiled down into a synchronous model (single clock, edgetriggered Verilog to be precise) which is then verified. We refer to this model as the synchronous model.

Although the techniques presented in this paper could be used to directly verify xmas models instead of the synchronous models, in this work, we simply use the high-level structure in the xmas models to discover new invariants which are then used in the verification of the synchronous model. We choose this approach partly for engineering convenience (we use a conventional model checker as the trusted engine and view the analysis described in this paper as providing verification hints) and partly because the methods described in this paper cannot be used to prove all properties of interest (in particular liveness). A nice side effect of this approach is that the invariants we add get checked by the model checker rather than being assumed as given.

#### 3 XMAS Models

xMAS models are constructed by instantiating components from a library of microarchitectural primitives and connecting them with *channels*. Channels are typed. In the synchronous model, a channel x with type  $\alpha$  has two boolean signals x.irdy (for initiator ready) and x.trdy (for target ready) for control and one signal x.data that has type  $\alpha$  for the data.

**Fig. 2.** A key showing the symbols for the various primitives used to model microarchitectural blocks. Section 3 describes these components in detail. The italicized letters (k, f, e, g, h and s) indicate parameters. Whenever we use these primitives in a diagram we need to specify values for these parameters. Often, to avoid clutter we do not show these values explicitly trusting that they are clear from the context. In contrast, the gray letters (i, o, a, and b) in this figure only indicate port names and are only shown to help you understand the formal definitions in Section 3. Observe that for some components such as the fork, we place the parameter close to the "corresponding" port in the diagram.

A channel is connected to exactly two components: one component called the *initiator* that "writes" to the channel (via an *output* port) and another component called the *target* that "reads" from the channel (via an *input* port). In the synchronous model, the initiator drives irdy and data signals (and reads trdy) whereas the target drives trdy (and reads irdy and data). Intuitively, a data element (or a packet) is transferred across a channel in those cycles when both irdy and trdy are true. Note that a channel is just a three wires and stores no state. A channel is represented in our diagrams by a line.

An xmas network may be viewed as a directed graph with the components as nodes and channels as edges. Edges are directed from initiator to target.

Example. In Figure 1, there are three channels x, y and z. For channel x, the initiator is the source and the target is the first queue. Thus x is connected to the output port of the source and to the input port of the first queue. The output port of the first queue is connected to channel y.

Figure 2 shows the library of kernel primitives. We formally specify each primitive by providing the synchronous equations that are generated for it. We present this in some detail because the exact definitions are important to understand the invariants that we generate later. These definitions may be skimmed on a first reading.

Queue. In our models, storage is implemented by queues. In terms of interface, a queue is one of the simplest primitives. It is parameterized by a type  $\alpha$  of the elements stored in the queue and a non-negative integer k that indicates the capacity of the queue. It has one input port i which is connected to the target end of a channel that is used to write data into the queue. Clearly, this channel must have type  $\alpha$ , and for convenience we say that port i also has type  $\alpha$ , denoted by i:  $\alpha$ . Likewise, the output port o:  $\alpha$  is connected to the

<sup>&</sup>lt;sup>1</sup> Our queues are always FIFO i.e. first-in-first-out.

initiating end of the channel that reads data out of the queue. The equations for a queue are:

```
o.irdy := (\mathbf{pre}(\text{num}) \neq 0) i.trdy := (\mathbf{pre}(\text{num}) \neq k) enq := i.irdy and i.trdy deq := o.irdy and o.trdy
```

where enq and deq are combinational signals defined for convenience, and num is the current occupancy of the queue given by:

```
num := pre(num) + 1 if enq and not deq

pre(num) - 1 if deq and not enq

pre(num) otherwise
```

where **pre** is the standard synchronous operator that returns the value of its argument in the previous cycle and the value 0 in the first cycle [2]. The elements in the queue are stored in an array called mem of size k of signals of type  $\alpha$ . These are indexed by head and tail pointers used for reading and writing, correspondingly.

```
head := if deq then inc_k(pre(head)) else pre(head)

tail := if enq then inc_k(pre(tail)) else pre(tail)

where inc_k(x) \equiv if \ x = k - 1 then 0 else x + 1. For j \in \{0, k - 1\} we have mem_j := if \text{ enq and } j = pre(tail) then i.data else pre(mem_j)

and,

o.data := pre(mem_0) if pre(head) = 0

pre(mem_1) if pre(head) = 1

\vdots

pre(mem_{k-1}) if pre(head) = k - 1
```

Among our set of primitives a queue is the only one that can store data. It is also the only delay element: even if the queue is empty, an input packet is visible at the output only after 1 cycle.

Source. A source is a primitive which is parameterized by a constant expression  $e:\alpha.^2$  Each cycle, it non-deterministically attempts to send a packet e through its output port. A source has a single output port  $o:\alpha$  and is governed by the following equations:<sup>3</sup>

```
o.irdy := oracle or pre(o.irdy and not o.trdy) o.data := e
```

where oracle is an unconstrained primary input that is used to model the nondeterminism of the source in the synchronous model. Each source has its own oracle. We define o.irdy in this specific manner to keep it persistent regardless of the oracle behavior: i.e. once a source makes a value available on the channel, it preserves that value until a transfer. Also note that one can imagine more complex sources which emit arbitrary values from a given set. However, for ease of exposition we stick to the simpler definition above.

<sup>&</sup>lt;sup>2</sup> Henceforth we only mention the value parameters of a component and leave the type parameters implicit.

When o.irdy is false, o.data is a don't care. But for brevity in the equations, we always assign to o.data rather than only when o.irdy is asserted.

Sink. Dually, a sink is a component which non-deterministically consumes a packet. It has one input port  $i:\alpha$  and is characterized by the following equation:

```
i.trdy := oracle or pre(i.trdy and not i.irdy)
```

Function. A function primitive is used to model transformations on the data. It is parameterized by a function  $f: \alpha \to \beta$ . It has an input port  $i: \alpha$  and an output port  $o: \beta$  and is fully characterized by the following equations:

```
o.irdy := i.irdy o.data := f(i.data) i.trdy := o.trdy
```

Note that f is a combinational function that is applied to the input data to generate the output data.

Fork. A fork is a primitive with one input port  $i:\alpha$  and two outputs ports  $a:\beta$  and  $b:\gamma$  parameterized by two functions  $f:\alpha\to\beta$  and  $g:\alpha\to\gamma$ . Intuitively, a fork takes an input packet and creates a packet at each output. It coordinates the input and outputs so that a transfer only takes place when the input is ready to send and both the outputs are ready to receive. Formally,

```
\begin{array}{ll} \text{a.irdy} := \text{i.irdy} \ \mathbf{and} \ \text{b.trdy} & \text{a.data} := f(\text{i.data}) \\ \text{b.irdy} := \text{i.irdy} \ \mathbf{and} \ \text{a.trdy} & \text{b.data} := g(\text{i.data}) \\ \text{i.trdy} := \text{a.trdy} \ \mathbf{and} \ \text{b.trdy} \end{array}
```

Join. A join is the dual of a fork. It has two input ports  $a:\alpha$  and  $b:\beta$  and one output port  $o:\gamma$ . It is parameterized by a single function  $h:\alpha\times\beta\to\gamma$ . Intuitively, a join takes two input packets (one at each input) and produces a single output packet. It coordinates the inputs and output so that a transfer only takes place when the inputs are ready to send and the output is ready to receive. Formally,

```
a.trdy := o.trdy and b.irdy b.trdy := o.trdy and a.irdy o.irdy := a.irdy and b.irdy o.data := h(a.data, b.data)
```

Switch. A switch is a primitive to route packets in the network. It has an input port i and two output ports a and b, all of type  $\alpha$ . It is parameterized by a switching function  $s: \alpha \to \text{Bool}$ . Informally, the switch applies s to a packet x at its input, and if s(x) is true, it routes the packet to port a, and otherwise it routes it to port b. Formally,

```
\begin{array}{ll} \text{a.irdy} := \text{i.irdy and } s(\text{i.data}) & \text{a.data} := \text{i.data} \\ \text{b.irdy} := \text{i.irdy and not } s(\text{i.data}) & \text{b.data} := \text{i.data} \\ \text{i.trdy} := (\text{a.irdy and a.trdy}) \text{ or } (\text{b.irdy and b.trdy}) \end{array}
```

Merge. Arbitration is modeled by a merge primitive that selects one packet among multiple competing packets. A merge has multiple input ports and one output port. Requests for a shared resource are modeled by sending packets to a merge, and a grant is modeled by the selected packet. For simplicity we present here a complete definition of a two-input merge that has two input ports  $a:\alpha$  and  $b:\alpha$  and one output  $o:\alpha$ .

where u is a local Boolean state variable to ensure fairness. We could choose a specific fairness algorithm such as

```
u := 1 if a.irdy and not b.irdy
0 if not a.irdy and b.irdy
not pre(u) if pre(o.irdy and o.trdy)
pre(u) otherwise
```

Example. Figure 3 shows two agents P and Q communicating via a router. Packets are modeled by triples (t,s,d), where  $t\in\{\text{req},\text{rsp}\}$  is the type of the packet,  $s\in\{P,Q\}$  is the source and  $d\in\{P,Q\}$  is the destination. Each agent creates new requests for the other agent or for itself. When an agent receives a request (from the other agent or from itself) it produces a response by changing the type of the message and swapping the source and the destination. The response is produced after a non-deterministic delay. The response is sent back to the requester where it is sunk after a non-deterministic delay. The router routes messages according to their destinations i.e. d. (In practice this simplified microarchitecture would not be used since it deadlocks. Deadlocks can be avoided by using virtual channels as we discuss later.)



**Fig. 3.** Example showing a pair of agents communicating over a simple fabric (see text for details). The nd-delay box models non-deterministic delay (the functions of the fork are identity). Since each symbol has a precise formal semantics (see Section 3) this figure is a precise executable description.

# 4 Analysis for Channel Properties

A very common verification problem on xMAS networks is to check that all values flowing through a channel satisfy some property. For instance, at the input of an agent, we may wish to check that all packets that arrive have the agent as the destination. Invariants of this kind are called channel properties, and in this section we see how such invariants may be strengthened.

## 4.1 Channel Properties

If x is a channel that has type  $\alpha$ , a channel property is a function  $p: \alpha \to \{0, 1\}$ . Intuitively, if a property p is asserted on a channel x, it means that whenever a valid value is seen on the channel (i.e. x.irdy is asserted), the data on the channel must satisfy p. Formally, a channel property p on a channel x corresponds to the LTL invariant  $\mathbf{G}(x.\text{irdy} \implies p(x.\text{data}))$  in the synchronous model. For brevity, we sometimes simply say property instead of channel property.

Example. The verification problem in the introduction corresponds to verifying the channel property  $v \mapsto (v = 0)$  on channel z.<sup>4</sup> This corresponds to the LTL property  $\mathbf{G}(z.\mathrm{irdy} \implies (z.\mathrm{data} = 0))$  in the synchronous model.

### 4.2 Propagating Channel Properties

Given a channel property p, we can derive properties on other channels that are "implied" by p using a set of rules. These rules are similar in spirit to Hoare rules [8] used in program verification and are derived syntactically (i.e. no reasoning is involved). The goal is to strengthen the LTL invariant corresponding to p in the synchronous model with the additional invariants obtained from the new channel properties. The soundness of these rules may be verified from the definitions given in Section 3.

Rule for Queue. Since a queue does not modify the data it holds, a property holds on the output of a queue iff it holds on the input.

Example. In our running example (Figure 1), the property  $v \mapsto (v = 0)$  holds at z iff it holds at y. Similarly the property holds at y iff it holds at x. It turns out that adding the LTL properties corresponding to the channel properties for x and y, does not make the resulting verification problem on the synchronous model inductive. We need further strengthening, and we return to this topic shortly.

Rule for Function. Given an instance of a function primitive with the parameter  $f: \alpha \to \beta$ , a channel property p holds at the output iff the property  $p' = p \circ f$  holds at the input.

Rule for Switch. Consider an instance of a switch whose switching function is  $s: \alpha \to \beta$ . The channel property p holds at the output a iff the property

By " $v \mapsto (v = 0)$ " we mean the function that is 1 iff the input is equal to 0, i.e. the function  $\lambda v.(v = 0)$  using  $\lambda$  notation.

 $v \mapsto (s(v) \implies p(v))$  holds at the input. Likewise, a property p holds at output b iff the property  $v \mapsto ((\neg s(v)) \implies p(v))$  holds at the input.

Rule for Mux. A channel property holds on the output iff it holds on each input.

Rule for Fork. A channel property p holds on the output a of a fork iff  $p' = p \circ f$  holds on the input. Similarly, p holds on the output b of a fork iff  $p' = p \circ g$  holds on the input.

Rule for Restricted Join. Propagating a property across a join is tricky since the output of a join in general could be functionally dependent on both inputs. However, in our examples drawn from the domain of communication fabrics, joins are only used to control access to resources (e.g. see examples of credit logic and virtual channels in Section 5). Therefore, the join function depends only on at most one input of the join (called the functional input) i.e. it is of the form  $h: \alpha \to \gamma$  (instead of  $h: \alpha \times \beta \to \gamma$ ). In such cases the other input carries tokens (i.e. values having the unit type). It is easy to detect such joins automatically since the join function h syntactically depends only on one of the inputs. If h is constant, then either input may be taken as the functional input. Given such a join with the restricted function  $h: \alpha \to \gamma$ , a property p holds at the output iff  $p' = p \circ h$  holds at the functional input of the join. Extending propagation to general joins appears to be a hard problem since it involves reasoning about multiple channels.<sup>5</sup>

# 4.3 Queue Invariants

If we have a channel property p at the output of a queue, using the rule for queues presented above, we also have the property p at the input of the queue. However, simply adding the invariants from these properties to the LTL model does not make the synchronous problem (1-step) inductive. It is easy to see why: Suppose a queue is in a state where it has more than 2 elements. Even if these properties hold at the output and input of the queue, at best they guarantee that only the oldest and youngest element in the queue satify p. They say nothing about the other elements in the queue.

Therefore we need additional invariants to ensure that *every* element stored in the queue satisfies p. For  $j \in [0, k)$ , where k is the size of the queue, we add the LTL invariant (recall the state variables of a queue from Section 3)

$$\mathbf{G}(\mathrm{used}_j \implies p(\mathrm{mem}_j))$$

where used<sub>j</sub> is a predicate over the state that indicates if the jth storage element in the queue is used or not. It is defined as follows:

$$used_j := (head < tail$$
 **and**  $(head \le j$  **and**  $j < tail))$  **or**  $(head > tail)$  **and**  $(head \le j$  **or**  $j < tail))$  **or**  $(num = k)$ 

<sup>&</sup>lt;sup>5</sup> Even with general joins there is an easy case. If p is a property such that  $p' = p \circ h$  depends on only one variable, then it suffices to propagate p' along the corresponding input.

Along with this, we add the LTL assertions  $\mathbf{G}(\text{num} \leq k)$ ,  $\mathbf{G}(\text{head} < k)$  and  $\mathbf{G}(\text{tail} < k)$  to ensure that these state variables are within bounds. Finally, we need to add the following invariants to establish the correct relationship between these 3 state variables:

```
G(\text{head} < \text{tail} \implies \text{head} + \text{num} = \text{tail})

G(\text{head} > \text{tail} \implies \text{head} + \text{num} = \text{tail} + k)

G(\text{head} = \text{tail} \implies \text{num} = 0 \text{ or num} = k)
```

These assertions are used to ensure that the head and tail pointers behave as expected and provide a local over-approximation of the state-space.

#### 4.4 A Note on Local Invariants

The queue invariants added above block off portions of the unreachable state space that would otherwise lead to false counter examples in induction. However since these invariants are local, any correlation between different queues is not captured. However, this is exactly how a human designer thinks about the system: for example seldom would the correctness of a design depend on say two head pointers in two different queues taking on the same value in all portions of the reachable state space.

Indeed, if the correct operation of a design relies on the correlation between different components, typically this is enforced in the design by some explicit communication structure between them. A common case in our models for this case is when the occupancy of multiple queues are correlated in the reachable state space. We study this problem in the next section where we follow this communication trail to infer the appropriate invariants.

The queue is our main state holding element. Among all the primitives, the only other interesting state holding element is the merge which maintains state for fairness. If the merge has multiple inputs, then the appropriate local invariants for the fairness logic need to be added. (For the particular 2-input merge presented in Section 3, we do not need to add constraints for the u variable since it can take on both 0 and 1 values in the reachable space.)

## 4.5 Propagation Algorithm

Given a property p on a channel, we try to maximally propagate it backwards using the obvious iterative algorithm. This is done by looking at the initiator of the channel, and applying the corresponding rule from Section 4.2. This creates new properties at the inputs of the initiator. This process is repeated for each newly added property. If the initiator is a source, then the property is not (cannot be!) propagated further. If the xmas network has a directed cycle, the above process will not terminate. We handle this by recording the "parent" and stopping when a cycle is encountered.

After all properties have been propagated in this manner, for each queue in the system, we add the local invariant according to the scheme described in Section 4.3 for each property at the output of the queue. **Theorem 1 (Partial Completeness).** Given an acyclic xmas network  $\mathcal{N}$  where all joins are restricted, and a property p on a channel in  $\mathcal{N}$  that holds, the above algorithm adds sufficiently many invariants to make the synchronous problem 1-step inductive.

The propagation algorithm often leads to the creation of a large number of properties. However, many properties can be discharged locally i.e. during the propagation process, they become tautologies i.e. the constant 1 function. Therefore we use a reasoning engine to detect tautological properties and do not propagate them further. This is an important optimization in practice.

Example. In the example of Figure 3, let l be the property  $(t, s, d) \mapsto (d = P)$  at the ingress queue of agent P. If we propagate l backwards through the queues and switches in the router using the above algorithm, we find that the properties that are obtained from l at each input of the router are of the form  $(t, s, d) \mapsto ((d = P)) \implies (d = P)$  which is a tautology. These tautological properties need not be propagated further.

Remark on cycles. Most properties become tautologies during propagation, so cycles in the xMAS network are not a problem (as in the example above). However, for those that do not become tautologies, it may be necessary to add additional channel invariants on loops to "break" the cycle. Furthermore, in many cases in communication fabrics we find that packets loop at most k times, where k is small (i.e. 1 or 2). For example in Figure 3, k = 2 (for a self-request). We could handle such cases automatically by unrolling loops k + 1 times.

#### 5 Invariants from Flows

As remarked in Section 4.4, if the correct operation of a design relies on correlation between state variables in different components then in a real design there is usually an explicit communication mechanism between them for coordination. In this section we present an algorithm to analyze a commonly-occuring communication of this form that leads to correlation among the occupancies of different queues in the system. The invariants added by this analysis allow us to prove an important class of safety properties that check that the queues in a system are sized correctly. Such safety properties are necessary for reasoning about liveness.

### 5.1 The Basic Idea

Example (credits). Consider the xMAS network shown in Figure 4 which shows a master agent M communicating with a target T. The credit logic portion of T issues at most k outstanding credits to M at any given time. Credits are modeled as values of the unit type called *tokens*. M has to wait for a credit before it can send a request to T. The purpose of this mechanism is to ensure that there is always room in T's ingress queue for requests from M i.e. nothing



**Fig. 4.** Credits introduce correlation between the occupancies of different queues. Both joins are restricted in the sense of Section 4.2 since at least one input is a token.

gets stuck on channel r. Thus r is non-blocking i.e. satisfies the LTL property:  $\mathbf{G}$  (r.irdy  $\implies r$ .trdy). Credits are freed up when data is read from the ingress queue of T.

The non-blocking property on r is not inductive. However, by adding the invariant

$$\mathbf{G}(\text{num}_c + \text{num}_i = \text{num}_o)$$

to the synchronous model, the problem becomes inductive.<sup>6</sup> Here,  $\operatorname{num}_c$  is the num variable of the credit queue in M,  $\operatorname{num}_i$  the same for the ingress queue in T and  $\operatorname{num}_o$  for the outstanding credits queue in T.

The question now is how can we detect such global assertions automatically? If x is a channel, let  $\lambda_x$  denote the number of packets that have been transferred on x upto a given point in time (i.e.  $\lambda_x$  is the count of the number of cycles so far in which x.irdy and x.trdy were both asserted). Now, from the equations of a join in Section 3 it is easy to see that either a transfer happens on both inputs and the output of a join or there is no transfer at any input or the output. Thus for the two joins in Figure 4 we have,

$$\lambda_e = \lambda_f = \lambda_r$$
 and  $\lambda_s = \lambda_w = \lambda_z$ .

Similarly, for a fork it can be verified that either a transfer happens on both output and the input or there is no transfer at all. Thus for the two forks in the system we have the equations

$$\lambda_u = \lambda_t = \lambda_v$$
 and  $\lambda_p = \lambda_n = \lambda_s$ .

A queue is more interesting. Any packet that enters a queue is either still in the queue or has exited through the output channel. Thus from the three queues in Figure 4 we get the following three equations:

$$\lambda_r = \text{num}_i + \lambda_p$$
  $\lambda_t = \text{num}_c + \lambda_e$   $\lambda_v = \text{num}_o + \lambda_w$ 

<sup>&</sup>lt;sup>6</sup> Assuming that the (local) assertions  $\mathbf{G}(\text{num} \leq k)$  for each queue have already been added.



**Fig. 5.** Example of a shared communication path that requires more precise flow analysis. The credit logic bubbles encapsulate the logic shown in the credit logic box of Figure 4. The switch in the target routes A packets to  $l_1$  and B packets to  $l_2$ . The joins in M are restricted and have the identity function.

From these 7 equations, we can eliminate the  $\lambda$  variables to get the desired relationship between the num variables. This can be done automatically in the following manner. First we create a matrix from the equations where all  $\lambda$  variables are to the left and all num variables are to the right. Then this matrix is converted to Reduced Row Echelon (RRE) form by Gaussian elimination (over the rationals). Finally, we select the equations from the RRE form which involve only the num variables (i.e. the coefficients of all  $\lambda$  variables are 0).

Note that the  $\lambda$  variables are unbounded and by this elimination process, we are only left with relations in only the num variables (which are bounded by the size of the queues). Hence these relations can be added as invariants to the synchronous model.

The technique described in this section resembles generation of place invariants in Petri nets [5]. However, rather than modeling the communication fabric with Petri nets (which leads to an overhead of using explicit back-pressure arcs and complexity in modeling the data-path) we derive those invariants directly from more compact and natural xMAS specifications.

#### 5.2 Shared Communication

In the presence of shared communication channels the approach presented above needs to be refined.

Example (virtual channels). Virtual channels lead to sharing. They are commonly used in communication fabrics to multiplex multiple logical streams onto a single physical link with the guarantee that even if one stream is blocked at the receiver, the other streams still make progress [7].

Figure 5 shows a simple example of virtual channels. A master agent M sends two types of messages A and B (think of these as perhaps requests and responses) to a target T over a single channel r. The ingress switch in T routes A and B packets to their respective ingress queues. The credit pattern of Figure 4 is used

to ensure that whenever a packet is presented to the egress arbiter of M, there is guaranteed to be room in the corresponding ingress queue in T. Thus channel r is non-blocking.

Once again, the non-blocking property on r is not inductive. However, if we add the invariants  $\mathbf{G}(\operatorname{num}_{c_A} + \operatorname{num}_{i_A} = \operatorname{num}_{o_A})$  and  $\mathbf{G}(\operatorname{num}_{c_B} + \operatorname{num}_{i_B} = \operatorname{num}_{o_B})$  for each credit loop, then the problem becomes inductive. Here,  $\operatorname{num}_{c_A}$  refers to the num variable of the credit queue in M associated with the A packets, and so on. However, if we try the approach from the previous example (with suitable extensions for muxes and switches) we find that we can only derive the weak invariant:  $\mathbf{G}(\operatorname{num}_{c_A} + \operatorname{num}_{i_A} + \operatorname{num}_{c_B} + \operatorname{num}_{i_B} = \operatorname{num}_{o_A} + \operatorname{num}_{o_B})$  which is not enough to prove the property.

We can improve the precision of the analysis by defining a  $\lambda$  variable per flow through a channel. For example we know that two types of values flow through channel r. Therefore we introduce two variables  $\lambda_r^A$  and  $\lambda_r^B$  for r where  $\lambda_r^A$  is a count of the number of cycles when r.irdy and r.trdy have been asserted and r.data was equal to A. Similarly,  $\lambda_r^B$  for B. Since there are two flows through r, we assume that two flows are possible through  $g_1$  and  $g_2$  and through  $f_1$  and  $f_2$  and associate two  $\lambda$  variables from each of these channels: one for A and one for B. For all the other channels, we associate only a single  $\lambda$  variable since there are only single flows through them. (We will see later how to automatically figure out the number of flow variables needed.)

Since the ingress switch in T routes A to channel  $l_1$  and B to channel  $l_2$ , we have

$$\lambda_r^A = \lambda_{l_1}$$
 and  $\lambda_r^B = \lambda_{l_2}$ 

For a merge, a packet at the output must come from one or the other input. Therefore, we have the following equations for the egress arbiter in M:

$$\lambda_{g_1}^A + \lambda_{g_2}^A = \lambda_r^A$$
 and  $\lambda_{g_1}^B + \lambda_{g_2}^B = \lambda_r^B$ 

Observe that one input of each join in M is a token input i.e. the joins are restricted. We have the following relations between the outputs and the functional inputs:

$$\lambda_{f_1}^A = \lambda_{g_1}^A \qquad \quad \lambda_{f_1}^B = \lambda_{g_1}^B \qquad \quad \lambda_{f_2}^A = \lambda_{g_2}^A \qquad \quad \lambda_{f_2}^B = \lambda_{g_2}^B$$

and the following relations between the token inputs and the outputs:

$$\lambda_{e_1} = \lambda_{q_1}^A + \lambda_{q_1}^B \qquad \lambda_{e_2} = \lambda_{q_2}^A + \lambda_{q_2}^B$$

Each source however generates only one type of packet. Therefore we can set the other  $\lambda$  variable on the output channel to zero i.e.  $\lambda_{f_1}^B=0$  and  $\lambda_{f_2}^A=0$ .

All the other components only interface with channels carrying single flows, and we add equations as in the credit example. Finally, as before, by eliminating the  $\lambda$  variables using Gaussian elimination, we obtain the desired relations among the num variables.

### 5.3 Algorithm for Discovering Flow Invariants

Formally, if x is a channel that has type  $\alpha$ , a flow on x is a function  $p : \alpha \to \{0, 1\}$ . (Note the similarity with channel properties.) Our goal is to compute the set of flows for each channel and the equations relating the  $\lambda$  variables for these flows.

- Step 1. Sort the xmas graph in reverse "topological" order starting from the sinks using the textbook depth-first-search (DFS) based topological sort algorithm [6, §22.4]. If the xmas network is cyclic, this has the effect of topologically sorting the DAG obtained by deleting the backedges in the DFS.
- **Step 2.** Assign the constant 1 function as the flow on the inputs to the sinks and on the backedge channels. Now we process each component in the network in the reverse "topological" order computed above by applying the following rules to propagate flows (we use the same parameter names as in Figure 2 and use port names to refer to the corresponding channels):

Queue. For each flow p on the output channel o, we create a new flow p on the input channel i. We also add a new state variable called num<sup>p</sup> to the queue that tracks how many elements satisfying p are currently in the queue. We also add an assertion that equates num<sup>p</sup> to the number of elements that satisfy (used<sub>j</sub>  $\implies p(\text{mem}_j)$ ) in the queue. Finally, we add the equation  $\lambda_i^p = \text{num}^p + \lambda_o^p$ .

Function. For each flow p on the output channel o, we create a new flow  $p' = p \circ f$  on the input channel i and add the equation  $\lambda_i^{p'} = \lambda_o^p$ .

Switch. For each flow p on the output channel a, we create a flows  $p' = v \mapsto (s(v) \operatorname{and} p(v))$  on the input channel i and add the equation  $\lambda_i^{p'} = \lambda_a^p$ . Similarly for flows on output b.

Merge. For each flow p on the output channel o, we create a flow p on input a and another flow p on input b and add the equation  $\lambda_a^p + \lambda_b^p = \lambda_o^p$ .

Fork. For each pair (p,q) where p is a flow on output a and q on output b, we create a new flow  $r = v \mapsto (p(f(v)) \operatorname{and} q(g(v)))$  on input. For each flow p on output a, we add the equation  $\lambda_a^p = \Sigma_r \lambda_i^r$  where r ranges over flows that were added to i due to p. Similar equations are added for each flow on b.

Join. Once again, we limit our attention to restricted joins. For each flow p on the output channel o, we add a flow  $p'=p\circ h$  to the functional input (suppose it is the input a). We add the constant 1 flow to the other input (i.e. b) and the equations  $\lambda_a^{p'}=\lambda_o^p$  and  $\lambda_b^1=\Sigma_p\lambda_o^p$  where p ranges over all the flows on o.

Source. For each flow p in the output o, we check if p(e) is true or not. If p(e) is false, then we add the equation  $\lambda_o^p = 0$  and mark p as dead.

During the above process, each time a new flow is created, we record its parent(s). Furthermore, if a new flow is unsatisfiable i.e. the constant 0 function we mark it dead and do not propagate it further.

**Step 3.** If the xMAS DAG is cyclic, for each channel x that is a backedge, the above propagation process adds new flows. These need to be related to the constant 1 flow which was assigned before starting propagation. Therefore we add  $\lambda_x^1 = \Sigma_p \lambda_x^p$  where p ranges over all the flows added to x during propagation.

**Step 4.** If all children of a flow p at a channel x are dead, we mark x as dead as well and add the equation  $\lambda_x^p = 0$ . We repeat this process until no new flows can be marked dead.

**Theorem 2** (Inductivity). The set of equations obtained by this process is an inductive invariant of the synchronous model.

**Step 5.** Finally, the  $\lambda$  variables are eliminated as explained before to obtain relations between the num variables. Note that it is possible that there are no relations among the num variables (e.g. Figure 1).

Remark. Since the  $\lambda$  variables correspond to channels which hold no state, we conjecture that eliminating them does not destroy inductivity. This has been confirmed by our experiments.

# 6 Experimental Results

#### 6.1 Micro-benchmarks

Since the state-of-the-art model checking algorithms are unable to converge on any of our real examples, we present a comparison on the small examples from this paper. Table 1 shows the results of running ABC (version 91206p) on several examples (parameterized on k) without the addition of invariants as described in this paper.

The first example is from Figure 1 where each queue is of size k. In the second example we have a series of k queues (similar to Figure 1). In the third we check the property in the example of Section 4.5, but to make the example more realistic we set the the source and destination to be 2 bits wide in the packet. Fourth and fifth are self-explanatory. In the last example we add k queues on the channel r in Figure 5.

Column i in the table is the number of primary inputs (oracles); r and n are number of registers and AIG nodes (after synthesis). A depth of (m, n) means

**Table 1.** Comparison with interpolation on micro-benchmarks. See Section 6.1 for details. Most rows have two data points corresponding to different values of the parameter k of the corresponding example.

| Description                         | k | i | r  | n   | depth   | $_{\rm time}$ | k  | i | r  | n   | depth  | time |
|-------------------------------------|---|---|----|-----|---------|---------------|----|---|----|-----|--------|------|
| Two queues of size $k$              | 3 | 2 | 49 | 348 | (5, 12) | 23            | 4  | 2 | 63 | 381 | BMC 24 | _    |
| k queues of size 2                  | 3 | 2 | 49 | 302 | (9, 12) | 76            | 4  | 2 | 64 | 396 | BMC 20 | _    |
| Figure 3 with all queues sized to 2 |   |   |    |     |         |               | -  | 8 | 99 | 659 | BMC 11 | _    |
| Figure 4                            | 8 | 4 | 14 | 104 | (13, 9) | 38            | 12 | 4 | 14 | 121 | BMC 27 | _    |
| Figure 5 with all queues sized to 2 |   |   |    |     |         |               | -  | 4 | 17 | 95  | (7, 4) | 40   |
| above with $k$ queues on $r$        | 1 | 4 | 24 | 135 | (6, 7)  | 112           | 2  | 4 | 29 | 151 | BMC 13 | _    |

interpolation converged in n iterations when starting from a BMC of depth 1+m. The time is in seconds (on a 3GHz Intel Xeon CPU) with a timeout of 300 secs indicated by a dash (and we show the final BMC depth in the previous column). Note that interpolation times out on many examples.

The first three rows correspond to examples for channel propagation. In all cases when we add the invariants as described in Section 4, ABC is able to solve the problem in no time. Even if we set k = 100, the first example is solved in 7 seconds, the second in 1 second and the third in 40 seconds.

The remaining rows correspond to examples for flow invariants. Again without the flow invariants, interpolation has a hard time. However, in these examples we found that BDD-based reachability could solve these quickly. In all cases in our experience, the algorithm for discovering flow invariants finds exactly those invariants that are interesting. For example, in the fifth example, there are initially 43 variables and 32 equations. After elimination, we are left with two invariants (with 6 terms) corresponding to the two credit loops as expected. The time needed for both property propagation and for detecting flow invariants is neglible.

### 6.2 Experience on Real Examples

We have applied the techniques described in this paper to verify a number of abstract models used to validate the microarchitecture of future designs. These are drawn from the domain of communication fabrics and are characterized by deeply pipelined logic for multi-phase transactions, presence of ordering logic and several virtual channels, and peer-to-peer traffic. Even in minimal configurations, there are tens of simultaneous transactions in flight.

As a data point, previously on one of our simpler examples we were able to obtain a proof of a critical non-blocking property,<sup>7</sup> only by severely limiting the state space by reducing the number of simultaneous outstanding transactions an agent can issue. The proof was obtained with an explicit state model checker with maximal reachability depth of 159 in 12 hours using 17GB of memory. In contrast, using the flow analysis from Section 5 on the *original* model, 16 flow relations are discovered (from an initial set of 176 equations on 220 variables) and ABC solves the resulting problem in 4.5 sec.

A big advantage of this technique is its robustness and scalability. Rather than be limited to minimal configurations (and consequently reduced concurrency), we can now verify more realistic models. Channel property verification is robust and most properties are discharged automatically. For a few properties we need to add additional channel properties to break loops. However, these invariants are natural and easy to add since they only talk about data and do not involve control at all. Finally, although it may appear that flow invariants could lead to scalability problems, so far we have not encountered any problems, even on our larger examples with dozens of flow invariants, many with tens of terms. For such problems, an inductive engine that assumes all invariants in one cycle and then checks each invariant separately in the following cycle appears to be scalable.

<sup>&</sup>lt;sup>7</sup> To check for adequate buffering to avoid deadlocks.

#### 7 Conclusion and Future Work

The concrete proposals for capturing and exploiting high-level information in hardware models presented in this work have proved very useful in practice allowing us to prove with little computational effort many hard sequential properties on real microarchitectural models which could not be proved before. The benefit seems to be in separating control from data and exploiting knowledge of the control to reduce the problem to a combinational one on the data.

The invariants we add may be seen as providing a *bag* abstraction for queues. Although the bag abstraction has proven adequate to handle our current examples (systems with restricted joins), it appears insufficient to handle systems with general joins. It would be interesting to extend the methods of this paper to reason about such systems.

Finally, a lot of the computational overhead of verifying the synchronous model may be eliminated by switching to an axiomatic semantics for xmas models for a more direct verification. This may also be an interesting direction for building bridges to the RTL implementation.

# Acknowledgments

We thank Amit Goel, Alexander Gotmanov, Chandramouli Kashyap, Umit Ogras and Sayak Ray for many helpful discussions on this work and Jesse Bingham for reviewing an early draft.

#### References

- 1. Baumgartner, J., et al.: Scalable conditional equivalence checking: An automated invariant-generation based approach. In: FMCAD 2009, pp. 120–127 (2009)
- Benveniste, A., et al.: The synchronous language twelve years later. Proc. of the IEEE 91(1), 64–83 (2003)
- 3. Berkeley Logic Synthesis Group, http://www.eecs.berkeley.edu/~alanmi/abc/
- 4. Chatterjee, S., Kishinevsky, M., Ogras, U.Y.: Quick formal modeling of communication fabrics to enable verification. In: HLDVT 2010 (to appear, 2010)
- 5. Colom, J.M., Silva, M.: Convex geometry and semiflows in P/T nets. In: Proc. of Appl. and Theory of Petri Nets, 79–112 (1991)
- Corman, T.H., et al.: Introduction to Algorithms, 2nd edn. MIT Press, Cambridge (1990)
- 7. Dally, W.J., Towles, B.: Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco (2004)
- 8. Hoare, C.A.R.: An axiomatic basis for computer programming. Comm. of the ACM 12(10), 576580–576583 (1969)
- Jhala, R., McMillan, K.L.: Microarchitecture Verification by Compositional Model Checking. In: Berry, G., Comon, H., Finkel, A. (eds.) CAV 2001. LNCS, vol. 2102, pp. 396–410. Springer, Heidelberg (2001)
- Kaivola, R., et al.: Replacing Testing with Formal Verification in Intel Core i7 Processor Execution Engine Validation. In: Bouajjani, A., Maler, O. (eds.) CAV 2009. LNCS, vol. 5643, pp. 414–429. Springer, Heidelberg (2009)