1 Introduction

Model checking is an automatic technique for verifying finite state concurrent systems, which uses a finite state machine to describe the system under consideration and temporal logic to state the properties that the system must satisfy. This method has been used successfully in practice to verify complex software and hardware systems [1, 2]. However, efficient verification of parameterized cache coherence protocols is one of the most challenging problems in verification domain today. Firstly, parameterized systems are composed of an arbitrary number of processes which concur cooperatively (the number of processes is called the system parameter). The behavior of one process is determined not only by its current state, but also the changes of the environment it lives. Secondly, parameterized systems are by nature unbounded. The system parameter may be arbitrarily large, and the ultimate goal is to validate the properties in a system for every possible number of processes. In such cases, the number of global states can be enormous, resulting in the state space explosion. Formal verification of parameterized systems is known to be undecidable and thus cannot be automated. Thirdly, symbolic methods such as BDD or SAT, which can enable scalable formal verification methods, can be ineffective when it comes to cache coherence protocols because most of the state variables are relevant in protocol property verification. As faster larger systems are designed, the complexity of cache protocols will continue to increase.

Fong Pong [3] presented a comprehensive survey of various approaches to the verification of cache coherence protocol based on state enumeration, model checking, and symbolic state models. He pointed out that no framework had been proposed so far to deal with the memory consistency model in the context of formal verification based on state expansion. Monolithic formal verification methods that treat the protocol as a whole have been used fairly routinely for verifying cache coherence protocols from the early 1990s [4, 5]. However, these monolithic techniques will not be able handle the very large state space of parameterized protocols. While techniques like indexed predicates [6], counter abstraction [7], environment abstractions [8, 9], and cutoffs based approach [10] have been proposed for parameter protocol verification during these years, none of them scales well to large protocols, and those that do scale require an inordinate amount of manual effort to succeed [11]. We are not aware of any published work that has reported formal verification of a parameterized cache coherence protocol with reasonable complexity.

All successful applications of model checking thus far have made use of domain specification abstraction techniques. Continuing this trend and drawing inspiration from recent work like environment abstraction [8, 9], we exploit the domain knowledge about parameterized systems to devise an appropriate abstraction method. We propose a novel generic approach called two-dimensional abstraction (TDA), which could effectively reduce the state space of parameterized systems. In our work, the size of the state transition graph for each process is reduced independently at first, then the whole system composed of the reduced processes is abstracted based on the design principles of parameterized systems, thus avoiding the construction of the complete state space that might be too large to fit into memory.

TDA has a number of advantages over other approaches. First, TDA abstracts away redundant information from a concrete system via decomposition–abstraction–composition–reabstraction, thus effectively alleviating the state explosion problem during parameterized systems verification. Second, TDA can be used for parallel systems in the usual fashion because it has no limitation in communication mode among processes. Third, TDA can be used with any model checker. The freedom to choose model checkers is important in practice. Fourth, TDA is sound and complete. We give complete soundness and completeness proofs for our method. At last, constant heterogeneous processes and infinite state systems are allowed, which makes TDA suitable for large scale heterogeneous systems. We demonstrate the power of our method by applying it to various well-known classes of protocols.

The rest of this paper is organized as follows. In Sect. 2, we introduce previous related work. Section 3 gives some background information. In Sect. 4, we propose a model with true concurrency semantics for parameterized systems. In Sect. 5, we present concepts of a TDA model and the method to construct a TDA model. A cache coherence protocol based on MESI is used to illustrate the approach of getting a much smaller state space by TDA in Sect. 6. Experimental results of various well-known protocols and application are presented in Sect. 7. Section 8, the last section, presents concluding remarks.

2 Related works

The development of effective techniques for checking parameterized systems is one of the most challenging problems in verification today. Prior research in the area of coherence protocol verification has ranged from simulation to formal methods. These techniques have had varying degrees of success, but few of them have been applied to a large industrial-strength protocol like FLASH.

Simulation with random or directed stimulus has been shown to be effective at finding most protocol errors [12]. However, simulation tends not to be effective at uncovering subtle bugs, especially those related to the consistency model. Subtle consistency bugs often occur only under unusual combinations of circumstances, and it is unlikely that simulation will drive the protocol to these situations.

For verification of high level specifications, modern industrial practice consists of modeling small instances of the protocols in guard/action languages such as Murphi [13] or TLA+ [14], and exploring the reachable states through explicit state enumeration.

The idea of using non-interference lemmas for parameterized model checking is attributed to McMillan [15], Chou [16], and Li [17], which is also called the CMP method. The CMP approach to parameterized verification is a combination of data type reduction and compositional reasoning. In this approach, a model checker is used as proof assistant and the user guides the proof by supplying invariants or non-interference lemmas. Similar types of reasoning have been applied by Chen to verify non-parameterized hierarchical protocols [18]. The compositional method of McMillan is used for compositional reasoning to handle infinite state systems including directory based protocols. This technique, which requires user intervention at various stages, has been applied to verify safety and liveness properties of the FLASH protocol. The paper by Chou [16] presented a method along similar lines, that was used to verify safety of FLASH and GERMAN protocol. Krstic [19] gave a formalization of the method. The CMP method scales well. As far as we are aware, the CMP method is one of a few methods to handle the full complexity of the FLASH protocol. Intel used CMP to verify an industrial-strength cache protocol several orders of magnitude larger than even the FLASH protocol [20]. Talupur and Tuttle showed how to derive high-quality invariants from message flows and how to use these invariants to accelerate the CMP method [21, 22]. A message flow is a sequence of messages sent among processors during the execution of a protocol. The hardest part of using CMP is finding a set of protocol invariants that enable CMP to work. The user has the burden of coming up with non-interference lemmas which can be non-trivial and require deep understanding of the protocol under verification.

Another effective method for parameterized verification is the abstraction approach [69, 11, 2325]. Predicate abstraction, first proposed by Graf [11] as a special case of the general framework of abstraction interpretation, has been used in the verification of parameterized protocols. In predicate abstraction, a finite set of predicates is defined over the concrete set of states. These predicates are used to construct a finite state abstraction of a concrete system. The automation in generating the finite abstract model makes this scheme attractive in combining deductive and algorithmic approaches for infinite state verification. Lahiri [26] proposed the use of a symbolic decision procedure and its application for predication abstraction. One of the main problems in predicate abstractions is that it typically makes a large number of theorem prover calls when computing the abstract transition relation or the abstract state space. Pnueli [23] presented the method of invisible invariants that combines a small-model theorem with a heuristics to generate proofs of correctness of parameterized systems. Wang [24] used monotonic abstraction to provide an over-approximation of the transition system induced by a parameterized system. The over-approximation gives a transition system which is monotonic with respect to a well quasi-ordering on the set of configurations. Timm [25] presented an approach combining symmetry arguments with spotlight abstractions. The technique determines (the size of) a particular instantiation of the parameterized system from the given temporal logic formula, and feds this into an abstracting model checker. Environment abstraction [8, 9] exploits the replicated structure of a parameterized system to make its verification easy, and it converts the unbounded system into a bounded one via finite state description method. In real cache coherence protocols, the internal state of each cache can be quite complex, and thus environment abstraction might fail. The other method is divide-and-conquer, in other words, abstraction for each process is made independently before the model for the whole system is constructed [27]. Unfortunately, too many constraints for systems under consideration make this way unpractical.

Other related work includes that of Pandav [28] who has proposed a set of heuristics to aid in constructing invariants for cache protocols. Delzanno [29] used arithmetic constraints to model possibly infinite sets of global states of a multi-processor system with many identical caches. General purpose symbolic model checkers for infinite-state systems working over arithmetical domains were used. Delzanno and Bultan [30, 31] described a constraint based verification method for handling the safety and liveness properties of GERMAN protocol. But their method cannot verify single index liveness properties. Emerson and Kahlon [32] verified GERMAN by first reducing it to a snoopy bus protocol and then invoking a theorem asserting that if a snoopy bus protocol of a certain form is correct for 7 nodes then it is correct for any number of nodes. Pnueli proposed an elegant cutoff method that can verify the DIR protocol [10], but it was sound and not complete, and worked only for safety properties. A broad technique was proposed for the verification of WSIS systems that can handle the DIR protocol as an example [33], yet again the resulting technique was sound but not complete.

3 Preliminaries

This section contains basic material about the Kripke structure, temporal logic and equivalent relation on Kripke structures [34].

Definition 1

(Kripke structure)

Let AP be a set of atomic propositions. A Kripke structure M over AP is a five-tuple M=(AP,S,I,R,L) where

  1. 1.

    S is a finite set of states.

  2. 2.

    IS is the set of initial states.

  3. 3.

    RS×S is a transition relation that must be total, that is, for every state sS there is a state s′∈S such that R(s,s′).

  4. 4.

    L:S→2AP is a function that labels each state with the set of atomic propositions true in that state.

Temporal logic is used to specify properties of Kripke structures. CTL , a powerful logic, describes properties of computation trees. A tree is formed by designating a state in a Kripke structure as the initial state and then unwinding the structure into an infinite tree with the designated state at the root. In CTL , formulas are composed of path quantifiers and temporal operators. The path quantifiers are used to describe the branching structure in the computation tree. There are two such quantifiers A (for all computation paths) and E (for some computation path). The temporal operators, X (next time), F (in the future), G (always), U (until), and R (release) describe properties of a path through the tree.

There are two types of formulas in CTL : state formulas which are true in a specific state and path formulas which are true along a specific path. Let AP be the set of atomic propositions, the syntax of CTL is given by the following rules:

  1. 1.

    If pAP, then p is a state formula.

  2. 2.

    If f and g are state formulas, then ¬f,fg, and fg are state formulas.

  3. 3.

    If f is a path formula, then Ef and Af are state formulas.

  4. 4.

    If f is a state formula,then f is also a path formula.

  5. 5.

    If f and g are path formulas, then ¬f, fg, fg, Xf, Ff, fUg, and fRg are path formulas.

Let M be a Kripke structure over AP. A path in M from a state s is an infinite sequence of states π=s 0 s 1 s 2⋯ such that s 0=s and R(s i ,s i+1) holds for all i≥0. We use π i to denote the suffix of π starting at s i .

The restriction of CTL to universal path quantifiers A is called ACTL .

Simulation equivalence restricts the logic and relaxes the requirement that the structures should satisfy exactly the same formulas, resulting in a great reduction.

Definition 2

(Simulation relation)

Given two structures M and M′ with AP′⊆AP, a relation HS×S′ is a simulation relation between M and M′ if and only if for all s and s′, if H(s,s′) then the following conditions hold:

  1. 1.

    L(s)∩AP′=L′(s′).

  2. 2.

    For every state s 1 such that R(s,s 1), there is a state \(s_{1}^{\prime}\) with the property that \(R^{\prime}(s^{\prime},s_{1}^{\prime})\) and \(H(s_{1},s_{1}^{\prime})\).

If there exists a simulation relation H such that for every initial state s 0 in M there is an initial state \(s_{0}^{\prime}\) in M′ for which \(H(s_{0},s_{0}^{\prime})\), we say that Msimulates M (denoted by MM′).

4 Modeling parameterized systems

States of each process in a parameterized system are considered as interpretations over a finite variable set, V. For each V, a subset V e is called an external variable set that is used by the process to communicate with the environment consisting of other processes. The set V i=VV e is an internal variable set. Obviously, the environment may update only external variables, whereas the process may update all the variables. Such processes are modeled by Kripke structures which describe a class of finite state systems with first-order logic propositions. A complex parameterized system is modeled as a composition of such smaller processes when the following conditions are met.

Definition 3

(Compatible structure)

Two Kripke structures M 1=(AP 1,S 1,I 1,R 1,L 1) and M 2=(AP 2,S 2,I 2,R 2,L 2) are involved, in which V 1 and V 2 are their respective state variable sets. If \(V_{1}^{i} \cap V_{2}^{i}= \varnothing\) and \(V_{1}^{e}=V_{2}^{e}\) are true, then M 1 and M 2 are compatible structures. The former condition indicates that internal variables are owned only by one process and the latter requires external variables shared by both processes.

Definition 4

(Compatible state)

Let M 1=(AP 1,S 1,I 1,R 1,L 1) and M 2=(AP 2,S 2,I 2,R 2,L 2) be two compatible structures. If L 1(s 1) ∩ AP 2=L 2(s 2) ∩ AP 1 is true, then s 1S 1 and s 2S 2 are compatible. Compatible states agree on the external variables as well as the common atomic propositions.

Processes communicate with each other in the synchronous or asynchronous mode. In the synchronous execution mode, all processes execute the transitions at the same time, whereas in the asynchronous execution mode, the process state transitions are independent of each other: the system evolves by interleaving the evolution of its processes. At each execution cycle, only one process is chosen to perform a transition. However, parameterized systems, in which different processes may change their states at the same time, are very common in reality. There is no order between these transitions, thus preserving the true meanings of concurrency. We call such a communication mode as asynchronous composition with true concurrency semantics. From the viewpoint of computer science, it is more interesting to investigate asynchronous products of Kripke structures with true concurrency semantics. We propose a formal model with true concurrency semantics for parameterized systems, which is more suitable for describing concurrent systems in the usual fashion.

Definition 5

(Asynchronous composition with true concurrency semantics)

Let M k =(AP k ,S k ,I k ,R k ,L k ) be the kth (1≤kn) Kripke structure among compatible structures. Their asynchronous composition with true concurrency semantics,

$$M={\overset{n}{\underset{k=1}{\sideset{}{_a}\prod}}}M_k=(\mathit{AP},S,I,R,L)$$

is defined to be:

  1. 1.

    \(\mathit{AP}={{\bigcup}^{n}_{k=1} \mathit{AP}_{k}}\).

  2. 2.

    \(S=\{<s_{1},s_{2},\ldots,s_{n}>|s_{k}\in S_{k}\ (1 \le k \le n)\mbox{\textit{~are~compatible~states}}\} \subseteq{\prod}^{n}_{k=1}{S_{k}}\).

  3. 3.

    \(I=\{<s_{1},s_{2},\ldots,s_{n}>|{\bigwedge^{n}_{k=1}s_{k}}\in I_{k}\}\subseteq S\).

  4. 4.

    R={(<s 1,i ,s 2,i ,…,s n,i >,<s 1,i+1,s 2,i+1,…,s n,i+1>)|∃j,1≤jn,(s j,i ,s j,i+1)∈R j }.

  5. 5.

    \(L(<s_{1},s_{2},\ldots,s_{n}>)={\bigcup}^{n}_{k=1}L_{k}(s_{k})\).

Theorem 1

The asynchronous composition operator with true concurrency semantics, ∏ a , is commutative and associative.

Proof

By Definition 5, the set of atomic propositions of the composition is a union of component atomic propositions; so is the set of labels. States of the composition are vectors of component states that are compatible, and they are elements of the Cartesian product of component states. Each transition of the composition involves at least a transition of n components. Because the union and product of sets are commutative and associative, the asynchronous composition operator with true concurrency semantics is also commutative and associative. □

5 Two-dimensional abstraction

Now we use a two-dimensional graph shown in Fig. 1 to describe the state space of parameterized systems, where the x axis denotes system parameter n, and the y axis denotes the state space of each process m. To simplify the presentation, it is supposed that all processes are identical. Since the full cross-product of the process states needs to be considered in the global system at each step, the result of the asynchronous composition with true concurrency semantics is very large, in the worst case m n. Too many reachable states impede the automatic verification in many practical cases. Two-dimensional abstraction technique proposed in this paper is specifically tailored for parameterized systems with true concurrency semantics and helps avoiding the problem of state explosion.

Fig. 1
figure 1

State space of parameterized systems

Definition 6

(Two-dimensional abstraction) For asynchronous concurrent parameterized systems with true concurrency semantics, two-dimensional abstraction is a process constructing an abstract model by first reducing the state space of each process independently along the y axis in order to reduce m and then hiding the system parameter n along the x axis based on the design principles of parameterized systems. The former step is called y-abstraction, and the latter x-abstraction. The corresponding reduced results are called the y-abstract model and TDA model, respectively.

The selection of an equivalence relation between a TDA model and a concrete system is of prime importance for the successful application of TDA in practice. Simulation relationship [35] will result in a greater reduction of the number of states by restricting logic and relaxing the requirement that two structures should satisfy exactly the same set of formulas. Given two Kripke structures M 1=(AP 1,S 1,I 1,R 1,L 1) and M 2=(AP 2,S 2,I 2,R 2,L 2) with AP 2AP 1, if there exists a simulation relation H such that for every initial state s 10 (s 10I 1) in M 1 there is an initial state s 20 (s 20I 2) in M 2 for which H(s 10,s 20), we say that M 2 simulates M 1 and denote it by M 1M 2. Intuitively, for every transition in M 1, there is a corresponding transition in M 2.

In the following sections, PS c(n) refers to the concrete model of asynchronous concurrent parameterized systems with true concurrency semantics consisting of n concrete processes. PS y(n) is the y-abstract model of PS c(n) and PS t(n) is its TDA model.

5.1 y-Abstraction

The y-abstraction deals with each concrete process independently in order to abstract away the information irrespective of system properties. Any property-preserving abstraction method is available. We construct a finite predicate set Φ={φ 1,φ 2,…,φ r } from properties and system description, and build the y-abstract model through the method of basic predicate abstraction.

The predicate set Φ defines an equivalence relationship on \(S_{k}^{c}\), the set of states of \(M_{k}^{c}=(\mathit{AP}_{k}^{c},S_{k}^{c},I_{k}^{c},R_{k}^{c},L_{k}^{c})\ (1 \le k \le n)\), and each equivalence class is denoted by an abstract state. The concrete state is labeled with a predicate formula which is satisfied in that state. In other words, labeling function \(L_{k}^{c}\) maps a concrete state into a predicate set. The set of states of the y-abstract model \(M_{k}^{y}\), \(S_{k}^{y}\) is a set of normal boolean expressions on b 1,b 2,…,b r (b j (1≤jr) corresponding to predicate φ j . A y-abstract state is a truth assignment to r boolean variables. Labeling function \(L_{k}^{y}\) maps a y-abstract state into a boolean expression. The abstract operator \(H_{k}^{cy}\) determines the relationship between concrete states and abstract states. The method of building the transition relation \(R_{k}^{y}\) of the y-abstract model \(M_{k}^{y}\) from the concrete transition relation \(R_{k}^{c}\) is the same as that introduced by Graf and Saidi [11]. From the above definitions, we can conclude that \(H_{k}^{cy} \subseteq S_{k}^{c} \times S_{k}^{y}\) is a simulation relation between \(M_{k}^{c}\) and \(M_{k}^{y}\), so the following theorem holds.

Theorem 2

\(M_{k}^{c} \preccurlyeq M_{k}^{y}\ (1 \le k \le n)\).

Proof

The proof is given in [11]. □

In the following, we will demonstrate how the y-abstraction affects the parameterized concurrent systems.

Definition 7

(Visible transitions set and invisible transitions set)

Given a Kripke structure M=(AP,S,I,R,L), we assume that AP f is the set of atomic propositions involved in the temporal formula f. The set of visible transitions of M w.r.t. AP f includes transitions affecting the truth of atomic propositions in AP f , which is denoted by VTS(M,AP f )={(s,t)|(s,t)∈R∧(L(s)∩AP f L(t)∩AP f )}. The set of IVTS(M,AP f )=RVTS(M,AP f ) is called the set of invisible transitions of M w.r.t. AP f .

It is obvious that VTS(M,AP f ) and IVTS(M,AP f ) relate to the system property. Both of them satisfy VTS(M,AP f )∩IVTS(M,AP f )=∅ and VTS(M,AP f )∪IVTS(M,AP f )=R.

For each transition \(R_{k}^{c}(s_{k}^{c},t_{k}^{c})\) of the kth process in a concrete model PS c(n), if \(R_{k}^{c}(s_{k}^{c},t_{k}^{c}) \in \mathit{IVTS}(M_{k}^{c},\mathit{AP}_{f})\), the corresponding y-abstract transition \(R_{k}^{y}(s_{k}^{y},t_{k}^{y})\) is a loop in the state graph and \(M_{k}^{y}\) does not change the current state. If \(R_{k}^{c}(s_{k}^{c},t_{k}^{c}) \in \mathit{VTS}(M_{k}^{c},\mathit{AP}_{f})\), \(R_{k}^{y}(s_{k}^{y},t_{k}^{y})\) connects two different y-abstract states in \(M_{k}^{y}\), that is to say, \(M_{k}^{y}\) performs a transition. Hence, all transitions in \(M_{k}^{c}\) are maintained. Figure 2 illustrates two kinds of concrete transitions and their y-abstract transitions. Therefore, the y-abstract model PS y(n) is an asynchronous composition of \(M_{1}^{y},M_{2}^{y},\dots,M_{n}^{y}\).

Fig. 2
figure 2

How y-abstraction affects transitions in asynchronous concurrent parameterized systems

Theorem 3

The asynchronous composition with true concurrency semantics operator a is monotonic w.r.t. ≼, that is, \(M_{k}^{c}\preccurlyeq M_{k}^{y}\) (1≤kn)⇒PS c(n)≼PS y(n).

Proof

Let \(\mathit{PS}^{c}(n)=(\mathit{AP}^{c},S^{c},I^{c},R^{c},L^{c})=\prod_{a\ k=1}^{n} M_{k}^{c}\) be an asynchronous composition with true concurrency semantics, where \(M_{k}^{c}=(\mathit{AP}_{k}^{c},S_{k}^{c},I_{k}^{c},R_{k}^{c},L_{k}^{c})\). Its y-abstract model is denoted by \(\mathit{PS}^{y}(n)=(\mathit{AP}^{y},S^{y},I^{y},R^{y},L^{y})=\prod^{n}_{a\ k=1} M_{k}^{y}\), where \(M_{k}^{y}=(\mathit{AP}_{k}^{y},S_{k}^{y},I_{k}^{y},R_{k}^{y},L_{k}^{y})\).

First of all, from Theorem 2, we have

$$ M_k^c \preccurlyeq M_k^y.$$
(1)

Therefore,

$$ \mathit{AP}_k^y \subseteq \mathit{AP}_k^c.$$
(2)

By Definition 5, it is easy to see that

$$\mathit{AP}^y=\overset{n}{\underset{k=1}{\bigcup}}\mathit{AP}_k^y \subseteq\overset {n}{\underset{k=1}{\bigcup}}\mathit{AP}_k^c = \mathit{AP}^c. $$
(3)

Note that the abstract function \(H_{k}^{cy}\), described in Sect. 5.1, is a simulation relation between \(M_{k}^{c}\) and \(M_{k}^{y}\), hence, for every s y in PS y(n), the following identity holds:

(4)

That is to say, a y-abstract state is obtained by applying \(H_{k}^{cy}\ (1 \le k \le n)\) to the kth element in concrete state s c.

Now we will show that H cyS c×S y is a simulation relation between PS c(n) and PS y(n). For every \(s^{c}=\langle s_{1a}^{c},s_{2b}^{c},\dots,s_{kl}^{c},\dots,s_{ng}^{c}\rangle \in S^{c}\), suppose that \(s^{y}=\langle s_{1a}^{y},s_{2b}^{y},\dots,s_{kl}^{y},\dots,s_{ng}^{y} \rangle \in S^{y}\) is its y-abstract state, namely, H cy(s c)=s y, then, by Definition 2, both of the following conditions must hold:

  1. 1.

    L c(s c)∩AP y=L y(s y).

  2. 2.

    t c t cS cR c(s c,t c)⇒∃t y t yS yR y(s y,t y)∧H cy(t c,t y).

Proof of condition (1): L c(s c)∩AP y=L y(s y).

By Definition 5, observe that

(5)

Because

$$ \mathit{AP}^y={\overset{n}{\underset{k=1}{\bigcup}}}\mathit{AP}_k^y,$$
(6)

if we replace AP y in (5) with the right-hand side of (6), we obtain

$$ L^c\bigl(s^c\bigr) \cap \mathit{AP}^y = {\overset{n}{\underset{k=1}{\bigcup }}} {\Biggl(L_k^c\bigl(s_{kl}^c\bigr)\cap{\overset{n}{\underset{k=1}{\bigcup}}}\mathit{AP}_k^y\Biggr).}$$
(7)

Note that \(L_{k}^{c}(s_{kl}^{c})\) is a set of atomic propositions true in \(s_{kl}^{c}\), so it is only relative to \(\mathit{AP}_{k}^{c}\) and independent of \(\mathit{AP}_{j}^{c}\ (1 \le j \le n,j \neq k)\). Furthermore, \(\mathit{AP}_{k}^{y} \subseteq \mathit{AP}_{k}^{c}\). Therefore,

(8)

Substitute this item into (7) to obtain

$$L^c\bigl(s^c\bigr) \cap \mathit{AP}^y={\overset{n}{\underset{k=1}{\bigcup}}}L_k^y\bigl(s_{kl}^y\bigr)=L^y\bigl(s^y\bigr).$$
(9)

Hence, condition (1) is true.

Proof of condition (2): ∀t c t cS cR c(s c,t c)⇒∃t y t yS yR y(s y,t y)∧H cy(t c,t y).

For each \(t^{c}=\langle t_{1a^{\prime}}^{c},t_{2b^{\prime}}^{c},\dots,t_{kl^{\prime}}^{c},\dots,t_{ng^{\prime}}^{c}\rangle \in S^{c}\), R c(s c,t c) implies that there is at least one component in a concrete model that makes a transition. Suppose that the former k (1≤kn) components make transitions, while the latter nk components do not. There are several cases to be considered.

Case 1: \(t^{c} \neq s^{c} \wedge R_{k}^{c}(s_{kl}^{c},t_{kl^{\prime}}^{c}) \in \mathit{IVTS}(M_{k}^{c},\mathit{AP}_{f})\), as represented in the middle of Fig. 2.

Because

$$M_k^c \preceq M_k^y,$$
(10)

one gets

$$ R_k^c\bigl(s_{kl}^c,t_{kl^\prime}^c\bigr) \Rightarrow\exists t_{ke^\prime}^y \in S_k^y~R_k^y\bigl(s_{ke}^y,t_{ke^\prime}^y\bigr) \wedge H_k^{cy}\bigl(t_{kl^\prime }^c,t_{ke^\prime}^y\bigr).$$
(11)

Now we construct t y by Definition 5 as follows:

(12)

As the latter nk components in the concrete model do not make transitions, we obtain

$$s_{(k+1)r}^c=t_{(k+1)r^\prime}^c,\dots,s_{ng}^c=t_{ng^\prime}^c.$$
(13)

Substitute them into (12) to obtain

$$t^y=\bigl\langle H_1^{cy}\bigl(t_{1a^\prime}^c\bigr),\dots,H_k^{cy}\bigl(t_{kl^\prime }^c\bigr),H_{k+1}^{cy}\bigl(t_{(k+1)r^\prime}^c\bigr),\dots,H_n^{cy}\bigl(t_{ng^\prime}^c\bigr)\bigr \rangle.$$
(14)

This expression indicates that applying \(H_{k}^{cy}\) to the kth element of t c will yield its y-abstract state, thus, (t c,t y)∈H cy.

From (11), there is at least one element in s y and t y that satisfies \(R_{k}^{y}(s_{ke}^{y},t_{ke^{\prime}}^{y})\), so (s y,t y)∈R y.

The other two cases, \(t^{c} \neq s^{c} \wedge R_{k}^{c}(s_{kl}^{c},t_{kl^{\prime}}^{c})\in \mathit{VTS}(M_{k}^{c},\mathit{AP}_{f})\) and t c=s c, can be discussed in a similar way.

To this point, both conditions (1) and (2) are true. We conclude that H cyS c×S y is a simulation between PS c(n) and PS y(n). By Definition 2, for every initial state \(s_{0}^{c} \in I^{c}\) in PS c(n) there is an initial state \(s_{0}^{y} \in I^{y}\) in PS y(n) such that \(H^{cy}(s_{0}^{c},s_{0}^{y})\), as a consequence, this theorem is proved. □

Theorem 3 implies that the y-abstract model is weakly-preserved w.r.t. ACTL* formula. Applying this theorem to each kind of ACTL* formula, we get the following conclusion.

Theorem 4

For each ACTL* formula f(AP f AP y), PS y(n)⊨fPS c(n)⊨f.

Proof

From Theorem 3, we obtain

$$\mathit{PS}^c(n) \preceq \mathit{PS}^y(n).$$

Hence, PS y(n)⊨fPS c(n)⊨f holds. It is proved in [34]. □

Intuitively, this theorem is true because formula in ACTL* describes properties that are quantified over all possible behaviors of a system. Because every behavior of PS y(n) is a behavior of PS c(n) , every formula of ACTL* that is true in PS y(n) must also be true in PS c(n). Theorem 4 is very useful for large scale system verification since it provides a way of accelerating the verification by taking advantage of exhaustive search of a smaller state space.

5.2 x-Abstraction

During the construction of parameterized systems, the designers reason about its correctness by focusing on the execution of one process (called hub) and consider its interaction with other processes (called rims, all rims constitute the hub’s environment) [8]. The x-abstraction, following this idea, produces a much smaller state space.

As described in the earlier sections, PS y(n) is an asynchronous concurrent system with true concurrency semantics. Without loss of generality, assume that PS y(n) contains n−1 (n>1) rims (numbered from 1 to n−1) and one hub (numbered n). We get the following identity by expanding L y, the labeling function of PS y(n):

It is straightforward to find that \(L_{k}^{y}(s_{k}^{y})\) (1≤kn) on the right hand side of the identity is the set of all labels of rims (or hubs) and they are atomic propositions that process k satisfies in the current state. These atomic propositions reflect process properties. Consequently, the object of x-abstraction is the whole parameterized system whose properties relate to either one process or many processes.

Definition 8

(Process property)

The first-order predicate prop(k), 1≤kn, indicating that the kth process has property prop, is called process property. We use PROP(k)={prop(k)} to denote all properties the kth process holds.

Given a process d, the d-label is an instance of prop(k), meaning that process d meets the property prop. PROP(d)={prop(d)} is the set of all d-labels. For every s y (s yS y) and process d (1≤dn), we have either s yprop(d) or s yprop(d). If s yprop(d) holds, the y-abstract state s y has the label prop(d).

The global state label of the y-abstract model can be simplified as follows, by Definition 8:

$$ L^y=L(1)\cup\cdots \cup L(k) \cup\cdots\cup L(n)={\overset {n}{\underset{k=1}{\bigcup}}}L(k)=\bigl\{l(d),s^y \models l(d),1 \le d \le n\bigr\}.$$
(15)

It is interesting to note that the global label of the y-abstract state s y is all the process properties it satisfied. Next we will introduce a new notation to describe the parameterized system.

Definition 9

The first-order predicate snps(k)=prop(k)∧(⋀ jk prop(j)) describes not only the kth process but also its environment (comprising the jth process). snps(k) is a quite detailed picture of the global system, and all the snapshots are represented as SNPS={snps(k)}.

A snapshot snps(k) gives the necessary condition that an equivalent partition meets on PS y(n): if there exits a process d satisfying s ysnps(d), snps(k) is one of the abstract states of s y. All such y-abstract states which satisfy the above condition compose an equivalence class. If snps(k) were of the form ±prop 1(k)∧±prop 2(k)∧⋯∧±prop r (k), r>1, where prop 1(k),…,prop r (k) are r process properties and ±prop i (k) (1≤ir) indicates that prop i (k) appears positive or negative, snps(k) can be expressed by a tuple 〈b 1,b 2,…,b r 〉, where b i =1⇔snps(k)⇒prop i (k). That is, the value of each bit b i reflects the polarity of the corresponding predicate prop i (k) in snps(k). Labeling the y-abstract states with atomic formulas will result in a much smaller state space.

In order to construct a TDA model, PROP and SNPS must meet two conditions: coverage and congruence. Coverage means that every y-abstract state is reflected by some snapshots, and congruence implies that snps(k) contains enough information about a process to conclude a label holds true for this process or not. That is to say, for each snps(k)∈SNPS and each prop(k)∈PROP it holds that snps(k)→prop(k) or snps(k)→¬prop(k).

Suppose that PROP and SNPS of PS y(n) satisfy the above conditions, the TDA model is a Kripke structure PS t=〈AP t,S t,I t,R t,L t〉:

  1. 1.

    AP t is the set of atomic propositions involved in the process property prop(k), and AP t=AP y according to Definition 8;

  2. 2.

    S t=SNPS is the set of abstract states: the abstract operator α n (s y)={snps(k)∈SNPS|s ysnps(n)} maps all the y-abstract states s y, where hub meets the condition of snps(k), into the TDA abstraction state snps(k);

  3. 3.

    I t is the set of initial abstract states: snps(k)∈I t if there exists a parameterized system PS y(n) and a y-abstract state s yI y such that snps(k)∈α n (s y);

  4. 4.

    L t is the labeling function: for each snps(k)∈S t,L t(snps(k))={prop(k):snps(k)⇒prop(n)};

  5. 5.

    R t is the set of abstract transitions: for each snps 1(k)∈S t,snps 2(k)∈S t, if there exist a parameterized system PS y(n) and two y-abstract states s yS y,t yS y which meet the condition of snps 1(k)∈α n (s y)∧snps 2(k)∈α n (t y)∧(s y,t y)∈R y, then (snps 1(k),snps 2(k))∈R t.

The TDA abstract state is labeled with prop(k) which process k satisfies, and now k becomes finite after y-abstraction, therefore, S t is finite, too. From the theoretical perspective, TDA will reduce the space by (|S|−|S t|)/|S| where S is the set of asynchronous composition states defined in Definition 5. At this time, our goal of reducing the state space of parametric verification has been achieved.

Theorem 5

For a single-indexed ACTL* specificationxφ(x) where the atomic formulas involved in φ(x) are labels in L t, the following holds: PS tφ(x)⇒∀nPS y(n)⊨∀(x).

Proof

The proof is given in [36]. □

The correctness of TDA means that TDA model is weakly-preserved for single indexed ACTL* specifications, which is guaranteed by Theorems 3, 4, and 5. In addition, Theorem 5 implies that TDA is sound, namely, any single-indexed ACTL* specification which holds in a TDA model also holds in a concrete model with arbitrary number of processes. The completeness and soundness of our approach provide a solid theoretical foundation for optimizing the state space of parameterized systems.

6 An example

We show how the TDA runs on parameterized MESI protocol. The MESI protocol is a four-state write-invalidate cache coherence protocol in which every memory block can be in one of the following states: Modified, Exclusive, Shared, and Invalid [37]. Invalid means that a memory block is not present in the cache and to load it the processor would have to send a request (LD) to the main memory. Modified identifies cache lines that have been written by the corresponding processor (ST). The current version of the modified block resides in the cache and is not visible to the rest of the system at this time. The processor can perform LD, ST, and Eviction on this data. Shared is the only state which allows other valid copies of the same memory block to be stored in other caches. A processor can load from a Shared memory block or evict it without notifying other processors or the memory. Exclusive means that the processor is the one who owns the right to modify the block and the main memory is current with the contents of the cache. If one cache has an Exclusive or Modified state, all matching lines in other caches are marked Invalid.

Let PS c(3) be a distributed shared-memory multi-processor system with three processors which ensures the data consistency through a directory-based MESI protocol considering single memory block and single cache line. The directory itself is a data structure whose entries record, for every block of memory, the state (i.e., cache access permission, namely, dirstate) and the identities of the processors which have cached that block (sharedset). Each cache tag residing in a processor includes at least three fields: memaddr, cachestate, and cachedata. From the viewpoint of each cache controller, a particular memory block can be in one of the four states: MODF, EXCL, SHRD, or INVD. From the perspective of system-wide view, the state of a cache line is determined by the corresponding dirstate and cachestate. Regardless of dirstate, if the range of cachedata is contained in [0,1], there are as many as 32 transitions in the state machine of a single processor for a single memory block, even though 7 states are valid (shown on the left hand side of Fig. 3). It is very difficult to draw the state machine graph if cachedata and memaddr are allowed to take on any values from its domain.

Fig. 3
figure 3

MESI state machine of a single processor for a single memory block, two values and its y-abstract model

Now we want to validate PS c(3) which satisfies such a property that there exists a processor without a copy of a block of memory when it is shared by another processor. The first step is to simplify the MESI protocol for a single processor through y-abstraction by Definition 6. Because the above property only relates to the state of cache line and does not care its value, cachedata is redundant. The Kripke structure of the reduced MESI protocol by y-abstraction is shown on the right hand side of Fig. 3, where states are labeled with predicates satisfied in the current state, for example, ‘M’ means cachestate=MODF.

According to Definition 5, there are only 14 valid states out of a possible 4∗4∗4 states in PS c(3) (shown in Fig. 4), each of them is labeled with a predicate-vector of length three, with the three bits representing the predicate the current memory block satisfies in processors 1, 2, and 3, respectively. For example, 〈EII〉 implies that processor 1 owns the right to modify the memory block and the memory data is not present in the caches of processors 2 and 3. To load the memory data, both of them must issue a request to the main memory. Other states are excluded due to compatibility constraints. Take 〈MMM〉 as an example. For the particular cache line in processors 1, 2 and 3, cachestate is an internal variable, whereas dirstate and sharedset are external variables. The labels for an M state in each processor are {dirstate=M,sharedset=P1,cachestate=M}, \(\{\mathit{dirstate}=M,\mathit{sharedset}={\mbox{\emph{P2}}}, \mathit{cachestate}=M\}\), and \(\{\mathit{dirstate}=M, \mathit{sharedset}={\mbox{\emph{P3}}}, \mathit{cachestate}=M\}\), respectively. The M states do not agree on the external variable sharedset, so they are not compatible.

Fig. 4
figure 4

y-abstract model of 3 MESI-based processors

In the second step, we use the following process property to represent that the block of memory is shared by the kth processor and there is another processor which has no copy of the block of memory:

$$ \delta(k)=\bigl(\mathit{cachestate}[k]=S\bigr) \wedge\biggl(\underset{j \neq k}{\bigvee}\mathit{cachestate}[j]=I\biggr).$$
(16)

We define prop 1(k) and prop 2(k) by

(17)
(18)

Thus

(19)

Table 1 demonstrates the result of the state space of PS y(3) partitioned by snps(2). The first column lists the sets of equivalence class, while the second is the label of each equivalence class and its bit vector expression is shown in the last column. From the table we note that there are only 4 states in the TDA model, reducing the space by 71.4 % compared with that in the y-abstract model. The state of 〈11〉 in the resulting model means that processor 2 has a shared copy of the memory block and the memory data is not present in the caches of processor 1 and/or processor 3. Therefore, the TDA model is precise enough to prove the above system property, namely, TDA is correct.

Table 1 PS y(3) state space partition using snps(2)

Because the system parameter n is existentially-quantified, a group of parameterized systems with different system parameter can be modeled by the same TDA model. To prove the soundness, we applied our method to several other concrete systems. As it is expected, at least 3 concrete systems have the same TDA model as PS c(3) has. Figure 5 shows one such system.

Fig. 5
figure 5

Another concrete system with 4 MESI-based processors has the same TDA model as PS c(3) has

7 Case studies

To validate our approach, we have implemented TDA and applied it to verify several classical cache coherence protocols as described in [38] and a hierarchical cache protocol in FT-1000 CPU.

7.1 Protocols and properties to be verified

Classical protocols and properties these protocols should have are introduced briefly here.

Synapse N+1

Synapse N+1 is a write-allocation protocol developed by Synapse for the N+1 computer. A cache can be in one of three possible states: invalid (the cache has no valid data), valid (the cache has a potentially shared copy of the data), and dirty (the cache has a modified copy of the data). dirty is an exclusive state, only one cache can have a dirty line. The state changes according to write and read commands issued by the corresponding processor (for example, R m , W) or coming from the system bus (such as \(\overline {R_{m}}\) and \(\overline{W}\)), as shown in Fig. 6, R h is an internal action that denotes a read hit, R m denotes a read miss, W denotes a write.

Fig. 6
figure 6

The Synapse N+1 protocol from the perspective of cache C i

There are two possible sources of data inconsistency for Synapse:

UNS1: a dirty cache co-exists with one or more caches in state valid;

UNS2: more than one cache is in state dirty.

Illinois

The University of Illinois protocol is a snoopy cache, write-invalidate, write-in coherence policy. The special feature is that caches can have exclusive copies of data. Bus invalidation signals are sent only for writes to shared data. The memory copy is updated using a write-back policy (replacement). In addition to invalid, caches can be in one of the following states: valid-exclusive (the cache has an exclusive copy of the data that is consistent with the memory such that a modification of its content requires no bus invalidation signal), shared (the cache has a copy of the data consistent with the memory and other caches may have copies of the data), and dirty (the cache has a modified copy of the data, i.e., the data in main memory are obsolete and the content of the other caches is not valid). The transition is given in Fig. 7, and the behavior of one cache may be internal actions R h (read hit), R m (read miss), W e (write in exclusive state), W d (write in dirty state), WI (write and invalidate), and Rep (replacement with a new memory line). In this figure, P is defined as Number(dirty)=0∧Number(shared)=0∧Number(valid-exclusive)=0, where Number(q) denotes the number of caches in state q in the current global state.

Fig. 7
figure 7

The Illinois protocol from the perspective of cache C i

The possible sources of data inconsistency are:

UNS1: a dirty cache co-exists with caches either in state shared or valid-exclusive;

UNS2: there is more than one dirty cache.

The other possible violations of the exclusivity of state valid-exclusive are:

UNS3: there is more than one valid-exclusive cache;

UNS4: a shared cache co-exists with a cache in state valid-exclusive.

Berkeley

The Berkeley protocol is a variation of MESI with write-allocation and with a shared modified state, named owned non-exclusively. In this state, the main memory is not coherent with the possible multiple, cached copies of the owner data. The other three states are invalid, unowned (similar to the MESI Shared state), and owned exclusively (similar to the MESI Modified state). Figure 8 demonstrates how one cache changes its state according to different commands.

Fig. 8
figure 8

The Berkeley protocol from the perspective of cache C i

In the Berkeley protocol, we have the following sources of data inconsistency:

UNS1: an owned exclusively cache co-exists with one or more caches either in state owned non-exclusively, or unowned;

UNS2: there is more than one owned exclusively cache.

Dragon

Dragon is a write-allocation protocol that uses a signal to indicate snoop hits on the bus. The protocol has four states: shared clean (multiple clean copies may coexist), shared dirty (multiple dirty copies may coexist), shared valid exclusive (the cache has an exclusive clean copy), and dirty (the cache has an exclusive dirty copy). The possible transitions from the perspective of cache C i are shown in Fig. 9, where P,Q,S,T are defined as follows:

Fig. 9
figure 9

The Dragon protocol from the perspective of cache C i

PNumber(exclusive)=0∧Number(dirty=0)∧Number(shared-dirty)=0∧Number(shared-clean)=0,

QNumber(shared-dirty)+Number(shared-clean)≥2,

SNumber(shared-dirty)=0∧Number(shared-clean)=1,

TNumber(shared-dirty)=1∧Number(shared-clean)=0.

In the Dragon protocol, there are several possible sources of data inconsistency:

UNS1: a dirty cache co-exists with one or more caches either in state shared dirty, shared clean or valid exclusive;

UNS2: an valid exclusive cache co-exists with one or more caches either in state shared clean, or shared dirty;

UNS3: there is more than one dirty cache;

UNS4: there is more than one valid exclusive cache.

7.2 Experimental results

Figures 10 and 11 present some results of these experiments.

Fig. 10
figure 10

Asynchronous composition state number with different processor number

Fig. 11
figure 11

TDA state number against properties, UNS1–UNS4 correspond to properties to be verified for each protocol, and AHG denotes abstract history graph

The asynchronous composition of n-processor system which ensures the data consistency through some protocol is a concrete system. Figure 10 shows the number of concrete states of each protocol against different system parameter according to Definition 5. Although in the worst case the number of states in asynchronous composition could be as large as \({\prod}^{n}_{k=1}|{S_{k}}|\), in practice it typically turns out to be much smaller. This is because some states, such as 〈dirty,dirty〉 in Illinois protocol and 〈owned-exclusively,owned-exclusively〉 in Berkeley protocol are prohibited. As it is seen from this figure, with the increase of processor number (especially greater than 13 for Berkeley and Dragon, 20 for Synapse N+1 and Illinois), the state number grows rapidly. Therefore, the largest asynchronous composition we can get only comprises 24 processors (Synapse N+1).

In Fig. 11, we plot the number of states in TDA model of each protocol. Because process properties used in TDA are made of predicates taken from properties to be verified, different properties for the same protocol have different TDA models. Two predicates, cachestate(i)=dirty/shared and \({\mathit{Number}(\mathit{dirty}/\mathit{valid} \text {-}\mathit{exclusive})}\), are enough to express these properties formally, resulting in 4, the maximum number of TDA abstract states. AHG denotes the number of reachable states in the abstract history graph described in [39] which are greater than those in TDA. It is also important to notice that the number of states in TDA model does not change along with the system parameter, which is consistent with the conclusion in Sect. 6. All experiments were conducted on a PC with a 3.3 GHz Intel Core processor, 8 Gb of available main memory, running Red Hat Linux (6.1) and GCC (4.4.5).

7.3 Application for FT-1000 CPU

FT-1000 CPU is a key component in TH-1A supercomputer system [40]. It adopts the parallel system on chip multi-core architecture. Eight multi-thread cores, each with a private cache hierarchy (L1 Cache), are integrated on the chip. The eight cores share a large capacity multi-bank L2 Cache, and communication between cores is achieved through Cache Crossbar. Cache Ordering Unit (COU) is responsible for cache coherence and memory ordering. L2 Cache can access the off-chip high speed DDR3 DRAM via memory controller units (MCU). The inter-chip direct connect interface supports cache coherence packet and large block data transfer packet, and can be used for connecting 2–4 processors directly to build large scale tightly-coupled shared-memory systems. This chip provides efficient I/O access by integrated PCIE 2.0 standard interface. Figure 12 illustrates the architecture of FT-1000 CPU.

Fig. 12
figure 12

Architecture of FT-1000 CPU

In FT-1000 based SMP systems, a two-level hierarchical coherence protocol is designed to provide the coherent view of shared data items for programmers. The first level is the chip-level protocol used to keep multiple copies of the data among eight L1 caches consistent. The second level is the inter-chip protocol, used to maintain the L2 caches coherence among different chips. Both levels of this protocol are based on the standard three-state (unowned, shared, exclusive) invalidation-based directory-based cache coherence protocol with some extensions. This hierarchical protocol is more complicated, with more corner cases and bigger state space than non-hierarchical protocols, as we can see, it has eight instances of chip-level protocol and at most four instances of inter-chip protocol running concurrently. So it seems obvious that such hierarchical protocols cannot be checked by current model checkers, e.g., Murphi, NuSMV. During the development of FT-1000 CPU, we applied TDA to reduce the state space of chip-level protocol, and checked several safety properties using NuSMV. Then, FT-1000 CPU is regarded as a single-core processor and the verification of the inter-chip protocol is simplified. We claimed the correctness of the original protocol by verifying the second level protocol. Some chip-level experimental results are given in Table 2, where UNS1 and UNS2 are the same as those of Synapse N+1.

Table 2 Experimental results of FT-1000 chip-level protocol

8 Conclusions

The verification of cache coherence in general is known to be NP-hard. In the age of exascale computing, scalability is emerging as one of the key components in parallel computing [41]. Scalable multi-core multi-processor architectures are inevitable. More and more complex processes and unbounded system parameter result in the state explosion during the verification of parameterized cache coherence protocols. A generic abstraction method for parameterized systems, two-dimensional abstraction (TDA), has been put forward in this paper. The novelty of our approach lies in that it analyzes in depth the intrinsic factors affecting the size of state space, and reduces the state space in two dimensions, thus a much smaller abstract model is produced. Compared with traditional approaches, our approach can effectively reduce the verification complexity and greatly scale the verification capabilities. We give complete soundness and completeness proofs for our method. We have demonstrated the benefits of our approach on several coherence protocols with realistic features.

Our future work is to integrate TDA with model-checking tools and check the advanced cache coherence protocol hierarchically organized for a next generation supercomputer. We also plan to investigate combining TDA with CMP method in the future.