Verifying Safety of Synchronous FaultTolerant Algorithms by Bounded Model Checking
 3 Citations
 4.5k Downloads
Abstract
Many faulttolerant distributed algorithms are designed for synchronous or roundbased semantics. In this paper, we introduce the synchronous variant of threshold automata, and study their applicability and limitations for the verification of synchronous distributed algorithms. We show that in general, the reachability problem is undecidable for synchronous threshold automata. Still, we show that many synchronous faulttolerant distributed algorithms have a bounded diameter, although the algorithms are parameterized by the number of processes. Hence, we use bounded model checking for verifying these algorithms.
The existence of bounded diameters is the main conceptual insight in this paper. We compute the diameter of several algorithms and check their safety properties, using SMT queries that contain quantifiers for dealing with the parameters symbolically. Surprisingly, performance of the SMT solvers on these queries is very good, reflecting the recent progress in dealing with quantified queries. We found that the diameter bounds of synchronous algorithms in the literature are tiny (from 1 to 4), which makes our approach applicable in practice. For a specific class of algorithms we also establish a theoretical result on the existence of a diameter, providing a first explanation for our experimental results. The encodings of our benchmarks and instructions on how to run the experiments are available at: [33].
1 Introduction
Faulttolerant distributed algorithms are hard to design and verify. Recently, threshold automata were introduced to model, verify and synthesize asynchronous faulttolerant distributed algorithms [19, 21, 24]. Owing to the wellknown impossibility result [18] many distributed computing problems, including consensus, are not solvable in purely asynchronous systems. Thus, synchronous distributed algorithms have been extensively studied [5, 26]. In this paper, we introduce synchronous threshold automata, and investigate their applicability and limitations for verification of synchronous faulttolerant distributed algorithms.
A long execution of reliable broadcast and the short representative
Contributions. We start by introducing synchronous threshold automata (STA) and the counter systems they define.
 1.
We show that parameterized reachability checking of STA is undecidable.
 2.
We introduce an SMTbased procedure for finding the diameter of the counter system associated with an STA, i.e., the number of steps in which every configuration of the counter system is reachable. By knowing the diameter, we use bounded model checking as a complete verification method [11, 14, 22].
 3.
For a class of STA that captures several algorithms such as the broadcast algorithm in Fig. 1, we prove that a diameter is always bounded. The diameter is a function of the number of guard expressions and the longest path in the automaton, that is, it is independent of the parameters.
 4.
We implemented our technique, by running Z3 [29] and CVC4 [7] as backend SMT solvers, and evaluated it by applying it to several distributed algorithms from the literature: Benchmarks that tolerate Byzantine faults from [8, 9, 10, 32], benchmarks that tolerate crashes from [13, 26, 30], and benchmarks that tolerate send omissions from [10, 30].
 5.
We are the first to automatically verify the Byzantine and send omission benchmarks. For the crash benchmarks, our method performs significantly better than the abstractionbased method in [3]. By tweaking the constraints on the parameters n, t, f, we introduce configurations with more faults than expected, for which our technique automatically finds a counterexample.
2 Overview of Our Approach
Bounded Diameter. Consider Fig. 1: the processes execute the send, receive, and local computation steps in lockstep. One iteration of the loop is expressed as an STA edge that connects the locations before and after an iteration (i.e., the STA models the loop body of the pseudo code). The location \(\textsc {se}\) encodes that v \(=1\) and accept is false. That is, \(\textsc {se}\) is the location in which processes send \(\texttt {{<}ECHO{>}}\) in every round. If a process sets accept to true, it goes to location \(\textsc {ac}\). The location where v is 1 is encoded by \(\textsc {v1}\), and the where v is 0 by \(\textsc {v0}\).
An example execution is depicted in Table 1 on the left. We run \(nf\) copies of the STA in Fig. 1. Observe that the guards of the rules \(r_1\) and \(r_2\) are both enabled in the configuration \(\sigma _0\). One STA uses \(r_2\) to go to \(\textsc {se}\) while the others use the selfloop \(r_1\) to stay in \(\textsc {v0}\). As both rules remain enabled, in every round one more automaton can go to \(\textsc {se}\). Hence, configuration \(\sigma _{t+1}\) has \(t+1\) correct STA in location \(\textsc {se}\) and rule \(r_1\) becomes disabled. Then, all remaining STA go to \(\textsc {se}\) and then finally to \(\textsc {ac}\). This execution depends on the parameter t, which implies that the length of this execution is unbounded for increasing values of the parameter t. (We note that we can obtain longer executions, if some STA use rule \(r_4\)). On the right, we see an execution where all STA take \(r_2\) immediately. That is, while configuration \(\sigma _{t+3}\) is reached by a long execution on the left, it is reached in just two steps on the right (observe \(\sigma '_2=\sigma _{t+3}\)). We are interested in whether there is a natural number k (which does not depend on the parameters n, t and f) such that we can always shorten executions to executions of length \(\le k\). (By length, we mean the number of transitions in an execution.) In such a case we say that the STA has bounded diameter. In Sect. 5.1 we introduce an SMTbased procedure that enumerates candidates for the diameter bound and checks if the candidate is indeed the diameter; if it finds such a bound, it terminates. For the STA in Fig. 1, this procedure computes the diameter 2.
Threshold Automata with Traps. In Sect. 5.2, we define a fragment of STA for which we theoretically guarantee a bounded diameter. For example, the STA in Fig. 1 falls in this fragment, and we obtain a guaranteed diameter of \({\le } 8\). The fragment is defined by two conditions: (i) The STA has a structure that implies monotonicity of the guards: the set of locations that are used in the guards (e.g., \(\{\textsc {v1},\textsc {se},\textsc {ac}\}\)) is closed under the rules, i.e., from each location within the set, the STA can reach only a location in the set. We call guards that have this property trapped. (ii) The STA has no cycles, except possibly selfloops.
Threshold Automata with Untrapped Guards. The FloodMin algorithm in Fig. 2 solves the kset agreement problem. This algorithm is ran by n replicated processes, up to t of which may fail by crashing. For simplicity of presentation, we consider the case when \(k=1\), which turns kset agreement into consensus. In Fig. 2, on the right, we have the STA that captures the loop body. The locations \(\textsc {c0}\) and \(\textsc {c1}\) correspond to the case when a process is crashing in the current round and may manage to send the value 0 and 1 respectively; the process remains in the crashed location Open image in new window and does not send any messages starting with the next round. We observe that the guard \(\#\{\textsc {v0}, \textsc {c0}\} > 0\) is not trapped, and our result about trapped guards does not apply. Nevertheless, our SMTbased procedure can find a diameter of 2. In the same way, we automatically found a bound on the diameter for several benchmarks from the literature. It is remarkable that the diameter for the transition relation of the loop body (without the loop condition) is bounded by a constant, independent of the parameters.
Bounded Model Checking of Algorithms with Clean Rounds. The number of loop iterations \(\lfloor t/k \rfloor +1\) of the FloodMin algorithm has been designed such that it ensures (together with the environment assumption of at most t crashes) that there is at least one clean round in which at most \(k1\) processes crashed. The correctness of the FloodMin algorithm relies on the occurrence of such a clean round. We make use of the existence of clean rounds by employing the following twostep methodology for the verification of safety properties: (i) we find all reachable cleanround configurations, and (ii) check if a bad configuration is reachable from those configurations. Detailed description of this methodology can be found in Sect. 6. Our method requires the encoding of a clean round as input (e.g., for Fig. 2 that no STA are in \(\textsc {c0}\) and \(\textsc {c1}\)). We leave detecting and encoding clean rounds automatically from the fault environment for future work.
3 Synchronous Threshold Automata
We introduce the syntax of synchronous threshold automata and give some intuition of the semantics, which we will formalize as counter systems below.
A synchronous threshold automaton is the tuple \( STA _{} = (\mathcal {L}_{}, \mathcal {I}_{}, \varPi _{}, \mathcal {R}_{}, RC_{}, \chi _{})\), where \(\mathcal {L}\) is a finite set of locations, \(\mathcal {I}\subseteq \mathcal {L}\) is a nonempty set of initial locations, \(\varPi \) is a finite set of parameters, \(\mathcal {R}\) is a finite set of rules, RC is a resilience condition, and \(\chi \) is a counter invariant, defined in the following. We assume that the set \(\varPi \) of parameters contains at least the parameter n, denoting the number of processes. We call the vector \(\varvec{\pi }= \langle \pi _1, \dots , \pi _{\varPi }\rangle \) the parameter vector, and a vector \(\mathbf {p}= \langle p_1, \dots , p_{\varPi } \rangle \) is an instance of \(\varvec{\pi }\), where \(\pi _i \in \varPi \) is a parameter, and \(p_i \in \mathbb {N}\) is a natural number, for \(1 \le i \le \varPi \), such that \(\mathbf {p}[\pi _i] = p_i\) is the value assigned to the parameter \(\pi _i\) in the instance \(\mathbf {p}\) of \(\varvec{\pi }\). The set of admissible instances of \(\varvec{\pi }\) is defined as \(P_{RC}= \{\mathbf {p}\in \mathbb {N}^{\varPi } \mid \mathbf {p} \text{ is } \text{ an } \text{ instance } \text{ of } \varvec{\pi } \text{ and } \mathbf {p} \text{ satisfies } RC \}\). The mapping \(N : P_{RC}\rightarrow \mathbb {N}\) maps an admissible instance \(\mathbf {p}\in P_{RC}\) to the number \(N(\mathbf {p})\) of processes that participate in the algorithm, such that \(N(\mathbf {p})\) is a linear combination of the parameter values in \(\mathbf {p}\).
For example, for the STA in Fig. 1, \(RC \equiv n > 3t \wedge t \ge f\), hence a vector \(\mathbf {p}\in \mathbb {N}^{\varPi }\) is an admissible instance of the parameter vector \(\varvec{\pi }= \langle n, t, f \rangle \), if \(\mathbf {p}[n] > 3\mathbf {p}[t] \wedge \mathbf {p}[t] \ge \mathbf {p}[f]\). Furthermore, for this STA, \(N(\mathbf {p}) = \mathbf {p}[n]  \mathbf {p}[f]\). For the STA in Fig. 2, \(RC \equiv n > t \wedge t \ge f\), hence the admissible instances satisfy \(\mathbf {p}[n] > \mathbf {p}[t] \wedge \mathbf {p}[t] \ge \mathbf {p}[f]\), and we have \(N(\mathbf {p}) = \mathbf {p}[n]\).
We introduce counter atoms of the form \(\psi \equiv \#L \ge \varvec{a}\cdot \varvec{\pi }+ b\), where \(L \subseteq \mathcal {L}\) is a set of locations, \(\#L\) denotes the total number of processes currently in the locations \(\ell \in L\), \(\varvec{a}\in \mathbb {Z}^{\varPi }\) is a vector of coefficients, \(\varvec{\pi }\) is the parameter vector, and \(b \in \mathbb {Z}\). We will use the counter atoms for expressing guards and predicates in the verification problem. In the following, we will use two abbreviations: \(\#L = \varvec{a}\cdot \varvec{\pi }+ b\) for the formula \((\#L \ge \varvec{a}\cdot \varvec{\pi }+ b) \wedge \lnot (\#L \ge \varvec{a}\cdot \varvec{\pi }+ b+ 1)\), and \(\#L > \varvec{a}\cdot \varvec{\pi }+ b\) for the formula \(\#L \ge \varvec{a}\cdot \varvec{\pi }+ b+ 1\).
A rule \(r \in \mathcal {R}\) is the tuple \(( from , to , \varphi )\), where \( from , to \in \mathcal {L}\) are locations, and \(\varphi \) is a guard whose truth value determines if the rule r is executed. The guard \(\varphi \) is a Boolean combination of counter atoms. We denote by \(\varPsi \) the set of counter atoms occurring in the guards of the rules \(r \in \mathcal {R}\).
The counter invariant \(\chi \) is a Boolean combination of counter atoms \(\#L \ge \varvec{a}\cdot \varvec{\pi }+ b\), where each atom occurring in \(\chi \) restricts the number of processes allowed to populate the locations in \(L \subseteq \mathcal {L}\).
Counter Systems. The counter atoms are evaluated over tuples \((\mathbf {\kappa }, \mathbf {p})\), where \(\mathbf {\kappa }\in \mathbb {N}^{\mathcal {L}}\) is a vector of counters, and \(\mathbf {p}\in P_{RC}\) is an admissible instance of \(\varvec{\pi }\). For a location \(\ell \in \mathcal {L}\), the counter \(\mathbf {\kappa }[\ell ]\) denotes the number of processes that are currently in the location \(\ell \). A counter atom \(\psi \equiv \#L \ge \varvec{a}\cdot \varvec{\pi }+ b\) is satisfied in the tuple \((\mathbf {\kappa }, \mathbf {p})\), that is \((\mathbf {\kappa }, \mathbf {p}) \models \psi \), iff \(\sum _{\ell \in L} \mathbf {\kappa }[\ell ] \ge \varvec{a}\cdot \mathbf {p}+ b\). The semantics of the Boolean connectives is standard.
A transition is a function \(t: \mathcal {R} \rightarrow \mathbb {N}\) that maps a rule \(r \in \mathcal {R}\) to a factor \(t(r) \in \mathbb {N}\), denoting the number of processes that act upon this rule. Given an instance \(\mathbf {p}\) of \(\varvec{\pi }\), we denote by \(T(\mathbf {p})\) the set \(\{t\mid \sum _{r \in \mathcal {R}} t(r) = N(\mathbf {p})\}\) of transitions whose rule factors sum up to \(N(\mathbf {p})\).
 1.
for every \(r \in \mathcal {R}\), such that \(t(r) > 0\), it holds that \((\mathbf {\kappa }, \mathbf {p}) \models r.\varphi \), and
 2.
for every \(\ell \in \mathcal {L}\), it holds that \(\mathbf {\kappa }[\ell ] = \sum _{r \in \mathcal {R}\wedge r. from = \ell } t(r)\).
The first condition ensures that processes only use rules whose guards are satisfied, and the second that every process moves in an enabled transition.
Observe that each transition \(t\in T(\mathbf {p})\) defines a unique tuple \((\mathbf {\kappa }, \mathbf {p})\) in which it is enabled. We call the origin of a transition \(t\in T(\mathbf {p})\) the tuple \(o(t) = (\mathbf {\kappa }, \mathbf {p})\), such that for every \(\ell \in \mathcal {L}\), we have \(o(t).\mathbf {\kappa }[\ell ] = \sum _{r \in \mathcal {R}\wedge r. from = \ell }t(r)\). Similarly, each transition defines a unique tuple \((\mathbf {\kappa }, \mathbf {p})\) that is the result of applying the transition in its origin. We call the goal of a transition \(t\in T(\mathbf {p})\) the tuple \(g(t) = (\mathbf {\kappa }, \mathbf {p})\), such that for every \(\ell \in \mathcal {L}\), we have \(g(t).\mathbf {\kappa }[\ell ] = \sum _{r \in \mathcal {R}\wedge r. to = \ell } t(r)\).
We now define a counter system, for a given \( STA _{} = (\mathcal {L}_{}, \mathcal {I}_{}, \varPi _{}, \mathcal {R}_{}, RC_{}, \chi _{})\), and an admissible instance \(\mathbf {p}\in P_{RC}\) of the parameter vector \(\varvec{\pi }\).
Definition 1

\(\varSigma (\mathbf {p})= \{\sigma = (\mathbf {\kappa }, \mathbf {p}) \mid \sum _{\ell \in \mathcal {L}} \sigma .\mathbf {\kappa }[\ell ] = N(\mathbf {p})\, \text{ and }\; \sigma \models \chi \}\) are the configurations;

\(I(\mathbf {p})= \{\sigma \in \varSigma (\mathbf {p})\mid \sum _{\ell \in \mathcal {I}} \sigma .\mathbf {\kappa }[\ell ] = N(\mathbf {p}) \}\) are the initial configurations;

\(R(\mathbf {p})\subseteq \varSigma (\mathbf {p})\times T(\mathbf {p})\times \varSigma (\mathbf {p})\) is the transition relation, with \(\langle \sigma , t, \sigma ' \rangle \in R(\mathbf {p})\), if \(\sigma \) is the origin and \(\sigma '\) the goal of \(t\). We write \(\sigma \xrightarrow {t} \sigma '\), if \(\langle \sigma , t, \sigma '\rangle \in R(\mathbf {p})\).
We restrict ourselves to deadlockfree counter systems, i.e., counter systems where the transition relation is total (every configuration has a successor). A sufficient condition for deadlockfreedom is that for every location \(\ell \in \mathcal {L}\), it holds that \(\chi \rightarrow \bigvee _{r \in \mathcal {R}\wedge r. from = \ell } r.\varphi \). This ensures that it is always possible to move out of every location, as there is at least one outgoing rule per location whose guard is satisfied.
To simplify the notation, in the following we write \(\sigma [\ell ]\) to denote \(\sigma .\mathbf {\kappa }[\ell ]\).
Paths and Schedules in a Counter System. We now define paths and schedules of a counter system, as sequences of configurations and transitions, respectively.
Definition 2
A path in the counter system \(\mathsf {CS}( STA , \mathbf {p}) = (\varSigma (\mathbf {p}), I(\mathbf {p}), R(\mathbf {p}))\) is a finite sequence \(\{\sigma _i\}_{i = 0}^{k}\) of configurations, such that for every two consecutive configurations \(\sigma _{i1}, \sigma _{i}\), for \(0 < i \le k\), there exists a transition \(t_i \in T(\mathbf {p})\) such that \(\sigma _{i1} \xrightarrow {t_i} \sigma _{i}\). A path \(\{\sigma _i\}_{i = 0}^{k}\) is called an execution if \(\sigma _0 \in I(\mathbf {p})\).
Definition 3
A schedule is a finite sequence \(\tau = \{t_i\}_{i = 1}^{k}\) of transitions \(t_i \in T(\mathbf {p})\), for \(0 < i \le k\). We denote by \(\tau  = k\) the length of the schedule \(\tau \).
A schedule \(\tau = \{t_i\}_{i=1}^{k}\) is feasible if there is a path \(\{\sigma _i\}_{i=0}^{k}\) such that \(\sigma _{i1} \xrightarrow {t_i} \sigma _{i}\), for \(0 < i \le k\). We call \(\sigma _0\) the origin, and \(\sigma _k\) the goal of \(\tau \), and write \(\sigma _0 \xrightarrow {\tau } \sigma _k\).
4 Parameterized Reachability and Its Undecidability
We show that the following problem is undecidable in general, by reduction from the halting problem of a twocounter machine (2CM) [28]. Such reductions are common in parameterized verification, e.g., see [12].
Definition 4
(Parameterized Reachability). Given a formula \(\varphi \), that is, a Boolean combination of counter atoms, and \( STA _{} = (\mathcal {L}_{}, \mathcal {I}_{}, \varPi _{}, \mathcal {R}_{}, RC_{}, \chi _{})\), the parameterized reachability problem is to decide whether there exists an admissible instance \(\mathbf {p}\in P_{RC}\), such that in the counter system \(\mathsf {CS}( STA , \mathbf {p})\), there is an initial configuration \(\sigma \in I(\mathbf {p})\), and a feasible schedule \(\tau \), with \(\sigma \xrightarrow {\tau } \sigma '\) and \(\sigma ' \models \varphi \).
To prove undecidability, we construct a synchronous threshold automaton \( STA _{\mathcal {M}}\), such that every counter system induced by it simulates the steps of a 2CM executing a program \(P\). The STA has a single parameter – the number n of processes, and the invariant \(\chi = true \). The idea is that each process plays one of two roles: either it is used to encode the control flow of the program \(P\) (controller role), or to encode the values of the registers in unary, as in [17] (storage role). Thus, \( STA _{\mathcal {M}}\) consists of two parts – one per each role.
Our construction allows multiple processes to act as controllers. Since we assume that 2CM is deterministic, all the controllers behave the same. For each instruction of the program \(P\), in the controller part of \( STA _{\mathcal {M}}\), there is a single location (for ‘jump if zero’ and ‘halt’) or a pair of locations (for ‘increment’ and ‘decrement’), and a special stuck location. In the storage part of \( STA _{\mathcal {M}}\), there is a location for each register, a store location, and auxiliary locations. The number of processes in a register location encodes the value of the register in 2CM.
An increment (resp. decrement) of a register is modeled by moving one process from (resp. to) the store location to (resp. from) the register location. The guards on the rules in the controller part check if the storage processes made a transition that truly models a step of 2CM; in this case, the controllers move on to the next location, otherwise they move to the stuck location. For example, to model a ‘jump if zero’ for register A, the controllers check if \(\#\{\ell _{A}\} = 0\), where \(\ell _A\) is the storage location corresponding to register A. The main invariant which ensures correctness is that every transition in every counter system induced by \( STA _{\mathcal {M}}\) either faithfully simulates a step of the 2CM, or moves all of the controllers to the stuck location.
Let \(\ell _{ halt }\) be the halting location in the controller part of \( STA _{\mathcal {M}}\). The formula \(\varphi \equiv \lnot (\#\{\ell _{ halt }\} = 0)\) states that the controllers have reached the halting location. Thus, the answer to the parameterized reachability question given the formula \(\varphi \) and \( STA _{\mathcal {M}}\) is positive iff 2CM halts, which gives us undecidability.
5 Bounded Diameter Oracle
5.1 Computing the Diameter Using SMT
Given an STA, the diameter is the maximal number of transitions needed to reach all possible configurations in every counter system induced by the STA, and an admissible instance \(\mathbf {p}\in P_{RC}\). We adapt the definition of diameter from [11].
Definition 5
(Diameter). Given an \( STA _{} = (\mathcal {L}_{}, \mathcal {I}_{}, \varPi _{}, \mathcal {R}_{}, RC_{}, \chi _{})\), the diameter is the smallest number d such that for every \(\mathbf {p}\in P_{RC}\) and every path \(\{\sigma _i\}_{i=0}^{d+1}\) of length \(d+1\) in \(\mathsf {CS}( STA , \mathbf {p})\), there exists a path \(\{\sigma '_j\}_{j=0}^{e}\) of length \(e \le d\) in \(\mathsf {CS}( STA , \mathbf {p})\), such that \(\sigma _0 = \sigma '_0\) and \(\sigma _{d+1} = \sigma '_{e}\).
 1.
initialize the candidate diameter d to 1;
 2.
check if the negation of the formula (1) is unsatisfiable;
 3.
if yes, then output d and terminate;
 4.
if not, then increment d and jump to step 2.
If the procedure terminates, it outputs the diameter, which can be used as completeness threshold for bounded model checking. We implemented this procedure, and used a backend SMT solver to automate the test in step 2.
5.2 Bounded Diameter for a Fragment of STA
In this section, we show that for a specific fragment of STA, we are able to give a theoretical bound on the diameter, similar to the asynchronous case [20, 21].
The STA that fall in this fragment are monotonic and 1cyclic. An STA is monotonic iff every counter atom changes its truth value at most once in every path of a counter system induced by the STA and an admissible instance \(\mathbf {p}\in P_{RC}\). This implies that every schedule can be partitioned into finitely many subschedules, that satisfy a property we call steadiness. We call a schedule steady if the set of rules whose guards are satisfied does not change in all of its transitions. We also give a sufficient condition for monotonicity, using trapped counter atoms, defined below. In a 1cyclic STA, the only cycles that can be formed by its rules are selfloops. Under these two conditions, we guarantee that for every steady schedule, there exists a steady schedule of bounded length, that has the same origin and goal. We show that this bound depends on the counter atoms \(\varPsi \) occurring in the guards of the STA, and the length of the longest path in the STA, denoted by \(c\). The main result of this section is stated by the theorem:
Theorem 1
For every feasible schedule \(\tau \) in a counter system \(\mathsf {CS}( STA , \mathbf {p})\), where \( STA \) is monotonic and 1cyclic, and \(\mathbf {p}\in P_{RC}\), there exists a feasible schedule \(\tau '\) of length \(O(\varPsi c)\), such that \(\tau \) and \(\tau '\) have the same origin and goal.
To prove Theorem 1, we start by defining monotonic STA.
Definition 6
(Monotonic STA). An automaton \( STA _{} = (\mathcal {L}_{}, \mathcal {I}_{}, \varPi _{}, \mathcal {R}_{}, RC_{}, \chi _{})\) is monotonic iff for every path \(\{\sigma _i\}_{i=0}^k\) in the counter system \(\mathsf {CS}( STA , \mathbf {p})\), for \(\mathbf {p}\in P_{RC}\), and every counter atom \(\psi \in \varPsi \), we have \(\sigma _i \models \psi \) implies \(\sigma _j \models \psi \), for \(0 \le i< j < k\).
To show that we can partition a schedule into finitely many subschedules, we need the notion of a context. A context of a transition \(t\in T(\mathbf {p})\) is the set \(\mathcal {C}_{t} = \{\psi \in \varPsi \mid o(t) \models \psi \}\) of counter atoms \(\psi \) satisfied in the origin \(o(t)\) of the transition \(t\). Given a feasible schedule \(\tau \), the point i is a context switch, if \(\mathcal {C}_{t_{i1}} \ne \mathcal {C}_{t_i}\), for \(1 < i \le \tau \).
Lemma 1
Every feasible schedule \(\tau \) in a counter system induced by a monotonic STA has at most \(\varPsi \) context switches.
Proof
Let \(\tau = \{t_i\}_{i = 1}^k\) be a feasible schedule and \(\varPsi \) the set of counter atoms appearing on the rules of the monotonic STA. For every \(\psi \in \varPsi \), there is at most one context switch i, for \(0 < i \le k\), such that \(\psi \not \in \mathcal {C}_{t_{i1}}\) and \(\psi \in \mathcal {C}_{t_i}\). \(\square \)
Sufficient Condition for Monotonicity. We introduce trapped counter atoms.
Definition 7
A set \(L \subseteq \mathcal {L}\) of locations is called a trap, iff for every \(\ell \in L\) and every \(r \in \mathcal {R}\) such that \(\ell = r.from\), it holds that \(r.to \in L\).
A counter atom \(\psi \equiv \#L \ge \varvec{a}\cdot \varvec{\pi }+ b\) is trapped iff the set L is a trap.
Lemma 2
Let \(\psi \equiv \#L \ge \varvec{a}\cdot \varvec{\pi }+ b\) be a trapped counter atom, \(\sigma \) a configuration such that \(\sigma \models \psi \), and t a transition enabled in \(\sigma \). If \(\sigma \xrightarrow {t} \sigma '\), then \(\sigma '\models \psi \).
Corollary 1
Let \( STA _{} = (\mathcal {L}_{}, \mathcal {I}_{}, \varPi _{}, \mathcal {R}_{}, RC_{}, \chi _{})\) be an automaton such that all its counter atoms are trapped. Then STA is monotonic.
Steady Schedules. We define the notion of steadiness, similarly to [20].
Definition 8
A schedule \(\tau = \{t_i\}_{i=1}^{k}\) is steady, if \(\mathcal {C}_{t_i} = \mathcal {C}_{t_j}\), for \(0< i < j \le k\).
We now focus on shortening steady schedules. That is, given a steady schedule, we construct a schedule of bounded length with the same origin and goal.
Observe that \( STA _{} = (\mathcal {L}_{}, \mathcal {I}_{}, \varPi _{}, \mathcal {R}_{}, RC_{}, \chi _{})\) can be seen as a directed graph \(G_{ STA }\), with vertices corresponding to the locations \(\ell \in \mathcal {L}\), and edges corresponding to the rules \(r\in \mathcal {R}\). We denote by \(c\) the length of the longest path between two nodes in the graph \(G_{ STA }\), and call it the longest chain of \( STA \). If \(G_{ STA }\) contains only cycles of length one, then \( STA \) is called 1cyclic.
To shorten steady schedules, in addition to monotonicity, we require that the STA are also 1cyclic. In the following, we assume that the schedules we shorten come from counter systems induced by monotonic and 1cyclic STA. Intuitively, if a given schedule is longer than the longest chain of the STA, then in some transition of the schedule some processes followed a rule which is a selfloop. As processes may follow selfloops at different transitions, we cannot shorten the given schedule by eliminating transitions as a whole. Instead, we deconstruct the original schedule into sequences of process steps, which we call runs, shorten the runs, and reconstruct a new shorter schedule from the shortened runs. The main challenge is to show that the newly obtained schedule is feasible and steady.
Schedules as Multisets of Runs. We proceed by defining runs and showing that each schedule can be represented by a multiset of runs.
We call a run the sequence \(\varrho = \{r_i\}_{i=1}^{k}\) of rules, for \(r_i \in \mathcal {R}\), such that \(r_i. to = r_{i+1}. from \), for \(0< i < k\). We denote by \(\varrho [i] = r_i\) the ith rule in the run \(\varrho \), and by \(\varrho \) the length of the run. The following lemma shows that a feasible schedule can be deconstructed into a multiset of runs.
Lemma 3
 1.
\(\mathcal {P}\) is a set of runs \(\varrho \) of length k, and
 2.
\(m: \mathcal {P}\rightarrow \mathbb {N}\) is a multiplicity function, such that for every location \(\ell \in \mathcal {L}\), it holds that \(\sum _{r. from = \ell } t_i(r) = \sum _{\varrho [i]. from = \ell } m(\varrho )\), for \(0 < i \le k\).
A multiset \((\mathcal {P}, m)\) of runs of length k defines a schedule \(\tau = \{t_i\}_{i = 1}^k\) of length k, and we have \(t_i(r) = \sum _{\varrho [i] = r} m(\varrho )\), for every rule \(r \in \mathcal {R}\) and \(0 < i \le k\).
For the counter systems of STA, which are both monotonic and 1cyclic, we show that their steady schedules can be shortened, so that their length does not exceed the longest chain \(c\) (that is, the length of the longest path in the STA).
Lemma 4
Let \(\tau \) be a steady feasible schedule in a counter system induced by a monotonic and 1cyclic STA. If \(\tau  > c+ 1\), then there exists a steady feasible schedule \(\tau '\) such such that \(\tau ' = \tau   1\), and \(\tau , \tau '\) have the same origin and goal.
Proof
(Sketch). If \(\tau = \{t_i\}_{i=1}^{k+1}\), with \(\tau  = k+1 > c+1\), is a steady schedule, then \(\mathcal {C}_{t_1} = \mathcal {C}_{t_k}\), and its prefix \(\theta = \{t_i\}_{i=1}^k\) is a steady and feasible schedule, with \(k > c\). By Lemma 3, there is a multiset \((\mathcal {P}, m)\) of runs of length k describing \(\theta \). Since \(k > c\), and c is the longest chain in the STA, which is 1cyclic, it must be the case that every run in \(\mathcal {P}\) contains at least one selfloop. Construct a new multiset \((\mathcal {P}', m')\) of runs of length \(k1\), such that each \(\varrho ' \in \mathcal {P}'\) is obtained by some \(\varrho \in \mathcal {P}\) by removing one occurrence of a selfloop rule. The multiset \((\mathcal {P}', m')\) defines the schedule \(\theta ' = \{t'_1\}_{i=1}^{k1}\). Because of the monotonicity and steadiness of \(\theta \), and because we only remove selfloops (which go from and to the same location) when we build \(\theta '\) from \(\theta \), the feasibility is preserved, that is, it holds that \(g(t'_{i1}) = o(t'_i)\), for \(1< i < k\), and that no guards false in \(\theta \) become true in \(\theta '\). Furthermore, it is easy to check that \(\theta '\) has the same origin and goal as \(\theta \). As the goal of \(\theta '\) is the origin of \(t_{k+1}\), construct a schedule \(\tau ' = \{t'_{i}\}_{i=1}^k\), where \(t'_k = t_{k+1}\). As \(\tau \) is steady, the transitions \(t_1\) and \(t_{k+1}\) have the same contexts. From \(o(t_1) = o(t'_1)\) and \(o(t_{k+1}) = o(t'_k)\), we get that \(t'_1\) and \(t'_k\) have the same contexts, which, together with the monotonicity, implies that \(\tau '\) is steady. \(\square \)
As a consequence of Lemmas 1 and 4, we obtain Theorem 1, which tells us that for any feasible schedule, there exists a feasible schedule of length \(O(\varPsi c)\). This bound does not depend on the parameters, but on the number of context switches and the longest chain \(c\), which are properties of the STA.
6 Bounded Model Checking of Safety Properties
Once we obtain the diameter bound d (either using the procedure from Sect. 5.1, or by Theorem 1), we use it as a completeness threshold for bounded model checking. For the algorithms that we verify, we express the violations of their safety properties as reachability queries on bounded executions. The length of the bounded executions depends on d, and on whether the algorithm was designed such that it is assumed that there is a clean round in every execution.
7 Experimental Evaluation
The algorithms that we model using STA and verify by bounded model checking are designed for different fault models, which in our case are crashes, send omissions or Byzantine faults. We now proceed by introducing our benchmarks. Their encodings, together with the implementations of the procedures for finding the diameter and applying bounded model checking are available at [1].
Results for our benchmarks, available at [1]: \(\mathcal {L}, \mathcal {R}, \varPsi \), RC are the number of locations, rules, atomic guards, and resilience condition in each STA; d is the diameter computed using SMT, \(c\) is the longest chain of the algorithms whose STA are monotonic and 1cyclic; \(\tau \) is the time (in seconds) to compute the diameter using SMT; T, SMT is the time to check reachability using the diameter computed using the SMT procedure from Sect. 5.1; T, Theorem 1 the time to check reachability using the bound obtained by Theorem 1. For the cases where Theorem 1 is not applicable, we write (–). The experiments were run on a machine with Intel(R) Core(TM) i54210U CPU and 4GB of RAM, using z34.8.1 and cvc41.6.
Computing the Diameter. We implemented the procedure from Sect. 5.1 in Python. The implementation uses a backend SMT solver (currently, z3 and cvc4). Our tool computed diameter bounds for all of our benchmarks, even for those for which we do not have a theoretical guarantee. Our experiments reveal extremely low values for the diameter, that range between 1 and 4. The values for the diameter and the time needed to compute them are presented in Table 2.
Checking the Algorithms. We have implemented another Python function which encodes violations of the safety properties as reachability properties on paths of bounded length, as described in Sect. 6, and uses a backend SMT solver to check their satisfiability. Table 2 contains the results that we obtained by checking reachability for our benchmarks, using the diameter bound computed using the procedure from Sect. 5.1, and diameter bound from Theorem 1, for algorithms whose STA are monotonic and 1cyclic.
To our knowledge, we are the first to verify the listed algorithms that work with send omission, Byzantine and hybrid faults. For the algorithms with crash faults, our approach is a significant improvement to the results obtained using the abstractionbased method from [3].
8 Discussion and Related Work
Parameterized verification of synchronous and partially synchronous distributed algorithms has recently gained attention. Both models have in common that distributed computations are organized in rounds and processes (conceptually) move in lockstep. For partially synchronous consensus algorithms, the authors of [15] introduced a consensus logic and (semi)decision procedures. Later, the authors of [27] introduced a language for partially synchronous consensus algorithms, and proved cutoff theorems specialized to the properties of consensus: agreement, validity, and termination. Concerning synchronous algorithms, the authors of [3] introduced an abstractionbased model checking technique for crashtolerant synchronous algorithms with existential guards. In contrast to their work, we allow more general guards that contain linear expressions over the parameters, e.g., \(nt\). Our method offers more automation, and our experimental evaluation shows that our technique is faster than the technique [3].
We introduce a synchronous variant of threshold automata, which were proposed in [21] for asynchronous algorithms. Several extensions of this model were recently studied in [23], but the synchronous case was not considered. STA extend the guarded protocols by [16], in which a process can check only if a sum of counters is different from 0 or n. Generalizing the results from [16] to STA is not straightforward. In [2], safety of finitestate transition systems over infinite data domains was reduced to backwards reachability checking using a fixpoint computation, as long as the transition systems are wellstructured. It would be interesting to put our results in this context. A decidability result for liveness properties of parameterized timed networks was obtained in [4], employing linear programming for the analysis of vector addition systems with a parametric initial state. We plan to investigate the use of similar ideas for analyzing liveness properties of STA.
The 1cyclicity condition is reminiscent of flat counter automata [25]. In Fig. 3, we show a possible translation of an STA to a counter automaton (similar to the translation for asynchronous threshold automata from [23]). We note that the counter automaton is not flat, due to the presence of the outer loop, which models a transition to the next round. By knowing a bound d on the diameter (e.g., by Theorem 1), one can flatten the counter automaton by unfolding the outer loop d times. We also experimented with FAST [6] on two of our benchmarks: rb and floodmin for \(k = 1\), depicted in Figs. 1 and 2 respectively. FAST terminated on rb, but took significantly longer than our tool on the same machine (i.e., hours rather than seconds). FAST ran out of memory when checking floodmin.
Our experiments show that STA that are neither monotonic, nor 1cyclic still may have bounded diameters. Finding other classes of STA for which one could derive the diameter bounds is a subject of future work. Although we considered only reachability properties in this work—which happened to be challenging—we are going to investigate completeness thresholds for liveness in the future.
References
 1.Experiments. https://github.com/istoilkovska/syncTA
 2.Abdulla, P.A., Cerans, K., Jonsson, B., Tsay, Y.: General decidability theorems for infinitestate systems. In: LICS, pp. 313–321 (1996)Google Scholar
 3.Aminof, B., Rubin, S., Stoilkovska, I., Widder, J., Zuleger, F.: Parameterized model checking of synchronous distributed algorithms by abstraction. In: Dillig, I., Palsberg, J. (eds.) Verification, Model Checking, and Abstract Interpretation. LNCS, vol. 10747, pp. 1–24. Springer, Cham (2018). https://doi.org/10.1007/9783319737218_1CrossRefGoogle Scholar
 4.Aminof, B., Rubin, S., Zuleger, F., Spegni, F.: Liveness of parameterized timed networks. In: Halldórsson, M.M., Iwama, K., Kobayashi, N., Speckmann, B. (eds.) ICALP 2015. LNCS, vol. 9135, pp. 375–387. Springer, Heidelberg (2015). https://doi.org/10.1007/9783662476666_30CrossRefGoogle Scholar
 5.Attiya, H., Welch, J.: Distributed Computing, 2nd edn. Wiley, Hoboken (2004)CrossRefGoogle Scholar
 6.Bardin, S., Leroux, J., Point, G.: FAST extended release. In: Ball, T., Jones, R.B. (eds.) CAV 2006. LNCS, vol. 4144, pp. 63–66. Springer, Heidelberg (2006). https://doi.org/10.1007/11817963_9CrossRefGoogle Scholar
 7.Barrett, C., et al.: CVC4. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 171–177. Springer, Heidelberg (2011). https://doi.org/10.1007/9783642221101_14CrossRefGoogle Scholar
 8.Berman, P., Garay, J.A., Perry, K.J.: Asymptotically optimal distributed consensus. Technical report, Bell Labs (1989). http://plan9.belllabs.co/who/garay/asopt.ps
 9.Berman, P., Garay, J.A., Perry, K.J.: Towards optimal distributed consensus (extended abstract). In: FOCS, pp. 410–415 (1989)Google Scholar
 10.Biely, M., Schmid, U., Weiss, B.: Synchronous consensus under hybrid process and link failures. Theor. Comput. Sci. 412(40), 5602–5630 (2011)MathSciNetCrossRefGoogle Scholar
 11.Biere, A., Cimatti, A., Clarke, E.M., Zhu, Y.: Symbolic model checking without BDDs. In: Cleaveland, W.R. (ed.) TACAS 1999. LNCS, vol. 1579, pp. 193–207. Springer, Heidelberg (1999). https://doi.org/10.1007/3540490590_14CrossRefGoogle Scholar
 12.Bloem, R., et al.: Decidability of Parameterized Verification. Synthesis Lectures on Distributed Computing Theory. Morgan & Claypool Publishers (2015)Google Scholar
 13.Chaudhuri, S., Herlihy, M., Lynch, N.A., Tuttle, M.R.: Tight bounds for \(k\)set agreement. J. ACM 47(5), 912–943 (2000)MathSciNetCrossRefGoogle Scholar
 14.Clarke, E.M., Kroening, D., Ouaknine, J., Strichman, O.: Completeness and complexity of bounded model checking. In: Steffen, B., Levi, G. (eds.) VMCAI 2004. LNCS, vol. 2937, pp. 85–96. Springer, Heidelberg (2004). https://doi.org/10.1007/9783540246220_9CrossRefzbMATHGoogle Scholar
 15.Drăgoi, C., Henzinger, T.A., Veith, H., Widder, J., Zufferey, D.: A logicbased framework for verifying consensus algorithms. In: McMillan, K.L., Rival, X. (eds.) VMCAI 2014. LNCS, vol. 8318, pp. 161–181. Springer, Heidelberg (2014). https://doi.org/10.1007/9783642540134_10CrossRefGoogle Scholar
 16.Emerson, E.A., Namjoshi, K.S.: Automatic verification of parameterized synchronous systems. In: Alur, R., Henzinger, T.A. (eds.) CAV 1996. LNCS, vol. 1102, pp. 87–98. Springer, Heidelberg (1996). https://doi.org/10.1007/3540614745_60CrossRefGoogle Scholar
 17.Emerson, E.A., Namjoshi, K.S.: On reasoning about rings. Int. J. Found. Comput. Sci. 14(4), 527–550 (2003)MathSciNetCrossRefGoogle Scholar
 18.Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985)MathSciNetCrossRefGoogle Scholar
 19.Konnov, I., Lazić, M., Veith, H., Widder, J.: A short counterexample property for safety and liveness verification of faulttolerant distributed algorithms. In: POPL, pp. 719–734 (2017)Google Scholar
 20.Konnov, I., Veith, H., Widder, J.: SMT and POR beat counter abstraction: parameterized model checking of thresholdbased distributed algorithms. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 85–102. Springer, Cham (2015). https://doi.org/10.1007/9783319216904_6CrossRefGoogle Scholar
 21.Konnov, I.V., Veith, H., Widder, J.: On the completeness of bounded model checking for thresholdbased distributed algorithms: reachability. Inf. Comput. 252, 95–109 (2017)MathSciNetCrossRefGoogle Scholar
 22.Kroening, D., Strichman, O.: Efficient computation of recurrence diameters. In: Zuck, L.D., Attie, P.C., Cortesi, A., Mukhopadhyay, S. (eds.) VMCAI 2003. LNCS, vol. 2575, pp. 298–309. Springer, Heidelberg (2003). https://doi.org/10.1007/354036384X_24CrossRefGoogle Scholar
 23.Kukovec, J., Konnov, I., Widder, J.: Reachability in parameterized systems: all flavors of threshold automata. In: CONCUR, pp. 19:1–19:17 (2018)Google Scholar
 24.Lazić, M., Konnov, I., Widder, J., Bloem, R.: Synthesis of distributed algorithms with parameterized threshold guards. In: OPODIS. LIPIcs, vol. 95, pp. 32:1–32:20 (2017)Google Scholar
 25.Leroux, J., Sutre, G.: Flat counter automata almost everywhere!. In: Peled, D.A., Tsay, Y.K. (eds.) ATVA 2005. LNCS, vol. 3707, pp. 489–503. Springer, Heidelberg (2005). https://doi.org/10.1007/11562948_36CrossRefGoogle Scholar
 26.Lynch, N.: Distributed Algorithms. Morgan Kaufman, Burlington (1996)zbMATHGoogle Scholar
 27.Marić, O., Sprenger, C., Basin, D.A.: Cutoff bounds for consensus algorithms. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10427, pp. 217–237. Springer, Cham (2017). https://doi.org/10.1007/9783319633909_12CrossRefGoogle Scholar
 28.Minsky, M.L.: Computation: Finite and Infinite Machines. PrenticeHall Inc., Upper Saddle River (1967)zbMATHGoogle Scholar
 29.de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008). https://doi.org/10.1007/9783540788003_24CrossRefGoogle Scholar
 30.Raynal, M.: FaultTolerant Agreement in Synchronous MessagePassing Systems. Synthesis Lectures on Distributed Computing Theory. Morgan & Claypool Publishers (2010)Google Scholar
 31.Srikanth, T.K., Toueg, S.: Optimal clock synchronization. J. ACM 34(3), 626–645 (1987)MathSciNetCrossRefGoogle Scholar
 32.Srikanth, T., Toueg, S.: Simulating authenticated broadcasts to derive simple faulttolerant algorithms. Distrib. Comput. 2, 80–94 (1987)CrossRefGoogle Scholar
 33.Stoilkovska, I., Konnov, I., Widder, J., Zuleger, F.: Artifact and instructions to generate experimental results for TACAS 2019 paper: Verifying Safety of Synchronous FaultTolerant Algorithms by Bounded Model Checking (artifact). Figshare (2019). https://doi.org/10.6084/m9.figshare.7824929.v1
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.