Abstract
Threshold guards are a basic primitive of many faulttolerant algorithms that solve classical problems of distributed computing, such as reliable broadcast, twophase commit, and consensus. Moreover, threshold guards can be found in recent blockchain algorithms such as Tendermint consensus. In this tutorial, we give an overview of the techniques implemented in Byzantine Model Checker (ByMC). ByMC implements several techniques for automatic verification of thresholdguarded distributed algorithms. These algorithms have the following features: (1) up to t of processes may crash or behave Byzantine; (2) the correct processes count messages and make progress when they receive sufficiently many messages, e.g., at least \(t+1\); (3) the number n of processes in the system is a parameter, as well as t; (4) and the parameters are restricted by a resilience condition, e.g., \(n > 3t\). Traditionally, these algorithms were implemented in distributed systems with up to ten participating processes. Nowadays, they are implemented in distributed systems that involve hundreds or thousands of processes. To make sure that these algorithms are still correct for that scale, it is imperative to verify them for all possible values of the parameters.
Supported by Interchain Foundation (Switzerland) and by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 787367 (PaVeS). Partially supported by the Austrian Science Fund (FWF) via the Doctoral College LogiCS W1255.
Download conference paper PDF
1 Introduction
The recent advent of blockchain technologies [2, 20, 23, 30, 68, 81] has brought faulttolerant distributed algorithms to the spotlight of computer science and software engineering. In particular, due to the huge amount of funds managed by blockchains, it is crucial that their software is free of bugs. At the same time, these systems are characterized by a large number of participants. Thus, automated verification methods face the wellknown state space explosion problem. Furthermore, the wellknown undecidability results for the verification of parameterized systems [4, 15, 37, 38, 79] apply in this setting. One way to circumvent these problems is to develop domain specific methods that work for a specific subclass of systems.
In this tutorial, we consider verification techniques for faulttolerant distributed algorithms. As an example, consider a blockchain system, where a blockchain algorithm ensures coordination of the processes participating in the system. We observe that to do so, the processes need to solve a coordination problem called atomic (or, total order) broadcast [43], that is, every process delivers the same transactions in the same order. To achieve that, we typically need a resilience condition that restricts the fraction of processes that may be faulty [70]. The techniques we survey in this tutorial deal with the concepts of broadcast and atomic broadcast under resilience conditions.
While Bitcoin [68] was a new approach to consensus, several Blockchain systems like Tendermint [20] and HotStuff [81] are modern implementations that are built on these classic Byzantine fault tolerance concepts. While the techniques we describe here address in part the challenges for the verification of such systems. We discuss open challenges in Sect. 5.
In addition to practical importance, the reasons for the longstanding interest [39, 58, 61, 70] in distributed systems is that distributed consensus is nontrivial in two aspects:

1.
Most coordination problems are impossible to solve without imposing constraints on the environment, e.g., an upper bound on the fraction of faulty processes, assumptions on the behavior of faulty processes, or bounds on message delays and processing speeds (i.e., restricting interleavings) [33, 39, 70].

2.
Designing correct solutions is hard, owing to the huge state and execution space, and the complex interplay of assumptions mentioned in Point 1. Thus, even published protocols may contain bugs, as reported, e.g., by [62, 64].
Due to the impossibility of asynchronous faulttolerant consensus [39], much of the research focuses one what kinds of problems are solvable in asynchronous systems (e.g., some forms of reliable broadcast) or what kinds of systems allow to solve consensus. In Sect. 2 we will survey some of the most fundamental system assumptions that allow to solve problems in the presence of faults and example algorithms, and in Sect. 3 we will discuss how these algorithms can be modeled and how they can be automatically verified.
2 ThresholdGuarded Distributed Algorithms
In a classic survey, Schneider [73] explains replicated state machines by the following notion of replica coordination that consists of two properties:

Agreement. “Every nonfaulty state machine replica receives every request.”

Order. “Every nonfaulty state machine replica processes the requests it receives in the same relative order.”
In Schneider’s approach [73], the specification of Agreement can be solved using an algorithm for reliable broadcast [43]. The processes can use a consensus algorithm [25, 27, 35] to establish the Order property. For instance, the atomic broadcast algorithm from [25] contains these two subalgorithms.
The simplest canonical system model that allows one to solve consensus is the synchronous one, and we discuss it in Sect. 2.1. A second elegant way to circumvent the impossibility of [39] is by replacing liveness with almost sure termination, that is, a probabilistic guarantee. We review this approach in Sect. 2.3. In fact, reliable broadcast can be solved with an asynchronous distributed algorithm. We discuss their characteristics in Sect. 2.2.
2.1 Synchronous Algorithms
A classic example of a faulttolerant distributed algorithm is the broadcasting algorithm by Srikanth & Toueg [76]. The description of its code is given in Fig. 1. As is typical for distributed algorithms, the semantics are not visible from the pseudo code. In fact, we use the same pseudocode to describe its asynchronous variant later in Sect. 2.2.
The algorithm satisfies the Agreement property mentioned above. In a distributed system comprising reliable servers, which do not fail and do not lose messages, this property is easy to achieve. If a server receives a requests it sends the request to all other servers. As messages are delivered reliably, every request will eventually be received by every server. The problems comes with faults. Srikanth and Toueg studied Byzantine failures, where faulty servers may send messages only to a subset of the servers (or even send conflicting data). Then two servers may receive different requests. The algorithm in Fig. 1 addresses this problem, by forwarding message content received from other servers and only accepting a message content when it was received from a quorum of servers. For each message content m, one instance of this algorithm is executed. Initially the variable v captures whether a process has received m, it is 1 if this is the case. Then a process sends ECHO to all. In an implementation, the message would be of the form (ECHO,m), that is, it would be tagged with ECHO, and carry the content m to distinguish different instances running in parallel; also it would suffice to send the message once instead of sending it in each iteration. Then if the second guard in line 6 evaluates to true at a server p, then p has received \(t+1\) ECHO messages, which means that at least one correct process has forwarded the message, so it also forwards it. If a server receives \(nt\) ECHO messages, it finally accepts the request stored in m due to line 8. The reason this algorithm works is that the combination of \(nt\), \(t+1\), and \(n>3t\) ensures that if one correct processes has \(nt\) ECHO messages, every other correct process will eventually received at least \(t+1\) (there are \(t+1\) correct processes among any \(nt\) processes) so that every correct process will forward, and since there are at least \(nt\) correct processes, every one will accept. However, this arithmetics over parameters is subtle and errorprone. To this end, our verification techniques focus on threshold expressions and resilience conditions.
In the above discussion, we were imprecise about the code semantics. In this section we consider the synchronous semantics: All correct processes execute the code linebyline in lockstep. One loop iteration is called one round. A message sent by a correct process to a correct process is received within the same round. Then after sending and receiving messages in lockstep, all correct processes continue by evaluating the guards, before they all proceed to the next round. Because this semantics ensures that all processes move together, and all messages are received within the next rounds, no additional fairness needed to ensure liveness. In practice, this approach is often considered slow and expensive, as it has to be implemented with timeouts that are aligned to worst case message delays (which can be very slow in real networks). However, synchronous semantics offers a highlevel abstraction that allows one to design algorithms easier.
Figure 2 shows an example of another synchronous algorithm. This algorithm is run by n replicated processes, up to t of which may fail by crashing, that is, by prematurely halting. It solves the kset agreement problem, that is, out of the n initial values each process decides on one value, so that the number of different decision values is at most k. By setting \(k=1\), we obtain that there can be exactly one decision value, which coincides with the definition of consensus. In contrast to the reliable broadcast above, it runs for a finite number of rounds. The number of loop iterations \(\lfloor t/k \rfloor + 1\) of the FloodMin algorithm has been designed such that it ensures that there is at least one clean round in which at most \(k1\) processes crash. When we consider consensus, this means there is a round in which no process crashes, so that all processes receive the same values \(b_1, \dots b_\ell \). As a result, during that round all processes set best to the same value.
2.2 Asynchronous Algorithms
We now discuss the asynchronous semantics of the code in Fig. 1: at each time, exactly one processes performs a step. That is, the steps of the processes are interleaved. In the example one may interpret this as one code line being an atomic unit of executions at a process. In the “receive” statement, a process takes some messages out of the incoming message buffer: possibly no message, and not necessarily all messages that are in the buffer. The “send to all” then places one message in the message buffers of all the other processes. Often asynchronous semantics is considered more coarsegrained, e.g., a step consists of receiving, updating the state, and sending one or more messages.
As we do not restrict which messages are taken out of the buffer during a step, we cannot bound the time needed for message transmission. Moreover, we do not restrict the order, in which processes have to take steps, so we cannot bound the time between two steps of a single process. Typically, we are interested in verifying safety (nothing bad ever happens) under these conditions.
However, for liveness this is problematic. We need messages to be delivered eventually, and correct processes to take steps from time to time. So liveness is typically preconditioned by fairness guarantees: every correct processes takes infinitely many steps and every message sent from a correct process to a correct process is eventually received. For broadcast these constraints are sufficient, while for consensus they are not.
2.3 Randomized Algorithms
A prominent example is BenOr’s faulttolerant binary consensus [7] algorithm in Fig. 3. It circumvents the impossibility of asynchronous consensus [39] by relaxing the termination requirement to almostsure termination, i.e., termination with probability 1. Here processes execute an infinite sequence of asynchronous rounds. While the algorithm is executed under asynchronous semantics, the processes have a local variable r that stores the round number. Processes essages that they send in round r with the round number. Observe that the algorithm only operates on messages from the current round (the guards only count messages tagged with r). Asynchronous algorithms with this feature are called communication closed [29, 36]. Each round consists of two stages where the processes first exchange messages tagged with R, wait until the number of received messages reaches a certain threshold (the expression over parameters in line 5) and then exchange messages tagged with P. As in the previous examples, n is the number of processes, among which at most t may crash or be Byzantine. The thresholds \(nt\), \((n+t)/2\) and \(t+1\) in combination with the resilience condition \(n > 5t\) ensure that no two correct processes ever decide on different values. If there is no “strong majority” for a value in line 13, a process chooses a new value by tossing a coin in line 16.
3 Parameterized Verification
3.1 Synchronous Algorithms
In [78], we introduced the synchronous variant of threshold automata, and studied their applicability and limitations for verification of synchronous faulttolerant distributed algorithms. We showed that the parameterized reachability problem for synchronous threshold automata is undecidable. Nevertheless, we observed that counter systems of many synchronous faulttolerant distributed algorithms have bounded diameters, even though the algorithms are parameterized by the number of processes. Hence, bounded model checking can be used for verifying these algorithms. We briefly discuss these results in the following.
Synchronous Threshold Automata. In a synchronous algorithm, the processes execute the send, receive, and local computation steps in lockstep. Consider the synchronous reliable broadcast algorithm from [77], whose pseudocode is given in Fig. 1 (left). A synchronous threshold automaton (STA) that encodes the pseudocode of this algorithm is given in Fig. 1 (right). The STA models the loop body of the pseudo code: one iteration of the loop is expressed as an STA edge that connects the locations before and after a loop iteration.
The semantics of the synchronous threshold automaton is defined in terms of a counter system. For each location \(\ell _i \in \{\textsc {v0}, \textsc {v1}, \textsc {se}, \textsc {ac}\}\) (a node in the graph), we have a counter \(\kappa _i\) that stores the number of processes located in \(\ell _i\). The counter system is parameterized in two ways: (i) in the number of processes n, the number of faults f, and the upper bound on the number of faults t, (ii) the expressions in the guards contain n, t, and f. Every system transition moves all processes simultaneously; potentially using a different rule for each process (depicted by an edge in the figure), provided that the rule guards evaluate to true. The guards compare a sum of counters to a linear combination of parameters. Processes send messages based on their current locations. Hence, we use the number of processes in given locations to test how many messages of a certain type have been sent in the previous round. However, the pseudo code in Fig. 1 is predicated by received messages rather than by sent messages. This algorithm is designed to tolerate Byzantinefaulty processes, which may send corrupt messages to some correct processes. Thus, the number of received messages may deviate from the number of correct processes that sent a message. For example, if the guard in line 6 evaluates to true, the \(t+1\) received messages may contain up to f messages from the faulty processes. If i correct processes send \(\texttt {<ECHO>}\), for \(1 \le i \le t\), the faulty processes may “help” some correct processes to pass over the \(t + 1\) threshold. That is, the effect of the f faulty processes on the correct processes is captured by the “\(f\)” component in the guards. As a result, we run only the correct processes, so that a system consists of \(nf\) copies of the STA.
For example, in the STA in Fig. 1, processes send a message if they are in a location \(\textsc {v1}\), \(\textsc {se}\), or \(\textsc {ac}\). Thus, the guards compare the number of processes in a location \(\textsc {v1}\), \(\textsc {se}\), or \(\textsc {ac}\), which we denote by \(\#\{\textsc {v1},\textsc {se},\textsc {ac}\}\), to some linear expression over the parameters, called a threshold. The assignment v:=1 in line 6 is modeled by the rule \(r_2\), guarded with \(\phi _1\; \equiv \; \#\{\textsc {v1},\textsc {se},\textsc {ac}\} \ge t+1f\). This guard evaluates to true if he number of processes in location \(\textsc {v1}\), \(\textsc {se}\), or \(\textsc {ac}\) is greater than or equal to \(t+1f\). The implicit “else” branch between lines 6 and 8 is modeled by the rule \(r_1\), guarded with \(\phi _3 \equiv \; \#\{\textsc {v1},\textsc {se},\textsc {ac}\} < t+1\). The effect of the faulty processes is captured by both the rules \(r_1\) and \(r_2\) being enabled. Similarly, the rules \(r_5, r_7, r_8\) are guarded with the guard \(\phi _2\; \equiv \; \#\{\textsc {v1},\textsc {se},\textsc {ac}\} \ge ntf\), which is true when the number of process in one of \(\textsc {v1}\), \(\textsc {se}\), or \(\textsc {ac}\) is greater or equal to \(n  t f\), while the rules \(r_3, r_4\) are guarded with \(\phi _4 \; \equiv \; \#\{\textsc {v1},\textsc {se},\textsc {ac}\} < n  t\). The rule \(r_6\) is unguarded, i.e., its guard is \(\top \).
Bounded Diameter. An example execution of the synchronous reliable broadcast algorithm is depicted in Table 1 on the left. Observe that the guards of the rules \(r_1\) and \(r_2\) are both enabled in the configuration \(\sigma _0\). One STA uses \(r_2\) to go to \(\textsc {se}\) while the others use the selfloop \(r_1\) to stay in \(\textsc {v0}\). As both rules remain enabled, in every round one copy of STA can go to \(\textsc {se}\). Hence, the configuration \(\sigma _{t+1}\) has \(t+1\) correct STA in location \(\textsc {se}\) and the rule \(r_1\) becomes disabled. Then, all remaining STA go to \(\textsc {se}\) and then finally to \(\textsc {ac}\). This execution depends on the parameter t, which implies that the length of this execution grows with t and is thus unbounded. (We note that we can obtain longer executions, if some STA use the rule \(r_4\)). On the right, we see an execution where all copies of STA immediately move to \(\textsc {se}\) via rule \(r_2\). That is, while the configuration \(\sigma _{t+3}\) is reached by a long execution on the left, it is reached in just two steps on the right (observe that \(\sigma '_2=\sigma _{t+3}\)). We are interested in whether there is a natural number k (independent of n, t and f) such that we can always shorten executions to executions of length \(\le k\). (By length, we mean the number of transitions in an execution.) In such a case, we say that the STA has bounded diameter. We adapt the definition of diameter from [14], and introduce an SMTbased procedure for computing the diameter of the counter system. The procedure enumerates candidates for the diameter bound, and checks (by calling an SMT solver) if the number is indeed the diameter; if it finds such a bound, it terminates.
Bounded Model Checking. The existence of a bounded diameter motivates the use of bounded model checking, as safety verification can be reduced to checking the violation of a safety property in executions with length up to the diameter. Crucially, this approach is complete: if an execution reaches a bad configuration, this bad configuration is already reached by an execution of bounded length. Thus, once the diameter is found, we encode the violation of a safety property using a formula in Presburger arithmetic, and use an SMT to check for violations.
The SMT queries that are used for computing the diameter and encoding the violation of the safety properties contain quantifiers for dealing with the parameters symbolically. Surprisingly, performance of the SMT solvers on these queries is very good, reflecting the recent progress in dealing with quantified queries. We found that the diameter bounds of synchronous algorithms in the literature are tiny (from 1 to 8), which makes our approach applicable in practice. The verified algorithms are given in Sect. 4.
Undecidability. In [78], we showed that the parameterized reachability problem is in general undecidable for STA. In particular, this implies that some STA have unbounded diameters. We identified a class of STA which in theory have bounded diameters. For some STA outside of this class, our SMTbased procedure still can automatically find the diameter. Remarkably, the SMTbased procedure gives us the diameters that are independent of the parameters.
3.2 Asynchronous Algorithms
Asynchronous Threshold Automata. Similarly as in STAs, nodes in asynchronous threshold automata (TAs) represent locations of processes, and edges represent local transitions. What makes a difference between an STA and a TA are shared variables and labels on edges that have a form \(\gamma \mapsto \textsf {act}\). A process moves along an edge labelled by \(\gamma \mapsto \textsf {act}\) and performs an action \(\textsf {act}\), only if the condition \(\gamma \), called a threshold guard, evaluates to true.
We model reliable broadcast [76] using the same threshold automaton from Fig. 1 but with different edge labels in comparison to the STA. We use a shared variable x to capture the number of \(\texttt {<ECHO>}\) messages sent by correct processes. We have two threshold guards: \(\gamma _1 :x \ge (t+1)  f\) and \(\gamma _2 :x \ge (nt)  f\). Depending on the initial value of a correct process, 0 or 1, the process is initially either in location \(\textsc {v0}\) or in \(\textsc {v1}\). If its value is 1 a process broadcasts \(\texttt {<ECHO>}\), and executes the rule \(r_3 :\text {true}\mapsto x\small {\texttt {++}}\). This is modelled by a process moving from \(\textsc {v1}\) to \(\textsc {se}\) and increasing the value of x. If its value is 0, it has to wait to receive enough messages, i.e., it waits for \(\gamma _1\) to become true, and then it broadcasts the \(\texttt {<ECHO>}\) message and moves to location \(\textsc {se}\). Thus, \(r_2\) is labelled by \(\gamma _1 \mapsto x\small {\texttt {++}}\). Finally, once a process has \(\gamma _2\)enough \(\texttt {<ECHO>}\) messages, it sets accept to true and moves to \(\textsc {ac}\). Thus, \(r_5\) is labelled by \(\gamma _2\), whereas \(r_7\) and \(r_8\) by \(\gamma _2\mapsto x\small {\texttt {++}}\).
Counter Systems. The semantics of threshold automata is captured by counter systems. Instead of storing the location of each process, we count the number of processes in each location, as all processes are identical. Therefore, a configuration comprises (i) values of the counters for each location, (ii) values of the shared variables, and (iii) parameter values. A configuration is initial if all processes are in initial locations, here \(\textsc {v0}\) or \(\textsc {v1}\), and all shared variables have value 0 (here \(x=0\)). A transition of a process along an edge from location \(\ell \) to location \(\ell '\) — labelled by \(\gamma \mapsto \textsf {act}\) — is modelled by the configuration update as follows: (i) the counter of \(\ell \) is decreased by 1, and the counter of \(\ell '\) is increased by 1, (ii) shared variables are updated according to the action \(\textsf {act}\), and (iii) parameter values are unchanged. The key ingredient of our technique is acceleration of transitions, that is, many processes may move along the same edge simultaneously. In the resulting configuration, counters and shared variables are changed depending on the number of processes that participate in the transition. It is important to notice that any accelerated transition can be encoded in SMT.
Reachability. In [49], we determine a finite set of execution “patterns”, and then analyse each pattern separately. These patterns restrict the order in which threshold guards become true (if ever). Namely, we observe how the set of guards that evaluate to true changes along each execution. In our example, there are two (nontrivial) guards, \(\gamma _1\) and \(\gamma _2\). Initially, both are false as \(x=0\). During an execution, none, one, or both of them become true, but note that once they become true, they never return to false, as the number of sent messages x cannot decrease. Thus, there is a finite set of execution patterns.
For instance, a pattern \(\{\}\dots \{\gamma _1\}\dots \{\gamma _1,\gamma _2\}\) captures all finite executions \(\tau \) that can be represented as \(\tau = \tau _1 \cdot t_1 \cdot \tau _2 \cdot t_2 \cdot \tau _3\), where \(\tau _1,\tau _2,\tau _3\) are subexecutions of \(\tau \), and \(t_1\) and \(t_2\) are transitions. No threshold guard is enabled in a configuration visited by \(\tau _1\), and only \(\gamma _1\) is enabled in all configurations visited by \(\tau _2\). Both guards are enabled in configurations visited by \(\tau _3\), and \(t_1\) and \(t_2\) change the evaluation of the guards. Another pattern \(\{\}\dots \{\gamma _2\}\dots \{\gamma _1,\gamma _2\}\) enables \(\gamma _2\) before \(\gamma _1\). Third pattern \(\{\}\dots \{\gamma _1\}\) never enables \(\gamma _2\).
To perform verification, we have to analyse all execution patterns. For each pattern we construct a socalled schema: A sequence of accelerated transitions that have as free variables: the number of processes that execute the transitions and the parameter values. In Fig. 1 transitions are modelled by edges denoted with \(r_i\), \(i \in 1..8\). For instance, the pattern \(\{\}\dots \{\gamma _1\}\) produces the schema:
There are two segments \(\tau _1\) and \(\tau _2\) corresponding to \(\{\}\) and \(\{\gamma _1\}\), respectively. In each of them we list all the rules that can be executed according to the true guards, in a fixed natural order: only \(r_1\) and \(r_3\) can be executed if no guard is enabled, and \(r_1, r_2, r_3, r_4\) if only \(\gamma _1\) holds true. Additionally, we have to list all the candidate rules for \(t_1\) that can change the evaluation of the guards. In our example only \(r_3\) can enable the guard \(\gamma _1\).
We say that an execution follows the schema \(\mathcal {S}\) if its transitions appear in the same order as in \(\mathcal {S}\), but they are accelerated (every transition is executed by a number of processes, possibly zero). For example, if (r, k) denotes that k processes execute the rule r simultaneously, then the execution \(\rho = (r_1,2)(r_3,3)(r_2,2)(r_4,1)\) follows \(\mathcal {S}\), where the transitions of the form (r, 0) are omitted. In this case, we prove that for each execution \(\tau \) of pattern \(\{\}\dots \{\gamma _1\}\), there is an execution \(\tau '\) that follows the schema \(\mathcal {S}\), and \(\tau \) and \(\tau '\) reach the same configuration (when executed from the same initial configuration). This is achieved by mover analysis: inside any segment in which the set of enabled guards is fixed, we can swap adjacent transitions (that are not in a natural order); in this way we gather all transitions of the same rule next to each other, and transform them into a single accelerated transition. For example, \(\tau = (r_3,2)(r_1,1)(r_3,1)(r_1,1)(r_2,1)(r_4,1)(r_2,1)\) can be transformed into \(\tau '= \rho \) from above, and they reach the same configurations. Therefore, instead of checking reachability for all executions of the pattern \(\{\}\dots \{\gamma _1\}\), it is sufficient to analyse reachability only for the executions that follow the schema \(\mathcal {S}\).
Every schema is encoded as an SMT query over linear integer arithmetic with free variables for acceleration factors, parameters, and counters. An SMT model gives us an execution of the counter system, which typically disproves safety.
For example, consider the following reachability problem: Can the system reach a configuration with at least one process in \(\ell _3\)? For each SMT query, we add the constraint that the counter of \(\ell _3\) is nonzero in the final configuration. If the query is satisfiable, then there is an execution where at least one process reaches \(\ell _3\). Otherwise, there is no such execution following the particular schema, where a process reaches \(\ell _3\). That is why we have to check all schemas.
Safety and Liveness. In [50] we introduced a fragment of Linear Temporal Logic called \({\textsf {ELTL}}_{\textsf {FT}}\). Its atomic propositions test location counters for zero. Moreover, this fragment only uses only two temporal operators: (eventually) and (globally). Our goal is to check whether there exists a counterexample to a temporal property, and thus formulas in this fragment represent negations of safety and liveness properties.
Our technique for verification of safety and liveness properties uses the reachability method as its basis. As before, we want to construct schemas that we can translate to SMT queries and check their satisfiability. Note that violations of liveness properties are infinite executions of a lasso shape, that is, \(\tau \cdot \rho ^\omega \), where \(\tau \) and \(\rho \) are finite executions. Hence, we have to enumerate the patterns of lassos. These shapes depend not only on the values of thresholds, but also on the evaluations of atomic propositions that appear in temporal properties. We single out configurations in which atomic propositions evaluate to true, and call them cut points, as they “cut” an execution into finitely many segments (see Fig. 4).
We combine these cut points with those “cuts” in which threshold guards become enabled (as in the reachability analysis). All the possible orderings in which thresholds and formulas become true, give us a finite set of lasso patterns. We construct a schema for each shape by first defining schemas for each of the segments between two adjacent cut points. On one hand, for reachability it is sufficient to execute all enabled rules of that segment exactly once in the natural order. Thus, each subexecution \(\tau _i\) can be transformed into \(\tau _i'\) that follows the segment’s schema, so that \(\tau _i\) and \(\tau _i'\) reach the same final configuration. On the other hand, safety and liveness properties reason about atomic propositions inside executions. To this end, we introduced a property specific mover analysis that allows us to construct schemas by executing all enabled rules a fixed number of times in a specific order. The number of rule repetitions depends on a temporal property; it is typically two or three.
For each lasso pattern we encode its schema in SMT and check its satisfiability. As \({\textsf {ELTL}}_{\textsf {FT}}\) formulas are negations of specifications, an SMT model gives us a counterexample. If no schema is satisfiable, the temporal property holds true.
3.3 Asynchronous Randomized Multiround Algorithms
Probabilistic Threshold Automata. Randomized algorithms typically have an unbounded number of asynchronous rounds and randomized choices. Probabilistic threshold automata (PTAs) are extensions of asynchronous threshold automata that allow formalizing these features. A PTA modelling BenOr’s algorithm from Fig. 3 is shown in Fig. 5. The behaviour of a process in a single round is modelled by the solid edges. Note that in this case threshold guards should be evaluated according to the values of shared variables, e.g., \(x_0\) and \(x_1\), in the observed round. The dashed edges model round switches: once a process reaches a final location in a round, it moves to an initial location of the next round. The coin toss is modelled by the branching rule \(r_{10}\): a process in location SP by moving along this fork can reach either \(CT_0\) or \(CT_1\), both with probability 1/2.
Unboundedly Many Rounds. In order to overcome the issue of unboundedly many rounds, we prove that we can verify PTAs by analysing a oneround automaton that fits in the framework of Sect. 3.2. In [11], we prove that one can reorder transitions of any fair execution such that their round numbers are in a nondecreasing order. The obtained ordered execution is stutter equivalent to the original one. Thus, the both execution satisfy the same \({\textsf {LTL}}_{\textsf {{X}}}\) properties over the atomic propositions of one round. In other words, the distributed system can be transformed to a sequential composition of oneround systems.
The main problem with isolating a oneround system is that consensus specifications often talk about at least two different rounds. In this case we need to use round invariants that imply the specifications. For example, if we want to verify agreement, we have to check that no two processes decide different values, possibly in different rounds. We do this in two steps: (i) we check the round invariant that no process changes its decision from round to round, and (ii) we check that within a round no two processes disagree.
Probabilistic Properties. The semantics of a probabilistic threshold automaton is an infinitestate Markov decision process (MDP), where the nondeterminism is traditionally resolved by an adversary. In [11], we restrict our attention to socalled roundrigid adversaries, that is, fair adversaries that generate executions in which a process enters round \(r+1\) only after all processes finished round r.
Verifying almostsure termination under roundrigid adversaries calls for distinct arguments. Our methodology follows the lines of the manual proof of Ben Or’s consensus algorithm by Aguilera and Toueg [3]. However, our arguments are not specific to Ben Or’s algorithm, and apply to other randomized distributed algorithms (see Table 2). Compared to their paperandpencil proof, the threshold automata framework required us to provide a more formal setting and a more informative proof, also pinpointing the needed hypotheses. The crucial parts of our proof are automatically checked by the model checker ByMC.
4 ByMC: Byzantine Model Checker
Overview of the Techniques Implemented in ByMC. Table 2 shows coverage of various asynchronous algorithms with the techniques that are implemented in ByMC. In the following, we give a brief description of these techniques.
We started development of ByMC in 2012. We extended the classic \(\{0,1,\infty \}\)counter abstraction to thresholdguarded algorithms [41, 46, 47]. Instead of using the predefined intervals [0, 1) and \([1, \infty )\), the tool was computing parametric intervals by simple static analysis, for instance, the intervals [0, 1), \([1, t+1)\), \([t+1, nt)\), and \([nt, \infty )\). ByMC was automatically constructing the finitestate counter abstraction from protocol specifications in Parameterized Promela. This finite abstraction was automatically checked with Spin [45]. As this abstraction was typically too coarse for liveness checking, we have implemented a simple counterexampleguided abstraction refinement loop for parameterized systems. This technique is called CA+SPIN in Table 2.
Spin scaled only to two broadcast algorithms. Thus, we extended ByMC with the abstraction/checking loop that used nuXmv [24] instead of Spin. This technique is called CA+BDD in Table 2. Although this extension scaled better than CA+SPIN, we could only check two more benchmarks with it. Detailed discussions of the techniques CA+SPIN and CA+BDD can be found in [41, 53].
By running the abstraction/checking loop in nuXmv, we found that the bounded model checking algorithms of nuXmv could check long executions of our benchmarks. However, bounded model checking in general does not have completeness guarantees. In [51, 55], we have shown that the counter systems of (asynchronous) threshold automata have computable bounded diameters, which gave us a way to use bounded model checking as a complete verification approach for reachability properties. This technique is called CA+SAT in Table 2. Still, the computed upper bounds were too high for achieving complete verification.
The SMTbased techniques of Sect. 3.2 are called SMTS (for safety) and SMTL (for liveness) in Table 2. These techniques accept either threshold automata or Parametric Promela on their input. As one can see, these techniques are the most efficient techniques that are implemented in ByMC. More details on the experiments can be found in the tool paper [54].
Finally, the technique for multiround randomized algorithms is called SMTMR in Table 2. This technique is explained in Sect. 3.3.
Model Checking Synchronous Threshold Automata. The bounded model checking approach for STA introduced in Sect. 3.1 is not yet integrated into ByMC. It is implemented as a standalone tool, available at [1]. In [78], we encoded multiple synchronous algorithms from the literature, such as consensus [9, 10, 13, 63, 72], kset agreement (from [63], whose pseudocode is given in Fig. 2 and [72]), and reliable broadcast (from [13, 77]) algorithms. We use Z3 [67] and CVC4 [6] as backend SMT solvers. Table 3 gives an overview of the verified synchronous algorithms. For further details on the experimental results, see [78].
5 Towards Verification of Tendermint Consensus
Tendermint consensus is a faulttolerant distributed algorithm for proofofstake blockchains [22]. Tendermint can handle Byzantine faults under the assumption of partial synchrony. It is running in the Cosmos network, where currently over 100 validator nodes are committing transactions and are managing the ATOM cryptocurrency [21]. Tendermint consensus heavily relies on threshold guards, as can be seen from its pseudocode in [22] [Algorithm 1]. For instance, one of the Tendermint rules has the following precondition:
The rule 1 requires two kinds of messages: (1) a single message of type PROPOSAL carrying a proposal v from the process \(proposer(h_p, round_p)\) that is identified by the current round \(round_p\) and consensus instance \(h_p\), and (2) messages of type PREVOTE from several nodes. Here the term \(2f+1\) (taken from the original paper) in fact does not refer to a number of processes. Rather each process has a voting power (an integer that expresses how many votes a process has), and \(2f+1\) (in combination with \(n=3t+1\)) expresses that nodes that have sent PREVOTE have more than twothirds of voting power. Although this rule bears similarity with the rules of threshold automata, Tendermint consensus has the following features that cannot be directly modelled with threshold automata:

1.
In every consensus instance \(h_p\) and round \(round_p\), a single proposer sends a value that the nodes vote on. The identity of the proposer can be accessed with the function \(proposer(h_p, round_p)\). This feature breaks symmetry among individual nodes, which is required by our modelling with counter systems. Moreover, the proposer function should be fairly distributed among the nodes, e.g., it can be implemented with round robin.

2.
Whereas the classical example algorithms in this paper count messages, Tendermint evaluates the voting power of the nodes from which messages where received. This adds an additional layer of complexity.

3.
Liveness of Tendermint requires the distributed system to reach a global stabilization period, when every message could be delivered not later than after a bounded delay. This model of partial synchrony lies between synchronous and asynchronous computations and requires novel techniques for parameterized verification.
As a first step towards parameterized verification, we are specifying Tendermint consensus in TLA^{+} [59] and check its properties with the symbolic model checker Apalache [48]. As Apalache currently supports only nonparameterized verification —the specification parameters must be fixed —we are planning to use automatic abstractions to build a bridge between Apalache and ByMC.
6 Conclusions
Computeraided verification of distributed algorithms and systems is a lively research area. Approaches range from mechanized verification [44, 74, 80] over deductive verification [8, 29, 31, 34, 69] to automated techniques [5, 16, 40, 56]. In our work, we follow the idea of identifying fragments of automata and logic that are sufficiently expressive for capturing interesting algorithms and specifications, while these fragments are amenable for completely automated verification. We introduced threshold automata for that and implemented our verification techniques in the open source tool ByMC [54]. By doing so, we verified several challenging distributed algorithms; most of them were verified for the first time.
The threshold automata framework has proved to be both of practical relevance as well as of theoretical interest. There are several ongoing projects that consider automatic generation of threshold automata from code, complexity theoretic analysis of verification problems, and more refined probabilistic reasoning.
References
Bounded Model Checking of STA. https://github.com/istoilkovska/syncTA
Abraham, I., Malkhi, D., Nayak, K., Ren, L., Spiegelman, A.: Solidus: an incentivecompatible cryptocurrency based on permissionless Byzantine consensus. CoRR abs/1612.02916 (2016). http://arxiv.org/abs/1612.02916
Aguilera, M., Toueg, S.: The correctness proof of BenOr’s randomized consensus algorithm. Distributed Computing pp. 1–11 (2012)
Apt, K., Kozen, D.: Limits for automatic verification of finitestate concurrent systems. IPL 15, 307–309 (1986)
Bakst, A., von Gleissenthall, K., Kici, R.G., Jhala, R.: Verifying distributed programs via canonical sequentialization. PACMPL 1(OOPSLA), 110:1–110:27 (2017)
Barrett, C., et al.: CVC4. In: CAV, pp. 171–177 (2011)
BenOr, M.: Another advantage of free choice: Completely asynchronous agreement protocols (extended abstract), In: PODC, pp. 27–30 (1983)
Berkovits, I., Lazić, M., Losa, G., Padon, O., Shoham, S.: Verification of thresholdbased distributed algorithms by decomposition to decidable logics. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11562, pp. 245–266. Springer, Cham (2019). https://doi.org/10.1007/9783030255435_15
Berman, P., Garay, J.A., Perry, K.J.: Asymptotically optimal distributed consensus. Technical report, Bell Labs (1989). http://plan9.belllabs.co/who/garay/asopt.ps
Berman, P., Garay, J.A., Perry, K.J.: Towards optimal distributed consensus (Extended Abstract). In: FOCS, pp. 410–415 (1989)
Bertrand, N., Konnov, I., Lazic, M., Widder, J.: Verification of randomized consensus algorithms under roundrigid adversaries. In: CONCUR 2019, LIPIcs, vol. 140, pp. 33:1–33:15 (2019)
Bertrand, N., Konnov, I., Lazić, M., Widder, J.: Verification of randomized consensus algorithms under roundrigid adversaries. In: CONCUR, pp. 33:1–33:15 (2019)
Biely, M., Schmid, U., Weiss, B.: Synchronous consensus under hybrid process and link failures. Theor. Comput. Sci. 412(40), 5602–5630 (2011)
Biere, A., Cimatti, A., Clarke, E.M., Zhu, Y.: Symbolic Model Checking without BDDs. TACAS. LNCS 1579, 193–207 (1999)
Bloem, R., et al..: Decidability of Parameterized Verification. Morgan & Claypool, Synthesis Lectures on Distributed Computing Theory (2015)
Bouajjani, A., Enea, C., Ji, K., Qadeer, S.: On the completeness of verifying message passing programs under bounded asynchrony. In: CAV. pp. 372–391 (2018)
Bracha, G.: Asynchronous Byzantine agreement protocols. Inf. Comput. 75(2), 130–143 (1987)
Bracha, G., Toueg, S.: Asynchronous consensus and broadcast protocols. J. ACM 32(4), 824–840 (1985)
Brasileiro, F.V., Greve, F., Mostéfaoui, A., Raynal, M.: Consensus in one communication step PaCT. LNCS 2127, 42–50 (2001)
Buchman, E.: Tendermint: Byzantine Fault Tolerance in the Age of Blockchains. Master’s thesis, University of Guelph (2016). http://hdl.handle.net/10214/9769
Buchman, E., Kwon, J.: Cosmos whitepaper: a network of distributed ledgers (2018). https://cosmos.network/resources/whitepaper
Buchman, E., Kwon, J., Milosevic, Z.: The latest gossip on BFT consensus. arXiv preprint arXiv:1807.04938 (2018). https://arxiv.org/abs/1807.04938
Buterin, V.: A nextgeneration smart contract and decentralized application platform (2014)
Cavada, R., et al.: The NUXMV symbolic model checker, In: CAV. pp. 334–342 (2014)
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. J. ACM 43(2), 225–267 (1996)
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. JACM 43(2), 225–267 (1996)
CharronBost, B., Schiper, A.: The heardof model: computing in distributed systems with benign faults. Distrib. Comput. 22(1), 49–71 (2009)
Chaudhuri, S., Herlihy, M., Lynch, N.A., Tuttle, M.R.: Tight Bounds for kset Agreement. J. ACM 47(5), 912–943 (2000)
Damian, A., Drăgoi, C., Militaru, A., Widder, J.: Communicationclosed asynchronous protocols, In: CAV. pp. 344–363 (2019)
Decker, C., Seidel, J., Wattenhofer, R.: Bitcoin meets strong consistency, In: ICDCN. pp. 13:1–13:10 (2016). https://doi.org/10.1145/2833312.2833321
Desai, A., Garg, P., Madhusudan, P.: Natural proofs for asynchronous programs using almostsynchronous reductions, In: OOPSLA, pp. 709–725 (2014)
Dobre, D., Suri, N.: Onestep consensus with zerodegradation, In: DSN. pp. 137–146 (2006)
Dolev, D., Dwork, C., Stockmeyer, L.: On the minimal synchronism needed for distributed consensus. J. ACM 34, 77–97 (1987)
Drăgoi, C., Henzinger, T.A., Veith, H., Widder, J., Zufferey, D.: A logicbased framework for verifying consensus algorithms VMCAI. LNCS 8318, 161–181 (2014)
Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. J. ACM 35(2), 288–323 (1988)
Elrad, T., Francez, N.: Decomposition of distributed programs into communicationclosed layers. Sci. Comput. Program. 2(3), 155–173 (1982)
Emerson, E., Namjoshi, K.: Reasoning about rings, In: POPL, pp. 85–94 (1995)
Esparza, J.: Decidability of model checking for infinitestate concurrent systems. Acta Informatica 34(2), 85–107 (1997)
Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985)
Gleissenthall, K.V., Gökhan Kici, R., Bakst, A., Stefan, D., Jhala, R.: Pretend synchrony. In: POPL (2019), (to appear)
Gmeiner, A., Konnov, I., Schmid, U., Veith, H., Widder, J.: Tutorial on parameterized model checking of faulttolerant distributed algorithms. In: Bernardo, M., Damiani, F., Hähnle, R., Johnsen, E.B., Schaefer, I. (eds.) SFM 2014. LNCS, vol. 8483, pp. 122–171. Springer, Cham (2014). https://doi.org/10.1007/9783319073170_4
Guerraoui, R.: Nonblocking atomic commit in asynchronous distributed systems with failure detectors. Distrib. Comput. 15(1), 17–25 (2002)
Hadzilacos, V., Toueg, S.: Faulttolerant broadcasts and related problems. In: Mullender, S. (ed.) Distributed systems (2nd Ed.) pp. 97–145 (1993)
Hawblitzel, C., et al.: Ironfleet: proving safety and liveness of practical distributed systems. Commun. ACM 60(7), 83–92 (2017)
Holzmann, G.: The SPIN Model Checker. AddisonWesley, Boston (2003)
John, A., Konnov, I., Schmid, U., Veith, H., Widder, J.: Counter attack on byzantine generals: parameterized model checking of faulttolerant distributed algorithms, October 2012. http://arxiv.org/abs/1210.3846
John, A., Konnov, I., Schmid, U., Veith, H., Widder, J.: Parameterized model checking of faulttolerant distributed algorithms by abstraction, In: FMCAD. pp. 201–209 (2013)
Konnov, I., Kukovec, J., Tran, T.: TLA+ model checking made symbolic. PACMPL 3(OOPSLA), 123:1–123:30 (2019)
Konnov, I., Lazić, M., Veith, H., Widder, J.: Para\(^2\): Parameterized path reduction, acceleration, and SMT for reachability in thresholdguarded distributed algorithms. Formal Methods Syst. Des. 51(2), 270–307 (2017)
Konnov, I., Lazić, M., Veith, H., Widder, J.: A short counterexample property for safety and liveness verification of faulttolerant distributed algorithms. In: POPL, pp. 719–734 (2017)
Konnov, I., Veith, H., Widder, J.: On the completeness of bounded model checking for thresholdbased distributed algorithms: reachability. CONCUR. LNCS 8704, 125–140 (2014)
Konnov, I., Veith, H., Widder, J.: SMT and POR beat counter abstraction: parameterized model checking of thresholdbased distributed algorithms. In: CAV (Part I). LNCS, vol. 9206, pp. 85–102 (2015)
Konnov, I., Veith, H., Widder, J.: What you always wanted to know about model checking of faulttolerant distributed algorithms. In: Mazzara, M., Voronkov, A. (eds.) PSI 2015. LNCS, vol. 9609, pp. 6–21. Springer, Cham (2016). https://doi.org/10.1007/9783319415796_2
Margaria, T., Steffen, B. (eds.): ISoLA 2018. LNCS, vol. 11246. Springer, Cham (2018). https://doi.org/10.1007/9783030034245
Konnov, I.V., Veith, H., Widder, J.: On the completeness of bounded model checking for thresholdbased distributed algorithms: reachability. Inf. Comput. 252, 95–109 (2017)
Kragl, B., Qadeer, S., Henzinger, T.A.: Synchronizing the asynchronous. In: CONCUR. pp. 21:1–21:17 (2018)
Kukovec, J., Konnov, I., Widder, J.: Reachability in parameterized systems: all flavors of threshold automata. In: CONCUR. LIPIcs, vol. 118, pp. 19:1–19:17 (2018)
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)
Lamport, L.: Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. AddisonWesley, Boston (2002)
Lazić, M., Konnov, I., Widder, J., Bloem, R.: Synthesis of distributed algorithms with parameterized threshold guards. In: OPODIS. LIPIcs, vol. 95, pp. 32:1–32:20 (2017). https://doi.org/10.4230/LIPIcs.OPODIS.2017.32
Le Lann, G.: Distributed systems  towards a formal approach. In: IFIP Congress, pp. 155–160 (1977). http://wwwroc.inria.fr/novaltis/publications/IFIP%20Congress%201977.pdf
Lincoln, P., Rushby, J.: A formally verified algorithm for interactive consistency under a hybrid fault model. In: FTCS, pp. 402–411 (1993)
Lynch, N.: Distributed Algorithms. Morgan Kaufman, San Francisco (1996)
Malekpour, M.R., Siminiceanu, R.: Comments on the “Byzantine selfstabilizing pulse synchronization”. protocol: Counterexamples. Tech. rep., NASA, February 2006. http://shemesh.larc.nasa.gov/fm/papers/Malekpour2006tm213951.pdf
Mostéfaoui, A., Moumen, H., Raynal, M.: Randomized kset agreement in crashprone and Byzantine asynchronous systems. Theor. Comput. Sci. 709, 80–97 (2018)
Mostéfaoui, A., Mourgaya, E., Parvédy, P.R., Raynal, M.: Evaluating the conditionbased approach to solve consensus. In: DSN, pp. 541–550 (2003)
de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: TACAS, pp. 337–340 (2008)
Nakamoto, S.: Bitcoin: a peertopeer electronic cash system (2008). https://bitcoin.org/bitcoin.pdf
Padon, O., McMillan, K.L., Panda, A., Sagiv, M., Shoham, S.: Ivy: safety verification by interactive generalization. In: PLDI, pp. 614–630 (2016)
Pease, M.C., Shostak, R.E., Lamport, L.: Reaching agreement in the presence of faults. J. ACM 27(2), 228–234 (1980)
Raynal, M.: A case study of agreement problems in distributed systems: Nonblocking atomic commitment. In: HASE, pp. 209–214 (1997)
Raynal, M.: Faulttolerant agreement in synchronous messagepassing systems. Morgan & Claypool Publishers, Synthesis Lectures on Distributed Computing Theory (2010)
Schneider, F.B.: Implementing faulttolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22(4), 299–319 (1990)
Sergey, I., Wilcox, J.R., Tatlock, Z.: Programming and proving with distributed protocols. PACMPL 2(POPL), 281–2830 (2018)
Song, Y.J., van Renesse, R.: Bosco: onestep Byzantine asynchronous consensus. DISC. LNCS 5218, 438–450 (2008)
Srikanth, T.K., Toueg, S.: Optimal clock synchronization. J. ACM 34(3), 626–645 (1987)
Srikanth, T., Toueg, S.: Simulating authenticated broadcasts to derive simple faulttolerant algorithms. Dist. Comp. 2, 80–94 (1987)
Stoilkovska, I., Konnov, I., Widder, J., Zuleger, F.: Verifying safety of synchronous faulttolerant algorithms by bounded model checking. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11428, pp. 357–374. Springer, Cham (2019). https://doi.org/10.1007/9783030174651_20
Suzuki, I.: Proving properties of a ring of finitestate machines. Inf. Process. Lett. 28(4), 213–214 (1988)
Wilcox, J.R., et al.: Verdi: a framework for implementing and formally verifying distributed systems. In: PLDI, pp. 357–368 (2015)
Yin, M., Malkhi, D., Reiter, M.K., GolanGueta, G., Abraham, I.: Hotstuff: BFT consensus with linearity and responsiveness. In: PODC, pp. 347–356 (2019)
Acknowledgments
This survey is based on the results of a longlasting research agenda [12, 47, 49, 50, 57, 60, 78]. We are grateful to our past and present collaborators Nathalie Bertrand, Roderick Bloem, Annu Gmeiner, Jure Kukovec, Ulrich Schmid, Helmut Veith, and Florian Zuleger, who contributed to many of the described ideas that are now implemented in ByMC.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 IFIP International Federation for Information Processing
About this paper
Cite this paper
Konnov, I., Lazić, M., Stoilkovska, I., Widder, J. (2020). Tutorial: Parameterized Verification with Byzantine Model Checker. In: Gotsman, A., Sokolova, A. (eds) Formal Techniques for Distributed Objects, Components, and Systems. FORTE 2020. Lecture Notes in Computer Science(), vol 12136. Springer, Cham. https://doi.org/10.1007/9783030500863_11
Download citation
DOI: https://doi.org/10.1007/9783030500863_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030500856
Online ISBN: 9783030500863
eBook Packages: Computer ScienceComputer Science (R0)