Keywords

figure a
figure b

1 Introduction

To enhance reliability, robustness and performance, many modern systems use a distributed architecture, composed of multiple nodes communicating with each other. Examples range from coordinated control of multi-robot systems such as swarms of mobile and aerial robots, to load-balancing among servers answering many queries per second. A fully decentralized system, where decisions are made collectively by the nodes rather than by one master node, greatly improves reliability by ensuring there is no single point of failure in the system. A distributed architecture also provides greater performance (depending on the context, in terms of load capacity, reduced latency, smaller communication overhead, etc.) than any single node could ever achieve. Distributed architectures are supported by distributed algorithms, which particularly focus on carefully handling situations where some nodes become faulty, stop responding, or become malicious.

One central aspect of distributed algorithms is the ability to achieve consensus. Consensus is said to be achieved in a network if all normal (correct) nodes agree on a certain value, where a node is normal if it is not faulty [34]. The value agreed upon by all nodes can be a reference point for the next position of a swarm, or the sequence of commands executed by a set of replicas in State Machine Replication [44]. Consensus has been studied extensively in different communities. In the distributed computer systems communities, some prominent algorithms achieving consensus are Paxos [29], MultiPaxos [47], Raft [36], and Practical Byzantine Fault Tolerance (PBFT) [6]. However, these algorithms deal with the problem of exact consensus. There are many scenarios where exact consensus is not achievable, ranging from the design of human controlled systems to analysis of natural systems like bird flocking. These problems have to be solved under harsh environmental restrictions such as restricted communication abilities and presence of communication uncertainty. Therefore, these problems warrant the study of asymptotic consensus problems, which unlike exact consensus, do not require strong assumptions on the underlying network [16].

This paper presents the first formal proof of an asymptotic consensus algorithm, by formalizing the Weighted-Mean Subsequence Reduced (W-MSR) algorithm [30, 50]. The problem of asymptotic consensus is of much importance to the distributed robotics and controls community, who have studied algorithms like the Mean Subsequence Reduced (MSR) algorithm [27] and its recent extension W-MSR. These algorithms are designed to achieve asymptotic consensus in partially connected groups of nodes, but have not been formally verified. Formal verification of consensus algorithms is important as has been emphasized by the distributed computer systems community, who have long invested in producing mechanically checked proofs of its consensus protocols. The controls community, however, lags behind in this direction. In recent years, the distributed systems community has embraced formal methods to provide mechanically-checked proofs of its consensus protocols and their implementations, using a wide range of techniques from interactive and automated theorem proving [5, 8, 9, 18, 25, 31, 48] to automatic generation of inductive invariants [20, 21, 33, 49]. In the distributed robotics and controls community however, researchers usually prove their consensus protocols with paper proofs, using mathematical analysis based on Lyapunov theory and its extensions, without computer-checked formalizations. As we show in this paper, our formalization of asymptotic consensus for the W-MSR algorithm [30] reveals imprecisions in the placement of quantifiers in the main theorem and several missing pieces in the proof, thereby highlighting the importance of machine-checked proofs. Thus a significant contribution of our work is providing the first mechanically checked formalism of the asymptotic consensus and its application to the W-MSR algorithm, widely used in the controls community. We have chosen to formalize this algorithm since it is a widely-used algorithm for resilient consensus [41, 42, 46]. From the perspective of practical applications, enabling resilient consensus in the presence of misbehaving or faulty nodes is desirable for many applications in autonomous systems and robotics, e.g., for coordinated control of multi-robot systems.

The MSR and W-MSR algorithms are very different from exact consensus algorithms such as MultiPaxos, Raft or PBFT. As such our formal verification of the correctness of W-MSR uses different techniques than previous proofs of exact consensus algorithms. The first major difference is that MSR and W-MSR guarantee asymptotic consensus rather than finite-time consensus. A second major difference is that MSR and W-MSR provide consensus in networks that are not fully connected: two normal nodes might not be able to communicate with each other directly, but might have to rely on another (possibly faulty) node to forward their messages to each other. This last property is crucial to model multi-robot systems where complete communication between any two robots may not be feasible at all times. Because of those differences, providing a mechanically-checked proof of W-MSR requires the development and use of different techniques than the ones typically used to mechanically check Multipaxos, Raft or PBFT. In particular, our formalization crucially relies on formalization of limits and real analysis, because many of the techniques used in model-checking or for generating invariants are not well-suited to prove asymptotic properties.

Contributions: The original contribution of this work is the formalization in the Coq theorem prover of the convergence results of the W-MSR algorithm [30]. Specifically, we provide a machine-checked concrete counterexample for the proof of necessity, a clean proof of Lemma 1 and the Coq formalization of the main theorem (Theorem 1). We also fill in several missing details and clarify imprecisions in the proof of sufficiency, which can be viewed as an addition to the existing proof [30]. Additionally, this is, to our knowledge, the first mechanical formalization of a consensus algorithm where the consensus is obtained asymptotically, opening the door to more such proofs.

This paper is organized as follows. In Section 2, we discuss the problem setup and define terminologies related to graph topology and the W–MSR algorithm [30]. In Section 3, we discuss the formalization of the necessary and sufficient conditions in Coq, for achieving resilient asymptotic consensus. We also discuss some specific challenges we encountered during the formalization. After reviewing some related work in Section 4, we conclude in Section 5 by discussing key takeaways from our work and generic challenges we encountered during the formalization. We also lay down a few directions that could be addressed in future work.

2 Preliminaries

In this paper we consider the problem of formalizing consensus in a network, and adopt the problem formulation from [30]. While the original paper discusses consensus in a distributed control graph for both malicious and byzantine threat models for both time-varying and time-invariant graph structures, we limit our formalization to the case of a time-invariant graph for a malicious threat model and for a particular threat scope: F-total, where the total number of malicious nodes in the control graph is bounded. We will next discuss briefly what each of these highlighted terms means in the context of the following problem.

2.1 Problem formulation

Consider a network that is modeled by a digraph (directed graph), \(\mathcal {D} = (\mathcal {V}, \mathcal {E})\), where \(\mathcal {V} = \{1,\ldots , n\}\) is the node set and \(\mathcal {E} \subset \mathcal {V} \times \mathcal {V}\) is the directed edge set. The node set is partitioned into a set of normal nodes \(\mathcal {N}\), and a set of adversary nodes \(\mathcal {A}\), which are unknown a priori to the normal nodes. Each directed edge \((j,i) \in \mathcal {E}\) models information flow and indicates that node i can be influenced by (or receive information from) node j at time-step t. The set of in-neighbors of node i is defined as \(\mathcal {V}_i = \{j \in \mathcal {V} | (j,i) \in \mathcal {E}\}\). Intuitively, the set of in-neighbors contains all neighboring nodes of i, such that the direction of information flow is from those nodes to i. The cardinality of the set of in-neighbors is called the in-degree, \(d_i = |\mathcal {V}_i|\). Since each node has access to its own value at time-step t, we also consider a set of inclusive neighbors of node i, denoted by \(\mathcal {J}_i = \mathcal {V}_i \cup \{i\}\).

2.2 Threat Model

As discussed earlier, we formalize a threat model (F-total malicious model [30]) in which every adversary node in the graph is malicious, and there exists an upper bound F on the number of malicious agents in the graph, i.e., the set of adversary nodes are F-totally bounded. In the context of the problem in Section 2.1, some relevant formal definitions pertaining to the threat model are stated as:

Definition 1

(Malicious node  [30]). A node \(i \in \mathcal {A}\) is called Malicious if it sends the same value \(x_i(t)\) to all its neighbors at each time step t, but applies a different update function \(f'_i(.)\) at some time step.

Definition 2

(F-total set  [30]). A set \(\mathcal {S} \subset \mathcal {V}\) is F-total if it contains at most F nodes in the network, i.e., \(|S| \le F\), \(F \in \mathbb {Z}_{\ge 0}\).

Definition 3

(F-totally bounded  [30]). A set of adversary nodes is F-totally bounded if it is an F-total set.

Note that while Definitions 2 and 3 may appear similar, they define different terminologies. Definition 2 defines an F-total set with at most F nodes in a network. Definition 3 specializes this to a set of adversary nodes saying that there are at most F adversarial nodes in a network.

2.3 Robust network topologies

The ability of a set of normal nodes in a control graph to achieve consensus depends on its ability to make local decisions effectively. Le Blanc et al. [30] defined a topological property called network robustness for reasoning about the effectiveness of purely local algorithms to succeed, which we formalize in Coq. In particular, they define a property called (rs)-robustness, which is stated as:

Definition 4

((rs)-robustness  [30]).: A digraph \(\mathcal {D} = (\mathcal {V}, \mathcal {E})\) on n nodes \((n \ge 2)\) is (rs)-robust, for nonnegative integers \(r \in \mathbb {Z}_{\ge 0}\), \( 1 \le s \le n\), if for every pair of nonempty, disjoint subsets \(\mathcal {S}_1\) and \(\mathcal {S}_2\) of \(\mathcal {V}\) at least one of the following holds \((i)~|\mathcal {X}_{\mathcal {S}_1}^r| = | \mathcal {S}_1|;~(ii)~|\mathcal {X}_{\mathcal {S}_2}^r| = | \mathcal {S}_2|;~(iii)~|\mathcal {X}_{\mathcal {S}_1}^r| + |\mathcal {X}_{\mathcal {S}_2}^r| \ge s\), where \(\mathcal {X}_{\mathcal {S}_k}^r = \{ i \in \mathcal {S}_k: | \mathcal {V}_i \backslash \mathcal {S}_k| \ge r \}\) for \(k \in \{1,2\}\).

The condition (iii) states that there are a total of at least s nodes from the union of sets \(\mathcal {S}_1\) and \(\mathcal {S}_2\), such that each of those nodes have at least r nodes outside of their respective sets in the union \(\mathcal {S}_1 \cup \mathcal {S}_2\). The idea is that “enough” nodes in every pair of nonempty, disjoint sets \(\mathcal {S}_1, \mathcal {S}_2 \subset \mathcal {V}\) have at least r neighbors outside of their respective sets. This ensures that the network is well connected, and that loss of information from a node due to malicious attack does not affect the whole network. Figure 1 illustrates an example of a network with (2, 2) robustness.

Fig. 1.
figure 1

Illustration for (2, 2) robustness. In the illustration (a), every node of the set \(S_2\) has 2 neighboring nodes outside \(S_2\). Similarly every node in the set \(S_1\) has at least 2 neighboring nodes outside \(S_1\). In the illustration (b), there are 2 nodes in the union \(S_1 \cup S_2\) that have 2 neighbors outside the set. Note that the sets \(S_1\) and \(S_2\) are disjoint.

2.4 Update model for the normal nodes

In this paper, we formalize a consensus algorithm, called the W–MSR algorithm [30]. This algorithm provides an update model for the normal nodes in the network. A schematic of the algorithm is illustrated in Figure 2. We denote the value emitted by node i at time t as \(x_i(t)\), and the value of the directed weighted edge from node j, to node i at time t as \(w_{ij}(t)\). The value \(x_i(t)\) could represent a measurement like position, velocity, or it could be an optimization variable. The quantity \(x_j^i(t)\) is the information that the \(j^{th}\) node in the neighboring set of node i sends to the node i. Each node also has a varying set of neighbors which it ignores that we denote as \(\mathcal {R}_i(t)\). The set \(\mathcal {R}_i(t)\) changes because the nodes are removed depending on their value with respect to the value of node i at time t. In this algorithm, the updated value of a normal node i at time \(t+1\) is the convex sum of the values of its neighboring set including itself. Hence, \(x_i(t+1) = \sum _{j\in \mathcal {J}_i \backslash \mathcal {R}_i(t)} w_{ij}(t)x_j^i(t)\), where we assume the existence of a constant \(\alpha \in \mathbb {R}\), such that \(0 < \alpha < 1\), and the weights \(w_{ij}(t)\) satisfy the conditions:

  1. 1.

    \(w_{ij}(t) = 0\) whenever \(j \notin \mathcal {J}_i\);

  2. 2.

    \(w_{ij}(t) \ge \alpha , \forall j \in \mathcal {J}_i\); and

  3. 3.

    \(\sum _{j \in \mathcal {J}_i \backslash \mathcal {R}_i(t)} w_{ij}(t) = 1\)

for all \(i \in \mathcal {N}\), and \( t \in \mathbb {Z}_{\ge 0}\). It is important to note that the third condition depends on the set of removed nodes, which may change over time. In order to satisfy this condition the values of the weights may need to change over time.

The choice of neighboring sets in the W–MSR algorithm is defined as follows:

  1. 1.

    At each time-step t, each normal node i obtains the values of its neighbors, and forms a sorted list

  2. 2.

    If there are fewer than F nodes with values strictly greater than the value of i, then the normal node removes all those nodes. Otherwise, it removes precisely the largest F values in the sorted list. Likewise, if there are less than F nodes with values strictly less than the normal node i, the normal node removes all such nodes. Otherwise, it removes precisely the smallest F nodes in the sorted list.

Fig. 2.
figure 2

Schematic of the W-MSR update. At time t, the node i obtains values from its neighbors and forms a sorted list. The algorithm then removes the largest and the smallest F nodes in the sorted list, or if there are less than F nodes with values strictly greater than or less than the value of i, the algorithm removes all those nodes.

An important point to note here is that the above update model holds only for the normal nodes, i.e., \(i \in \mathcal {N}\). The update function for adversary nodes, i.e. \(i \in \mathcal {A}\), and their influence on the normal nodes depend on the threat model. We will next discuss the formalization of the W–MSR algorithm in Coq.

3 A formal proof of consensus for the W–MSR algorithm

Theorem 1

 [30] Consider a time-invariant network modeled by a digraph \(\mathcal {D} = (\mathcal {V}, \mathcal {E})\) where each normal node updates its value according to the W–MSR algorithm with parameter F. Under the F-total malicious model, resilient asymptotic consensus is achieved if and only if the network topology is \((F+1, F+1)\)-robust.

The proof of this theorem requires us to prove both a sufficiency and a necessity condition. The original paper proof relies on a safety condition, which provides an invariant condition that must hold at all times in the state update. We will next discuss the proof of the safety condition (Section 3.1), then sufficiency (Section 3.2) and necessity (Section 3.3) conditions individually.

3.1 Proof of the safety condition in W-MSR

Lemma 1

(Safety condition). [30] Suppose each node updates its value according to the W-MSR algorithm with parameter F under the F-total malicious model. Then for each node \(i \in \mathcal {N}\), \(x_i(t+1) \in [ m(t), M(t)]\), regardless of the network topology.

Here, \(m(t) = \min _{i \in \mathcal {N}}~ \{x_i(t) \}\) and \(M(t) = \max _{i \in \mathcal {N}}~\{ x_i(t) \}\). Note that the original paper [30] does not provide a proof of this lemma, and our proof, which we formalize in this paper, is an original contribution. We provide a detailed proof of the lemma by explicitly enumerating the cases from the definition of the W-MSR algorithm. On the other hand, the original paper [30] merely states an outline, making a careful check of the proof difficult.

Proof

We prove Lemma 1 by showing inductively, that at each time t, and for every normal node i, there exists a node \(j_1 \in \mathcal {J}_i \cap \mathcal {N}\) such that \(\forall k \in \mathcal {J}_i\setminus {\mathcal {R}_i(t)}, \text { } x_{j_1}(t) \le x_k(t)\), thus:

$$\begin{aligned} x_i(t +1) &= \sum _{j\in \mathcal {J}_i \backslash \mathcal {R}_i(t)} w_{ij}(t)x_j^i(t) \ge \sum _{j\in \mathcal {J}_i \backslash \mathcal {R}_i(t)} w_{ij}(t)x_{j_1}^i(t) = x_{j_1}^i(t) \ge m(t) \end{aligned}$$
(1)

Symmetrically there exists a \(j_2 \in \mathcal {J}_i \cap \mathcal {N}\) such that \(\forall k \in \mathcal {J}_i\setminus {\mathcal {R}_i(t)}, x_{j_2}(t) \ge x_k(t)\). Thus, the symmetric inequality \(x_i(t + 1) \le M(t)\), holds for the same reason. Since the proof of the existence of \(j_1\) and \(j_2\) are nearly identical, we only show the proof of the former in Appendix A of the extended version [45].

Formalization in Coq: We formalize Lemma 1 in Coq as:

figure c

The definition of F_total_malicious states that the model is F-total malicious if the set of adversary nodes are F-totally bounded (i.e., there are at most F adversary nodes in the network) and all the adversary nodes are malicious. Here A: D \(\rightarrow \) bool is a tagging function. If A i == true, then i is classified as an Adversary node else it is classified as a Normal node. mal : nat \(\rightarrow \) D \(\rightarrow \) R is an arbitrary update function for a malicious node. Since we do not know beforehand, how this function would look like, we assume it as a parameter. The function init : D \(\rightarrow \) R is an initial value associated with a node. We define a malicious node in Coq as that node in the graph for which the normal update model does not hold, i.e., there exists a time t such that \(x_i(t+1) \ne \sum _{j \in \mathcal {J}_i \backslash \mathcal {R}_i(t)}w_{ij}(t)x_j^i(t) \).

figure d

The second hypothesis wts_well_behaved states that we respect those three conditions on weights that we discussed in Section 2.4. The assignment of weights depend on whether a node \(j \in \mathcal {J}_i \backslash \mathcal {R}_i(t)\) or not. Here, \(\mathcal {J}_i\) denotes the inclusive set of neighbors of the node i. \(\mathcal {R}_i(t)\) denotes the removed set of nodes according to the W–MSR algorithm, and we define \(\mathcal {R}_i(t)\) in Coq as follows

figure e

Note that we use the filter function from the MathComp sequence library. This is crucial as it gives us lemmas that allow us to assert that any node in \(\mathcal {J}_i \setminus {\mathcal {R}_i(t)}\) satisfies the conditions of the filter. Additionally, the filter function requires that its first argument has a pred type, D \(\rightarrow \) bool in our case. Therefore, we need our inequality operations to be decidable. Hence, we used the decidable versions of the inequality operations, such as Rle_dec, provided by Coq’s reals library instead of it’s built-in \(\le \) operation. We then define the set \(\mathcal {J}_i \setminus \mathcal {R}_i(t)\) in Coq as

figure f

Since \(\mathcal {J}_i \backslash \mathcal {R}_i(t)\) is defined based on the value of node i, \(x_i(t)\), which indeed depends on A, mal, init. Hence, wts_well_behaved depends on A, mal, init.

The trickiest parts of the proof of Lemma 1 rely on the fact that we desire \(\mathcal {J}_i \setminus {\mathcal {R}_i(t)}\) when treated as a list to be sorted. In order to fulfill this condition we use the formalization for sorting found in the MathComp library. To do this we first define a relation on D as:

figure g

This definition ensures that if \(x_i(t) < x_j(t)\), then i is ordered as less than j with respect to this relationship. In the case of nodes with equivalent values we use an arbitrary mechanism to break ties. Doing so ensures that this relation is total, and satisfies transitivity, anti-symmetry, and reflexivity. This relation lets us use the sorting lemmas in MathComp’s path library [13], and it ensures the weaker condition that we occasionally use in the proof:

figure h

The biggest difficulty with formalizing this proof arises when dealing with the case that \(|R_i^{<}(t)| < F\), where \(R_i^{<}(t) := \{j \in \mathcal {J}_i : x_j(t) < x_i(t) \text { and } idx_{\mathcal {J}_i}(x_j(t)) < F \}\), and define \(idx_{l}(x_k(t))\), to be the index of the value \(x_k(t)\) in a given list l of values, or the size of l if \(x_k(t)\) is not present.. In particular, showing that \(idx_{\mathcal {J}_i \setminus {\mathcal {R}_i(t)}}(j) = 0 \implies n_j(\mathcal {J}_i) = |R_i^{<}(t)|\). This requires proving an extra lemma on the \(\mathcal {J}_i\) list:

figure i

With this lemma, we can reason that the zero-th index of \(\mathcal {J}_i \setminus {\mathcal {R}_i(t)}\), is the \(|R_i^{<}(t)|\)-th index of \(\mathcal {J}_i\). Using this lemma, we can prove the existence of \(j_1\) in the proof of lem_1. Symmetrically, we can show the existence of \(j_2\) such that \(\forall k \in \mathcal {J}_i\setminus {\mathcal {R}_i(t)}\), \(x_{j_2}(t) \ge x_k(t)\). Tying it all together, we complete the proof of the lemma lem_1 in Coq.

3.2 Proof of Sufficiency

Lemma 2

 [30] Consider a time-invariant network modeled by a digraph \(\mathcal {D} = (\mathcal {V}, \mathcal {E})\) where each normal node updates its value according to the W–MSR algorithm with parameter F. Under the F-total malicious model, if a network is (F+1, F+1) robust, resilient asymptotic consensus is achieved.

This is an important lemma because we would like to design a network such that the normal nodes in the network reach an asymptotic consensus in the presence of malicious nodes in the network. Next we will discuss an informal proof of the Lemma 2 followed by its formalization in the Coq proof assistant.

Proof

The proof of Lemma 2 is done by contradiction. We start by assuming that the limits \(A_M\) and \(A_m\) of the functions M(t) and m(t) respectively are different, i.e., \(A_M \ne A_m\). The limits \(A_M\) and \(A_m\) of the functions M(t) and m(t), respectively, exist because M(t) and m(t) are both continuous and monotonously decreasing functions of t. Therefore, by definition of limits for M(t) and m(t), we know that \(\forall ~t,~A_M \le M(t)~\wedge ~m(t) \le A_m\), as illustrated in Figure 3. We will show that by carefully constructing the sets \(S_1\) and \(S_2\) in the definition of (rs)-robustness, and unrolling the definition of (rs)-robustness at every time-step inductively, we eventually arrive at the desired contradiction: \(\exists ~t,~M(t) < A_M~\vee ~A_m < m(t)\). We discuss the details of the proof in Appendix B of the extended version [45].

Fig. 3.
figure 3

Illustration of the tube of convergence bounded above by \(A_M + \epsilon \) and bounded below by \(A_m - \epsilon \). We observe the behavior of functions M(t) and m(t) inside this tube of convergence \(\forall t \ge t_\epsilon \). We prove that M(t) and m(t) are monotonous \(\forall t \ge t_\epsilon \), and they approach the limits \(A_M\) and \(A_m\), respectively. We start by assuming that \(A_M \ne A_m\), but later prove that \(A_M = A_m\) by contradiction, thereby proving asymptotic consensus.

Formalization in Coq: We introduce the following axiom in Coq to support reasoning by contradiction.

figure j

This is a propositional completeness lemma that allows us to reason classically and is consistent with the formalization of classical facts in Coq’s standard library. We need this lemma because we prove the sufficiency condition using contradiction. We are choosing to use classical reasoning because the original paper [30] does not provide a constructive proof. The reasoning used in the paper is classical. This requires us to state the following lemma in Coq

figure k

The proof of P_not_not_P uses the axiom proposition_degeneracy.

We state the sufficiency condition (Lemma 2) for the network to achieve resilient asymptotic consensus as the following in Coq.

figure l

The sufficiency condition requires that the graph is non-trivial, i.e., there are at least two nodes in the graph, and the number of faulty nodes F in the graph is bounded by the total number of nodes D. We define r_s_robustness in Coq as

figure m

where Xi_S_r S1 r is the set of all nodes in the set S1 such that all of its nodes have at least r neighboring nodes outside S1. In Coq, we define Xi_S_r as

figure n

We define Resilient_asymptotic_consensus in Coq as

figure o

Here, is_lim_seq is a predicate in Coquelicot that defines limits of sequences. Rbar is the extended set of reals, which includes \(+ \infty \) and \(- \infty \). To prove that the network achieves resilient asymptotic consensus under the \((F+1, F+1)\)- robustness condition, we need to prove the following two conditions in the definition of Resilient_asymptotic_consensus: \((i) ~\forall t, m(0) \le m(t) \wedge M(t) \le M (0)\), and \((ii)~\exists L, \forall i, i \in \mathcal {N} \rightarrow \lim \limits _{t \rightarrow \infty } x_i(t) = L\). We state the first subproof as the lemma statement interval_bound in Coq. The proof of lemma interval_bound is a consequence of Lemma 1. We prove this lemma by an induction on time t and then apply Lemma 1 to complete the proof.

We prove the second subproof by contradiction in Coq. To start the proof of contradiction, we need to assume that the limits \(A_M\) and \(A_m\) of the maximum and minimum functions M(t) and m(t) are different. We then instantiate the sets \(S_1\) and \(S_2\) in the definition of (rs)- robustness with \(\mathcal {X}_M(t_\epsilon , \epsilon _o )\) and \(\mathcal {X}_m(t_\epsilon ,\epsilon _o)\) respectively, where \(\mathcal {X}_M(t, \epsilon _l) = \{ i \in \mathcal {V}: x_i(t) > A_M - \epsilon _l \}\) and \(\mathcal {X}_m (t, \epsilon _l) = \{ i \in \mathcal {V}: x_i(t) < A_m +\epsilon _l \}\). In Coq, we define the sets \(\mathcal {X}_M\) for any epsilon and t as follows

figure p

where Rlt_dec is Coq’s standard decidability lemma for less than operation.

We need to prove that the sets \(\mathcal {X}_M\) and \(\mathcal {X}_m\) are disjoint at all times till we reach a point when either \(\mathcal {X}_M\) or \(\mathcal {X}_m\) are empty. This requires us to prove the following lemma in Coq

figure q

Since \(\mathcal {X}_m(t_\epsilon +l, \epsilon _l)\) is a set of all nodes with values at least, \(A_M - \epsilon _l\) and \(\mathcal {X}_m(t_\epsilon +l, \epsilon _l)\) is a set of all nodes with values at most \(A_m+ \epsilon _l\), these two sets are disjoint if \(A_M - \epsilon _l > A_m + \epsilon _l\). For \(l=0\), we have defined \(\epsilon _o\) such that \(A_M - \epsilon _o > A_m + \epsilon _o\). To prove that \(A_M - \epsilon _l > A_m + \epsilon _l, \forall l, 0<l\), we need to show that \(A_M - \epsilon _l > A_M - \epsilon _o\) and \(A_m + \epsilon _o > A_m + \epsilon _l\). This would indeed require us to show that \(\epsilon _l < \epsilon _o, \forall l, 0 < l\). This holds since we had defined \(\epsilon _l\) recursively as \(\epsilon _l := \alpha \epsilon _{l-1} + (1- \alpha )\epsilon \).

A crucial aspect of the sufficiency proof is proving that the \((F+1, F+1)\)- robustness implies that there exists a node in the union of the set \(\mathcal {X}_M \cap \mathcal {N}\) and \(\mathcal {X}_m \cap \mathcal {N}\) such that it has at least \(F+1\) nodes outside the set. This was particularly challenging because in the original paper [30], the authors do not use all three conditions in the definition of \((F+1, F+1)\) robustness condition to informally prove the implication. They use only the third condition \((F+1 \le |\mathcal {X}_{\mathcal {X}_M}^{F+1} | + |\mathcal {X}_{\mathcal {X}_m}^{F+1} |)\) to state the implication, while leaving it up on the readers to connect the missing dots with the first two conditions. For the implication to hold, all three conditions in the definition of \((F+1, F+1)\)- robustness should imply the existence of such a node since there is an or in the definition of \((F+1, F+1)\)- robustness connecting the three conditions. To prove the implication from the first two conditions, we need to first prove the existence of a normal node in the sets \(\mathcal {X}_M\) and \(\mathcal {X}_m\) for all \(l \le N\). This holds since the node i with value \(M(t_\epsilon +l)\) will always be above the threshold \(A_M - \epsilon _l\) because \(M(t) \ge A_M, \forall t\) due to the existence of the limit \(A_M\). Hence, \(0 < |\mathcal {X}_M(t_\epsilon + l, \epsilon _l)|, \forall l \le N\). Since the first condition of \((F+1, F+1)\)- robustness states that \(| \mathcal {X}_{\mathcal {X}_M (t_\epsilon +l, \epsilon _l)}^{F+1}| = | \mathcal {X}_M(t_\epsilon +l, \epsilon _l)|\), \(0 < | \mathcal {X}_{\mathcal {X}_M (t_\epsilon +l, \epsilon _l)}^{F+1}|\). Hence by definition of \( \mathcal {X}_{\mathcal {X}_M (t_\epsilon +l, \epsilon _l)}^{F+1}\) , there exists a normal node in the set \(X_M(t_\epsilon +l, \epsilon _l)\) such that it has at least \(F+1\) nodes outside \(X_M(t_\epsilon +l, \epsilon _l)\). We prove this formally in Coq using the following lemma statement

figure r

By symmetry, we prove that \(0 < | \mathcal {X}_{\mathcal {X}_m (t_\epsilon +l, \epsilon _l)}^{F+1}|\). The other part that was not explicit from the paper proof in the original paper [30] was that the largest value that the node i uses at time step \(t_\epsilon + l \) is \(M(t_\epsilon + l)\), which is provided without proof. This was a challenge during our formalization. To formally prove this we had to split the neighbor set of i into two parts depending on their relative position with respect to i. While it is easy to bound the values of the nodes positioned in the left side of i with \(M(t_\epsilon + l)\) since the neighboring list is assumed to be sorted at the time of update and we have established this upper bound for any normal node from lemma 1, bounding the values for the nodes positioned in the right of the normal node i was not trivial. We proved this using a case analysis on the cardinality of the set \(R_i^{>}(t)\). In Coq, we formally prove this using the lemma statement x_right_ineq_1 in Coq. We do not expand on this lemma here for brevity.

Another challenge during the formalization was using the bound of the neighboring node of i, \(A_M - \epsilon _l\) in the update of the value of i at the next time step. We know that the neighbors outside the set \(\mathcal {J}_i(t_\epsilon +l) \backslash \mathcal {X}_M(t_\epsilon +l,\epsilon _l)\) have value at most \(A_M - \epsilon _l\). But to use these nodes in the update function, we need to show that these neighboring nodes are in the inclusive set of the normal node i minus the extremes, i.e, there exists a node in the intersection of the sets \(\mathcal {J}_i(t_\epsilon +l)\) and the set s which contains nodes outside the set \(\mathcal {J}_i(t_\epsilon +l) \backslash \mathcal {X}_M(t_\epsilon +l,\epsilon _l)\).We prove the existence of such a node using the following lemma statement in Coq

figure s

We instantiate the set A with \(\mathcal {J}_i \backslash \mathcal {R}_i(t)\) and the set B with \(\mathcal {R}_i^{<}(t)\). We know that by definition of the W–MSR algorithm, \(|\mathcal {R}_i^{<}(t)| \le F\). To use the lemma exists_in_intersection, we first had to prove that \(s \subset (\mathcal {J}_i \backslash \mathcal {R}_i(t)) \cup \mathcal {R}_i^{<}(t)\). Applying the lemma exists_in_intersection then gives us a node k as a witness which lies in the intersection of the set s and \(\mathcal {J}_i \backslash \mathcal {R}_i(t)\). We use this node to apply the bound \(A_M - \epsilon _l\) in the proof of inequality 1 for \(l \le N\). All other nodes in the neighboring list of the normal node i minus extremes are shown to be bounded by M(t).

To show that the inequality \(\exists t, M(t) < A_M \vee A_m < m(t)\) holds, we need to prove that for every l such that \(l \le N\), the cardinality of the set \(\mathcal {X}_M\) decreases or the cardinality of the set \(\mathcal {X}_m\) decreases or both under the \((F+1, F+1)\)- robustness condition. This requires us proving the following lemma in Coq

figure t

We instantiate s1 and s2 with \(\mathcal {X}_M(t_\epsilon +l, \epsilon _l)\) and \(\mathcal {X}_m(t_\epsilon +l, \epsilon _l)\) respectively. We use the lemma sj_ind_var to arrive at a contradiction and complete the proof of the sufficiency.

3.3 Proof of necessity

Lemma 3

 [30] Consider a time-invariant network modeled by a digraph \(\mathcal {D}= (\mathcal {V}, \mathcal {E})\) where each normal node updates its value according to the W–MSR algorithm with parameter F. Under the F-total malicious model, if resilient asymptotic consensus is achieved then the network is (F+1, F+1)-robust.

Necessity is a secondary, but still significant lemma. It tells us that there is no weaker condition than \((F+1, F+1)\)-robustness such that the normal nodes within the network reach asymptotic consensus. We now discuss an informal proof of Lemma 3. Note that the original paper [30] does not provide a clean proof of this lemma. For example, the original paper provides a sketch of the proof of Lemma 3 by contrapositivity, but does not provide a concrete counterexample to discharge the proof by contrapositive. The paper proof in [30] does not talk about construction of weights or the proof that these weights are not well-behaved under non-(rs)-robustness. These issues were non-trivial and posed challenges in Coq, as will be explained in this section. We also highlight challenges in the construction of this counterexample and the proof of necessity in Coq, including an issue of mutual recursion in Coq. The issues with missing details in the original paper proof, which we had to develop explicitly, make the proof in this paper an original contribution.

Proof

We proceed by proving the contrapositive of necessity, that is: if the network is not \((F+1, F+1)\) robust then it does not achieve resilient asymptotic consensus. Assuming that the network is not \((F+1, F+1)\)-robust we know that there are non-empty sets \(S_1, S_2 \subset \mathcal {V}\), such that \(S_1 \cap S_2 = \emptyset \), \(|\chi _{S_1}^{F+1}| \ne |S_1|\), \(|\chi _{S_2}^{F+1}| \ne |S_2|\), and \(|\chi _{S_1}^{F+1}| + |\chi _{S_2}^{F+1}|< F+1\). It follows that \(|\chi _{S_1}^{F+1}| <F+1\), and \(|\chi _{S_2}^{F+1}| < F+1\). Also recall that \(\chi _{S_1}^{F+1} \subseteq S_1\), and \(\chi _{S_2}^{F+1} \subseteq S_2\). One way of interpreting this condition is that the number of nodes within \(S_1\) and \(S_2\) that can receive a lot of information from outside of their respective sets is less than \(F+1\) in total, and less than the number of nodes in each set respectively. We seek to construct a set of adversaries, initial values, malicious functions, and weights such that resilient asymptotic consensus is not achieved. In particular we seek to prove that there exists two normal nodes ij such that \(\lim \limits _{t \rightarrow \infty } x_i(t) \ne \lim \limits _{t \rightarrow \infty } x_j(t)\). We discuss the details of the proof in the Appendix D of the extended version [45].

Formalization in Coq: We formalize the lemma 3 in Coq as

figure u

Formalization of necessity_proof exposed some inconsistencies in definitions in the original paper [30]. In particular, the paper defines those three conditions on weights, that we discussed in the Section 2.4, only for normal nodes. During our formalization, we found this to be restrictive. Those conditions on weights should hold for any node. The need for applying the conditions in the paper to the weights of adversary nodes, is that in order to ensure that a node \(i \in \mathcal {A}\) is malicious, as defined in the paper, there must exist a time t such that the quantity \(x_i(t+1) \ne \sum _{j \in \mathcal {J}_i \setminus {\mathcal {R}_i(t)}} w_{ij}(t)x_j^i(t)\). In other words at some time the value emitted by a given node must not equal the value it would emit if it was normal, but the sum is clearly undefined if the weights of an adversary node are undefined. Therefore, we relax the condition that the set of weights described in the paper only exists for normal nodes. Fortunately this does not create a problem as adversary nodes can update their values according to any function they wish, meaning that they do not have to use the described set of weights, or any weights at all, leaving their values unconstrained by this condition.

Another thing that was not explicit in the original paper [30] was the right placement of quantifiers. Formalizing the proof of necessity helped us identify the right placement of quantifiers and provide an accurate formal specification for the W–MSR algorithm. At the start of our formalization it was not evidently clear to us whether the paper meant to imply that:

figure v

or:

figure w

In the first formula, the quantified values A, mal, init are not bound to the definition of resilient asymptotic consensus. Therefore, in the necessity proof, we cannot construct a counterexample by appropriate instantiation of A, mal and init, to discharge the proof by contradiction. In the second formula, the quantified values are bound to the definition of resilient asymptotic consensus, which allows us to construct the counterexample by propagating the negation through the quantified values. Essentially, the difference is between the formulae \((\forall X, P(X) \rightarrow Q(X))\) and \(((\forall X.~ P(X)) \rightarrow (\forall X.~ Q(X)))\), where X represents the tuple \((\texttt {A}, \texttt {mal}, \texttt {init})\), and the first statement is stronger. Therefore, the former, stronger condition is not necessarily true in the necessity direction, while the weaker later condition is.

Another difficulty we encountered was defining the weights in such a way that \(w_{ij}(t) = \frac{1}{|\mathcal {J}_i \setminus {\mathcal {R}_i}|}\). This is a result of Coq’s sensitivity to ill-defined recursion. The issue arises because defining \(w_{ij}\) at time t requires knowing the value of \(x_i\) at time t, however, as we had defined \(x_i\), it takes the set of weights it uses as a parameter, even though mathematically there is no issue since \(x_i(t)\) only relies on the values of \(x_j(t-1)\), and \(w_{ij}(t-1)\). In order to solve this issue we defined a function which returns a pair of functions \((x_i, w_{ij})\). In order to ensure that Coq could guess the parameter being recursed on we also had to add another parameter \(two_t\) which is initialized as \(2\cdot t\), and ensure that the pair \((x_i(t), w_{ij}(t))\) is returned when \(two_t = 2\cdot t\), and \((x_{t+1}, w_{ij}(t))\) is returned when \(two_t = (2\cdot t) + 1\).

3.4 Formal proof of the main theorem

We state the main theorem statement 1 in Coq as:

figure x

We close the proof of F_total_consensus by splitting the theorem into sufficiency and necessity sub-proofs and applying the lemmas sufficiency_proof and necessity_proof. The only detail worth noting is that necessity_proof relies on the decidable of r_s_robustness, which we need the axiom of the excluded middle to conclude.

4 Related Work

Recently there has been a growing interest in the formalization of distributed systems and control theory, using both automated and interactive verification approaches.

Some notable works in the area of automated verification use model checking, temporal logic, and reachability techniques. For instance, Cimatti et al. [11] have used model checking techniques to formally verify the implementation of a part of safety logic for railway interlocking system. Schrer et al. [43] extended the JavaPathFinder [24] model checker to support modeling of a real-time scheduler and physical system that are defined by differential equations. They verify the safety and liveness properties of a control system, and also verify the programming errors. Besides model checking, temporal logic based techniques have been applied to control synthesis [40], robust model predictive control [14] and automatic verification of sequential control systems [35]. Other approaches for verifying safety use reachability methods like flow pipe approximations [10], zonotope approximation algorithms [2, 19, 28], and ellipsoidal calculus [4].

There has also been significant work in the formalization of control theory using interactive theorem provers [1, 38, 39]. In the area of formalization of stability analysis for control theory, Cyril Cohen and Damien Rouhling formalized the LaSalle’s principle in Coq [12]. Stability is important for the control of dynamical systems since it guarantees that trajectories of dynamical systems like cars and airplanes, are bounded. Chan et al. [7] formalize safety properties like Lyapunov stability and exponential stability of cyber-physical systems, in Coq. In [39], Damien Rouhling formalized the soundness of a control function [32] for an inverted pendulum. Some works have also emerged in the area of signal processing for controls. Gallois-Wang et al. [17] formalized some error analysis theorems about digital filters in Coq. Araiza-Illan et al. [3] formally verified high level properties of control systems such stability, feedback gain, or robustness using the Why3 tool [15]. Rashid et al. [38] formalized the transform methods in HOL-Light [22]. Transform methods are used in signal processing and controls to switch between the time domain and the frequency domains for design and analysis of control systems. A few works have emerged in the area of formalization of the feedback control theory to guarantee robustness of control systems. Jasim and Veres et al [26] proved one of the most fundamental and general result of nonlinear feedback system - the Small-gain theorem (SGT), formally using Isabelle/HOL [37]. Hasan et al [23] formalized the theoretical foundations of feedback controls in HOL Light. Another notable work in the formalization of control systems is the formalization of safety properties of robot manipulators by Affeldt et al. [1].

Most of the above works deal with the problem of formalizing the theoretic foundations of control theory – stability analysis, transform methods, filtering algorithms for signal processing, feedback control design. But, to our knowledge, none of these works tackles the problem of consensus in a formal setting. Given that consensus is a quantity of interest in distributed control applications, our work on the formalization of the W–MSR algorithm, is a first step towards formally verified distributed control systems.

5 Conclusion

In this work, we formalize a consensus algorithm [30] for distributed controls in Coq. We formally prove the necessary and sufficient conditions for a set of normal nodes in the network to achieve asymptotic consensus in the presence of a fix bound of malicious nodes in the network. During the process of formalization we discover several areas where the proof in the original paper is imprecise, especially when defining the lemma statements of sufficiency and necessity. In particular, the order of quantifiers on some variables was unclear, and we had to spend time clarifying their order. We also prove a stronger version of the sufficiency condition than the original theorem requires. This is done to ensure that the conditions in both directions of the double implication holds. The definitions and lemmas we formalize in this paper can be used for verifying consensus for other threat models described in the original paper [30]. Overall our work is a first of its kind to provide formal specifications of a consensus algorithm in distributed controls. The total length of Coq proofs is about 11 thousand lines of code. It took us 6 person months for the entire formalization.

A possible future direction of work is to verify the implementation of the algorithm. The proof of this algorithm in the original paper [30], and our formalization assume that all computations are in the real field. However, an actual implementation would need to use finite precision arithmetic. It would therefore be interesting to study the effect of finite precision on the robustness of this algorithm. It would also be interesting to formalize the algorithm for time-variant networks in which the edge relation between the nodes can change with time. Possible use cases for such network model are drone swarms for military and rescue operations, in which each drone in the network could be expected to dynamically change the flow of information from its neighbors.