Consensus Variants

  • Christian Cachin
  • Rachid Guerraoui
  • Luís Rodrigues
Chapter

Abstract

This chapter describes variants of the consensus abstraction introduced in the previous chapter. These variants are motivated by applications of consensus to areas like fault-tolerant, replicated services, and distributed databases.

This chapter describes variants of the consensus abstraction introduced in the previous chapter. These variants are motivated by applications of consensus to areas like fault-tolerant, replicated services, and distributed databases.

In the variants we consider here, just like in consensus, the processes need to make consistent decisions, such as agreeing on one common value. However, most of the abstractions extend or change the interface of consensus, in order to satisfy the specific coordination requirements of an application.

The abstractions we will study here includetotal-order broadcast,terminating reliable broadcast,fast consensus,(nonblocking) atomic commitment,group membership, andview synchrony. We will mainly focus on fail-stop algorithms for implementing these abstractions. But we also considerByzantine total-order broadcast andByzantine fast consensus and give implementations for them in the fail-arbitrary model. Some further variants of the total-order broadcast abstraction are discussed in the exercise section. But other variants of consensus represent unexplored territory, in the sense that determining adequate means to specify and implement these abstractions in other system models is an area of current research.

6.1 Total-Order Broadcast

6.1.1 Overview

Earlier in the book (in Sect. 3.9), we discussed FIFO-order and causal-order (reliable) broadcast abstractions and their implementation. FIFO-order broadcast requires that messages from the same process are delivered in the order that the sender has broadcast them. For messages from different senders, FIFO-order broadcast does not guarantee any particular order of delivery. Causal-order broadcast enforces a global ordering for all messages that causally depend on each other: such messages need to be delivered in the same order and this order must respect causality. But causal-order broadcast does not enforce any ordering among messages that are causally unrelated, or “concurrent” in this sense. In particular, if a process p broadcasts a messagem1 and a process q concurrently broadcasts a messagem2 then the messages might be delivered in different orders by the processes. For instance,p might deliver firstm1 and thenm2, whereasq might deliver firstm2 and thenm1.

Atotal-order (reliable) broadcast abstraction orders all messages, even those from different senders and those that are not causally related. More precisely, total-order broadcast is a reliable broadcast communication abstraction which ensures that all processes deliver the same messages in a common global order. Whereas reliable broadcast ensures that processes agree on the sameset of messages they deliver, total-order broadcast ensures that they agree on the samesequence of messages; the set of delivered messages is now ordered.

The total-order broadcast abstraction is sometimes also calledatomic broadcast because the message delivery occurs as if the broadcast were an indivisible “atomic” action: the message is delivered to all or to none of the processes and, if the message is delivered, every other message is ordered either before or after this message. This section considers total-order broadcast among crash-stop process abstractions. Total-order broadcast with Byzantine processes is the subject of Sect. 6.2.

Total-order broadcast is the key abstraction for maintaining consistency among multiple replicas that implement one logical service, whose behavior can be captured by a deterministic state machine. A state machine consists of variables representing its state together with commands that update these variables and may produce some output. Commands consist ofdeterministic programs, such that the outputs of the state machine are solely determined by the initial state and the sequence of commands previously executed. Most practical services can be modeled like this. Any service implemented by the state machine can be made fault-tolerant by replicating it on different processes. Total-order broadcast ensures that all replicas deliver the commands from different clients in the same order, and hence maintain the same state.

For instance, this paradigm can be applied to implement highly available shared objects of arbitrary types in a distributed system, that is, objects with much more powerful semantics than the read-write (register) objects studied earlier in the book (Chap. 4). According to the state-machine replication paradigm, each process hosts a replica of the object. A client broadcasts every method invocation on the object to all replicas using the total-order broadcast primitive. This will ensure that all replicas keep the same state and that all responses are equal. In short, the use of total-order broadcast ensures that the object is highly available, yet it appears as if it were a single logical entity accessed in a sequential and failure-free manner, which provides operations that act atomically on its state. We will return to this topic in the exercise section.

6.1.2 Specifications

Many specifications of the total-order broadcast abstraction can be considered. We focus here on two variants that both extend a corresponding reliable broadcast abstraction. The first is a regular variant that ensures total ordering only among the correct processes. The second is a uniform variant that ensures total ordering with respect to all processes, including the faulty processes as well.

The specification of aregular total-order broadcast abstraction is depicted in Module 6.1. The interface is the same as in the (regular) reliable broadcast abstraction (Module 3.2 from Sect. 3.3), and also its first four properties (TOB1–TOB4) are the same as before (properties RB1–RB4). The only difference consists in the addedtotal order property.

The second specification definesuniform total-order broadcast and is depicted in Module 6.2. The interface is the same as in the uniform reliable broadcast abstraction (Module 3.3); its first four properties (UTOB1–UTOB4) map directly to those of uniform reliable broadcast (URB1–URB4) in Sect. 3.4 and extend it with theuniform total order property.

Other combinations of thetotal order oruniform total order properties with reliable and uniform reliable broadcast properties are possible and lead to slightly different specifications. For conciseness, we omit describing all the corresponding modules and refer to an exercise (at the end of the chapter) for a logged total-order broadcast abstraction.

It is important to note that thetotal order property is orthogonal to the FIFO-order and causal-order properties discussed in Sect. 3.9. It is possible that a total-order broadcast abstraction does not respect causal order. On the other hand, as we pointed out, FIFO-order broadcast and causal-order broadcast abstractions do not enforce total order: the processes may deliver some messages in different order to different processes.

6.1.3 Fail-Silent Algorithm: Consensus-Based Total-Order Broadcast

In the following, we give a total-order broadcast algorithm that implements the abstraction of Module 6.1 and is called “Consensus-Based Total Order” because it relies on consensus. The pseudo code is shown in Algorithm 6.1. It uses a reliable broadcast abstraction and multiple instances of a (regular) consensus abstraction as underlying building blocks.

The intuition behind Algorithm 6.1 is the following. Messages are first disseminated using a reliable broadcast instance with identifier rb. Recall that reliable broadcast imposes no particular order on delivering the messages, so every process simply stores the delivered messages in a set of unordered messages. At any point in time, it may be that no two processes have the same sets of unordered messages in their sets. The processes then use the consensus abstraction to decide on one set, order the messages in this set, and finally deliver them.

More precisely, the algorithm implementing a total-order broadcast instance tob works in consecutive rounds. As long as new messages are broadcast, the processes keep on moving sequentially from one round to the other. There is one consensus instance for every round, such that the instance of round r has identifier c.r, forr = 1, 2,  . The processes use the consensus instance of round r to decide on a set of messages to assign to that round number. Every process thentob-delivers all messages in the decided set according to some deterministic order, which is the same at every process. This will ensure thetotal order property.

Ther-th consensus instance, invoked in round r, decides on the messages to deliver in round r. Suppose that every correct process hastob-delivered the same messages up to round r − 1. The messages of round r are delivered according to a deterministic and locally computable order, agreed upon by all processes in advance, such as the lexicographic order on the binary representation of the message content or based on low-level message identifiers. Once that the processes have decided on a set of messages for round r, they simply apply a deterministic functionsort( ⋅) to the decided set messages, the function returns an ordered list of messages, and the processes deliver the messages in the given order. Hence, the algorithm ensures that there is one global sequence of messages that aretob-delivered by the correct processes.

In each instance of consensus, every process starts with a (possibly different) set of messages to be ordered. Each process simply proposes the set of messages it has alreadyrb-delivered (from the reliable broadcast primitive) and not yettob-delivered (according to the total-order semantics). The properties of consensus ensure that all processes decide the same set of messages for the instance. In addition, Algorithm 6.1 uses a flagwait to ensure that a new round is not started before the previous round has terminated.

An execution of the algorithm is illustrated in Fig. 6.1. The figure is unfolded into two parallel flows: that of the reliable broadcasts, used to disseminate the messages, and that of the consensus instances, used to order the messages. Messages received from the reliable broadcast module are proposed to the next instance of consensus. For instance, process s proposes message m2 to the first instance of consensus. As the first instance of consensus decides messagem1, process s resubmitsm2 (along withm3 that it has received meanwhile) to the second instance of consensus.
Fig. 6.1

Sample execution of consensus-based total-order broadcast

Correctness.

Theno creation property follows from theno creation property of the reliable broadcast abstraction and from thevalidity property of the consensus abstraction. Theno duplication property follows from theno duplication property of the reliable broadcast abstraction and from theintegrity property of the consensus abstraction, combined with the check that no message contained in the variable delivered is added to the set unordered.

Consider theagreement property. Assume that some correct processpto-delivers some message m. According to the algorithm,p must have decided a set of messages that contains m in some round. Every correct process eventually decides the same set of messages in that round because of the algorithm and thetermination property of consensus. Hence, every correct process eventuallyto-delivers m.

Consider thevalidity property of total-order broadcast, and letp be some correct process thatto-broadcasts a message m. Assume by contradiction thatp neverto-delivers m. This means thatm is never included in a set of decided messages at any correct process. Due thevalidity property of reliable broadcast, every correct process eventuallyrb-delivers m. Therefore, there is some round in which every correct process proposes a set of unordered messages to consensus that contains m. Thevalidity property of consensus now ensures thatp decides a batch of messages that includesm andto-delivers m in the same round.

Consider now thetotal order property. Letp andq be any two correct processes thatto-deliver some message m2. Assume thatpto-delivers some distinct message m1 beforem2. Ifpto-deliversm1 andm2 in the same round then due to theagreement property of consensus,q must have decided the same set of messages in that round. Thus,q alsoto-deliversm1 beforem2, as we assume that the messages decided in one round areto-delivered in the same order by every process, determined in a fixed way from the set of decided messages. Assume thatm1 is contained in the set of messages decided byp in an earlier round than m2. Because of theagreement property of consensus,q must have decided the same set of messages in the earlier round, which containsm1. Given that processes proceed sequentially from one round to the other,q must also haveto-deliveredm1 beforem2.

Performance.

Toto-deliver a message when no failures occur, and by merging a fail-stop reliable broadcast algorithm with a fail-stop consensus algorithm as presented in previous chapters, three communication steps andO(N) messages are required.

Variant.

By replacing the regular consensus abstraction with a uniform one, Algorithm 6.1 implements a uniform total-order broadcast abstraction.

6.2 Byzantine Total-Order Broadcast

6.2.1 Overview

Total-order broadcast is an important primitive also in the fail-arbitrary model. Intuitively, it gives the same guarantees as total-order broadcast in a system with crash-stop processes, namely that every correct process delivers the same sequence of messages over time.

Recall the Byzantine broadcast primitives from Chap. 3, in which every instance only delivered one message. Because total-order broadcast concerns multiple messages, its specification does not directly extend these basic one-message primitives, but uses the same overall approach as the total-order broadcast abstraction with crash-stop processes. In particular, every process may repeatedly broadcast a message and every process may deliver many messages.

For implementing total-order broadcast in the fail-arbitrary model, however, one cannot simply take the algorithm from the fail-silent model in the previous section and replace the underlying consensus primitive with Byzantine consensus. We present an algorithm with the same underlying structure, but suitably extended for the fail-arbitrary model. But first we introduce the details of the specification.

6.2.2 Specification

AByzantine total-order broadcast abstraction lets every process repeatedly broadcast messages by triggering a request event ⟨ ​Broadcast ∣ m​ ⟩. An indication event ⟨ ​Deliver ∣ p,m​ ⟩ delivers a message m with sender p to a process. For an instance btob of Byzantine total-order broadcast, we also say a processbtob-broadcasts a message andbtob-delivers a message.

The sender identification in the output holds only when process p is correct, because a Byzantine process may behave arbitrarily. The abstraction ensures the sameintegrity property as the Byzantine broadcast primitives of Chap. 3, in the sense that every message delivered with sender p was actually broadcast by p, ifp is correct, and could not have been forged by Byzantine processes. The other properties of Byzantine total-order broadcast are either exactly the same as those of total-order broadcast among crash-stop processes (validity,agreement, andtotal order) or correspond directly to the previous abstraction (no duplication). The specification is given in Module 6.3.

6.2.3 Fail-Noisy-Arbitrary Algorithm: Rotating Sender Byzantine Broadcast

Implementations of Byzantine broadcast abstractions are more complex than their counterparts with crash-stop processes because there are no useful failure-detector abstractions in the fail-arbitrary model. But an algorithm may rely on the eventual leader detector primitive (Module 2.9) that is usually accessed through an underlying consensus abstraction.

Here we introduce Algorithm 6.2, called “Rotating Sender Byzantine Broadcast,” which relies on a Byzantine consensus primitive, similar to the “Consensus-Based Total-Order Broadcast” algorithm from the previous section. As before, the processes proceed in rounds and invoke one instance of Byzantine consensus in every round. Furthermore, every process disseminates thebtob-broadcast messages using a low-level communication primitive.

However, the processes cannot simply reliably broadcast thebtob-broadcast messages and propose a set ofbtob-undelivered messages received from reliable broadcast to consensus. The reason lies in the more demanding validity properties of Byzantine consensus compared to consensus with processes that may only crash. In the (strong) Byzantine consensus primitive, a “useful” decision value that carries some undelivered messages results only ifall correct processes propose exactly the same input. But without any further structure on the message dissemination pattern, the correct processes may never propose the same value, especially because furtherbtob-broadcast messages may arrive continuously. Weak Byzantine consensus offers no feasible alternative either, because it ensures validity only if all processes are correct. In the presence of a single Byzantine process, it may never output any useful decision.

The solution realized by Algorithm 6.2 circumvents this problem by imposing more structure on what may be proposed in a consensus instance. The algorithm proceeds in global rounds; in round r, a process only proposes a single message for consensus and one that isbtob-broadcast by thedesignated sender for the round, determined from the round number by the functionleader( ⋅) introduced before.

More precisely, a process first relays everybtob-broadcast message to all others in aData message over an authenticated link primitive al. EveryData message contains a sequence number, assigned by the sender, to provide FIFO order among itsData messages. Every correct process, therefore,al-delivers the same ordered sequence ofData messages from every correct sender. The receiver stores the undeliveredbtob-broadcast messages from every sender in a queue according to the order of their arrival.

Concurrently, every process runs through rounds Byzantine consensus instances. In every round, it proposes the first message in the queue corresponding to the designated sender s of the round for consensus. When a process finds no message in the queue of process s, it proposes the symbol  □ for consensus. When consensus decides on a message, the processbtob-delivers it; if consensus decides □ then no message isbtob-delivered in this round. (Recall that Byzantine consensus may decide  □ unless all correct processes have proposed the same value.)

The variable unordered is an array of lists, one for every process in the system. It is initialized to an array of empty lists, denoted by ([])N. Lists can be manipulated with the operations introduced before, in Chap. 3: an element x can be appended to a list L by the function append(L, x), and an elementx can be removed fromL by the function remove(L, x). We also use a new functionhead(L) here, which returns the first element in L.

Correctness.

Note that the algorithm preserves FIFO order by design, andbtob-delivers messages from the same sender in the order they werebtob-broadcast. Intuitively, the algorithm maintainsN ordered message queues, one for every process, and propagates the messages from every sender through the corresponding queue. These queues are synchronized at all correct processes. Every round of consensus may cut off the first message in each queue and deliver it, or may decide not to deliver a message.

It may happen that the consensus instance of a round decides a message sent by sender s, but some process p does not find the message in the queue unordered[s]. It is safe for the process to deliver the message nevertheless, because the queue must be empty; this follows because a corrects sends its messages in the same order to all processes. Hence, whenever any correct process enters a new round, the queue unordered[s] contains a unique message at its head or is empty.

Thevalidity property now follows easily because abtob-broadcast message m from a correct sender is eventually contained in the sender’s queue at every correct process. Eventually, there is a round corresponding to the sender of m, in which every correct process proposes m for consensus. According to thestrong validity property of Byzantine consensus, message m isbtob-delivered at the end of this round.

Theno duplication property is evident from the algorithm and the checks involving thedelivered variable. Theagreement andtotal order properties follow from the round structure and from thetermination andagreement properties of the underlying Byzantine consensus primitive.

Finally, theintegrity property holds because a correct process, in round r, only proposes for consensus a message received over an authenticated link from the sender s = leader(r) or the value  □ . According to thestrong validity property of Byzantine consensus, the decided value must have been proposed by a correct process (unless it is  □ ); therefore, the algorithm may onlybtob-deliver a message from sender s in round r.

Performance.

The algorithm adds one communication step withO(N) messages for every broadcast message to the cost of the underlying Byzantine consensus instances. Every delivered message requires at least one instance of Byzantine consensus, which is the most expensive part of the algorithm.

The algorithm is conceptually simple, but suffers from the problem that it may not be efficient. In particular, depending on the network scheduling, it may invoke an arbitrary number of Byzantine consensus instances until it delivers a particular message m from a correct sender in the totally ordered sequence. This happens when the algorithmbtob-delivers messages from Byzantine senders before m or when the consensus primitive decides  □ .

We discuss a more efficient algorithm, which never wastes a Byzantine consensus instance without atomically delivering some message, in the exercises (at the end of the chapter).

Variant.

Algorithm 6.2, our Byzantine total-order broadcast algorithm, uses Byzantine consensus in a modular way. If we would break up the modular structure and integrate the rounds of Algorithm 6.2 with the round-based and leader-based approach of the “Byzantine Leader-Driven Consensus” algorithm for Byzantine consensus, we would save several steps and obtain a much more efficient algorithm. Even more savings would be possible from integrating the resulting algorithm with the implementation of the Byzantine eventual leader-detection abstraction, which is used underneath the “Byzantine Leader-Driven Consensus” algorithm.

6.3 Terminating Reliable Broadcast

6.3.1 Overview

The goal of the reliable broadcast abstraction introduced earlier in the book (Sect. 3.3) is to ensure that if a message is delivered to a process then it is delivered to all correct processes (in the uniform variant).

As its name indicates,terminating reliable broadcast (TRB) is a form of reliable broadcast with a specific termination property. It is used in situations where a given process s is known to have the obligation of broadcasting some message to all processes in the system. In other words,s is an expected source of information in the system and all processes must perform some specific processing according to some message m to be delivered from the source s. All the remaining processes are thus waiting for a message from s. Ifs broadcasts m with best-effort guarantees and does not crash, then its message will indeed be delivered by all correct processes.

Consider now the case where processs crashes and some other processp detects thats has crashed without having seenm. Does this mean thatm was not broadcast? Not really. It is possible thats crashed while broadcasting m. In fact, some processes might have deliveredm whereas others might never do so. This can be problematic for an application. In our example, processp might need to know whether it should keep on waiting form, or if it can know at some point thatm will never be delivered by any process. The same issue may arise when the processes are waiting for a set of messages broadcast by multiple senders, of which some are known to broadcast a message but others might never broadcast a message.

At this point, one may think that the problem of the faulty sender could have been avoided ifs had used a uniform reliable broadcast primitive to broadcast m. Unfortunately, this is not the case. Consider process p in the example just given. The use of a uniform reliable broadcast primitive would ensure that, if some other process q deliveredm thenp would eventually also deliverm. However,p cannot decide if it should wait form or not. Process p has no means to distinguish the case where some process q has deliveredm (andp can indeed wait form), from the case where no process will ever deliverm (andp should definitely not keep waiting for m).

The TRB abstraction adds precisely this missing piece of information to reliable broadcast. TRB ensures that every process p either delivers the messagem from the sender or some failure indication  △ , denoting thatm will never be delivered (by any process). This indication is given in the form of a specific message  △ delivered to the processes, such that △ is a symbol that does not belong to the set of possible messages that processes broadcast. TRB is a variant of consensus because all processes deliver the same message, i.e., either messagem from the sender or the indication  △ .

TRB is similar to the Byzantine consistent broadcast and Byzantine reliable broadcast abstractions (of Sects. 3.10 and 3.11) in two ways: first, there is only one process s that sends a message and this process is known, and second, the broadcast abstraction delivers at most one message. With respect to termination, TRB differs from the Byzantine broadcast abstractions, however: When the sender s is faulty, a Byzantine broadcast abstraction may not deliver a message, but TRB always delivers an output, regardless of whether process s is correct.

6.3.2 Specification

The properties of TRB are depicted in Module 6.4. Note that the abstraction is defined for a specific sender process s, which is known to all processes in advance. Only the sender process broadcasts a message; all other processes invoke the algorithm and participate in the TRB upon initialization of the instance. According to Module 6.4, the processes may not only deliver a message m but also “deliver” the special symbol  △ , which indicates that the sender has crashed.

We consider here theuniform variant of the problem, where agreement is uniformly required among any pair of processes, be they correct or not.

6.3.3 Fail-Stop Algorithm: Consensus-Based Uniform Terminating Reliable Broadcast

Algorithm 6.3, called “Consensus-Based Uniform Terminating Reliable Broadcast,” implements uniform TRB using three underlying abstractions: a perfect failure detector instance \(\mathcal{P}\), a uniform consensus instance uc, and a best-effort broadcast instance beb.

Algorithm 6.3 works by having the sender process s disseminate a messagem to all processes using best-effort broadcast. Every process waits until it either receives the message broadcast by the sender process or detects the crash of the sender. The properties of a perfect failure detector and thevalidity property of the broadcast ensure that no process waits forever. If the sender crashes, some processes maybeb-deliverm and others may notbeb-deliver any message.

Then all processes invoke the uniform consensus abstraction to agree on whether to deliver m or the failure notification  △ . Every process proposes either m or △ in the consensus instance, depending on whether the process has deliveredm (from the best-effort broadcast primitive) or has detected the crash of the sender (in the failure detector). The decision of the consensus abstraction is then delivered by the algorithm. Note that, if a process has notbeb-delivered any message from s then it learnsm from the output of the consensus primitive.

An execution of the algorithm is illustrated in Fig. 6.2. The sender process s crashes while broadcastingm with the best-effort broadcast primitive. Therefore, processesp andq receivem, but processr does not; instead,r detects the crash of s. All remaining processes use the consensus primitive to decide on the value to be delivered. In the example of the figure, the processes decide to deliver m, but it would also be possible that they decide to deliver  △ (sinces has crashed).
Fig. 6.2

Sample execution of consensus-based uniform terminating reliable broadcast

Correctness.

Consider first thevalidity property of uniform TRB. Assume thats does not crash andutrb-broadcasts a message m. Due to thestrong accuracy property of the perfect failure detector, no process detects a crash ofs. Due to thevalidity property of best-effort broadcast, every correct processbeb-deliversm and proposesm for uniform consensus. By thetermination andvalidity properties of uniform consensus, all correct processes includings eventually decide m. Thus, process s eventuallyutrb-delivers m.

To see thetermination property, observe how theno duplication property of best-effort broadcast and theintegrity property of consensus ensure that no processuc-decides more than once. Therefore, every process alsoutrb-delivers at most one message. Thestrong completeness property of the failure detector, thevalidity property of best-effort broadcast, and thetermination property of consensus ensure furthermore that every correct process eventuallyutrb-delivers a message.

Theintegrity property of uniform TRB follows directly from theno creation property of best-effort broadcast and from thevalidity property of consensus: if a processutrb-delivers a message m then eitherm =  △ orm wasutrb-broadcast by process s.

Finally, theuniform agreement property of uniform consensus implies also theuniform agreement property of TRB.

Performance.

The algorithm requires the execution of one underlying uniform consensus instance, invokes a best-effort broadcast primitive to broadcast one message, and accesses a perfect failure detector. The algorithm does not add anything to the cost of these primitives. If no process fails and ignoring the messages sent by the failure detector, the algorithm exchangesO(N) messages and requires one additional communication step for the initial best-effort broadcast, on top of the uniform consensus primitive.

Variant.

Our TRB specification has auniform agreement property. As for reliable broadcast, we could specify a regular variant of TRB with a regularagreement property that refers only to messages delivered by correct processes. In that case, Algorithm 6.3 can still be used to implement regular TRB when the underlying uniform consensus abstraction is replaced by a regular one.

6.4 Fast Consensus

6.4.1 Overview

The consensus primitive plays a central role in distributed programming, as illustrated by the many variants and extensions of consensus presented in this chapter. Therefore, a consensus algorithm with good performance directly accelerates many implementations of other tasks as well. Many consensus algorithms invoke multiple communication steps with rounds of message exchanges among all processes. But some of these communication steps may appear redundant, especially for situations in which all processes start with the same proposal value. If the processes had a simple way to detect that their proposals are the same, consensus could be reached faster.

This section introduces a variation of the consensus primitive with a requirement to terminate particularly fast under favorable circumstances. Afast consensus abstraction is a specialization of the consensus abstraction from Chap. 5 that must terminate inone round when all processes propose the same value. In other words, the abstraction imposes a performance condition on consensus algorithms for the case of equal input values and requires that every process decides after one communication step. This improvement is not for free, and comes at the price of lower resilience.

With the fast consensus primitive, we introduce a performance criterion into a module specification for the first time in this book. This is a common practice for more elaborate abstractions. One has to look inside algorithms that implement the abstraction for judging whether it satisfies the property, in contrast to the usual safety properties, for which the correctness of an algorithm can be verified from its behavior at the module interface alone.

6.4.2 Specification

We consider the fast consensus primitive in its uniform variant. The specification of auniform fast consensus primitive is shown in Module 6.5. Compared to the uniform consensus abstraction (specified in Module 5.2), the interface and three of the four properties (validity,integrity, anduniform agreement) remain the same, only the termination condition changes. The strengthenedfast termination property requires that every correct process decides after one communication step, in all those executions where the proposal values of all processes are the same.

The number of communication steps used by a primitive is directly determined by the algorithm that implements it. Recall that a communication step of a process occurs when a process sends a message to another process and the latter receives this message. Basic communication steps are typically encapsulated by some underlying modules, such as perfect point-to-point links and best-effort broadcast. Therefore, one also has to consider the implementations of the underlying modules to determine the performance of an algorithm.

6.4.3 Fail-Silent Algorithm: From Uniform Consensus to Uniform Fast Consensus

One can add fast termination to any consensus implementation in a modular way. We present such a transformation “From Uniform Consensus to Uniform Fast Consensus” in Algorithm 6.4. It is a fail-silent algorithm and comes at the cost of reduced resilience. Specifically, implementing fast consensus requires thatN > 3f instead of onlyN > 2f. The algorithm additionally uses a uniform reliable broadcast primitive.

The transformation first performs one round of all-to-all message exchanges, in which every process broadcasts its proposal value with best-effort guarantees. When a process receives only messages with the same proposal value v in this round, fromN − f distinct processes, it decides v. This step ensures thefast termination property; such a processdecides fast. Otherwise, if the messages received in the first round contain multiple distinct values, but still more thanN − 2f messages contain the same proposal valuew, the process adoptsw as its own proposal value. Unless the process has already decided, it then invokes an underlying uniform consensus primitive with its proposal and lets it agree on a decision.

In order to ensure that the algorithm terminates under all circumstances, even if some process, say p, has decided fast and does not invoke the underlying consensus module, the algorithm additionally asks p to reliably broadcast its decision value with auniform agreement guarantee. Hence, ifp is correct, every correct process may, eventually, decide after receiving the decision value from the fast-deciding process.

The heart of the transformation is the condition under which a process adopts another proposal value after the first round. Namely, it may occur that some processes decide fast, say, some value v, but others resort to the uniform consensus primitive. In this case, the algorithm must ensure thatv remains the only possible decision value. The condition achieves that because a process that decides fast has received v fromN − f processes. As at mostf of these processes may fail and becauseN > 3f, every other correct process still receivesv from at leastN − 2f processes. Hence, all processes propose v in the underlying consensus primitive.

Correctness.

For thefast termination property, observe first that if all processes propose the same value v then every process may indeed decide after one communication step, that is, after receivingN − fProposal messages containing v. Otherwise, the algorithm terminates under the combination of theN > 3f condition with the assumptions made about the underlying consensus implementation, because every correct process either decides fast andurb-broadcasts its decision or invokes the uniform consensus instance uc. If no correct process reliably broadcasts a decision then all of them invoke uniform consensus and itstermination property ensures termination.

Thevalidity property is straightforward to verify from the algorithm. The role of the variabledecision ensures that no process decides twice and establishes theintegrity property.

Given this discussion, we now consider theuniform agreement property. There are three cases to consider: First, suppose two processes decide some value v after the first round. As each of them hasbeb-deliveredN − fProposal messages containing v, but there are a total ofN processes only andN > 3f, a message from some sender must have beenbeb-delivered by both processes. Hence, they decide the same value.

For the second case, assume no processurb-broadcasts aDecided message. Then every processufc-decides afteruc-deciding, andagreement follows from theagreement property of uniform consensus.

In the third case, some process has decided fast and receivedN − f messages with the same proposal v in the first round. Therefore, every other process that receivesN − f messages in the first round finds at leastN − 2f among them containing v. Hence, every process may onlyuc-propose v in the underlying uniform consensus primitive. According to itsvalidity property, ituc-decides v.

Performance.

If the initial proposal values are not the same for all processes then the transformation adds at most communication step to the underlying uniform consensus primitive.

6.5 Fast Byzantine Consensus

6.5.1 Overview

One can also require that a Byzantine consensus primitive decides fast and takes only one round in executions where all proposed values are the same. This section introduces afast Byzantine consensus abstraction with this feature.

Compared to fast consensus with crash-stop process abstractions, however, one cannot require that an algorithm always decides in one round whenever all correct processes propose the same value. The reason is that Byzantine processes might propose arbitrary values, and a correct process cannot distinguish such a value from a value proposed by a correct process. As illustrated by the differentvalidity properties for Byzantine consensus introduced in the previous chapter, ensuring a particular consensus decision value in the fail-arbitrary model may be problematic. Our fast Byzantine consensus abstraction, therefore, adopts the approach that was already taken for weak Byzantine consensus; it requires a fast decision only in executions where all processes are correct.

Compared to the previous abstractions of Byzantine consensus, deciding fast in executions with unanimous proposals requires to lower the resilience. Specifically, the algorithm presented here assumes thatN > 5f; one can show that this is optimal.

6.5.2 Specification

Our notion offast Byzantine consensus is specified by Module 6.6. It has the same request and indication events as all consensus abstractions. The primitive corresponds to a (strong) Byzantine consensus primitive with the strengthenedfast termination property (that is, properties FBC2–FBC4 are the same as properties BC2–BC4 of Module 5.11).

Thefast termination property requires that any fast Byzantine consensus algorithm terminates after one communication step if all correct processes propose the same value, but only in failure-free executions, that is, in executions without Byzantine processes.

A variant of fast Byzantine consensus with a strongerfast termination property is explored in an exercise (at the end of the chapter). It does not restrict fast decisions to executions with only correct processes.

6.5.3 Fail-Arbitrary Algorithm: From Byzantine Consensus to Fast Byzantine Consensus

As was the case for fast consensus in the fail-stop model, fast Byzantine consensus can be realized in a modular way from a Byzantine consensus abstraction. We describe a transformation “From Byzantine Consensus to Fast Byzantine Consensus” in Algorithm 6.5. The transformation is similar to Algorithm 6.4, but requires some changes. These concern the lowered resilience of the algorithm, it requiresN > 5f, and the higher numbers of equal values that are needed to decide fast. One can show that this condition is necessary.

As a minor difference to the previous algorithm, every process always invokes the underlying (strong) Byzantine consensus primitive here, even after deciding fast. Recall that in Algorithm 6.4, only processes that did not decide fast proposed a value to the uniform consensus primitive. This change simplifies the algorithm and avoids complications arising when broadcasting a decision value with Byzantine processes.

Correctness.

We argue why Algorithm 6.5 provides fast Byzantine consensus ifN > 5f. Thefast termination property follows directly from the algorithm because of the assumption that all processes must be correct and propose the same value, in order to decide in one round. Hence, no Byzantine process could interfere by sending a different value. Furthermore, a process may either decide fast, i.e., after receiving only equalProposal messages or after deciding in the underlying Byzantine consensus instance bc. Because every correct processbc-proposes a value, it alsobc-decides a value by the properties of Byzantine consensus.

Thestrong validity property holds because the underlying (strong) Byzantine consensus primitive satisfies the samestrong validity property, and the algorithm directly maps proposal values to the primitive and decision values from the primitive to fast consensus.

The variabledecision protects a process from deciding more than once, and this establishes theintegrity property.

For theagreement property, consider first the case where two correct processes p and q decide fast. Note that among theProposal messages received by a correct process, at leastN − 2f were sent by correct processes. BecauseN > 5f, the two sets ofN − 2fProposal messages from correct processes collected byp andq overlap (actually, in more thanf messages). Therefore, the same value v is contained in the sets ofp andq, and both processes also decide v fast.

Next, suppose that some correct process p decides a value v fast and another correct process qfbc-decides afterbc-deciding. The fact thatp decided fast implies that it received at leastN − fProposal messages containing v. As there are onlyN processes in the system overall, at mostf further correct processes may have proposed a value different from v. Hence, every other correct process receivesv in at leastN − 3fProposal messages, accounting for the potentially different proposals from the up tof further correct processes and for theProposal messages from thef Byzantine processes. As
$$N - 3f > \frac{N - f} {2}$$
under the assumption for the algorithm, it follows that every correct processbc-proposes v in the underlying consensus primitive. Hence, every correct processbc-decides v and therefore, also process qfbc-decides v.

Finally, if no process decides fast then theagreement property directly follows from the underlying Byzantine consensus primitive.

Performance.

The transformation adds one round of message exchanges among all processes and one communication step to the underlying Byzantine consensus primitive.

6.6 Nonblocking Atomic Commit

6.6.1 Overview

The unit of data processing in a distributed information system is atransaction. Among other applications, transactions are a central concept for the design of database management systems. A transaction corresponds to a portion of a program that is delimited by abegin statement and anend statement. A transaction typically satisfiesatomic semantics in two senses:

  1. Concurrency atomicity:

    All transactions appear to execute one after the other, i.e., they areserializable; serializability is usually guaranteed through somedistributed locking scheme or with someoptimistic concurrency control mechanism.

     
  2. Failure atomicity:

    Every transaction appears to execute either completely and therebycommits or not at all, in which case it is said toabort.

     

Ensuring these two forms of atomicity in a distributed environment is not trivial because the transaction might be accessing information on different processes, calleddata managers, which maintain the relevant data items. The data managers may have different local state and different opinions on whether the transaction should commit or not. For example, some data managers might observe conflicting concurrent data accesses, whereas others might not. Similarly, some data managers might detect logical or physical problems that prevent a transaction from committing. For instance, there may not be enough money to make a bank transfer, there may be concurrency-control problems, such as the risk of violating serializability in a database system, or there could be a storage issue, such as when the disk is full and a data manager has no way to guarantee the durability of the transaction’s updates.

Despite differences between their opinions, all data managers need to make sure that they either all discard the new updates, in case the transaction aborts, or make them visible, in case the transaction commits. In other words, all data managers need to agree on the same outcome for the transaction.

Thenonblocking atomic commit (NBAC) abstraction is used precisely to solve this problem in a reliable way. The processes, each representing a data manager, agree on the outcome of a transaction, which is either tocommit or toabort the transaction. Every process initially proposes a value for this decision, which is either aCommit value or anAbort value, depending on its local state and opinion about the transaction.

By proposingCommit for a transaction, a process expresses that it is willing and able to commit the transaction. Typically, a process witnesses the absence of any problem during the execution of the transaction. Furthermore, the process promises to make the update of the transaction permanent. This, in particular, means that the process has stored the temporary update of the transaction in stable storage: should it crash and recover, it can install a consistent state including all updates of the committed transaction.

By proposingAbort, a data manager process vetoes the commitment of the transaction and states that it cannot commit the transaction. This may occur for many reasons, as we pointed out earlier.

6.6.2 Specification

Thenonblocking atomic commit abstraction is defined by ⟨ Propose ∣ v ⟩ and ⟨ Decide ∣ v ⟩ events, which are similar to those in the interface of the consensus abstraction, but require thatv is eitherCommit orAbort. The abstraction satisfies the properties listed in Module 6.7. At first glance, the problem looks like binary consensus: the processes propose one of two values and need to decide on a common final value. There is, however, a fundamental difference: in consensus, any proposed value can be decided. In the atomic commit abstraction, a value ofCommit cannot be decided if any of the processes has proposedAbort (this would mean that some data managers can indeed commit the transaction and ensure its durability whereas others cannot). When a process expresses its veto to a transaction by proposingAbort, the NBAC abstraction must honor this. As another difference to consensus, nonblocking atomic commit may decideAbort also if some process crashes, even though all processes have proposed Commit.

6.6.3 Fail-Stop Algorithm: Consensus-Based Nonblocking Atomic Commit

Algorithm 6.6 implements nonblocking atomic commit using three underlying abstractions: a perfect failure detector \(\mathcal{P}\), a uniform consensus instance uc, and a best-effort broadcast abstraction beb. In order to distinguish the value proposed to the NBAC abstraction from the value proposed to the underlying consensus abstraction, we call the first avote and the second aproposal.

The algorithm works as follows. Every processp broadcasts its initial vote (Abort orCommit) to all other processes using best-effort broadcast. Then it waits to hear something from every process q in the system: either tobeb-deliver the vote ofq or to detect the crash ofq. Ifp detects the crash of any process or receives a voteAbort from any process thenp directly (without waiting for more messages) invokes the consensus abstraction withAbort as its proposal. Ifp receives the voteCommit from all processes thenp invokes consensus withCommit as its proposal. Once the consensus abstractionuc-decides, every processnbac-decides according to the outcome of consensus.

Correctness.

Thetermination property of nonblocking atomic commit follows from thevalidity property of best-effort broadcast, from thetermination property of consensus, and from thestrong completeness property of the perfect failure detector. Theuniform agreement property of NBAC directly follows from that of the uniform consensus abstraction. Furthermore, theintegrity property of NBAC holds because theno duplication property of best-effort broadcast and theintegrity property of uniform consensus ensure that no processnbac-decides twice.

Consider now the twovalidity properties of NBAC. Thecommit-validity property requires thatCommit isnbac-decided only if all processesnbac-proposeCommit. Assume by contradiction that some process pnbac-proposesAbort, whereas some process qnbac-decidesCommit. According to the algorithm, forq tonbac-decideCommit, it must also haveuc-decidedCommit in the consensus primitive. Because of thevalidity property of consensus, some process r must have proposedCommit to the consensus abstraction. Given thevalidity property of the best-effort broadcast primitive, one can distinguish two cases: either process p (that votesAbort) crashes and processr does notbeb-deliver the vote fromp orrbeb-delivers the voteAbort from p. In both cases, according to the algorithm, process r proposesAbort to uniform consensus: a contradiction. Consider now theabort-validity property of NBAC. It requires thatAbort isnbac-decided only if some processnbac-proposesAbort or some process crashes. Assume by contradiction that all processesnbac-propose a vote ofCommit and no process crashes, whereas some processpnbac-decidesAbort. Forp tonbac-decideAbort, due thevalidity property of uniform consensus, some processq mustuc-proposeAbort. According to the algorithm and thestrong accuracy property of the failure detector though,q onlyuc-proposesAbort if some processnbac-proposesAbort or\(\mathcal{P}\) detects a process crash: a contradiction.

Performance.

The algorithm requires one execution of the consensus abstraction. In addition to the cost of consensus and the messages communicated by the perfect failure detector, the algorithm exchangesO(N2) messages and requires one communication step for the initial best-effort broadcast.

Variant.

One could define a nonuniform (regular) variant of nonblocking atomic commit, by requiring onlyagreement (for any twocorrect processes) and notuniform agreement (for any two processes). However, this abstraction would not be useful in a practical setting to coordinate the termination of a transaction in a distributed database system. Indeed, the very fact that some process has decided to commit a transaction might trigger an external action: say, the process has delivered some cash through a bank machine. Even if that process has crashed, its decision is important and other processes should reach the same outcome.

6.7 Group Membership

6.7.1 Overview

Some of our algorithms from the previous sections were required to make decisions based on information about which processes were operational, crashed, or otherwise faulty. At any point during the computation, every process maintains information about some other processes in the system, whether they are up and running, whether one specific process can be a trusted leader, and so on. In the algorithms we considered, this information is provided by a failure detector module available at each process. According to the properties of a failure detector, this information reflects the actual status of failures in the system more or less accurately. In any case, the outputs of the failure detector modules at different processes are not always the same. In particular, different processes may get notifications about process failures in different orders and, in this way, obtain a different perspective of the system’s evolution. If there was a way to provide better coordinated failure notifications, faster and simpler algorithms might become possible.

Agroup membership (GM) abstraction provides consistent and accurate information about which processes have crashed and which processes are correct. The output of a group membership primitive is better coordinated and at a higher abstraction level than the outputs of failure detectors and leader election modules.

In a second role, a membership abstraction enables dynamic changes in the group of processes that constitute the system. Throughout this book, we have assumed a static set Π ofN processes in our system model. No new process could join the set of processes, be included in the system, and participate in the computation. Likewise, a process inΠ could not voluntarily leave the system or become excluded by an administrator. And after a process had crashed, our algorithms would eventually ignore it, but the process was still part of the system.

A group membership primitive also coordinates such join and leave operations and provides a dynamic set of processes in the system. As with failure notifications, it is desirable that group-membership information is provided to the processes in a consistent way.

To simplify the presentation of the group membership concept, however, we will only describe how it implements its first role above, i.e., to give consistent information about process crashes in a system with otherwise static membership. We assume that the initial membership of the group is the complete set of processes, and subsequent membership changes are solely caused by crashes. We do not consider explicit join and leave operations, although they are important for practical systems. Group membership introduces a new notion that defines the current set of active processes in the system, which is also the basis for modeling join and leave operations. One can easily extend our basic abstraction with such operations. Reference pointers to the literature and to practical systems that support such operations are given in the notes at the end of this chapter.

6.7.2 Specification

A group is the set of processes that participate in the computation. At any point in time, the current membership of the group is called thegroup view, or simply theview. More precisely, a view V = (id, M) is a tuple that contains a unique numericview identifier id and a set M ofview member processes. Over time, the system may evolve through multiple views. The initial group view is the entire system, denoted byV0 = (0, Π), which contains view identifier 0 and includes the complete set of processes Π in the system. The group membership abstraction provides information about a new view V through an indication event ⟨ ​View ∣ V ​ ⟩. When a process outputs a viewV, it is said toinstall a new view V, after going through aview change. Group membership offers no request events to the layer above.

We consider amonotone group membership abstraction, where all correct processes install multiple new views in a sequence with monotonically increasing view identifiers. Furthermore, when two different processes install a view with the same identifier, the view memberships must be the same. Compared to the outputs of our failure-detector abstractions, views therefore offer much better coordination among the processes.

The group membership abstraction is characterized by the properties listed in Module 6.8. Itsuniform agreement andmonotonicity properties require that every process installs a sequence of views with increasing identifiers and shrinking membership, as mentioned earlier. Thecompleteness andaccuracy properties are similar to those of the perfect failure detector abstraction and dictate the conditions under which a process can be excluded from a group.

6.7.3 Fail-Stop Algorithm: Consensus-Based Group Membership

Algorithm 6.7, which is called “Consensus-Based Group Membership,” implements the group membership abstraction assuming a uniform consensus abstraction and a perfect failure-detector abstraction. At initialization, each process installs a view including all the processes in the system. From that point on, the algorithm remains idle until some process detects that another process has crashed. As different processes may detect crashes in different orders, a new view cannot be output immediately after detecting a failure and in a unilateral way; the processes first need to coordinate about the composition of the new view. The algorithm executes an instance of uniform consensus to decide which processes are to be included in the next view. A process invokes consensus only after it has detected that at least one member of the current view have crashed. Thewait flag is used to prevent a process from triggering a new consensus instance before the previous consensus instance has terminated. When the consensus decides, a new view is delivered. In order to preserve theuniform agreement property, a process p may sometimes install a view containing a process thatp already knows to have crashed, because the perfect failure-detector module atp has already output that process. In this case, after installing that view,p will initiate a new consensus instance to trigger the installation of another view that would exclude the crashed process.

An execution of the “Consensus-Based Group Membership” algorithm is illustrated in Fig. 6.3. In the execution with four processesp,q,r, ands, the first two processes,p andq, crash initially. Process s subsequently detects the crash ofp and initiates a consensus instance to define a new view without p. Process r then detects the crash ofq and proposes a different view to consensus. The first consensus instance decides the proposal from s, and, as a result, process p is excluded from the view with identifier 1. As process r has already detected the crash ofq, it triggers another consensus instance to excludeq. Eventually, process s also detects the crash ofq and also participates in the second consensus instance to install the view with identifier 2. This view includes only the correct processes.
Fig. 6.3

Sample execution of consensus-based group membership

As a possible optimization of the algorithm, consider the moment when a process has decided on a setM′ in an instance of uniform consensus and is about to start a new view (id, M′). Then the process might also setcorrect tocorrect ∩ M′, for accelerating the detection of crashed processes.

Algorithm 6.7 actually provides stronger guarantees than required by the group membership abstraction. In particular, it satisfies alinear group membership property in the sense that every process installs thesame sequence of views and never skips a view in the sequence. In contrast, themonotonicity property would actually allow that a process installs a view with a much higher identifier than its current view, or that all process skip a view with a particular identifier. Furthermore, the algorithm also ensures astrict form of themonotonicity property, in that the membership of a view installed after the current one is always strictly smaller than the membership of the current view. Monotonicity allows a subsequent view to have the same membership. There are practical group communication systems that exploit this flexibility of our group membership abstraction.

Correctness.

Themonotonicity property follows directly from the algorithm, because a process only initiates the formation of a new view when its setcorrect becomes properly contained in the current view membership. Theuniform agreement property follows directly from theuniform agreement property of the underlying consensus abstraction.

Thecompleteness property follows from thestrong completeness property of the perfect failure-detector abstraction, which says that if a process p crashes then eventually every correct process detects the crash. According to the algorithm, there will be a consensus instance in which every proposal value no longer includes p. By thevalidity property of consensus, this means that they eventually install a view that does not include p.

Theaccuracy property of the group membership algorithm follows analogously from the use of the perfect failure detector \(\mathcal{P}\). As some process p proposed a set of processes to consensus that did not include a process q, process q must have been detected to have crashed by process p. In this case, thestrong accuracy property of\(\mathcal{P}\) implies thatq must have crashed.

Performance.

The algorithm requires at most one consensus execution for each process that crashes.

Variant.

We focus here only on the uniform variant of the group membership abstraction: a regular group membership abstraction is specified by replacing theuniform agreement property with a regularagreement property. An algorithm implementing a regular group membership abstraction might use regular consensus instead of uniform consensus.

6.8 View-Synchronous Communication

6.8.1 Overview

Theview-synchronous communication abstraction is also calledview-synchronous broadcast and integrates two abstractions introduced earlier: reliable broadcast and group membership. In the following, we discuss a subtle issue that arises when these two primitives are combined. This difficulty motivates the introduction of view-synchronous communication as a new first-class primitive.

Consider the following scenario of a group of processes exchanging messages, where one of them, say, process q, crashes. Assume that this failure is detected and that the membership abstraction installs a new viewV = (id, M) at the processes such thatq ∉ M. Suppose that afterV has been installed, some process p delivers a message m that was originally broadcast by q. Note that such a scenario is possible, as nothing in the specification of reliable broadcast prevents a message that was broadcast by a process that has failed from being delivered later. In fact, in order to ensure theagreement property of reliable broadcast, messages originally broadcast byq are typically relayed by other processes, especially for the case whereq has failed. However, it feels strange and counterintuitive for the application programmer to handle a message from a process q in view V, from whichq has been expelled. It would, thus, be desirable forp to simply discard m. Unfortunately, it may also happen that some other process r has already deliveredmbefore installing view V. So, in this scenario, the communication primitive is faced with two conflicting goals: to ensure the reliability of the broadcast, which means thatm must be delivered byp, but, at the same time, to guarantee the consistency of the view information, which means thatm cannot be delivered in the new view andp must discard it.

The solution to this dilemma, which is offered by view-synchronous communication, integrates the installation of views with the delivery of messages and orders every new view with respect to the message flow. If a message m is delivered by a (correct) process before it installs a view V thenm should be delivered by all processes that installV,before they install the view. This abstraction is also calledview-synchronous broadcast because it gives the illusion that failures are synchronized and appear to happen atomically with respect to the delivered messages.

6.8.2 Specification

View-synchronous communication extends both the reliable broadcast abstraction and the group membership abstraction: as a consequence, its interface contains the events of both primitives. Specifically, the interface of a view-synchronous communication primitive contains a request event ⟨ Broadcast ∣ m ⟩ to broadcast a message m, an indication event ⟨ Deliver ∣ p,m ⟩ that outputs a message m from sender p (from Module 3.2), and another indication event ⟨ View ∣ V  ⟩, which installs view V (from Module 6.8). The abstraction adds two more events used for synchronization with the communication layer above; these are introduced after the formal definition.

Module 6.9 states the properties ofview-synchronous communication; it defines the view synchrony concept as the combination of group membership, in a uniform variant as considered in the previous section, with the regular variant of reliable broadcast. Other combinations are possible, in particular combinations of regular and uniform variants and, optionally, adding the properties of FIFO delivery order and causal delivery order for reliable broadcast. Given such a wide choice, there are many different possible flavors of view-synchronous communication.

In Module 6.10, we introduce auniform view-synchronous communication abstraction, obtained by combining the group membership abstraction with the uniform reliable broadcast abstraction.

The new element in the specification of view-synchronous communication, which integrates the way that messages should be delivered with respect to view changes, lies in theview inclusion property that first appears in Module 6.9. We say that a processdelivers orbroadcasts a message min a view V if the process delivers or broadcasts m, respectively, after installing viewV and before installing any subsequent view. Theview inclusion property requires that every message must be delivered only in the same view in that it was broadcast. This solves the problem mentioned before, as the condition implies that messages coming from processes that have already been excluded from a view can no longer be delivered.

In order to make theview inclusion property feasible, the interface and properties of view-synchronous communication contain an additional feature. As messages must be delivered in the same view in which they are broadcast, the view change poses a problem: if new messages are continuously broadcast then the installation of a new view may be postponed indefinitely. In other words, it is not possible to implement the view-synchronous communication abstraction without any control on the broadcast pattern. Therefore, the interface of this abstraction includes two specific events that handle the interaction between the view-synchronous communication primitive and the layer above (i.e., the application layer). They provide flow control through a ⟨ Block ⟩ indication event and a ⟨ BlockOk ⟩ request event. By triggering the ⟨ Block ⟩ event, the view-synchronous communication layer requests that the higher layer stops broadcasting messages in the current view. When the higher-level module agrees to that, it acknowledges the block request with the ⟨ BlockOk ⟩ event.

We assume that the layer above is well behaved and that whenever it is asked to stop broadcasting messages (through a request to block), then it indeed does not trigger any further broadcasts after acknowledging the request to block. It may again broadcast new messages after the next view is installed. On the other hand, we require from the view-synchronous communication abstraction that it only requests the higher layer to block if a view change is imminent, i.e., only if a process that is a member of the current view has failed and a new view must be installed. (We do not explicitly state these properties in Modules 6.9 and 6.10 as we consider them to be of a different nature than theview inclusion property.)

6.8.3 Fail-Stop Algorithm: TRB-Based View-Synchronous Communication

Algorithm 6.8–6.9, called “TRB-Based View-Synchronous Communication,” implements the view-synchronous communication abstraction according to Module 6.9. The keys element of the algorithm is a collectiveflush procedure, executed by the processes after they receive a view change from the underlying group membership primitive and before they install this new view at the view-synchronous communication level. During this step, every process uses an instance of the uniform TRB primitive to rebroadcast all messages that it has view-synchronously delivered in the current view.

The algorithm for an instance vs of view-synchronous communication works as follows. During its normal operation within a view V = (vid, M), a process simply addsvid to every message that it receives forvs-broadcast and broadcasts it in aData message using an underlying best-effort broadcast primitive beb. When a processbeb-delivers aData message with a view identifier that matchesvid, the identifier of the current view, it immediatelyvs-delivers the message contained inside. Every process also maintains a set inview, with all sender/message pairs for the messages that itvs-delivered during the normal operation of the current view.

The collective flush procedure is initiated when the group membership primitive installs a new view. Each process first requests from its caller that it stopsvs-broadcasting messages in the current view. The higher layer agrees to this with a ⟨ BlockOk ⟩ event at each process. When the view-synchronous communication algorithm receives this event, it stopsvs-delivering new messages and discards anyData message that still arrives via the underlying best-effort broadcast primitive. The process then proceeds to resending all messages that itvs-delivered in the old view using a TRB primitive.

Every process initializes one instance of uniform TRB for each process inM, the membership of the old view, and rebroadcasts its setinview with the TRB instance for which it is the sender. Eventually, when the TRB instances have delivered such sets (or the failure indication) from all processes inM, each process computes the union of these sets. The result is a global set of messages that have beenvs-delivered by those processes in view V that have not crashed so far. Each process thenvs-delivers any message contained in this set that it has not yetvs-delivered, before it installs the new view.

Note that discardingData messages from the old view during the flush procedure and later causes no problem, because if a message isvs-delivered by any correct process in the old view then it will be rebroadcast through the TRB abstraction andvs-delivered as well.

Whenever a processvs-delivers a message during the algorithm, it verifies that the message has not beenvs-delivered before. The algorithm maintains a variable delivered for this purpose that stores allvs-delivered messages so far (in the current view and in earlier views).

The new views output by the underlying group membership abstraction are appended to a queuependingviews. This ensures that they do not get lost and are still processed in the same sequence as installed by group membership. When a process has started to handle messages in a new view, it keeps checking this queue; as soon as it finds another view at the head of the queue, it invokes the flush procedure and this leads to the next view change. Variablependingviews is a list of views; recall that our operations on lists areappend(L, x) to append an element x to a list L,remove(L, x) to removex fromL, andhead(L), which returns the first element in L.

An example execution of the algorithm is shown in Fig. 6.4. Process pvs-broadcasts two messagesm1 andm2 and then crashes. Message m1 arrives atr ands via best-effort broadcast and is immediatelyvs-delivered, but the best-effort broadcast toq is delayed. On the other hand, message m2 isvs-delivered byq but not byr ands, because the sender p crashed. Additionally, process s alsovs-broadcasts a message m3, which is soonvs-delivered by all correct processes, before a view change is initiated. When the underlying membership module installs a new view and excludesp from the group, the flush procedure starts. The processes initialize an instance of TRB for each process in the old view, and each process broadcasts the set of messages that it hasvs-delivered in the old view. For instance, the TRB with sender q outputsm2 andm3, sinceq has not yetvs-deliveredm1. The union of all sets output by the TRB instances, {m1, m2, m3}, must bevs-delivered by every correct process before it installs the new view. Note that the best-effort broadcastData message withm1 is eventually delivered toq, but is discarded because it is not from the current view at q.
Fig. 6.4

Sample execution of the TRB-based view synchronous algorithm

Correctness.

Consider first theview inclusion property. Letm be any message that isvs-delivered by some process q with sender p in a given view V. Ifq is the sender of the message thenq directlyvs-delivers the message uponvs-broadcasting it, in the same view. Consider now the case where the sender p is a different process. There are two possibilities. Either process qvs-deliversm in response tobeb-delivering aData message containing m or in response toutrb-delivering the rebroadcast set of delivered messages from some process r. In the first case, the algorithm checks if the view in which the message wasvs-broadcast is the current one, and if not, the message is discarded. In the second case, process r hasutrb-broadcast its setdelivered, which contains only messages that have beenvs-broadcast andvs-delivered in the current view.

Theno creation broadcast property directly follows from the properties of the underlying best-effort broadcast and TRB abstraction. Theno duplication broadcast property follows from the use of the variable delivered and the check, applied afterbeb-delivering aData message, that only messagesvs-broadcast in the current view arevs-delivered. Consider theagreement broadcast property (VS5). Assume that some correct process p hasvs-delivered a message m. Every correct process eventuallyvs-deliversm afterbeb-delivering it, or if a new view needs to be installed, uponutrb-delivering a set of delivered messages from the same view that contains m. To show thevalidity property of broadcast, letp be some correct process thatvs-broadcasts a message m. Process p directlyvs-deliversm and, because of theagreement broadcast property, every correct process eventuallyvs-delivers m.

Consider now the properties inherited from group membership. Themonotonicity,uniform agreement (VS7), andaccuracy properties directly follow from the corresponding properties of the underlying group membership abstraction and from the algorithm, which preserves the order of views. Thecompleteness property is ensured by thecompleteness property of the underlying group membership primitive, thetermination property of TRB, and the assumption that the higher-level module is well behaved (i.e., it stopsvs-broadcasting messages when it is asked to do so).

Performance.

During periods where the view does not need to change, the cost of view-synchronously delivering a message is the same as the cost of a best-effort broadcast, that is, one single message transmission. For a view change from a view (vid, M), however, the algorithm requires the execution of a group membership instance, plus the (parallel) execution of one TRB for each process inM, in order to install the new view. Considering the consensus-based algorithms used to implement group membership and TRB primitives, installing a new view requires 1 +  | M | consensus instances. In an exercise (at the end of the chapter), we discuss how to optimize Algorithm 6.8–6.9 by running a single instance of consensus to agree both on the new view and on the set of messages to bevs-delivered before the new view is installed.

6.8.4 Fail-Stop Algorithm: Consensus-Based Uniform View-Synchronous Communication

The view-synchronous broadcast algorithm of the previous section (Algorithm 6.8–6.9) implements view-synchronous communication (Module 6.9). It is uniform in the sense that no two processes, be they correct or not, install different views. The algorithm isnot uniform in the message-delivery sense, as required by uniform view-synchronous communication (Module 6.10). That is, one process might view-synchronously deliver a message and crash, but no other process delivers that message. For instance, the sender p of a message m couldvs-deliverm and its best-effort broadcast might reach only one other process q, which alsovs-delivers m. But, ifp andq crash without any further actions then no other process ever learns anything about m.

One might think that Algorithm 6.8–6.9 could be made to satisfy theuniform agreement broadcast property of Module 6.10 simply by replacing the underlying best-effort broadcast abstraction with a uniform reliable broadcast primitive (say, instance urb). However, the following scenario illustrates that this does not work. Suppose process pvs-broadcasts m,urb-broadcasts m, and thenvs-delivers m afterurb-delivering it. The only guarantee here is that all correct processes will eventuallyurb-deliverm; they might do so after installing a new view, however, which means thatm would not bevs-delivered correctly.

We present Algorithm 6.10–6.11, called “Consensus-Based Uniform View-Synchronous Communication,” which ensures uniform agreement in two ways: first, in the sense of group membership and, second, in the sense of reliable broadcast. In other words, Algorithm 6.10–6.11 implements uniform view-synchronous communication (Module 6.10).

The algorithm invokes a uniform consensus primitive directly and relies on a perfect failure-detector abstraction, but does not use group membership or TRB. It works as follows. When a processvs-broadcasts a message m, itbeb-broadcasts aData message withm and the current view identifier, and addsm to the set of messages it hasbeb-broadcast. When a process p extracts such anm from abeb-deliveredData message originating from the same view, it addsq to the set of processesack[m] that have acknowledged m. Thenpbeb-broadcastsm and thereby acknowledgesm, if it did not do so already, and addsm to its set of messages it hasbeb-broadcast. The latter set is stored in a variablepending, which contains all messages that a process has received in the current view.

The process maintains also a variabledelivered with all messages that it has evervs-delivered. When all processes in the current view are contained inack[m] at a given process p, thenpvs-delivers the messagem and addsm todelivered. As the reader might recall from Chap. 3, the same approach has already been used in Algorithm 3.4.

If any process detects the crash of at least one member of the current view, the process initiates a collective flush procedure as in Algorithm 6.8–6.9 from the previous section. The process first broadcasts (using best-effort guarantees) its setpending, containing all messages received in the current view; note that some messages inpending might not have beenvs-delivered. As soon as a process p has collected the setpending from every other process thatp did not detect to have crashed, it proposes a new view through an instance of uniform consensus. The view consists of all processes that are correct according to the failure detector output at p.

Apart from a new candidate view, process p also proposes the collection ofpending sets received fromall processes in the candidate view. The union of these sets contains all messages that the processes have “seen” and potentiallyvs-delivered in the ending view. The consensus primitive then decides on a new view and on such a collection of sets. Before installing the new view, each process parses all sets of pending messages in the consensus decision andvs-delivers those messages that it has notvs-delivered yet. Finally, the process installs the new view decided by consensus and resumes the normal operation in the next view.

Correctness.

The arguments for correctness are similar to those of Algorithm 6.8–6.9. Theview inclusion property directly follows from the algorithm because no process buffers a message from the previous view when it installs the next view.

We first consider the broadcast-related properties. Thedelivered variable ensures that no message isvs-delivered twice, which demonstrates theno duplication property. It is also easy to see that a message can only bevs-delivered if some process has indeedvs-broadcast it, as stated by theno creation property.

Thevalidity property follows because a correct process p includes every message in its setpending. As p is correct and, therefore, never detected by the failure detector, this set is always contained in the decision of the uniform consensus instance that switches to the next view. Process pvs-delivers every one of its messages at the latest before installing the next view.

For theuniform agreement property, consider any process thatvs-delivers some message m. If the process does this during the flush procedure that installs a new view then all correct processes install the same new view as well andvs-deliver m. Otherwise, the processvs-deliversm because every process in the current view has acknowledged m, and it follows that all correct processes must have stored the message in theirpending variable. Hence, during the next view change,m is contained in the set of all “seen” messages and is eventuallyvs-delivered by every correct process.

The properties related to group membership follow directly from the algorithm, because it contains almost the same steps as the “Consensus-Based Group Membership” algorithm from Sect. 6.7, which also uses uniform consensus and perfect failure-detector primitives.

Performance.

During periods where the view does not change, the cost ofvs-delivering a message is the same as the cost of a reliable broadcast, namelyO(N2) messages and only two communication steps. To install a new view, the algorithm requires the parallel execution of best-effort broadcasts for all processes in the view, followed by an execution of uniform consensus to agree on the next view. The algorithm uses only one instance of uniform consensus and is therefore more efficient than Algorithm 6.8–6.9, which may invoke up toN + 1 instances of uniform consensus for a view change.

6.9 Exercises

Exercise 6.1:

Would it make sense to add thetotal-order property of total-order broadcast to a best-effort broadcast abstraction?

Exercise 6.2:

What happens in our “Consensus-Based Total-Order Broadcast” algorithm (Algorithm 6.1) if the set of messages delivered in a round is not sorted deterministically after deciding in the consensus abstraction, but before it is proposed to consensus? What happens in that algorithm if the set of messages decided on by consensus is not sorted deterministically at all?

Exercise 6.3:

The “Consensus-Based Total-Order Broadcast” algorithm (Algorithm 6.1) transforms a consensus abstraction (together with a reliable broadcast abstraction) into a total-order broadcast abstraction. Describe a transformation between these two primitives in the other direction, that is, implement a (uniform) consensus abstraction from a (uniform) total-order broadcast abstraction.

Exercise 6.4:

Discuss algorithms for total-order broadcast in the fail-silent model.

Exercise 6.5:

Discuss the specification of total-order broadcast and its implementation in the fail-recovery model.

Exercise 6.6:

Discuss the relation between Byzantine consensus and Byzantine total-order broadcast.

Exercise 6.7:

Design a more efficient Byzantine total-order broadcast algorithm than Algorithm 6.2, in which every underlying consensus instance decides at least one message that can bebtob-delivered. Use digital signatures and a Byzantine consensus primitive with theanchored validity property, as introduced in Exercise 5.11.

Exercise 6.8:

Give a specification of a state-machine replication abstraction and design an algorithm to implement it using a total-order broadcast primitive.

Exercise 6.9:

Consider a fault-tolerant service, implemented by a replicated state machine, from the perspective of a group of clients. Clients invoke operations on the service and expect to receive a response. How do the clients access the replicated state-machine abstraction from Exercise 6.8? What changes if the replicated state-machine abstraction is implemented in the fail-arbitrary model, with Byzantine processes?

Exercise 6.10:

Can we implement a TRB abstraction with an eventually perfect failure-detector abstraction (\(\lozenge \mathcal{P}\)) if we assume that at least one process can crash?

Exercise 6.11:

Can we implement a perfect failure-detector abstraction\(\mathcal{P}\) from multiple TRB instances, such that every process can repeatedly broadcast messages, in a model where any number of processes can crash?

Exercise 6.12:

Consider the fast Byzantine consensus abstraction with the following stronger form of thefast termination property. It guarantees a one-round decision in any execution as long as allcorrect processes propose the same value:

FBC1’: Strong fast termination: If all correct processes propose the same value, then every correct process decides some value after one communication step. Otherwise, every correct process eventually decides some value.

Describe how to modify Algorithm 6.5 such that it transforms a Byzantine consensus abstraction into fast Byzantine consensus with strong fast termination. As a hint, we add that thestrong fast termination property can only be achieved under the assumption that N > 7f.

Exercise 6.13:

Recall that our implementation of a nonblocking atomic commit (NBAC) abstraction (Algorithm 6.6) relies on consensus. Devise two algorithms that do not use consensus and implement relaxations of the NBAC abstraction, where thetermination property has been replaced with:
  1. 1.

    Weak termination: Let p be some process that is known to the algorithm; if p does not crash then all correct processes eventually decide.

     
  2. 2.

    Very weak termination: If no process crashes then all processes eventually decide.

     

Exercise 6.14:

Can we implement a NBAC primitive with an eventually perfect failure detector\(\lozenge \mathcal{P}\), if we assume that at least one process can crash? What if we consider a weaker specification of NBAC, in which the (regular or uniform)agreement property is not required?

Exercise 6.15:

Do we need the perfect failure-detector primitive\(\mathcal{P}\) to implement NBAC if we consider a system where at least two processes can crash, but a majority is correct? What if we assume that at most one process can crash?

Exercise 6.16:

Give an algorithm that implements a view-synchronous communication abstraction such that a single consensus instance is used for every view change (unlike Algorithm 6.8–6.9), and every process directlyvs-delivers every message after vs-broadcasting it or after first learning about the existence of the message (unlike Algorithm 6.10–6.11).

6.10 Solutions

Solution 6.1:

The resulting abstraction would not make much sense in a failure-prone environment, as it would not preclude the following scenario. Assume that a process p broadcasts several messages with best-effort properties and then crashes. Some correct processes might end up delivering all those messages (in the same order) whereas other correct processes might end up not delivering any message.

Solution 6.2:

If the deterministic sorting is done prior to proposing the set for consensus, instead of a posteriori upon deciding, the processes would not agree on a set but on a sequence of messages. But if theyto-deliver the messages in decided order, the algorithm still ensures thetotal order property.

If the messages, on which the algorithm agrees in consensus, are never sorted deterministically within every batch (neither a priori nor a posteriori), then thetotal order property does not hold. Even if the processes decide on the same batch of messages, they mightto-deliver the messages within this batch in a different order. In fact, thetotal order property would be ensured only with respect to batches of messages, but not with respect to individual messages. We thus get a coarser granularity in the total order.

We could avoid using the deterministic sort function at the cost of proposing a single message at a time in the consensus abstraction. This means that we would need exactly as many consensus instances as there are messages exchanged between the processes. If messages are generated very slowly by processes, the algorithm ends up using one consensus instance per message anyway. If the messages are generated rapidly then it is beneficial to use several messages per instance: within one instance of consensus, several messages would be gathered, i.e., every message of the consensus algorithm would concern several messages toto-deliver. Agreeing on large batches with many messages at once is important for performance in practice, because it considerably reduces the number of times that the consensus algorithm is invoked.

Solution 6.3:

Given a total-order broadcast primitive to, a consensus abstraction is obtained as follows: When a process proposes a value v in consensus, itto-broadcasts v. When the first message isto-delivered containing some value x, a process decides x. Since the total-order broadcast delivers the same sequence of messages at every correct process, and everyto-delivered message has been to-broadcast, this reduction implements a consensus abstraction.

Solution 6.4:

Our algorithm for total-order broadcast in the fail-stop model works also in the fail-silent model, as it does not use a failure-detector abstraction directly, but uses primitives for reliable broadcast and consensus. Algorithms for reliable broadcast can be realized in the fail-silent model, assuming a majority of correct processes. The consensus abstraction cannot be implemented in the fail-silent model, as explained in Chap. 5, only in the fail-noisy or in the randomized fail-silent models.

Solution 6.5:

We introduce a specification of total-order broadcast in the fail-recovery model and an algorithm that implements it.

We apply the same approach as used to derive “logged” abstractions in the previous chapters. We depart from an abstraction designed for the fail-stop model and adapt its interface with adjacent modules to use logged delivery, add logging operations for relevant states, and define a recovery procedure. Any underlying primitives are implemented in the fail-recovery model as well.

We illustrate here only the uniform variant of logged total-order broadcast, presented in Module 6.11. Its interface is similar to the interface of the logged broadcasts from Chap. 3 (see Module 6.5, for instance), with the only change that variabledelivered, used tolog-deliver messages from the primitive, is now an ordered list and no longer a set. Newly log-delivered messages are always appended todelivered. Recall that the abstraction log-delivers a message m from sender s whenever an event ⟨ Deliver ∣ delivered ⟩ occurs such that delivered contains a pair (s,m) for the first time.

To implement the abstraction, we present algorithm “Logged Uniform Total-Order Broadcast” in Algorithm 6.12; it closely follows the algorithm for the fail-stop model presented in Sect. 6.1 and works as follows. Every message in a total-order broadcast request is disseminated using the underlying uniform reliable broadcast primitive for the fail-recovery model. The total-order broadcast algorithm maintains two variables with messages: a setunordered of messages that have been delivered by the underlying reliable broadcast module, and a listdelivered, containing the totally ordered sequence of log-delivered messages. The algorithm operates in a sequence of rounds and invokes one logged uniform consensus primitive per round. At the end of every round, it sorts the newly decided batch of messages and appends them to delivered.

The algorithm starts a new instance of logged uniform consensus whenever it notices that there are unordered messages that have not yet been ordered by the consensus instances of previous rounds. When proposing a batch of messages for consensus in some round, the algorithm logs the proposal in stable storage. The wait flag is also used to ensure that consensus instances are invoked in serial order.

During the recovery operation after a crash, the total-order algorithm runs again through all rounds executed before the crash and executes the same consensus instances once more. (We assume that the runtime environment re-instantiates all instances of consensus that had been dynamically initialized before the crash.) This ensures that every instance of consensus actually decides. Because the algorithm proposes the logged message batches again for consensus, every consensus instance is always invoked with exactly the same parameters. Although it may not be strictly needed (depending on the implementation of consensus), this is consistent with the invariant that each process proposes the message batch stored in stable storage.

The algorithm has the interesting feature of never storing the set of unordered messages and not logging the delivered sequence for its own use (however, the algorithm must writedelivered to stable storage for log-delivering its output). These two data structures are simply reconstructed from scratch upon recovery, based on the stable storage kept internally by the reliable broadcast primitive and by the consensus primitive. Because the initial values proposed for each consensus instance are logged, the process may invoke all past instances of consensus again to obtain the same sequence of messages ordered and delivered in previous rounds.

The algorithm requires at least one communication step to execute the reliable broadcast primitive and at least two communication steps to execute every consensus instance. Therefore, even if no failures occur, at least three communication steps are required.

Solution 6.6:

The Byzantine total-order broadcast and the Byzantine consensus primitives stand in a similar relation to each other as their counterparts with crash-stop processes, which was discussed in Exercise 6.3. As demonstrated through Algorithm 6.2, Byzantine consensus can be used to implement Byzantine total-order broadcast.

An emulation in the other direction works as well, assuming that N > 3f. The processes run a Byzantine total-order broadcast primitive and broadcast the proposal from consensus. Once that the first N − f messages from distinct senders have been delivered by total-order broadcast, the processes apply a deterministic decision function. It decides a value v in Byzantine consensus if more than f messages were equal to v; otherwise, it decides □. This emulation shows that Byzantine total-order broadcast can be used to implement Byzantine consensus.

Solution 6.7:

A simple solution, which solves the problem almost but not completely, works like this. Every process maintains a variableundelivered with the set of input messages that it has itselfbtob-broadcast, but that have not yet beenbtob-delivered. In parallel, the processes execute rounds of Byzantine consensus with theanchored validity property. For every consensus instance, a process p computes a digital signature σ on the round number and its current set of undelivered messages, and proposes the tuple (p,undelivered,σ) for consensus. The predicate implementing anchored validity verifies that a proposal contains only messages that have not yet been delivered up to the current round, that there is at least one undelivered message, and that the signature in the proposal is valid.

In order to prevent that a correct process stalls the sequence of consensus executions because its setundelivered is empty, the processes also periodically exchange their proposals for consensus, using point-to-point links. When a process with an empty set of input messages receives a proposal, it adopts the proposal of the other process and proposes it in consensus. When consensus decides a set of messages from some sender process s, every processbtob-delivers all messages in a deterministic order. Because the signature on the proposal is valid, the messages must have beenbtob-broadcast by sender s (if s is correct).

According to theanchored validity notion and the predicate imposed by the correct processes, this algorithm decides andbtob-delivers at least one message in every round of consensus. It may violate the validity property of Byzantine total-order broadcast, though, because consensus may forever avoid to decide on the set of messages sent by a particular correct process.

An extension of this algorithm avoids the above problem. To ensure that everybtob-broadcast message m from a correct sender s is eventuallybtob-delivered, the sender first constructs a triple (m,s,σ), containing a signature σ on the message, and sends this triple to all processes using authenticated point-to-point links. Every process now maintains the received triples containing undelivered messages in itsundelivered variable.

An initial dissemination phase is added to every round; in the second phase of the round, the processes again execute an instance of Byzantine consensus with anchored validity. In the dissemination phase, every process signs its variableundelivered and sends it with the signature in anUndelivered message to all processes. Every process waits to receive such Undelivered messages containing only valid signatures from more than f processes.

A process then enters the second phase of the round and proposes the received list of f + 1Undelivered messages to Byzantine consensus. The predicate implementing anchored validity verifies that a proposal containsUndelivered messages signed by more than f distinct processes, that no (btob-level) message contained in it has yet been delivered up to the current round, that there is at least one undelivered message, and that the signatures in all triples are valid. When consensus decides a proposal, the algorithm proceeds like before and btob-delivers all messages extracted from the triples in a deterministic order.

The extended algorithm ensures thevalidity property because a triple (m,s,σ) with abtob-broadcast message m, sent by a correct process s, and with signature σ is eventually contained in theundelivered set of every correct process. In the next round after that time, everyUndelivered message therefore contains m. Since the Byzantine consensus instance decides a set of more than fUndelivered messages, at least one of them is from a correct process and contains also m.

Solution 6.8:

A state machine consists of variables and commands that transform its state and may produce some output. Commands consist of deterministic programs such that the outputs of the state machine are solely determined by the initial state and by the sequence of commands that it has executed. A state machine can be made fault-tolerant by replicating it on different processes.

A replicated state-machine abstraction can be characterized by the properties listed in Module 6.12. Basically, its interface presents two events: first, a request event ⟨ Execute ∣ command ⟩ that a client uses to invoke the execution of a programcommand of the state machine; and, second, an indication event ⟨ ​Output​​ ∣ response ⟩, which is produced by the state machine and carries the output from executing the last command in parameter response. For the sake of brevity, we assume that the command parameter of the execution operation includes both the name of the command to be executed and any relevant parameters.

As an example, an atomic register could be implemented as a state machine. In this case, the state of the machine would hold the current value of the register and the relevant commands would be (1) awrite(v) command that writes a value v to the register and outputs a parameter-less response, which only indicates that the write concluded, and (2) a read command that causes the state machine to output the value of the register as the response. Of course, more sophisticated objects can be replicated the same way.

Algorithm 6.13 implements a replicated state-machine primitive simply by disseminating all commands to execute using a uniform total-order broadcast primitive. When a command is delivered by the broadcast module, the process executes it on the state machine, and outputs the response.

As the state machine is deterministic, is started from the same initial state at every process, and every process executes the same sequence of commands, all responses are equal.

Solution 6.9:

Every client assigns unique identifiers to its own commands. A client sends an identifier/command pair to one replica first using a point-to-point message (a replica is a process that executes the replicate state machine). When it receives this, the replica executes the command on the state machine and includes the originating client and the command identifier together with the command description. When the client does not receive a response to a command after some time, it resends the same identifier/command pair to another replica.

All replicas process the outputs of the state machine; we assume that every response also carries the command from which it is produced, including the originating client and the identifier. When the state machine at a replica outputs a response, the replica obtains the client from which the command originated and sends the response with the command identifier back to the client, using a point-to-point message. A client must wait to receive the first response for every command identifier and may discard duplicate responses.

Almost the same scheme works with Byzantine processes. Even though a client might send the command to some Byzantine replicas, the client eventually hits a correct replica by repeatedly sending the command to different replicas. Once it sends the command to a correct replica, the state machine eventually executes the command and outputs a response. But as the Byzantine replicas may send a wrong response to the client, the client needs to receive f + 1 matching responses with a given command identifier: this ensures that at least one of the responses is from a correct replica.

Solution 6.10:

The answer is no. Consider an instancetrb of TRB with sender process s. We show that it is impossible to implement TRB from an eventually perfect failure-detector primitive \(\lozenge \mathcal{P}\) if even one process can crash.

Consider an execution E1, in which process s crashes initially and observe the possible actions for some correct process p: due to thetermination property of TRB, there must be a time T at which ptrb-delivers △.

Consider a second execution E2 that is similar to E1 up to time T, except that the sender s is correct andtrb-broadcasts some message m, but all communication messages to and from s are delayed until after time T. The failure detector behaves in E2as in E1until after time T. This is possible because the failure detector is only eventually perfect. Up to time T, process p cannot distinguish E1from E2andtrb-delivers △. According to theagreement property of TRB, process s musttrb-deliver △ as well, and s delivers exactly one message due to thetermination property. But this contradicts thevalidity property of TRB, since s is correct, hastrb-broadcast some message m≠△, and musttrb-deliver m.

Solution 6.11:

The answer is yes and shows that the perfect failure detector is not only sufficient to implement TRB but also necessary. In other words, the TRB abstraction is equivalent to a perfect failure detector primitive.

Consider a model where any number of processes can crash and suppose that for every process p, multiple instances of TRB with p as sender are available. We explain how to construct a perfect failure detector from these primitives. The idea behind the transformation is to have every process repeatedly invoke instances of TRB with all processes in the system as senders. If one such instance with sender s evertrb-delivers △ at process p, the module \(\mathcal{P}\) at p detects s to have crashed from then on.

The transformation satisfies thestrong completeness property of \(\mathcal{P}\) because the TRB abstraction delivers △ if the sender s has crashed by itstermination property. On the other hand, thestrong accuracy property of\(\mathcal{P}\) (which states that if a process is detected, then it has crashed) holds because the properties of TRB imply that process P only delivers △ when the sender s has crashed.

Solution 6.12:

The only change to Algorithm 6.5 concerns the number of equal proposal values necessary to decide fast. The algorithm as stated before decides fast if all N − f received Proposal messages contain the same value. Clearly, even one Byzantine process could prevent fast termination according to the strong notion by sending a different proposal value and causing it to be received by all correct processes. More generally, the algorithm must tolerate that at most f of the received proposal messages are different from the rest and still decide fast. Reducing the bound for deciding fast to finding N − 2f equal values among the N − f received proposal messages achieves this.

The modified algorithm still ensuresagreement under the assumption that N>7f. Again, there are three cases to consider: two correct processes p and q decide fast both, or p decides fast and q decides afterbc-deciding, or both decide afterbc-deciding. Only the second case changes substantially compared to Algorithm 6.5. If p decides fast, it has received N − fProposal messages and found a common value v in at least N − 2f of them. As the system contains N processes, also every other correct process must have receivedProposal messages from at least N − 2f of those processes whose proposal was received by p. No more than f of these processes might be Byzantine; thus, every correct process must have at least N − 3f proposals containing v. But the assumption that N > 7f implies that
$$N - 3f > \frac{N - f} {2}$$
and, therefore, every correct process adopts v as its proposal and bc-proposes v. Applying the same argument as before, it follows that also process q decides v in fast Byzantine consensus.

All other properties of fast Byzantine consensus follow from the same arguments as used to show the correctness of Algorithm 6.5.

Solution 6.13:

Both algorithms are reminiscent of atomic commit methods used by some practical distributed transaction processing systems.

  1. 1.

    The first algorithm may rely on the globally known process p to enforce termination. The algorithm uses a perfect failure detector \(\mathcal{P}\)and works as follows. All processes send their proposal over a point-to-point link to p. This process collects the proposals from all processes that\(\mathcal{P}\)does not detect to have crashed. Once process p knows something from every process in the system, it may decide unilaterally. In particular, it decidesCommit if all processes proposeCommit and no process is detected by \(\mathcal{P}\), and it decidesAbort otherwise, i.e., if some process proposesAbort or is detected by \(\mathcal{P}\) to have crashed. Process p then uses best-effort broadcast to send its decision to all processes. Any process that delivers the message with the decision from p decides accordingly. If p crashes, then all processes are blocked.

    Of course, the algorithm could be improved in some cases, because the processes might figure out the decision by themselves, such as when p crashes after some correct process has decided, or when some correct process decidesAbort. However, the improvement does always work: if all correct processes proposeCommit but p crashes before any other process then no correct process can decide. This algorithm is also known as the “Two-Phase Commit” (2PC) algorithm. It implements a variant of atomic commitment that isblocking.

     
  2. 2.

    The second algorithm is simpler because it only needs to satisfy termination if all processes are correct. All processes use best-effort broadcast to send their proposals to all processes. Every process waits to deliver proposals from all other processes. If a process obtains the proposalCommit from all processes, then it decidesCommit; otherwise, it decidesAbort. Note that this algorithm does not make use of any failure detector.

     

Solution 6.14:

The answer is no. To explain why, we consider an execution E1, where all processes are correct and proposeCommit, except for some process p that proposesAbort and crashes initially, without sending any message. All correct processes must therefore decideAbort in E1, as decidingCommit would violate thecommit-validity property. Let T be the time at which the first (correct) process q decidesAbort. It does so presumably after receiving some output of\(\lozenge \mathcal{P}\), which indicated that p crashed.

Consider now an execution E2that is similar to E1except that p is correct and proposesCommit, but all its messages are delayed until after time T. The failure detector behaves in E2 as in E1 until time T and suspects p to have crashed; this is possible because\(\lozenge \mathcal{P}\) is only eventually perfect. Hence, no process apart from p can distinguish between E1 and E2 and q also decidesAbort in E2. But this violates theabort-validity property, as all processes are correct and propose Commit, yet they decideAbort.

In this argument, the (uniform or regular)agreement property of NBAC was not explicitly needed. This shows that even a specification of NBAC whereagreement was not needed could not be implemented with an eventually perfect failure detector if some process crashes.

Solution 6.15:

Consider first a system where at least two processes can crash but a majority is correct. We will argue that in this case the perfect failure detector is not needed. Specifically, we exhibit a failure detector that is strictly weaker than the perfect failure detector (\(\mathcal{P}\)) in a precise sense, but that is strong enough to implement NBAC.

The failure-detector abstraction in question is called theanonymously perfect perfect failure detector, and denoted by ?\(\mathcal{P}\). This failure detector ensures thestrong completeness and eventual strong accuracy properties of an eventually perfect failure detector (Module 2.8), plus the following property:

Anonymous detection: Every correct process eventually outputs a failure indication value F if and only if some process has crashed.

Recall that an eventually perfect failure-detector primitive is not sufficient to implement NBAC, as shown in Exercise 6.14.

Given that we assume a majority of correct processes and given that the failure detector ?\(\mathcal{P}\) satisfies at least the properties of an eventually perfect failure detector, one can use ?\(\mathcal{P}\) to implement a uniform consensus primitive (for instance, using the fail-noisy Algorithm 5.7 from Chap. 5).

We now describe how to implement NBAC with the help of a ?\(\mathcal{P}\) abstraction and a uniform consensus primitive. The NBAC algorithm works as follows. All processes initially use a best-effort broadcast primitive beb to send their proposal to all processes. Every process p waits (1) tobeb-deliverCommit from all processes, or (2) tobeb-deliverAbort from some process, or (3) for ?\(\mathcal{P}\)to output F. In case (1), process p invokes consensus and proposes Commit. In cases (2) and (3), p invokes consensus with proposal Abort. When the consensus primitive outputs a decision, p also decides this value for NBAC. It is easy to see that the algorithm implements NBAC, since theanonymous detection property gives the processes enough information to decide correctly.

Now we discuss in which sense ?\(\mathcal{P}\) is strictly weaker than\(\mathcal{P}\). Assume a system where at least two processes can crash. Consider an execution E1 where two processes p and q crash initially, and an execution E2 where only p initially crashes. Let r be any correct process. Using ?\(\mathcal{P}\), at any particular time T, process r cannot distinguish between executions E1 and E2 if the messages of q are delayed until after T. When process r obtains an output F from ?\(\mathcal{P}\), it knows that some process has indeed crashed but not which one. With\(\mathcal{P}\), process r would know precisely which process crashed.

Hence, in a system where two processes can crash but a majority is correct, a perfect failure-detector primitive \(\mathcal{P}\) is not needed to implement NBAC. There is a failure-detector abstraction ?\(\mathcal{P}\), called the anonymously perfect failure detector, which is strictly weaker than\(\mathcal{P}\) and strong enough to implement NBAC.

Consider now the second part of the exercise. Assume that at most one process can crash. We argue that in such a system, we can emulate a perfect failure-detector abstraction given a primitive for NBAC. The algorithm causes all processes to go through sequential rounds. In each round, the processes use best-effort broadcast to send an “I-am-alive” message to all processes, and then invoke an instance of NBAC. In a given round, every process p waits for NBAC to decide on an outcome: if the outcome isCommit then p moves to the next round; if the outcome isAbort then p waits tobeb-deliver N − 1 messages and declares the process that should have sent the missing message to have crashed. Clearly, this algorithm emulates the behavior of a perfect failure detector \(\mathcal{P}\) in a system where at most one process crashes.

Solution 6.16:

Algorithm 6.14–6.15 for view-synchronous communication presented here uses a reliable broadcast primitive, a uniform consensus module, and a perfect failure-detector abstraction.

The algorithm combines the simple communication approach of Algorithm 6.8–6.9 with methods from Algorithm 6.10–6.11; it works as follows. When a process detects the crash of at least one member of the current view, the processinitiates a collective flush procedure as in the algorithms of Sect. 6.8. The purpose of the flush procedure is again to collect all messages that have been view-synchronously delivered by at least one process (that has not been detected to have crashed). These messages must bevs-delivered by all processes that are about to install the new view. To execute the flush procedure, each process first blocks the normal message flow as before (by triggering a ⟨ Block ⟩ output event for the layer above and waiting for the corresponding ⟨ BlockOk ⟩ input event). Once the message flow is blocked, the process stops broadcasting and delivering view-synchronous application messages. The process then broadcasts its set ofvs-delivered messages in the current view (stored in variable inview) to every other process.

As soon as a process p has collected thevs-delivered message set from every other process that p did not detect to have crashed, p proposes a new view through a consensus instance. More precisely, process p proposes to consensus the new set of view members as well as their corresponding sets ofvs-delivered messages (in variable seen). Because the flush procedure might be initiated by processes which have detected different failures (or detected the failures in a different order), and, furthermore, some processes might fail during the flush procedure, different processes might propose different values to consensus. But it is important to note that each of these proposals contains a valid set of processes for the next view and a valid set ofvs-delivered messages. (The only risk here is to end upvs-delivering fewer or more messages of processes which have crashed, but this does no harm.) Consensus guarantees that the same new view is selected by all correct processes. Before installing the new view, process p parses thevs-delivered message sets of all other processes that proceed to the new view andvs-delivers those messages that it has notvs-delivered yet. Finally, p installs the new view and resumes the normal operation within the view.

Any message isvs-delivered by its sender immediately after the message isvs-broadcast. The message is also added to the setinview of messagesvs-delivered by the sender. If the sender remains correct, the message will bevs-delivered before the next view change by the algorithm described earlier (remember that the algorithm uses a perfect failure detector). Furthermore, the set ofvs-delivered messages (from the variableseen of some process) will be made available to all noncrashed processes as an output of the consensus instance that decides on the next view. Since all correct processes parse this set for missing messages before they install the next view, all contained messages arevs-delivered in the same view at all correct processes.

During periods where the view does not need to change, the cost ofvs-delivering a message is the same as the cost of a best-effort broadcast, that is, only O(N) messages and one communication step. To install a new view, the algorithm requires every process to broadcast one message using best-effort broadcast and to execute one instance of uniform consensus to agree on the next view.

6.11 Chapter Notes

  • Total-order broadcast is probably the most important abstraction for practical applications of distributed programming. It has, therefore, received a lot of interest, starting with the pioneering work of Lamport (1978). Schneider (1990) gives a concise introduction to total-order broadcast and state-machine replication.

  • Our total-order broadcast specifications and the algorithm implementing total-order broadcast in the fail-stop model are inspired by the work of Chandra and Toueg (1996) and of Hadzilacos and Toueg (1993). These implementations are instructive because they are modular and use consensus as an abstract primitive. In practice, system implementors have preferred optimized monolithic implementations of total-order broadcast, such as the “Paxos” algorithm (Lamport 1998) or viewstamped replication (Oki and Liskov 1988), whose consensus mechanisms we discussed in Chap. 5. Many systems deployed today provide total-order broadcast; one representative implementation is described by Chandra, Griesemer, and Redstone (2007).

  • Our total-order broadcast specification and the algorithm in the fail-recovery model (in Exercise 6.5) were defined only more recently ((Boichat et al. 2003a; Boichat and Guerraoui 2005; Rodrigues and Raynal 2003).

  • We considered that messages that need to be totally ordered were broadcast to all processes in the system, and hence it was reasonable to have all processes participate in the ordering activity. As for reliable broadcast, it is also possible to formulate a total-order multicast abstraction where the sender can select the subset of processes to which the message needs to be sent, and require that no other process besides the sender and the multicast set participates in the ordering (Rodrigues,Guerraoui, and Schiper 1998;Guerraoui and Schiper 2001).

  • It is possible to design total-order algorithms with crash-stop processes that exploit particular features of concrete networks. Such algorithms can be seen as sophisticated variants of the basic approach taken by the algorithms presented here (Chang and Maxemchuck 1984; Veríssimo et al. 1989; Kaashoek and Tanenbaum 1991; Moser et al. 1995; Rodrigues et al. 1996; Rufino et al. 1998; Amir et al. 2000).

  • Byzantine total-order broadcast has a shorter history than its counterpart with crash-stop processes. Our Byzantine atomic broadcast algorithms in Sect. 6.2 and Exercise 6.7 follow the modular presentation by Cachin et al. (2001). In contrast to this, the PBFT algorithm of Castro and Liskov (2002) and many subsequent algorithms (Doudou et al. 2005; Abd-El-Malek et al. 2005; Abraham et al. 2006; Martin and Alvisi 2006; Kotla et al. 2009; Guerraoui et al. 2010) implement Byzantine total-order broadcast directly.

  • The TRB problem was studied by Hadzilacos and Toueg (1993) in the context of crash failures. This abstraction is a variant of the “Byzantine Generals” problem (Lamport, Shostak, and Pease 1982). While the original Byzantine Generals problem uses a fail-arbitrary model with processes that might behave arbitrarily and maliciously, the TRB abstraction assumes that processes may only fail by crashing.

  • The fast consensus abstraction and the algorithm that implements it were discussed by Brasileiro et al. (2001). Our description in Sect. 6.4 follows their presentation.

  • The fast Byzantine consensus abstraction in Sect. 6.5 and its variant from Exercise 6.12 have been introduced by Song and van Renesse (2008).

  • The atomic commit problem, sometimes also called “atomic commitment,” was introduced by Gray (1978), together with the “Two-Phase Commit” algorithm, which we studied in Exercise 6.13. The atomic commit problem corresponds to our nonblocking atomic commit (NBAC) abstraction without thetermination property.

  • The nonblocking atomic commit problem has been introduced bySkeen (1981) and was refined later (Guerraoui 2002; Delporte-Gallet et al. 2004). Our “Consensus-Based Non-Blocking Atomic Commit” algorithm presented in this chapter is a modular variant of Skeen’s decentralized three-phase algorithm. It is more modular in the sense that we encapsulate many subtle issues of NBAC within consensus.

  • The group membership problem was initially discussed by Birman and Joseph (1987). They also introduced the view-synchronous communication abstraction. Both primitives have been implemented in the influential ISIS System. The specification of view-synchronous communication presented here was introduced by Friedman and van Renesse (1996). This is a strong specification as it ensures that messages are always delivered in the same view, in which they were broadcast. Weaker specifications were also considered (Babaoglu et al. 1997; Fekete et al. 2001; Lesley and Fekete 2003; Pereira et al. 2003). A comprehensive survey of group communication specifications and systems providing view-synchronous communication is presented by Chockler, Keidar, and Vitenberg (2001).

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Christian Cachin
    • 1
  • Rachid Guerraoui
    • 2
  • Luís Rodrigues
    • 3
  1. 1.IBM Research ZürichRüschlikonSwitzerland
  2. 2.Fac. Informatique et Communications Lab. Programmation Distribuée (LPD)Ecole Polytechnique Fédérale Lausanne (EPFL)LausanneSwitzerland
  3. 3.INESC-ID Instituto Superior TécnicoLisboaPortugal

Personalised recommendations