Correctness and concurrent complexity of the Black-White Bakery Algorithm

Lamport’s Bakery Algorithm (Commun ACM 17:453–455, 1974) implements mutual exclusion for a fixed number of threads with the first-come first-served property. It has the disadvantage, however, that it uses integer communication variables that can become arbitrarily large. Taubenfeld’s Black-White Bakery Algorithm (Proceedings of the DISC. LNCS, vol 3274, pp 56–70, 2004) keeps the integers bounded, and is adaptive in the sense that the time complexity only depends on the number of competing threads, say N. The present paper offers an assertional proof of correctness and shows that the concurrent complexity for throughput is linear in N, and for individual progress is quadratic in N. This is proved with a bounded version of UNITY, i.e., by assertional means.


Introduction
The advent of multiprocessors and multicore architectures has revived the interest in concurrent algorithms. Concurrent algorithms are difficult to design, however, because they can unexpectedly misbehave due to subtle bugs or race conditions. They are almost impossible to test. Verification is not easy either, but if one has a good proof assistant, it can be done.
A typical concurrency problem is mutual exclusion. Over the years, many mutual exclusion algorithms have been proposed. Recently, we performed an investigation [BDH15] of 20 of these algorithms: the algorithms were implemented and their performance compared, under both 0 and high contention. It turned out that some algorithms for mutual exclusion based on reading and writing of atomic variables perform almost as good as algorithms based on stronger hardware primitives. This justifies a renewed interest in the theoretical performance analysis of these algorithms.
One of the most elegant mutual exclusion algorithms ever proposed is Lamport's Bakery Algorithm [Lam74]. This algorithm has the so-called first-come first-served property (FCFS). In particular, it has no starvation. A disadvantage is that it requires unbounded integers. In 2004, Taubenfeld [Tau04] proposed the so-called Black-White Bakery Algorithm, which shares some of the good properties of the Bakery Algorithm, in particular FCFS, but does not need unbounded integers.

Mutual exclusion
The problem of mutual exclusion was proposed in 1965 by Dijkstra [Dij65]. It can be formulated as follows. Consider a system of concurrent threads or processes that can communicate by shared variables. From time to time these threads need exclusive access to some shared resource. Such exclusive access is called the critical section CS. When a thread is in the CS, other threads that need the resource must wait. Mutual exclusion is the design of an entry and exit protocol that protects the CS so that there is never more than one thread in the CS.

Atomicity
In a concurrent system, the atomic commands of the threads are interleaved in arbitrary ways. It is therefore important to specify the grain of atomicity of the commands. This must be done in such a way that it can be respected by the implementation. According to the principle of single critical reference [OG76,(3.1)] and [AdBO09,p. 273], an atomic command shall read or write at most one shared variable (not both), unless it is specifically provided by the operating system (e.g. a CAS or a semaphore action). The principle serves to forbid (e.g.) atomic commands of the form x : x + 1 when they are not explicitly provided by the operating system. Actions on private variables can be added to atomic commands because they never give interference.
When one has to implement an algorithm with fine-grain concurrency on hardware with a weak memory model, one may have to insert memory fences in the code to ensure that the intended atomicity is respected by the hardware. In Sect. 2, we describe how this can be done for the algorithm at hand.

Correctness
The correctness requirements of concurrent algorithms are distinguished in safety (no bad things happen) and progress (eventually something good happens). In general, the safety properties are (and must be) the first concern. For a mutual exclusion algorithm, this primarily means that there is never more than one thread in the critical section, and that the system cannot reach a deadlock state. There are two progress requirements: general progress, i.e. when there are threads that need to enter the critical section, eventually, some do (this is called deadlock freedom); and individual progress: any thread that needs to enter the critical section, eventually does so (this is called lockout freedom). Usually, the proof of progress needs several of the properties established in the proof of safety. The more reason to treat safety first and carefully.

Concurrent complexity
In principle, progress is unquantified, but for practical purposes it is useful to know that the progress to some well-defined goal does not take too much time. This leads to the questions of time-complexity.
Due to the many possible interleavings, it is not easy to come up with faithful time-complexity measure for concurrent algorithms. In [Hes98,Hes15a], we proposed a concept of concurrent complexity based on "rounds". This concept is closely tied to UNITY of [CM88,Mis01], in the sense that, in most cases, a progress proof with UNITY can easily be adapted to also give an upper bound of the concurrent complexity.
In the analysis of a concurrent algorithm, a transition system is constructed that models the algorithm, but also its environment, which contains the clients of the system. One therefore needs to distinguish two kinds of steps: the forward steps that are executed by the threads for the sake of the algorithm, and environment steps that model uncontrolable actions of the environment. In general, the steps of an algorithm are forward steps. See Sect. 3.3 for a more detailed discussion. Progress can be hampered by the disabling of forward steps, e.g., when a thread needs to wait for a semaphore. In general, disabling of environment steps improves performance. The distinction between forward steps and environment steps corresponds to Guarantee and Rely in Rely/Guarantee approaches.
An execution fragment is a nonempty finite sequence of states such that every pair of subsequent states is connected by a step of the transition system. Two execution fragments can be concatenated if the last state of the first fragment equals the first state of the second fragment. An execution fragment is called a round if, for every thread p, it either contains at least one forward step of p, or at least one state in which the forward steps of p are disabled. Informally speaking, in a round, every thread is scheduled at least once.
Finally, the concurrent complexity n of reaching a postcondition Q from a precondition P is expressed as the assertion "P leads to Q within n rounds", notation This is defined to mean that every execution fragment that starts in a state where P holds, and that contains a concatenation of n rounds, contains a state where Q holds. This concept of leads-to-within specializes the leads-to concept of UNITY and temporal logic. This would mean that, when some thread p is in the entry protocol, it will enter the critical section within nine rounds. Note that the predicates p in Entry and p in CS need not be stable.
Remark. The approach implicitly requires a weak kind of scheduling fairness. There need not be a fair scheduler. Yet, if the next forward command of some thread is never done, we cannot expect progress of this thread, and the absence of the command may even block progress for all other threads. Therefore, in order to prove progress, we need some assumption that enforces so-called weak fairness. This assumption is built in by the idea of rounds.

Problem setting and progress estimates
The mutual exclusion problem is traditionally modelled as follows. The threads are in an infinite loop of the form: Here, NCS and CS are given program fragments that stand for the noncritical section and the critical section, respectively. NCS need not terminate, CS is guaranteed to terminate. The problem is to implement Entry and Exit in such a way that the number of threads in CS is guaranteed to remain ≤ 1 (mutual exclusion). Lamport [Lam86a] also required that Exit be waitfree in the sense that every thread can pass Exit without waiting, in a bounded number of its own steps.
The progress requirement is that, when some thread has entered Entry, eventually some thread will enter CS. Individual progress (lockout-freedom) is the condition that, if some thread has entered Entry, eventually it will enter CS, go through Exit, and return to NCS.
The first-come-first-served property FCFS is defined as follows [Lam74]. It is required that the program fragment Entry is a sequential composition of two fragments Doorway and Waiting, such that Doorway is waitfree and that, when a thread has passed Doorway, it will enter CS before any other thread that is currently not in Entry. See Sect. 3.4 for the formalization we have used.
The Black-White Bakery (BWB) Algorithm is adaptive, in the sense that, if the number of competing threads is bounded by a number N , the concurrent complexity is bounded by a function of N . Two kinds of concurrent complexity are distinguished: throughput and individual progress.
Throughput is measured by a shared history variable rc (return counter) which is incremented with 1 whenever any thread returns to NCS. Let AI be the condition that all threads are idle, i.e., at NCS. Of course, there is no throughput when AI holds. A linear estimate of throughput is therefore a pair of constants A, B such that, for all i , m, In words, given a number i , if the number of rounds is large enough (A · i + B ) and the threads do not become all idle, at least i times a thread returns to NCS. The number A is the throughput factor. The smaller it is the better the performance of the algorithm. The number B is a kind of initial delay. According to Theorem 3 below, for the BWB algorithm, throughput (0) holds with a throughput factor A O(N ). Individual progress of thread p is expressed and quantified by (1) true Lt n p at NCS.
This says that, from every location, thread p returns to the noncritical section within n rounds. The number n is (an upper bound for) the individual delay. According to Theorem 4 below, for the BWB algorithm, individual progress (1) holds with individual delay n O(N 2 ).
Remark. In the Formulas (0) and (1), passage of the critical section is assumed to take only one round.

The Black-White Bakery Algorithm
Our version of the Black-White Bakery Algorithm [Tau04, Fig. 3] is given in Fig. 1 [AST99]. It has three methods: join, leave, and getset. The method getset returns a set that contains all threads that have completed their last call of join and have not yet started leave, and that does not contain any threads that have completed leave and have not yet started a next call of join. Formally, the object can be regarded as a large boolean array. We come back to this in Sect. 3.1.
Entry to the critical section is guarded by two queues, distinguished by the shared variable color : bit. In line 25, an entering thread lines up in the queue of color. The threads in the queue of 1 − color have priority. Thread p computes its priority lev .p in this queue in line 26. It announces the queue chosen and its priority by the assignment in line 27: the integer pair (p) encodes the queue thread p has chosen, mcol .p pair(p) mod 2, as well as its priority lev .p pair(p) div 2 in this queue.
In order to prevent interference between the writing of pair in line 27 and the reading of pair in lines 33, 34, the doorway 22-29 of thread p is guarded by the boolean cho(p), just as in Lamport's Bakery Algorithm.
The algorithm thus uses the shared variables: The initial condition is Initially, color can be arbitrary. Thread p only writes the array elements pair(p), cho(p), partic(p).
The main communication variable is array pair. Thread p writes pair(p) in the lines 27 and 37. It reads pair(thr.p) in the lines 26, 32, 33, and 34. In the lines 26, 33, 34, thread p processes the value of pair(thr) by means of private functions fn, guardA, guardB, which also use its private variables mcol, lev, and thr. The ordering used in guardA is the lexical ordering: In the lines 30-34, thread p waits for any other participating thread thr, first to conclude its lines 24-28, next to conclude its waiting section if thr has priority over p. After this waiting section, thread p can enter the critical section CS. Subsequently, it resets color, but only when its private color mcol .p equals the public color.
The algorithm of Fig. 1 deviates at two points from Algorithm 3 of [Tau04]. The latter algorithm violates the principle of single critical reference of Sect. 1.2. At the point of our line 25, it reads the shared variable color and immediately writes the value read to the shared variable pair(p). For the sake of the verification, we need to identify the atomic action in such a way that the principle of single critical reference is satisfied. We therefore introduce a private variable mcol to hold the value read in line 25, and postpone the assignement to pair(p) to line 27. A more innocent deviation is that the computation of the maximum over all threads in set1 is split in a sequence of steps in line 26.
Note that the program now almost satisfies the principle of single critical reference: in every transition (i.e., at every line number) at most one shared variable is read or written, and not both. This is the reason to separate the lines 31 and 32. Strictly speaking, line 34 violates the principle because it inspects pair(thr) and color. This is allowed, however, because the thread is waiting for a disjunction: it can pass when either of the disjuncts holds.
If one has to execute the algorithm on hardware with a weak memory model, one may have to insert fences after every write operation that is followed by a read operation. Therefore, Fig. 1 offers optional fences after the lines 23 and 28.
Remark. For the sake of simplicity, or when the set of threads is small enough, one can remove the variable partic and the lines 22 and 38. In the lines 24 and 29, partic must then be replaced by the set thread of all threads. The result is more or less equivalent to Fig. 2 of [Tau04].

Verification of safety
In order to verify the BWB algorithm, it is modelled as a transition system with a global state that comprises the values of all shared and private variables, including program counters. In this system, the threads perform steps in arbitrary order. This transition system is then used to prove the relevant safety and liveness properties.
The transition system is developed in Sect. 3.1. Section 3.2 contains the proof of mutual exclusion. Absence of deadlock states is proved in Sect. 3.3. The FCFS property is proved in Sect. 3.4. In Sect. 3.5, it is proved that the communication variables can remain bounded.

The transition system
The program of Fig. 1 is extended and transformed into the transition system of Fig. 2. This is a formalization step, not subject to verification by PVS. Indeed, Fig. 2 is the starting point of the PVS verification.
First, at line 21, a noncritical section NCS has been added, where thread p resides initially. This is also the location thread p goes back to after line 38. The decision at NCS to aim at the critical section and to go to line 22 is an environment step because it is done by the client of the system.
During the design and verification of an algorithm, we occasionally have to change line numbers and numbered invariants. To avoid introducing mistakes in the PVS proof when modifying the files with query-replace, we use line numbers of two digits. Therefore, in Fig. 2, the transitions are numbered from 21 onward (the choice of 21 is arbitrary). Every thread has a private variable pc that holds the current line number. Every transition of thread p implicitly increments pc.p, unless this is overridden by a branch or goto instruction.
We thus use the line numbers to refer to the steps of the algorithm. We distinguish the steps at line 26 into step 26B, the execution of the loop body (which does not change pc), and step 26E, the jump to line 27 when set1 is empty. Similarly, step 30B goes to line 31, while step 30E jumps to line 35 when set2 is empty. Note that, in Fig. 2, the variables set1 and set2 change in the loop bodies: they now serve to hold the threads for which the loop body has yet to be executed.
In order to verify the FCFS property, we let thread p register, when it becomes competing, the threads that it must give priority to in the ghost variable predec(p). When p leaves the CS, it disclaims all its priorities by removing itself from the sets predec(q). When thread p becomes idle again, in line 38, it increments a private ghost variable cnt.p. We come back to this below in the Sects. 3.4 and 4.3, respectively.
In Formally, fairness is used to imply that this repetition terminates. This treatment of the set partic as a safe variable in the sense of Lamport [Lam86b] precisely captures the properties postulated in Sect. 2. See also [Hes13a, Section 1.4].
For the ease of verification, the array pair is split into arrays col and num with Therefore, line 27 now holds a concurrent assignment to fields of these arrays, and line 37 only resets num(p).

Proof of mutual exclusion
Mutual exclusion is the property that there are never more than one thread in the CS, i.e., if thread q is in CS, any thread (say r ) in CS equals q: Implicitly, by postulating such an invariant, we mean that it should hold for all values of the free variables (here q and r ).
Remark. Predicate MX expresses mutual exclusion in an idealized environment. One may employ a Rely/Guarantee framework (e.g. [NLWSD14]) to express how clients of the data structure can benefit from this. This falls out of the scope of this paper, and it would be the same for almost all mutual exclusion algorithms.
In the invariants, we use q (and r ) as free variables of type thread. In the discussion, we use p for the acting thread, because an invariant about q (and r ) can be falsified by actions of any thread p. Of course, p, q, r always range over all threads, and equalities between them are not excluded.
In order to prove that MX is indeed invariant, we need to establish quite a number of other invariants. There are two ways of finding invariants: either bottom-up by looking at the algorithm, or top-down by weakening the required invariant (here MX). For the present algorithm, we begin with a bottom-up approach.
As thread q is the only one that writes the fields partic(q), cho(q), num(q), col(q), we clearly have the invariants Similarly, the variables set1 and set2 satisfy the invariants  After this preparation, we take a top-down approach. As announced, the competing threads q with mcol .q color have priority over those with mcol .q color. This may suggest the predicate This predicate easily follows from I q5 and the postulate We turn to the proof that Jq0 is indeed an invariant. This proof was constructed using the proof assistant PVS. It requires human creativity to invent or generalize invariants, but the proof assistant is needed to verify obvious steps, to handle the numerous case distinctions, and to list proof obligations.
Initially both threads q and r are at line 21, so that Jq0 holds. Predicate Jq0 is threatened only by the steps 29, 33, 34, and 36. This means that, for all other steps of the transition system, the precondition Jq0 implies that Jq0 also holds in the postcondition. For the steps mentioned, we need additional information about the precondition to infer Jq0 in the postcondition.
Step 33 preserves Jq0 because of the new postulate Step 34 preserves Jq0 because of Iq2, Iq3, and the new postulate Indeed, step 34 threatens Jq0 only when thread q does the step and r thr .q, while r is in 26-37 and mcol .q color mcol .r . Then Iq2 implies num(r ) > 0 and Iq3 implies col(r ) mcol .r , and hence col(r ) mcol .q. It follows that the guard of step 34 is false, and the step cannot be taken.
Step 36 of thread p preserves Jq0 for q and r because of Iq5 and Jq0. Indeed, step 36 of thread p threatens Jq0 for q and r only when p is at 36 and modifies color, and q is in 30-37 with mcol .q color. As p modifies color, it has mcol .p color. Therefore, Jq0 for p and q implies that q ∈ set2.p, contradicting Iq5.
Predicate Jq1 is threatened only by the steps 32 and 36. It is preserved by step 32 because of Iq3 and Jq2, and by step 36 because of Iq5 and Jq0. Similarly, predicate Jq2 is threatened only by the steps 31 and 36. It is preserved by step 31 because of Iq1, and by step 36 because of Iq5 and Jq0.
This concludes the proof of preservation of Jq0, and hence of Jq0a. Predicate Jq0a implies that if threads q and r are both in 35-37, then mcolq mcol .r . It therefore remains to consider threads near CS with the same private colors. At this point, the algorithm is very similar to the Bakery Algorithm, see [Lam74] or e.g. [Hes13a].
We postulate the invariant Predicate Jq3 is threatened only by the steps 27, 29, 33, and 34. It is preserved by step 27 because Iq2, Iq4, and the new postulate

Absence of deadlock
A thread is said to be idle iff it is at line 21. A thread is said to be competing iff it is in 22-38. A step of the transition system is called a forward step if it starts in one of the lines 22-38 and either modifies pc or modifies the private variable set1 (in case of line 26). A thread is said to be enabled if it can do a forward step. The step from lines 21 to 22 is not a forward step but an environment step because this step is not part of the system that provides mutual exclusion, but it is done by a process using the system when it needs access to the critical section.
Note that idle threads cannot do forward steps, and that the only non-forward steps of a competing thread are flickering steps at lines 22 and 38.
It is easy to verify that thread p is enabled if and only if it satisfies the predicate where r thr .p.
The transition system is said to be in deadlock iff there are competing threads and no (competing) thread can do a forward step. Absence of deadlock means that deadlock states are not reachable.
In order to prove absence of deadlock, we observe the following obvious invariants: Theorem 1 Absence of deadlock. Assume that there are no enabled threads. Then all threads are idle.
Proof As there are no enabled threads, it follows from ena and Kq0 that all threads are at the lines 21, 31, 33, or 34. By Kq1, it follows that cho(q) is false for all threads q, so that all threads at line 31 are enabled. Therefore all threads are at the lines 21, 33, 34. For every thread p at line 34, we have that p is not enabled, so that thread r thr .p satsfies num(r ) > 0 and col(r ) mcol .p color; by Kq2 and Iq3, this implies that r is at line 33 and has col(r ) color.
It follows that, if there is a thread at line 34, then the set S 0 {r | r at 33 ∧ col(r ) color} is nonempty. Let q ∈ S 0 be the minimal element for the lexical ordering, i.e., (num(q), q) ≤ (num(r ), r ) for all r ∈ S 0. As thread q is disabled and at line 33, the thread r thr .q satisfies num(r ) > 0 and col(r ) mcol .q. By Kq2, Iq2, Iq3, and the previous paragraph, it follows that r ∈ S 0, and hence (num(q), q) ≤ (num(r ), r ), so that thread q is enabled (by Iq2). This proves there are no threads at line 34. Therefore all threads are at the lines 21 or 33. Now consider the set S 1 {r | r at 33}. If this set is nonempty, let q be the minimal element of this set for the lexical order. By the arguments of the previous paragraph, again, thread q is enabled. This implies that S 1 is empty. Therefore all threads are at line 21, i.e., they are idle. 2

First-come first-served
The first-come first-served property (FCFS) must be distinguished from first-in first-out (FIFO). The point is that, in almost all mutual exclusion algorithms, the moment of "first-in" cannot be communicated between the threads. The first-come first-served property is therefore defined by Lamport [Lam86b] in the following way. It is required that the entry part of the protocol is a sequential composition of two fragments Doorway and Waiting, such that Doorway is waitfree and that, when a thread has passed Doorway, it will enter CS before any other thread that is currently not in Entry.
In our case, Doorway is the frament of the lines 22-29, which is indeed waitfree, and Waiting is the loop 30-34. The ghost variable predec (set of predecessors) is introduced to verify FCFS. Any thread p that enters Doorway at line 21, registers all threads in 30-35 in predec(p). Every thread that exits CS removes itself from all sets predec(q). Now FCFS is expressed by the condition that any thread q cannot exit Waiting before predec(q) is empty, as formalized in the predicate In order to prove predicate FCFS, we observe that it is logically implied by Iq5 and the new postulate Predicate Lq0 is threatened only by the steps 29, 33, and 34. It is preserved by step 29 because of Iq0 and the new postulate It is preserved by step 33 because of Iq2, Iq4, Lq1, and the new postulates thr .q ∈ predec(q) ∧ q at 33 ⇒ col(thr .q) mcol .q.
Indeed, step 33 threatens Lq0 only when q does the step and r thr .q ∈ predec(q). Then Lq1 implies that r is in 30-35, and Iq2 implies num(r ) > 0. Lq3 implies col(r ) mcol .q. Therefore Lq2 together with Iq4 imply num(r ) < lev .q. It follows that the guard of step 33 of q is false.
Predicate Lq0 is preserved by step 34 because of Iq2, Lq1, and the new postulate This concludes the proof of the invariants Lq*, and thus of FCFS.

Bounding the tickets
The Black-White Bakery Algorithm was designed as a remedy for the unbounded integers needed in the original Bakery Algorithm [Lam74]. This is verified by the next result.
Theorem 2 Assume that the number of competing threads is always bounded by some number N . Then the tickets num(q) are also bounded by N .
Proof In order to prove this, the transition system is parametrized with the number N , and step 21 is forbidden whenever there are N competing threads (i.e., threads not at line 21). This implies the invariant

Mq0
: The theorem is proved by distinguishing the threads that hold the current color from those that do not. For the first class, we define the set It follows that pair(q) ≤ 2 · N + 1 always holds.

Progress
Progress of the algorithm is expressed in operational semantics, presented in Sect. 4.1. The operational progress assertions, however, are not proved by operational arguments but by means of "bounded UNITY" [Hes15a], presented in Sect. 4.2.
We proceed with an investigation of the quantitative throughput in Sect. 4.3, and of individual progress in Sect. 4.4, both under the assumption of Sect. 3.5 that the number of competing threads is bounded by N .

Formal operational semantics
The state of the system is given by the values of all shared and private variables. Usually, we prefer to keep the state implicit, but formally all invariants are boolean functions of the state. We let X be the set of all states. If P is a predicate on the state, it is also regarded as the subset of X where predicate P holds. P ⊆ Q therefore means that every state that satisfies P also satisfies Q (i.e. that P implies Q). Let start be the initial predicate, i.e., the set of initial states.
For thread p, relation step(p) is defined as the set of the pairs (x , y) of states such that in state x thread p can do a step of the algorithm that results in state y. Relation step is defined as the union of the relations step(p) for all threads p, together with the identity relation of the state space. An execution is defined to be an infinite sequence xs of states with xs 0 ∈ start, and (xs n , xs n+1 ) ∈ step for all n ∈ N. A predicate P is an invariant if and only if it contains all states of all executions. We write X 0 ⊆ X for the intersection of all invariants obtained. So this is the set of the states that satisfy all invariants obtained in Sect. 3.
An execution fragment of length n ≥ 0 is a nonempty finite sequence (xs 0 . . . xs n ) in X 0 such that (xs i , xs i+1 ) ∈ step for all i with 0 ≤ i < n. Two execution fragments can be concatenated when the final state of the first fragment equals the initial state of the second fragment.
Coming back to the algorithm, recall from Sect. 3.3 that the forward steps are defined to be the steps 22-38 that modify pc or set1. Relation fwd (p) ⊆ step(p) is defined to be the set of forward steps of thread p. Thread p is therefore enabled in state x if and only if there is a state y with (x , y) ∈ fwd (p). Recall that enabledness is expressed by the predicate ena(p).
An occurrence of thread p in an execution fragment (xs 0 . . . xs n ) is a number i with 0 ≤ i < n, and (xs i , xs i+1 ) ∈ fwd(p) or xs i ∈ ena(p). The execution fragment is called a round if it contains an occurrence of every thread. In other words, in the fragment, every thread is scheduled, and either executed or found to be disabled. This applies, e.g., when thread p is always at line 21.
Progress of the algorithm will be proved under the assumption that all threads do enough forward steps unless they are disabled. More precisely, progress will be proved for any execution fragment that contains a concatenation of sufficiently many rounds.

UNITY and bounded UNITY
UNITY logic [CM88,Mis01] is a way to systematically prove assertions of the form P leads to Q (notation P → Q), meaning "if P holds at any time t during a computation, Q will hold at some time t ≥ t".
Example. Individual progress of the algorithm means that a thread, say p, in the entry protocol, will eventually reach the critical section at line 35. This is expressed by: p in {22 . . . 34} → p at 35. 2 UNITY logic begins with defining two relations, co and co!, between predicates: P co Q ≡ ∀(x , y) ∈ step : x ∈ P ⇒ y ∈ Q, P co! Q ≡ ∃ r : P ⊆ ena (r ) ∧ (∀(x , y) ∈ fwd (r ) : x ∈ P ⇒ y ∈ Q). P co Q means that every step that starts in P ends in Q. According to co!, there is a specific thread r that is able to establish Q. UNITY logic is based on the relations unless and ensures defined by: UNITY's leads-to relation → is defined inductively by the three rules: Bounded UNITY is a version of UNITY in which the leads-to relation is quantified by a natural number: P leads to Q within n rounds, notation P Lt n Q, is defined to mean that every execution fragment that contains a concatenation of n rounds and has its initial state in P , contains a state in Q. The basic proof rules are

•
If P ⊆ Q, then P Lt n Q for every n ≥ 0.
If P Lt k Q and Q Lt m R, then P Lt k + m R. • For any family (P i ) i∈I , if P i Lt n Q for all i ∈ I , then (∃ i ∈ I : P i ) Lt n Q.
The first rule is called the subset rule, the second one is the ensures rule, the third one is called transitivity, and the fourth one is called the Disjunction Rule. There is also the Progress-Safety-Progress Rule [CM88]: The soundness of these proof rules has been proved mechanically [Hes15a]. The set of proof rules is not complete, but they are enough for the present purposes.
Some progress properties are easily expressed by means of a numerical measure. For instance, as discussed in Sect. 1.5, the throughput of a mutual exclusion algorithm can be expressed by the growth of the sum rc, see Formula (0). We develop a small theory to estimate the growth of such a function.
A numerical state function vf : X → Z is called a forward measure if it satisfies the following three requirements: The importance of a forward measure vf is that it is guaranteed to grow with the number of rounds, unless all threads are disabled, in the sense that Useful progress properties are rarely coupled directly to the number of rounds. It can happen, however, that a useful progress property is measured by an integer valued state function svf that is proportional to a forward measure vf , via for some factor F > 0 and some delay D > 0. If the Formulas (3) and (4) hold, they imply that Roughly speaking, this says that svf grows in n rounds with at least (n − D)/F . In the limit where the initial delay D counts no longer, svf grows at least with a speed F −1 .

Throughput
The throughput of the algorithm is defined as the number of times threads come back to the noncritical section.
To measure this, a private ghost variable cnt.p is introduced which is incremented in line 38, see Fig. 2. The throughput during an execution fragment is the growth of the sum rc p cnt.p over all threads p, see Sect. 1.5.
Before analysing the growth of rc, we note some more invariants. As announded, we assume the invariant Mq0 of Sect. 3.5. It is easy to see that this implies Using this, it is easy to verify the invariants We also need the obvious invariants q ∈ predec(q).
It follows from Kq0, Nq0, Nq1, that lvf is bounded by The function lvf(q) increases under most steps of thread q. More precisely, it decreases under step 38, it remains constant under the flickering steps of lines 22 and 38, and it increases in all other steps. For the steps 24 and 29, this follows from Mq0a. For the backward jumps from the lines 33 and 34 to line 30, it follows from Nq2. All steps of threads q leave lvf (q) constant.
The function lvf is connected to the ghost variable cnt in the function The bounds on lvf immediately imply A · cnt.q ≤ avf(q) < A · (cnt.q + 1).
Function avf(q) remains constant under the flickering steps of the lines 22 and 38, and it increases under all other steps of thread q. This holds in particular for step 38. of thread q because of the bounds for lvf. All steps of threads q leave avf (q) constant. The sum Savf q avf (q) now satisfies the bounds The lefthand inequality is easy. The righthand inequality follows from Mq0 and the fact that lvf (q) 0 when q is not competing. The function Savf remains constant under the flickering steps, and it increases under all other steps of all threads. If thread p is enabled and it does a flickering step, it remains enabled. Therefore, function Savf is a forward measure, see Formula (2). By Formula (6), function rc is proportional to Savf with factor A and delay (A − 1) · N + 1. According to Theorem 1, when there are no enabled threads, all threads are idle. This means that where AI is the condition that all threads are idle. Therefore, Formula (5) implies: In other words, the algorithm has a throughput factor A 5 · N + 9, linear in N ; and throughput delay B (A − 1) · (N − 1).

Individual progress
As the algorithm satisfies FCFS, individual progress follows from general progress, just as in the case of the algorithm of Lycklama-Hadzilacos-Aravind [Hes15a]. The key step is an application of the PSP rule.
The problem is to guarantee that a thread q in the region 30-35 eventually leaves this region. If it does not, all newly entering threads r will collect and keep q in predec(r ). By FCFS, such threads cannot reach the critical section. This should contradict Theorem 3. To formalize this argument, consider the predicate WF (q, m) given by While thread q remains in 30-35, no thread r can enter NP, because when r enters {22 . . . }, it puts q into predec(r ). Thread r can only leave NP by executing line 38, i.e., by incrementing rc. Conversely, when thread r increments rc, it executes line 38 and therefore leaves the set NP because of FCFS. This proves that  As n 1 + 3 + (N + 7) 10 · N 2 + 13 · N + 2, the combination of the last three leads-to assertions by means of transitivity and disjunction gives the result on individual progress: Theorem 4 true Lt 10 · N 2 + 13 · N + 2 (q at 21).
So, the individual delay is bounded by 10 · N 2 + 13 · N + 2 and is therefore of order O(N 2 ).

In conclusion
All assertions in this paper about the transition system of Sect. 3 have been proved with the proof assistant PVS [OSRSC01]. The starting point is a formal description in PVS of Fig. 2 in relational semantics. PVS helps primarily with exhaustive case distinctions and the administration of the proof obligations. How this can be done and our experiences with this proof assistant are described in [Hes13a]. The proof script for the present paper is available on [Hes15b].
The safety properties of the algorithm, mutual exclusion, absence of deadlock, FCFS, boundedness of the tickets, are all proved by means of invariants, as is usual. It is much work, but with experience and a powerful proof assistant it can be done.
The treatment of progress is more innovative. The numerical quantification in the Theorems 3 and 4 does not require much more effort than a standard UNITY proof for the corresponding progress assertions. The UNITY proof seems to be easier than a temporal logic proof such as given in [Hes13a], primarily because the UNITY concepts ensures and leads-to are more intuitive than sets of executions can ever be.
The result is that the Black-White Bakery Algorithm has a throughput factor linear in N , and individual delay quadratic in N . This can also be proved for the ordinary Bakery Algorithm. It can be compared with the result of [Hes15a] for the algorithm of Lycklama-Hadzilacos-Aravind: there the throughput factor is quadratic in N and the individual delay is cubic in N . On the other hand, we conjecture that the tournament algorithm Peterson-Buhr of [BDH15, Section 18.6] has a throughput factor logarithmic in N and individual delay linear in N .
If one wants to implement the Black-White Bakery Algorithm for a fixed and modest number of threads, the active set partic can be removed from the algorithm of Fig. 1. This means removal of the lines 22 and 38, and replacing getset(partic) by thread in the lines 24 and 29.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.