Rely-guarantee bound analysis of parameterized concurrent shared-memory programs

We present a thread-modular proof method for complexity and resource bound analysis of concurrent, shared-memory programs. To this end, we lift Jones’ rely-guarantee reasoning to assumptions and commitments capable of expressing bounds. The compositionality (thread-modularity) of this framework allows us to reason about parameterized programs, i.e., programs that execute arbitrarily many concurrent threads. We automate reasoning in our logic by reducing bound analysis of concurrent programs to the sequential case. As an application, we automatically infer time complexity for a family of fine-grained concurrent algorithms, lock-free data structures, to our knowledge for the first time.


Program complexity and resource bound analysis
Program complexity and resource bounds analysis (bound analysis) aims to statically determine upper bounds on the resource usage of a program as expressions over its inputs. Despite the recent discovery of powerful bound analysis methods for sequential imperative programs (e.g., [4,6,9,12,20,23,36] shared-memory imperative programs (cf. Sect. 6). In addition, it is often necessary to reason about parameterized programs that execute an arbitrary number of concurrent threads.
However, from a practical point of view, bound analysis is an important step towards proving functional correctness criteria of programs in resource-constrained environments: For example, in real-time systems intermediary results must be available within certain time bounds, or in embedded systems applications must not exceed hard constraints on CPU time, memory consumption, or network bandwidth.

Non-blocking data structures
We illustrate the necessity of extending bound analysis to concurrent, shared-memory programs on the example of non-blocking data structures: Devised to circumvent shortcomings of lock-based concurrency (like deadlocks or priority inversion), they have been adopted widely in engineering practice [25]. For example, the Michael-Scott non-blocking queue [31] is implemented in the Java standard library's ConcurrentLinkedQueue class.
Automated techniques have been introduced for proving both correctness (e.g., [2,8,13,40]) and progress (e.g., [22,27]) properties of non-blocking data structures. In this work, we focus on the progress property of lock-freedom, a liveness property that ensures absence of livelocks: Despite interleaved execution of multiple threads altering the data structure, some thread is guaranteed to complete its operation eventually.
From a practical, engineering point of view it is not enough to prove that a data structure operation completes eventually. Rather, it needs to make progress using a bounded, measurable amount of resources: Petrank et al. [34] formalize and study bounded lock-free progress as bounded lock-freedom, and discuss its relevance for practical applications. They describe its verification for a fixed number of threads and a given progress bound using model checking, but leave finding the bound to the user. Existing approaches for automatically proving progress properties like the ones presented in [22,27] are limited to eventual (unbounded) progress. To our knowledge, bounded progress guarantees have not been inferred automatically before.

Overview
Reasoning about the resource consumption of non-blocking algorithms is an intricate problem and tedious to perform manually. To illustrate this point, consider the following common design pattern for lock-free data structures: A thread aiming to manipulate the data structure starts by taking as many steps as possible without synchronization, preparing its intended update. Then, it attempts to alter the globally visible state by synchronizing on a single word in memory at a time. Interference from other threads may cause this synchronization to fail and force the thread to retry from the beginning. From the viewpoint of a single thread that accesses the data structure: 1. The amount of interference by other threads directly affects its resource consumption. In general, this means reasoning about an unbounded number of concurrent threads, even to infer resource bounds on a single thread. 2. The point of interference may occur at any point in the execution, due to the fine granularity of concurrency.
In this paper, we present an automated bound analysis for concurrent, shared-memory programs to remedy this situation: In particular, our method analyzes the parameterized  1 Jones' rely/guarantee proof rules for safety system of N concurrent lock-free data structure client threads. To reason about this infinite family of systems and its interactions, we leverage and extend rely-guarantee reasoning [28], which we briefly introduce in the next section.

Introduction to rely-guarantee reasoning
Rely-guarantee (RG) reasoning [28,42] extends Hoare logic to concurrency: It makes interference from other threads of execution explicit in the specifications. In particular, Hoare triples {S} P {S } are extended to RG quintuples R, G {S} P {S }, where the effect summaries R and G capture interference: They are binary relations on program states that over-approximate the state transitions of executions: rely R specifies other threads' effects (thread P's environment) that P can tolerate to satisfy its precondition S and postcondition S . guarantee G specifies the effect that P can inflict on its environment. Intuitively, encoding a thread's environment in rely and guarantee relations abstracts away the order in which a thread performs its actions, which thread performs which action, and the number of times each action is performed. For termination analysis, the last point is crucial: A thread may not terminate under infinite interference, but may do so under finite interference. For bound analysis, this may still be too coarse: To compute bounds on the thread, we may need to bound the amount of interference from its environment. Therefore, we extend RG reasoning to bound analysis by introducing bound information into the relies and guarantees. We give new proof rules for such specifications that allow to reason not just about safety, but also about bounds. Finally, the compositionality of our proof rules allows us to reason even about an unbounded number of threads, i.e., about parameterized systems.
In the following we outline the major contributions of this paper.

Contributions
1. We present the first extension of rely-guarantee specifications to bound analysis and formulate proof rules to reason about these extended specifications (Sects. 3.1-3.4). Apart from their specific use case in this work, we believe the proof rules are interesting in their own right, for example in comparison to Jones' original RG rules [28,42], or the reasoning rules for liveness presented in [22] (cf. the discussion in Sect. 6).
2. We instantiate our proof rules to derive a novel proof rule for parameterized systems. In addition, we present an algorithm that automates reasoning about the unboundedly many threads of parameterized systems (Sect. 3.5). 3. We reduce rely-guarantee bound analysis of concurrent pointer programs to bound analysis of sequential integer programs, and obtain an algorithm for bound analysis of lock-free algorithms (Sect. 4). 4. We implement our algorithm in the tool Coachman and apply it to lock-free algorithms from the literature. To our knowledge, we are the first to automatically infer runtime complexity for widely studied lock-free data structures such as Treiber's stack [38] or the Michael-Scott queue [31] (Sect. 5).
This is an extended version of the conference paper that appeared at FMCAD 2018 [33]. Besides making the material more accessible through additional explanations and discussions, it adds the following contributions: 1. It contains full proofs of Theorems 1 and 2 that were omitted from the conference version. 2. We extend and improve the structure of Sect. 3 and 4 to first introduce a standalone relyguarantee framework for bound analysis (Sect. 3), and then instantiate it for the analysis of lock-free data structures (Sect. 4). 3. We extend our experiments (Sect. 5) to include nine additional benchmark cases. In addition to the conference version, we include further lock-free data structures, as well as benchmark cases that are not lock-free or have non-linear complexity. 4. Some of these new results were made possible by major performance improvements to our implementation Coachman. Its updated version is available online [14].

Motivating example
We start by giving an informal explanation of our method and of the paper's main contributions on a running example. Figure 2 shows the implementation of a lock-free concurrent stack known as Treiber's stack [38]. Our input programs are represented as control-flow graphs with edges labeled by guarded commands of the form g c. We omit g if g = true. As a convention, we write global variables shared among threads in uppercase (e.g., T) and local variables to be replicated in each thread in lowercase (e.g., t). Further, we assume that edges in the control-flow graphs are executed atomically, and that programs execute in presence of a garbage collector; the latter prevents the so-called ABA problem and is a common assumption in the design of lock-free algorithms [25]. Values stored on the stack do not influence the number of times its operations are executed, thus we abstract them away for readability. The stack is represented by a null-terminated singly-linked list, with the shared variable T pointing to the top element. The push and pop methods may be called concurrently, with synchronization occurring at the guarded commands originating in 3 for push and 13 for pop. These low-level atomic synchronization commands are usually implemented in hardware, through instructions like compare-andswap (CAS) [25]. In Fig. 2, we highlight these synchronization points and edges in bold.

Running example: Treiber's Stack
The stack operations are implemented as follows: Starting with an empty stack, T points to NULL. The push operation (Fig. 2a) Fig. 2 Treiber's lock-free stack [38]. Stack pointer T is the sole global variable. Synchronization points and edges (corresponding to CAS instructions) are highlighted in bold 1. allocates a new list node n ( 0 → 1 ) 2. reads the shared stack pointer T ( 1 → 2 ) 3. updates the newly allocated node's next field to the read value of T ( 2 → 3 ) 4. atomically: compares the value read in (2) to the actual value of T; if equal, T is updated to point to n, otherwise the operation restarts ( 3 → 4 and 3 → 1 respectively).

Problem statement
Consider a general data structure client P = op1() [] . . . [] opM(), where op1, . . . , opM are the data structure's operations, and [] denotes non-deterministic choice. We compose N concurrent client threads P 1 to P N accessing the data structure: Our goal is to design a procedure that automatically infers upper-bounds for all system sizes N on 1. the thread-specific resource usage caused by a control-flow edge of a single thread P 1 when executed concurrently with P 2 · · · P N , and 2. the total resource usage caused by a control-flow edge in total over all threads P 1 to P N .
Remark 1 (Cost model) To measure the amount of resource usage, bound analyses are usually parameterized by a cost model that assigns each operation or instruction a cost amounting to the resources consumed. In this paper, we adopt a uniform cost model that assigns a constant cost to each control-flow edge. When we speak of the (time) complexity of a program, we adopt a specific uniform cost model that assigns cost 1 to each control-flow back edge and cost 0 to all other edges; this reflects the asymptotic time complexity of the program.
Running example Consider N concurrent copies P 1 · · · P N of the Treiber stack's client program push() [] pop(), and the push operation's control-flow edge 1 → 2 . A manual analysis yields a thread-specific bound for P 1 telling us that this edge is executed at most N times by P 1 : Each time that another thread successfully modifies stack pointer T, P 1 's copy in t may become outdated, causing the test at 3 to fail (t = T), and P 1 to restart. After at most N − 1 iterations, all other threads have finished their operations and returned, and P 1 Similarly, a total bound for P 1 · · · P N tells us that edge 1 → 2 is executed at most N (N + 1)/2 times by all threads P 1 to P N in total: The first thread to successfully synchronize at 3 sees no interference and executes 1 → 2 once. The second thread may need to restart once due to the first thread modifying T, and executes 1 → 2 at most twice, etc. The last thread to synchronize has the worst-case bound we established as thread-specific bound for P 1 : it executes 1 → 2 N times. We obtain N (N + 1)/2 as closed form for the total bound. In the following, we illustrate how to formalize and automate this reasoning.

Environment abstraction
Client program N P from above is parameterized in the number of concurrent threads N . To reason about this infinite family of parallel client programs, we base our analysis on Jones' rely-guarantee reasoning [28]. For each thread, RG reasoning over-approximates the following as sets of binary relations over program states (thread-modular [18] effect summaries): -the thread's effect on the global state (its guarantee) -the effect of all other threads (its rely) as the union of those threads' guarantees. The effect of all other threads (the thread's environment) is thus effectively abstracted into a single relation. Crucially, this also abstracts away how often each is executed by the environment, rendering Jones' RG reasoning unsuitable for concurrent bound analysis. Id } summarizes the globally visible effect of P 1 's environment P 2 · · · P N for all N > 0. In particular, we obtain one effect summary for each control-flow edge: A Push summarizes the effect of an environment thread executing edge 3 → 4 from the point of view 1 of thread P 1 , A Pop that of 13 → 14 , and A i, j Id that of all other edges i → j . We discuss how to obtain A in Sect. 4.2.
As is, the effect summaries in A may be executed infinitely often. Our informal derivation of the bound in Sect. 2.2 however, had to determine how often other threads could interfere with the reference thread P 1 (altering pointer T) to bound its number of loop iterations.
Hence, we lift Jones' RG reasoning to concurrent bound analysis by enriching RG relations with bounds. We emphasize our focus on progress properties in this work: Although our framework extends Jones' RG reasoning and can express safety properties, we only use it to reason about bounds; tighter integration is left for future work.

Rely-guarantee reasoning for bound analysis
In particular, relies and guarantees in our setting are maps {A 1 → b 1 , . . . } from effect summaries A i (which are binary relations over program states) to bound expressions b i . Each relation describes an effect summary, and the bound expression describes how often that summary may occur on a run of the program.
We present a program logic for thread-modular reasoning [18] about bounds: A judgement in our logic takes the form where {S} P {S } is a Hoare triple, and R, G are a rely and guarantee. Its informal meaning is: For any execution of program P starting in a state from {S}, and environment interference described by the relations in R and occurring at most the number of times given by the respective bounds in R, P changes the shared state according to the relations in G and at most the number of times described by the respective bounds in G. In addition, the execution is safe (does not reach an error state) and if P terminates, its final state is in {S }.
Running example For readability, we focus on the analysis of Treiber's push method. The steps for pop are similar. Our technique computes exactly one effect summary for each of the method's control-flow edges, in order to express one bound per edge (Fig. 2c). For a rely or guarantee we fix the order of effect summaries and write (b 1 First, our method states the following RG quintuple: where R = (∞, ∞, ∞, ∞, ∞), G = (1, ∞, ∞, ∞, 1), and Inv is a data structure invariant over shared variables in a suitable assertion language (e.g., separation logic [35]). We use invariant Inv to ensure that the computed bounds are valid for all computations starting from all legal stack configurations. Despite the unbounded environment R (which corresponds to Fig. 2c), we can already bound two edges, 0 → 1 and 3 → 4 of P 1 , and thus the corresponding effect summaries in G: These edges are not part of a loop and -despite any interference from the environment -can be executed at most once. We show how to automatically discharge (or rather, discover) such RG quintuples in Sect. 4.3. Next, we use the bound information obtained in G to refine the environment R until a fixed point of the rely is reached. This refinement is formalized in Sect. 3.5 in Theorem 2.
Running example (continued) We already established that thread P 1 can execute effect summaries A 0,1 Id and A Push at most once. In our example, all threads are symmetric, thus each of the N − 1 other threads can execute A 0,1 Id and A Push at most once as well. The abstract environment representing these N −1 threads can thus execute each summary A 0,1 Id and A Push at most N − 1 times. We obtain the refined rely R = (N − 1, ∞, ∞, ∞, N − 1).
As we have reasoned in Sect. 2.2, once the number of executions of the A Push effect summary is bounded, P 1 loops only that number of times. We obtain the refined guarantee By the same reasoning as above, we multiply G with (N − 1) (componentwise) and obtain the refined rely From R , we cannot obtain any tighter bounds, i.e., G = G is a fixed point, and we report G and G + R as the thread-specific and total bounds of P 1 and P 1 · · · P N : Edge Thread-specific bound Total bound We demonstrate in Sect. 5 that for more complex examples, more than two iterations of the rely-refinement are necessary to bound all edges. We formalize our reasoning by giving a compositional proof system in Sect. 3, instantiate it for pointer programs and the analysis of lock-free algorithms in Sect. 4, and experimentally evaluate our technique in Sect. 5.

Rely-guarantee bound analysis
In this section, we formalize the technique illustrated informally above. We start by stating our program model and formally define the kind of bounds we consider:

Definition 1 (Program) Let
where L is a finite set of locations, 0 ∈ L is the initial location, and T ⊆ L × GC × L is a finite set of transitions. Let S be a predicate over Var that is evaluated over program states. We overload · and write S ⊆ Σ for the set of states satisfying S. We represent executions of P as sequences of steps r ∈ Σ × T × Σ and write σ t − → σ for a step (σ, t, σ ). A run of P from S is a sequence of steps ρ = σ 0
Given a program P over local and shared variables Var = LVar ∪ SVar, we write N P = P 1 · · · P N where N ≥ 1 for the N -times interleaving of program P with itself, where P i over Var i is obtained from P by suitably renaming local variables such that LVar 1 ∩ · · · ∩ LVar N = ∅. Given a predicate S over Var, we write N S for the conjunction S 1 ∧ · · · ∧ S N where S i over Var i is obtained by the same renaming.

Definition 3 (Expression) Let
Var be a set of integer program variables. We denote by Expr(Var) the set of arithmetic expressions over Var ∪ Z ∪ {∞}. The semantics function · : Expr(Var) → Σ → (Z ∪ {∞}) evaluates an expression in a given program state. We assume the usual expression semantics; in addition, a • ∞ = ∞ and a ≤ ∞ for all a ∈ Z ∪ {∞} and • ∈ {+, ×}.

Definition 4 (Bound) Let
Given a program P = (L, T , 0 ) and predicate S over local and shared variables Var = LVar ∪ SVar, our goal is to compute a function Bound : T → Expr(SVar Z ∪ {N }), such that for all transitions t ∈ T and all system sizes N ≥ 1, Bound(t) is a bound for t of P 1 on all runs of N P = P 1 · · · P N from N S = S 1 ∧ · · · ∧ S N . That is, Bound gives us the thread-specific bounds for transitions of P 1 . In Sect. 3.5, we explain how to obtain total bounds on N P from that.

Extending rely-guarantee reasoning for bound analysis
To analyze the infinite family of programs N P = P 1 · · · P N , we abstract P 1 's environment P 2 · · · P N : We define effect summaries which provide an abstract, thread-modular view of transitions by abstracting away local variables and program locations.

Definition 5 (Effect summary) Let Σ S be a set of program states over shared variables SVar.
An effect summary A ⊆ Σ S × Σ S over SVar is a binary relation over shared program states. Where convenient, we treat an effect summary A as a guarded command whose effect A is exactly A.
Sound effect summaries over-approximate the state transitions of the program they abstract: Definition 6 (Soundness of effect summaries) Let P = (L, T , 0 ) be a program over local and shared variables Var = LVar ∪ SVar, and let S over Var be a predicate describing P's initial states. We denote by Effects(P, S) the state transitions reachable by P from program location 0 and all initial states σ 0 ∈ S when projected onto shared variables SVar.
Let A over SVar be a finite set of effect summaries, and let A * denote all sequentially composed programs of effect summaries in A (its Kleene iteration). A is sound for P from S if Effects(P A * , S) ⊆ Effects(A * , S).
In Sect. 4.2 we show how to compute A in a preliminary analysis step such that it overapproximates P (or P 1 P 2 ). We extend the above notion of soundness of effect summaries to parallel composition and the parameterized case in Lemma 1 and Corollary 1 below. Intuitively, if the effects of each individual program P 1 , P 2 , . . . interleaved with A * are included in effects of A * , then so are the effects of their parallel composition. It is thus sufficient to check soundness for a finite number of programs and still obtain sound summaries of parameterized systems.

Lemma 1
Let P be a program over local and shared variables Var = LVar ∪ SVar and let S be a predicate over Var describing its initial states. Let P 1 , P 2 , . . . , P N be programs over variables Var 1 , Var 2 , . . . , Var N obtained by renaming local variables in P such that P 1 , P 2 , . . . , P N do not share local variables, i.e., 1≤i≤N LVar i = ∅. Further, let S 1 , S 2 , . . . , S N be predicates obtained from S using the same renaming. Let A be a sound set of effect summaries for P from S. If and

Corollary 1 In particular, if
Effect summaries are capable of expressing relies and guarantees in Jones' RG reasoning (cf. Sect. 1.4). In the following, we extend this notion to bound analysis by equipping each effect summary with a bound expression. We call these extended interference specifications environment assertions: Definition 7 (Environment assertion) Let A = {A 1 , . . . , A n } be a finite set of effect summaries over shared variables SVar. Let N be a symbolic parameter describing the number of threads in the system. An environment assertion E A : A → Expr(SVar ∪ {N }) over A is a function that maps effect summaries to bound expressions over SVar and N . We omit A from E A wherever it is clear from the context.
We use sequences a of effect summaries to describe interference: Intuitively, the bound E A (A) describes how often summary A ∈ A is permissible in such a sequence. Finally, we define rely-guarantee quintuples over environment assertions as the specifications in our compositional proofs: Definition 8 (Rely-guarantee quintuple) We abstract environment threads of interleaved programs as rely-guarantee quintuples (RG quintuples) of either form where P and P 1 P 2 are programs, S and S are predicates such that S ⊆ Σ are initial program states, and S ⊆ Σ are final program states, and rely R and guarantees G and G 1 , G 2 are environment assertions over a finite set of effect summaries A.
In particular, R abstracts P's or P 1 P 2 's environment. The guarantees G and (G 1 , G 2 ) allow us to express both thread-specific and total bounds on interleaved programs: The guarantee G of quintuple R, G {S} P 1 P 2 {S } contains total bounds for P 1 P 2 , while the guarantees G 1 , G 2 of R, (G 1 , G 2 ) {S} P 1 P 2 {S } contain the respective thread-specific bounds of threads P 1 and P 2 .
Note that the relies and guarantees of a single RG quintuple are defined over the same set of effect summaries A. This is not a limitation: in case we had different sets of effect summaries A and A , we can always use their union A ∪ A and set the respective bounds to zero.
Remark 2 (Notation of environment assertions) We choose to write relies and guarantees as functions over A as it simplifies notation throughout the paper. The reader may prefer to think of environment assertions {A 1 → b 1 , . . . } as sets of pairs of an effect summary and a bound { (A 1 , b 1 ), . . . }, in contrast to just a set of effect summaries {A 1 , . . . } as in Jones' RG reasoning.

Trace semantics of rely-guarantee quintuples
We model executions of RG quintuples as traces, which abstract runs of the concrete system. This allows us to over-approximate bounds by considering the traces induced by RG quintuples.
Further, let S be a predicate over local and shared variables Var = LVar ∪ SVar and let A be a finite sound set of effect summaries for P from S. We represent executions of P interleaved with effect summaries in A as sequences of trace transitions δ ∈ (L ×Σ)×(L ×Σ ∪{⊥})×{1, 2, e}×A, where the first two components define the change in program location and state, the third component defines whether the transition was taken by program P 1 (1), P 2 (2), or the environment (e), and the last component defines which effect summary encompasses the state change. For a trace transition δ = (( , σ ), ( , σ ), α, A), . . of program P starts in a pair ( 0 , σ 0 ) of initial program location and state, and is a (possibly empty) sequence of trace transitions. Let |τ | ∈ (N 0 ∪ {∞}) denote the number of transitions of τ . We define the set of traces of program P as the set traces(S, P) such that for all τ ∈ traces(S, P), we have σ 0 ∈ S and for trace τ 's i th transition (0 < i ≤ |τ |) it holds that either The projection τ C of a trace τ ∈ traces(S, P) to components C ⊆ {1, 2, e} is the sequence of effect summaries defined as image of τ under the homomorphism that maps (( , σ ), ( , σ ), α, A) to A if α ∈ C, and otherwise to the empty word.
We now define the meaning of RG quintuples over traces. Given an environment assertion E A over effect summaries A, interference by an action A ∈ A is described by E A (A), giving an upper bound on how often A can interfere: Fig. 3 Rely/guarantee proof rules for bound analysis. We write G for either G or (G 1 , G 2 ). In the latter case, ⊆ is applied componentwise Definition 10 (Validity) Let A be a finite set of effect summaries over shared variables SVar, let A ∈ A be an effect summary, and let a be a finite or infinite word over effect summaries A. Let E A be an environment assertion over A. Let σ ⊆ Σ S be a program state over SVar. We overload #(A, a) ∈ N 0 ∪ {∞} to denote the number of times A appears on a and define We define R, G | {S} P {S } iff for all traces τ ∈ traces(S, P) such that τ starts in state σ 0 ∈ S and τ {e} | σ 0 R (τ 's environment transitions satisfy the rely): -if τ is finite and ends in ( , σ ) for some , then σ = ⊥ (the program is safe) and σ ∈ S (the program is correct), and -if τ is finite and ends in ( , σ ) for some , then σ = ⊥ and σ ∈ S , and τ {1} | σ 0 G 1 and τ {2} | σ 0 G 2 .

Proof rules for rely-guarantee bound analysis
Inspired by Jones' proof rules for safety [28,42] (cf. Fig. 1) and the rely-guarantee rules for liveness and termination in [15], we propose inference rules to facilitate reasoning about our bounded RG quintuples. First, we define the addition and multiplication environment assertions, as well as the subset relation over them: Definition 11 (Operations and relations on environment assertions) Let A be a finite set of effect summaries over shared variables SVar, let A ∈ A be an effect summary, and let E A and E A be environment assertions over A. Let σ ⊆ Σ S be a program state over SVar. Let e ∈ Expr(SVar) be an expression over SVar. For all effect summaries A ∈ A we define Further, let S be a predicate over SVar. We define for all A ∈ A and all σ ∈ S .
Proof rules. The proof rules for our extended RG quintuples, using environment assumptions to specify interference, are shown in Fig. 3: -Par interleaves two threads P 1 and P 2 and expresses their thread-specific guarantees in (G 1 , G 2 ). -Par-Merge combines thread-specific guarantees (G 1 , G 2 ) into a total guarantee G 1 + G 2 .
-Conseq is similar to the consequence rule of Hoare logic or RG reasoning: it allows to strengthen precondition and rely, and to weaken postcondition and guarantee(s).
Keeping rules Par and Par-Merge separate is not only useful to express thread-specific bounds, but sometimes necessary to carry out the proofs below.
Leaf rules of the proof system. Note that our proof system comes without leaf rules. We offload the computation of correct guarantees G from a given program P, a precondition S, and a rely R to a bound analyzer (cf. Sect. 4.3). From this, we can immediately state valid RG quintuples R, G | {S} P {S } for sequential programs and use the rules from Fig. 3 only to infer guarantees on the parallel composition of programs.
Relation to Jones' original RG rules. Note that our proof rules are a natural extension of Jones' original RG rules (Fig. 1): If we replace set union ∪ with addition of environment assumptions + (Definition 11) and the standard subset relation ⊆ with our overloaded one on environment assumptions (Definition 11), Jones' rule J-Par equals the composed application of Par and Par-Merge, and Jones' J-Conseq equals our Conseq rule.
Postconditions of RG quintuples. Although our proof rules allow to infer both bounds (in the guarantees) and safety (through the postconditions), in this work we focus on the former. We still write postconditions because our proof rules are sound even with them, and because this notation is already familiar to many readers. As postconditions aren't relevant for inferring bounds in this work, they default to true in the examples below. Fig. 3 are sound.

Theorem 1 (Soundness) The rules in
Proof We give an intuition here and refer the reader to Appendix A for the full proof.
Proof sketch: We build on the trace semantics of Definition 9. For each rule Par, Par-Merge, Conseq we assume validity (Definition 10) of the rule's premises. We then consider a trace τ of the program in the conclusion, such that it satisfies the judgement's precondition and rely (i.e., the premises of validity), and show that the trace also satisfies the judgement's guarantee and postcondition.
-For rule Par, we prove satisfaction of the guarantee by induction on the length of a trace τ ∈ traces(S 1 ∧ S 2 , P 1 P 2 ) and by case-splitting on the labeling of the last transition. The proof rules in Fig. 3 together with procedure SynthG defined below allow us to compute rely-guarantee bounds for the parallel composition of a fixed number of threads.
Definition 12 (Synthesis of guarantees) Let SynthG(S, P, R) be a procedure that takes a predicate S, a non-interleaved program P, and a rely R and computes a guarantee G, such that R, G | {S} P {true} holds. Further, let procedure SynthG be monotonically decreasing, i.e., for all predicates S and programs P, if R ⊆ R then SynthG(S, P, R ) ⊆ SynthG(S, P, R).
For now, we assume that SynthG exists. We give an implementation in Sect. 4.3.

Running example
We show how to infer bounds for two threads P 1 P 2 concurrently executing Treiber's push method . Let 0 = (0, . . . , 0) denote the empty environment. Our goal is to find valid premises for rule Par (Fig. 3) to conclude That is, in an otherwise empty environment (rely R = 0), when run as P 1 P 2 , each thread has the bounds given in G 1 and G 2 . Recall from Sect. 2.4 that Inv is a data structure invariant over shared variables. We assume its existence for now and describe its computation in Sect. 4.1.
Since R is empty, the premises of rule Par become Assuming a rely G 2 that soundly over-approximates P 2 in an environment of P 1 , we can compute G 1 as G 1 = SynthG(Inv, P 1 , G 2 ). As the argument above is circular, the only sound assumption we can make at this point is to let G 2 = (∞, ∞, ∞, ∞, ∞), i.e., assume that P 2 interferes up to infinitely often on P 1 .

Extension to parameterized systems and automation
The proof rules given in Sect. 3.4 allow us to infer bounds for systems composed of a fixed number of threads. We now turn towards deriving bounds for parameterized systems, i.e., systems with a finite but unbounded number N of concurrent threads N P = P 1 · · · P N . To this end, we use the proof rules from Sect. 3.4 to derive the symmetry argument stated in Theorem 2 below: It allows us to switch the roles of reference thread and environment, i.e., to infer bounds on P 2 · · · P N in an environment of P 1 from already computed bounds on P 1 in an environment of P 2 · · · P N . Theorem 2 (Generalization of single-thread guarantees) Let P be a program over local and shared variables Var = LVar ∪ SVar and let N P = P 1 · · · P N be its N -times interleaving. Let S be a predicate over SVar. Let A over SVar be a sound set of effect

Algorithm 1: Parameterized bound analysis
Input: A program P over effect summaries A, and an initial state S. Output: Guarantees G 1 and G 2 , such that (0, . . . , 0), summaries for P started from S, and let R and G be environment assertions over A. Let 0 = (0, . . . , 0) denote the empty environment. If I.e., if (N − 1) × G is smaller than R, and if R, G | {S} P 1 {true} holds, then in an empty environment, P 1 's environment P 2 · · · P N executes effect summaries A no more than (N − 1) × G times.
Proof We give an intuition here and refer the reader to Appendix B for the full proof.
Proof sketch: We prove the property by induction for k threads up to a total of N . The main idea is to keep the effect of these k threads, k × G, in the guarantee, and the effect of the remaining N − k threads, (N − k) × G, in the rely. For the induction base (k = 2), we apply rule Conseq to the premises of Theorem 2 and obtain the interleaved guarantees of the two threads using rule Par. In the induction step, we add a (k + 1) th thread using rule Par and merge the guarantees using Par-Merge. Finally, for k = N we get an empty environment 0 in the rely, and N × G in the guarantee. Algorithm 1 shows our procedure for rely-guarantee bound computation of parameterized systems. It uses Theorem 2 and procedure SynthG (Definition 12) to compute the bound of a parameterized system P 1 (P 2 · · · P N ) as the greatest fixed point of environment assertions ordered by ⊆. It alternates between 1. computing a guarantee G 1 for P 1 in R, G 1 | {S} P 1 {true} (Line 2), and 2. inferring a guarantee G 2 for P 2 · · · P N in Intuitively, if R in step 1 overapproximates the effects of P 2 · · · P N , then G 1 is a valid guarantee for P 1 in an environment of P 2 · · · P N . In step 2, our algorithm uses Theorem 2 to generalize this guarantee G 1 on P 1 in an environment of P 2 · · · P N to a guarantee G 2 on P 2 · · · P N in an environment of P 1 . Theorem 3 below formalizes this argument.
Finally, if the algorithm reaches a fixed point, it returns the results of the analysis: 1. Thread-specific bounds of P 1 are directly returned as G 1 .
2. For total bounds of P 1 · · · P N , apply rule Par-Merge to G 1 and G 2 to sum up the guarantees of P 1 and P 2 · · · P N .
-For each subsequent iteration, let G 1 , G 2 , R refer to the variables' evaluation in the previous iteration. We have R = G 2 = (N −1)×G 1 R . Since by assumption SynthG is monotonically decreasing, from R R we have G 1 ⊆ G 1 and thus (N −1)×G 1 ⊆ R.
Termination. From the above, we have that the evaluations of G 1 (and G 2 , R, respectively) are strictly decreasing in each iteration. The lattice of environment assertions ordered by ⊆ is finite and bounded from below by the least element (0, . . . , 0). Thus no infinitely descending chains of evaluations of G 1 exist and Algorithm 1 terminates.
Running example Let us return to the task of computing bounds for N threads N P = P 1 · · · P N concurrently executing Treiber's push method. Our method starts from the RG quintuple with unknown guarantee "?" Recall from Sect. 2.4 that Inv is a data structure invariant over shared variables. We assume its existence for now and describe its computation in Sect. 4.1. Algorithm 1 starts by computing a correct-by-construction guarantee for the RG quintuple in (5): It summarizes P 1 's environment P 2 · · · P N in the rely R. At this point, it cannot safely assume any bounds on P 2 · · · P N , and thus on R. Therefore, it lets R = (∞, ∞, ∞, ∞, ∞) (Line 1 of Algorithm 1), which amounts to stating the query from (5) above as Next, Line 2 of Algorithm 1 runs the RG bound analysis procedure SynthG. As we have argued in Sect. 2.4, this yields SynthG(Inv, P 1 , R) = (1, ∞, ∞, ∞, 1), i.e., we have At this point, our method cannot establish tighter bounds for P 1 unless it obtains tighter bounds for its environment P 2 · · · P N and thus R. In Sect. 2.4, we informally argued that if G = (1, ∞, ∞, ∞, 1) is a guarantee for P 1 , then (N − 1) × G = (N − 1, ∞, ∞, ∞, N − 1) must be a guarantee for the N − 1 threads in P 1 's environment P 2 · · · P N . Line 3 of Algorithm 1 applies Theorem 2 to (7) and obtains From the above, we have that (N − 1, ∞, ∞, ∞, N − 1) is a bound for P 1 's environment P 2 · · · P N when run in parallel with P 1 . Going back to the RG quintuple (5), our technique refines the rely R, which models P 2 · · · P N , by letting R = G 2 = (N − 1, ∞ This means that we can refine our query for a guarantee from above to iterating our fixed point search. This second iteration again runs SynthG, which returns (1, N , N , N − 1, 1). Thus, and by Theorem 2 we have N , N , N − 1, 1) Another refinement of R from G 2 and another run of SynthG gives This time, the guarantee has not improved any further over the one in (10), i.e., our method has reached a fixed point and stops the iteration. Applying Theorem 2 gives (1, N , N , N − 1, 1) of which (G 1 , G 2 ) are returned as the algorithm's result.
To compute thread-specific bounds for the transitions of P 1 , our method may stop here; the bounds can be read off G 1 . For example, the fourth component of G 1 indicates that back edge 3 → 1 is executed at most N − 1 times. Note that according to Remark 1 this gives an upper bound on the asymptotic time complexity of the corresponding loop.
To compute total bounds for the transitions of the whole interleaved system P 1 · · · P N , our technique simply applies rule Par-Merge, which gives (N , N 2 , N 2 , (N − 1)N , N ) (14) Again, bounds can be read off G, for example the second component indicates that transition 1 → 2 is executed at most N 2 times by all N threads in total.

Application: proving that non-blocking algorithms have bounded progress
In Sects. 1 and 2, we presented our motivation for computing bounds of non-blocking algorithms and data structures in order to prove bounded lock-freedom. Accordingly, we instantiate Algorithm 1's inputs -precondition S, the set of effect summaries A, and the black-box method SynthG. This leaves Algorithm 1 parameterized only by program P, i.e., the non-blocking algorithm to analyze. In particular, we pass S, A, and SynthG as:

A suitable data structure invariant Inv to use as a precondition in RG quintuples R A , G A
{Inv} P {true}.

A finite set of effect summaries A as the domain of thread-modular environment assertions
R A and G A . 3. An implementation of the bound analyzer SynthG(Inv, P, R A ).
Variants of the above have been discussed throughout the literature. In this section, we show how we adapt and combine these techniques for our purpose.

Data structure invariants via shape analysis
A method manipulating a data structure may usually start executing in any legal configuration of the data structure.
Running example For example, the push method of Treiber's stack may be called on an empty stack, or a stack containing some number of elements (Fig. 4).
Thus, our goal is to compute bounds that are valid for all computations starting from all memory configurations the data structure may be in. Given a program P = (L, T , 0 ), a thread-modular shape analysis (e.g., [10,11,21]) computes a symbolic data structure invariant Inv that describes all possible memory configurations (when projected onto shared variables) that the parameterized program N P = P 1 · · · P N may reach.

Effect summary generation
The second ingredient to computing progress bounds for non-blocking algorithms is the generation of thread-modular effect summaries (Definition 5) that over-approximate the effect of threads on the global state. Many methods for obtaining effect summaries have been described in the literature. Using the nomenclature from [26], these can be grouped into three different approaches: -The merge-and-project approach (e.g., [7,19,28,29]) first merges reachable, partial (from the point of view of a specific thread) program states, lets one thread perform a sequential step, and then projects the result onto what is seen by other threads. -The learning approach (e.g. [32,41]) uses symbolic execution embedded in a fixed point computation to infer symbolic update patterns on the shared program state. -Finally, the effect summary approach [26] discovers a stateless summary program that over-approximates the analyzed program's effects on the shared program state.
We follow the effect summary approach. Holík et.al. [26] demonstrate how to compute such effect summaries using a heuristic based on copy propagation and program slicing followed by a simple soundness check. We obtain A = {A 1 , . . . , A m } as a stateless program Stateless(A) of the form In addition to A, this method outputs a function EffectOf : A → 2 T that maps an effect summary to the transitions it abstracts. Fig. 2c. Since we are interested in computing bounds per transition, we compute one effect summary per transition of the original program. In general, coarser effect summaries may be chosen.

Rely-guarantee bound analysis: procedure SYNTHG
Finally, we present our bound analysis procedure SynthG(S, P, R): Given a precondition S, a program P and a rely R over effect summaries A, it computes bounds for the transitions of P in an environment of R if started in a state in S. SynthG proceeds in the following way:

It instruments the stateless effect summary program Stateless(A) with additional counters
to allow only runs that obey the bounds given by R. Call the resulting program Instr(R) and let the interleaved program I = P Instr(R) be the interleaving of the program P to analyze and its environment R. Note that according to the product constructions of Definition 2, I again is a (sequential) program. 2. Most sequential bound analyzers target integer programs. Thus, as an intermediate step, our method translates program I = P Instr(R) into an equivalent (bisimilar) integer programÎ . 3. Finally, we use an off-the-shelf bound analyzer for sequential integer programs to obtain bounds onÎ . Note that bounds on transitions ofÎ that correspond to transitions of P are bounds for P in an environment of R.
Our main insight is that constructing the interleaved program P Instr(R) yields just a sequential program that can be given to a sequential bound analyzer. Thus reducing RG bound analysis to the sequential case, we describe each of the above steps in further detail:

Recall from Sect. 4.2 that we obtain the finite set of effect summaries A as a stateless program Stateless(A). Our method instruments Stateless(A) with fresh counter variables ξ A i to enforce the bounds in R:
Let Instr(R A ) = ({ }, T , ) be the program over additional variables ξ A 1 , . . . , ξ A m and a fresh location with initial states g 0 where Like Stateless(A), T contains one transition per effect summary. The definition of each transition's guarded command A and the initial state g 0 depend on whether effect summary A is bounded by R:

Translation to integer programs
Our goal is to analyze the sequential pointer program I = P Instr(R). To make use of the wide range of existing sequential bound analyzers for integer programs (e.g., [4,6,9,12,20,23,36]), our method translates the pointer program I into an equivalent integer program I : Using the technique of [8], our algorithm translates the interleaved program with pointers I = P Instr(R) and predicate Inv ∧ g 0 into a bisimilar integer programÎ and predicate Inv ∧ g 0 . Alternatively, one could directly compute bounds on the pointer program I using techniques such as described in [3,17,37].

Off-the-shelf bound analysis
Note thatÎ is a sequential integer program that can be given to an off-the-shelf sequential bound analyzer. We require the bound analyzer to be sound (i.e., it only reports transition bounds that hold for all runs of the program), but not necessarily complete (i.e., it may fail to bound a transition, even though the bound exists). The latter is expected due to the undecidable nature of (even sequential) bound analysis, and causes our analysis to be incomplete as well. LetT denote the transitions ofÎ . Our method runs the sequential bound analyzer onÎ with initial states Inv ∧ g 0 , which computes a function SeqBound :T → Expr(Var Z ∪ {N , ∞}), such that for all t ∈T and all N ≥ 1, SeqBound(t) is a bound for t on all runs ofÎ from Inv ∧ g 0 .
Then, our technique maps bounds obtained on transitions ofÎ back to the corresponding transitions of P in I = P Instr(R), which allows it to compute the desired guarantee for P: Letting for all A ∈ A gives a guarantee G for R, ? {Inv} P {true}, i.e., we have R, G | {Inv} P {true} as we required from procedure SynthG. Thus, we reduced RG bound analysis to sequential bound analysis.

Remark 3 (Bounds over parameters)
Note that the instrumentation step Instr(R A ) can introduce additional global variables (like N ) as initialization of the instrumentation counters ξ A i . This allows the sequential bound analyzer to find bounds over parameters that it otherwise wouldn't know about.
Running example Assume that we are in our second iteration of computing bounds for N concurrent copies of Treiber's P = push method, i.e., we are now looking to compute a guarantee for For space reasons we restrict ourselves to the case where N P is started from a non-empty stack.
Instrumentation of bounds. Recall from Sect. 2 that the effect summary A for push is the one shown in Fig. 5a. Our method starts by instrumenting the bounds from R = (N − 1, ∞, ∞, ∞, N − 1) into effect summary A. We obtain Instr(R A ) as the stateless program shown in Fig. 5b. Translation to integer programs. Next, using the technique of [8], we transform I = push() Instr(R A ) into the bisimilar integer programÎ shown in Fig. 7: Solid lines correspond to push, dashed lines to the effect summary A Push . We omit transitions corresponding to actions A i, j Id : skip in Fig. 5b. Applying the technique of [8] also yields initial states Intuitively, each integer variable x i corresponds to the length of uninterrupted list segments between two pointers: Consider Fig. 6. Applying the technique of [8] would abstract the depicted program state into a state over two integer variables {x 1 → 1, x 2 → 4} where the valuations of the variables correspond to the length of the list segment between n and T (x 1 = 1) and between T and ⊥ (x 2 = 4). The mapping of pointers to integer variables {n → x 1 , T → x 2 } and the next-list-segment relation n → T → ⊥ are encoded into the control locations of the integer program by [8] and omitted from Fig. 7 for space reasons.
Off-the-shelf bound analysis. Note thatÎ in Fig. 7 contains a singleton loop (i.e., a strongly connected component) formed by program locations 4 -12 . Also note that each circle through the loop contains an edge corresponding to A Push , and that the guarded command on each such edge tests that ξ Push > 0 and decrements ξ Push by one. Since ξ Push is initialized to N − 1 and ξ Push is nowhere incremented, our bound analysis procedure concludes that paths inside the loop execute at most N − 1 times. Edges outside the loop are taken at most once. Finally, summing up the respective bounds into a guarantee gives G = (1, N , N , N − 1, 1).

Experimental evaluation
In this section, we report on our implementation of the method of Sect. 4 and experiments that we perform on well-known concurrent algorithms from the literature. For each benchmark case, our tool constructs a general client program P = op1() [] . . . [] opM(), and analyzes its parameterized N -times interleaving N P = P 1 · · · P N for thread-specific bounds of a single thread P i and total bounds of P 1 · · · P N as described in Sect. 2.

Implementation
Our tool Coachman [14] implements the RG bound analyzer for pointer programs described in Sect. 4. For invariant analysis and effect summary generation, we use invariants from [11] and effect summaries from [26] where available 2 . In all other cases we manually describe the initial memory layout, apply the summary computation algorithm from [26], and manually convince ourselves of their soundness. For the sequential bound analyzer, we implement an algorithm based on difference constraint abstraction [36]. Table 1 summarizes the experimental results. We group our benchmarks into four sets: The first set of benchmarks is taken from [21] and consists of non-blocking stack and queue implementations. Treiber's stack [38] (treiber) has been thoroughly discussed Table 1 Experimental results. The compl(exity) column lists the benchmark's thread-specific complexity as computed by Coachman, which is asymptotically tight for all benchmark cases. |V | and |E| indicate the size of the generated integer program as transition system with |V | control locations and |E| edges, it(erations) gives the number of iterations for Algorithm 1 to reach a fixed point, and runtime is our tool's runtime in hours:minutes:seconds. Column speedup lists the speedup compared to the tool version reported in the conference proceedings [33] Benchmark

Benchmarks
Compl. in our running example (Sect. 2). dcas-stack is a modified version of Treiber's stack using a double-compare-and-swap (DCAS) instruction that atomically compares two memory locations and conditionally updates the first [39]. The HSY elimination stack [24] (hsy-elimination) allows a pair of concurrent push and pop operations to exchange values without going through the bottleneck of the stack's shared top pointer. The Michael-Scott lock-free queue [31] (michael-scott) has, e.g., been implemented in the ConcurrentLinkedQueue class of the Java standard library. The DGLM queue [16] is a more recent, optimized version of the Michael-Scott queue.
We omit the two remaining benchmarks from [21] that our implementation currently does not handle: a list-based set and an n-ary CAS variant (due to their use of bit-vector arithmetic on pointers and partitioned memory regions, respectively). This is solely a limitation of our implementation (more precisely, the used integer abstraction from [8]) rather than a limitation of the overall rely-guarantee approach to bound analysis presented in this work. We leave refining the integer abstraction for these cases as future work.
In addition to the benchmarks from [21], we include two additional standard non-blocking data structures [25]: A simple atomic reference (atomic-ref) that can be atomically read and updated, and a bounded priority queue whose two buckets are each backed by a lock-free stack (prio-queue).
Designers of concurrent data structures usually aim for complexity to be linear in the number of concurrent threads N . To confirm that our tool works for further complexity classes, we designed benchmarks quadratic and cubic: They consist of 2 (resp. 3) nested CAS calls and have complexity N 2 (resp. N 3 ).
Finally, we expose our tool to benchmarks that have unbounded complexity: spinlocktas and spinlock-ttas implement a busy-waiting (test-and-)test-and-set lock [25]. treiber-partial is a partial variant of Treiber's stack [25], where the pop method busy-waits for an element in case the stack is empty.

Discussion of results
First of all, our tool computes and confirms asymptotically tight bounds for all benchmark cases. In the following, we summarize its operation and results.
Example 1 (Treiber's stack) For a single CAS-guared loop (e.g., in Treiber's stack from Fig. 2), our tool takes 2 iterations: Considering the product of a single thread and its abstracted environment (given as -still unbounded -effect summaries), its first iteration establishes a bound for the CAS-edge leaving the loop. It then applies Theorem 2 to obtain a bound on the corresponding effect summaries. In the second iteration the bounded environment edges induce a bound on the remaining loop edges. This also establishes a fixed point, as all effect summaries have been bounded and no smaller bound has been established.
Example 2 (Michael-Scott queue) In contrast to Treiber's stack, the transitions of the Michael-Scott queue cannot be bounded with just a single refinement operation: It synchronizes via two CAS operations, the first one breaking/looping as in Treiber's stack, the second one located on a back edge of the main loop. Thus our algorithm cannot immediately bound the summary edge corresponding to the second CAS. Rather, it first bounds the first CAS' effect summary, then refines and bounds the second CAS' summary, and after a final refinement bounds all other edges.
Other data structure benchmarks. Complexity of the remaining data structure benchmarks is established similarly.
Benchmarks with polynomial complexity. Nested loops each guarded with a CAS on pairwise different words in memory increase the polynomial complexity by one degree for each nesting level. This is showcased by benchmarks quadratic and cubic, for which our tool correctly computes the quadratic / cubic bound.
Intuitively, the number of iterations until a fixed point is reached is determined by a dependency relation between the CAS operations: Each CAS c that can only be bounded after another CAS c = c (usually guarding the loop containing c) is bounded, adds one iteration.
Unbounded (non-terminating) benchmarks. Finally, we test our tool on benchmarks that do not -in general -terminate, and thus have unbounded complexity. This is confirmed by benchmarks spinlock-tas, spinlock-ttas, and treiber-partial, for which Coachman correctly fails to find a ranking function and thus to establish bounds.
Bounds on control-flow edges. So far we have only considered the overall thread-specific complexity of our benchmark cases. This corresponds to the complexity cost model described in Remark 1 of Sect. 2.2. Adoption of other cost models is possible and useful: Our bound analysis allows us to infer bounds on an individual control-flow edge e of the program template. This corresponds to a uniform cost model that sets the cost of e to 1 and that of all other edges to 0.
We demonstrate its usefulness on the TAS and TTAS spinlocks (Fig. 8): The TAS spinlock's ( Fig. 8a) busy-waiting loop ( 0 → 0 ) corresponds to a failing CAS call, while the TTAS spinlock (Fig. 8b) wraps this check in a simple if-then-else ( 10 → 10 ) and performs the CAS operation only at 12 . Note that the TAS spinlock executes the expensive CAS operation unboundedly often, while the TTAS spinlock executes it at most N times. This fact is well-known in the literature, and one of the main considerations for preferring TTAS over TAS [25].

Fig. 8 Test-and-set (TAS) and test-and-test-and-set (TTAS) spinlocks with computed bounds
Runtime. Performance results were obtained on a single core of a 2.3GHz Intel Core i5-8259U processor 3 . The runtime of our implementation is negligible in most cases, however larger benchmarks (integer programs with |V | > 1,000 control locations and |E| > 10,000 control-flow edges) can take significant time.
Since the translation to integer programs described in Sect. 4.3 is purely syntactic, the resulting program contains paths that are unreachable from the initial state. We prune these paths by computing invariants over the interval abstract domain. This pruning step is currently implemented as a naïve worklist algorithm, which is the main bottleneck of our implementation and could be further optimized.
In comparison to the tool version reported in the conference proceedings [33], we have made major performance improvements by solving graph isomorphism queries to reduce the size of the generated integer programs (cf. Sect. 4.3). This enables us to prove bounds for the additional data structures included in this extended version within reasonable time. In fact, our improved tool achieves up to 12x speedup compared to the version presented in the conference proceedings.

Related work
Albert et.al. [5] describe an RG bound analysis for actor-based concurrency. They use heuristics to guess an unsound guarantee and justify it by proving that all state changes by environment threads not captured by the guarantee occur only finitely often. We note that the approach of [5] leaves state changes by the environment that are not captured by the guarantee completely unconstrained. I.e., they may change the program state arbitrarily, leading to coarser than necessary bounds. In contrast, our approach includes all state changes by environment threads, recognizes that environment state changes occurring boundedly often already carry ranking information for the corresponding effect summaries, and leaves their handling to the sequential bound analyzer.
More closely related to our work, Gotsman et al. [22] present a general framework for expressing liveness properties in RG specifications and apply it to prove termination, i.e., unbounded lock-freedom. They give rely and guarantee as words over effect summaries, and instantiate it for properties stating that a set of summaries does not occur infinitely often. They automatically discharge such properties in an iterative proof search over the powerset of effect summaries. Our approach differs in various aspects: First, while our RG quintuples may be formulated as words over effect summaries, the instantiation in [22] is suitable only for termination, but too weak for bound analysis. Second, the focus on liveness properties leads to more complicated proof rules in [22], which have to account for the fact that naive circular reasoning about liveness properties is unsound [1,22,30]. In contrast, all sequences of effect summaries expressible by our environment assertions are safety-closed, allowing us to use the full power of RG-style circular arguments in the premises of rule Par. Finally, we obtain bounds for all effect summaries at once in a refinement step by reduction to sequential bound analysis, rather than iteratively querying a termination prover whether a particular effect summary is executed only finitely often.

Conclusion
We have presented the first extension of rely-guarantee reasoning to bound analysis, and automated bound analysis of concurrent programs by a reduction to sequential bound analysis. Our implementation Coachman is freely available and for the first time automatically infers bounds for widely-studied concurrent algorithms.

Future work
While our framework extends Jones' RG reasoning, we have only given proof rules for parallel composition and a consequence rule and have left the concrete programming language and corresponding rules abstract. Our only requirement regarding safety is that the effect summaries obtained in Sect. 4.2 over-approximate any thread's effect on the global state. However, obviously the precision of effect summaries is an important trade-off between scalability of the analysis and finding (tight) bounds. In our experiments (Sect. 5), effect summaries strong enough to show correctness, a safety property, proved highly useful. Giving a full set of rules and exploring a tighter integration between safety and (bounded) liveness properties is left for future work.
Another interesting question is the completeness of our approach. Computing bounds, just as termination, amounts to finding a ranking function, possibly into the ordinal numbers. A possible construction would thus extend bounds from the integers to the ordinals. We leave this investigation for future work.
While lock-freedom guarantees absence of live-locks, it does not guarantee starvationfreedom: If a thread's environment interferes infinitely often, the thread may loop forever. Wait-freedom is a stronger progress property that guarantees that each individual thread makes progress (i.e, freedom of starvation). Its implementation exposes shared variables per thread; handling this is an interesting problem for the future. Other interesting application domains for further investigation include distributed algorithms and protocol implementations.
In terms of practical improvements, our tool currently discovers constant bounds and bounds that are expressions over the number of concurrent threads N . Other bounds, e.g., an expression over the list's length, are possible and occur in practice. Finding ways to symbolically express such bounds, as well as extending the sequential bound analysis to synthesize appropriate ranking functions poses interesting challenges for the future. Another practical improvement left for future work is the extension to memory shapes other than singly linked lists.
We first show that P 1 P 2 satisfies the guarantee (G 1 , G 2 ). The proof proceeds by induction on the length of τ .
Suppose τ is finite. Then so is τ and from (18) by Definition 10, we have that τ ends in ( , σ ) where σ = ⊥ and σ ∈ S . By construction, the same holds for τ . Fig. 3

B Proof of Theorem 2
Theorem 2 (Generalization of single-thread guarantees to N Threads) Let P be a program over local and shared variables Var = LVar ∪ SVar and let N P = P 1 · · · P N be its N -times interleaving. Let S be a predicate over SVar. Let A over SVar be a sound set of effect summaries for P started from S, and let R and G be environment assertions over A. I.e., if (N − 1) × G is smaller than R, and if R, G | {S} P 1 {true} holds, then in an empty environment, P 1 's environment P 2 · · · P N executes effect summaries A no more than (N − 1) × G times.