Reconciling Preemption Bounding with DPOR

. There are two major techniques for scaling up stateless model checking: dynamic partial order reduction (DPOR), which only explores executions that diﬀer in the ordering of racy accesses, and preemption bounding , which only explores executions containing up to k preemptions (preemptive context-switches). Combining these two techniques is challenging because DPOR-equivalent executions often contain a diﬀerent number of preemptions, making it incorrect to cut explorations that exceed the preemption bound. To restore completeness, prior work has weakened the DPOR algorithm, which often results in the exploration of many redundant executions. We propose an alternative approach. Starting from an optimal DPOR algo-rithm, we achieve completeness by allowing some slack on the preemption-bound of the explored executions. We prove that the required slack does not exceed the number of threads of the program (minus two), and that this upper limit is tight.


Introduction
Stateless model checking (SMC) [12] is an effective bug-finding technique for concurrent programs that systematically explores all interleavings of the given input program. As such, it suffers from the state-space explosion problem: the number of possible interleavings of a program grows rapidly with the program size. There are two main approaches to attack this problem in the literature.
Dynamic partial order reduction (DPOR) [11] is based on the idea that permutations of independent instructions in an interleaving lead to the same state. DPOR deems such interleavings equivalent and strives to explore only one representative interleaving from each equivalence class. Preemption bounding (PB, a.k.a. context bounding) [26] is based on the idea that concurrency bugs in practice can be exposed with a small number of preemptions [25]. Leveraging this insight, PB only explores the interleavings that arise with at most k preemptions (for some fixed k), thereby guaranteeing a partial coverage of the state space.
Combining the two approaches is non-trivial. Simply modifying a DPOR algorithm to discard any explored executions that exceed the desired bound k is not complete, as executions with ≤ k preemptions are missed. To restore completeness, Coons et al. [10] weaken DPOR by adding extra backtracking points, but such an approach negates any optimality properties of the underlying DPOR algorithm, and can lead to the (redundant) exploration of multiple equivalent interleavings.
In this paper, we propose a different approach. We adapt a state-of-the-art optimal DPOR algorithm with polynomial memory requirements called TruSt [16] to support preemption-bounded search.
We first observe that the preemption-bound definition of Coons et al. [10] is overly pessimistic for incomplete executions (i.e., executions where at least one thread is enabled) in that an incomplete execution can often be extended to a complete one with a smaller preemption-bound. Updating the definition to be more optimistic, however, does not fully resolve the issue: an intermediate execution that exceeds the bound might still be needed in order to reveal a conflicting instruction that leads to the exploration of the desired execution.
Our solution is to allow the exploration of executions exceeding the bound, as long as they only exceed it by a small amount, which we call slack. For programs with N ≥ 2 threads, we show that a slack value of N − 2 suffices to maintain completeness (up to the provided bound). Unlike Coons et al. [10], our approach is optimal in the sense that it does not explore equivalent executions more than once. Although it may explore executions with larger bound than the desired one, we argue that these executions are useful, because they can still reveal bugs.
We have implemented our bounding approach in GenMC [19], a state-ofthe-art open-source stateless model checker. We show that for small preemption bounds (and despite the slack), bounded search can perform significantly faster than full search. Moreover, we experimentally confirm the literature observation that small bounds suffice to expose most concurrency bugs. We therefore argue that our combination of preemption bounding and DPOR is useful as a practical testing approach, which also provides certain coverage guarantees.

Background
In this section, we recall the basic DPOR approach and how prior work has tried to incorporate preemption-bounded search into it. Subsequently, we review the TruSt algorithm [16], which we later build upon to obtain our results.

The Basics of Dynamic Partial Order Reduction
DPOR starts by exploring one thread interleaving. In the process, it detects conflicting transitions, i.e., instructions that, if executed in the opposite order, will alter the state of the system. At each state, when an earlier transition t is in conflict with a possible transition t that can be taken by another thread in this state, DPOR considers the execution where t is fired before t. To accomplish this, DPOR adds the transition t to the backtrack set of the state immediately before t was fired, to be explored later.
We illustrate DPOR by running it on the following example ( Fig. 1).  After firing the transitions (r x ) and (r y ) (trace 1 ), DPOR adds transition (w 1 ) to the backtrack set of the state after the firing of transition (r x ), since transition (w 1 ) is in conflict with transition (r y ). When the initial exploration is finished (trace 2 ), DPOR backtracks to 1 and considers the second exploration option, i.e., firing transition (w 1 ) and thus reaching 3 .
Subsequently, DPOR fires (r y ) (trace 4 ) and notices that this is in conflict with (w 2 ); it then adds (w 2 ) as an alternative exploration option for the state before the firing of (r y ) in 4 . Again, DPOR finishes with the exploration where the read instruction reads the value 1 (trace 5 ) and backtracks to 3 . Now, (w 2 ) is fired (trace 6 ) and the algorithm continues with the remaining transition, leading to 7 . DPOR now terminates since there is no other exploration option.
This way, DPOR manages to explores all three equivalence classes (representatives 2 , 5 , 7 ) of the 6 interleavings that correspond to this program.

Bounded Partial Order Reduction
Preemption bounding (PB) [26] prunes the state space by discarding executions that contain more preemptions than a given constant bound k . A preemption occurs at index i of a sequence of events τ whenever (1) events τ i and τ i+1 originate from different threads and (2) the thread of τ i remains enabled after τ i ; in particular, τ i is not the last event of its thread.
Combining DPOR and PB is non-trivial. Specifically, simply pruning from DPOR's exploration space any trace with more than k preemptions is incorrect because their exploration might lead to exploring traces with up to k preemptions.
To see this, consider the run of rr+ww with k = 0. DPOR reaches the state where (r x ) is fired and (w 1 ) is considered as an alternative option in the backtrack set. Firing transition (w 1 ) will lead to trace 3 , which exceeds the bound, since there is a transition from the second thread present, while the first thread is still enabled. By discarding this state, the execution where b = 2 (which is equivalent to 7 ) would never be considered, even though it respects the bound.
To address this issue, Coons et al. [10] conservatively add more backtrack points accounting for such bound-induced dependencies. Concretely, when the two transitions of the first thread are fired (trace 1 ), Coons et al. [10] adds (w 1 ) in the backtrack set not only of the state before the firing of (r y ) in 2 , as in the unmodified DPOR algorithm, but also of the initial state. Additionally, the initial transition from a state is always picked so that it is from the same thread as of the last fired transition, if possible. As a result, when the state with only (w 1 ) being fired is reached (due to the additional backtrack point), (w 2 ) will be fired immediately afterwards, and eventually the interleaving that corresponds to the right-to-left execution of the threads will be explored.
While this solution guarantees that no execution within the bound is lost, it weakens DPOR, i.e., it leads to the exploration of equivalent interleavings that would otherwise not be considered. In rr+ww, for k > 0, Coons et al. [10] explore interleavings that only differ in the order of (r x ) and (w 1 ).

TruSt: Optimal Dynamic Partial Order Reduction
The basic DPOR algorithm described in § 2.1 does not guarantee optimality, i.e., that only one execution from each equivalent class will be explored. There are several improvements of the basic algorithm, some of which achieve optimality (e.g., [2,18]). Here, we follow the most recent such improvement, TruSt [16], which achieves optimality with polynomial memory consumption.
TruSt represents program executions as execution graphs, a concept that appeared in previous works for DPOR under weak memory models [15,18]. An execution graph G consists of a set of nodes G.E (a.k.a. events) representing the individual thread instructions executed, such as read events R and write events W, and three kinds of directed edges encoding the ordering between events: the program order G.po, which orders events of the same thread; the coherence order G.co, which orders writes to the same location; and the reads-from mapping G.rf, which shows where each read is reading from.
For an execution graph G, we define the following derived relations: The causality order, porf, relates two events if there is a path of program order or read-from dependencies between them, while fr orders a read event before every write that is coherence after the one read by the read. An execution graph is SC-consistent (sequentially consistent) if there is a total ordering of its events respecting po such that each read event reads from the immediately preceding same-location write in the total order. Equivalently, a graph is SC-consistent if porf ∪ co ∪ fr is acyclic.
Execution graphs enable the efficient reversal of many conflicting events. If a write or a read event is in conflict with a previous write event, there is no need to backtrack to the state before the write events is added. Instead, the new event can be directly added in the execution and either read from a co-earlier write in case of a read event, or be placed co-before the conflicting write in case of a write event.
The only reversals where backtracking is necessary are those between a write event and a previously added read event: when a read event is added, it does not have the option to read from a write that has not yet been added. These reversals are referred to as backward revisits. To avoid exponential memory consumption, TruSt considers each exploration option eagerly when the new event is added, instead of maintaining backtrack sets for later exploration. In the case of backward revisits, TruSt removes the part of the execution that was added after the read event but is not in the prefix of the write event. The prefix of an event is defined as the set of events that precede it in the porf order. This allows the write event to be directly added in the execution graph. Because there is the possibility that many different execution graphs can lead to the same execution after a backward revisit, TruSt only considers the revisit if the events to be removed respect a maximality condition which is defined in such a way so that there will always be exactly one such set of deleted events, achieving an optimal exploration.

Bounded Optimal DPOR: Obstacles
We discuss the two main obstacles that complicate the application of preemptionbounded search to a DPOR algorithm.

Pessimistic Bound Definition
The first problem concerns the definition of preemptions for incomplete executions. Recall in the rr+ww example why the naive adaptation of DPOR with preemption bound k = 0 (incorrectly) does not generate the execution reading b = 2. The partial trace 3 is discarded because it contains at least one preemption according to the definition of Musuvathi et al. [24]. (Both threads are enabled and have executed one instruction each.) We argue that this trace should be deemed to have no preemptions because of monotonicity. Trace 3 can be extended to a full trace (namely, 7 ) that (is equivalent to one that) does not have any preemptions.
We therefore modify the definition of preemptions as follows. A preemption occurs at index i of an event sequence τ whenever (1) events τ i and τ i+1 originate from different threads and (2) the thread of τ i remains enabled after τ i , and has further events in the trace τ i+1 τ i+2 ... τ |τ | . According to our new definition, both interleavings that are equivalent with 3 have zero preemptions, because when switching to another thread, the first thread has no further events in the trace.
Our new definition satisfies monotonicity and coincides with the original on complete executions. We note, however, that partial executions with k preemptions cannot always be extended to a complete execution with k preemptions. Consider, for example, trace 4 of rr+ww, which has no preemptions. Firing the only remaining transition leads to trace 5 , which has one preemption. A DPOR algorithm that employs our definition of preemptions might thus reach states that are bound-blocked ; the current explored execution respects the bound but there is no final execution reachable from this state that respects the bound. In our experience (see §6), bound-blocked executions do not seem to have a significant effect on the performance of our algorithm.

Need For Slack
Monotonicity alone is not enough to incorporate bounded search in an algorithm like TruSt, without still forfeiting completeness: some executions that respect the bound might still be lost. Intuitively, since DPOR algorithms operate by detecting conflicting instructions during an interleaving's exploration and reversing the conflict to obtain a new interleaving, it might be the case that for the conflict to be revealed, an execution that exceeds the bound needs to be explored.
We illustrate this point with the example in Fig. 2 where all the variables are initialized to zero. Consider a run of TruSt that always adds the next event from the left-most enabled thread. To reach the final execution that results from executing the threads from right to left, TruSt needs to pass through the execution depicted on the right of Fig. 2 before reaching this final execution. In the next step, the second write of the third thread will be added, which will reveal a conflict with the first read of y of the second thread. The algorithm will then perform a backward revisit, removing the events of the second thread after the first read of y, and change the read's incoming rf edge to the new write event. The desired final execution will be reached after the remaining events of the second thread are added again.
It is easy to see that, while the final execution has zero preemptions, the depicted intermediate execution has at least one preemption, and would thus be discarded. This example can in fact be generalized by adding more threads identical to the third one; to reach the final right-to-left execution that has zero preemptions, TruSt must visit an execution that has at least N −2 preemptions, where N is the total number of threads. In §4, we show that this is in fact an upper limit; a final execution with k preemptions is always reachable through a sequence of executions that never exceed k + N − 2 preemptions. This result directly enables us to incorporate preemption-bounded search into TruSt by allowing some slack to the bound.

Recovering Completeness via Slack
Our bounded DPOR algorithm, Buster, can be seen in Algorithm 1, where we have highlighted the differences w.r.t. to TruSt [16].
We first discuss some additional notation used in the algorithm. First, each execution graph generated by the algorithm keeps track of the order < G in which events were added to it. Second, given a graph G and a set of events E, we write G| E for the restriction of G to E. Third, let G.cprefix(e) be the causal prefix of an event e in an execution graph G, i.e., the set of all events that causally precede it (including e itself). Formally, G.cprefix(e) = e e , e ∈ G.porf * . Fourth, a subscript loc(a) restricts a set of events to those that access the same location as event a. Fifth, the function SetRF(G, a, w) adds an rf edge from w to a and SetCO(G, w p , a) places a immediately after w p in co. Finally, we define the traces of an execution graph as the linearizations of (G.porf ∪ G.co ∪ G.fr) on G.E. We lift the definition of preemptions to an execution graph G: preemptions(G) is the minimum number of preemptions in the traces of G.
Apart from only exploring SC-consistent executions, Buster eagerly discards executions with more preemptions than the user-provided value k plus the slack (Line 5). If both tests fail, Buster continues by picking an new event to extend the current execution (Line 6). For correctness, we fix next P (G) to always return the event that corresponds to the left-most available thread. Depending on the type of the new event, the algorithm proceeds in a different way. We discuss the interesting cases of read and write events.
If the new event a is a read event, Buster simply considers every possible write event as an rf option for a (Line 13), and eagerly explores the corresponding execution. If a is a write event, first every co placement is considered and explored (Line 15). Afterwards, Buster considers possible backward-revisits; for every read r event that is not in the causal prefix of a, the execution where r reads from a is considered, after deleting the events added after r, that are not in the causal prefix of a (Line 19). To avoid redundant revisits, only when the set of deleted events satisfies a maximality condition (Line 18), is the backward-revisit performed (see [16] for more details).

Properties of TruSt
We now present some key properties of the TruSt algorithm, i.e., Algorithm 1 without Line 5, that are used to prove Buster's correctness (Theorem 1).
From TruSt's correctness argument, we know that every SC-consistent execution G f has exactly one sequence of Visit P calls that leads to it. We call the sequence of the corresponding graphs a production sequence for G f .
Let a maximal step of an execution G be a execution that results from extending a thread of G by an event e in a maximal way, i.e., if e ∈ R, then e is made to read from the co-latest event and if e ∈ W, then e is placed at the end of co. We write G → G when G is a maximal step of G, and G → e G when G → G and e is the added event. We say that a sequence of maximal steps is non-decreasing when the sequence of the thread identifiers of the added events is non-decreasing. Finally, we write tid(e) for the thread identifier of an event e.
A key property of TruSt (stated in Prop. 1) is that every execution G in the production sequence of an SC-consistent execution G f is either a prefix of G f , or it contains a read event r that does not read from the "correct" write, but there is a prefixĜ of G f that can by extended to G by a non-decreasing sequence of maximal steps starting with r and not including events of at least one thread to the right of r. Proposition 1. Let S be the production sequence of an SC-consistent final execution G f , and G be an execution in S. Then, either G G f or there ex-ists an execution G b that is before G in S, a read event r = next P (G b ), a thread t > tid(r) and an executionĜ such that rf(r)) G, there is a non-decreasing sequence of maximal steps s.t. G → r → * G, and ∀e ∈ G.E \Ĝ.E. tid(e) = t.
Intuitively, TruSt tries to construct G f by exploring an increasing sequence of its prefixes. This is not always possible, because when a read event r is added to G b , the write event w that it should read from might not yet be present in G b . In that case, r is made to read from another write and is later revisited by w leading to the execution G b = G f | G b .E∪G f .cprefix(r) , which is a prefix of G f . It is possible that additional backward revisit steps may happen between G b and G b . Due to maximality, however, for every intermediate execution G in the production sequence between G b and G b , there will be an execution G b Ĝ G b that can be extended to G by a sequence of non-decreasing maximal steps. ExecutionĜ is exactly the part of G that is not deleted or revisited in a later step in S. Hence, if w is the first write that performed a backward revisit in S after G, then the events of thread t = tid(w) are already included inĜ. Finally, it can be shown that t is to the right of r. The formal proof of this proposition can be found in the extended version of this paper [23].

Correctness of Slacked Bounding
To see why executions in the production sequence of a graph G f can have at most preemptions(G f ) + N − 2 preemptions, we start with a definition. A witness of a graph G is a trace of G that contains preemptions(G) preemptions. Next, we observe that preemptions are monotone w.r.t. execution prefixes. That is, if an execution G requires a certain number of preemptions to be produced, a larger execution G G requires at least that many preemptions. To prove this, take a witness of G and restrict to the events of G, thereby obtaining a witness of G. The restriction can only remove preemptions.
Further, we note that the number of preemptions of an execution is unaffected if we extend its last executed thread with a maximal step; if a maximal step adds an event to a different thread, the number is increased by at most one. Lemma 2. Let G and G be SC-consistent executions and r ∈ G .E such that G → r → * G . Then, preemptions(G ) ≤ preemptions(G) + S, where S is the number of threads that where extended to obtain G from G.
Proof. Consider a witness w of G and extend by appending the missing events in the same order they were added in the sequence of maximal steps. Notice that, by construction of the maximal step, the resulting sequencing is a trace of G . Each time we add an event e in the trace, such that the last event of of the trace was not in the thread of e, we increase the preemption-bound by one: a thread was previously considered as completed, but was now extended with a new event. However, this can only happen S times: the maximal steps keep adding events of the same thread and when another thread is picked, the first is not extended again (the maximal steps are non-decreasing). This gives us a trace of G with at most preemptions(G) + S preemptions, which concludes our proof.
We can now prove that Buster is complete, i.e., it visits every full, SCconsistent execution that respects the bound. Proof. Consider a full, SC-consistent execution G f of P with at most k preemptions. From the completeness of TruSt, we know that a run of Algorithm 1 without the test on Line 5 will visit G f . It thus suffices to show that for every execution G in the production sequence of G f has at most k + N − 2 preemptions, where N is the number of threads of P. If G G f , then from Lemma 1 Otherwise, from Prop. 1, there exists an execution G b that is before G in the production sequence of G f and an executionĜ, such that G b Ĝ G,Ĝ → r → * G, and no events in G.E \Ĝ.E are in thread t, for some thread t to the right of r.
From the last two properties and Lemma 2 we have preemptions(G) ≤ k+N −1 since it is preemptions(Ĝ) ≤ preemptions(G f ) (Ĝ G f and Lemma 1) and at most N − 1 threads are extended fromĜ to G.
To complete the proof, we will prove that preemptions(G) = k + N − 1 leads to contradiction. The equality implies thatĜ had k preemptions and that N −1 threads were extended in the maximal steps fromĜ to G, and all of them increased the preemptions by one. The sequence of maximal steps fromĜ to G is non-decreasing and starts with the thread of r. Since there are at most N threads, N −1 are extended, and at least one thread to the right of t is not extended, r is in the leftmost thread.
Let t r be the leftmost thread, (r) , and w = G f .rf(r). From the proof of TruSt, we can infer that all events of G b are in the porf-prefix of the last event of t r . It is TruSt will eventually add the write w = G f .rf(r) and revisit the read r, reaching the execution G b G f that contains all events added before r, i.e., the events of G b , the events in the porf-prefix of r, and r. Hence, all events in G b .E \ {r} are in the porf-prefix of r, which implies that any witness of G b ends with r.
Since G b G f , any witness t of G b has at most k preemptions. Let G be the execution G b without r, and G the unique execution s.t.Ĝ → r G . Removing the last event r from t gives us a trace t of G with at most k preemptions. If t ends with an event of t r , then we can restrict t to the events ofĜ and add r at the end, obtaining a trace of G with at most k preemptions. Otherwise, t does not end with an event of t r , and thus trace t has one more preemption than t, i.e., t has at most k − 1 preemptions. Then, we can again restrict t to the events of G and add r a the end, obtaining again a trace of G with at most k preemptions. This contradicts our assumption that preemptions(Ĝ) = k and all N −1 threads that are extended fromĜ increase the number of preemptions, since the first thread t r can be extended without incurring any more preemptions.
Buster inherits TruSt's optimality, as it only explores a subset of the executions that TruSt does. Here, optimality refers to avoiding redundant work; due to the slack, Verify(P, k ) can also visit executions more than k preemptions. Theorem 2. Verify(P, k ) explores each graph G of a program P at most once.

Implementation
We have implemented Buster on top of the GenMC tool [19], which implements the TruSt algorithm [16]. Since GenMC supports weak memory models and the standard notion of preemption bounding only makes sense for sequential consistency, we enforce SC in our benchmarks by using only SC memory accesses and selecting GenMC's RC11 model [21].
The bulk of our modifications to GenMC concern the checking of whether the preemption-bound of an execution G exceeds a value k. Generally, deciding whether the preemption-bound of a Mazurkiewicz trace exceeds a value is an NP-complete problem [24]. We use an adaptation of the bound computation in Musuvathi et al. [24] to execution graphs, but instead of recursively computing preemptions(G) (and cache computations across calls to amortize the cost), we recursively compute the predicate Φ(G, k) = preemptions(G) ≤ k. The benefit of this method is that we can avoid calculating preemptions(G) exactly when its value exceeds the desired bound. Furthermore, there is no additional state that needs to be stored; Buster remains stateless.
As an optimization, we use as slack (Line 5) the minimum between N −2 and the number of threads that have no deletable events; an event is not deletable if it is in the porf-prefix of a write that backward revisited. Intuitively, the events that are added in G to reachĜ (Prop. 1) are the events that will later be deleted to eventually reach a graph that is a prefix of the final graph G f .

Evaluation
To evaluate Buster, we answer the following questions: § 6.1 How many preemptions suffice to expose common concurrency bugs? Is Buster effective at finding such concurrency bugs? § 6.2 How good is preemption bounding at pruning the search space? Up to what bound does Buster run faster than vanilla DPOR? § 6.3 What is the overhead induced by the bound calculation? Table 1. Buggy benchmarks. An indicates that an error was found. To that end, we evaluate Buster against GenMC on a diverse set of benchmarks. Unfortunately, we cannot include the approach of Coons et al. [10] in our comparison because their implementation is not available. We can draw two major conclusions from our evaluation. First, most bugs do manifest with a small number of preemptions (≤ 2), an observation that has been made in the literature before [26,28]. Second, even though the bound calculation can be fairly expensive expensive, for small bounds Buster outperforms GenMC and can find bugs faster than GenMC.

Experimental Setup
We conducted all experiments on a Dell PowerEdge M620 blade system with two Intel Xeon E5-2667 v2 CPU (8 cores @ 3.3 GHz) and 256GB of RAM. We used LLVM 11.0.1 for GenMC and Buster. All reported times are in seconds. We set a timeout limit of 30 minutes.

Bound and Bug Manifestation
To validate that most bugs require a small number of preemptions, we run Buster and GenMC on three sets of benchmarks: the unsafe concurrent benchmarks of the SCT suite [28], the unsafe benchmarks of the pthread category of SV-COMP [27] included in GenMC's test suite, and a set of concurrent data structures (CDs) from GenMC's test suite with randomly induced bugs.
In all cases, we configure Buster to disregard any errors that occur in executions that exceed the bound and are explored due to the slack. We note that this configuration may delay bug finding, since Buster may by chance quickly come across a buggy execution with more than k preemptions (due to slack) before finding any buggy execution with up to k preemptions. Nevertheless, we follow it to ensure that the bugs found arise in executions with up to the desired number of preemptions, so as to be able to validate the claim that bugs manifest in executions with a small number of preemptions. Table 1 reports our outcomes on the first two classes of benchmarks. As can be seen, Buster was able to find most bugs using a bound of 1. In fact, for most benchmarks, Buster found the bug before exploring a complete execution, hence the "0 " entries in the table. The only benchmarks, where Buster needs a bound greater that 1 are the synthetic benchmarks triangular, which needs a bound of 8, as it was specifically designed to make the bug discovery difficult and push model checkers to their limits; reorder-20 and twostage-100, which have a large number of threads (20 and 100, respectively). Buster times out on the latter two benchmarks because the large number of threads put a lot of stress in the bound checking procedure. We note that for twostage-100, GenMC also fails to terminate within the time limit. Table 2 reports our results for our CD benchmarks. For these benchmarks, we have taken CD implementations from the GenMC test suite, and induced bugs into them by randomly dropping a synchronization instruction or replacing a CAS instruction with a normal write or an unconditional exchange instruction, thereby introducing a possible atomicity violation. We then construct mediumsized clients (with 2-3 threads and up to 12 operations per thread) of these data structures that check for their intended semantics (for example, that a queue has FIFO semantics). In all cases, the induced bugs lead to violations of the assertions in the client programs, and occasionally even to memory errors. Buster can find these bugs easily; a bound of k = 2 suffices to expose them. By contrast, GenMC times out for most of these benchmarks, as their state space is enormous.

Comparison with Plain DPOR on Safe Benchmarks
We have already seen that modulo specially crafted synthetic benchmarks, a small preemption bound is sufficient for finding bugs in practice. Moreover, Buster is pretty good at finding such bugs in concurrent data structures. We now evaluate the application of Buster on a collection of safe benchmarks. For this purpose, we use different variations of the benchmarks of Table 2 (after repairing them so that no assertion is violated), as well as a few locking benchmarks. Table 3 compares the performance of Buster for small values of k and GenMC. As it can be seen, GenMC struggles with these benchmarks, whereas Buster with k = 2 (and often also with k = 3) terminates fairly quickly. This is because only a small fraction of the total executions of sizeable benchmarks have few preemptions. Therefore restricting the search to only those executions makes Buster run much faster than GenMC, and guarantees that the program under consideration does not have any common bugs.
In the last column of Table 3 we include the maximum value of k such that Buster terminates faster than GenMC, for the benchmarks that terminate under GenMC. In most cases Buster is faster than GenMC even for k > 3. For the dglm-fifo benchmarks Buster is only faster for k ∈ {0, 1}, because for these benchmarks a small k suffices to fully explore the state space.

Bound Calculation Overhead
We now measure the cost of checking that each encountered execution is below the specified bound. As we discussed in §5, checking whether an execution graph's preemption-bound exceeds a value is a NP-complete problem, and thus we expect this calculation to threaten the performance of our tool. Table 4. Overhead w.r.t. to GenMC (left) and blocking in benchmarks (right). To carefully account for this cost, we compare Buster against the baseline GenMC implementation on benchmarks where preemption bounding does not reduce the number of executions that are explored. In Table 4, we report results on simple CD clients that have only one operation per thread of the Treiber stack [29] and the TTAS lock [13]. The clients are designed so that Buster can explore the full set of program executions with a small bound k. We suffix the name of the benchmarks with the number of writer and reader threads for the Treiber stack and the total number of threads for TTAS.
Column b contains the minimal number of the bound k for which Buster explores the same number of executions as GenMC does. Note that since these benchmarks contain several threads, exploration up to a certain bound (e.g., k = 0) does not mean that only executions with k preemptions are visited; due to slack, executions with more preemptions may be visited, and so it is possible for the exploration to cover the entire state space for a smaller bound than intrinsically necessary. In the subsequent columns we report the time overhead (percentage) for bounds k = b, k = b + 1, and k = b + 2 w.r.t. to GenMC's execution time, which is visible on the last column. The maximum overhead is observed for k = b (the minimal value sufficient to cover the entire state space). This is expected because k = b places the most burden on the calculation of whether the number of preemptions in a given execution are below k. For larger k values, the overhead drops because it is easier to show that the number of preemptions are below the bound; one does not have to calculate the number of preemptions of an execution precisely. Overall, for the Treiber stack benchmark, the overhead introduced by calculating the bounds is fairly low and does not exceed the 23% of the execution time of GenMC. For the plain runs of ttas-lock, the maximal overhead is a bit larger, up to 38%. We note, however, that such overhead only occurs in clients with a large number of threads (7); smaller clients are not affected as much.

Overhead due to Bound-Blocked Executions
Finally, we measure the overhead caused by bound-blocked executions, by evaluating how often they arise in practice. Specifically, we ran Buster on GenMC's test suite for various preemption-bound values, as well as on the safe CD clients used in § 6.2, and counted the number of such bound-blocked executions.
For GenMC's test suite, the results are summarized in Table 4 (right). We have restricted out attention to the runs with at least 10 executions, so that our results are not skewed by benchmarks that have very few executions. We have also excluded 8 benchmarks from the test suite that use barriers because they are currently not supported by our tool. As it can be seen, bound-blocked executions are rare: most runs lead to one bound-blocked execution, and only 6 lead to more than 8 bound-blocked executions. Bound-blocked executions are on average no more than 6% of the total number of executions explored.
For the CDs clients, bound-blocked executions are even more rare; out of the 22 clients, Buster encounters bound-blocked executions in only 4 of them, for some k . We exclude again from the discussion runs with very few executions. From the remaining runs, only two encounter a considerable number of bound-blocked executions that become negligible as the bound is increased: around 10% for k = 1 and less than 1% for k = 2

Related Work
There is a large body of work that has improved the original DPOR algorithm of Flanagan et al. [11]. Abdulla et al. [2] introduced the first optimal DPOR algorithm, which, however, suffers from possibly exponential memory consumption. Kokologiannakis et al. [16] developed TruSt, which is the first optimal DPOR algorithm that consumes polynomial memory.
Agarwal et al. [6], Chalupa et al. [8], Chatterjee et al. [9], and Huang [14] have extended DPOR for partitions coarser than the one we have focused in this paper, i.e., Mazurkiewicz traces. Abdulla et al. [1,4,5] consider DPOR under various weak memory models, while the works of Kokologiannakis et al. [16,18,20] provide a DPOR algorithm that is parametric in the choice of the memory model, provided it respects some basic properties.
Qadeer et al. [26] showed the decidability of context-bound verification of concurrent boolean programs. Musuvathi et al. [25] propose iterative context bounding, a search algorithm that prioritizes executions with fewer preemptions. Musuvathi et al. [24] combine partial-order reduction with a preemption-bound search, and prove that judging whether the preemption-bound of a Mazurkiewicz trace exceeds a certain value is an NP-complete problem.
To our knowledge, the only attempt to combine DPOR and preemption bounding is by Coons et al. [10], who identify the difficulty of maintaining completeness of the exploration, and resolve it by weakening DPOR.
Abdulla et al. [3] and Atig et al. [7] have extended the notion of preemption bounding to weak memory models. We leave a possible extension of our approach to weak memory models for future work.