Stateless Model Checking under a Reads-Value-From Equivalence

Stateless model checking (SMC) is one of the standard approaches to the verification of concurrent programs. As scheduling non-determinism creates exponentially large spaces of thread interleavings, SMC attempts to partition this space into equivalence classes and explore only a few representatives from each class. The efficiency of this approach depends on two factors: (a) the coarseness of the partitioning, and (b) the time to generate representatives in each class. For this reason, the search for coarse partitionings that are efficiently explorable is an active research challenge. In this work we present RVF-SMC, a new SMC algorithm that uses a novel \emph{reads-value-from (RVF)} partitioning. Intuitively, two interleavings are deemed equivalent if they agree on the value obtained in each read event, and read events induce consistent causal orderings between them. The RVF partitioning is provably coarser than recent approaches based on Mazurkiewicz and"reads-from"partitionings. Our experimental evaluation reveals that RVF is quite often a very effective equivalence, as the underlying partitioning is exponentially coarser than other approaches. Moreover, RVF-SMC generates representatives very efficiently, as the reduction in the partitioning is often met with significant speed-ups in the model checking task.


Introduction
The verification of concurrent programs is one of the key challenges in formal methods. Interprocess communication adds a new dimension of non-determinism in program behavior, which is resolved by a scheduler. As the programmer has no control over the scheduler, program correctness has to be guaranteed under all possible schedulers, i.e., the scheduler is adversarial to the program and can generate erroneous behavior if one can arise out of scheduling decisions. On the other hand, during program testing, the adversarial nature of the scheduler is to hide erroneous runs, making bugs extremely difficult to reproduce by testing alone (aka Heisenbugs [1]). Consequently, the verification of concurrent programs rests on rigorous model checking techniques [2] that cover all possible program behaviors that can arise out of scheduling non-determinism, leading to early tools such as VeriSoft [3,4] and CHESS [5].

arXiv:2105.06424v1 [cs.PL] 13 May 2021
To battle with the state-space explosion problem, effective model checking for concurrency is stateless. A stateless model checker (SMC) explores the behavior of the concurrent program by manipulating traces instead of states, where each (concurrent) trace is an interleaving of event sequences of the corresponding threads [6]. To further improve performance, various techniques try to reduce the number of explored traces, such as context bounded techniques [7,8,9,10] As many interleavings induce the same program behavior, SMC partitions the interleaving space into equivalence classes and attempts to sample a few representative traces from each class. The most popular approach in this domain is partial-order reduction techniques [11,6,12], which deems interleavings as equivalent based on the way that conflicting memory accesses are ordered, also known as the Mazurkiewicz equivalence [13]. Dynamic partial order reduction [14] constructs this equivalence dynamically, when all memory accesses are known, and thus does not suffer from the imprecision of earlier approaches based on static information. Subsequent works managed to explore the Mazurkiewicz partitioning optimally [15,16], while spending only polynomial time per class.
The performance of an SMC algorithm is generally a product of two factors: (a) the size of the underlying partitioning that is explored, and (b) the total time spent in exploring each class of the partitioning. Typically, the task of visiting a class requires solving a consistency-checking problem, where the algorithm checks whether a semantic abstraction, used to represent the class, has a consistent concrete interleaving that witnesses the class. For this reason, the search for effective SMC is reduced to the search of coarse partitionings for which the consistency problem is tractable, and has become a very active research direction in recent years. In [17], the Mazurkiewicz partitioning was further reduced by ignoring the order of conflicting write events that are not observed, while retaining polynomial-time consistency checking. Various other works refine the notion of dependencies between events, yielding coarser abstractions [18,19,20]. The work of [21] used a reads-from abstraction and showed that the consistency problem admits a fully polynomial solution in acyclic communication topologies. Recently, this approach was generalized to arbitrary topologies, with an algorithm that remains polynomial for a bounded number of threads [22]. Finally, recent approaches define value-centric partitionings [23], as well as partitionings based on maximal causal models [24]. These partitionings are very coarse, as they attempt to distinguish only between traces which differ in the values read by their corresponding read events. We illustrate the benefits of value-based partitionings with a motivating example.

Motivating Example
Consider a simple concurrent program shown in Figure 1. The program has 98 different orderings of the conflicting memory accesses, and each ordering corresponds to a separate class of the Mazurkiewicz partitioning. Utilizing the reads-from abstraction reduces the number of partitioning classes to 9. However, when taking into consideration the values that the events can read and write, the number of cases to consider can be reduced even further. In this specific example, there is only a single behaviour the program may exhibit, in which both read events read the only observable value. Thread1 1. w(x, 1) 2. w(y, 1) Thread2 1. w(x, 1) 2. w(y, 1) 3. r(x) Thread3 1. w(x, 1) 2. w(y, 1) 3. r(y) Equivalence classes: Mazurkiewicz [15] 98 reads-from [22] 9 value-centric [23] 7 this work 1 Fig. 1: Concurrent program and its underlying partitioning classes.
The above benefits have led to recent attempts in performing SMC using a valuebased equivalence [24,23]. However, as the realizability problem is NP-hard in general [25], both approaches suffer significant drawbacks. In particular, the work of [23] combines the value-centric approach with the Mazurkiewicz partitioning, which creates a refinement with exponentially many more classes than potentially necessary. The example program in Figure 1 illustrates this, where while both read events can only observe one possible value, the work of [23] further enumerates all Mazurkiewicz orderings of all-but-one threads, resulting in 7 partitioning classes. Separately, the work of [24] relies on SMT solvers, thus spending exponential time to solve the realizability problem. Hence, each approach suffers an exponential blow-up a-priori, which motivates the following question: is there an efficient parameterized algorithm for the consistency problem? That is, we are interested in an algorithm that is exponential-time in the worst case (as the problem is NP-hard in general), but efficient when certain natural parameters of the input are small, and thus only becomes slow in extreme cases.
Another disadvantage of these works is that each of the exploration algorithms can end up to the same class of the partitioning many times, further hindering performance. To see an example, consider the program in Figure 1 again. The work of [23] assigns values to reads one by one, and in this example, it needs to consider as separate cases both permutations of the two reads as the orders for assigning the values. This is to ensure completeness in cases where there are write events causally dependent on some read events (e.g., a write event appearing only if its thread-predecessor reads a certain value). However, no causally dependent write events are present in this program, and our work uses a principled approach to detect this and avoid the redundant exploration. While an example to demonstrate [24] revisiting partitioning classes is a bit more involved one, this property follows from the lack of information sharing between spawned subroutines, enabling the approach to be massively parallelized, which has been discussed already in prior works [21,26,23].

Our Contributions
In this work we tackle the two challenges illustrated in the motivating example in a principled, algorithmic way. In particular, our contributions are as follows.
(1) We study the problem of verifying the sequentially consistent executions. The problem is known to be NP-hard [25] in general, already for 3 threads. We show that the problem can be solved in O(k d+1 ·n k+1 ) time for an input of n events, k threads and d variables. Thus, although the problem NP-hard in general, it can be solved in polynomial time when the number of threads and number of variables is bounded. Moreover, our bound reduces to O(n k+1 ) in the class of programs where every variable is written by only one thread (while read by many threads). Hence, in this case the bound is polynomial for a fixed number of threads and without any dependence on the number of variables. (2) We define a new equivalence between concurrent traces, called the readsvalue-from (RVF) equivalence. Intuitively, two traces are RVF-equivalent if they agree on the value obtained in each read event, and read events induce consistent causal orderings between them. We show that RVF induces a coarser partitioning than the partitionings explored by recent well-studied SMC algorithms [15,21,23], and thus reduces the search space of the model checker.
(3) We develop a novel SMC algorithm called RVF-SMC, and show that it is sound and complete for local safety properties such as assertion violations. Moreover, RVF-SMC has complexity k d · n O(k) · β, where β is the size of the underlying RVF partitioning. Under the hood, RVF-SMC uses our consistency-checking algorithm of Item 1 to visit each RVF class during the exploration. Moreover, RVF-SMC uses a novel heuristic to significantly reduce the number of revisits in any given RVF class, compared to the valuebased explorations of [24,23]. (4) We implement RVF-SMC in the stateless model checker Nidhugg [27]. Our experimental evaluation reveals that RVF is quite often a very effective equivalence, as the underlying partitioning is exponentially coarser than other approaches. Moreover, RVF-SMC generates representatives very efficiently, as the reduction in the partitioning is often met with significant speed-ups in the model checking task.

Preliminaries
General notation. Given a natural number i ≥ 1, we let [i] be the set {1, 2, . . . , i}. Given a map f : X → Y , we let dom(f ) = X denote the domain of f . We represent maps f as sets of tuples {(x, f (x))} x . Given two maps f 1 , f 2 over the same domain X, we write Given a set X ⊂ X, we denote by f |X the restriction of f to X . A binary relation ∼ on a set X is an equivalence iff ∼ is reflexive, symmetric and transitive.

Concurrent Model
Here we describe the computational model of concurrent programs with shared memory under the Sequential Consistency (SC) memory model. We follow a standard exposition of stateless model checking, similarly to [14,15,21,22,28,23], Concurrent program. We consider a concurrent program H = {thr i } k i=1 of k deterministic threads. The threads communicate over a shared memory G of global variables with a finite value domain D. Threads execute events of the following types.
(1) A write event w writes a value v ∈ D to a global variable x ∈ G.
(2) A read event r reads the value v ∈ D of a global variable x ∈ G.
Additionally, threads can execute local events which do not access global variables and thus are not modeled explicitly.
Given an event e, we denote by thr(e) its thread and by var(e) its global variable. We denote by E the set of all events, and by R (W) the set of read (write) events. Given two events e 1 , e 2 ∈ E, we say that they conflict, denoted e 1 e 2 , if they access the same global variable and at least one of them is a write event. Concurrent program semantics. The semantics of H are defined by means of a transition system over a state space of global states. A global state consists of (i) a memory function that maps every global variable to a value, and (ii) a local state for each thread, which contains the values of the local variables and the program counter of the thread. We consider the standard setting of Sequential Consistency (SC), and refer to [14] for formal details. As usual, H is execution-bounded, which means that the state space is finite and acyclic. Event sets. Given a set of events X ⊆ E, we write R(X) = X ∩ R for the set of read events of X, and W(X) = X ∩ W for the set of write events of X. Given a set of events X ⊆ E and a thread thr, we denote by X thr and X =thr the events of thr, and the events of all other threads in X, respectively. Sequences and Traces. Given a sequence of events τ = e 1 , . . . , e j , we denote by E(τ ) the set of events that appear in τ . We further denote R(τ ) = R(E(τ )) and W(τ ) = W(E(τ )).
Given a sequence τ and two events e 1 , e 2 ∈ E(τ ), we write e 1 < τ e 2 when e 1 appears before e 2 in τ , and e 1 ≤ τ e 2 to denote that e 1 < τ e 2 or e 1 = e 2 . Given a sequence τ and a set of events A, we denote by τ |A the projection of τ on A, which is the unique subsequence of τ that contains all events of A∩E(τ ), and only those events. Given a sequence τ and a thread thr, let τ thr be the subsequence of τ with events of thr, i.e., τ |E(τ ) thr . Given two sequences τ 1 and τ 2 , we denote by τ 1 • τ 2 the sequence that results in appending τ 2 after τ 1 .
A (concrete, concurrent) trace is a sequence of events σ that corresponds to a concrete valid execution of H. We let enabled(σ) be the set of enabled events after σ is executed, and call σ maximal if enabled(σ) = ∅. As H is bounded, all executions of H are finite and the length of the longest execution in H is a parameter of the input. Reads-from and Value functions. Given a sequence of events τ , we define the reads-from function of τ , denoted RF τ : R(τ ) → W(τ ), as follows. Given a read event r ∈ R(τ ), we have that RF τ (r) is the latest write (of any thread) conflicting with r and occurring before r in τ , i.e., (i) RF τ (r) r, (ii) RF τ (r) < τ r, and (iii) for each w ∈ W(τ ) such that w r and w < τ r, we have w ≤ τ RF τ (r). We say that r reads-from RF τ (r) in τ . For simplicity, we assume that H has an initial salient write event on each variable. Further, given a trace σ, we define the value function of σ, denoted val σ : E(σ) → D, such that val σ (e) is the value of the global variable var(e) after the prefix of σ up to and including e has been executed. Intuitively, val σ (e) captures the value that a read (resp. write) event e shall read (resp. write) in σ. The value function val σ is well-defined as σ is a valid trace and the threads of H are deterministic.

Partial Orders
In this section we present relevant notation around partial orders, which are a central object in this work. Partial orders. Given a set of events X ⊆ E, a (strict) partial order P over X is an irreflexive, antisymmetric and transitive relation over X (i.e., < P ⊆ X × X). Given two events e 1 , e 2 ∈ X, we write e 1 ≤ P e 2 to denote that e 1 < P e 2 or e 1 = e 2 . Two distinct events e 1 , e 2 ∈ X are unordered by P , denoted e 1 P e 2 , if neither e 1 < P e 2 nor e 2 < P e 1 , and ordered (denoted e 1 P e 2 ) otherwise. Given a set Y ⊆ X, we denote by P |Y the projection of P on the set Y , where for every pair of events e 1 , e 2 ∈ Y , we have that e 1 < P |Y e 2 iff e 1 < P e 2 . Given two partial orders P and Q over a common set X, we say that Q refines P , denoted by Q P , if for every pair of events e 1 , e 2 ∈ X, if e 1 < P e 2 then e 1 < Q e 2 . A linearization of P is a total order that refines P . Lower sets. Given a pair (X, P ), where X is a set of events and P is a partial order over X, a lower set of (X, P ) is a set Y ⊆ X such that for every event e 1 ∈ Y and event e 2 ∈ X with e 2 ≤ P e 1 , we have e 2 ∈ Y . Visible writes. Given a partial order P over a set X, and a read event r ∈ R(X), the set of visible writes of r is defined as VisibleW P (r) ={ w ∈ W(X) : (i) r w and (ii) r < P w and (iii) for each w ∈ W(X) with r w , if w < P w then w < P r } i.e., the set of write events w conflicting with r that are not "hidden" to r by P . The program order PO. The program order PO of H is a partial order < PO ⊆ E × E that defines a fixed order between some pairs of events of the same thread, reflecting the semantics of H.
A set of events X ⊆ E is proper if (i) it is a lower set of (E, PO), and (ii) for each thread thr, the events X thr are totally ordered in PO (i.e., for each distinct e 1 , e 2 ∈ X thr we have e 1 PO e 2 ). A sequence τ is well-formed if (i) its set of events E(τ ) is proper, and (ii) τ respects the program order (formally, τ PO|E(τ )). Every trace σ of H is well-formed, as it corresponds to a concrete valid execution of H. Each event of H is then uniquely identified by its PO predecessors, and by the values its PO predecessor reads have read. Causally-happens-before partial orders. A trace σ induces a causallyhappens-before partial order → σ ⊆ E(σ) × E(σ), which is the weakest partial order such that (i) it refines the program order (i.e., → σ PO|E(σ)), and (ii) for every read event r ∈ R(σ), its reads-from RF σ (r) is ordered before it (i.e., RF σ (r) → σ r). Intuitively, → σ contains the causal orderings in σ, i.e., it thr1 thr2 thr3 w(x, 1) r e a d b y Fig. 2: A trace σ, the displayed events E(σ) are vertically ordered as they appear in σ. The solid black edges represent the program order PO. The dashed red edges represent the reads-from function RF σ . The transitive closure of all the edges then gives us the causallyhappens-before partial order → σ .
captures the flow of write events into read events in σ together with the program order. Figure 2 presents an example of a trace and its causal orderings.

Reads-Value-From Equivalence
In this section we present our new equivalence on traces, called the reads-valuefrom equivalence (RVF equivalence, or ∼ RVF , for short). Then we illustrate that ∼ RVF has some desirable properties for stateless model checking.
Reads-Value-From equivalence. Given two traces σ 1 and σ 2 , we say that they are reads-value-from-equivalent, written σ 1 ∼ RVF σ 2 , if the following hold.
Soundness. The RVF equivalence induces a partitioning on the maximal traces of H. Any algorithm that explores each class of this partitioning provably discovers every reachable local state of every thread, and thus RVF is a sound equivalence for local safety properties, such as assertion violations, in the same spirit as in other recent works [22,21,23,24]. This follows from the fact that for any two traces σ 1 and σ 2 with E(σ 1 ) = E(σ 2 ) and val σ1 = val σ2 , the local states of each thread are equal after executing σ 1 and σ 2 .
Coarseness. Here we describe the coarseness properties of the RVF equivalence, as compared to other equivalences used by state-of-the-art approaches in stateless model checking. Figure 4 summarizes the comparison.
The above two conditions imply that the induced causally-happens-before partial orders are equal, i.e., → σ1 = → σ2 , and thus trivially also → σ1 |R = → σ2 |R. Further, by a simple inductive argument the value functions of the two traces are also equal, i.e., val σ1 = val σ2 . Hence any two reads-from-equivalent traces are also RVF-equivalent, which makes the RVF equivalence always at least as coarse as the reads-from equivalence.
The work of [23] utilizes a value-centric equivalence, which deems two traces equivalent if they satisfy all the conditions of our RVF equivalence, and also some further conditions (note that these conditions are necessary for correctness of the SMC algorithm in [23]). Thus the RVF equivalence is trivially always at least as coarse. The value-centric equivalence preselects a single thread thr, and then requires two extra conditions for the traces to be equivalent, namely: (1) For each read of thr, either the read reads-from a write of thr in both traces, or it does not read-from a write of thr in either of the two traces. (2) For each conflicting pair of events not belonging to thr, the ordering of the pair is equal in the two traces.
Both the reads-from equivalence and the value-centric equivalence are in turn as coarse as the data-centric equivalence of [21]. Given two traces, the data-centric equivalence has the equivalence conditions of the reads-from equivalence, and additionally, it preselects a single thread thr (just like the value-centric equivalence) and requires the second extra condition of the value-centric equivalence, i.e., equality of orderings for each conflicting pair of events outside of thr.
Finally, the data-centric equivalence is as coarse as the classical Mazurkiewicz equivalence [13], the baseline equivalence for stateless model checking [14,15,29]. Mazurkiewicz equivalence deems two traces equivalent if they consist of the same set of events and they agree on their ordering of conflicting events.
While RVF is always at least as coarse, it can be (even exponentially) coarser, than each of the other above-mentioned equivalences. We illustrate this in Appendix B. We summarize these observations in the following proposition.
In this work we develop our SMC algorithm RVF-SMC around the RVF equivalence, with the guarantee that the algorithm explores at most one maximal trace per class of the RVF partitioning, and thus can perform significantly fewer steps than algorithms based on the above equivalences. To utilize RVF, the algorithm in each step solves an instance of the verification of sequential consistency problem, which we tackle in the next section. Afterwards, we present RVF-SMC.

Verifying Sequential Consistency
In this section we present our contributions towards the problem of verifying sequential consistency (VSC). We present an algorithm VerifySC for VSC, and we show how it can be efficiently used in stateless model checking.
The VSC problem. Consider an input pair (X, GoodW) where A witness of (X, GoodW) is a linearization τ of X (i.e., E(τ ) = X) respecting the program order (i.e., τ PO|X), such that each read r ∈ R(τ ) reads-from one of its good-writes in τ , formally RF τ (r) ∈ GoodW(r) (we then say that τ satisfies the good-writes function GoodW). The task is to decide whether (X, GoodW) has a witness, and to construct one in case it exists. VSC in Stateless Model Checking. The VSC problem naturally ties in with our SMC approach enumerating the equivalence classes of the RVF trace partitioning. In our approach, we shall generate instances (X, GoodW) such that (i) each witness σ of (X, GoodW) is a valid program trace, and (ii) all witnesses σ 1 , σ 2 of (X, GoodW) are pairwise RVF-equivalent (σ 1 ∼ RVF σ 2 ). Hardness of VSC. Given an input (X, GoodW) to the VSC problem, let n = |X|, let k be the number of threads appearing in X, and let d be the number of variables accessed in X. The classic work of [25] establishes two important lower bounds on the complexity of VSC: (1) VSC is NP-hard even when restricted only to inputs with k = 3.
(2) VSC is NP-hard even when restricted only to inputs with d = 2.
The first bound eliminates the possibility of any algorithm with time complexity O(n f (k) ), where f is an arbitrary computable function. Similarly, the second bound eliminates algorithms with complexity O(n f (d) ) for any computable f .
In this work we show that the problem is parameterizable in k + d, and thus admits efficient (polynomial-time) solutions when both variables are bounded.

Algorithm for VSC
In this section we present our algorithm VerifySC for the problem VSC. First we define some relevant notation. In our definitions we consider a fixed input pair (X, GoodW) to the VSC problem, and a fixed sequence We can then say that w is the active write of the variable var(w) in τ .
In such a case we say that r holds x in τ . Note that several distinct reads may hold a single variable in τ .
is a lower set of (X, PO) and the following hold.
(1) If e is a read, it has an active good-write w ∈ GoodW(e) in τ .
(2) If e is a write, its variable var(e) is not held in τ . Memory maps. A memory map of τ is a function from global variables to thread indices MMap τ : G → [k] where for each variable x ∈ G, the map MMap τ (x) captures the thread of the active write of x in τ . Witness states. The sequence τ is a witness prefix if the following hold.
(2) For each r ∈ X \R(τ ) that holds its variable var(r) in τ , one of its good-writes w ∈ GoodW(r) is active in τ .
Intuitively, τ is a witness prefix if it satisfies all VSC requirements modulo its events, and if each read not in τ has at least one good-write still available to readfrom in potential extensions of τ . For a witness prefix τ we call its corresponding event set and memory map a witness state. Events r x and w x are executable in τ . Event r y is not, its good-write is not active in τ . Event w y is also not executable, as its variable y is held by r y . The memory map of τ is MMap τ (x) = 1 and MMap τ (y) = 3. τ is a witness prefix, and E(τ ) with MMap τ together form its witness state. Algorithm. We are now ready to describe our algorithm VerifySC, in Algorithm 1 we present the pseudocode. We attempt to construct a witness of (X, GoodW) by enumerating the witness states reachable by the following pro-

Algorithm 1: VerifySC
Input: Proper event set X and good-writes function GoodW : if E(τ ) = X then return τ // All events executed, witness found 5 foreach event e executable in τ do cess. We start (Line 1) with an empty sequence as the first witness prefix (and state). We maintain a worklist S of so-far unprocessed witness prefixes, and a set Done of reached witness states. Then we iteratively obtain new witness prefixes (and states) by considering an already obtained prefix (Line 3) and extending it with each possible executable event (Line 6). Crucially, when we arrive at a sequence τ e , we include it only if no sequence τ with equal corresponding witness state has been reached yet (Line 7). We stop when we successfully create a witness (Line 4) or when we process all reachable witness states (Line 9). Correctness and Complexity. We now highlight the correctness and complexity properties of VerifySC, while we refer to Appendix C for the proofs. The soundness follows straightforwardly by the fact that each sequence in S is a witness prefix. This follows from a simple inductive argument that extending a witness prefix with an executable event yields another witness prefix. The completeness follows from the fact that given two witness prefixes τ 1 and τ 2 with equal induced witness state, these prefixes are "equi-extendable" to a witness. Indeed, if a suffix τ * exists such that τ 1 • τ * is a witness of (X, GoodW), then τ 2 • τ * is also a witness of (X, GoodW). The time complexity of VerifySC is bounded by O(n k+1 · k d+1 ), for n events, k threads and d variables. The bound follows from the fact that there are at most n k · k d pairwise distinct witness states. We thus have the following theorem. Theorem 1. VSC for n events, k threads and d variables is solvable in Implications. We now highlight some important implications of Theorem 1. Although VSC is NP-hard [25], the theorem shows that the problem is parameterizable in k + d, and thus in polynomial time when both parameters are bounded. Moreover, even when only k is bounded, the problem is fixed-parameter tractable in d, meaning that d only exponentiates a constant as opposed to n (e.g., we have a polynomial bound even when d = log n ). Finally, the algorithm is polynomial for a fixed number of threads regardless of d, when every memory location is written by only one thread (e.g., in producer-consumer settings, or in the concurrent-read-exclusive-write (CREW) concurrency model). These important facts brought forward by Theorem 1 indicate that VSC is likely to be efficiently solvable in many practical settings, which in turn makes RVF a good equivalence for SMC.

Practical heuristics for VerifySC in SMC
We now turn our attention to some practical heuristics that are expected to further improve the performance of VerifySC in the context of SMC. 1. Limiting the Search Space. We employ two straightforward improvements to VerifySC that significantly reduce the search space in practice. Consider the for-loop in Line 5 of Algorithm 1 enumerating the possible extensions of τ . This enumeration can be sidestepped by the following two greedy approaches.
(1) If there is a read r executable in τ , then extend τ with r and do not enumerate other options. (2) Let w be an active write in τ such that w is not a good-write of any r ∈ R(X) \ E(τ ). Let w ∈ W(X) \ E(τ ) be a write of the same variable (var(w) = var(w)), note that w is executable in τ . If w is also not a good-write of any r ∈ R(X) \ E(τ ), then extend τ with w and do not enumerate other options.
The enumeration of Line 5 then proceeds only if neither of the above two techniques can be applied for τ . This extension of VerifySC preserves completeness (not only when used during SMC, but in general), and it can be significantly faster in practice. For clarity of presentation we do not fully formalize this extended version, as its worst-case complexity remains the same. 2. Closure. We introduce closure, a low-cost filter for early detection of VSC instances (X, GoodW) with no witness. The notion of closure, its beneficial properties and construction algorithms are well-studied for the reads-from consistency verification problems [21,22,30], i.e., problems where a desired reads-from function is provided as input instead of a desired good-writes function GoodW. Further, the work of [23] studies closure with respect to a good-writes function, but only for partial orders of Mazurkiewicz width 2 (i.e., for partial orders with no triplet of pairwise conflicting and pairwise unordered events). Here we define closure for all good-writes instances (X, GoodW), with the underlying partial order (in our case, the program order PO) of arbitrary Mazurkiewicz width.
Given a VSC instance (X, GoodW), its closure P (X) is the weakest partial order that refines the program order (P PO|X) and further satisfies the following conditions. Given a read r ∈ R(X), let Cl(r) = GoodW(r) ∩ VisibleW P (r). The following must hold.
(3) If (Cl(r), P |Cl(r)) has a greatest element w, then for each w ∈ W(X) \ GoodW(r) with r w, if w < P r then w < P w. (4) For each w ∈ W(X) \ GoodW(r) with r w, if each w ∈ Cl(r) satisfies w < P w, then we have r < P w.
Finally, we explain how closure can be used by VerifySC. Given an input (X, GoodW), the closure procedure is carried out before VerifySC is called. Once the closure P of (X, GoodW) is constructed, since each solution of VSC(X, GoodW) has to refine P , we restrict VerifySC to only consider sequences refining P . This is ensured by an extra condition in Line 5 of Algorithm 1, where we proceed with an event e only if it is minimal in P restricted to events not yet in the sequence. This preserves completeness, while further reducing the search space to consider for VerifySC. 3. VerifySC guided by auxiliary trace. In our SMC approach, each time we generate a VSC instance (X, GoodW), we further have available an auxiliary trace σ. In σ, either all-but-one, or all, good-writes conditions of GoodW are satisfied. If all good writes in GoodW are satisfied, we already have σ as a witness of (X, GoodW) and hence we do not need to run VerifySC at all. On the other hand, if case all-but-one are satisfied, we use σ to guide the search of VerifySC, as described below.
We guide the search by deciding the order in which we process the sequences of the worklist S in Algorithm 1. We use the auxiliary trace σ with E( σ) = X. We use S as a last-in-first-out stack, that way we search for a witness in a depth-first fashion. Then, in Line 5 of Algorithm 1 we enumerate the extension events in the reverse order of how they appear in σ. We enumerate in reverse order, as each resulting extension is pushed into our worklist S, which is a stack (last-infirst-out). As a result, in Line 3 of the subsequent iterations of the main while loop, we pop extensions from S in order induced by σ.

Stateless Model Checking
We are now ready to present our SMC algorithm RVF-SMC that uses RVF to model check a concurrent program. RVF-SMC is a sound and complete algorithm for local safety properties, i.e., it is guaranteed to discover all local states that each thread visits.
RVF-SMC is a recursive algorithm. Each recursive call of RVF-SMC is argumented by a tuple (X, GoodW, σ, C) where: (1) X is a proper set of events.
(3) σ is a valid trace that is a witness of (X, GoodW).
(4) C : R → Threads → N is a partial function called causal map that tracks implicitly, for each read r, the writes that have already been considered as reads-from sources of r.
Further, we maintain a function ancestors : R(X) → {true, false}, where for each read r ∈ R(X), ancestors(r) stores a boolean backtrack signal for r. We now provide details on the notions of causal maps and backtrack signals.
Causal maps. The causal map C serves to ensure that no more than one maximal trace is explored per RVF partitioning class. Given a read r ∈ enabled(σ) enabled in a trace σ, we define forbids C σ (r) as the set of writes in σ such that C forbids r to read-from them. Formally, forbids C σ (r) = ∅ if r ∈ dom(C), otherwise forbids C σ (r) = {w ∈ W(σ) | w is within first C(r)(thr(w)) events of σ thr }. We say that a trace σ satisfies C if for each r ∈ R(σ) we have RF σ (r) ∈ forbids C σ (r). Backtrack signals. Each call of RVF-SMC (with its GoodW) operates with a trace σ satisfying GoodW that has only reads as enabled events. Consider one of those enabled reads r ∈ enabled( σ). Each maximal trace satisfying GoodW shall contain r, and further, one of the following two cases is true: (1) In all maximal traces σ satisfying GoodW, we have that r reads-from some write of W( σ) in σ . (2) There exists a maximal trace σ satisfying GoodW, such that r reads-from a write not in W( σ) in σ .
Whenever we can prove that the first above case is true for r, we can use this fact to prune away some recursive calls of RVF-SMC while maintaining completeness. Specifically, we leverage the following crucial lemma.
Lemma 1. Consider a call RVF-SMC(X, GoodW, σ, C) and a trace σ extending σ maximally such that no event of the extension is a read. Let r ∈ enabled( σ) such that r ∈ dom(C). If there exists a trace σ that (i) satisfies GoodW and C, and (ii) contains r with RF σ (r) ∈ W( σ), then there exists a trace σ that (i) satisfies GoodW and C, (ii) contains r with RF σ (r) ∈ W( σ), and (iii) contains a write w ∈ W( σ) with r w and thr(r) = thr(w).
We then compute a boolean backtrack signal for a given RVF-SMC call and read r ∈ enabled( σ) to capture satisfaction of the consequent of Lemma 1. If the computed backtrack signal is false, we can safely stop the RVF-SMC exploration of this specific call and backtrack to its recursion parent. Algorithm. We are now ready to describe our algorithm RVF-SMC in detail, Algorithm 2 captures the pseudocode of RVF-SMC(X, GoodW, σ, C). First, in Line 1 we extend σ to σ maximally such that no event of the extension is a read. Then in Lines 2-5 we update the backtrack signals for ancestors of our current recursion call. After this, in Lines 6-11 we construct a sequence of reads enabled in σ. Finally, we proceed with the main while-loop in Line 13. In each while-loop iteration we process an enabled read r (Line 14), and we perform no more while-loop iterations in case we receive a false backtrack signal for r. When processing r, first we collect its viable reads-from sources in Line 17, then we group the sources by value they write in Line 18, and then in iterations of the for-loop in Line 19 we consider each value-group. In Line 20 we form the event set, and in Line 21 we form the good-write function that designates the value-group as the good-writes of r. In Line 22 we use VerifySC to generate a witness, and in case it exists, we recursively call RVF-SMC in Line 26 with the newly obtained events, good-write constraint for r, and witness.
1 σ ← σ • σ where σ extends σ maximally such that no event of σ is a read 2 foreach w ∈ E( σ) do foreach v ∈ Dr do // Process each value 20 To preserve completeness of RVF-SMC, the backtrack-signals technique can be utilized only for reads r with undefined causal map r ∈ dom(C) (cf. Lemma 1). The order of the enabled reads imposed by Lines 6-11 ensures that subsequently, in iterations of the loop in Line 13 we first consider all the reads where we can utilize the backtrack signals. This is an insightful heuristic that often helps in practice, though it does not improve the worst-case complexity.
Example. Figure 6 displays a simple concurrent program on the left, and its corresponding RVF-SMC (Algorithm 2) run on the right. We start with RVF-SMC(∅, ∅, , ∅) (A). By performing the extension (Line 1) we obtain the events and enabled reads as shown below (A). First we process read r 1 (Line 14). The read can read-from w 1 and w 3 , both write the same value so they are grouped together as good-writes of r 1 . A witness is found and a recursive call to (B) is performed. In (B), the only enabled event is r 2 . It can read-from w 2 and w 4 , both Thread thr1 Below each circle is its corresponding event set E( σ) and the enabled reads (dashed grey). Writes with green background are good-writes (GoodW) of its corresponding-variable read. Writes with red background are forbidden by C for its corresponding-variable read. Dashed arrows represent recursive calls.
write the same value so they are grouped for r 2 . A witness is found, a recursive call to (C) is performed, and (C) concludes with a maximal trace. Crucially, in (C) the event w 5 is discovered, and since it is a potential new reads-from source for r 1 , a backtrack signal is sent to (A). Hence after RVF-SMC backtracks to (A), in (A) it needs to perform another iteration of Line 13 while-loop. In (A), first the causal map C is updated to forbid w 1 and w 3 for r 1 . Then, read r 2 is processed from (A), creating (D). In (D), r 1 is the only enabled event, and w 5 is its only C-allowed write. This results in (E) which reports a maximal trace. The algorithm backtracks and concludes, reporting two maximal traces in total. Novelties of the exploration. Here we highlight some key aspects of RVF-SMC. First, we note that RVF-SMC constructs the traces incrementally with each recursion step, as opposed to other approaches such as [15,22] that always work with maximal traces. The reason of incremental traces is technical and has to do with the value-based treatment of the RVF partitioning. We note that the other two value-based approaches [24,23] also operate with incremental traces. However, RVF-SMC brings certain novelties compared to these two methods. First, the exploration algorithm of [24] can visit the same class of the partitioning (and even the same trace) an exponential number of times by different recursion branches, leading to significant performance degradation. The exploration algorithm of [23] alleviates this issue using the causal map data structure, similar to our algorithm. The causal map data structure can provably limit the number of revisits to polynomial (for a fixed number of threads), and although it offers an improvement over the exponential revisits, it can still affect performance. To further improve performance in this work, our algorithm combines causal maps with a new technique, which is the backtrack signals. Causal maps and backtrack signals together are very effective in avoiding having different branches of the recursion visit the same RVF class. Beyond RVF partitioning. While RVF-SMC explores the RVF partitioning in the worst case, in practice it often operates on a partitioning coarser than the one induced by the RVF equivalence. Specifically, RVF-SMC may treat two traces σ 1 and σ 2 with same events (E(σ 1 ) = E(σ 2 )) and value function (val σ1 = val σ2 ) as equivalent even when they differ in some causal orderings ( → σ1 |R = → σ2 |R). To see an example of this, consider the program and the RVF-SMC run in Figure 6. The recursion node (C) spans all traces where (i) r 1 reads-from either w 1 or w 3 , and (ii) r 2 reads-from either w 2 or w 4 . Consider two such traces σ 1 and σ 2 , with RF σ1 (r 2 ) = w 2 and RF σ2 (r 2 ) = w 4 . We have r 1 → σ1 r 2 and r 1 → σ2 r 2 , and yet σ 1 and σ 2 are (soundly) considered equivalent by RVF-SMC. Hence the RVF partitioning is used to upper-bound the time complexity of RVF-SMC. We remark that the algorithm is always sound, i.e., it is guaranteed to discover all thread states even when it does not explore the RVF partitioning in full.

Experiments
In this section we describe the experimental evaluation of our SMC approach RVF-SMC. We have implemented RVF-SMC as an extension in Nidhugg [27], a state-of-the-art stateless model checker for multithreaded C/C++ programs that operates on LLVM Intermediate Representation. First we assess the advantages of utilizing the RVF equivalence in SMC as compared to other trace equivalences. Then we perform ablation studies to demonstrate the impact of the backtrack signals technique (cf. Section 5) and the VerifySC heuristics (cf. Section 4.2).
In our experiments we compare RVF-SMC with several state-of-the-art SMC tools utilizing different trace equivalences. First we consider VC-DPOR [23], the SMC approach operating on the value-centric equivalence. Then we consider Nidhugg/rfsc [22], the SMC algorithm utilizing the reads-from equivalence. Further we consider DC-DPOR [21] that operates on the data-centric equivalence, and finally we compare with Nidhugg/source [15] utilizing the Mazurkiewicz equivalence. 1 The works of [22] and [31] in turn compare the Nidhugg/rfsc algorithm with additional SMC tools, namely GenMC [28] (with reads-from equivalence), RCMC [29] (with Mazurkiewicz equivalence), and CDSChecker [32] (with Mazurkiewicz equivalence), and thus we omit those tools from our evaluation.
There are two main objectives to our evaluation. First, from Section 3 we know that the RVF equivalence can be up to exponentially coarser than the other equivalences, and we want to discover how often this happens in practice. Second, in cases where RVF does provide reduction in the trace-partitioning size, we aim to see whether this reduction is accompanied by the reduction in the runtime of RVF-SMC operating on RVF equivalence. Setup. We consider 119 benchmarks in total in our evaluation. Each benchmark comes with a scaling parameter, called the unroll bound. The parameter controls the bound on the number of iterations in all loops of the benchmark. For each benchmark and unroll bound, we capture the number of explored maximal traces, and the total running time, subject to a timeout of one hour. In Appendix E we provide further details on our setup.

Results.
We provide a number of scatter plots summarizing the comparison of RVF-SMC with other state-of-the-art tools. In Figure 7, Figure 8, Figure 9 and Figure 10 we provide comparison both in runtimes and explored traces, for VC-DPOR, Nidhugg/rfsc, DC-DPOR, and Nidhugg/source, respectively. In each scatter plot, both its axes are log-scaled, the opaque red line represents equality, and the two semi-transparent lines represent an order-of-magnitude difference. The points are colored green when RVF-SMC achieves trace reduction in the underlying benchmark, and blue otherwise. Discussion: Significant trace reduction. In Table 1 we provide the results for several benchmarks where RVF achieves significant reduction in the tracepartitioning size. This is typically accompanied by significant runtime reduction, allowing is to scale the benchmarks to unroll bounds that other tools cannot handle. Examples of this are 27 Boop4 and scull loop, two toy Linux kernel drivers.
In several benchmarks the number of explored traces remains the same for RVF-SMC even when scaling up the unroll bound, see 45 monabsex1, reorder 5 and singleton in Table 1. The singleton example is further interesting, in that while VC-DPOR and DC-DPOR also explore few traces, they still suffer in runtime due to additional redundant exploration, as described in Sections 1 and 5. Discussion: Little-to-no trace reduction. Table 2 presents several benchmarks where the RVF partitioning achieves little-to-no reduction. In these cases the well-engineered Nidhugg/rfsc and Nidhugg/source dominate the runtime. RVF-SMC ablation studies. Here we demonstrate the effect that follows from our RVF-SMC algorithm utilizing the approach of backtrack signals (see Section 5) and the heuristics of VerifySC (see Section 4.2). These techniques have no effect on the number of the explored traces, thus we focus on the runtime. The  Table 1: Benchmarks with trace reduction achieved by RVF-SMC. The unroll bound is shown in the column U. Symbol "-" indicates one-hour timeout. Boldfont entries indicate the smallest numbers for respective benchmark and unroll.
left plot of Figure 11 compares RVF-SMC as is with a RVF-SMC version that does not utilize the backtrack signals (achieved by simply keeping the backtrack flag in Algorithm 2 always true). The right plot of Figure 11 compares RVF-SMC as is with a RVF-SMC version that employs VerifySC without the closure and auxiliary-trace heuristics. We can see that the techniques almost always result in improved runtime. The improvement is mostly within an order of magnitude, and in a few cases there is several-orders-of-magnitude improvement.  Table 2: Benchmarks with little-to-no trace reduction by RVF-SMC. Symbol † indicates that a particular benchmark operation is not handled by the tool.
Finally, in Figure 12 we illustrate how much time during RVF-SMC is typically spent on VerifySC (i.e., on solving VSC instances generated during RVF-SMC).

Conclusions
In this work we developed RVF-SMC, a new SMC algorithm for the verification of concurrent programs using a novel equivalence called reads-value-from (RVF). On our way to RVF-SMC, we have revisited the famous VSC problem [25]. Despite its NP-hardness, we have shown that the problem is parameterizable in k+d (for k threads and d variables), and becomes even fixed-parameter tractable in d when k is constant. Moreover we have developed practical heuristics that solve the problem efficiently in many practical settings.
Our RVF-SMC algorithm couples our solution for VSC to a novel exploration of the underlying RVF partitioning, and is able to model check many concurrent programs where previous approaches time-out. Our experimental evaluation reveals that RVF is very often the most effective equivalence, as the underlying partitioning is exponentially coarser than other approaches. Moreover, RVF-SMC generates representatives very efficiently, as the reduction in the partitioning is often met with significant speed-ups in the model checking task. Interesting future work includes further improvements over the VSC, as well as extensions of RVF-SMC to relaxed memory models.

A Extensions of the concurrent model
For presentation clarity, in our exposition we considered a simple concurrent model with a static set of threads, and with only read and write events. Here we describe how our approach handles the following extensions of the concurrent model: (1) Read-modify-write and compare-and-swap events.
(3) Spawn and join events for dynamic thread creation. Read-modify-write and compare-and-swap events. We model a readmodify-write atomic operation on a variable x as a pair of two events rmw r and rmw w , where rmw r is a read event of x, rmw w is a write event of x, and for each trace σ either the events are both not present in σ, or they are both present and appearing together in σ (rmw r immediately followed by rmw w in σ). We model a compare-and-swap atomic operation similarly, obtaining a pair of events cas r and cas w . In addition we consider a local event happening immediately after the read event cas r , evaluating the "compare" condition of the compare-and-swap instruction. Thus, in traces σ that contain cas r and the "compare" condition evaluates to true, we have that cas r is immediately followed by cas w in σ. In traces σ that contain cas r and the "compare" condition evaluates to false, we have that cas w is not present in σ .
We now discuss our extension of VerifySC to handle the VSC(X, GoodW) problem (Section 4) in presence of read-modify-write and compare-and-swap events. First, observe that as the event set X and the good-writes function GoodW are fixed, we possess the information on whether each compare-and-swap instruction satisfies its "compare" condition or not. Then, in case we have in our event set a read-modify-write event pair e 1 = rmw r and e 2 = rmw w (resp. a compareand-swap event pair e 1 = cas r and e 2 = cas w ), we proceed as follows. When the first of the two events e 1 becomes executable in Line 5 of Algorithm 1 for τ , we proceed only in case e 2 is also executable in τ •e 1 , and in such a case in Line 6 we consider straight away a sequence τ • e 1 • e 2 . This ensures that in all sequences we consider, the event pair of the read-modify-write (resp. compare-and-swap) appears as one event immediately followed by the other event.
In the presence of read-modify-write and compare-and-swap events, the SMC approach RVF-SMC can be utilized as presented in Section 5, after an additional corner case is handled for backtrack signals. Specifically, when processing the extension events in Line 2 of Algorithm 2, we additionally process in the same fashion reads cas r enabled in σ that are part of a compare-and-swap instruction. These reads cas r are then treated as potential novel reads-from sources for ancestor mutations cas * r ∈ dom(ancestors) (Line 4) where cas * r is also a read-part of a compare-and-swap instruction. Mutex events. Mutex events acquire and release are naturally handled by our approach as follows. We consider each lock-release event release as a write event and each lock-acquire event acquire as a read event, the corresponding unique mutex they access is considered a global variable of G.
In SMC, we enumerate good-writes functions whose domain also includes the lock-acquire events. Further, a good-writes set of each lock-acquire admits only a single conflicting lock-release event, thus obtaining constraints of the form GoodW(acquire) = {release}. During closure (Section 4.2), given GoodW(acquire) = {release}, we consider the following condition: thr(acquire) = thr(release) implies release < P acquire. Thus P totally orders the critical sections of each mutex, and therefore VerifySC does not need to take additional care for mutexes. Indeed, respecting P trivially solves all GoodW constraints of lock-acquire events, and further preserves the property that no thread tries to acquire an already acquired (and so-far unreleased) mutex. No modifications to the RVF-SMC algorithm are needed to incorporate mutex events. Dynamic thread creation. For simplicity of presentation, we assumed a static set of threads for a given concurrent program. However, our approach straightforwardly handles dynamic thread creating, by including in the program order PO the orderings naturally induced by spawn and join events. In our experiments, all our considered benchmarks spawn threads dynamically.

B Details of Section 3
Consider the simple programs of Figure 13. In each program, all traces of the program are pairwise RVF-equivalent, while the other equivalences induce exponentially many inequivalent traces.   13: Programs with one RVF-equivalence class, and Ω(2 n ) equivalence classes for the reads-from, value-centric, data-centric, and Mazurkiewicz equivalence.

C Details of Section 4
Here we present the proof of Theorem 1. Proof. We argue separately about soundness, completeness, and complexity of VerifySC (Line 9). Soundness. We prove by induction that each sequence in the worklist S is a witness prefix. The base case with an empty sequence trivially holds. For an inductive case, observe that extending a witness prefix τ with an event e executable in τ yields a witness prefix. Indeed, if e is a read, it has an active good-write in τ , thus its good-writes condition is satisfied in τ • e. If e is a write, new reads r may start holding the variable var(e) in τ • e, but for all these reads e is its good-write, and it shall be active in τ • e. Hence the soundness follows. Completeness. First notice that for each witness τ of VSC(X, GoodW), each prefix of τ is a witness prefix. What remains to prove is that given two witness prefixes τ 1 and τ 2 with equal induced witness state, if a suffix exists to extend τ 1 to a witness of VSC(X, GoodW), then such suffix also exists for τ 2 . Note that since τ 1 and τ 2 have an equal witness state, their length equals too (since E(τ 1 ) = E(τ 2 )). We thus prove the argument by induction with respect to |X \ E(τ 1 )|, i.e., the number of events remaining to add to τ 1 resp. τ 2 . The base case with |X \ E(τ 1 )| = 0 is trivially satisfied. For the inductive case, let there be an arbitrary suffix τ * such that τ 1 • τ * is a witness of VSC(X, GoodW). Let e be the first event of τ * , we have that τ 1 • e is a witness prefix. Note that τ 2 • e is also a witness prefix. Indeed, if e is a read, the equality of the memory maps MMap τ1 and MMap τ2 implies that since e reads a good-write in τ 1 • e, it also reads the same good-write in τ 2 • e. If e is a write, since E(τ 1 ) = E(τ 2 ), each read either holds its variable in both τ 1 and τ 2 or it does not hold its variable in either of τ 1 and τ 2 . Finally observe that MMap τ1•e = MMap τ2•e . We have MMap τ1 = MMap τ2 , if e is a read both memory maps do not change, and if e is a write the only change of the memory maps as compared to MMap τ1 and MMap τ2 is that MMap τ1•e (var(e)) = MMap τ2•e (var(e)) = thr(e). Hence we have that τ 1 • e and τ 2 • e are both witness prefixes with the same induced witness state, and we can apply our induction hypothesis. Complexity. There are at most n k ·k d pairwise distinct witness states, since the number of different lower sets of (X, PO) is bounded by n k , and the number of different memory maps is bounded by k d . Hence we have a bound n k · k d on the number of iterations of the main while-loop in Line 2. Further, each iteration of the main while-loop spends O(n · k) time. Indeed, there are at most k iterations of the for-loop in Line 5, in each iteration it takes O(n) time to check whether the event is executable, and the other items take constant time (manipulating Done in Line 7 and Line 8 takes amortized constant time with hash sets).

D Details of Section 5
In this section we present the proofs of Lemma 1 and Theorem 2. We first prove Lemma 1, and then we refer to it when proving Theorem 2.
Lemma 1. Consider a call RVF-SMC(X, GoodW, σ, C) and a trace σ extending σ maximally such that no event of the extension is a read. Let r ∈ enabled( σ) such that r ∈ dom(C). If there exists a trace σ that (i) satisfies GoodW and C, and (ii) contains r with RF σ (r) ∈ W( σ), then there exists a trace σ that (i) satisfies GoodW and C, (ii) contains r with RF σ (r) ∈ W( σ), and (iii) contains a write w ∈ W( σ) with r w and thr(r) = thr(w).
Proof. We prove the statement by a sequence of reasoning steps.
(1) Let S be the set of writes w * ∈ W( σ) such that there exists a trace σ * that (i) satisfies GoodW and C, and (ii) contains r with RF σ * (r) = w * . Observe that RF σ (r) ∈ S, hence S is nonempty. (2) Let Y be the set of events containing r and the causal future of r in σ , , subsequence of all events except Y ). σ is a valid trace, and from Y ∩ E( σ) = ∅ we get that σ satisfies GoodW and C. (3) Let T 1 be the set of all partial traces that (i) contain all of E( σ), (ii) satisfy GoodW and C, (iii) do not contain r, and (iv) contain some w * ∈ S. T 1 is nonempty due to (2). (4) Let T 2 ⊆ T 1 be the traces σ * of T 1 where for each w * ∈ S ∩ E(σ * ), we have that w * is the last event of its thread in σ * . The set T 2 is nonempty: since w * ∈ W( σ), the events of its causal future in σ * are also not in E( σ), and thus they are not good-writes to any read in GoodW. (5) Let T 3 ⊆ T 2 be the traces of T 2 with the least amount of read events in total. Trivially T 3 is nonempty. Further note that in each trace σ * ∈ T 3 , no read reads-from any write w * ∈ S ∩ E(σ * ). Indeed, such write can only be read-from by reads r * out of E( σ) (traces of T 3 satisfy GoodW). Further, events of the causal future of such reads r * are not good-writes to any read in GoodW (they are all out of E( σ)). Thus the presence of r * violates the property of having the least amount of read events in total. (6) Let σ 1 be an arbitrary partial trace from T 3 . Let S 1 = S ∩ E(σ 1 ), by (3) we have that S 1 is nonempty. Let σ 2 = σ 1 |(E(σ 1 ) \ S 1 ). Note that σ 2 is a valid trace, as for each w * ∈ S 1 , by (4) it is the last event of its thread, and by (5) it is not read-from by any read in σ 1 . (7) Since r ∈ enabled( σ) and E( σ) ⊆ E(σ 2 ) and r ∈ E(σ 2 ), we have that r ∈ enabled(σ 2 ). Let w * ∈ S 1 arbitrary, by the previous step we have w * ∈ enabled(σ 2 ). Now consider σ = σ 2 • r • w * . Notice that (i) σ satisfies GoodW and C, (ii) RF σ (r) ∈ W( σ) (there is no write out of W( σ) present in σ 2 ), and (iii) for w * ∈ E(σ), since w * ∈ S we have r w * and thr(r) = thr(w * ). Now we are ready to prove Theorem 2. Proof. We argue separately about soundness, completeness, and complexity.
Soundness. The soundness of RVF-SMC follows from the soundness of VerifySC used as a subroutine to generate traces that RVF-SMC considers.
Completeness. Let nd = RVF-SMC(X, GoodW, σ, C) be an arbitrary recursion node of RVF-SMC. Let σ be an arbitrary valid full program trace satisfying GoodW and C. The goal is to prove that the exploration rooted at a explores a good-writes function GoodW : R(σ ) → 2 W(σ ) such that for each r ∈ R(σ ) we have RF σ (r) ∈ GoodW (r).
We prove the statement by induction in the length of maximal possible extension, i.e., the largest possible number of reads not defined in GoodW that a valid full program trace satisfying GoodW and C can have. As a reminder, given nd = RVF-SMC(X, GoodW, σ, C) we first consider a trace σ = σ • σ where σ is a maximal extension such that no event of σ is a read.
Base case: 1. There is exactly one enabled read r ∈ enabled( σ). All other threads have no enabled event, i.e., they are fully extended in σ. Because of this, our algorithm considers every possible source r can read-from in traces satisfying GoodW and C. Completeness of VerifySC then implies completeness of this base case.
Inductive case. Let MAXEXT be the length of maximal possible extension of nd = RVF-SMC(X, GoodW, σ, C). By induction hypothesis, RVF-SMC is complete when rooted at any node with maximal possible extension length < MAXEXT. The rest of the proof is to prove completeness when rooted at nd, and the desired result then follows.
Inductive case: RVF-SMC without backtrack signals. We first consider a simpler version of RVF-SMC, where the boolean signal backtrack is always set to true (i.e., Algorithm 2 without Line 16). After we prove the inductive case of this version, we use it to prove the inductive case of the full version of RVF-SMC.
(2) There exists no such i.
Let us prove that (2) is impossible. By contradiction, consider it possible, let r j be the first read out of r 1 , ..., r k in the order as appearing in σ . Consider the thread of RF σ (r j ). It has to be one of thrr 1 , ..., thrr k , as other threads have no enabled event in σ, thus they are fully extended in σ. It cannot be thrr j , because all thread-predecessors of r j are in E( σ). Thus let it be a thread 1 ≤ m ≤ k, m = j. Since RF σ (r j ) ∈ E( σ), RF σ (r j ) comes after r m in σ . This gives us r m < σ RF σ (r j ) < σ r j , which is a contradiction with r j being the first out of r 1 , ..., r k in σ . Hence we know that above case (1) is the only possibility.
Let 1 ≤ j ≤ k be the smallest with RF σ (r j ) ∈ E( σ) ∪ {init event}. Since σ satisfies GoodW and C, we have RF σ (r j ) ∈ C(r j ). Consider nd per-Hence σ 4 is a valid trace containing all events E( σ). Further σ 4 satisfies GoodW and C, because σ satisfies GoodW and C and σ 4 contains the same subsequence of the events E( σ). Finally, RF σ4 (r x ) = RF σ (r x ) ∈ W( σ). The inductive case, and hence the completeness result, follows.
Complexity. Each recursive call of RVF-SMC (Algorithm 2) trivially spends n O(k) time in total except the VerifySC subroutine of Line 22. For the VerifySC subroutine we utilize the complexity bound O(n k+1 ·k d+1 ) from Theorem 1, thus the total time spent in each call of RVF-SMC is n O(k) · O(k d ).
Next we argue that no two leaves of the recursion tree of RVF-SMC correspond to the same class of the RVF trace partitioning. For the sake of reaching contradiction, consider two such distinct leaves l 1 and l 2 . Let a be their last (starting from the root recursion node) common ancestor. Let c 1 and c 2 be the child of a on the way to l 1 and l 2 respectively. We have c 1 = c 2 since a is the last common ancestor of l 1 and l 2 . The recursion proceeds from a to c 1 (resp. c 2 ) by issuing a good-writes set to some read r 1 (resp. r 2 ). If r 1 = r 2 , then the two good-writes set issued to r 1 = r 2 in a differ in the value that the writes of the two sets write (see Line 18 of Algorithm 2). Hence l 1 and l 2 cannot represent the same RVF partitioning class, as representative traces of the two classes shall differ in the value that r 1 = r 2 reads. Hence the only remaining possibility is r 1 = r 2 . In iterations of Line 13 in a, wlog assume that r 1 is processed before r 2 . For any pair of traces σ 1 and σ 2 that are class representatives of l 1 and l 2 respectively, we have that RF σ1 (r 1 ) = RF σ2 (r 1 ). This follows from the update of the causal map C in Line 29 of the Line 13-iteration of a processing r 1 . Further, we have that RF σ2 (r 1 ) is a thread-successor of a read r = r 1 that was among the enabled reads of mutate in a. From this we have r → σ2 r 1 and r → σ1 r 1 . Thus the traces σ 1 and σ 2 differ in the causal orderings of the read events, contradicting that l 1 and l 2 correspond to the same class of the RVF trace partitioning.
Finally we argue that for each class of the RVF trace partitioning, represented by the (X, GoodW) of its RVF-SMC recursion leaf, at most n k calls of RVF-SMC can be performed where its X and GoodW are subsets of X and GoodW, respectively. This follows from two observations. First, in each call of RVF-SMC, the event set is extended maximally by enabled writes, and further by one read, while the good-writes function is extended by defining one further read. Second, the amount of lower sets of the partial order (R(X), PO) is bounded by n k .
The desired complexity result follows.

E Details of Section 6
Here we present additional details on our experimental setup.
Handling assertion violations. Some of the benchmarks in our experiments contain assertion violations, which are successfully detected by all algorithms we consider in our experimental evaluation. After performing this sanity check, we have disabled all assertions, in order to not have the measured parameters be affected by how fast a violation is discovered, as the latter is arbitrary. Our pri-mary experimental goal is to characterize the size of the underlying partitionings, and the time it takes to explore these partitionings. Identifying events. As mentioned in Section 2, an event is uniquely identified by its predecessors in PO, and by the values its PO-predecessors have read. In our implementation, we rely on the interpreter built inside Nidhugg to identify events. An event e is defined by a pair (a e , b e ), where a e is the thread identifier of e and b e is the sequential number of the last LLVM instruction (of the corresponding thread) that is part of e (the e corresponds to zero or several LLVM instructions not accessing shared variables, and exactly one LLVM instruction accessing a shared variable). It can happen that there exist two traces σ 1 and σ 2 , and two different events e 1 ∈ σ 1 , e 2 ∈ σ 2 , such that their identifiers are equal, i.e., a e1 = a e2 and b e2 = b e2 . However, this means that the control-flow leading to each event is different. In this case, σ 1 and σ 2 differ in the value read by a common event that is ordered by the program order PO both before e 1 and before e 2 , hence e 1 and e 2 are treated as inequivalent.
Technical details. For our experiments we have used a Linux machine with Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (12 CPUs) and 128GB of RAM.
We have run Nidhugg with Clang and LLVM version 8. Scatter plots setup. Each scatter plot compares our algorithm RVF-SMC with some other algorithm X. In a fixed plot, each benchmark provides a single data point, obtained as follows. For the benchmark, we consider the highest unroll bound where neither of the algorithms RVF-SMC and X timed out. 2 Then we plot the times resp. traces obtained on that benchmark and unroll bound by the two algorithms RVF-SMC and X.