Stateless model checking for TSO and PSO
 920 Downloads
 1 Citations
Abstract
We present a technique for efficient stateless model checking of programs that execute under the relaxed memory models TSO and PSO. The basis for our technique is a novel representation of executions under TSO and PSO, called chronological traces. Chronological traces induce a partial order relation on relaxed memory executions, capturing dependencies that are needed to represent the interaction via shared variables. They are optimal in the sense that they only distinguish computations that are inequivalent under the widelyused representation by Shasha and Snir. This allows an optimal dynamic partial order reduction algorithm to explore a minimal number of executions while still guaranteeing full coverage. We apply our techniques to check, under the TSO and PSO memory models, LLVM assembly produced for C/pthreads programs. Our experiments show that our technique reduces the verification effort for relaxed memory models to be almost that for the standard model of sequential consistency. This article is an extended version of Abdulla et al. (Tools and algorithms for the construction and analysis of systems, Springer, New York, pp 353–367, 2015), appearing in TACAS 2015.
1 Introduction
Verification and testing of concurrent programs is difficult, since one must consider all the different ways in which instructions of different threads can be interleaved. To make matters worse, most architectures implement relaxed memory models, such as TSO and PSO [4, 36], which make threads interact in even more and subtler ways than by standard interleaving. For example, a processor may reorder loads and stores by the same thread if they target different addresses, or it may buffer stores in a local queue.
A successful technique for finding concurrency bugs (i.e., defects that arise only under some thread schedulings), and for verifying their absence, is stateless model checking (SMC) [18], also known as systematic concurrency testing [24, 39]. Starting from a test, i.e., a way to run a program and obtain some expected result, which is terminating and threadwisely deterministic (e.g. no datanondeterminism), SMC systematically explores the set of all thread schedulings that are possible during runs of this test. A special runtime scheduler drives the SMC exploration by making decisions on scheduling whenever such decisions may affect the interaction between threads, so that the exploration covers all possible executions and detects any unexpected test results, program crashes, or assertion violations. The technique is completely automatic, has no false positives, does not suffer from memory explosion, and can easily reproduce the concurrency bugs it detects. SMC has been successfully implemented in tools such as VeriSoft [19], Chess [28], and Concuerror [12].
There are two main problems for using SMC in programs that run under relaxed memory models (RMM). The first problem is that already under the standard model of sequential consistency (SC) the number of possible thread schedulings grows exponentially with the length of program execution. This problem has been addressed by partial order reduction (POR) techniques that achieve coverage of all thread schedulings, by exploring only a representative subset [13, 17, 30, 38]. POR has been adapted to SMC in the form of Dynamic Partial Order Reduction (DPOR) [16], which has been further developed in recent years [1, 22, 24, 32, 33, 37]. DPOR is based on augmenting each execution by a happensbefore relation, which is a partial order that captures dependencies between operations of the threads. Two executions can be regarded as equivalent if they induce the same happensbefore relation, and it is therefore sufficient to explore one execution in each equivalence class (called a Mazurkiewicz trace [27]). DPOR algorithms guarantee to explore at least one execution in each equivalence class, thus attaining full coverage with reduced cost. A recent optimal algorithm [1] guarantees to explore exactly one execution per equivalence class.
The second problem is that in order to extend SMC to handle relaxed memory models, the operational semantics of programs must be extended to represent the effects of RMM. The natural approach is to augment the program state with additional structures, e.g., store buffers in the case of TSO, that model the effects of RMM [3, 5, 29]. This causes blowups in the number of possible executions, in addition to those possible under SC. However, most of these additional executions are equivalent to some SC execution. To efficiently apply SMC to handle RMM, we must therefore extend DPOR to avoid redundant exploration of equivalent executions. The natural definition of “equivalent” under RMM can be derived from the abstract representation of executions due to Shasha and Snir [35], here called Shasha–Snir traces, which is often used in model checking and runtime verification [7, 8, 10, 11, 21, 23]. Shasha–Snir traces consist of an ordering relation between dependent operations, which generalizes the standard happensbefore relation on SC executions; indeed, under SC, the equivalence relation induced by Shasha–Snir traces coincides with Mazurkiewicz traces. It would thus be natural to base DPOR for RMM on the happensbefore relation induced by Shasha–Snir traces. However, this relation is in general cyclic (due to reorderings possible under RMM) and can therefore not be used as a basis for DPOR (since it is not a partial order). To develop an efficient technique for SMC under RMM we therefore need to find a different representation of executions under RMM. The representation should define an acyclic happensbefore relation. Also, the induced trace equivalence should coincide with the equivalence induced by Shasha–Snir traces.
Contribution In this paper, we show how to apply SMC to TSO and PSO in a way that achieves maximal possible reduction using DPOR, in the sense that redundant exploration of equivalent executions is avoided. A cornerstone in our contribution is a novel representation of executions under RMM, called chronological traces, which define a happensbefore relation on the events in a carefully designed representation of program executions. Chronological traces are a succinct canonical representation of executions, in the sense that there is a onetoone correspondence between chronological traces and Shasha–Snir traces. Furthermore, the happensbefore relation induced by chronological traces is a partial order, and can therefore be used as a basis for DPOR. In particular, the OptimalDPOR algorithm of [1] will explore exactly one execution per Shasha–Snir trace. In particular, for socalled robust programs that are not affected by RMM (these include dataracefree programs), OptimalDPOR will explore as many executions under RMM as under SC: this follows from the onetoone correspondence between chronological traces and Mazurkiewicz traces under SC. Furthermore, robustness can itself be considered a correctness criterion (as in e.g. [7, 8, 10, 11]), which can also be automatically checked with our method (by checking whether the number of equivalence classes is increased when going from SC to RMM).
We show the power of our technique by using it to implement an efficient stateless model checker, which for C programs with pthreads explores all executions of a testcase or a program, up to some bounded length. During exploration of an execution, our implementation generates the corresponding chronological trace. Our implementation employs the sourceDPOR algorithm [1], which is simpler than OptimalDPOR, but about equally effective. Our experimental results for analyses under SC, TSO and PSO of a number of intensely racy benchmarks and programs written in C/pthreads, show that (i) the effort for verification under TSO and PSO is not much larger than the effort for verification under SC, and (ii) our implementation compares favourably against CBMC [6] and gotoinstrument [5], on a number of terminating and datadeterministic benchmarks.
2 Overview of main concepts
To see why this buffering semantics may cause unexpected program behaviors, consider the small program in Fig. 1. It consists of two threads \(p \) and \(q \). The thread \(p \) first stores 1 to the memory location x, and then loads the value at memory location y into its register $ r. The thread \(q \) is similar, but with the roles of x and y reversed. All memory locations and registers are assumed to have initial values 0. It is easy to see that under the SC semantics, it is impossible for the program to terminate in a state where both registers $ r and $ s hold the value 0. However, under the buffering semantics of TSO, such a final state is possible. Fig. 2 shows one such program execution. We see that the store to x happens at the beginning of the execution, but does not take effect with respect to memory until the very end of the execution. Thus the store to x and the load to y appear to take effect in an order opposite to how they occur in the program code. This allows the execution to terminate with \(\textsf {\$}{} { r} = \textsf {\$}{} { s} = 0\).
A Shasha–Snir trace is a directed graph, where edges capture observed event orderings. The nodes in a Shasha–Snir trace are the executed instructions. For each thread, there are edges between each pair of subsequent instructions, creating a total order for each thread. For two instructions i and j in different threads, there is an edge \(i\rightarrow j\) in a trace when i causally precedes j. This happens when j reads a value that was written by i, when i reads a memory location that is subsequently updated by j, or when i and j are subsequent writes to the same memory location. In Fig. 3 we show the Shasha–Snir trace for the execution in Fig. 2.
These observations suggest to define a representation of traces that separates stores from updates. In Fig. 4 we have redrawn the trace from Fig. 3. Updates are separated from stores, and we order updates, rather than stores, with operations of other threads. Thus, there are edges between updates to and loads from the same memory location, and between two updates to the same memory location. In Fig. 4, there is an edge from each store to the corresponding update, reflecting the principle that the update cannot occur before the store. There are edges between loads and updates of the same memory location, reflecting that swapping their order will affect the observed values. However, notice that for this program there are no edges between the updates and loads of the same thread, since they access different memory locations.
 1.
A load is never directly related to an update originating in the same thread. This captures the intuition that swapping the order of such a load and update has no effect other than changing a load from memory into a load of the same value from buffer, as seen when comparing Fig. 6b, c.
 2.
A load ld from a memory location x by a thread \(p \) is never directly related to an update by an another thread \(q \), if the update by \(q \) precedes some update to x originating in a store by \(p \) that precedes ld. This is because the value written by the update of \(q \) is effectively hidden to the load ld by the update to x by \(p \). When we compare Fig. 6a, b, we see that the order between the update by \(q \) and the load is irrelevant, since the update by \(q \) is hidden by the update by \(p \) (note that the update by \(p \) originates in a store that precedes the load).
When we apply these rules to the example of Fig. 5, all of the three representations in Fig. 6a–c merge into a single representation shown in Fig. 7b. In total, we reduce the number of distinguished cases for the program from six to three. This is indeed the minimal number of cases that must be distinguished by any representation, since the different cases result in different values being loaded by the load instruction or different values in memory at the end of the execution. We show in Theorem 1 of Sect. 3 that our proposed representation is in general optimal.
Chronological traces for PSO The TSO and PSO memory models are very similar. The difference is that PSO does not enforce program order between stores by the same thread to different memory locations. To capture this, chronological traces are constructed differently under TSO and PSO. In particular, under TSO there will always be edges between all updates of the same thread, but under PSO we omit those edges when the updates access different memory locations. In Sect. 5 we describe in more detail how to adapt the chronological traces described above to the PSO memory model.
DPOR based on chronological traces Here, we illustrate how stateless model checking performs DPOR based on chronological traces, in order to explore one execution per chronological trace. As example, we use the small program of Fig. 5. This example shows only the intuition of the process, and is intentionally vague. A detailed description of the algorithm is given in Sect. 4.
The algorithm initially explores an arbitrary execution of the program, and simultaneously generates the corresponding chronological trace. In our example, this execution can be the one shown in Fig. 8a, along with its chronological trace. The algorithm then finds those edges of the chronological trace that can be reversed by changing the thread scheduling of the execution. In Fig. 8a, the reversible edges are the ones from \(p: \textsf {update}\) to \(q: \textsf {update}\), and from \(p: \textsf {load: }{\textsf {\$}{} { r}}\,{:=}\,{\mathbf{x}}\) to \(q: \textsf {update}\). For each such edge, the program is executed with this edge reversed. Reversing an edge can potentially lead to a completely different continuation of the execution, which must then be explored.
In the example, reversing the edge from \(p: \textsf {load: }{\textsf {\$}{} { r}}\,{:=}\,{\mathbf{x}}\) to \(q: \textsf {update}\) will generate the execution and chronological trace in Fig. 8b. Notice that the new execution is observably different from the previous one: the load reads the value 2 instead of 1.
The chronological traces in both Fig. 8a, b display a reversible edge from \(p: \textsf {update}\) to \(q: \textsf {update}\). The algorithm therefore initiates an execution where \(q: \textsf {update}\) is performed before \(p: \textsf {update}\). The algorithm will generate the execution and chronological trace in Fig. 8c.
3 Formalization
In this section we summarize our formalization of the concepts of Sect. 2. We introduce our representation of program executions, define chronological traces, formalize Shasha–Snir traces for TSO, and prove a onetoone correspondence between chronological traces and Shasha–Snir traces.
Preliminaries For a function f, we use the notation \(f[x\hookleftarrow v]\) to denote the function \(f'\) such that \(f'(x) = v\) and \(f'(y) = f(y)\) whenever \(y \ne x\). We use \(w\cdot {}w'\) to denote the concatenation of the words w and \(w'\).
Parallel programs We consider parallel programs consisting of a number of threads that run in parallel, each executing a deterministic code, written in an assemblylike programming language. The language includes instructions store: x := $ r, load: $ r := x, and fence. Other instructions do not access memory, and their precise syntax and semantics are ignored for brevity. Here, and in the remainder of this text, x , y , z are used to name memory locations, \(u{}, v{}, w{}\) are used to name values, and $ r , $ s , $ t are used to name processor registers. We use the short forms st( x ) and ld( x ) to denote some store and load of x respectively, where the value is not interesting. We use TID to denote the set of all thread identifiers, and MemLoc to denote the set of all memory locations.
Formal TSO semantics We formalize the TSO model by an operational semantics. Define a configuration as a pair \((\mathbb {L},\mathbb {M})\), where \(\mathbb {M}\) maps memory locations to values, and \(\mathbb {L}\) maps each thread \(p \) to a local configuration of the form \(\mathbb {L}(p)=(\mathbb {R},\mathbb {B})\). Here \(\mathbb {R}{}\) is the state of local registers (their valuation denoted \(\mathbb {R}(\textsf {\$}{} { r})\)) and program counter of \(p \), and \(\mathbb {B}{}\) is the contents of the store buffer of \(p \). This content is a word over pairs \((\mathbf{x},v)\) of memory locations and values. We let the notation \(\mathbb {B}(\mathbf{x})\) denote the value \(v\) such that \((\mathbf{x},v)\) is the rightmost (i.e., most recently inserted) pair in \(\mathbb {B}{}\) of form \((\mathbf{x},\_)\). If there is no such pair in \(\mathbb {B}{}\), then \(\mathbb {B}(\mathbf{x}) = \,\perp \).
In order to accommodate memory updates in our operational semantics we will introduce the notion of auxiliary threads. For each thread \(p \in \textsf {TID}\), we assume that there is an auxiliary thread \(\textsf {upd} (p)\). The auxiliary thread \(\textsf {upd} (p)\) will nondeterministically perform memory updates from the store buffer of \(p {}\), when the buffer is nonempty. We use \(\textsf {AuxTID}= \{\textsf {upd} (p) \big  p \in \textsf {TID}\}\) to denote the set of auxiliary thread identifiers. We will use \(p \) and \(q \) to refer to real or auxiliary threads in \(\textsf {TID}\cup \textsf {AuxTID}\) as convenient.
For configurations \(c= (\mathbb {L},\mathbb {M})\) and \(c' = (\mathbb {L}',\mathbb {M}')\), we write \(c\xrightarrow {p}c'\) to denote that from configuration \(c{}\), thread \(p {}\) can execute its next instruction, thereby changing the configuration into \(c'\). We define the transition relation \(c\xrightarrow {p}c'\) depending on what the next instruction op of \(p {}\) is in \(c\). In the following we assume \(c= (\mathbb {L},\mathbb {M})\) and \(c' = (\mathbb {L}',\mathbb {M}')\) and \(\mathbb {L}(p) = (\mathbb {R},\mathbb {B})\). Let \(\mathbb {R}_{\textsf {pc}}\) be obtained from \(\mathbb {R}\) by advancing the program counter after \(p \) executes its next instruction. Depending on this next instruction op, we have the following cases.
Store If op has the form store: x := $ r, then \(c\xrightarrow {p}c'\) iff \(\mathbb {L}'=\mathbb {L}[p \hookleftarrow (\mathbb {R}_{\textsf {pc}},\mathbb {B}\cdot (\mathbf{x},v))]\) where \(v= \mathbb {R}(\textsf {\$}{} { r})\) and \(\mathbb {M}' = \mathbb {M}\) and. Intuitively, under TSO, instead of updating the memory with the new value v, we insert the entry \((\mathbf{x},v)\) at the end of the store buffer of the thread.
 1.
(From memory) \(\mathbb {B}(\mathbf{x})=\,\perp \) and \(\mathbb {L}'=\mathbb {L}[p \hookleftarrow (\mathbb {R}_{\textsf {pc}}[\textsf {\$}{} { r} \hookleftarrow \mathbb {M}(\mathbf{x})],\mathbb {B})]\), or
 2.
(Buffer forwarding) \(\mathbb {B}(\mathbf{x})\ne \, \perp \) and \(\mathbb {L}'=\mathbb {L}[p \hookleftarrow (\mathbb {R}_{\textsf {pc}}[\textsf {\$}{} { r} \hookleftarrow \mathbb {B}(\mathbf{x})],\mathbb {B})]\).
Fence If op has the form fence, then \(c\xrightarrow {p}c'\) iff \(\mathbb {B}= \varepsilon \) and \(\mathbb {L}'=\mathbb {L}[p \hookleftarrow (\mathbb {R}_{\textsf {pc}},\mathbb {B})]\) and \(\mathbb {M}' = \mathbb {M}\). A fence can only be executed when the store buffer of the thread is empty.
Program executions Based on the operational semantics defined above, a program execution can be defined as a sequence \(c_0\xrightarrow {p _0}c_1\xrightarrow {p _1}\cdots \xrightarrow {p _{n1}}c_n\) of configurations related by transitions labelled by actual or auxiliary thread IDs. Since each transition of each program thread (including the auxiliary threads of form \(\textsf {upd} (q)\)) is deterministic, a program run is uniquely determined by its sequence of thread IDs. Formally, we will therefore define each execution as a word of events. Each event is a triple \((p,i,j)\) which represents one transition in the run. Here the thread \(p \in \textsf {TID}\cup \textsf {AuxTID}\) is a regular or auxiliary thread, executing an instruction i (which may be an update \(\textsf {u(}{\mathbf{x}}\textsf {)}\)). The natural number j is used to disambiguate events. We let j be such that \((p,i,j)\) is the j:th event of \(p \) in the execution (counting from 1). For an event \(e=(p,i,j)\), we define \(\textsf {tid(}{e}\textsf {)}=p \). We will use \(\textsf {Event}\) to denote the set of all possible events. Figure 9 shows three sample executions.
For an execution \(\tau \) and two events \(e,e'\) in \(\tau \), we say that \(e<_{\tau }e'\) iff \(e\) strictly precedes \(e'\) in \(\tau \). We define two dummy events \(e^0=(\perp ,\perp ,0)\) and \(e^\infty =(\perp ,\perp ,\infty )\), and we extend \(<_{\tau }\) such that for every event \(e\not \in \{e^0,e^\infty \}\) we have \(e^0<_{\tau } e<_{\tau } e^\infty \).
For an execution \(\tau \) and an event \(e=(p,\textsf {st(}{\mathbf{x}}\textsf {)},j)\) in \(\tau \), we define \(\textsf {upd}{}_\textsf {st}(e)\) to be the update event in \(\tau \) corresponding to the store event \(e\). Formally, let k be the number of events \(e_w=(p ',\textsf {st(}{\mathbf{y}}\textsf {)},j')\) for any memory location y in \(\tau \) such that \(p '=p \) and \(j'\le {}j\). Then \(\textsf {upd}{}_\textsf {st}(e)=(\textsf {upd} (p),\textsf {u(}{\mathbf{x}}\textsf {)},k)\) if there is such an event in \(\tau \). Otherwise \(\textsf {upd}{}_\textsf {st}(e)=e^\infty \), denoting that the update is still pending at the end of \(\tau \). Figure 9a illustrates the typical case, where the store \(e_s\) is eventually followed by its corresponding update \(\textsf {upd}{}_\textsf {st}(e_s)=e_u\). Figure 9b shows the case when the update corresponding to the store \(e_s\) is still pending at the end of the execution, and therefore \(\textsf {upd}{}_\textsf {st}(e_s)=e^\infty \).
Chronological traces We can now introduce the main conceptual contribution of the paper, viz. chronological traces. For an execution \(\tau \) we define its chronological trace \(\mathcal {T}_C(\tau )\) as a directed graph \(\langle V,E\rangle \). The vertices V are all the events in \(\tau \); both events representing instructions and events representing updates. The set of edges E is the union of six relations: \(E=\rightarrow ^{\textsf {po}}_{\tau }\cup \rightarrow ^{\textsf {su}}_{\tau }\cup \rightarrow ^{\textsf {uu}}_{\tau }\cup \rightarrow ^{\textsf {srcct}}_{\tau }\cup \rightarrow ^{\textsf {cfct}}_{\tau }\cup \rightarrow ^{\textsf {uf}}_{\tau }\).
We will illustrate the definition on an execution of the program in Fig. 10a, which contains an idiom that occurs in the mutual exclusion algorithm of Peterson [31]. It is mostly the same as that from Dekker’s mutual exclusion algorithm. But it has two additional accesses in each thread to a separate memory location z. These provide an opportunity to display buffer forwarding. Figure 10c shows an example of an execution and Fig. 10b shows its corresponding chronological trace.
We define the edge relations of chronological traces as follows, for two arbitrary events \(e=(p,i,j)\in {}V\) and \(e'=(p ',i',j')\in {}V\):
Program order \(e\rightarrow ^{\textsf {po}}_{\tau }e'\) iff \(p = p '\) and \(j' = j+1\). For example, we see in Fig. 10b that there is a program order edge from the store instruction A (i.e., the event \((p,\textsf {st(}{x}\textsf {)},1)\)) to the store instruction B (i.e., the event \((p,\textsf {st(}{z}\textsf {)},2)\)) which immediately follows it in the program of thread \(p \). Similarly, the updates of each thread are program ordered. E.g., \(A'\rightarrow ^{\textsf {po}}_{\tau }{}B'\).
Store to update \(e\rightarrow ^{\textsf {su}}_{\tau }e'\) iff \(i = \textsf {st(}{\mathbf{x}}\textsf {)}\) for some \(\mathbf{x} \) and \(\textsf {upd}{}_\textsf {st}(e) = e'\). I.e., \(e'\) is the update corresponding to the store \(e\). This is illustrated in Fig. 10b where there is an suedge from each store, to its corresponding update. E.g., \(A\rightarrow ^{\textsf {su}}_{\tau }{}A'\).
Update to update \(e\rightarrow ^{\textsf {uu}}_{\tau }e'\) iff \(i = \textsf {u(}{\mathbf{x}}\textsf {)}\) and \(i' = \textsf {u(}{\mathbf{x}}\textsf {)}\) for some x and \(e<_{\tau }e'\) and there is no event \(e'' = (p '',\textsf {u(}{\mathbf{x}}\textsf {)},j'')\) such that \(e<_{\tau }e''<_{\tau }e'\). I.e., \(\rightarrow ^{\textsf {uu}}_{\tau }\) chronologically orders all updates for each memory location. In Fig. 10b we see that the two updates \(B'\) and \(F'\) to z are uuordered with each other in the same order as they appear in the execution in Fig. 10c. However, they are not uuordered with the updates \(A'\) and \(E'\) to x and y .
Source \(e\rightarrow ^{\textsf {srcct}}_{\tau }e'\) iff for some x it holds that \(i = \textsf {u(}{\mathbf{x}}\textsf {)}\) and \(i' = \textsf {ld(}{\mathbf{x}}\textsf {)}\) and \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e'}\textsf {)}<_{\tau }e<_{\tau }e'\) and there is no update \(e'' = (p '',\textsf {u(}{\mathbf{x}}\textsf {)},j'')\) to x such that \(e<_{\tau }e''<_{\tau }e'\). I.e., if the source of the value read by \(e'\) is an update \(e\) from a different process, then \(e\rightarrow ^{\textsf {srcct}}_{\tau }e'\). Otherwise, there is no incoming \(\rightarrow ^{\textsf {srcct}}_{\tau }\) edge to \(e'\). Since the definition forces the strict order \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e'}\textsf {)}<_{\tau }e<_{\tau }e'\), it excludes the possibility of the update \(e\) originating in the same thread as the load \(e'\) (as no update from \(p '\) can come after \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e'}\textsf {)}\) but before \(e'\)). Therefore a load is never srcctrelated to an update from the same thread. In Fig. 10b we see that the load H takes its value from the update \(A'\). Therefore the events are srcctrelated. But the loads C and G to z both read the value written by their own thread, and therefore have no srcctrelation. The “ct” in the name of the relation stands for “chronological trace”, and serves to distinguish the relation \(\rightarrow ^{\textsf {srcct}}_{\tau }\) for chronological traces from the similar, but different relation \(\rightarrow ^{\textsf {srcss}}_{\tau }\) for Shasha–Snir traces (introduced below).
Conflict \(e\rightarrow ^{\textsf {cfct}}_{\tau }e'\) iff \(i = \textsf {ld(}{\mathbf{x}}\textsf {)}\) and \(i' = \textsf {u(}{\mathbf{x}}\textsf {)}\) for some x and \(e'\) is the first (w.r.t. \(<_{\tau }\)) event \(e_u\) of the form \((\_,\textsf {u(}{\mathbf{x}}\textsf {)},\_)\) such that both \(e<_{\tau }e_u\) and \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e}\textsf {)}<_{\tau }e_u\). The intuition here is that \(e\rightarrow ^{\textsf {cfct}}_{\tau }e'\) when \(e'\) is the first update which succeeds \(e\) in the coherence order of \(\mathbf{x} \). Equivalently, \(e'\) is the update that overwrites the value that was read by \(e\). In Fig. 10b, the load D to y by \(p {}\) reads the initial value of y , which is then overwritten by the update \(E'\) to y by \(q {}\). Therefore we have \(D\rightarrow ^{\textsf {cfct}}_{\tau }{}E'\). The load C reads the value of the update \(B'\) by buffer forwarding. That value is later overwritten in memory by the update \(F'\). Therefore we have \(C\rightarrow ^{\textsf {cfct}}_{\tau }{}F'\).
Update to fence \(e\rightarrow ^{\textsf {uf}}_{\tau }e'\) iff \(i = \textsf {u(}{\mathbf{x}}\textsf {)}\) for some x , and \(i' = \textsf {fence}\) and \(p = \textsf {upd} (p ')\) and \(e<_{\tau }e'\) and there is no event \(e'' = (p,\textsf {u(}{\mathbf{y}}\textsf {)},j'')\) for any y such that \(e<_{\tau }e''<_{\tau }e'\). The intuition here is that the fence cannot be executed until all pending updates of the same thread have been flushed from the buffer. Hence the updates are ordered before the fence.
Shasha–Snir traces We will now formalize Shasha–Snir traces, and prove that chronological traces are equivalent to Shasha–Snir traces, in the sense that they induce the same equivalence relation on executions. We first recall the definition of Shasha–Snir traces. We follow the formalization by Bouajjani et al. [8].
First, we introduce the notion of a completed execution. We say that an execution \(\tau \) is completed when all stores have reached memory, i.e., when for every event \(e= (p,\textsf {st(}{\mathbf{x}}\textsf {)},j)\) in \(\tau \) we have \(\textsf {upd}{}_\textsf {st}(e) \ne e^\infty \). In the context of Shasha–Snir traces, we will restrict ourselves to completed executions.
For a completed execution \(\tau \), we define the Shasha–Snir trace of \(\tau \) as the graph \(\mathcal {T}(\tau ) = \langle V,E\rangle \) where V is the set of all nonupdate events \((p,i,j)\) in \(\tau \) where \(i\ne \textsf {u(}{\mathbf{x}}\textsf {)}\) for all x . The edges E is the union of four relations \(E=\rightarrow ^{\textsf {po}}_{\tau }\cup \rightarrow ^{\textsf {st}}_{\tau }\cup \rightarrow ^{\textsf {srcss}}_{\tau }\cup \rightarrow ^{\textsf {cfss}}_{\tau }\).
For two arbitrary events \(e=(p,i,j)\in {}V\) and \(e'=(p ',i',j')\in {}V\), we define the relations as follows:
Program order \(e\rightarrow ^{\textsf {po}}_{\tau }e'\) iff \(p = p '\) and \(j' = j+1\). This is the same program order as for chronological traces.
Store order \(e\rightarrow ^{\textsf {st}}_{\tau }e'\) iff \(i = \textsf {st(}{\mathbf{x}}\textsf {)}\) and \(i' = \textsf {st(}{\mathbf{x}}\textsf {)}\) for some \(\mathbf{x} \) and the corresponding updates are ordered in \(\tau \) s.t. \(\textsf {upd}{}_\textsf {st}(e)<_{\tau }\textsf {upd}{}_\textsf {st}(e')\) and there is no other update event \(e'' = (p '',\textsf {u(}{\mathbf{x}}\textsf {)},j'')\) such that \(\textsf {upd}{}_\textsf {st}(e)<_{\tau }e''<_{\tau }\textsf {upd}{}_\textsf {st}(e')\). I.e., for each memory location \(\mathbf{x} \), the transitive closure \({\rightarrow ^{\textsf {st}}_{\tau }}^{*}\) is a total order on all stores to \(\mathbf{x} \) based on the order in which they reach memory.
Source \(e\rightarrow ^{\textsf {srcss}}_{\tau }e'\) iff \(i' = \textsf {ld(}{\mathbf{x}}\textsf {)}\) and \(e\) is the maximal store event \(e'' = (p '',\textsf {st(}{\mathbf{x}}\textsf {)},j'')\) with respect to \({\rightarrow ^{\textsf {st}}_{\tau }}^*\) such that either \(\textsf {upd}{}_\textsf {st}(e'')<_{\tau }e'\) or \(e''{\rightarrow ^{\textsf {po}}_{\tau }}^{*}e'\). I.e., \(e\rightarrow ^{\textsf {srcss}}_{\tau }e'\) when \(e'\) is a load which reads its value from \(e\), via memory or by buffer forwarding.
Conflict \(e\rightarrow ^{\textsf {cfss}}_{\tau }e'\) iff \(i = \textsf {ld(}{\mathbf{x}}\textsf {)}\) and \(i' = \textsf {st(}{\mathbf{x}}\textsf {)}\) and if there is an event \(e''\) such that \(e''\rightarrow ^{\textsf {srcss}}_{\tau }e\) then \(e''\rightarrow ^{\textsf {st}}_{\tau }e'\), otherwise \(e'\) has no predecessor in \(\rightarrow ^{\textsf {st}}_{\tau }\). I.e., \(e'\) is the store which overwrites the value that was read by \(e\).
The definition of Shasha–Snir traces is illustrated in Fig. 10d. We are now ready to state the equivalence theorem.
Theorem 1
(Equivalence of Shasha–Snir traces and chronological traces) For a given program \(\mathcal {P}\) with two completed executions \(\tau ,\tau '\), it holds that \(\mathcal {T}(\tau ) = \mathcal {T}(\tau ')\) iff \(\mathcal {T}_C(\tau ) = \mathcal {T}_C(\tau ')\).
We decompose the theorem into the following two lemmas, which are proven separately.
Lemma 1
(Equivalence of Shasha–Snir traces and chronological traces \(\Rightarrow \) direction) For a given program \(\mathcal {P}\) with two completed executions \(\tau ,\tau '\), it holds that if \(\mathcal {T}(\tau ) = \mathcal {T}(\tau ')\) then \(\mathcal {T}_C(\tau ) = \mathcal {T}_C(\tau ')\).
Lemma 2
(Equivalence of Shasha–Snir traces and chronological traces \(\Leftarrow \) direction) For a given program \(\mathcal {P}\) with two completed executions \(\tau ,\tau '\), it holds that if \(\mathcal {T}_C(\tau ) = \mathcal {T}_C(\tau ')\) then \(\mathcal {T}(\tau ) = \mathcal {T}(\tau ')\).
Proof of Lemma 1
First, we determine that the events are the same in both chronological traces: \(V_C = V'_C\). From \(V_{SS} = V'_{SS}\) we have that the nonupdate events in \(\tau \) are the same as the ones in \(\tau '\). Since \(\tau \) and \(\tau '\) contain the same stores for each thread in the same perthread order, it follows from the completedness of \(\tau \) and \(\tau '\), and from the TSO semantics that \(\tau \) and \(\tau '\) also have the same update events. Hence \(V_C = V'_C\).
We see that the definitions of program order and store to update order in chronological traces are entirely determined by which events exist in the execution for each thread. Since both executions have the same events, we conclude that \(\rightarrow ^{\textsf {po}}_{\tau } \,=\, \rightarrow ^{\textsf {po}}_{\tau '}\) and \(\rightarrow ^{\textsf {su}}_{\tau }=\, \rightarrow ^{\textsf {su}}_{\tau '}\). The equality \(\rightarrow ^{\textsf {uf}}_{\tau }\,=\,\rightarrow ^{\textsf {uf}}_{\tau '}\) of update to fence order follows similarly.
Let us consider the definitions of update to update order for chronological traces and store order for Shasha–Snir traces. We see that there is a onetoone mapping between relations \(e\rightarrow ^{\textsf {st}}_{\tau }e'\) for stores in Shasha–Snir traces to relations \(\textsf {upd}{}_\textsf {st}(e)\rightarrow ^{\textsf {uu}}_{\tau }\textsf {upd}{}_\textsf {st}(e')\) in chronological traces. Since the store orders are the same for \(\tau \) and \(\tau '\), we thus conclude that the update to update orders are also the same: \(\rightarrow ^{\textsf {uu}}_{\tau } \,=\, \rightarrow ^{\textsf {uu}}_{\tau '}\).
We now turn our attention to proving that \(\rightarrow ^{\textsf {srcct}}_{\tau }\,=\,\rightarrow ^{\textsf {srcct}}_{\tau '}\). We will first prove that \(\rightarrow ^{\textsf {srcct}}_{\tau }\,\subseteq \, \rightarrow ^{\textsf {srcct}}_{\tau '}\). From symmetry it then follows that \(\rightarrow ^{\textsf {srcct}}_{\tau '}\,\subseteq \,\rightarrow ^{\textsf {srcct}}_{\tau }\), and hence that \(\rightarrow ^{\textsf {srcct}}_{\tau }\,=\,\rightarrow ^{\textsf {srcct}}_{\tau '}\). Let us assume that the relation \(e\rightarrow ^{\textsf {srcct}}_{\tau }e'\) exists in \(\rightarrow ^{\textsf {srcct}}_{\tau }\) for some events \(e\,=\, (p,\textsf {u(}{\mathbf{x}}\textsf {)},j)\) and \(e' \,= \,(p ',\textsf {ld(}{\mathbf{x}}\textsf {)},j')\). We will prove that the same relation \(e\rightarrow ^{\textsf {srcct}}_{\tau '}e'\) exists in \(\rightarrow ^{\textsf {srcct}}_{\tau '}\). From the definition of \(\rightarrow ^{\textsf {srcct}}_{\tau }\) we have that \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e'}\textsf {)}<_{\tau } e<_{\tau } e'\) and there is no update \(e'' \,=\,p '',\textsf {u(}{\mathbf{x}}\textsf {)},i'')\) to the same memory location such that \(e<_{\tau }e''<_{\tau }e'\). Since \(e'\) is preceded in \(\tau \) by at least one update to x , there must be a store event \(e_w\) such that \(e_w\rightarrow ^{\textsf {srcss}}_{\tau }e'\) in \(\tau \). From the definition of \(\rightarrow ^{\textsf {srcss}}_{\tau }\) we have that \(e_w\) is the maximal event \((p '',\textsf {st(}{\mathbf{x}}\textsf {)},j'')\) with respect to \({\rightarrow ^{\textsf {st}}_{\tau }}^{*}\) such that either \(\textsf {upd}{}_\textsf {st}(e_w)<_{\tau }e'\) or \(e_w{\rightarrow ^{\textsf {po}}_{\tau }}^*e'\). If \(e_w{\rightarrow ^{\textsf {po}}_{\tau }}^*e'\), then \(\textsf {upd}{}_\textsf {st}(e_w) =\, {\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e'}\textsf {)}\). But then the maximality of \(e_w\) contradicts \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e'}\textsf {)}<_{\tau } e<_{\tau } e'\). Hence we have \(\textsf {upd}{}_\textsf {st}(e_w)<_{\tau }e'\). Maximality of \(e_w\) now gives that \(\textsf {upd}{}_\textsf {st}(e_w) \,=\, e\). Since \(\rightarrow ^{\textsf {srcss}}_{\tau } \,=\, \rightarrow ^{\textsf {srcss}}_{\tau '}\) we have that in \(\tau '\) also \(e_w\rightarrow ^{\textsf {srcss}}_{\tau '}e'\). From the definition of \(\rightarrow ^{\textsf {srcss}}_{\tau '}\) and \(\lnot (e_w{\rightarrow ^{\textsf {po}}_{\tau '}}^*e')\) we know that \(\textsf {upd}{}_\textsf {st}(e_w)\) is the storeordermaximal update to \(\mathbf{x} \) that precedes \(e'\) in \(\tau '\). Since the store order is the same for \(\tau \) and \(\tau '\) we have \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e'}\textsf {)}<_{\tau '}e\). But then \(e\,=\, \textsf {upd}{}_\textsf {st}(e_w)\) satisfies the criteria for \(e\rightarrow ^{\textsf {srcct}}_{\tau '}e'\).
Finally, we will show that \(\rightarrow ^{\textsf {cfct}}_{\tau } \,=\, \rightarrow ^{\textsf {cfct}}_{\tau '}\). Similarly to the proof for \(\rightarrow ^{\textsf {srcct}}_{\tau } \,= \,\rightarrow ^{\textsf {srcct}}_{\tau '}\), it suffices here to show that \(\rightarrow ^{\textsf {cfct}}_{\tau } \subseteq \rightarrow ^{\textsf {cfct}}_{\tau '}\). Assume therefore that \(e_r\rightarrow ^{\textsf {cfct}}_{\tau }e_u\) for some events \(e_r \,=\, (p,\textsf {ld(}{\mathbf{x}}\textsf {)},j)\), \(e_u \,= \,(p ',\textsf {u(}{\mathbf{x}}\textsf {)},j')\). We will show that \(e_r\rightarrow ^{\textsf {cfct}}_{\tau '}e_u\). The definition of \(\rightarrow ^{\textsf {cfct}}_{\tau }\) gives that \(e_u\) is the first (w.r.t. \(<_{\tau }\)) event \(e\) of the form \((\_,\textsf {u(}{\mathbf{x}}\textsf {)},\_)\) such that both \(e_r<_{\tau }e\) and \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}<_{\tau }e\). Let \(e_w\) be the store event such that \(\textsf {upd}{}_\textsf {st}(e_w) \,= \,e_u\). We will split the proof in cases depending on whether or not there exists a source event for \(e_r\) in the Shasha–Snir traces.
Assume therefore first (i) that there is no event \(e_{src}\) such that \(e_{src}\rightarrow ^{\textsf {srcss}}_{\tau }e_r\). Then there is no update to x that precedes \(e_r\) in \(<_{\tau }\). Furthermore \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)} \,=\, e^0\). This tells us that \(e_w\) has no predecessor in \(\rightarrow ^{\textsf {st}}_{\tau }\). Since \(\rightarrow ^{\textsf {st}}_{\tau } \,= \,\rightarrow ^{\textsf {st}}_{\tau '}\), we also have that \(e_w\) has no predecessor in \(\rightarrow ^{\textsf {st}}_{\tau '}\). Furthermore, since \(e_r\) has no source event in \(\tau '\), it must be the case that \(e_r<_{\tau '}e_u\). But then, \(e_u\) is the first update event in \(\tau '\) which is after both \(e_r\) and \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}\). And so we have \(e_r\rightarrow ^{\textsf {cfct}}_{\tau '}e_u\).
Next assume (ii) that there is an event \(e_{src}\) with \(e_{src}\rightarrow ^{\textsf {srcss}}_{\tau }e_r\) and that \(\textsf {tid(}{e_{src}}\textsf {)} \,= \,\textsf {tid(}{e_r}\textsf {)}\). Then it must be the case that \(\textsf {upd}{}_\textsf {st}(e_{src}) \,= \,{\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}\). Since \(\rightarrow ^{\textsf {srcss}}_{\tau } \,= \,\rightarrow ^{\textsf {srcss}}_{\tau '}\), we have that \(e_{src}\rightarrow ^{\textsf {srcss}}_{\tau '}e_r\). There can be no update event \(e\) to the same memory location x such that \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}<_{\tau }e<_{\tau }e_r\). If there were such an \(e\), then \(e_{src}\) wouldn’t be the source of \(e_r\). The same argument goes in \(\tau '\). This tells us that \(e_u\) is the immediate store order successor of \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}\), i.e., \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}\rightarrow ^{\textsf {uu}}_{\tau }e_u\) and \(e_{src}\rightarrow ^{\textsf {st}}_{\tau }e_w\). Since \(\rightarrow ^{\textsf {uu}}_{\tau } = \rightarrow ^{\textsf {uu}}_{\tau '}\), we have \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}\rightarrow ^{\textsf {uu}}_{\tau '}e_u\). Hence \(e_u\) is the first update event which succeeds both \(e_r\) and \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}\) in \(<_{\tau '}\). Thus \(e_r\rightarrow ^{\textsf {cfct}}_{\tau '}e_u\).
Lastly, we assume (iii) that there is an event \(e_{src}\) such that \(e_{src}\rightarrow ^{\textsf {srcss}}_{\tau }e_r\) and that \(\textsf {tid(}{e_{src}}\textsf {)} \ne \textsf {tid(}{e_r}\textsf {)}\). Then it is the case in \(\tau \) that \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}<_{\tau }\textsf {upd}{}_\textsf {st}(e_{src})<_{\tau }e_r\). And there is no update event \(e\) to x such that \(\textsf {upd}{}_\textsf {st}(e_{src})<_{\tau }e<_{\tau }e_r\). The same holds in \(\tau '\). Since \(e_u\) is the first update to x after \(e_r\) in \(\tau \), this means that we have \(\textsf {upd}{}_\textsf {st}(e_{src})\rightarrow ^{\textsf {uu}}_{\tau }e_u\). We have \(\rightarrow ^{\textsf {uu}}_{\tau } \,= \,\rightarrow ^{\textsf {uu}}_{\tau '}\), so \(\textsf {upd}{}_\textsf {st}(e_{src})\rightarrow ^{\textsf {uu}}_{\tau '}e_u\). Now it must be the case that \(e_r<_{\tau '}e_u\). Otherwise, \(e_{src}\) wouldn’t be the source of \(e_r\) in \(\tau '\), and we know \(e_{src}\rightarrow ^{\textsf {srcss}}_{\tau '}e_r\). Hence \(e_u\) is an update event that succeeds both \(e_r\) and \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}\) in \(<_{\tau '}\). It remains to show that it is the first such update. Suppose \(e\ne e_u\) is an update event to x such that \(e_r<_{\tau }e<_{\tau }e_u\). Then it would be the case that \(\textsf {upd}{}_\textsf {st}(e_{src})<_{\tau '}e<_{\tau '}e_u\). But this would contradict \(\textsf {upd}{}_\textsf {st}(e_{src})\rightarrow ^{\textsf {uu}}_{\tau '}e_u\). Thus we have \(e_r\rightarrow ^{\textsf {cfct}}_{\tau '}e_u\).
This concludes the proof of \(\mathcal {T}_C(\tau ) = \mathcal {T}_C(\tau ')\). \(\square \)
Proof of Lemma 2
We will prove that \(\mathcal {T}(\tau ) \,=\, \mathcal {T}(\tau ')\). We know that \(V_{SS}\) (respectively \(V'_{SS}\)) is precisely the nonupdates of \(V_C\) (respectively \(V'_C\)). Since \(V_C \,=\, V'_C\) we have \(V_{SS}\,=\,V'_{SS}\).
For the relations \(\rightarrow ^{\textsf {po}}_{\tau }\) and \(\rightarrow ^{\textsf {st}}_{\tau }\), a reasoning analogue to that in the \(\Rightarrow \) direction gives that \(\rightarrow ^{\textsf {po}}_{\tau } \,=\, \rightarrow ^{\textsf {po}}_{\tau '}\) and \(\rightarrow ^{\textsf {st}}_{\tau } =\, \rightarrow ^{\textsf {st}}_{\tau '}\).
We will show that \(\rightarrow ^{\textsf {srcss}}_{\tau }\subseteq \rightarrow ^{\textsf {srcss}}_{\tau '}\). Symmetry then gives \(\rightarrow ^{\textsf {srcss}}_{\tau '}\subseteq \rightarrow ^{\textsf {srcss}}_{\tau }\), and hence \(\rightarrow ^{\textsf {srcss}}_{\tau } \,=\, \rightarrow ^{\textsf {srcss}}_{\tau '}\). Assume therefore that \(e_w\rightarrow ^{\textsf {srcss}}_{\tau }e_r\) holds for some events \(e_w = (p,\textsf {st(}{\mathbf{x}}\textsf {)},j)\) and \(e_r \,=\, (p ',\textsf {ld(}{\mathbf{x}}\textsf {)},j')\). Then by the definition of \(\rightarrow ^{\textsf {srcss}}_{\tau }\) we have that \(e_w\) is the maximal event \(e\,=\, (p '',\textsf {st(}{\mathbf{x}}\textsf {)},j'')\) with respect to \({\rightarrow ^{\textsf {st}}_{\tau }}^*\) such that either \(\textsf {upd}{}_\textsf {st}(e)<_{\tau }e_r\) or \(e{\rightarrow ^{\textsf {po}}_{\tau }}^*e_r\). We will separate the proof by cases: either \(\textsf {tid(}{e_w}\textsf {)} \,=\, \textsf {tid(}{e_r}\textsf {)}\) or \(\textsf {tid(}{e_w}\textsf {)} \ne \textsf {tid(}{e_r}\textsf {)}\).
Assume first (i) that \(\textsf {tid(}{e_w}\textsf {)} \,=\, \textsf {tid(}{e_r}\textsf {)}\). Then it holds that \(e_w{\rightarrow ^{\textsf {po}}_{\tau }}^*e_r\), since the events must be program ordered, and the other direction implies \(e_r <_{\tau } \textsf {upd}{}_\textsf {st}(e_w)\). Program order is the same in \(\tau '\) as in \(\tau \), so we also have \(e_w{\rightarrow ^{\textsf {po}}_{\tau '}}^*e_r\). It remains to show that \(e_w\) is maximal in \(\tau '\). First we conclude that there can be no store event \(e\) such that \(e_w\rightarrow ^{\textsf {st}}_{\tau '}e\) and \(e{\rightarrow ^{\textsf {po}}_{\tau '}}^*e_r\). This is because both the program order and the store order are the same in \(\tau '\) as in \(\tau \), and hence such an event \(e\) would contradict the assumed maximality of \(e_w\) w.r.t. \(\tau \). As a corollary we have \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)} = \textsf {upd}{}_\textsf {st}(e_w)\). Next we need to conclude that there is no event \(e\) such that \(e_w\rightarrow ^{\textsf {st}}_{\tau '}e\) and \(\textsf {upd}{}_\textsf {st}(e)<_{\tau '}e_r\). We know that there is no such event in \(\tau \): i.e., there is no event \(e\) such that \(e_w\rightarrow ^{\textsf {st}}_{\tau }e\) and \(\textsf {upd}{}_\textsf {st}(e)<_{\tau }e_r\). Hence by the definition of \(\rightarrow ^{\textsf {srcct}}_{\tau }\) there is no event \(e_{src}^C\) which is source related with \(e_r\) in the chronological trace: \(e_{src}^C\rightarrow ^{\textsf {srcct}}_{\tau }e_r\). Since \(\rightarrow ^{\textsf {srcct}}_{\tau } \,= \,\rightarrow ^{\textsf {srcct}}_{\tau '}\), the same holds in \(\tau '\). Now if there were an event such as \(e\) in \(\tau '\), then \(e_r\) would have a source according to \(\rightarrow ^{\textsf {srcct}}_{\tau '}\). This is a contradiction, and so there can be no such \(e\) in \(\tau '\). Hence, \(e_w\) is the maximal store event w.r.t. \({\rightarrow ^{\textsf {st}}_{\tau '}}^*\) which is either updated \(<_{\tau '}\)before \(e_r\) or program orderbefore \(e_r\). That concludes the proof for the case that \(\textsf {tid(}{e_w}\textsf {)} = \textsf {tid(}{e_r}\textsf {)}\).
Next assume (ii) that \(\textsf {tid(}{e_w}\textsf {)} \ne \textsf {tid(}{e_r}\textsf {)}\). Clearly \(e_w\) is not program ordered with \(e_r\). Hence the definition of \(\rightarrow ^{\textsf {srcss}}_{\tau }\) gives that \(\textsf {upd}{}_\textsf {st}(e_w)<_{\tau }e_r\). The maximality of \(e_w\) gives that \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)} <_{\tau } \textsf {upd}{}_\textsf {st}(e_w)\), and that there is no update event \(e= (p '',\textsf {u(}{\mathbf{x}}\textsf {)},j'')\) such that \(\textsf {upd}{}_\textsf {st}(e_w)<_{\tau }e<_{\tau }e_r\). Then by the definition of \(\rightarrow ^{\textsf {srcct}}_{\tau }\) we have \(\textsf {upd}{}_\textsf {st}(e_w)\rightarrow ^{\textsf {srcct}}_{\tau }e_r\). By \(\rightarrow ^{\textsf {srcct}}_{\tau } = \,\rightarrow ^{\textsf {srcct}}_{\tau '}\) we also have \(\textsf {upd}{}_\textsf {st}(e_w)\rightarrow ^{\textsf {srcct}}_{\tau '}e_r\). By the definition of \(\rightarrow ^{\textsf {srcct}}_{\tau '}\) we now have that \(e_w\) is the greatest (w.r.t. \(<_{\tau '}\)) store event with \(\textsf {upd}{}_\textsf {st}(e_w)<_{\tau '}e_r\). We also have that \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}<_{\tau '}\textsf {upd}{}_\textsf {st}(e_w)\). Since there can be no event \(e\,=\, (\_,\textsf {st(}{\mathbf{x}}\textsf {)},\_)\) such that \(e{\rightarrow ^{\textsf {po}}_{\tau '}}^*e_r\) and \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_r}\textsf {)}<_{\tau '}\textsf {upd}{}_\textsf {st}(e)\), we have that \(e_w\) is the maximal event \(e\,= \,(\_,\textsf {st(}{\mathbf{x}}\textsf {)},\_)\) with respect to \({\rightarrow ^{\textsf {st}}_{\tau }}^*\) such that either \(\textsf {upd}{}_\textsf {st}(e)<_{\tau '}e_r\) or \(e{\rightarrow ^{\textsf {po}}_{\tau '}}^*e_r\). Hence \(e_w\rightarrow ^{\textsf {srcss}}_{\tau '}e_r\). This concludes the proof for \(\rightarrow ^{\textsf {srcss}}_{\tau } \,= \,\rightarrow ^{\textsf {srcss}}_{\tau '}\).
Since \(\rightarrow ^{\textsf {cfss}}_{\tau }\) (respectively \(\rightarrow ^{\textsf {cfss}}_{\tau '}\)) is entirely determined by \(\rightarrow ^{\textsf {srcss}}_{\tau }\) and \(\rightarrow ^{\textsf {st}}_{\tau }\) (respectively \(\rightarrow ^{\textsf {srcss}}_{\tau '}\) and \(\rightarrow ^{\textsf {st}}_{\tau '}\)), and we know that \(\rightarrow ^{\textsf {srcss}}_{\tau } \,= \,\rightarrow ^{\textsf {srcss}}_{\tau '}\) and \(\rightarrow ^{\textsf {st}}_{\tau } \,=\, \rightarrow ^{\textsf {st}}_{\tau '}\), we immediately get that \(\rightarrow ^{\textsf {cfss}}_{\tau } \,=\, \rightarrow ^{\textsf {cfss}}_{\tau '}\). This concludes the proof. \(\square \)
4 DPOR algorithm for TSO
A DPOR algorithm can exploit chronological traces to perform stateless model checking of programs that execute under TSO (and PSO), as illustrated at the end of Sect. 2. The explored executions follow the semantics of TSO in Sect. 3. For each execution, its happensbefore relation is computed, which is the transitive closure of the edge relation \(\rightarrow ^{\textsf {ct}}_{\tau }\,=\,\rightarrow ^{\textsf {po}}_{\tau }\cup \rightarrow ^{\textsf {su}}_{\tau }\cup \rightarrow ^{\textsf {uu}}_{\tau }\cup \rightarrow ^{\textsf {srcct}}_{\tau }\cup \rightarrow ^{\textsf {cfct}}_{\tau }\cup \rightarrow ^{\textsf {uf}}_{\tau }\) of the corresponding chronological trace. This happensbefore relation can in principle be exploited by any DPOR algorithm to explore at least one execution per equivalence class induced by Shasha–Snir traces. In this section, we will show concretely how to compute the happensbefore relation, and how to use it to instantiate a DPOR algorithm. To do so, we will first need to introduce some further concepts.
The happensbefore relation \(\rightarrow ^{\textsf {ct}}_{\tau }\) is computed on the fly, using vector clocks, while taking the particular structure of chronological traces into account. The main difference from computing happensbefore relations for sequentially consistent executions (see, e.g., [32]) is that load events which get their value by store forwarding are not immediately synchronized with the vector clock of the memory location. Instead, the load is associated with the store buffer entry from which it got its value. The load is then synchronized with the memory location at the time when the store buffer entry is updated to memory.
Formally, we extend the TSO configurations described in Sect. 3 to keep track of the necessary information about relations between different events. Below we need vector clocks. A vector clock is a function \(C:(\textsf {TID}\cup \textsf {AuxTID})\mapsto \mathbb {N}\). The intuition is that C captures a set of observed events. For every thread p, the first \(C(p)\) events by \(p \) have been observed. We let \(\textsf {VecClocks}=((\textsf {TID}\cup \textsf {AuxTID})\mapsto \mathbb {N})\) denote the set of vector clocks.
For two vector clocks \(v,v'\) we use the notation \(v\sqcup {}v'\) to denote the vector clock \(v''\) such that \(v''(p)=max(v(p),v'(p))\) for all \(p \). For two vector clocks \(v,v'\) we say that \(v\le {}v'\) when \(v(p)\le {}v'(p)\) for all \(p \). We say that \(v<v'\) if at least one of the inequalities is strict. For an event \(e\) and a set E of events we define \(E\oplus e=\{e'\in {}E\textsf {tid(}{e'}\textsf {)}\ne \textsf {tid(}{e}\textsf {)}\}\cup \{e\}\), i.e. \(E\oplus e\) is E where \(e\) replaces the previous event \(e'\in E\) s.t. \(\textsf {tid(}{e'}\textsf {)}=\textsf {tid(}{e}\textsf {)}\). We use the shorthand \(f[x_0,x_1,\cdots ,x_n\hookleftarrow v]\) to denote \(f[x_0\hookleftarrow v][x_1\hookleftarrow v]\cdots [x_n\hookleftarrow v]\), i.e., an assignment of the same value to multiple function arguments.

\({\mathcal {C}}\) \(: (\textsf {TID}\cup \textsf {AuxTID}\cup \textsf {Event}\cup \{\perp \})\mapsto \textsf {VecClocks}\)
maps each (real or auxiliary) thread identifier \(p \) to a vector clock representing which parts of the execution have been seen by \(p \). Also, \(\mathcal {C}\) maps each event \(e\) to the value of \({\mathcal {C}}(\textsf {tid(}{e}\textsf {)})\) at the time immediately after executing \(e\). We fix that \(\mathcal {C}(\perp ) = (\lambda x . 0)\) is a zeroed clock.

\({\mathcal {B}}\) \(: \textsf {TID}\mapsto (\textsf {MemLoc}\times \textsf {Event}\times (\textsf {Event}\cup \{\perp \}))^*\)
maps each real (not auxiliary) thread ID \(p \) to a word of letters \((\mathbf{x},e_s,e_l)\), each of which keeps an auxiliary state for the corresponding letter in the store buffer in \(\mathbb {L}(p)\). Here \(\mathbf{x} \) is the accessed memory location, \(e_s\) is the store event that produced that letter, and \(e_l\) is the latest buffer forwarded load event for which the letter has been the source (if there is no such event then \(e_l \,=\, \perp \)).

\(\mathcal {M}\) \(: \textsf {MemLoc}\mapsto ((\textsf {Event}\cup \{\perp \})\times 2^{\textsf {Event}})\)
maps each memory location \(\mathbf{x} \) to a pair \((e_u,E_l)\), where \(e_u\) is the latest update event that accessed \(\mathbf{x} \) (or \(\perp \) if \(\mathbf{x} \) has never been updated), and where \(E_l\) is a set which for each thread \(p \) that has read x contains the latest event of \(p \) that read the value of \(\mathbf{x} \).
The idea here is that as we execute memory accesses, we update the vector clock of the executing thread to reflect which new events have been observed.
For example, when we execute an update \(e_{\mathbf{x}}\) corresponding to a buffer entry \((\mathbf{x},e_s,e_l)\), we look to the memory \(\mathcal {M}(\mathbf{x}) = (e_u,E_l)\). We know that the update event is ordered after the previous update \(e_u\), as well as the previous loads in \(E_l\) and the store event \(e_s\) which enabled the update \(e_x\). We update the vector clock \(\mathcal {C}(\textsf {tid(}{e_{\mathbf{x}}}\textsf {)})\) of the auxiliary thread to include all these newly observed events.
The procedure for a load from memory is similar, except that we do not observe previous loads. More interesting are loads that are satisfied by buffer forwarding. When we execute a buffer forwarded load \(e_l\) to x , we do not observe any new event, since the load was not able to reach and synchronize with the memory. Instead we save the load event with the buffer entry from which it read its value. When that entry is updated to memory, by the update event \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_l}\textsf {)}\), we move \(e_l\) to the set of loads that have been observed by \(\mathcal {M}(\mathbf{x} {})\). By this scheme the load event \(e_l\) becomes available for observation by precisely the update events which succeed \({\textsf {upd}}{}_{\textsf {ld}}\textsf {(}{e_l}\textsf {)}\). In the remainder of this section we will make this intuition formal.
4.1 Instantiating SourceDPOR for chronological traces
The DPOR algorithm takes three parameters: the current execution \(\tau \), the current extended state \((\mathbb {L},\mathbb {M},\mathcal {C},\mathcal {B},\mathcal {M})\) and a sleep set \({ Sleep}\) of threads which are currently blocked from being executed. The algorithm recursively explores executions which are continuations of \(\tau \). On line 1 we pick a thread \(p \) that can be executed in the current state, and which is not in the sleep set. The next instruction of \(p \) will be the first continuation of \(\tau \) which is explored. As races are discovered between events during the exploration, new alternative continuations will be added to the set backtrack. Such alternatives will be explored in subsequent iterations through the loop on lines 3–30.
For each event \(e\) that is added to \(\tau \) in one iteration of the loop, three main steps are performed: On line 7, the configuration is updated to reflect the effect of executing \(e\), as well as the new edges in the happensbefore relation \(\rightarrow ^{\textsf {ct}}_{\tau }\). At the same time, we compute the set cnf of earlier events which race with \(e\). On lines 8–16, we add new branches to backtrack, corresponding to the races that have been detected and collected in the set cnf. On lines 17–27 we update the sleep set in order to unblock those threads whose next instruction races with \(e\).
The first step is handled by the auxiliary algorithm in Fig. 12. It is explained in Sect. 4.2 below.
In the next step, on lines 8–16, we add new branches to the set backtrack, corresponding to each of the races that have been collected in cnf. For each event \(e_c\) which is in a race with \(e\), we want to add an alternative branch, where \(e_c\) is delayed, allowing the possibility for \(e\) to execute before \(e_c\). Therefore, at line 9, we identify the position in \(\tau \) where \(e_c\) was executed, and the subexecutions \(\tau _0\) and \(\tau _1\) preceding and succeeding \(e_c\). We then identify the set I of events in \(\tau _1\) which may be the first executed event after \(\tau _0\) in a hypothetical other execution where \(e\) executes before \(e_c\). We make certain at lines 12–15 that at least one of the events in I is represented as an alternative branch to explore after \(\tau _0\).
Example exploration
Recall the program given in Fig. 5. In the example of Fig. 8 in the preliminaries, we gave a highlevel explanation of how DPOR with chronological traces would proceed to explore that program. In Fig. 13, we revisit that exploration, and point to how it is achieved by the algorithm given in Fig. 11.
At first, the sleep set is empty, as well as the backtrack set of the current execution \(\tau =\varepsilon \). At this point the threads \(p \) and \(q \) are enabled, but not the auxiliary threads \(\textsf {upd} (p)\) and \(\textsf {upd} (q)\). On line 1 of the algorithm, we pick the enabled thread \(p \), and proceed to insert \(p \) into the backtrack set of \(\tau \) and execute the first instruction A of \(p \). In Fig. 13, we see the execution of A at the top left. The \(slp:\emptyset \) and \(bt:\{p \}\) above A indicate that the sleep set at this point is empty, and the backtrack set is the singleton set containing \(p \).
We continue arbitrarily scheduling the instructions of the enabled threads in the sequence \(A',B,C,C'\), down along the leftmost column of Fig. 13. This gives us the first execution.
As we execute each event, the call to TSOpost on line 7 identifies the earlier events with which the current event has a conflict. During the first execution, two conflicts are identified: When the update \(C'\) is executed, we will have \(\mathcal {M}(\mathbf{x})=(A',\{B\})\), and we will find a conflict from \(A'\) to \(C'\) and one from B to \(C'\). Hence on line 7 when \(C'\) is executed, cnf will be assigned \(\{A',B\}\). When we enter the loop on lines 816, with \(e_c=A'\) we will first split the execution \(\tau =AA'BC\) into \(\tau _0=A\) and \(\tau _1=BC\) on line 9. Then we will identify the events in \(\tau _1\cdot e=BCC'\) which are initial. Notice that C precedes \(C'\) in the happensbefore order, but no events in \(BCC'\) precede B or C. Hence we assign \(\texttt {I}=\{B,C\}\). On lines 12–15 we make certain that either B or C is in the backtrack set corresponding to the position immediately before \(A'\) was executed. In this case we choose to insert \(\textsf {tid(}{C}\textsf {)}=q \) into \(\texttt {backtrack}(A)\). Similarly in the next iteration of the loop on lines 8–16, we insert \(q \) into \(\texttt {backtrack}(AA')\).
As the first execution has been completely explored, the algorithm now starts to backtrack. After \(C'\) has been explored, on the bottom left in Fig. 13, its thread \(\textsf {upd} (q)\) is added to the sleep set on line 29. Since the only thread in the backtrack set (\(\textsf {upd} (q)\)) is also in the sleep set, the loop on lines 3–30 terminates, and the call to TSOSourceDPOR returns. The call to TSOSourceDPOR which executed C immediately returns in the same way. In the call which executed B however, the backtrack set now contains one additional thread \(q \) which should be explored. Therefore, the algorithm takes an extra lap in the loop on lines 330, this time exploring the instruction C of \(q \). This new branch is illustrated in Fig. 13 as the middle column. Notice that \(p \) is present in the sleep set, which prevents us from scheduling the load B until some other conflicting event has been executed.
As the update \(C'\) is executed in the middle execution, we again identify that it has a conflict with the earlier update \(A'\). Again we find that C is an initial event along the execution between \(A'\) and \(C'\). But since \(\textsf {tid(}{C}\textsf {)}=q \) is already present in the backtrack set where \(A'\) is executed, we do not need to insert it again (i.e., \(\texttt {I}\cap \texttt {backtrack}(\tau _0)\ne \emptyset \) on line 12 in the algorithm).
We also identify that the update \(C'\) conflicts with the event B which is the next event of the thread \(p \) which is in the sleep set. Therefore, we remove \(p \) from the sleep set on lines 17–27 in the algorithm.
This leaves the algorithm free to schedule B as the next and last event of the middle execution. As we execute the load B, we have \(\mathcal {M}(\mathbf{x})=(C',\emptyset )\), we therefore detect a conflict from the update \(C'\) to the load B. As a result, \(\textsf {tid(}{B}\textsf {)}=p \) is inserted into the backtrack set \(\texttt {backtrack}(AA'C)\) immediately before \(C'\).
As before, we now start to backtrack. When we reach the call to TSOSourceDPOR where \(C'\) was executed, we find both \(\textsf {upd} (q)\) and \(p \) in the backtrack set. However, both are also in the sleep set, and so the call returns without starting a new branch. The next call returns similarly, and we return to the leftmost column in Fig. 13. On the call which executed \(A'\), we find the thread \(q \) in the backtrack set but not in the sleep set. We then start the new branch corresponding to the rightmost column in Fig. 13.
The thread \(\textsf {upd} (p)\) corresponding to the update \(A'\) is added to the sleep set and cannot be scheduled until the conflicting update \(C'\) has been executed. This effectively ensures that the order of the two updates is reversed in the last execution. As \(A'\) is executed, we identify that it is in conflict with the earlier \(C'\), and therefore add \(\textsf {tid(}{A'}\textsf {)}=\textsf {upd} (p)\) to the backtrack set \(\texttt {backtrack}(AC)\). When the last event, the load B, is executed, we have \(\mathcal {M}(\mathbf{x})=(A',\emptyset )\). Since the update \(A'\) originates in the same thread as the load B, there is no conflict from \(A'\) to B, and so we do not update any backtrack sets.
When backtracking after the last execution, at no point do we have a thread which is present in the backtrack set but not in the sleep set. Therefore, no new branches are initiated, and the algorithm terminates.
4.2 Computing the next configuration and happensbefore relation
We call the algorithm TSOpost(\((\mathbb {L},\mathbb {M},\mathcal {C},\mathcal {B},\mathcal {M})\),\(e\)), shown in Fig. 12, to compute the extended configuration which is reached by executing the event \(e\) from the configuration \((\mathbb {L},\mathbb {M},\mathcal {C},\mathcal {B},\mathcal {M})\). The algorithm performs a case split based on the type of instruction that is being executed. We will here pay some attention to the case of updates, and leave the other cases undescribed, since they are similar. The update case is covered on lines 10–23. First, on lines 11–13 we remove the oldest store from the store buffer, and update the memory, as described in the TSO semantics above. On the next two lines, we remove the corresponding entry from the buffer \(\mathcal {B}\) in the extended configuration. At the same time, we take note that the event \(e_s\) is the store corresponding to this update, and that \(e_r\) is the latest load to which this store has been buffer forwarded. On lines 16–21, we update the information about the memory location \(\mathbf{x} \) in the extended configuration. We change it such that \(e\) is recorded as the latest update for \(\mathbf{x} \). Furthermore, if the update \(e\) has been buffer forwarded to any load \(e_r\), then that load is recorded as the latest load of \(\mathbf{x} \) by \(p \). By recording \(e_r\) in \(\mathcal {M}(\mathbf{x})\) at this point, rather than at the point when the load itself was executed, we ensure that the load is recorded as racing with updates which succeed \(e\), but not with “hidden” updates which precede \(e\). Next, on line 22, we assign a new vector clock to both the thread \(p \) and the event \(e\). The new vector clock is the pointwise maximum of the vector clocks of all events that precede \(e\) in the \(\rightarrow ^{\textsf {ct}}_{\tau }\) order. The new clock includes \(\texttt {c}{}_{p}\), which captures the program order \(\rightarrow ^{\textsf {po}}_{\tau }\), and \(\mathcal {C}(e_s)\) which captures the relation \(\rightarrow ^{\textsf {su}}_{\tau }\) to \(e\) from the store \(e_s\) from which it originates. It includes \(\mathcal {C}(e_u)\) which captures the relation \(\rightarrow ^{\textsf {uu}}_{\tau }\) to \(e\) from the last previous update to \(\mathbf{x} \), and it includes \(\mathcal {C}(e_l)\) for all previous loads \(e_l\) to \(\mathbf{x} \), capturing the conflict relation \(\rightarrow ^{\textsf {cfct}}_{\tau }\). Finally, on line 23, we record the previous memory accesses which are in a race with \(e\), i.e., the events originating in different threads, and which immediately precede \(e\) in \(\rightarrow ^{\textsf {uu}}_{\tau }\cup \rightarrow ^{\textsf {srcct}}_{\tau }\cup \rightarrow ^{\textsf {cfct}}_{\tau }\). Hence, we select events from the update \(e_u\) and the loads in L which have not already been ordered in \(\rightarrow ^{\textsf {ct}}_{\tau }\) with \(e\) or its corresponding store \(e_s\), and which have different thread identifiers.
4.3 Correctness of the DPOR algorithms
We state the following theorem of correctness for DPOR applied to chronological traces.
Theorem 2
(Correctness of DPOR algorithms) The SourceDPOR and OptimalDPOR algorithms of [1], based on the happensbefore relation induced by chronological traces, explore at least one execution per equivalence class induced by Shasha–Snir traces. Moreover, OptimalDPOR explores exactly one execution per equivalence class.
Proof of Theorem 2
 1.
\(\rightarrow _{\tau }\) is a partial order on the events in \(\tau \), which is included in \(<_{\tau }\),
 2.
the events of each thread are totally ordered by \(\rightarrow _{\tau }\),
 3.
if \(\tau '\) is a prefix of \(\tau \), then \(\rightarrow _{\tau }\) and \(\rightarrow _{\tau '}\) are the same on \(\tau '\).
 4.
the assignment of happensbefore relations to executions partitions the set of executions into equivalence classes; i.e., if \(\tau '\) is a linearization of the happensbefore relation on \(\tau \), then \(\tau '\) is assigned the same happensbefore relation as \(\tau \); we use \(\simeq \) to denote the corresponding equivalence relation,
 5.
whenever \(\tau \) and \(\tau '\) are equivalent then they end up in the same global program state,
 6.
for any sequences \(\tau \), \(\tau '\) and \(\tau ''\), such that \(\tau \cdot {}\tau ''\) is an execution, we have \(\tau \simeq \tau '\) if and only if \(\tau \cdot {}\tau '' \simeq \tau '\cdot {}\tau ''\), and
 7.
if \(\tau \cdot {}(p,i,j)\) is an execution, whose last event is performed by thread p, and q, r are different threads, such that (p, i, j) would “happen before” a subsequent event by r but not a subsequent event by q, then (p, i, j) would also “happen before” \((r,i'',j'')\) in the execution \(\tau \cdot {}(p,i,j)\cdot {}(q,i',j')\cdot {}(r,i'',j'')\).
The theorem can now be proven by establishing that the happensbefore assignment induced by chronological traces is valid. Conditions 1, 2, 3, and 6 follow straightforwardly from definitions Condition 4 follows by observing that changing the order between nonrelated events does not affect the definition of the chronological trace. Condition 5 follows by observing that the chronological trace captures all dependences that are needed for determining which values are read and written by loads and stores. Finally, Condition 7 follows by noting that an arrow between (p, i, j) and \((r,i'',j'')\) in a chronological trace cannot be removed by inserting an event that is independent with p. This concludes the proof of Theorem 2.
5 Adaptation for PSO
In this section, we show how our techniques can be adapted to the PSO memory model with minor changes. Before we see how to apply our methods to it, we give an informal description of the PSO memory model.
5.1 PSO semantics
PSO is a strictly more relaxed model than TSO. As described previously, TSO allows reordering of stores with subsequent loads. PSO allows the same reordering, but also allows the reordering of stores with subsequent stores to different memory locations.
This behavior can be modelled by an operational semantics similar to the one described in Sect. 3 for TSO, but where each thread has a separate store buffer for each memory location. Each store buffer is FIFOordered, so stores to the same memory location by the same thread cannot be reordered. But there is no order maintained between stores in different buffers, so stores by the same thread to different locations may update in reversed order.
In Fig. 14a we give an example of a program where PSO allows more behaviors than TSO. The execution in Fig. 14b shows how the stores by \(p \) to x and y update to memory in reversed order. This allows the thread \(q \) to read first \(\mathbf{y} = 1\) then \(\mathbf{x} = 0\), which would be impossible both under SC and TSO.
5.2 Chronological traces for PSO
The adaptation of chronological traces to PSO is straightforward. The following simple adjustment suffices: Since stores from the same thread \(p {}\) to different memory locations x and y are updated by different auxiliary threads upd (\(p \),x) and upd (\(p \),y), there is no program order edge between the update events for different memory locations under PSO.
Formally, we reuse the definition of chronological traces for TSO, from Sect. 3, with some minor changes:
We need to reformulate the definition of \(\textsf {upd}{}_\textsf {st}\) to reflect that there are now multiple store buffers per thread: For an execution \(\tau \) and an event \(e=(p,\textsf {st(}{\mathbf{x}}\textsf {)},j)\) in \(\tau \), let k be the number of events \(e_w=(p ',\textsf {st(}{\mathbf{x}}\textsf {)},j')\) in \(\tau \) such that \(p '=p \) and \(j'\le {}j\). Then \(\textsf {upd}{}_\textsf {st}(e)=(\textsf {upd} (p,\mathbf{x}),\textsf {u(}{\mathbf{x}}\textsf {)},k)\) if there is such an event in \(\tau \). Otherwise \(\textsf {upd}{}_\textsf {st}(e)=e^\infty \). This new definition of \(\textsf {upd}{}_\textsf {st}\) replaces the old one in the definition of chronological traces for PSO.
We can then define chronological traces for PSO in the same way as for TSO, except that the definition of \(\rightarrow ^{\textsf {uf}}_{\tau }\) needs to be reformulated as follows:
For two events \(e=(p,i,j)\) and \(e'=(p ',i',j')\) we say that \(e\rightarrow ^{\textsf {ufpso}}_{\tau }e'\) iff \(i=\textsf {u(}{\mathbf{x}}\textsf {)}\) for some x , and \(i'=\textsf {fence}\) and \(p =\textsf {upd} (p ',\mathbf{x})\) and \(e<_{\tau }e'\) and there is no event \(e'' = (p,\textsf {u(}{\mathbf{x}}\textsf {)},j'')\) such that \(e<_{\tau }e''<_{\tau }e'\). I.e., under PSO, we put an edge in \(\rightarrow ^{\textsf {ufpso}}_{\tau }\) to the fence from the last preceding update by the thread for each memory location, rather than as under TSO only from the single last preceding update by the thread to any memory location.
A chronological trace for PSO is illustrated in Fig. 14c. Notice that there is no program order edge from \((\textsf {upd} (p,\mathbf{x}),\textsf {u(}{\mathbf{x}}\textsf {)},1)\) to \((\textsf {upd} (p,\mathbf{y}),\textsf {u(}{\mathbf{y}}\textsf {)},1)\). Had there been one, the trace would be cyclic.
6 Implementation
To show the effectiveness of our techniques we have implemented a stateless model checker for C programs. The tool, called Nidhugg, is available as open source at https://github.com/nidhugg/nidhugg. Major design decisions have been that Nidhugg: (i) should not be bound to a specific hardware architecture and (ii) should use an existing, mature implementation of C semantics, not implement its own. Our choice was to use the LLVM compiler infrastructure [26] and work at the level of its intermediate representation (IR). LLVM IR is lowlevel and allows us to analyze assemblylike but targetindependent code which is produced after employing all optimizations and transformations that the LLVM compiler performs till this stage.
Nidhugg detects assertion violations and robustness violations that occur under the selected memory model. We implement the SourceDPOR algorithm from Sect. 5 in Abdulla et al. [1], adapted to relaxed memory in the manner described in this paper. Before applying SourceDPOR, each spin loop is replaced by an equivalent single load and assume statement. This substantially improves the performance of SourceDPOR, since a waiting spin loop may generate a huge number of improductive loads, all returning the same wrong value; all of these loads will cause races, which will cause the number of explored traces to explode. Exploration of program executions is performed by interpretation of LLVM IR, based on the interpreter lli which is distributed with LLVM. We support concurrency through the pthreads library. This is done by hooking calls to pthread functions, and executing changes to the execution stacks (adding new threads, joining, etc.) as appropriate within the interpreter.
7 Experimental results
We have applied our implementation to several intensely racy benchmarks, all implemented in C/pthreads. They include classical benchmarks, such as Dekker’s, Lamport’s (fast) and Peterson’s mutual exclusion algorithms. Other programs, such as indexer.c, are designed to showcase races that are hard to identify statically. Yet others (stack_safe.c) use pthread mutexes to entirely avoid races. Lamport’s algorithm and stack_safe.c originate from the TACAS Competition on Software Verification (SVCOMP). Some benchmarks originate from industrial code: apr_1.c, apr_2.c, pgsql.c and parker.c.
We show the results of our tool Nidhugg in Table 1. For comparison we also include the results of two other analysis tools, CBMC [6] and gotoinstrument [5], which also target C programs under relaxed memory. The techniques of gotoinstrument and CBMC are described in more detail in Sect. 8.
Since both SMC and BMC require that all runs of the analyzed program terminate within some finite bound, we apply loop bounding when analyzing the benchmarks. The bound is indicated in the LB column of Table 1. Furthermore, all benchmarks are datadeterministic, since this is a requirement for SMC, as mentioned earlier.
Analysis times (in seconds) for our implementation Nidhugg, as well as CBMC and gotoinstrument under the SC, TSO and PSO memory models
Fence  LB  CBMC  gotoinstrument  Nidhugg  

SC  TSO  PSO  SC  TSO  PSO  SC  TSO  PSO  
apr_1.c  –  5  t/o  t/o  t/o  t/o  !  !  5.88  6.06  16.98 
apr_2.c  –  5  t/o  t/o  t/o  !  !  !  2.60  2.20  5.39 
dcl_singleton.c  –  7  5.95  31.47  18.01*  5.33  5.36  0.18*  0.08  0.08  0.08* 
dcl_singleton.c  pso  7  5.88  30.98  29.45  5.20  5.18  5.17  0.08  0.08  0.08 
dekker.c  –  10  2.42  3.17*  2.84*  1.68  4.00*  220.11*  0.10  0.11*  0.09* 
dekker.c  tso  10  2.39  5.65  3.51*  1.62  297.62  t/o  0.11  0.12  0.08* 
dekker.c  pso  10  2.55  5.31  4.83  1.72  428.86  t/o  0.11  0.12  0.12 
fib_false.c  –  –  1.63*  3.38*  3.00*  1.60*  1.58*  1.56*  2.39*  5.57*  6.20* 
fib_false_join.c  –  –  0.98*  1.10*  1.91*  1.31*  0.88*  0.80*  0.32*  0.62*  0.71* 
fib_true.c  –  –  6.28  9.39  7.72  6.32  7.63  7.62  25.83  75.06  86.32 
fib_true_join.c  –  –  6.61  8.37  10.81  7.09  5.94  5.92  1.20  2.88  3.19 
indexer.c  –  5  193.01  210.42  214.03  191.88  70.42  69.38  0.10  0.09  0.09 
lamport.c  –  8  7.78  11.63*  10.53*  6.89  t/o  t/o  0.08  0.08*  0.08* 
lamport.c  tso  8  7.60  26.31  15.85*  6.80  513.67  t/o  0.09  0.08  0.07* 
lamport.c  pso  8  7.72  30.92  27.51  7.43  t/o  t/o  0.08  0.08  0.08 
parker.c  –  10  12.34  91.99*  86.10*  11.63  1.50  0.09*  0.08*  
parker.c  pso  10  12.72  141.24  166.75  11.76  10.66  10.64  1.50  1.92  2.94 
peterson.c  –  –  0.35  0.38*  0.35*  0.18  0.20*  0.21*  0.07  0.07*  0.07* 
peterson.c  tso  –  0.35  0.39  0.35*  0.19  0.18  0.07  0.07  0.07*  
peterson.c  pso  –  0.35  0.41  0.40  0.18  0.18  0.19  0.07  0.07  0.08 
pgsql.c  –  8  19.80  60.66  4.63*  21.03  46.57  296.77*  0.08  0.07  0.08* 
pgsql.c  pso  8  23.93  71.15  121.51  19.04  t/o  t/o  0.07  0.07  0.08 
pgsql_bnd.c  pso  (4)  3.57  9.55  12.68  3.59  t/o  t/o  89.44  106.04  112.60 
stack_safe.c  –  –  44.53  516.01  496.36  45.11  42.39  42.50  0.34  0.36  0.43 
stack_unsafe.c  –  –  1.40*  1.87*  2.08*  1.00*  0.81*  0.79*  0.08*  0.08*  0.09* 
szymanski.c  –  –  0.40  0.44*  0.43*  0.23  0.89*  1.16*  0.07  0.13*  0.07* 
szymanski.c  tso  –  0.40  0.50  0.43*  0.23  0.23  0.08  0.08  0.07*  
szymanski.c  pso  –  0.39  0.50  0.49  0.23  0.24  0.24  0.08  0.08  0.08 
All experiments were run on a machine equipped with a 3 GHz Intel i7 processor and 6 GB RAM running 64bit Linux. We used version 4.9 of gotoinstrument and CBMC. The benchmarks have been tweaked to work for all tools, in communication with the developers of CBMC and gotoinstrument. All benchmarks are available at https://github.com/nidhugg/benchmarks_tacas2015.
Table 1 shows that our technique performs well compared to the other tools for most of the examples. We will briefly highlight a few interesting results.
We see that in most cases Nidhugg pays a very modest performance price when going from sequential consistency to TSO and PSO. The explanation is that the number of executions explored by our stateless model checker is close to the number of Shasha–Snir traces, which increases very modestly when going from sequential consistency to TSO and PSO for typical benchmarks. Consider for example the benchmark stack_safe.c, which is robust, and therefore has equally many Shasha–Snir traces (and hence also chronological traces) under all three memory models. Our technique is able to benefit from this, and has almost the same run time under TSO and PSO as under SC.
The effect of the optimization to replace each spin loop by a load and assume statement can be seen in the pgsql.c benchmark. For comparison, we also include the benchmark pgsql_bnd.c, where the spin loop has been modified such that Nidhugg fails to automatically replace it by an assume statement.
The only other benchmark where Nidhugg is not faster is fib_true.c. The benchmark has two threads that perform the actual work, and one separate thread that checks the correctness of the computed value, causing many races, as in the case of spin loops. We show with the benchmark fib_true_join.c that in this case, the problem can be alleviated by forcing the threads to join before checking the result.
Most benchmarks in Table 1 are small program cores, ranging from 36 to 118 lines of C code, exhibiting complicated synchronization patterns. To show that our technique is also applicable to real life code, we include the benchmarks apr_1.c and apr_2.c. They each contain approximately 8000 lines of code taken from the Apache Portable Runtime library, and exercise the library primitives for thread management, locking, and memory pools. Nidhugg is able to analyze the code within a few seconds. We notice that despite the benchmarks being robust, the analysis under PSO suffers a slowdown of about three times compared to TSO. This is because the benchmarks access a large number of different memory locations. Since PSO semantics require one store buffer per memory location, this affects analysis under PSO more than under SC and TSO.
8 Related work
To the best of our knowledge, our work, together with the work by Zhang et al. [41] are the first to apply stateless model checking techniques to the setting of relaxed memory models; see e.g. [1] for a recent survey of related work on stateless model checking and dynamic partial order reduction techniques. The work by Zhang et al. [41] was developed independently and concurrently with the work presented in this paper, and shares many similarities with it.
There have been many previous works dedicated to the verification and checking of programs running under RMM (e.g., [3, 7, 8, 9, 10, 11, 21, 23, 25, 40]). Some of them propose precise analyses for checking safety properties or robustness of finitestate programs under TSO (e.g., [3, 8]). Others describe monitoring and testing techniques for programs under RMM (e.g., [10, 11, 25]). There are also a number of efforts to design bounded model checking techniques for programs under RMM (e.g., [9, 40]) which encode the verification problem in SAT.
The two closest works to ours are those presented in [5, 6]. The first of them [6] develops a bounded model checking technique that can be applied to different memory models (e.g., TSO, PSO, and Power). That technique makes use of the fact that the trace of a program under RMM can be viewed as a partially ordered set. This results in a bounded model checking technique aware of the underlying memory model when constructing the SMT/SAT formula. The second line of work reduces the verification problem of a program under RMM to verification under SC of a program constructed by a code transformation [5]. This technique tries to encode the effect of the RMM semantics by augmenting the input program with buffers and queues. This work introduces also the notion of Xtop objects. Although an Xtop object is a valid acyclic representation of Shasha–Snir traces, it will in many cases distinguish executions that are semantically equivalent according to the Shasha–Snir traces. This is never the case for chronological traces.
There has also been some recent work on SATdirected stateless model checking [20], including consideration of RMM [14]. The main idea is to encode some key parts of the concurrent program into a SAT formula and hand it over to a general purpose SMT solver to produce additional interleavings. For the most relevant tool, SATCheck [14], the authors claim that this approach scales better with the length of program execution, basing their evaluation on increasing the length of the traces by increasing the number of iterations of a small program core. We were not able to evaluate SATCheck on our own benchmark set, as the tool is currently at a prototype level and requires preprocessing by an expert user in order to handle arbitrary programs. Nevertheless, experimentation on one of our benchmark programs (dekker.c) confirms that performance of Nidhugg and SATCheck are similar for small programs.
9 Concluding remarks
We have presented a technique for efficient stateless model checking which is aware of the underlying relaxed memory model. To this end, we have introduced chronological traces which are novel representations of executions under the TSO and PSO memory models, and induce a happensbefore relation that is a partial order and can be used as a basis for DPOR. Furthermore, we have established a strict onetoone correspondence between chronological and Shasha–Snir traces. Nidhugg, our publicly available tool, detects bugs in LLVM assembly code produced for C/pthreads programs and can be instantiated to the SC, TSO, and PSO memory models.
We plan to extend Nidhugg to more memory models such as Power, ARM, and the C/C++ memory model. This will require adapting the definition of chronological traces to those models in order to guarantee the onetoone correspondence with Shasha–Snir traces.
References
 1.Abdulla, P., Aronis, S., Jonsson, B., Sagonas, K.: Optimal dynamic partial order reduction. In: POPL, pp. 373–384. ACM (2014)Google Scholar
 2.Abdulla, P.A., Aronis, S., Atig, M.F., Jonsson, B., Leonardsson, C., Sagonas, K.: Stateless model checking for TSO and PSO. In: Tools and algorithms for the construction and analysis of systems, pp. 353–367. Springer, Heidelberg (2015)Google Scholar
 3.Abdulla, P.A., Atig, M.F., Chen, Y.F., Leonardsson, C., Rezine, A.: Counterexample guided fence insertion under TSO. In: TACAS, vol. 7214 of LNCS, pp. 204–219. Springer (2012)Google Scholar
 4.Adve, S.V., Gharachorloo, K.: Shared memory consistency models: a tutorial. Computer 29(12), 66–76 (1996)CrossRefGoogle Scholar
 5.Alglave, J., Kroening, D., Nimal, V., Tautschnig, M.: Software verification for weak memory via program transformation. In: ESOP, vol. 7792 of LNCS, pp. 512–532. Springer (2013)Google Scholar
 6.Alglave, J., Kroening, D., Tautschnig, M.: Partial orders for efficient bounded model checking of concurrent software. In: CAV, vol. 8044 of LNCS, pp. 141–157. Springer (2013)Google Scholar
 7.Alglave, J., Maranget, L.: Stability in weak memory models. In: CAV, vol. 6806 of LNCS, pp. 50–66. Springer (2011)Google Scholar
 8.Bouajjani, A., Derevenetc, E., Meyer, R.: Checking and enforcing robustness against TSO. In: ESOP, vol. 7792 of LNCS, pp. 533–553. Springer (2013)Google Scholar
 9.Burckhardt, S., Alur, R., Martin, M.M.K.: CheckFence: checking consistency of concurrent data types on relaxed memory models. In: PLDI, pp. 12–21. ACM (2007)Google Scholar
 10.Burckhardt, S., Musuvathi, M.: Effective program verification for relaxed memory models. In: CAV, vol. 5123 of LNCS, pp. 107–120. Springer (2008)Google Scholar
 11.Burnim, J., Sen, K., Stergiou, C.: Sound and complete monitoring of sequential consistency for relaxed memory models. In: TACAS, pp. 11–25. LNCS 6605, Springer (2011)Google Scholar
 12.Christakis, M., Gotovos, A., Sagonas, K.: Systematic testing for detecting concurrency errors in Erlang programs. In: ICST, pp. 154–163. IEEE (2013)Google Scholar
 13.Clarke, E.M., Grumberg, O., Minea, M., Peled, D.: State space reduction using partial order techniques. STTT 2(3), 279–287 (1999)CrossRefMATHGoogle Scholar
 14.Demsky, B., Lam, P.: SATCheck: SATdirected stateless model checking for SC and TSO. In: OOPSLA (2015)Google Scholar
 15.Dijkstra, E.W.: Cooperating sequential processes. Springer, Heidelberg (2002)Google Scholar
 16.Flanagan, C., Godefroid, P.: Dynamic partialorder reduction for model checking software. In: POPL, pp. 110–121. ACM (2005)Google Scholar
 17.Godefroid, P.: PartialOrder Methods for the Verification of Concurrent Systems—An Approach to the StateExplosion Problem, vol. 1032 of LNCS. Springer (1996)Google Scholar
 18.Godefroid, P.: Model checking for programming languages using VeriSoft. In: POPL, pp. 174–186. ACM Press (1997)Google Scholar
 19.Godefroid, P.: Software model checking: the VeriSoft approach. Formal Methods Syst. Des. 26(2), 77–101 (2005)CrossRefGoogle Scholar
 20.Huang, J.: Stateless model checking concurrent programs with maximal causality reduction. In: PLDI, pp. 165–174. New York, NY, USA, ACM (2015)Google Scholar
 21.Krishnamurthy, A., Yelick, K.A.: Analyses and optimizations for shared address space programs. J. Parallel Distrib. Comput. 38(2), 130–144 (1996)CrossRefMATHGoogle Scholar
 22.Lauterburg, S., Karmani, R., Marinov, D., Agha, G.: Evaluating ordering heuristics for dynamic partialorder reduction techniques. In: FASE, pp 308–322. LNCS 6013 (2010)Google Scholar
 23.Lee, J., Padua, D.A.: Hiding relaxed memory consistency with a compiler. IEEE Trans. Comput. 50(8), 824–833 (2001)CrossRefGoogle Scholar
 24.Lei, Y., Carver, R.: Reachability testing of concurrent programs. IEEE Trans. Softw. Eng. 32(6), 382–403 (2006)CrossRefGoogle Scholar
 25.Liu, F., Nedev, N., Prisadnikov, N., Vechev, M.T., Yahav, E.: Dynamic synthesis for relaxed memory models. In: PLDI, pp. 429–440. ACM (2012)Google Scholar
 26.The LLVM compiler infrastructure. http://llvm.org
 27.Mazurkiewicz, A.: Trace theory. In: Advances in Petri Nets (1986)Google Scholar
 28.Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P., Neamtiu, I.: Finding and reproducing heisenbugs in concurrent programs. In: OSDI, pp. 267–280. USENIX (2008)Google Scholar
 29.Park, S., Dill, D.L: An executable specification, analyzer and verifier for RMO (relaxed memory order). In: SPAA, pp. 34–41. ACM (1995)Google Scholar
 30.Peled, D.: All from one, one for all, on modelchecking using representatives. In: CAV, vol. 697 of LNCS, pp. 409–423. Springer (1993)Google Scholar
 31.Peterson, G., Stickel, M.: Myths about the mutal exclusion problem. Inf. Process. Lett. 12(3), 115–116 (1981)CrossRefGoogle Scholar
 32.Saarikivi, O., Kähkönen, K., Heljanko, K.: Improving dynamic partial order reductions for concolic testing. In: ACSD, IEEE (2012)Google Scholar
 33.Sen, K., Agha, G.: A racedetection and flipping algorithm for automated testing of multithreaded programs. In: Haifa Verification Conference, pp. 166–182. LNCS 4383 (2007)Google Scholar
 34.Sewell, P., Sarkar, S., Owens, S., Nardelli, F.Z., Myreen, M.O.: x86TSO: a rigorous and usable programmer’s model for x86 multiprocessors. Comm. ACM 53(7), 89–97 (2010)CrossRefGoogle Scholar
 35.Shasha, D., Snir, M.: Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst. 10(2), 282–312 (1988)CrossRefGoogle Scholar
 36.SPARC International, Inc. The SPARC Architecture Manual Version 9, (1994)Google Scholar
 37.Tasharofi, S. et al.: TransDPOR: A novel dynamic partialorder reduction technique for testing actor programs. In: FMOODS/FORTE, pp. 219–234. LNCS 7273 (2012)Google Scholar
 38.Valmari, A.: Stubborn sets for reduced state space generation. In: Advances in Petri Nets, vol. 483 of LNCS, pp. 491–515. Springer (1989)Google Scholar
 39.Wang, C., Said, M., Gupta, A.: Coverage guided systematic concurrency testing. In: ICSE, pp. 221–230. ACM (2011)Google Scholar
 40.Yang, Y., Gopalakrishnan, G., Lindstrom, G., Slind, K.: Nemos: A framework for axiomatic and executable specifications of memory consistency models. In: IPDPS, IEEE (2004)Google Scholar
 41.Zhang, N., Kusano, M., Wang, C.: Dynamic partial order reduction for relaxed memory models. In: Proceedings of the 36th ACM SIGPLAN conference on programming language design and implementation, pp. 250–259. ACM (2015)Google Scholar