Modular Relaxed Dependencies in Weak Memory Concurrency

We present a denotational semantics for weak memory concurrency that avoids thin-air reads, provides data-race free programs with sequentially consistent semantics (DRF-SC), and supports a compositional refinement relation for validating optimisations. Our semantics identifies false program dependencies that might be removed by compiler optimisation, and leaves in place just the dependencies necessary to rule out thin-air reads. We show that our dependency calculation can be used to rule out thin-air reads in any axiomatic concurrency model, in particular C++. We present a tool that automatically evaluates litmus tests, show that we can augment C++ to fix the thin-air problem, and we prove that our augmentation is compatible with the previously used compilation mappings over key processor architectures. We argue that our dependency calculation offers a practical route to fixing the longstanding problem of thin-air reads in the C++ specification.


Introduction
It has been a longstanding problem to define the semantics of programming languages with shared memory concurrency in a way that does not allow unwanted behaviours -especially observing thin-air values [8,7] -and that does not forbid compiler optimisations that are important in practice, as is the case with Java and Hotspot [30,29]. Recent attempts [16,11,25,15] have abandoned the style of axiomatic models, which is the de facto paradigm of industrial specification [8,2,6]. Axiomatic models comprise rules that allow or forbid individual program executions. While it is impossible to solve all of the problems in an axiomatic setting [7], abandoning it completely casts aside mature tools for automatic evaluation [3], automatic test generation [32], and model checking [23], as well as the hard-won refinements embodied in existing specifications like C++, where problems have been discovered and fixed [8,7,18]. Furthermore, the industrial appetite for fundamental change is limited. In this paper we offer a solution to the thin-air problem that integrates with existing axiomatic models.
The thin-air problem in C++ stems from a failure to account for dependencies [22]: false dependencies are those that optimisation might remove, and real dependencies must be left in place to forbid unwanted behaviour [7]. A single execution is not sufficient to discern real and false dependencies. A key insight from previous work [14,15] is that event structures [33,34] give us a simultaneous overview of all traces at once, allowing us to check whether a write is sure to happen in every branch of execution. Unfortunately, previous work does not integrate well with axiomatic models, nor lend itself to automatic evaluation.
To address this, we construct a denotational semantics in which the meaning of an entire program is constructed by combining the meanings of its subcomponents via a compositional function over the program text. This approach can be particularly amenable to automatic evaluation, reasoning and compiler certification [19,24], and fits with the prevailing axiomatic approach.
This paper uses this denotational approach to capturing program dependencies to explore the thin-air problem, resulting in a concrete proposal for fixing the thin-air problem in the ISO standard for C++.
Contributions. There are two parts to the paper. In the first, we develop a denotational model called "Modular Relaxed Dependencies model" (MRD) and build metatheory around it. The model uses a relatively simple account of synchronisation, but it demonstrates separation between the calculation of dependency and the enforcement of synchronisation. In the second, we evaluate the dependency calculation by combining it with the fully-featured axiomatic models RC11 [18] and IMM [26].
The denotational semantics has the following advantages: 1. It is the first thin-air solution to support fork/join ( §2.2). 2. It satisfies the DRF-SC property for a compositional model ( §5): programs without data races behave according to sequential consistency. 3. It comes with a refinement relation that validates program transformations, including the optimisation that makes Hotspot unsound for Java [30,29], and a list of others from the Java Causality Tests [27] ( §7). 4. It is shown to be equivalent to a global semantics that first performs a dependency calculation and then applies an axiomatic model. 5. An example in Section 10 illustrates a case in which thin-air values are observable in the current state-of-the-art models but forbidden in ours.
We adopt the dependency calculation from the global semantics of point 4 as the basis of our C++ model, which we call MRD-C11. We establish the C++ DRF-SC property described in the standard [13] ( §9.1) and we provide several desirable properties for a solution to the thin-air problem in C++: 5. We show that our dependency calculation is the first that can be applied to any axiomatic model, and in particular the RC11 and IMM models that cover C++ concurrency ( §8). 6. Our augmented IMM model, which we call MRD+IMM, is provably implementable over x86, Power, ARMv8, ARMv7 and RISC-V, with the compiler mappings provided by the IMM [26] ( §8.1). 7. These augmented models of C++ are the first that solve the thin-air problem to have a tool that can automatically evaluate litmus tests ( §11).

Modular Relaxed Dependency by example
To simplify things for now, we will attach an Init program to the beginning of each example to initialise all global variables to zero. Doing this makes the semantics non-compositional, but it is a natural starting place and aligns well with previous work in the area. Later, after we have made all of our formal definitions, we will see why the Init program is not necessary. For now, consider a simple programming language where all values are booleans, registers (ranging over r) are thread-local, and variables (ranging over x, y) are global. Informally, an event structure for a program consists of a directed graph of events. Events represent the global variable reads and writes that occur on all possible paths that the program can take. This can be built up over the program as follows: each write generates a single event, while each read generates twoone for each possible value that could be read. These read events are put in conflict with each other to indicate that they cannot both happen in a single execution, this is indicated with a zig-zag red arrow between the two events. Additionally, the event structure tracks true dependencies via an additional relation which we call semantic dependencies (dp). These are yellow arrows from read events to write events.
For example, consider the program (r 1 := x; y := r 1 ) (LB 1 ) that reads from a variable x and then writes the result to y. The interpretation of this program is an event structure depicted as follows: Each event has a unique identifier (the number attached to the box). The straight black arrows represent program order, the curved yellow arrows indicate a causal dependency between the reads and writes, and the red zigzag represents a conflict between two events. If two events are in conflict, then their respective continuations are in conflict too.
If we interpret the program Init; LB 1 , as below, we get a program where the Init event sets the variables to zero.
In the above event structure, we highlight events {1, 2, 3} to identify an execution. The green dotted arrow indicates that event 2 reads its value from event 1, we call this relation reads-from (rf). This execution is complete as all of its reads read from a write and it is closed w.r.t conflict-free program order.
We interpret the following program similarly, leading to a symmetrical event structure where the write to x is dependent on the read from y. The interpretation of Init; (LB 1 LB 2 ) gives the event structure where (LB 1 ) and (LB 2 ) are simply placed alongside one another.
The interpretation of parallel composition is the union of the event structures from LB 1 and LB 2 without any additional conflict edges. When parallel composing the semantics of two programs, we add all rf-edges that satisfy a coherence axiom. Here we present an axiom that provides desirable behaviour in this example (Section 4 provides our model's complete axioms).
(dp ∪ rf) is acyclic The program Init; (LB 1 LB 2 ) allows executions of the following three shapes.
Note that in this example, we are not allowed to read the value 1 -reading a value that does not appear in the program is one sort of thin-air behaviour, as described by Batty et al. [7]. For example, the execution {1, 4, 5, 8, 9} does not satisfy the coherence axiom as 4 dp −→ 5 rf − → 8 dp −→ 9 rf − → 4 forms a cycle. We now substitute (LB 2 ) with the following code snippet where the value written to the variable x is a constant. Its generated event structure is depicted as follows In this program, for each branch, we can reach a write of value 1 to location x. Hence, this will happen no matter which branch is chosen: we say b and d are independent writes and we draw no dependency edges from their preceding reads. Consider now the program (LB 3 ) in parallel with LB 1 introduced earlier in this section. As usual, we interpret the Init program in sequence with (LB 1 LB 3 ) as follows: The resulting event structure is very similar to that of (LB 1 LB 2 ), but the executions permitted in this event structure are different. The dependency edges calculated when adding the read are preserved, and now executions {1, 2, 3, a, b} and {1, a, b, 4, 5} are allowed. However, this event structure also contains the execution in which d is independent.
there is no rf or dp edge between d and c that can create a cycle, hence this is a valid complete execution in which we can observe x = 1, y = 1. Note that the Init is irrelevant in the consistency of this execution.
Modularity. It is worthwhile underlining the role that modularity plays here. In order to compute the behaviour of (LB 1 LB 2 ) and (LB 1 LB 3 ) we did not have to compute the behaviour of LB 1 again. In fact, we computed the semantics of LB 1 , LB 2 and LB 3 in isolation and then we observed the behaviour in parallel composition.
Thin-air values. The program (LB 1 LB 3 ) is a standard example in the weak memory literature called load buffering. In the program (LB 1 LB 2 ), if event 5 or 9 were allowed in a complete execution, that would be an undesirable thin-air behaviour: there is no value 1 in the program text, nor does any operation in the program compute the value 1. The program (LB 1 LB 3 ) is similar, but now contains a write of value 1 in the program text, so this is no longer a thin-air value. Note that the execution given for it is not sequentially consistent, but nonetheless a weak memory model needs to allow it so that a compiler can, for example, swap the order of the two commands in LB 3 , which are completely independent of each other from its perspective.

Event Structures
Event structures will form the semantic domain of our denotational semantics in Section 5. Our presentation follows the essential ideas of Winskel [33] and is further influenced by the treatment of shared memory by Jeffrey and Riely [15].

Background
A partial order (E, ) is a set E equipped with a reflexive, transitive and antisymmetric relation . A well-founded partial order is a partial order that has no infinite decreasing chains of the form · · · e i−1 e i e i+1 · · · .
A prime event structure is a triple (E, , #). E is a set of events, is a well-founded partial order on E and # is a conflict relation on E. # is binary, symmetric and irreflexive such that, for all c, d, e ∈ E, if c#d e then c#e. We write Con(E) for the set of conflict-free subsets of E, i.e. those subsets C ⊆ E for which there is no c, d ∈ C such that c#d.

Notation.
We use E to range over (prime/labelled/memory) event structures, and also the event set contained within, when there is no ambiguity. We also use E for event structures.
A labelled event structure (E, , #, λ), over a set of labels Σ, is a prime event structure together with a function λ : E → Σ which assigns a label to an event. We make events explicit using the notation {e : σ} for λ(e) = σ. We sometimes avoid using names and just write the label σ when there is no risk of confusion.
Consider the labelled event structure formed by the set {1, 2, 3, 4}, where the order relation is defined such that 1 2 3 and 1 4, the conflict relation is defined such that 2#4 and 3#4, and the labelling function is defined such that λ(1) = (W x 0), λ(2) = (R x 0), λ(3) = (W y 1) and λ(4) = (R x 1). The event structure is visualised on the left (we elide conflict edges that can be inferred from order). Given labelled event structures E 1 and E 2 define the product labelled event structure E 1 ×E 2 (E, , #, λ). E is E 1 ∪E 2 , assuming E 1 and E 2 to be disjoint, The coproduct labelled event structure E 1 + E 2 is the same as the product, except that the conflict relation # is We can use a similar construction for the co-product of an infinite set of pairwisedisjoint labelled event structures, indexed by I: we take infinite unions on the underlying sets and relations, along with extra conflicts for every pair of indices.
Where the E i are not disjoint, we can make them so by renaming with fresh event identifiers. In particular, we will need the infinite coproduct i∈I E with as many copies of E as the cardinality of the set I, and all the events between each copy in conflict. Each of these copies will by referred to as E i .

The fork-join event structure
Our language supports parallel composition nested under sequential composition, so we will need to model spawning threads and a subsequent wait for their termination. To support this, we define the fork-join composition of two labelled event structures, E 1 E 2 . First we define the leaves, ↓ (E), as the -maximal elements of E. Let I be the set of maximal conflict-free subsets of ↓ (E 1 ). Intuitively, each event set in I corresponds to the last events 4 of one way of executing the concurrent threads in E 1 . We then generate a fresh copy of E 2 for each of the executions: The set of events, E, is the set E 1 plus all the elements from the copies of E 3 . The order, , is constructed by linking every event in the copy E i 2 , with all the events in the set i, plus the obvious order from E 1 and the order in the local copy E i 2 . Finally, the conflict relation is the union of the conflict in E 1 and E 3 .

Coherent event structure
The signature of labels, Σ, is defined as follows: are the usual write and read operations and L, U are the lock and unlock operations respectively.
A coherent event structure is a tuple (E, S, , ≤) where E is a labeled event structure. S is a set of partial executions, where each execution is a tuple comprising a maximal conflict-free set of events, together with an intra-thread reads-from relation rf i , an extra-thread reads-from rf e , a dependency relation dp, and a partial order on lock/unlock events lk. The justification relation, , is a relation between conflict-free sets and events. Finally, the preserved program order, ≤ X , is a restriction of the program order, , for events on the same variable. ≤ L is the restriction of program order on events related in program order with locks or unlocks. Finally, we define rf to be rf e ∪ rf i and ≤ to be ≤ X ∪ ≤ L . For a partial execution, X ∈ S, we denote its components as lk X , rf X and dp X .
Justification, , collects dependency information in the program and is used to calculate dp X . For a conflict-free set C and an event e, we say C justifies e or e depends on C whenever C e. We collect dependencies between events modularly in order to identify the so-called independent writes which will be introduced shortly.
For a given partial execution, X, we define the order hb X as the reflexive transitive closure of ( ∪ lk X ). A coherent event structure contains a data race if there exists an execution X, with two events on the same variable x, at least one of which is a write, that are not ordered by hb X . A coherent event structure is data-race-free if it does not contain any data race. A racy rf X -edge is when two events w and r are racy and w rfe − − → X r. Note that rf i edges cannot ever be racy. We now define a coherent partial execution.

Definition 1 (Coherent Partial Execution).
A partial execution X is coherent if and only if: A complete execution X is an execution where all read events r have a write w that they read from, i.e. w rf − → X r.

Weak memory model
Central to the model is the way it records program dependencies in and dp. Justification, , records the structure of those dependencies in the program that may be influenced by further composition. As we shall see, composing programs may add or remove dependencies from justification: for example, composing a read may make later writes dependent, or the coproduct mechanism, introduced shortly, may remove them. In some parts of the program, e.g. inside locked regions, dependencies do not interact with the context. In this case, we freeze the justifications, using them to calculate dp. Following a freeze, the justification relation is redundant and can be forgottendp can be used to judge which executions are coherent.
Freezing. Here we define a function freeze which takes a justification C (w : W x v) and gives the corresponding dependency relation (r : We lift freeze to a function on an event structure as follows: where S contains all the executions where for each write, w i ∈ X 1 , we choose a justification so that C 1 1 w 1 , ..., C n 1 w n covers all writes in X 1 . Furthermore, with dp defined as follows: X 1 must be a coherent execution. We prove that for a coherent execution there always exists a choice of write justifications that freeze into dependencies to form a coherent execution. We will illustrate freezing of the program, whose event structure is as follows: The rules later on in this section will provide us with justifications {(6 : R t 1)} (9 : W y 1) and {(2 : R x 1)} (9 : W y 1) (but not the independent justification (9 : W y 1)). So in this program there are two minimal justifications of (9 : W y 1). The result of freezing is to duplicate all partial executions for each choice of write justifications. In this case, we get an execution containing 2 dp −→ 9 and another one containing 6 dp −→ 9.

Prepending single events
When prepending loads and stores, we model forwarding optimisations by updating the justification relation: e.g. when prepending a write, (w : W x 0), to an event structure where {(r : R x 0)} w , write forwarding satisfies the read of the justification, leaving an independently justified write, w .
Forwarding is forbidden if there exists e in E such that w ≤ e ≤ r, as in the example on the left. In this example we do not forward 1 to 6. The rules of this section give us that {1, 3, 6} 9: we have preserved program order over the accesses of x, 1 ≤ 3 ≤ 6, and we do not forward across the intervening read 3.

Read Semantics
We now define the semantics of read prepending as follows: where preserved program order ≤ is built straightforwardly out of ≤ 1 , ordering locks, unlocks and same-location accesses, and S is defined as the set of all (X ∪ {r}, lk X , rf X , dp X ), where X is a partial execution of S 1 and is the smallest relation such that for all C 1 e we have with LF being the "Load Forwarded " set of reads, i.e.the set of reads consecutively following the matching prepended one: This allows for load forwarding optimisations and coherence is satisfied by construction.
Write Semantics The write semantics are then defined as follows: where ≤ is built as in the read rule and S contains all coherent executions of the form, where X ∈ S 1 , and w rfi − − → r for any set of matching reads r in E 1 such that condition (1.2) of coherence is satisfied. Adding rf i edges leaves condition (1.1) satisfied.
The justification relation is the smallest upward-closed relation such that for all C 1 e: with SF being the Store Forwarding set of reads, i.e.the set of reads that we are going to remove from the justification set for later events that are matching the write we are prepending. This is defined as follows: When prepending a write to an event structure, we add it to justifications that contain a read to the same variable. Failing to do so would invalidate the DRF-SC property. We provide an example in Section 6.3, but we need to complete the definition of the semantics first, in particular, we need to explain first how the writes are lifted. This is coming in the next section (Section 4.2).

Coproduct semantics
The coproduct mechanism is responsible for making writes independent of prior reads if they are sure to happen, regardless of the value read. It produces the independent writes that enabled relaxed behaviour in the example in Section 1.
In the definition of coproduct we use an upward-closure of justification to enable the lifting of more dependencies. Whenever C e we define ↑ (C) as the upward-closed justification set, i.e. D e if C e, D is a conflict-free lock-free set with C ⊆ D, such that for all e ∈ D if e is an event such that e ≤ e then e ∈ D.
Now we define the coproduct operation. If E 1 is a labelled event structure of the form (r 1 : the coproduct of event structures is defined as, where whenever {r 1 } ∪ C 1 1 (w : W y v) and {r 2 } ∪ C 2 2 (w : W y v) then if the following conditions hold, we have D w and D w : 1. there exists a D ∈ ↑ (C 1 ) that is isomorphic to a D ∈ ↑ (C 2 ), that is, there exists f : D → D that is a λ-preserving and ≤ X -preserving bijection, 2. there is no event e in D such that r 1 ≤ X e The example of Section 1 illustrates the application of condition (1) of coproduct. Recall the event structures of (LB 1 ) and (LB 3 ) respectively.
In each case, the event structure is built as the coproduct of the conflicting events. In (LB 3 ), prior to applying coproduct we have {a} b and {c} d. The writes have the same label for both read values so, taking C 1 and C 2 to be empty, coproduct makes them independent, adding the independent writes b and d.
As for condition (2), if there is an event in the justification set that is ordered in ≤ X with the respective top read, then the top read cannot be erased from the justification. Doing so would break the ≤ X link.
When having value sets that contain more than two values, we use v∈V to denote a simultaneous coproduct (rather than the infinite sum). More precisely, if we coproduct the event structures E 0 , E 1 , · · · , E n in a pairwise fashion as follows, (· · · (E 0 + E 1 ) + · · · ) + E v we would get liftings that are undesirable. To see this, it suffices to consider the program, where the write to x of 1 is independent for a coproduct over values 1 and 2, but not when considering the event structure following (R x 3).

Lock semantics
When prepending a lock, we order the lock before following events in ≤ and we freeze the justifications into dependencies. By freezing, we prevent justifications from events after the lock from interacting with newly appended events. This disables optimisations across the lock, e.g. store and load forwarding. We define the semantics of locks as follows, where ≤ X remains unchanged and (E 1 , ∅, S 1 , ≤ 1 ) = freeze(E 1 , 1 , S 1 , ≤ 1 ), where S contains all partial executions of the form, (X ∪ {l}, (lk X ∪ lk), dp X , rf X ) where X ∈ S 1 and the lock order lk is such that for all lock or unlock event l ∈ X, l lk −→ l . Finally, ≤ L is ≤ L 1 extended with the lock ordered before all events in E 1 .
The semantics for the unlock is similar.

Parallel composition
We define the parallel semantics as follows. Note that this operation freezes the constituent denotations before combining them, erasing their respective justification relations. This choice prevents the optimisation of dependencies across forks and it makes thread inlining optimisations unsound, as they are in the Promising Semantics [16] and the Java memory model [21].
where, S are all coherent partial executions of the form, is a total order over the lock/unlock operations such that no lock/unlock operation is introduced between a lock and the next unlock on the same thread. Finally, we add all (w : W x v) rfe − − → (r : R x v) edges such that the execution satisfies condition (1.1) of coherence 1 and such that w belongs to S F 1 and r belongs to S F 2 or vice versa.

Join Semantics
We define the join composition as follows: where ≤ is built as in the read rule and S are all executions of the form where X 1 ∈ S 1 and X 2 ∈ S 2 with X 1 and X 2 conflict-free. Lock order lk orders all lock/unlock of X 1 before all lock/unlock of X 2 and w rfi − − → r whenever w ∈ X 1 and r ∈ X 2 such that the execution is still coherent.

Language and Semantics
We consider an imperative language that has sequential and parallel composition, and mutable shared memory.
We have standard boolean expressions, B, and expressions, M , represented by natural numbers, n, or registers, r. Finally we have the set of command statements, P , where skip is the command that performs no action, r := x reads from a global variable and stores the value in r, x := M computes the expression M and stores its value to the global variable x, P 1 ; P 2 is sequential composition, and P 1 P 2 is parallel composition. We have standard conditional statements, while loops, locks and unlocks. Moreover, a program P is lock-well-formed 5 if on every thread, every lock is paired with a following unlock instruction and vice versa, and there is no lock or unlock operation between pairs. A register environment, R → V, is a function from the set of local registers, R, to the set of values, V. A continuation is a function taking a register environment, R → V, to an event structure, E. We write ∅ as a short-hand for λρ.∅, the continuation returning the empty event structure.
We interpret the syntax defined above into the semantic domain defined in Section 4. In Figure 1, we define · as a function which takes a step-index n, a register environment ρ, and a continuation κ, and returns a coherent event structure.
The interpretation function · is defined first by induction on the step-index and then by induction on the syntax of the program. When n = 1 the interpretation gives the empty event structure (undefined). Otherwise we proceed by induction on the structure of the program. skip is just the continuation applied to the environment. A read is interpreted as a set of conflicting read events for each value v attached with a continuation applied to the environment where the register is updated with v.
A write is interpreted as a write with a following continuation. We interpret sequencing by interpreting the second program and passing it on to the interpretation of the first as a continuation. Parallel composition is the interpretation of the two programs with empty continuations passed to the × operator. The conditional statement is interpreted as usual. For interpreting the while-loops we use the induction hypothesis on the step-index [9].
When parallel composing two threads, we want to forbid any reordering with events sequenced before or after the composition (as thread inlining would do). To forbid this local reordering we surround this composition with two lock-unlock pairs.

Compositionality
We define the language of contexts inductively in the standard way.

Definition 3 (Context).
C :: In the base case, the context is a hole, denoted by [−]. The inductive cases follow the structure of the program syntax. In particular, a context can be a program P in sequence with a context, a context in sequence with a program P and so on. For a context C we denote C[P ] by the inductively defined function on the context C that substitutes the program P in every hole.

Fig. 1: Semantic interpretation
The following lemma shows that the semantics preserve context application. This falls out from the fact that the semantic interpretation is compositional, that is, we define every constructor in terms of its subcomponents.

Lemma 1 (Compositionality).
For all programs P 1 , P 2 , if P 1 = P 2 then for all contexts C, C[ The proof is a straightforward induction on the context C and it follows from the fact that semantics is inductively defined on the program syntax. The attentive reader may note that to prove P 1 = P 2 in the first place we have to assume n, ρ and κ and prove P 1 n ρ κ = P 2 n ρ κ . It is customary however in denotational semantics to have programs denoted by functions that are equal if they are equal at all inputs [31].

Data Race Freedom
Data race freedom ensures that we forbid optimisations which could lead to unexpected behaviour even in the absence of data races. We first define the closed semantics for a program P . For all n, the semantics of P , namely P is Init(P ) n λx.0 ∅ , where Init(P ) is the program that takes the global variables in P and initialises them to 0. We now establish that race-free programs interpreted in the closed semantics have sequentially consistent behaviour.
DRF semantics. Rather than proving DRF-SC directly, we prove that race-free programs behave according to an intermediate semantics · . This semantics differs from · in only two ways: program order is used in the calculation of coherence instead of preserved program order, and no dependency edges are recorded (as these are subsumed by program order). More precisely, the semantics is calculated as in Figure 1 but we check that (rf e ∪ lk ∪ ) is acyclic.
Note that race-free executions of the intermediate semantics · satisfy the constraints of the model of Boehm and Adve [10], and the definition of race is the same between the two models. Boehm and Adve prove that in the absence of races, their model provides sequential consistency.
The DRF-SC theorem is stated as follows.
Theorem 1. For any program P , if P is data race free then every execution D in P is a sequentially consistent execution, i.e. D is in P .

Tests and Examples
In this section, four examples demonstrate aspects of the semantics: the first recognises a false dependency, the second forbids unintended behaviour allowed by Jeffrey and Riely [15], the third motivates the choice to add forwarded writes to justification, and the last shows how we support an optimisation forbidden by Java but performed by the Hotspot compiler.

LB+ctrl-double
In the first example, from Batty et al. [7], the compiler collapses conditionals to transform P 1 to P 2 .
Coproduct ensures that the denotations of P 1 and P 2 are identical, with the event structure above, together with justification b and d. From compositionality (Lemma 1) and equality of the denotations, we have equal behaviour of P 1 and P 2 in any context, and the optimisation is allowed.

Jeffrey and Riely's TC7
The next test is Java TC7. The outcome where r 1 , r 2 and r 3 all have value 1 is forbidden by Jeffrey and Riely [15,Section 7], but allowed in the Java Causality Test Cases [27]. x := 1 (TC7) As noted by Jeffrey and Riely [15], the failure of this test "indicates a failure to validate the reordering of independent reads".
In the event structure of T 1 above, the justification relation is constructed according to Section 5. In particular, the rule for prepending reads (equation ( When parallel composing, we connect the rf-edges that respect coherence. Thus we obtain the execution {16 rf − → 8 dp −→ 10 rf − → 12 dp −→ 14 rf − → 6}, which is coherent, allowing the outcome with r 1 , r 2 and r 3 all 1 as desired.

Adding writes to justifications
In the definition of prepending writes (equation (3), condition (2)) we state that for any given justification, if there is an event in the justification set that is related via ≤ X with the write we are prepending, then that write must be in the justification set as well.
To see why we made this choice consider the following program, x := 1; and its associated event structure, We focus on the interpretation of the left-hand side thread. In the equation This execution is not sequentially consistent, but under SC, the program is race free. Without writes in justifications, the model would violate the DRF-SC property described in Section 5.2.

Java memory model, Hotspot.
Finally, we discuss redundant read after read elimination, an optimisation performed by the Hotspot compiler but forbidden by the Java memory model. It is the first optimisation in the following sequence fromŠevčík and Aspinall [30, Figure 5], used to demonstrate that the Java memory model is too strict, and unsound with respect to the observable behaviour of Sun's Hotspot compiler. Consider the event structures of the unoptimised T 3 and optimised T 1 .
The optimisation removes the apparently redundant pair of reads (4, 6), then reorders the now-independent write. This redundancy is represented in justification: when prepending the top read of y to the right-hand side of the event structure, the existing justification 6 7 is replaced by 3 7. When coproduct is applied, this matches with justification 1 2, leading to the independent writes 2 and 7. In a weak memory context however, a parallel thread could write a value to y between the two reads, thereby changing the value written to x. For this reason, we keep event 4 in the denotation and create the dependency edge 4 dp −→ 5.
Despite exhibiting the same behaviour here, the denotations of T 3 and T 2 do not match. We establish that the optimisation is sound in any context in the next section.

Refinement
We have shown in Section 5.1 that our semantics enjoys a compositionality property: if we can prove that two programs have the same semantics (w.r.t set-theoretical equality) then they cannot be distinguished by any context. We also explained how equality is too strict, as it does not allow us to relate all programs that ought to be deemed semantically equivalent. Our Java Hotspot compiler example in Section 6 shows that the program T 3 is in practice optimised to T 2 and then to T 1 . However, it is clearly not true that T 1 n ρ κ is a subset of T 2 n ρ κ .
In this section we present a coarser-grained relation, which we call refinement ( ). This relation permits the optimisations we want, but remains sound w.r.t. the intuitive notion of observational equivalence, and that it is closed under context application in the same way as equality.
To show soundness we define observational refinement ( Obs ) which captures the intuitive notion of program equivalence: one program is a permissible optimisation of another if it does not increase the set of observable behaviours, defined here as changes to values of observed variables. The definition identifies related executions and compares the ordering of observable events, recognising that adding happens-before edges restricts behaviour. We then define a refinement relation and show this relation is a subset of observational refinement. This is formally stated in the following lemma: Lemma 2 (Soundness of Refinement ( ⊆ Obs )). For all P 1 and P 2 , if P 1 Note that the refinement relation is defined over a tweaked version of the semantics, · T , a variant of · in which the registers are explicit in the event structure.
Finally we show is compositional: Theorem 2 (Compositionality of Refinement ( )). For all programs P 1 and P 2 , and indexes n, if for all ρ, P 1 T n ρ ∅ P 2 T n ρ ∅ then for all contexts C, ρ, κ and κ such that κ κ we have that C[P 1 ] T n ρ κ C[P 2 ] T n ρ κ

Showing implementability via IMM
In this section we show that our calculation of relaxed dependencies can easily be reused to solve the thin-air problem in other state-of-the-art axiomatic models, drawing the advantages of these models over to ours. In particular, we augment the IMM and RC11 models of Podkopaev et al. [26]. We adopt their language, given below. It covers C++ atomics, fences, fetch-and-add and compare-andswap operations but excludes locks. Note that locks are implementable using compare and swap operations. First we provide a model, written (for a program P ) as P MRD+IMM , that combines our relaxed dependencies to the axiomatic model of IMM, here written as P IMM . We will make these definitions precise shortly. We then show that P MRD+IMM is weaker than P IMM , making P MRD+IMM implementable over hardware architectures like x86-TSO, ARMv7, ARMv8 and Power. Secondly, we relax the RC11 axiomatic model by using our relaxed dependencies model MRD to create a new model P MRD-C11 , and show this model weaker than the RC11 model. We argue that the mathematical description of P MRD-C11 is lightweight and close to the C++ standard, it would therefore require minimal work to augment the standard with the ideas presented in this paper.
To prove implementability over hardware architectures we define a pre-execution semantics, where the relaxed dependency relation dp is calculated along with the data and control dependencies from IMM. To combine our model with IMM, we redefine the ar relation (we refer the reader to the IMM paper [26] for the details on ar) such that it is parametrised by an arbitrary relation which we put in place of the relations (data ∪ ctrl). ar(data ∪ ctrl) equals the original axiom ar and ar(dp) is the same axiom where dp is put in place of data ∪ ctrl.
We define the executions in P MRD+IMM as the maximal conflict-free sets such that ar(dp) is acyclic, and executions in P IMM as the maximal conflictfree sets such that ar(data ∪ ctrl) is acyclic.

Implementability
We can now state and prove that the MRD model is implementable over IMM, which gives us that MRD is implementable over x86-TSO, ARMv7, ARMv8, Power and RISC-V by combining our result with the implementability result of IMM.
Theorem 3 (MRD+IMM is weaker than IMM). For all programs P by the IMM model, P MRD+IMM ⊇ P IMM 9 Modular Relaxed Dependencies in RC11: MRD-C11 We refer to the RC11 [18] model, as specified in Podkopaev et al. [26]. We call this model P RC11 . While P RC11 forbids thin-air executions, it is not weak enough: it forbids common compiler optimisations by imposing that ( ∪ rf) is acyclic. We relax this condition by similarly replacing with our relaxed dependency relation dp, this time calculated on our preserved program order relation (≤). We call this model P MRD-C11 . Mathematically, this is done by imposing that (dp ∪ rf) is acyclic.
At this point, we prove the following lemma: Lemma 3 (Implementability of MRD-C11). For all programs P , To show this it suffices to show that there always exists dp ⊆ . This is straightforward by induction on the structure of P , observing that the only place where dependencies go against is when hoisting a write in the coproduct case. However, in the same construction we always preserve the dependencies coming from the different branches of the structure which are, by inductive hypothesis, always agreeing with program order. Theorem 4 (MRD-C11 is DRF-SC). For a program whose atomic accesses are all SC-ordered, if there are no SC-consistent executions with a race over non-atomics, then the outcomes of P under MRD-C11 coincide with those under SC.
Sketch proof. In the absence of races and relaxed atomics, the no-thin-air guarantee of RC11 is made redundant by the guarantee of happens-before acyclicity shared by RC11 and MRD-C11. The result follows from this observation, lemma 3 and Theorem 4 from Lahav et al. [18].

On the Promising Semantics and weakestmo
In this section we present examples that differentiate the Promising Semantics and weakestmo from our MRD and MRD-C11 models. First, we show that MRD correctly forbids the out-of-thin-air behaviour in the litmus test Coh-CYC from Chakraborty and Vafeiadis [11]. The test, given below, differentiates Promising and weakestmo: only the latter avoids the outcome r 1 = 3, r 2 = 2 and r 3 = 1. MRD correctly forbids this outcome: it identifies a dependency on the lefthand thread from the read of 3 from x to the write y := 1, and on the right-hand thread from the read of 1 from y to the write x := 3. The desired outcome then has a cycle in dependency and reads-from, and it is forbidden.
Chakraborty and Vafeiadis ascribe the behaviour to "a violation of coherence or a circular dependency", and include specific machinery to weakestmo that checks for global coherence violations at each step of program execution. These global checks forbid the unwanted outcome.
The Promising Semantics, on the other hand, can make promises that are not sensitive to coherence order, and therefore allows the above outcome erroneously.
In Coh-CYC, enforcing coherence ordering at each step in weakestmo was enough to forbid the thin-air behaviour, but it is not adequate in all cases. The example below features an outcome that Promising and weakestmo allow, and that MRD-C11 and MRD forbid. It demonstrates that cycles in dependency can arise without violating coherence in weakestmo. The program is an adaptation 6 of a Java test, where the the unwanted outcome represents a violation of type safety [20]. Observing the thin-air behaviour where a = 1 in the adaptation above is the analogue of the unwanted outcome in the original test. If in the end a = 1, then the second branch of the conditional in the rightmost thread must execute. It contains a read of 1 from y, and a dependent write of x := 1. On the middle thread there is a read of 1 from x, and a dependent write of y := 1. These dependencies form the archetypal thin-air shape in the execution where a = 1. MRD correctly identifies these dependencies and the outcome is prohibited due to its cycle in reads-from and dependency.
The a = 1 outcome is allowed in the Promising Semantics: a promise can be validated against the write of x := 1 in the true branch of the righthand thread, and later switched to a validation with x := r 0 from the false branch, ignoring the dependency on the read of y.
In the previous example, Coh-CYC, a stepwise global coherence check caused weakestmo to forbid the unwanted behaviour allowed by Promising, but that machinery does not apply here. weakestmo allows the unwanted outcome, and we conjecture that this deficiency stems from the structure of the model. Dependencies are not represented as a relation at the level of the global axiomatic constraint, so one cannot check that they are consistent with the dynamic execution of memory, as represented by the other relations. Adopting a coherence check in the stepwise generation of the event structure mitigates this concern for Coh-CYC, but not for the test above.
In contrast, MRD does represent dependencies as a relation, allowing us to check consistency with the rf relation here. The axiom that requires acyclicity of (dp ∪ rf) forbids the unwanted outcome, as desired.
11 Evaluating MRD-C11 with the MRD-er tool MRD-C11 is the first weak memory model to solve the thin-air problem for C++ atomics that has a tool for automatically evaluating litmus tests. Our tool, MRDer, evaluates litmus tests under the base model, RC11 augmented with MRD, and IMM augmented with MRD. It has been used to check the result of every litmus test in this paper, together with many tests from the literature, including the Java Causality Test cases [7,11,15,16,18,25,26,27].
When evaluating whether a particular execution is allowed for a given test, a model that solves the thin-air problem must take other executions of the program into account. For example, the semantics of Pichon-Pharabod et al., having explored one execution path, may ultimately backtrack [25]. Jeffrey and Riely phrase their semantics as a two player game where at each turn, the player explores all forward executions of the program [15]. At each operational step, the Promising Semantics [16] has to run forwards in a limited local way to validate that promised writes will be reached. The invisible events of Chakraborty et al. [11] are used to similar effect.
In MRD-C11, it is the calculation of justification that draws in information from other executions. This mechanism is localised, it avoids making choices about the execution that prune behaviours, and it does not require backtracking. MRD-C11 acts in a "bottom-up" fashion, and modularity ensures that justifications drawn from the continuation need not be recalculated. These properties have supported the development of MRD-er: automation of the model requires only a single pass across the program text to construct the denotation.

Discussion
Four recent papers have presented models that forbid thin-air values and permit previously challenging compiler optimisations. The key insight from these papers is that it is necessary to consider multiple program executions simultaneously. To do this, three of the four [15,25,11] use event structures, while the Promising Semantics [16] is a small-step operational semantics that explores future traces in order to take a step.
Although the Promising Semantics [16] is quite different from MRD, its mechanism for promising focuses on future writes, and MRD has parallels in its calculation of independent writes. Note also that both Promising's certification mechanism and MRD's lifting are thread-local.
The previous event-structure-based models are superficially similar to MRD, but all have a fundamentally different approach from ours: Pichon-Pharabod and Sewell [25] use event structures as the state of a rewriting system; Jeffrey and Riely [14,15] build whole-program event structures and then use a global mechanism to determine which executions are allowed; and Chakraborty et al. [11] transform an event structure using an operational semantics. In contrast, we follow a more traditional approach [33] where our event structures are used as the co-domain of a denotational semantics. Further, Jeffrey and Riely [14,15] and Pichon-Pharabod and Sewell [25] do not cover a significant subset set of C++ relaxed concurrency primitives.
MRD does not suffer from known problems with existing models. As noted by Kang et al. [16], the Pichon-Pharabod and Sewell model produces behaviour incompatible with the ARM architecture. The Jeffrey and Riely model forbids the reordering of independent reads, as demonstrated by Java Causality Test 7 (see Section 6.2). The Promising semantics allows the cyclic coherence ordering of the problematic Coh-CYC example [11]. weakestmo allows the thin-air outcome in the Java-inspired test of Section 10. In all four cases MRD provides the correct behaviour.
MRD is also highly compatible with the existing C++ standard text. The dp relation generated by MRD can be used directly in the axiomatic model to forbid thin-air behaviour. We are working on standards text with the ISO C++ committee based on this work, and have a current working paper with them [5].
The notion in C++ that data-race free programs should not exhibit observable weak behaviours goes back to Adve and Hill [1], and formed the basis of the original proposal for C++ [10]. This was formalised by Batty et al. [8] and adopted into the ISO standard. Despite the pervasiveness of DRF-SC theorems for weak memory models, these have remained whole-program theorems that do not support breaking a program into separate DRF and racy components. Our DRF theorem for our denotational model demonstrates a limited form of modularity that merits further exploration.
Other denotational approaches to relaxed concurrency have not tackled the thin-air problem. Dodds et al. [12] build a denotational model based on an axiomatic model similar to C++. It forms the basis of a sound refinement relation and is used to validate data-structures and optimisations. Their context language is too restrictive to support a compositional semantics, and their compromise to disallow thin-air executions forbids important optimisations. Kavanagh and Brookes [17] provide a denotational account of TSO concurrency, but their model is based on pomsets and suffers from the same limitation as axiomatic models [7]: it cannot be made to recognise false dependencies.
Future Work. We envisage a generalised theorem that would, on augmentation with MRD, extend an axiomatic DRF-SC proof to a proof that applies to the augmented model.
The ISO have struggled to define memory order::consume [13]. It is intended to provide ordering through dependencies that the compiler will not optimise away. The semantic dependency relation calculated by MRD identifies just these dependencies, and may support a better definition.
Finally, where we have used a global semantics to provide a full C++ model, it would be interesting to extend the denotational semantics to also cover all of C++, thereby allowing reasoning about C++ code in isolation from its context.

Conclusions
We have used the relatively recent insight that to avoid thin-air problems, a semantics should consider some information about what might happen in other program executions. We codify that into a modular notation of justification, leading to a semantic notion of independent writes, and finally of dependency (dp). We demonstrate the effectiveness of these concepts in three ways. One, we define a denotational semantics for a weak memory model, show it supports DRF-SC, and build a compositional refinement relation strong enough to verify difficult optimisations. Two, we show how to use dp with other axiomatic models, supporting the first optimal implementability proof for a thin-air solution via IMM, and showing how to repair the ISO C++ model. Three, we build a tool for executing litmus tests allowing us to check a large number of examples.