# Explaining Relaxed Memory Models with Program Transformations

## Abstract

Weak memory models determine the behavior of concurrent programs. While they are often understood in terms of reorderings that the hardware or the compiler may perform, their formal definitions are typically given in a very different style—either axiomatic or operational. In this paper, we investigate to what extent weak behaviors of existing memory models can be fully explained in terms of reorderings and other program transformations. We prove that TSO is equivalent to a set of two local transformations over sequential consistency, but that non-multi-copy-atomic models (such as C11, Power and ARM) cannot be explained in terms of local transformations over sequential consistency. We then show that transformations over a basic non-multi-copy-atomic model account for the relaxed behaviors of (a large fragment of) Power, but that ARM’s relaxed behaviors cannot be explained in a similar way. Our positive results may be used to simplify correctness of compilation proofs from a high-level language to TSO or Power.

## 1 Introduction

*sequential consistency*). In multiprocessor machines and/or with optimizing compilers, however, more behaviors are possible; they are formally described by what is known as a

*weak memory model*. Typical examples of such “weak” behaviors are in the SB (store buffering) and LB (load buffering) programs below: Assuming all variables are 0 initially, the weak behaviors in question are the ones in which

*a*and

*b*have the values mentioned in the program comments. In the SB program on the left this behavior is allowed by all existing weak memory models, and can be easily explained in terms of reordering: the hardware may execute the independent store to

*x*and load from

*y*in reverse order. Similarly, the behavior in the LB program on the right, which is allowed by some models, can be explained by reordering the load from

*x*and the subsequent store to

*y*. This explanation remains the same whether the hardware itself performs out-of-order execution, or the compiler, as a part of its optimization passes, performs these transformations, and the hardware actually runs a reordered program.

Formal memory models, however, choose a somewhat more complex explanation. Specifically, axiomatic memory model definitions construct a graph of memory access events for each program execution and impose various constraints on which store each load can read from. Similarly, operational definitions introduce concepts such as buffers, where the stores reside for some time before being propagated to other processors.

In this paper, we try to reconcile the formal model definitions with the more intuitive explanations in terms of program transformations. We consider the mainstream implemented memory models of TSO [16], C11’s Release/Acquire fragment [7], Power [4], and ARM [12], and investigate whether their weak behaviors can be fully accounted for in terms of program transformations that are allowed in these models. In this endeavor, we have both positive and negative results to report on.

First, in Sect. 3, we show that the TSO memory model of the x86 and SPARC architectures can be precisely characterized in terms of two transformations over sequential consistency: write-read reordering and read-after-write elimination.

Second, in Sect. 4, we present examples showing that C11’s Release/Acquire memory model cannot be defined in terms of a set of transformations over sequential consistency. This, in fact, holds for any memory model that allows non-multi-copy-atomic behaviors (where two different threads may observe a store of a third thread at different times), such as the full C11, Power, ARM, and Itanium models. Here, besides local instruction reorderings and eliminations we also consider the sequentialization transformation, that explains some non-multi-copy-atomic behaviors, but fails to account for all of them.

Next, in Sect. 5, we consider the Power memory model of Alglave et al. [4]. We show that the weak behaviors of this model, restricted to its fragment without “control fences” (Power’s \({{\mathbf {\mathtt{{isync}}}}}\) instructions), can be fully explained in terms of local reorderings over a stronger model that does not allow cycles in the entire program order together with the reads-from relation. In Sect. 6, we show that this is not possible for the ARM model: it allows some weak behaviors that cannot be explained in terms of local transformations over such stronger model.

Finally, in Sect. 7, we outline a possible application of the positive results of this paper, namely to simplify correctness of compilation proofs from a high-level language to either TSO or Power.

The proofs of this paper have also been formulated in Coq and are available at: http://plv.mpi-sws.org/trns/.

### 1.1 Related Work

Previous papers studied soundness of program transformations under different memory models (see, e.g., [15, 18]), while we are interested in the “completeness” direction, namely whether program transformations completely characterize a memory model.

Concerning TSO, it has been assumed that it can be defined in terms of the two transformations mentioned above (e.g., in [2, 9]), but to our knowledge a formal equivalence to the specification in [16] has not been established before. In the broader context of proposing a fixed memory model for Java, Demange et al. [10] prove a very close result, relating a TSO-like machine and local transformations of executions. Nevertheless, one of the transformations of [10] does not correspond to a local program transformation (as it depends on the write that was read by each read). We also note that the proofs in [10] are based on an operational model, while we utilize an equivalent axiomatic presentation of TSO, that allows us to have simpler arguments.

Alglave et al. [3] provide a method for reducing verification under a weak memory model to a verification problem under sequential consistency. This approach follows a global program transformation of a completely different nature than ours, that uses additional data structures to simulate the threads’ buffers.

Finally, assuming a sequentially consistent hardware, Ševčík [19] proves that a large class of compiler transformations respect the DRF guarantee (no weak behaviors for programs with no data races) and a basic non-thin-air guarantee (all read values are mentioned in some statement of the program). The results of the current paper allow the application of Ševčík’s theorems for TSO, as it is fully explained by transformations that are already covered as compiler optimizations. For the other models, however, our negative results show that the DRF and non-thin-air guarantees do not follow immediately from Ševčík’s theorems.

## 2 Preliminaries: Axiomatic Memory Model Definitions

In this section, we present the basic axiomatic way of defining memory models.

*Basic Notations.* Given a binary relation *R*, \(R^?\), \(R^+\), and \(R^*\) respectively denote its reflexive, transitive, and reflexive-transitive closures. The inverse relation is denoted by \(R^{-1}\). We denote by \(R_1;R_2\) the left composition of two relations \(R_1,R_2\). A relation *R* is called *acyclic* if \(R^+\) is irreflexive. When *R* is a strict partial order, \({R}{|_{\text {imm}}}\) denotes the relation consisting of all *immediate R-edges*, i.e., pairs \({\langle {a,b}\rangle }\in R\) such that for every *c*, \({\langle {c,b}\rangle }\in R\) implies \({\langle {c,a}\rangle }\in R^?\), and \({\langle {a,c}\rangle }\in R\) implies \({\langle {b,c}\rangle }\in R^?\). Finally, we denote by [*A*] the identity relation on a set *A*. In particular, \([A];R;[B] = R\cap (A\times B)\).

We assume finite sets \({\mathsf {Tid}}\), \({\mathsf {Loc}}\), and \({\mathsf {Val}}\) of thread identifiers, locations, and values. We use *i* as a metavariable for thread identifiers, *x*, *y*, *z* for locations, and *v* for values. Axiomatic memory models associate a set of graphs (called *executions*) to every program. The nodes of these graphs are called *events*, and they are related by different kinds of edges.

*Events.* An *event* consists of an identifier (natural number), a thread identifier (or 0 for initialization events), and a *type*, that can be \({\texttt {R}}\) (“read”), \({\texttt {W}}\) (“write”), \({\texttt {U}}\) (“atomic update”), or \({\texttt {F}}\) (“fence”). For memory accesses (\(\texttt {R},\texttt {W},\texttt {U}\)) the event also contains the accessed location, as well as the read and/or written value. Events in each specific memory model may contain additional information (e.g., fence type or C11-style access ordering). We use \(a,b,\ldots \) as metavariables for events. The functions \({tid}\), \({typ}\), \({loc}\), \({val_r}\) and \({val_w}\) respectively return (when applicable) the thread identifier, type, location, read value, and written value of an event.

### Notation 1

Given a relation *R* on events, \(R|_{x}\) denotes the restriction of *R* to events accessing location *x*, and \(R|_{ loc }\) denotes the restriction of *R* to events accessing the same location (i.e., \(R|_{x} = \{{\langle {a,b}\rangle } \in R \; | \;{loc}(a)={loc}(b)=x\}\) and \(R|_{ loc } = \bigcup _{x\in loc } R|_{x}\)).

*Executions.*An

*execution*

*G*consists of:

^{1}

- 1.
a finite set \(G.{\mathtt {E}}\) of events with distinct identifiers. This set always contains a set \(G.{\mathtt {E}}_0\) of initialization events, consisting of one write event assigning the initial value for every location. We assume that all initial values are 0.

- 2.
a binary relation Open image in new window, called

*program order*, which is a disjoint union of relations Open image in new window, such that Open image in new window, and for every \(i\in {\mathsf {Tid}}\), the relation Open image in new window is a strict total order on \(\{a\in G.{\mathtt {E}}\; | \;{tid}(a)=i\}\). - 3.
a binary relation Open image in new window, called

*reads-from*, which is a set of reads-from edges. These are pairs \({\langle {a,b}\rangle } \in G.{\mathtt {E}}\times G.{\mathtt {E}}\) satisfying \(a\ne b\), \({typ}(a)\in \{\texttt {W},\texttt {U}\}\), \({typ}(b)\in \{\texttt {R},\texttt {U}\}\), \({loc}(a)={loc}(b)\), and \({val_w}(a)={val_r}(b)\). It is required that an event cannot read from two different events (i.e., if Open image in new window then \(a_1=a_2\)). - 4.
a binary relation Open image in new window, called

*modification order*, whose properties vary from one model to another.

*G*with a set of tagged elements with the tags \({\mathtt {E}}\), Open image in new window, Open image in new window, and Open image in new window. For example, Open image in new window (where

*a*and

*b*are events) denotes an execution with \(G.{\mathtt {E}}=\{a,b\}\), Open image in new window, and Open image in new window. Further, for a set

*E*of events, \(\{{\mathtt {E}}:E\}\) denotes the set \(\{{\mathtt {E}}:e \; | \;e\in E\}\). A similar notation is used for the other tags, and it is particularly useful when writing expressions like Open image in new window (that stand for the extension of an execution

*G*with a set \( rf {}\) of reads-from edges). In addition, we denote by \(G.\texttt {T}\) (\(\texttt {T}\in \{\texttt {R},\texttt {W},\texttt {U},\texttt {F}\}\)) the set \(\{e\in G.{\mathtt {E}}\; | \;{typ}(e)=\texttt {T}\}\). We may also concatenate the event sets notations, and use a subscript to denote the accessed location (e.g., \(G.{\mathtt {R}}{\mathtt {W}}= G.{\mathtt {R}}\cup G.{\mathtt {W}}\) and \(G.{\mathtt {W}}_x\) denotes all events \(a\in G.{\mathtt {W}}\) with \({loc}(a)=x\)). We omit the prefix “

*G*.” when it is clear from the context.

The exact definition of the set of executions associated with a given program depends on the particular programming language and the memory model. Figure 1 provides an example. Note that in this initial stage the read values are not restricted whatsoever, and the reads-from relation Open image in new window and the modification order Open image in new window are still empty. We refer to such executions as *plain executions*.

Now, the main part of a memory model is the specification of which of the executions of a program *P* are allowed. The first requirement, agreed by all memory models, is that every read should be justified by some write. Such executions will be called *complete* (formally, *G* is complete if for every \(b\in {\mathtt {R}}{\mathtt {U}}\), we have Open image in new window for some event *a*). To filter out disallowed executions among the complete ones, each memory model \(\mathsf {M} \) defines a notion of when an execution *G* is \(\mathsf {M} \)*-coherent*, which is typically defined with the help of a few *derived relations*, and places several restrictions on the Open image in new window and Open image in new window relations. Then, we say that a plain execution *G* is \(\mathsf {M} \)*-consistent* if there exist relations \( rf {}\) and \( mo \) such that Open image in new window is a complete and \(\mathsf {M}\)-coherent execution. The semantics of a program under \(\mathsf {M} \) is taken to be the set of its \(\mathsf {M}\)-consistent executions.

### 2.1 Sequential Consistency

As a simple instance of this framework, we define sequential consistency (\(\mathsf {SC}\)). There are multiple equivalent axiomatic presentations of \(\mathsf {SC}\). Here, we choose one that is specifically tailored for studying the relation to TSO in Sect. 3.

### Definition 1

*G*is \(\mathsf {SC} \)

*-coherent*if the following hold:

### Proposition 1

Our notion of \(\mathsf {SC}\)-consistency defines sequential consistency [14].

### Proof (Outline)

The \(\mathsf {SC}\)-coherence definition above guarantees that Open image in new window is a partial order. Following [17], any total order extending this partial order defines an interleaving of the memory accesses, which agrees with Open image in new window and ensures that every read/update obtains its value from the last previous write/update to the same location. For the converse, one can take Open image in new window to be the restriction of the interleaving order to \({\mathtt {W}}{\mathtt {U}}{\mathtt {F}}\). \(\square \)

## 3 TSO

In this section, we study the TSO (*total store ordering*) memory model provided by the x86 and SPARC architectures. Its common presentation is operational: on top of usual interleaving semantics, each hardware thread has a queue of pending memory writes (called *store buffer*), that non-deterministically propagate (in order) to a main memory [16]. When a thread reads from a location *x*, it obtains the value of the last write to *x* that appears in its buffer, or the value of *x* in the memory if no such write exists. Fence instructions flush the whole buffer into the main memory, and atomic updates perform flush, read, write, and flush again in one atomic step.

To simplify our formal development, we use an *axiomatic* definition of TSO from [13]. By [16, Theorem 3] and [13, Theorem 5], this definition is equivalent to the operational one.^{2}

### Definition 2

*G*is \(\mathsf {TSO} \)

*-coherent*if the following hold:

Next, we present the key lemma that identifies more precisely the difference between \(\mathsf {TSO}\) and \(\mathsf {SC}\).

### Lemma 1

*G*is also \(\mathsf {SC}\)-coherent:where Open image in new window.

Now, we turn to our first main positive result, showing that \(\mathsf {TSO}\) is precisely characterized by write-read reordering and read-after-write elimination over sequential consistency. First, we define write-read reordering.

### Definition 3 (Write-Read Reordering)

For an execution *G* and events *a* and *b*, \(\mathsf {ReorderWR}({G,a,b})\) is the execution \(G'\) obtained from *G* by inverting the program order from *a* to *b*, i.e., it is given by: Open image in new window, and \(G'.{\mathtt {C}}=G.{\mathtt {C}}\) for every other component \({\mathtt {C}}\). \(\mathsf {ReorderWR}({G,a,b})\) is defined only when Open image in new window and \({loc}(a)\ne {loc}(b)\).

*G*, and thus also applies to plain executions. This fact ensures that it corresponds to a program transformation. Note that additional rewriting are sometimes needed in order to make two adjacent accesses in the program’s execution to be adjacent instructions in the program. For example, to reorder the store \(x:=1\) and load \(a:=y\) in the following program, one can first rewrite the program as follows: Similarly, reordering of local register assignments and unfolding of loops may be necessary. To relate reorderings on plain executions to reorderings on (non-straightline) programs, one should assume that these transformations may be freely applied.

### Remark 1

Demange et al. [10, Definition 5.3] introduce a related *write-read-read* reordering, which allows to reorder a read before a write and a sequence of subsequent reads *reading from that write*. This reordering does not correspond to a local program transformation, as it inspects the reads-from relation, that is not available in plain executions, and cannot be inferred from the program code.

The second transformation we use, called *WR-elimination*, replaces a read from some location directly after a write to that location by the value written by the write (e.g., \(x:=1;a:=x \leadsto x:=1;a:=1\)). Again, we place conditions to ensure that the execution transformation corresponds to a program one.

### Definition 4 (Read-After-Write Elimination)

For an execution *G* and events *a* and *b*, \(\mathsf {RemoveWR}({G,a,b})\) is the execution \(G'\) obtained by removing *b* from *G*, i.e., \(G'\) is given by: \(G'.{\mathtt {E}}=G.{\mathtt {E}}{\setminus } \{b\}\), and \(G'.{\mathtt {C}}=G.{\mathtt {C}}\cap (G'.{\mathtt {E}}\times G'.{\mathtt {E}})\) for every other component \({\mathtt {C}}\). \(\mathsf {RemoveWR}({G,a,b})\) is defined only when Open image in new window, \({loc}(a) = {loc}(b)\), and \({val_w}(a) = {val_r}(b)\).

Note that WR-reordering is unsound under \(\mathsf {SC}\) (the reordered program may exhibit behaviors that are not possible in the original program). WR-elimination, however, is sound under \(\mathsf {SC}\). Nevertheless, WR-elimination is needed below, since, by removing a read access, it may create new opportunities for WR-reordering.

We can now state the main theorem of this section. We write \(G \leadsto _{\mathsf {TSO}}G'\) if \(G'=\mathsf {ReorderWR}({G,a,b})\) or \(G'=\mathsf {RemoveWR}({G,a,b})\) for some *a*, *b*.

### Theorem 1

A plain execution *G* is TSO-consistent iff \(G \leadsto _{\mathsf {TSO}}^* G'\) for some \(\mathsf {SC}\)-consistent execution \(G'\).

The rest of this section is devoted to the proof of Theorem 1. First, the soundness of the two transformations under \(\mathsf {TSO}\) is well-known.

### Proposition 2

If \(G \leadsto _{\mathsf {TSO}}G'\) and \(G'\) is \(\mathsf {TSO}\)-consistent, then so is *G*.

The converse is not generally true. It does (trivially) hold for eliminations:

### Proposition 3

Let *G* be a complete and \(\mathsf {TSO}\)-coherent execution. Then, \(\mathsf {RemoveWR}({G,a,b})\), if defined, is complete and \(\mathsf {TSO}\)-coherent.

### Proof

Removing a read event from an execution reduces all relations mentioned in Definition 2, and hence preserves their irreflexivity. \(\square \)

### Proposition 4

Let *G* be a complete and \(\mathsf {TSO}\)-coherent execution. Let *a*, *b* such that \(\mathsf {ReorderWR}({G,a,b})\) is defined. If Open image in new window, then \(\mathsf {ReorderWR}({G,a,b})\) is complete and \(\mathsf {TSO}\)-coherent.

### Proposition 5

Suppose that *G* is complete and \(\mathsf {TSO}\)-coherent but not \(\mathsf {SC}\)-coherent. Then, \(G \leadsto _{\mathsf {TSO}}G'\) for some \(\mathsf {TSO}\)-coherent complete execution \(G'\).

### Proof

By Lemma 1, there must exist events \(a\in {\mathtt {W}}\) and \(b\in {\mathtt {R}}\), such that \({\langle {a,b}\rangle }\in po ' \cup rfi {}\), (where \( po '\) and \( rfi {}\) are the relations defined in Lemma 1). Now, if \({\langle {a,b}\rangle } \in po '\), we can apply WR-reordering, and take \(G'=\mathsf {ReorderWR}({G,a,b})\). By Proposition 4, \(G'\) is complete and \(\mathsf {TSO}\)-coherent. Otherwise, \({\langle {a,b}\rangle }\in rfi {}\). In this case, we can apply WR-elimination, and take \(G'=\mathsf {RemoveWR}({G,a,b})\). By Proposition 3, \(G'\) is complete and \(\mathsf {TSO}\)-coherent. \(\square \)

We can now prove the main theorem.

*Proof (of Theorem* 1*).* The right-to-left direction is easily proven using Proposition 2, by induction on the number of transformations in the sequence deriving \(G'\) from *G* (note that the base case trivially holds as \(\mathsf {SC}\)-consistency implies \(\mathsf {TSO}\)-consistency). We prove the converse. Given two plain executions *G* and \(G'\), we write \(G'<G\) if either Open image in new window). Clearly, < is a well-founded partial order. We prove the claim by induction on *G* (using < on the set of all executions). Let *G* be an execution, and assume that the claim holds for all \(G'<G\). Suppose that *G* is \(\mathsf {TSO}\)-consistent. If *G* is \(\mathsf {SC}\)-consistent, then we are done. Otherwise, by Proposition 5, \(G \leadsto _{\mathsf {TSO}}G'\) for some \(\mathsf {TSO}\)-consistent execution \(G'\). It is easy to see that we have \(G' < G\). By the induction hypothesis, \(G' \leadsto _{\mathsf {TSO}}^* G''\) for some \(\mathsf {SC}\)-consistent execution \(G''\). Then, we also have \(G \leadsto _{\mathsf {TSO}}^* G''\). \(\square \)

## 4 Release-Acquire

Next, we turn to the *non-multi-copy-atomic* memory model (i.e., two different threads may detect a store by a third thread at different times) of C11’s Release/Acquire. By \(\mathsf {RA}\) we refer to the memory model of C11, as defined in [7], restricted only to programs in which all reads are acquire reads, writes are release writes, and atomic updates are acquire-release read-modify-writes (RMWs). We further assume that this model has no fence events. Fence instructions under \(\mathsf {RA}\), as proposed in [13], can be implemented using atomic updates to an otherwise unused distinguished location.

### Definition 5

*G*is \(\mathsf {RA} \)

*-coherent*if the following hold:

- 1.
Open image in new window is a disjoint union of relations Open image in new window, such that each relation Open image in new window is a strict total order on \({\mathtt {W}}_x {\mathtt {U}}_x\).

- 2.
Open image in new window is irreflexive.

- 3.
Open image in new window is irreflexive.

- 4.
Open image in new window is irreflexive.

- 5.
Open image in new window is irreflexive.

*sequentialization*to the set of program transformations, to allow transformations of the form \(C_1 \parallel C_2 \leadsto C_1 ;C_2\) and \(C_1 ; C_1' \parallel C_2 \leadsto C_1 ;C_2; C_1'\). By sequentializing the \(x:=1\) store instruction to be before its corresponding load we obtain the program on the left:Now, this behavior is allowed under \(\mathsf {SC}\) after applying a WR-elimination followed by a WR-reordering in the first thread (obtaining the program on the right). At the execution level, sequentialization increases its Open image in new window component, and it is sound under \(\mathsf {RA}\), simply because it may only increase all the relations mentioned in Definition 5. Note that, unlike \(\mathsf {RA}\) and \(\mathsf {SC}\), sequentialization is unsound under \(\mathsf {TSO}\): while the weak behavior of the IRIW program is forbidden under \(\mathsf {TSO}\), it is allowed after applying sequentialization. Other examples show that sequentialization is unsound under \(\mathsf {Power}\) and \(\mathsf {ARM}\) as well [1].

*x*after the two other writes to

*x*in Open image in new window). In this program, no sound reorderings or eliminations can explain the weak behavior, and, moreover, any possible sequentialization will forbid this behavior.

In fact, the above example applies also to \(\mathsf {SRA}\), the stronger version of \(\mathsf {RA}\) studied in [13], obtained by requiring that Open image in new window is a total order on \({\mathtt {W}}{\mathtt {U}}\) (as in \(\mathsf {TSO}\)), instead of condition 1 in Definition 5 (but still excluding irreflexivity of Open image in new window that is required for \(\mathsf {SC}\)-coherence). As \(\mathsf {RA}\), \(\mathsf {SRA}\) forbids thread-local transformations in this program, but allows its weak behavior.

## 5 Power

- 1.
Like \(\mathsf {RA}\), the Power model is non-multi-copy-atomic, and thus, it cannot be explained using transformations over \(\mathsf {SC}\). Instead, we explain Power’s weak behaviors starting from a stronger non-multi-copy-atomic model, that, we believe, is easier to understand and reason about, than the Power model.

- 2.
Power’s control fence (\({{\mathbf {\mathtt{{isync}}}}}\)) is used to enforce a stronger ordering on memory reads. Its special effect cannot be accounted for by program transformations (see example in [1]). Hence, we only consider here a restricted fragment of the Power model, that has two types of fence events: \({\texttt {sync}}\) (“strong fence”) and \({\texttt {lwsync}}\) (“lightweight fence”). \(G.\mathtt {F}_\texttt {sync}\) and \(G.\mathtt {F}_\texttt {lwsync}\) respectively denote the set of events \(a\in G.{\mathtt {E}}\) with \({typ}(a)\) being \(\texttt {sync}\) and \(\texttt {lwsync}\).

The Power architecture performs out-of-order and speculative execution, but respects dependencies between instructions. Accordingly, \({\mathsf {Power}}\)’s axiomatic executions keep track of additional relations for data, address and control dependency between events, that are derived directly from the program syntax. For example, in all executions of \(a:=x;~ y:=a\), we will have a data dependency edge from the read event to the write event, since the load and store use the same register *a*. Here, we include all sort of dependencies in one relation between events, denoted by Open image in new window. Note that we always have Open image in new window, and that only read and update events may have outgoing dependency edges.

*preserved program order*, denoted Open image in new window, which is a subset of Open image in new window that is guaranteed to be preserved. The exact definition of Open image in new window is somewhat intricate (we refer the reader to [4] for details). For our purposes, it suffices to use the following properties of Open image in new window:

### Remark 2

Atomic updates are not considered in the text of [4]. In the accompanying herd simulator, they are modeled using pairs of a read and a write events related by an atomicity relation. Here we follow a different approach, model atomic updates using a single update event, and adapt herd’s model accordingly. Thus we are only considering Power programs in which \(\mathtt {lwarx}\) and \(\mathtt {stwcx}\) appear in separate adjacent pairs. These instructions are used to implement locks and compare-and-swap commands, and they indeed appear only in such pairs when following the intended mapping of programs to Power [6].

Using the preserved program order, \(\mathsf {Power}\)-coherence is defined as follows (the reader is referred to [4] for further explanations and details).

### Definition 6

*G*is \(\mathsf {Power} \)

*-coherent*if the following hold: where Open image in new window is defined as in Definition 1, and:

In particular, \(\mathsf {Power}\) allows the weak behavior in the LB program presented in the introduction. Indeed, unlike the other models discussed above, the \(\mathsf {Power}\) model does not generally forbid Open image in new window-cycles. Thus, \(\mathsf {Power}\)-consistent executions are not “prefix-closed”— it may happen that *G* is \(\mathsf {Power}\)-consistent, but some Open image in new window-prefix of *G* is not. This makes reasoning about the \(\mathsf {Power}\) model extremely difficult, because it precludes the understanding a program in terms of its partial executions, and forbids proofs by induction on Open image in new window-prefixes of an execution. In the following we show that all weak behaviors of \(\mathsf {Power}\) can be explained by starting from a stronger prefix-closed model, and applying various reorderings of independent adjacent memory accesses to different locations. First, we define the stronger model.

### Definition 7

An execution *G* is \(\mathsf {SPower} \)*-coherent* if it is \(\mathsf {Power}\)-coherent and Open image in new window is acyclic.

Note that this additional acyclicity condition is a strengthening of the *“no-thin-air”* condition in Definition 6. A similar strengthening for the C11 memory model was suggested in [8], as a straightforward solution to the “out-of-thin-air” problem (see also [5]). In addition, the same acyclicity condition was assumed for proving soundness of FSL [11] (a program logic for C11’s relaxed accesses).

Next, we turn to relate \(\mathsf {Power} \) and \(\mathsf {SPower} \) using general reorderings of adjacent memory accesses.

### Definition 8 (Reordering)

For an execution *G* and events *a* and *b*, \(\mathsf {Reorder}({G,a,b})\) is the execution \(G'\) obtained from *G* by inverting the program order from *a* to *b*, i.e., it is given by: Open image in new window, and \(G'.{\mathtt {C}}=G.{\mathtt {C}}\) for every other component \({\mathtt {C}}\). \(\mathsf {Reorder}({G,a,b})\) is defined only when \(a,b\not \in {\mathtt {F}}\) and Open image in new window, and \({loc}(a) \ne {loc}(b)\).

We write \(G \leadsto _{\mathsf {Power}}G'\) if \(G'=\mathsf {Reorder}({G,a,b})\) for some *a*, *b*.

### Proposition 6

Suppose that \(G \leadsto _{\mathsf {Power}}G'\). Then, *G* is \(\mathsf {Power}\)-coherent iff \(G'\) is \(\mathsf {Power}\)-coherent.

The following observation is useful in the proof below.

### Proposition 7

### Theorem 2

A plain execution *G* is \(\mathsf {Power}\)-consistent iff \(G \leadsto _{\mathsf {Power}}^* G'\) for some \(\mathsf {SPower}\)-consistent execution \(G'\).

### Proof

The right-to-left direction is proven by induction using Proposition 6. We prove the converse. Let *G* be a \(\mathsf {Power}\)-consistent plain execution, and let \( rf {}\) and \( mo \) be relations such that Open image in new window is complete and \(\mathsf {Power}\)-coherent. Let *S* be a total strict order on \({\mathtt {E}}\) extending the relation *R* given in Proposition 7. Let \(G'\) be the execution given by Open image in new window (where \({\mathtt {E}}_0\) is the set of initialization events in *G*), while all other components of \(G'\) are as in *G*. It is easy to see that \(G \leadsto _{\mathsf {Power}}^* G'\). Indeed, recall that a list *L* of elements totally ordered by < can be sorted by repeatedly swapping adjacent unordered elements \(l_i > l_{i+1}\) (as done in “bubble sort”). Since \(R\subseteq S\), no reordering step from *G* to \(G'\) will reorder dependent events, events accessing the same location, or fence events. Now, Proposition 6 ensures that Open image in new window is complete and \(\mathsf {Power}\)-coherent. To see that it is also \(\mathsf {SPower}\)-coherent, note that Open image in new window. \(\square \)

### Remark 3

Note that the reordering operation does not affect the dependency relation. To allow this, and still understand reordering on the program level, we actually consider a slightly weaker model of Power than the one in [4], that do not carry control dependencies across branches. For instance, in a program like \(a:=y; ({{\mathbf {\mathtt{{if}}}}}\,{a}\, {{\mathbf {\mathtt{{then}}}}}\, z:=1); x:=1\), which can be a result of reordering of the stores to *x* and *z* in \(a:=y; x:=1; ({{\mathbf {\mathtt{{if}}}}}\,{a}\, {{\mathbf {\mathtt{{then}}}}}\,{z:=1})\), we will not have a control dependency between the load of *y* and the store to *x*.

## 6 ARM

We now turn to the ARM architecture and show that it cannot be modeled by any sequence of sound reorderings and eliminations over a basic model satisfying Open image in new window-acyclicity.

*x*in the first thread. Nevertheless, this behavior is allowed under both the axiomatic ARMv7 model of Alglave et al. [4] and the ARMv8 Flowing and POP models of Flur et al. [12].

The axiomatic ARMv7 model [4] is the same as the \(\mathsf {Power}\) model presented in Sect. 5, with the only difference being the definition of Open image in new window (preserved program order). In particular, this model does not satisfy (ppo-lower-bound) because Open image in new window. Hence, the first thread’s program order in the example above is not included in Open image in new window, and there is no happens-before cycle. For the same reason, our proof for \(\mathsf {Power}\) does not carry over to \(\mathsf {ARM}\).

In the ARMv8 Flowing model [12], consider the topology where the first two threads share a queue and the third thread is separate. The following execution is possible: (1) the first thread issues a load request from *x* and immediately commits the \(x:=1\) store; (2) the second thread then issues a load request from *x*, which gets satisfied by the \(x:=1\) store, and then (3) issues a store to \(y:=1\); (4) the store to *y* gets reordered with the *x*-accesses, and flows to the third thread; (5) the third thread then loads \(y=1\), and also issues a store \(x:=1\), which flows to the memory; (6) the load of *x* flows to the next level and gets satisfied by the \(x:=1\) store of the third thread; and (7) finally the \(x:=1\) store of the first thread also flows to the next level. The POP model is strictly weaker than the Flowing model, and thus also allows this outcome.

## 7 Application: Correctness of Compilation

Our theorems can be useful to prove correctness of compilation of a programming language with some memory model (such as C11) for the TSO and Power architectures. We outline this idea in a more abstract setting.

*P*under memory model \(\mathsf {M}\). A formal definition of a behavior can be given using a distinguished

*world*location, whose values are inspected by an external observer. Assume some compilation scheme from a source language

*C*to a target language

*A*(i.e., a mapping of

*C*instructions to sequences of

*A*ones), and let \({\mathtt {compile}({P_C})}\) denote the program \(P_A\) obtained by applying this scheme on a program \(P_C\). Further, assume memory models \(\mathsf {M} _C\) and \(\mathsf {M} _A\) (we do not assume that \(\mathsf {M} _C\) has an axiomatic presentation; an operational one would work out the same). Correct compilation is expressed by:

*i*) is sound for \(\mathsf {M} _C\), i.e.,

*ii*) captures all target transformations from a compiled program:

Fulfilling the second requirement is typically easy, because the source language, its memory model, and the mapping of its statements to processors are often explicitly designed to enable such transformations. In fact, when one aims to validate an *optimizing* compiler, the first part of the second requirement should be anyway established. For example, consider the compilation of C11 to TSO. Here, we need to show that WR-reordering and WR-elimination on compiled code could be done by C11-sound transformations on corresponding instructions of the source. Indeed, the mapping of C11 accesses to TSO instructions (see [7]) ensures that any adjacent WR-pair results from adjacent C11 accesses with access ordering strictly weaker than \({\texttt {sc}}\) (sequential consistent accesses). Reordering and eliminations in this case is known to be sound under the C11 memory model [18].

## 8 Conclusion

In this paper, we have shown that the \(\mathsf {TSO}\) memory model and (a substantial fragment of) the \(\mathsf {Power}\) memory model can be defined by a set of reorderings and eliminations starting from a stronger and simpler memory model. Nevertheless, the counterexamples in Sects. 4 and 6 suggest that there is more to weak memory consistency than just instruction reorderings and eliminations.

We further sketched a possible application of the alternative characterizations of \(\mathsf {TSO}\) and \(\mathsf {Power}\): proofs of compilation correctness can be simplified by using the soundness of local transformations in the source language. To follow this approach in a formal proof of correctness of a compiler, however, further work is required to formulate precisely the syntactic transformations in the target programming language. In the future, we also plan to investigate the application of these characterizations for proving soundness of program logics with respect to \(\mathsf {TSO}\) and \(\mathsf {Power}\).

## Footnotes

- 1.
Different models may include some additional relations (e.g., a dependency relation between events is used for Power, see Sect. 5).

- 2.
Lahav et al. [13] treat fence instructions as syntactic sugar for atomic updates of a distinguished location. Here, we have fences as primitive instructions that induce fence events in the program executions.

## Notes

### Acknowledgments

We would like to thank the FM’16 reviewers for their feedback. This research was supported by an ERC Consolidator Grant for the project “RustBelt”, funded under Horizon 2020 grant agreement no. 683289.

### References

- 1.Coq development for this paper and further supplementary material. http://plv.mpi-sws.org/trns/
- 2.Adve, S.V., Gharachorloo, K.: Shared memory consistency models: a tutorial. Computer
**29**(12), 66–76 (1996)CrossRefGoogle Scholar - 3.Alglave, J., Kroening, D., Nimal, V., Tautschnig, M.: Software verification for weak memory via program transformation. In: Felleisen, M., Gardner, P. (eds.) ESOP 2013. LNCS, vol. 7792, pp. 512–532. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37036-6_28 CrossRefGoogle Scholar
- 4.Alglave, J., Maranget, L., Tautschnig, M.: Herding cats: modelling, simulation, testing, and data mining for weak memory. ACM Trans. Program. Lang. Syst.
**36**(2), 7:1–7:74 (2014)CrossRefGoogle Scholar - 5.Batty, M., Memarian, K., Nienhuis, K., Pichon-Pharabod, J., Sewell, P.: The problem of programming language concurrency semantics. In: Vitek, J. (ed.) ESOP 2015. LNCS, vol. 9032, pp. 283–307. Springer, Heidelberg (2015). doi:10.1007/978-3-662-46669-8_12 CrossRefGoogle Scholar
- 6.Batty, M., Memarian, K., Owens, S., Sarkar, S., Sewell, P.: Clarifying and compiling C/C++ concurrency: from C++11 to POWER. In: Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2012, pp. 509–520. ACM, New York (2012)Google Scholar
- 7.Batty, M., Owens, S., Sarkar, S., Sewell, P., Weber, T.: Mathematizing C++ concurrency. In: Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL 2011, pp. 55–66. ACM, New York (2011)Google Scholar
- 8.Boehm, H.J., Demsky, B.: Outlawing ghosts: avoiding out-of-thin-air results. In: Proceedings of the Workshop on Memory Systems Performance and Correctness. MSPC 2014, pp. 7:1–7:6. ACM, New York (2014)Google Scholar
- 9.Burckhardt, S., Musuvathi, M., Singh, V.: Verifying local transformations on relaxed memory models. In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 104–123. Springer, Heidelberg (2010). doi:10.1007/978-3-642-11970-5_7 CrossRefGoogle Scholar
- 10.Demange, D., Laporte, V., Zhao, L., Jagannathan, S., Pichardie, D., Vitek, J.: Plan B: a buffered memory model for Java. In: Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL 2013, pp. 329–342. ACM, New York (2013)Google Scholar
- 11.Doko, M., Vafeiadis, V.: A program logic for C11 memory fences. In: Jobstmann, B., Leino, K.R.M. (eds.) VMCAI 2016. LNCS, vol. 9583, pp. 413–430. Springer, Heidelberg (2016). doi:10.1007/978-3-662-49122-5_20 CrossRefGoogle Scholar
- 12.Flur, S., Gray, K.E., Pulte, C., Sarkar, S., Sezgin, A., Maranget, L., Deacon, W., Sewell, P.: Modelling the ARMv8 architecture, operationally: concurrency and ISA. In: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL 2016, pp. 608–621. ACM, New York (2016)Google Scholar
- 13.Lahav, O., Giannarakis, N., Vafeiadis, V.: Taming release-acquire consistency. In: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL 2016, pp. 649–662. ACM, New York (2016)Google Scholar
- 14.Lamport, L.: How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput.
**28**(9), 690–691 (1979)CrossRefMATHGoogle Scholar - 15.Morisset, R., Pawan, P., Zappa Nardelli, F.: Compiler testing via a theory of sound optimisations in the C11/C++11 memory model. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI 2013, pp. 187–196. ACM, New York (2013)Google Scholar
- 16.Owens, S., Sarkar, S., Sewell, P.: A better x86 memory model: x86-TSO. In: Berghofer, S., Nipkow, T., Urban, C., Wenzel, M. (eds.) TPHOLs 2009. LNCS, vol. 5674, pp. 391–407. Springer, Heidelberg (2009). doi:10.1007/978-3-642-03359-9_27 CrossRefGoogle Scholar
- 17.Shasha, D., Snir, M.: Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst.
**10**(2), 282–312 (1988)CrossRefGoogle Scholar - 18.Vafeiadis, V., Balabonski, T., Chakraborty, S., Morisset, R., Zappa Nardelli, F.: Common compiler optimisations are invalid in the C11 memory model and what we can do about it. In: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL 2015, pp. 209–220. ACM, New York (2015)Google Scholar
- 19.Ševčík, J.: Safe optimisations for shared-memory concurrent programs. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI 2011, pp. 306–316. ACM, New York (2011)Google Scholar