In this section we describe a small C-style language called \(\lambda \mathsf {MC} \), which features non-determinism in expressions. We define its semantics by translation into a ML-style functional language with concurrency called HeapLang.
We briefly describe the \(\lambda \mathsf {MC} \) source language (Sect. 2.1) and the HeapLang target language (Sect. 2.2) of the translation. Then we describe the translation scheme itself (Sect. 2.3). We explain in several steps how to exploit concurrency and monadic programming to give a concise and clear definitional semantics.
2.1 The Source Language \(\lambda \mathsf {MC} \)
The syntax of our source language called \(\lambda \mathsf {MC} \) is as follows:
The values include integers, \(\mathtt {NULL}\) pointers, concrete locations \(\mathtt {l}\), function pointers \(\mathtt {f}\), structs with two fields (tuples), and the unit value \(\texttt {()}\) (for functions without return value). There is a global list of function definitions, where each definition is of the form
. Most of the expression constructs resemble standard C notation, with some exceptions. We do not differentiate between expressions and statements to keep our language uniform. As such, if-then-else and sequencing constructs are not duplicated for both expressions and statements. Moreover, we do not differentiate between lvalues and rvalues [22, 6.3.2.1]. Hence, there is no address operator
, and, similarly to ML, the load (
) and assignment (
) operators take a reference as their first argument.
The sequenced bind operator
generalizes the normal sequencing operator
of C by binding the result of
to the variable \(\mathtt {x}\) in
. As such,
can be thought of as the declaration of an immutable local variable \(\mathtt {x}\). We omit mutable local variables for now, but these can be easily added as an extension to our method, as shown in Sect. 7. We write
for a sequenced bind
in which we do not care about the return value of
.
To focus on the key topics of the paper—non-determinism and the sequence point restriction—we take a minimalistic approach and omit most other features of C. Notably, we omit non-local control (return, break, continue, and goto). Our memory model is simplified; it only supports structs with two fields (tuples), but no arrays, unions, or machine integers. In Sect. 7 we show that some of these features (arrays, pointer arithmetic, and mutable local variables) can be incorporated.
2.2 The Target Language HeapLang
The target language of our definitional semantics of \(\lambda \mathsf {MC} \) is an \(\textsf {ML} \)-style functional language with concurrency primitives and a call-by-value semantics. This language, called HeapLang, is included as part of the Iris Coq development [21]. The syntax is as follows:
The language contains some concurrency primitives that we will use to model non-determinism in \(\lambda \mathsf {MC} \). Those primitives are
,
,
, and
. The first primitive is the parallel composition operator, which executes expressions
and
in parallel, and returns a tuple of their results. The expression
creates a new mutex. If \({\textit{lk}}\) is a mutex that was created this way, then
tries to acquire it and blocks until no other thread is using \({\textit{lk}}\). An acquired mutex can be released using
.
2.3 The Monadic Definitional Semantics of \(\lambda \mathsf {MC} \)
We now give the semantics of \(\lambda \mathsf {MC} \) by translation into HeapLang. The translation is carried out in several stages, each iteration implementing and illustrating a specific aspect of C. First, we model non-determinism in expressions by concurrency, parallelizing execution of subexpressions (step 1). After that, we add checks for sequence point violations in the translation of the assignment and dereferencing operations (step 2). Finally, we add function calls and demonstrate how the translation can be simplified using a monadic notation (step 3).
Step 1: Non-determinism via Parallel Composition. We model the unspecified evaluation order in binary expressions like
and
by executing the subexpressions in parallel using the \((\mathbin {||_{\textsf {\tiny HL}}})\) operator:
Since our memory model is simple, the value interpretation is straightforward:
The only interesting case is the translation of locations. Since there is no concept of a \(\mathtt {NULL}\) pointer in HeapLang, we use the option type to distinguish \(\mathtt {NULL}\) pointers from concrete locations (\(\mathtt {l}\)). The interpretation of assignments thus contains a pattern match to check that no \(\mathtt {NULL}\) pointers are dereferenced. A similar check is performed in the interpretation of the load operation (
). Moreover, each location contains an option to distinguish freed from active locations.
Step 2: Sequence Points. So far we have not accounted for undefined behavior due to sequence point violations. For instance, the program
gets translated into a HeapLang expression that updates the value of the location
non-deterministically to either 3 or 4, and returns 7. However, in C, the behavior of this program is undefined, as it exhibits a sequence point violation: there is a write conflict for the location
.
To give a semantics for sequence point violations, we follow the approach by Norrish [44], Ellison and Rosu [17], and Krebbers [29, 30]. We keep track of a set of locations that have been written to since the last sequence point. We refer to this set as the environment of our translation, and represent it using a global variable env of the type
. Because our target language HeapLang is concurrent, all updates to the environment \({\textit{env}}\) must be executed atomically, i.e., inside a critical section, which we enforce by employing a global mutex \({\textit{lk}}\). The interpretation of assignments
now becomes:
Whenever we assign to (or read from) a location l, we check if the location l is not already present in the environment \({\textit{env}}\). If the location l is present, then it was already written to since the last sequence point. Hence, accessing the location constitutes undefined behavior (see the
in the interpretation of assignments above). In the interpretation of assignments, we furthermore insert the location l into the environment \({\textit{env}}\).
In order to make sure that one can access a variable again after a sequence point, we define the sequenced bind operator
as follows:
After we finished executing the expression
, we clear the environment \({\textit{env}}\), so that all locations are accessible in
again.
Step 3: Non-interleaved Function Calls. As the final step, we present the correct translation scheme for function calls. Unlike the other expressions, function calls are not interleaved during the execution of subexpressions [22, 6.5.2.2p10]. For instance, in the program
the possible orders of execution are: either all the instructions in
followed by all the instructions in
, or all the instructions in
followed by all the instructions in
.
To model this, we execute each function call atomically. In the previous step we used a global mutex for guarding the access to the environment. We could use that mutex for function calls too. However, reusing a single mutex for entering each critical section would not work because a body of a function may contain invocations of other functions. To that extent, we use multiple mutexes to reflect the hierarchical structure of function calls.
To handle multiple mutexes, each C expression is interpreted as a HeapLang function that receives a mutex and returns its result. That is, each C expression is modeled by a monadic expression in the reader monad
. For consistency’s sake, we now also use the monad to thread through the reference to the environment (
), instead of using a global variable
as we did in the previous step.
We use a small set of monadic combinators, shown in Fig. 1, to build the translation in a more abstract way. The return and bind operators are standard for the reader monad. The parallel operator runs two monadic expressions concurrently, propagating the environment and the mutex. The \(\mathtt {atomic}\, \) combinator invokes a monadic expression with a fresh mutex. The \(\mathtt {atomic\_env}\, \) combinator atomically executes its body with the current environment as an argument. The \(\mathtt {run}{}\) function executes the monadic computation by instantiating it with a fresh mutex and a new environment. Selected clauses for the translation are presented in Fig. 2. The translation of the binary operations remains virtually unchanged, except for the usage of monadic parallel composition instead of the standard one. The translation for the assignment and the sequenced bind uses the \(\mathtt {atomic\_env}\, \) combinator for querying and updating the environment. We also have to adapt our translation of values, by wrapping it in \(\mathtt {ret}\, \):
.
A global function definition
is translated as a top level let-binding. A function call is then just an atomically executed function invocation in HeapLang, modulo the fact that the function pointer and the arguments are computed in parallel. In addition, sequence points occur at the beginning of each function call and at the end of each function body [22, Annex C], and we reflect that in our translation by clearing the environment at appropriate places.
Our semantics by translation can easily be extended to cover other features of C, e.g., a more advanced memory model (see Sect. 7). However the fragment presented here already illustrates the challenges that non-determinism and sequence point violations pose for verification. In the next section we describe a logic for reasoning about the semantics by translation given in this section.