Correct program parallelisations

A commonly used approach to develop deterministic parallel programs is to augment a sequential program with compiler directives that indicate which program blocks may potentially be executed in parallel. This paper develops a verification technique to reason about such compiler directives, in particular to show that they do not change the behaviour of the program. Moreover, the verification technique is tool-supported and can be combined with proving functional correctness of the program. To develop our verification technique, we propose a simple intermediate representation (syntax and semantics) that captures the main forms of deterministic parallel programs. This language distinguishes three kinds of basic blocks: parallel, vectorised and sequential blocks, which can be composed using three different composition operators: sequential, parallel and fusion composition. We show how a widely used subset of OpenMP can be encoded into this intermediate representation. Our verification technique builds on the notion of iteration contract to specify the behaviour of basic blocks; we show that if iteration contracts are manually specified for single blocks, then that is sufficient to automatically reason about data race freedom of the composed program. Moreover, we also show that it is sufficient to establish functional correctness on a linearised version of the original program to conclude functional correctness of the parallel program. Finally, we exemplify our approach on an example OpenMP program, and we discuss how tool support is provided.


Introduction
A common approach to handle the complexity of parallel programming is to write a sequential program augmented with parallelisation compiler directives that indicate which part of the code might be parallelised. A parallelising compiler consumes the annotated sequential program and automatically generates a parallel version. This parallel programming approach is often called deterministic parallel programming, as the parallelisation of a deterministic sequential program augmented with correct compiler directives is always deterministic. Deterministic parallel programming is supported by different languages and libraries, such as, for example, OpenMP [20], and is often used for financial and scientific applications (see e.g. [4,11,17,21]).
Although it is relatively easy to write parallel programs in this way, careless use of compiler directives can easily introduce data races 1 and consequently non-deterministic program behaviour. This paper proposes a tool-supported static verification technique to prove that parallelisation as indicated by the compiler directives does not introduce such non-determinism. Our technique is not fully automatic: the user has to add some additional annotations, and verification of these annotations gives the guarantee that program behaviour is not changed by the compiler directives. Moreover, we also show that it is sufficient to prove functional correctness on a sequential version of the program, in order to conclude functional correctness of the parallel program. We develop a verification technique to reason about data race freedom and functional correctness on an intermediate representation language, called PPL (for Parallel Programming Language), which captures the core features of deterministic parallel programming. We then show that a commonly used subset of a deterministic programming language such as OpenMP can be encoded into this intermediate representation, and thus, our verification technique allows us to reason about the correctness of compiler directives in OpenMP. The verification technique is implemented as part of our program verifier VerCors. That means, if we (manually) annotate an OpenMP program with specifications, data race freedom and functional correctness can be verified automatically. We illustrate this approach on some characteristic examples.
In essence, our intermediate representation language PPL is defined in terms of the composition of code blocks. We identify three kinds of basic blocks: a parallel block, a vectorised block and a sequential block. Basic blocks are composed by three binary block composition operators: sequential composition, parallel composition and fusion composition where the fusion composition allows two parallel basic blocks to be merged into one. An operational semantics for PPL is presented.
Our verification technique requires that users specify each basic block by an iteration contract that describes which memory locations are read and written by a thread. We introduce these contracts and present verification rules for basic blocks. Moreover, the program itself can be specified by a global contract. To verify the global contract, we show that the block compositions are memory safe (i.e. data race free) by proving that for all the iterations that might run in parallel, all accesses to shared memory are non-conflicting, meaning that they are disjoint or they are read accesses. If all block compositions are memory safe, then it is sufficient to prove that the sequential composition of all the basic blocks w.r.t. program order is memory safe and functionally correct to conclude that the parallelised program is functionally correct.
The main contributions of this paper are the following: -An intermediate representation language PPL that captures the core features of deterministic parallel programming, with a suitable operational semantics. -An algorithm that encodes a commonly used subset of OpenMP into its PPL intermediate representation.
-A tool-supported verification approach for reasoning about data race freedom and functional correctness of OpenMP programs by using the encoding of OpenMP into PPL.
This paper is an extended version of our paper presented at NFM 2017 [12]. In addition, it contains (1) a rephrasing of the verification rules for parallel and vectorised loops, presented at FASE 2015 [5] in the setting of PPL, i.e. rephrasing them for basic blocks, and (2) an algorithm that encodes a commonly used subset of OpenMP into PPL. This paper is organised as follows. After some background information on OpenMP and our program specification language, Sect. 3 introduces our intermediate representation language PPL, presenting syntax and semantics. Then, Sect. 4 shows how OpenMP programs are encoded into PPL. Section 5 presents the verification rules for basic blocks, while Sect. 6 presents the verification rules for block compositions. Section 7 provides more information on how the tool support is provided, while Sect. 8 uses our technique on an OpenMP program. Finally, Sect. 9 presents related work, and Sect. 10 concludes the paper and discusses future work.

Background
This section provides some background information on the OpenMP compiler directives and briefly introduces syntax and semantics of our program specification language.

OpenMP
As mentioned above, in this paper we consider a frequently used subset of OpenMP constructs, using only the following pragmas: omp parallel, omp for, omp simd, omp for simd, omp sections, and omp single, as well as all allowed clauses. We illustrate these OpenMP features by means of examples. For full details on OpenMP, we refer to [20]. Later, Sect. 4 shows how programs in this subset are encoded into our core parallel programming language, and Sect. 8 shows how to verify that these programs can safely be parallelised, after the user has added the necessary program contracts. Figure 1 presents a sequential C program augmented by OpenMP compiler directives (called pragmas). The pivotal parallelisation annotation in OpenMP is omp parallel which denotes a parallelisable code block (called parallel region). Threads are forked upon entering a parallel region and joined back into a single thread at the end of the region.

Example 1
This example shows a parallel region with three for-loops L1, L2, and L3. The loops are marked as omp for meaning that they are parallelisable (i.e. their iterations are allowed to be executed in parallel). To precisely define the behaviour of threads in the parallel region, omp for annotations are extended by clauses. For example the combined use of the nowait and schedule(static) clauses indicates the fusion of the parallel loops L1 and L2, meaning that the corresponding iterations of L1 and L2 are executed by the same thread without waiting. The clause nowait implies that the implicit barrier at In OpenMP, all variables which are not local to a parallel region are considered as shared by default unless they are explicitly declared as private (using the private clause) when they are passed to a parallel region.
Since OpenMP 4.0, support for the single instruction multiple data (SIMD) execution model has been added to the OpenMP standard. The SIMD execution model is a wellknown technique to speed up vector arithmetics, specifically in scientific applications. Figure 2 presents an OpenMP example to illustrate this. The first loop uses the omp simd annotation to vectorise the for-loop L1, which partitions the iterations of the loop into smaller chunks, where the size of each chunk is equal to the vectorisation size given by the extra clause simdlen (i.e. M in this example). The loop execution is defined as the sequential execution of chunks, where each chunk is executed in a vectorised fashion.

Example 2
The second for-loop (L2) shows the other form of OpenMP vectorisation using the omp for simd annotation. In this case, the loop execution is defined similarly, however the iteration chunks are executed in parallel rather than sequentially. Figure 3 visualises the execution of these loops. Figure 4 presents how the parallel execution of two parallel regions is defined in OpenMP. The example consists of three parallel regions: P 1 in lines 4-11, P 2 in lines 14-23 and P 3 in lines 26-29. Similar to the previous examples, the behaviour of each thread is defined by further OpenMP compiler directives. We use the omp sections annotation, which defines the blocks of the code (marked by omp section) which are executed in parallel. For example, two threads are forked upon entering the parallel region P 1 , one executes the method add and the other one executes the method mul. Note that the bodies of the methods are also parallel regions. Therefore, the threads executing the add and mul methods fork more threads upon entering the parallel region P 2 and P 3 . The parallel region P 2 is a fusion and the parallel region P 3 is a single parallel loop where omp parallel for is a shorthand for an omp parallel with a single omp for. Figure 5 shows an OpenMP program using incorrect compiler directives, which results in data races. As there is a data dependence between the two loops, we need a barrier between them when we parallelise the loops. However the clause schedule(static) nowait explicitly removes the barrier, which results in an erroneous parallelisation. Using our approach, as a user has to specify iteration contracts for the two loops, we can detect that parallelisation of this program would lead to data races.

Program specifications: syntax and semantics
Our program specification language is based on permissionbased separation logic, combined with the look-and-feel of the java modeling language (JML) [18]. In this way, we exploit the expressiveness and readability of JML, while using the power of separation logic to support thread-modular reasoning. We briefly explain the syntax and semantics of the permission-based separation logic formulas and how they extend the standard JML-program annotations in first-order logic. Syntax Threads hold permissions to access memory locations. Permissions are encoded by fractional values, as introduced by Boyland [9]: any fraction in the interval (0, 1) denotes a read permission, while 1 denotes a write permission. Permissions can be split and combined, but soundness of the logic ensures that for every memory location the total sum of permissions over all threads to access this location does not exceed 1. This guarantees that if the permission specifications can be verified, the program is data-race-free. The set of permissions that a thread holds are typically called its resources.
Formulas F in our program specification language are built from first-order logic formulas b, permission predicates Perm(e 1 , e 2 ), conditional expressions (·?· : ·), separating conjunction , and universal separating conjunction over a finite set I . The syntax of formulas is formally defined as follows:   where b is a side-effect free Boolean expression, e is a sideeffect free arithmetic expression, [.] is a unary dereferencing operator-thus [e] returns the value stored in the address e in shared memory-v ranges over variables and n ranges over numerals. We assume the first argument of the Perm(e 1 , e 2 ) predicate is always an address and the second argument is a fraction. For convenience, we often use the keyword read instead of an explicit fraction to specify an arbitrary read permission, and the keyword write instead of 1 to denote a write permission. We use the array notation a[e] as syntactic sugar for [a+e] where a is a variable containing the base address of the array a and e is the subscript expression; together they point to the address a + e in shared memory. Semantics Our semantics mixes concepts of implicit dynamic frames [25] and separation logic with fractional permissions, which makes it different from the traditional separation logic semantics and more aligned towards the way separation logic is implemented using traditional first order logic tooling. For further reading on the relationship between separation logic and implicit dynamic frames, we refer to the work of Parkinson and Summers [22].
To define the semantics of formulas, we assume the existence of the following domains: Loc, the set of memory locations, VarName, the set of variable names, Val, the set of all values, including memory locations, and Frac, the set of fractions ([0, 1]).
We define memory as a map from locations to values h : Loc → Val. A memory mask is a map from locations to fractions π : Loc → Frac with unit element π 0 : l → 0 with respect to the point-wise addition of heap masks. A store is a function from variable names to values: σ : VarName → Val.
Formulas can access the memory directly; the fractional permissions to access the memory are provided by the Perm predicate. A strict form of self-framing is enforced, meaning that the Boolean formulas expressing the functional properties in pre-and postconditions and invariants should be framed by sufficient resources (i.e. there should be sufficient access permissions for the memory locations that are accessed by the Boolean formula, in order to evaluate this formula).
The semantics of an expression e depends on a store σ , a memory h, and a memory mask π and yields a value: σ, h, π [e v. The store σ and the memory h are used to determine the value v, and the memory mask π is used to determine if the expression is correctly framed, i.e. sufficient access permissions are available. For example, the rule for array access is: where σ (a) is the initial address of array a in the memory and i is the array index that is the result of evaluating of index expression e. Apart from the check for correct framing as explained above, the evaluation of expressions is standard and we do not explain it any further. The semantics of a formula F, given in Fig. 6, depends on a store, a memory, and a memory mask and yields a memory mask: σ, h, π [F π . The given mask π denotes the permissions by which the formula F is framed. The yielded mask π denotes the additional permissions provided by the formula. Thus, a Boolean expression is valid if it is true and yields no additional permissions, (rule Boolean), while evaluating a Perm(e 1 , e 2 ) predicate yields additional permissions to the location, provided the expressions e 1 and e 2 are properly framed (rule Permission). Note that evaluation of expression e 1 results in a location l, while evaluation of expression e 2 results in a fraction f. The rule checks that the permissions already held on location l plus the additional fraction f does not exceed 1. The rules for evaluation of a conditional formula are standard (rules Cond 1 and Cond 2). We overload standard addition +, summation Σ, and comparison operators to be, respectively, used as pointwise addition, summation and comparison over the memory masks. These operators are used in the rules SepConj and USepConj. In the rule SepConj, each formula F 1 and F 2 yields a separate memory mask, π and π , respectively, where the final memory mask is calculated by pointwise addition of two memory masks, π + π . The rule checks if F 1 is framed by π and F 2 is framed by π + π . Note that since F 2 is framed by π + π , this implicitly guarantees that the permissions per location never exceed 1. Finally, the rule USepConj extends the similar evaluation by quantifying over a set of formulas conjoined by the universal separating conjunction operator. Again, rule USepConj checks that the permission fractions on any location in the memory cannot exceed 1.
Finally, a formula F is valid for a given store σ , memory h and memory mask π if starting with the empty memory mask π 0 , the required memory mask of F is less than π : Figure 7 presents an example of how we annotate a sequential program using our specification language. The formulas in the annotations are interpreted using the semantics as defined in Fig. 6. The program logic rules are This sequential program has a loop (lines 11-17) that adds the corresponding elements of two arrays (named a and b) and stores it in a different array (named c) in line 17. Annotations are provided to give a function specification (lines 1-7) and a loop invariants (lines [12][13][14][15][16]. Note that \forall* indicates universal separating conjunction, i∈I , over permission predicates and \forall denotes standard universal conjunction over logical predicates. Preconditions and postconditions, using keywords requires and ensures (lines [3][4][5][6], should hold at the beginning and the end of the function, respectively. We use the keyword context to abbreviate both requires and ensures clause. This is convenient to have, because permission pre-and postconditions are often the same. The keyword context_everywhere is used to specify an invariant property (lines 1-2) that must hold throughout the function. As pre-and postcondition, we have read permissions over all elements in arrays a and b (lines 3-4) and write permissions over all elements in array c (line 5). The loop invariants specifies the permissions that are used in the loop (lines [12][13][14]. Further the loop invariant specifies that when iteration i starts, we have added the elements from a and b from the beginning up to location i −1 (line 15). Therefore, at the end of the loop (and the function), we have added all elements (specified as a postcondition in line 6).   where N is a positive integer variable that denotes the number of parallel threads, i.e. the block's parallelisation level, S is a sequence of statements and V is a sequence of guarded assignments b ⇒ assg.

Syntax
In the grammar, we define a vectorised block at a different level than the other basic blocks, because this allows us to define the semantics in a more convenient way, while it does not prevent us from writing programs such as the parallel or fusion composition of a parallel and a vectorised block.
We assume a restricted syntax for fusion composition such that its operands are parallel basic blocks with the same parallelisation levels. This is checked by an extra well-formedness condition over PPL programs. Each basic block has a local read-only variable tid ∈ [0..N) called thread identifier, where N is the block's parallelisation level. We (ab)use the term iteration to refer to the computations of a single thread in a basic block. So a parallel or vectorised block with parallelisation level N has N iterations. For simplicity, but without loss of generality, threads have access to a single shared array which we refer to as heap. We assume all memory locations in the heap are allocated initially. A thread may update its local variables by performing a local computation (v := e), or by reading from the heap (v := mem(e)). A thread may update the heap by writing the value of one of its local variables to it (mem(e):= v). For the arrays, we use notation a[e] as syntactic sugar for [a+e] where a is a variable containing the base address of the array a and e is the subscript expression. Figure 9, line 1 and 2, contains a PPL expression that captures the program in lines 4-13. In this example, the two basic blocks are composed using (||). Figure 10 shows another example of a PPL expression and its corresponding OpenMP program where the basic parallel and vectorised blocks are composed sequentially (lines 1-3). Note that tid 1 refers to the thread identifier of the parallel block, while tid 2 refers to the thread identifier of the vectorised block.

Semantics
The behaviour of PPL programs is described using a small step operational semantics. For a convenient and understandable definition, the operational semantics is defined in several layers, as defined below. Throughout, we assume existence of the finite domains: -VarName, the set of variable names, -Val, the set of all values, which includes the memory locations, -Loc, the set of memory locations, and -[0..N) for thread identifiers. The Init state consists of a block statement P. The ParC state consists of two block states, while the SeqC state contains a block state and a block statement P; they capture all the states that a parallel composition and a sequential composition of two blocks might be in, respectively. The basic block state Par captures all the states that a parallel basic block Par (N) S might be in during its execution. It contains a mapping LS ∈ [0..N) → LocalState, which maps each thread to its local state, to model the parallel execution of the threads. There are three kinds of local states: a vectorised state Vec, a sequential state Seq, and a terminated sequential state Done.
LS ∈ LocalState Δ = Vec(Σ, E, V, σ, S)| vectorised basic block states Seq(σ, S)| sequential basic block states Done terminated sequential basic block states The Vec block state captures all states that a vectorised basic block Vec (N) V might be in during its execution. It consists of Σ ∈ [0..N) → PrivateMem, which maps each thread to its private memory, the body to be executed V, a private memory σ , and a statement S. As vectorised blocks may appear inside a sequential block, keeping σ and S allows continuation of the sequential basic block after termination of the vectorised block. To model vectorised execution, the state contains an auxiliary set E ⊆ [0..N) that models which threads have already executed the current instruction. Only when E equals [0..N), the next instruction is ready to be exe-cuted. Finally, the Seq block state consists of private memory σ and a statement S.
To simplify our notation, each thread receives a copy of the program store as part of its private memory when it initialises. This is captured in rules Init Par and Init Seq (Fig. 11), where the local store γ is passed as an argument to the Seq block state. Operational Semantics The operational semantics is defined as a transition relation between program states: → p ⊆ (BlockState×Store×SharedMem)×(BlockState×Store× SharedMem), (Fig. 11), and using an auxiliary transition relation between thread local states: (Fig. 12), and then a standard transition relation: → assg ⊆ (PrivateMem × S × SharedMem) × (PrivateMem × SharedMem) to evaluate assignments (Fig. 13). The semantics of expression e and Boolean expression b over private memory σ , written E e σ and B b σ , respectively, is standard and not discussed any further. We use the standard notation for function update: given a function f : As mentioned, the main transition relation between program states is defined in Fig. 11. Program execution starts in a program state (Init(P), , h) where P is the program's entry block. Depending on the form of P, a transition is made into an appropriate block state, leaving the heap unchanged (see rules Init ParC, Init SeqC, Init Fuse, Init Par and Init Seq).
The evaluation of a ParC state non-deterministically evaluates one of its block states (i.e. EB 1 or EB 2 ), until both blocks are done (rule ParC Done).
Evaluation of a sequential block is done by evaluating the local state. The evaluation of a SeqC state evaluates its block state EB step by step. When this evaluation is done, evaluation of the subsequent block is initialised.
Rule Lift Seq captures that evaluation of a thread local state is defined in terms of the local thread execution (as defined in Fig. 12). When the local thread state is fully evaluated, this results in a terminated block state (rule Local Done).
The evaluation of a parallel basic block is defined by the rules Par Step and Par Done. To allow all possible interleavings of the threads in the block's thread pool, each thread has its own local state LS, which can be executed independently, modelled by the mapping LS. A thread in the parallel block terminates if there are no more statements to be executed and a parallel block terminates if all threads executing the block are terminated.
The evaluation of sequential basic block's statements as defined in Fig. 12 is standard except when it contains a vec-     Fig. 12) is done in lock-step, i.e. all threads execute the same instruction no thread can proceed to the next instruction until all are done, meaning that they all share the same program counter. As explained, we capture this by maintaining an auxiliary set, E, which contains the identifier of the threads that have already executed the vector instruction (i.e. the guarded assignment b ⇒ assg). When a thread executes a vector instruction, its thread identifier is added to E (rules Vec Step). The semantics of vector instructions (i.e. guarded assignments) is the semantics of assignments if the guard evaluates to true and it does nothing otherwise. When all threads have executed the current vector instruction, the condition E = dom(Σ) holds, and execution moves on to the next vector instruction of the block (with an empty auxiliary set) (rule Vec Sync). The semantics of assignments as defined in Fig. 13 is standard and does not require further discussion.

Encoding OpenMP into PPL
In order to show that PPL indeed captures the core of deterministic parallel programming languages, this section shows how a widely used subset of OpenMP can be encoded into PPL. Figure 14 defines a grammar which captures a commonly used subset of OpenMP [2]. This grammar defines the OpenMP programs that can be encoded into PPL (and thus can be verified using the verification technique presented below).

Subset of OpenMP
Our grammar supports the following OpenMP annotations: omp parallel, omp for, omp simd, omp for simd, omp sections, and omp single. Every program is a finite and non-empty list of Jobs enclosed by omp parallel. The body of omp for, omp simd, and omp for simd, is a for-

OpenMP to PPL encoding
This section discusses the encoding of OpenMP programs that can be derived from the grammar in Fig. 14 The encoding algorithm is presented in Fig. 15 in a functional programming-like style.
Line 2 to 7 of the algorithm define some syntactic macros of several program patterns, to improve readability of the algorithm. Note that in the macro ParVec, tid 1 refers to the thread identifier of the parallel block, while tid 2 refers to the thread identifier of the vectorised block. The algorithm consists of two steps: a recursive translate step, and a compose step. The translation step recursively encodes all Jobs into their equivalent PPL code blocks without caring about how they will be composed. Later, the compose step conjoins the translated code blocks together to build a PPL program.
The translation step is a map, which applies the function match to the list of input jobs and returns a list of equivalent PPL code block. The input jobs are encoded in the form (A, C) where A is an OpenMP annotation and C is a code block written in C. The translation returns a list of the form (P, [A]), where P is the PPL program corresponding to the C code, and [A] are the OpenMP annotations that are needed to decide how to combine this PPL block with the other code blocks. Notice that the resulting PPL program is not necessarily a single basic block. The function match works as follows: -an OpenMP for annotation for a for-loop is translated into a parallel block; -an OpenMP simd annotation for a for-loop is translated into a loop of vectorised statements (taking into account the simdlen(M) argument); -an OpenMP for simd annotation for a for-loop is translated into a parallel composition of several vectorised statements (taking into account the simdlen(M) argument); -an OpenMP sections annotation is translated into the parallel composition of the individual statements; and -an OpenMP single annotation encodes the statements in the single block recursively.
The match function uses the function sec which recursively calls match on nested parallel blocks. A sequence of sequential statements with a contract is encoded as a parallel block with a single thread. Notice that in these cases, any nested OpenMP clauses are passed on; therefore, the match function returns a pair of a PPL program and a list of OpenMP annotations.
The compose step takes as its input a list of tuples in the form (P, [A]) (the output of the translate step); then it inserts appropriate PPL composition operators between adjacent program blocks in the list, provided certain conditions hold. To properly bind tuples to the composition operators, the operators are inserted in three individual passes; one pass for each composition operator, based on the binding precedence of the operators from high to low as follows: Operator insertion is done by the function bundle (lines 40-44). In each pass, bundle consumes the input list recursively. Each recursive call takes the two first tuples of the list and inserts a composition operator if the tuples satisfy the conditions of the composition operator; otherwise, it moves one tuple forward and starts the same process again. Notice that ultimately the head of the list x is composed with the head of the recursive call, rather than with the second element of the list. This is okay, because the composition to be applied is determined locally, and not affected by the compositions of the other blocks.
For each composition operator, the conditions are different. The conditions for parallel and fusion compositions are checked by the functions fusible and par_able. As explained in Sect. 2, fusion of two parallel loops L1 and L2 means that the corresponding iterations of L1 and L2 are executed by the same thread without waiting. Therefore, fusion composition is inserted between two consecutive tuples are single-element lists containing an omp for annotation, -the clauses of both annotations include schedule(static), and -the clauses of The parallel composition is inserted between any two tuples in the program where the clauses of the first tuple include a nowait. Otherwise, the sequential composition is inserted. The final outcome is a single merged tuple (P, [A]) where P is the result of the encoding and [A] can be eliminated.

Example translations
To illustrate the encoding, we discuss the translation of two small OpenMP programs into PPL.

Example 7
To translate the OpenMP program in Fig. 1  Using the compose function on the list with these two pairs results in the following PPL program:

Verification of basic blocks
The first step of our verification technique deals with the verification of basic blocks. As mentioned above, there are For each basic block, we specify an iteration contract, which is a contract for each thread executing in the block. Thus, for a sequential block, the iteration contract coincides with a standard block contract (as there is only one thread executing the block), while for parallel and vectorised blocks, the iteration contract specifies the behaviour of one single thread executed in parallel or in lock-step, respectively. We call this an iteration contract, as it corresponds to the specification of a single iteration of a parallelisable or vectorisable block.

Iteration contracts
An iteration contract consists of: a resource contract rc(i), and a functional contract fc(i), where i is the block's iteration variable. A resource contract indicates the permissions to access memory locations and a functional contract is related to values in the memory locations. Both the resource contract and the functional contract consist of a precondition and a postcondition. We use P(i) to denote the functional precondition, and Q(i) to denote the functional postcondition. In case the resource pre-and postcondition are the same, we simply write rc(i); otherwise, we distinguish them by rc pre (i) and rc post (i). where the first two lines show a resource contract and the last line indicates a functional contract. Note that * * is the ASCII-notation for .

Verification rules for basic blocks
As mentioned above, a sequential block is executed by a single thread, thus its iteration contract coincides with its block contract, and no special verification rule is needed.
Parallel basic blocks are verified by the rule ParBlock presented in Fig. 16, where S(i) is the body of the i th iteration of the parallel basic block. This rule states that if each single thread respects its iteration contract, the contract for the basic block is composed by the universal separating conjunction of the iteration contract's precondition and postcondition, respectively. As the threads execute completely independently, there is no permission transfer, and the resource preand postcondition coincide. Notice further that soundness of this rule implies that all threads in a parallel block must be independent, because otherwise the universal separating conjunction would not be satisfiable.
For vectorised blocks, the ParBlock rule can be used in case there are no inter-iteration data dependencies. If there are inter-iteration data-dependencies, we need to provide extra annotations that indicate how permissions are transferred inside the vectorised block. In a vectorised block, implicitly all threads synchronise between every instruction. During such a synchronisation, permissions may be transferred from the iteration containing the source of a dependence to the iteration containing the sink of that dependence. To specify such a transfer we introduce send and receive ghost statements. 3 Remember that according to the PPL grammar, the body of a vectorised block is a sequence of guarded assignments b ⇒ assg. A guard b s (i) denotes the guard of statement s in iteration i.
A send annotation specifies that at label L s , if a guard b s (i) is true, the permissions and properties denoted by formula φ are transferred to the statement labelled L r in iteration i + d, where i is the current iteration and d is the distance of dependence. A receive annotation specifies that the permissions and properties denoted by formula ψ are received by the current iteration from iteration i −d. These annotations always come in pairs. In practice, the information provided by either the send or receive annotation is sufficient to infer the other. Therefore, to reduce the annotation overhead, optionally only one of them has to be provided by the developer. However, by providing them both, we make the specifications easier to understand. In order to verify this example, we need a proof rule for vectorised blocks, as well as for the send and receive ghost statements.

Example 10
The rule for the verification of vectorised blocks is given in Fig. 17. It is similar in spirit to the ParBlock rule, but does not require the resource pre-and postcondition to be the same.
The rules for the send and receive ghost statements are similar in spirit to the rules that are typically used for permission transfer upon lock acquiring and release (see e.g. [15]). In particular, send is used to give up resources that the receive acquires. This is captured by the following two proof rules: Receiving permissions and properties that were not sent is unsound. Therefore, send and receive annotations have to be properly matched, meaning that: (i) send and receive annotations always come in pairs; (ii) if the receive is enabled in iteration j, then d iterations earlier, the send should be enabled, i.e., (iii) the information and resources received should be implied by those sent: In other words, the rules in Eq. 1 cannot be used unless the syntactic criterion (i) and the proof obligations (ii) and (iii) hold.

Soundness
This section discusses the soundness of the proof rules Par-Block and VecBlock above. To show soundness of these rules, we have to show that in order to prove correctness of a parallel or vectorised block, it is sufficient to reason about the body of the block, and to prove independence or inter-iteration data dependence of that body. As always, the interpretation of a Hoare triple {P}S{Q} is the following: if the precondition P holds in a state s, and if execution of statement S from state s terminates in a state s , then the postcondition Q holds in this state s . As the proof rules are adapted from the proof rules for parallel and vectorised loops presented in [5], the soundness argument is also similar.
To construct the proof, we define the set of possible execution traces of atomic steps over the vectorised and parallel blocks. In addition, we also define the instrumented sequentialised execution traces for those blocks, which are the executions (1) if all iterations are executed in order and (2) such that validity of each iteration contract is checked for each separate iteration.
To prove soundness of the rule ParBlock, we show that the all execution traces of this statement are equivalent to the instrumented sequentialised execution trace of the parallel block. To prove soundness of the rule VecBlock, we show that all execution traces of this statement are equivalent to the instrumented sequentialised execution trace of the vectorised block.
Functional equivalence of the two traces is shown by transforming the computations in one trace into the computations in the other trace by swapping adjacent independent execution steps.

Denotational semantics of blocks
To phrase the soundness proof, we prefer to use a denotational semantics for the parallel and vectorised blocks, where the semantic domain is a set of traces, seen as sequences of instructions. The denotational semantics that is defined in this section is equivalent to the operational semantics as defined in Sect. 3, but the proof is omitted from the paper. We develop our formalisation for non-nested blocks with K guarded statements. We instantiate the block body for each iteration of the block; thus, we have (L Definition 2 An execution trace c is a finite sequence t 1 , t 2 , . . . , t m of statement instances such that t 1 is executed first, then t 2 is executed and so on until the last statement t m . We write for an empty execution trace.
To characterise the set of execution traces for parallel and vectorised blocks, we define auxiliary operators concatenation and interleaving.

Definition 3
The plain concatenation (++) operator is defined as Plain concatenation takes two sets of execution traces and creates a new set that concatenates all execution traces in the first set with all execution traces in the second set.

Definition 4 The synchronised concatenation (#) operator inserts a barrier b between the execution traces. It is defined as
The intuition here is that the insertion of a barrier b indicates an implicit synchronisation point. When defining the interleaving of traces, the barrier restricts what interleavings are possible.
We lift concatenation to multiple sets as follows: Next, interleaving defines how to weave several execution traces into a single execution trace. This uses a happensbefore order <, in order not to violate restrictions imposed by the program semantics. This happens-before order < is defined such that it maintains program order (PO), i.e. it maintains the order of statements executed by the same thread, and it also maintains synchronisation order (SO), i.e. it maintains the order between a barrier and the statements preceding and following it.
To define the interleaving operator (Interleave), we first define an auxiliary operator Interleave i that denotes interleaving with a fixed first statement s of thread i: If the complete execution trace of thread i has been interleaved, there are two possible cases. If all other threads are also done, then this returns an empty execution trace (as a base case). If any other thread can still take a step, then this call for thread i returns an empty set of interleavings. If thread i has a non-empty execution trace to interleave, i.e. it is of the form s 1 ·c i , then we obtain all interleavings that start with s 1 , extended with the (recursive) interleaving of all other execution traces and the remainder of this execution trace c i . Note that this extension is only allowed if it does not violate the happens-before order <. Next we define the full interleaving operator, which basically considers all interleavings for all threads.
Now we can define the denotational semantics of parallel and vectorised blocks. The semantics of a parallel block is any interleaving of all statement instances that preserve the program order PO. The semantics of a vectorised block is any interleaving of the synchronised concatenation of the execution traces of the individual traces, thus with an implicit barrier added after the execution steps of each statement instance. Formally, these are defined as follows.

Definition 6
The denotational semantics of a vectorised block is defined as Next, we define the sequentialised execution trace of a parallel and vectorised block. This is the sequential execution of all iterations in a parallel and vectorised block.

Definition 7 The sequential execution trace of a parallel and vectorised block is
Finally, we define the instrumented sequentialised execution trace of a parallel and vectorised block. This is the sequential execution of all iterations, where in addition all precondition and postcondition are checked. Below we will show that all parallel and vectorised execution traces are equivalent to this instrumented sequentialised execution trace.

Definition 8 The instrumented sequentialised execution traces of a parallel and vectorised block are
where Assert checks the pre-and postcondition before and after each iteration. If the asserted property φ holds, Assert φ behaves as a skip; otherwise, it aborts (i.e. there is no execution). Note that the sequential execution trace is in happens-before order.

Correctness of parallel blocks
In the previous section, we defined a denotational semantics of parallel and vectorised blocks in terms of possible traces of atomic steps. In addition, we defined the instrumented sequentialised execution of parallel and vectorised blocks. Now, we argue correctness of the rules for parallel and vectorised blocks (Figs. 16 and 17).
We prove that every execution trace in Par(N) S is functionally equivalent to the single execution trace Par(N) S Seq Spec if all contracts hold, by showing that any execution trace can be reordered until it is the sequential execution order.

Theorem 1 All execution traces in Par(N) S and Par(N) S Seq
Spec are functionally equivalent only if all contracts hold. Proof sketch 1 Assume that the first n steps of the given execution trace are in the same order as the sequential execution trace. Then, step t n+1 in the sequential execution has to be somewhere in the given sequence. Because each sequence contains the same steps and the sequential execution trace is in happens-before order, all the steps that have to happen before t n+1 are already included in the prefix. Hence, in the given sequence, all the steps between the end of the prefix and t n+1 are independent of step t n+1 itself. Therefore, step t n+1 can be swapped with all these intermediate steps. We then repeat until the whole sequence matches.
We proved that any legal execution trace of parallel block can be reordered into the sequential one, i.e. Par(N) S = S 0 S 1 S 2 . . . S N . Now suppose that in the initial state P 0 P 1 . . . P N holds. Since all instructions are independent, after the execution of S 0 , Q 0 holds and P 1 P 2 . . . P N is preserved. After the execution of S 1 , Q 1 holds and P 2 P 3 . . . P N is preserved. Moreover, S 1 will not make Q 0 invalid. After the execution of S 2 , Q 2 holds and P 3 P 4 . . . P N is preserved. In addition, S 2 will not make Q 0 Q 1 invalid. By continuing in this way, in the final state of the execution trace Q 0 Q 1 . . . Q N holds. Therefore, we can conclude for any legal execution trace in Par(N) S starting in the precondition, the postcondition will hold for the final state.
As a corollary of Theorem 1, we can also conclude that all executions in Par(N) S are data-race-free. We can apply the same argument for the vectorised blocks, but as the vectorised blocks is defined in terms of SynchConcat, swapping past barriers is never necessary.

Theorem 2 All execution traces in Vec(N) S and Vec(N) S Seq
Spec are functionally equivalent. Note that the sequentialised instrumented execution trace now also contains send/receive ghost annotations and barriers between each iteration.

Verification of block composition
Now that we have seen how correctness of a basic block can be verified in isolation, the next step is to verify their composition. We show how this can be done on the basis of the block iteration contracts only, by proving that all the heap accesses of all iterations which are not ordered sequentially are nonconflicting (i.e. they are disjoint or they are read accesses). If this condition holds, correctness of the PPL program can be derived from the correctness of a linearised variant of the program.
We first discuss how we can verify programs where the resources in the iteration contracts are constant, i.e. the resource pre-and postconditions are always the same. Next, we sketch how to extend the approach to the case where the resource pre-and postconditions of an iteration contract differ.

Verification of block composition without resource transfers
As mentioned above, we first assume that each basic block of a program is specified by an iteration contract with constant resources rc(i) for iteration i. Further, we assume that the program is globally specified by a contract G which consists of the program's resource contract RC P and the program's functional contract FC P with the program's precondition P P and the program's postcondition Q P . Let P be the set of all PPL programs and P ∈ P be an arbitrary PPL program assuming that each basic block in P is identified by a unique label. We define B P = {b 1 , b 2 , . . . , b n }, as the finite set of basic block labels of the program P. For a basic block b with parallelisation level m, we define a finite set of iteration labels , 1 b , . . . , (m − 1) b } where i b indicates the i th iteration of the block b. Let I P = b∈B P I b be the finite set of all iterations of the program P.
To state our proof rule, we first define the set of all iterations that are not ordered sequentially, the incomparable iteration pairs, I P ⊥ as: where ≺ e ⊆ I P × I P is the least partial order which defines an extended happens-before relation. The extension addresses the iterations which are happens-before each other because their blocks are fused. We define ≺ e based on two partial orders over the program's basic blocks: ≺⊆ B P × B P and ≺ ⊕ ⊆ B P × B P . The former is the standard happens-before relation of blocks where they are sequentially composed by , and the latter is an happens-before relation w.r.t. fusion composition ⊕. They are defined by means of an auxiliary partial order generator function G(P, δ) : P×{ , ⊕} → B P × B P such that: ≺= G(P, ) and ≺ ⊕ = G(P, ⊕). We define G as follows: δ). The function G computes the set of all iteration pairs of the input program P which are in relation w.r.t. the given composition operator . This computation is basically a syntactical analysis over the input program. Now we define the extended partial order ≺ e as: This means that the iteration i b happens-before the iteration We define the block level linearisation (b-linearisation for short) blin : P → P as a program transformation which substitutes all non-sequential compositions by a sequential composition. We define P as a subset of P in which only sequential composition is allowed as composition operator.

Example 11
As an example, the b-linearisation of the PPL in Example 7 is as follows:   Figure 18 presents the rule b-linearise. In this rule, rc b (i) and rc b (j) are the resource contracts of two different basic blocks b and b where i b ∈ I b and j b ∈ I b . Application of the rule results in two new proof obligations. The first ensures that all heap accesses of all incomparable iteration pairs (the iterations that may run in parallel) are non-conflicting (i.e. all block compositions in P are memory safe). This reduces the correctness proof of P to the correctness proof of its b-linearised variant blin(P) (the second proof obligation). Then, the second proof obligation is discharged in two steps: (1) proving the correctness of each basic block against its iteration contract (using the proof rules discussed above) and (2) proving the correctness of blin(P) against the program contract.

Soundness
Now we are ready to show that a PPL program with provably correct iteration contracts and a global contract that is provable in our logic (including the rule b-linearise) is indeed data race free and functionally correct w.r.t. its specifications. To show this, we prove (i) soundness of the b-linearise rule and (ii) that each verified program is free of data races.
For the soundness proof, we show that for each program execution there exists a corresponding b-linearised execution with the same functional behaviour (i.e. they end in the same terminal state if they start in the same initial state) if all independent iterations are non-conflicting. From the rule's assumption, we know that if the precondition holds for the initial state of the b-linearised execution (which is also the initial state of the program execution), then its terminal state satisfies the postcondition. As both executions end in the same terminal state, the postcondition thus also holds for the program execution. To prove that there exists a matching b-linearised execution for each program execution, we first show that any valid program execution can be normalised w.r.t. program order and second that any normalised execution can be mapped to a b-linearised execution. To formalise this argument, we first define: an execution, an instrumented execution, and a normalised execution.
We assume all program's blocks including basic and composite blocks have a block label and program's statements are labelled by the label of the block to which they belong. Also there exists a total order over the block labels. To distinguish between valid and invalid executions, we instrument our operational semantics with heap mas-ks (memory masks). A heap mask models the access permissions to every heap location. It is defined as a map from locations to fractions π : Loc → Frac where Frac is the set of fractions ([0, 1]). Any fraction (0, 1) is a read and 1 is a write permission. The instrumented semantics ensures that each transition has sufficient access permissions to the heap locations that it accesses. We first add a heap mask π to all block state constructors (Init, ParC, SeqC and so on) and local state constructors (Vec, Seq and Done). Then, we extend the operational semantics rules such that in each block initialisation state with heap mask π an extra premise should be discharged, which states that there are n ≥ 2 heap masks π 1 , . . . , π n , one for each newly initialised state such that Σ n i π i ≤ π . The heap masks are carried along by the computation and termination transitions without any extra premises, while in the termination transitions heap masks of the terminated blocks are forgotten as they are not required after termination. As an example, Fig. 19 presents the instrumented versions of the rules Init ParC, ParC Done, rdsh, and wrsh, where → p,i and → assg,i denote program and assignment transition relations in the instrumented semantics, respectively. If a transition cannot satisfy its premises, it blocks.

Definition 10 (Instrumented Execution
). An instrumented execution of a program P is a finite sequence of state transitions Init(P, π), , h → * p,i Done(π ), , h where the set of all instrumented executions of P is written as IE P .
valid for a program P (i.e. every basic block in P respects its iteration contract), for any execution E of the program P, there exists a corresponding instrumented execution.
Proof sketch 2 Given an execution E, we assign heap masks to all program states that the execution E might be in. The program's initial state is assigned by a heap mask π ≤ 1. Assumption (1) implies that all iterations which might run in parallel are non-conflicting which implies that for all Init ParC transitions, there exist π 1 and π 2 such that π 1 +π 2 ≤ π where π is the heap mask of the state in which Init ParC evaluates. In all computation transitions the successor state receives a copy of the heap mask of its predecessor. Assumption (2) implies that all iterations of all parallel and vectorised basic blocks are non-conflicting. This implies that for an arbitrary Init Par or Init Vec transition which initialises a basic Fig. 19 The instrumented versions of the rules Init ParC, ParC Done, rdsh, and wrsh block b, there exists π 1 , . . . , π n such that Σ n i π i ≤ π b holds in b's initialisation transition and in all computation transitions of an arbitrary iteration i of the block b the premises of rdsh and wrsh transitions is satisfiable by π i . Lemma 2 All instrumented executions of a program P are data-race-free.

Proof sketch 3
The proof proceeds by contradiction. Assume that there exists an instrumented execution that has a data race. Thus, there must be two parallel threads such that one writes to and the other one reads from or writes to a shared heap location e. Because all instrumented executions are non-blocking, the premises of all transitions hold. Therefore, π 1 (e) = 1 holds for the first thread, and π 2 (e) > 0 for the second thread either it writes or reads. Also because the program starts with one single main thread, both threads should have a single common ancestor thread z such that π x (e)+π y (e) ≤ π z (e) where x and y are the ancestors of the first and the second thread, respectively. A thread only gains permission from its parent; therefore π 1 (e) + π 2 (e) ≤ π z (e) holds. Permission fractions are in the range [0, 1] by definition; therefore, π 1 (e) + π 2 (e) ≤ 1 holds. This implies that if π 1 (e) = 1, then π 2 (e) ≤ 0 which is a contradiction.
A normalised execution is an instrumented execution that respects the program order, which is defined using an auxiliary labelling function L : T → B all P × L where T is the set of all transitions, L is the set of labels {I , C, T }, and B all P is the set of block labels (including both composite and basic block labels).
if t computes a statement s (LB(block), T ), if t terminates a block block where LB returns the label of each block or statement in the program. We say transition t with label (b, l) is less than t where L B sub (b) returns the label set of all blocks of which b is composed.

Definition 11 (Normalised Execution
). An instrumented execution labelled by L is normalised if the labels of its transitions are in non-decreasing order.
We transform an instrumented execution to a normalised one by safely commuting the transitions whose labels do not respect the program order.

Lemma 3 For each instrumented execution of a program P,
there exists a normalised execution such that they both end in the same terminal state.

Proof sketch 4 Given an instrumented execution
, a state s x exists such that a new instrumented execution IE = IE 1 : (s 1 , t 2 ) : (s x , t 1 ) : IE 2 can be constructed by swapping two adjacent transitions t 1 and t 2 . As the swap is on an instrumented execution, from Lemma 2 we know that this is data-race-free, thus any accesses of t 1 and t 2 to a shared heap location must be reads. Because t 1 and t 2 are adjacent transitions, no other write may happen in between; therefore, the swap preserves the functionality of IE, yielding the same terminal state for IE and IE . Thus, the corresponding normalised execution of IE obtained by applying a finite number of such swaps yields the same terminal state as IE.

Lemma 4 For each normalised execution of a program P, there exists a b-linearised execution blin(P), such that they both end in the same terminal state.
Proof sketch 5 An execution of blin(P) is constructed by applying the map M : BlockState → BlockState to each state of a normalised execution. M is defined as: where LS 0 2 is the initial mapping of thread local states of P 2 and Par(LS 1 + + LS 0 2 ) indicates the state of two fused parallel blocks Par(LS 1 ) and Par(LS 0 2 ) where + + is overloaded and indicates pairwise concatenation of statements in the local states LS 1 and LS 0 2 (i.e. S 1 + + S 2 ).
Definition 12 (Validity of Hoare Triple). The Hoare triple {RC P P P }P{RC P Q P } is valid if for any execution E (i.e. Init(P), , h → * p Done, , h ) if , h, π RC P P P is valid in the initial state of E, then , h , π RC P Q P is valid in its terminal state.
The validity of , h, π RC P P P and , h , π RC P Q P is defined by the semantics of formulas presented in 2.2.
From assumption (2) and the soundness of the program logic used to prove it [5], we conclude (3). ∀b ∈ B P .
Given a program P, implication (3), assumption (1) and Lemma 1 imply that there exists an instrumented execution IE for P. Lemma 3 and Lemma 4 imply that there exists an execution E for the blinearised variant of P, blin(P), such that both IE and E end in the same terminal state. The initial states of both IE and E satisfy the precondition {RC P P P }. From assumption (2) and the soundness of the program logic used to prove it [5], {RC P Q P } holds in the terminal state of E which thus also holds in the terminal state of IE as they both end in the same terminal state.
Finally, we show that a verified program is indeed data-racefree.

Proposition 1 A verified program is data-race-free.
Proof sketch 7 Given a program P, with the same reasoning steps mentioned in Theorem 3, we conclude that there exists an instrumented execution IE for P. From Lemma 2 all instrumented executions are data-race-free. Thus, all executions of a verified program are data-race-free.

Verification of block composition with resource transfers
Next we look at how to adapt this rule in case there are intra-block dependencies; thus, the resource pre-and postconditions of individual iterations are different, and we need send/receive annotations in order to verify the blocks. This makes the independence check more involved: instead of just checking that the resource contracts for independent iterations are non-conflicting (∀(i b , j b ) ∈ I P ⊥ .(RC P → rc b (i) rc b ( j))), we now need to check the absence of conflicts for all combinations of resource pre-and This new version of the rule b-linearise is sound, because: 1. the check guarantees that the resource precondition of iteration i is disjoint from the resource pre-and postcondition of iteration j; 2. the check also guarantees that the resource postcondition of iteration i is disjoint from the resource pre-and postcondition of iteration j; 3. the resources specified in the resource precondition of iteration i either are send to another iteration (say k) in the same block or they should be part of the resource postcondition of iteration i. The rule guarantees that it will also be checked that the resource pre-and postconditions of iteration k are disjoint from the resource preand postconditions of iteration j (because if i and j are independent, then also k and j will be independent.
However, if multiple resource transfers happen within a block, it can happen that at an intermediate point in the block, the thread holds more permissions than it holds at the beginning and the end of the block. To address this, we need to define the intermediate maximal resource contract for an intermediate statement S as the universal separating conjunction of the iteration's precondition, and all the resources that are received by all statements that happen-before S. Absence of conflicts is then defined as a check over all intermediate resource contracts. It is future work to define this formally.

Tool support
As mentioned above, our verification technique is supported by the VerCors program verifier. 4 This section briefly discusses how our approach is implemented in VerCors. 4 The tool and a list of case studies and verified examples is available at: https://github.com/utwente-fmt/vercors. VerCors is a verifier to specify and verify (concurrent and parallel) programs written in a high-level language such as (subsets of) Java, C, OpenCL, OpenMP and PVL, where PVL is VerCors' internal language for prototyping new features. The programs are annotated with pre-/postconditions in permission-based separation logic [1,6]. Then, VerCors encodes annotated programs via several program transformation steps into the intermediate representation language (Silver) of the Viper framework [19,26], and then the encoded program is verified using the Viper technology (Fig. 20).
Using this approach, OpenMP programs are verified with VerCors in the following steps: 1. Specify the OpenMP program (i.e. provide an iteration contract for each block and write the program contract for the outermost OpenMP parallel region. 2. Encode the specified OpenMP program into its PPL counterpart (carrying along the original OpenMP specifications) (as discussed in Sect. 4). 3. Check the PPL program against its specifications, by transforming the PPL program into a Viper program.
Steps 2 and 3 are fully automatic, the user only has to provide the specifications for the OpenMP program. This section provides more details about the encoding of PPL programs into Viper.

Encoding of basic blocks into viper
To verify our iteration contracts using Viper, we encode the behaviour of the basic blocks and the send/receive annotations as method contracts. The idea is that every block annotated with an iteration contract is encoded by a call to the method basic_block, whose contract encodes the application of the suitable Hoare Logic rule for basic blocks, instantiated for the specific iteration contract. / * @ requires (\ forall * int j ; 0<=j && j <N; pre( j ) ) ; ensures (\forall * int j ; 0<=j && j <N; post( j ) ) ; @ * / basic_block( int N, free(S)); We also need to verify that every iteration respects the iteration contract. This is encoded by a method, parametrised by the thread identifier, containing the basic block's body, and specified by the iteration contract. / * @ requires (0<=j && j <N) * * pre( j ) ; ensures post(j); @ * / block_body(int j,int N, free(S)){ body; } Within the body of the basic block there may be send and receive statements.

Encoding of the b-linearise rule into viper
Finally, for the verification of block composition, we implemented the rule b-linearize as part of the encoding into Viper. This means we implemented in VerCors: -a function to compute the set I P ⊥ , and -the program transformation blin, resulting in a Viper program called blin_program().
This implementation basically follows the formal definition as presented above in Sect. 6. Next, as part of the Viper encoding, we encode the first proof obligation as lemmas for all independent iteration pairs (i b , j b ), i.e. these are encoded as specifications for empty method bodies of the following form: / * @ requires RC P ; ensures rc b (i) rc b ( j); @ * / indep_iteration ( b,b' ,i,j ) ; Finally, for the b-linearised program, we prove that it satisfies the global method specification. / * @ requires {RC P P P }; ensures {RC P Q P } @ * / blin_program(); requires the user to specify a program contract and an iteration contract for each SpecS block in the OpenMP program, from which all the required PPL contracts can be obtained. We demonstrate this in detail on two of the OpenMP programs presented in Sect. 2.1, which are successfully verified by VerCors. Figure 21 shows the required contracts for the example discussed in Fig. 1 (in Sect. 2.1). There are four specifications. The first one is the program contract attached to the outermost parallel block. The other contracts are the iteration contracts of the loops L1, L2 and L3, where the context keyword is used as a shorthand notation for both requiring and ensuring the same predicate, and \forall * denotes the universal separating conjunction i∈I . Example 7 already showed how this OpenMP program was encoded into PPL. After adding the annotations in Fig. 21 to the OpenMP program, VerCors generates the following PPL program P: Program P contains three parallel basic blocks B 1 , B 2 and B 3 . The fusion of B 1 and B 2 creates a composite block that is enclosed by the parentheses. Then, the composite block is composed with the basic block B 3 using the parallel composition operator. It is verified by discharging two proof obligations: 1. prove that all heap accesses of all incomparable iteration pairs (i.e. all iteration pairs except the identical iterations of B 1 and B 2 ) are non-conflicting, which implies that the fusion of B 1 and B 2 and parallel composition of B 1 ⊕ B 2 and B 3 are memory safe, and 2. prove that each parallel basic block by itself satisfies its iteration contract ∀b ∈ {1, 2, 3}.{ i∈[0..L) IC b (i)}B b { i∈[0..L) IC b (i)}, and second proving the correctness of the b-linearised variant of P against its program contract {RC P P P } B 1 B 2 B 3 {RC P Q P }. Figure 22 illustrates the necessary contract for the other example in Sect. 2.1 (Fig. 2). We have implemented a slightly more general variant of PPL in our VerCors tool, which supports variable declarations and method calls. To check the first proof obligation in the tool we quantify over pairs of blocks which allows the number of iterations in each block to be a parameter rather than a fixed number. Our implementation successfully verified the example in 25 seconds.

Related work
Botincan et al. propose a proof-directed parallelisation synthesis, which takes as input a sequential program with a proof in separation logic and outputs a parallelised counterpart by inserting barrier synchronisations [7,8]. Hurlin uses a proof-rewriting method to parallelise a sequential program's proof [16]. Compared to them, we prove the correctness of parallelisation by reducing the parallel proof to a blinearised proof. Moreover, our approach allows verification of sophisticated block compositions, which enables reasoning about state-of-the-art parallel programming languages (e.g. OpenMP), while their work remains rather theoretical.
Raychev et al. use abstract interpretation to make a nondeterministic program (obtained by naive parallelisation of a sequential program) deterministic by inserting barriers [23]. This technique over-approximates the possible program behaviours which ends up in a determinisation whose behaviour is implied by a set of rules which decide between feasible schedules rather than the behaviour of the original sequential program. Unlike them, we do not generate any parallel program. Instead we prove that parallelisation annotations can safely be applied and the parallelised program is functionally correct and exhibits the same behaviour as its sequential counterpart.
Barthe et al. synthesise SIMD code given pre-and postconditions for loop kernels in C++ STL or C# BCL [3]. We alternatively enable verification of SIMD loops, by encoding them into vectorised basic blocks. Moreover, we address the parallel or sequential composition of those loops with other forms of parallelised blocks.
Dodds et al. introduce a higher-order variant of concurrent abstract predicates (CAP) to support modular verification of synchronisation constructs for deterministic parallelism [13]. While their proofs make explicit use of nested region assertions and higher-order protocols, they do not address the semantic difficulties introduced by these features. As mentioned in the paper, the reasoning is unsound in certain corner cases, which was fixed in an expanded version of their paper using iCAP [14]. Their approach relies on a powerful program logic and focuses much less on automation of the verification process.
Salamanca et al. [24] propose a run-time loop-carried dependence checker as an extension to OpenMP which helps programmers to detect hidden data dependencies in omp parallel for. Compared to them, we statically detect any violation of data dependencies without any run-time overhead and we address a larger subset of OpenMP constructs.
Bubel et al. [10] provide a formal trace semantics for data dependences and a program logic to analyse and reason about dependences in imperative programming languages. They benefit from ghost variables to extend the program states to keep track of heap memories. The authors implement their  approach in the KeY verifier and show the effectiveness of their approach by experimenting on Java programs. Their approach for loop-free programs is highly automatic, but for programs containing loops, user interaction is required. In comparison with our work, for programs with loops, users need to provide loop invariants, while we only require iteration contracts (which we believe are often easier to specify).
Praun et al. [27] propose an abstract model to capture data dependences. The model represents these dependences as a density metric to predict potential concurrency of programs. This metric categorises the programs into high, medium and low densities. Programs with high density are good candidates for parallelism, while those with low density are not. Programs with medium density requires a scheduler that is aware of the algorithmic dependences. In contrast to our approach, their model abstracts from runtime aspects such as the number of threads and concurrency control and does not prove correctness of parallelised programs. Their work can benefit from our approach to guarantee correctness after discovering dependencies and parallelising the programs.

Conclusion and future work
We have presented the PPL language that captures the main forms of deterministic parallel programming, and we have shown how a commonly used subset of OpenMP can be encoded into PPL. Then, we proposed a verification tech-nique to reason about data race freedom and functional correctness of PPL programs. The verification technique consists of two parts: reasoning about the correctness of basic blocks, and reasoning about the composition of blocks. Finally, we illustrate the technique to verify the correctness of an example OpenMP program.
As future work, we plan to look into adapting annotation generation techniques to automatically generate iteration contracts, including both resource formulas and functional properties. This will lead to fully automatic verification of deterministic parallel programs. Moreover, our technique can be extended to address a larger subset of OpenMP programs by supporting more complex OpenMP patterns for scheduling iterations and omp task constructs. We also plan to identify the subset of atomic operations that can be combined with our technique that allows verification of the widely used reduction operations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.