Refinement of Parallel Algorithms down to LLVM

We present a stepwise refinement approach to develop verified parallel algorithms, down to efficient LLVM code. The resulting algorithms’ performance is competitive with their counterparts implemented in C/C++. Our approach is backwards compatible with the Isabelle Refinement Framework, such that existing sequential formalizations can easily be adapted or re-used. As case study, we verify a parallel quicksort algorithm, and show that it performs on par with its C++ implementation, and is competitive to state-of-the-art parallel sorting algorithms.


Introduction
We present a stepwise refinement approach to develop verified and efficient parallel algorithms. Our method can verify total correctness down to LLVM intermediate code. The resulting verified implementations are competitive with state-of-the-art unverified implementations. Our approach is backwards compatible to the Isabelle Refinement Framework (IRF), a powerful tool to verify efficient sequential software, such as model checkers [10,7,38], SAT solvers [24,25,11], or graph algorithms [22,28,29]. This paper adds parallel execution to the IRF's toolbox, without invalidating the existing formalizations, which can now be used as sequential building blocks for parallel algorithms, or be modified to add parallelization. As a case study, we verify total correctness of a parallel quicksort algorithm, re-using an existing verification of state-of-the-art sequential sorting algorithms [27]. Our verified parallel sorting algorithm is competitive to state-of-the-art parallel sorting algorithms.

Overview
This paper is based on the Isabelle Refinement Framework, a continuing effort to verify efficient implementations of complex algorithms, using stepwise refinement techniques. Figure 1 displays the components of the Isabelle Refinement Framework.
The back end layer handles the translation from Isabelle/HOL to the actual target language. The instructions of the target language are shallowly embedded into Isabelle/HOL, using a state-error (SE) monad. An instruction with undefined behaviour, or behaviour outside our supported fragment, raises an error. The state of the monad is the memory, represented via a memory model. The code generator translates the instructions to actual code. These components form the trusted code base, while all the remaining components of the Isabelle Refinement Framework generate proofs. In the back-end, the preprocessor transforms expressions to the syntactically restricted format required by the code generator, proving semantic equality of the original and transformed expression. While there exist back ends for purely functional code [30,21], and sequential imperative code [23,26], this paper describes a back end for parallel imperative LLVM code (Section 2). On top of the back-end, a program logic is used to prove programs correct. It uses separation logic, and provides automation like a verification condition generator (VCG). In Section 3, we describe our formalization of concurrent separation logic [33], and our VCG.
At the level of the program logic and VCG, our framework can already be used to verify simple low-level algorithms and data structures, like dynamic arrays and linked lists. More complex developments typically use a stepwise refinement approach, starting at purely functional programs modelled in a nondeterminism-error (NE) monad [30]. A semi-automatic refinement procedure (Sepref [23,26]) translates from the purely functional code to imperative code, refining abstract functional data types to concrete imperative ones. In Section 4, we describe our extensions to support refinement to parallel executions, and a fine-grained tracking of pointer equalities, required to parallelize computations that work on disjoint parts of the same array.
Using our approach, complex algorithms and data structures can be developed and refined to optimized efficient code. The stepwise refinement ensures a separation of concerns between high-level algorithmic ideas and low-level optimizations. We have used this approach to verify a wide range of practically efficient algorithms [10,7,38,24,25,11,22,28,29,27]. In Section 5, we use our techniques to verify a parallel sorting algorithm, with competitive performance wrt. unverified state-of-the-art algorithms. Section 6 concludes the paper and discusses related and future work.

A Back End for LLVM with Parallel Execution
We formalize a semantics for parallel execution, shallowly embedded into Isabelle/HOL. As for the existing sequential back ends [23,26], the shallow embedding is key to the flexibility and feasibility of the approach. The main idea is to make an execution report the memory that it accesses, and use this information to raise an error when joining executions that would have exhibited a data race. We use this to model an instruction that calls two functions in parallel, and waits until both have returned.

State-Nondeterminism-Error Monad with Access Reports
We define the underlying monad in two steps. We start with a nondeterminism-error monad, and then lift it to a state monad and add access reports. Defining a nondeterminism-error monad is straightforward in Isabelle/HOL: A program either fails, or yields a possible set of results (spec P), described by its characteristic function P. The return operation yields exactly one result, and bind combines all possible results, failing if there is a possibility to fail. Now assume that we have a state (memory) type ′ µ, and an access report type ′ ρ, which forms a monoid (0,+). With this, we define our state-nondeterminism-error monad with access reports, just called M for brevity: Here, return does not change the state, and reports no accesses (0), and bind sequentially composes the executions, threading through the state µ, and adding up the access reports r 1 and r 2 .
Typically, the access report will contain read and written addresses, such that data races can be detected. Moreover, if parallel executions can allocate memory, we must detect those executions where the memory manager allocated the same block in both parallel strands. As we assume a thread safe memory manager, those infeasible executions can safely be ignored. Let norace :: ′ ρ ⇒ ′ ρ ⇒ bool and feasible :: ′ ρ ⇒ ′ ρ ⇒ bool be symmetric predicates, and let combine :: be a commutative operator to compose two pairs of access reports and states. Then, we define a parallel composition operator for M: -ignore infeasible combinations assert norace ρ 1 ρ 2 ; -fail on data race return ne ((x 1 ,x 2 ), combine (ρ 1 ,µ 1 ) (ρ 2 ,µ 2 )) -combine results Here, we use assume to ignore infeasible executions, and assert to fail on data races. Note that, if one parallel strand fails, and the other parallel strand has no possible results spec (λ . False), the behaviour of the parallel composition is not clear. For this reason, we fix an invariant invar M :: which implies that every non-failing execution has at least one possible result. We define the actual type M as the subtype satisfying invar M . Thus, we have to prove that every combinator and instruction of our semantics preserves the invariant, which is an important sanity check. As additional sanity check, we prove symmetry of parallel composition: A block is either fresh, freed, or allocated, and a memory is a mapping from block indexes to blocks, such that only finitely many blocks are not fresh. Every block's state transitions from fresh to allocated to freed. This avoids ever reusing the same block, and thus allows us to semantically detect use after free errors. Every program execution can only allocate finitely many blocks, such that we will never run out of fresh blocks 1 . An allocated block contains an array of values, modelled as a list. Thus, an address consists of a block number, and an index into the array.
To access and modify memory, we define the functions valid, get, and put: where |xs| is the length of list xs, xs!i returns the ith element of list xs, and xs[i:=x] replaces the ith element of xs by x. Note that our LLVM semantics does not support conversion of pointers to integers, nor comparison or difference of pointers to different blocks. This way, a program cannot see the internal representation of a pointer, and we can choose a simple abstract representation, while being faithful wrt. any actual memory manager implementation.

Access Reports
We now fix the state of the M-monad to be memory, and the access reports to be sets of read and written addresses, as well as sets of allocated and freed blocks: acc ≡ ( r :: addr set; w :: addr set; a :: nat set; f :: nat set ) Two parallel executions are feasible if they did not allocate the same block, and they have a data race if one strand accesses addresses or blocks modified by the other strand: The invariant for M states that blocks transition only from fresh to allocated to free, allocated blocks never change their size, and the access report matches the observable state change (consistent). It also states, that for each finite set of blocks B, there is an execution that does not allocate blocks from B. The latter is required to show that we always find feasible parallel executions: The combine function joins the access reports and memories, preferring allocated over fresh, and freed over allocated memory. When joining two allocated blocks, the written addresses from the access report are used to join the blocks. We skip the rather technical definition of combine, and just state the relevant properties: Let ρ 1 =(r 1 ,w 1 ,a 1 ,f 1 ) and ρ 2 =(r 2 ,w 2 ,a 2 ,f 2 ) be feasible and race free access reports, and µ 1 , µ 2 be memories that have evolved from a common memory µ, consistently with the access reports ρ 1 , and addr a valid address in µ ′ . Then The properties (1)-(3) define the state of blocks in the combined memory: a fresh block in µ ′ was fresh already in µ, and has not been allocated (1); an allocated block was already allocated or has been allocated, but has not been freed (2); and a freed block was already freed, or has been freed (3). The properties (4)-(6) define the content: addresses written or allocated in the first or second execution get their content from µ 1 (4) or µ 2 (5) respectively. Addresses not written or allocated at all keep their original content (6).

LLVM Instructions
Based on the M-monad, we define shallowly embedded LLVM instructions. For most instructions, this is analogous to the sequential case [26]. The exceptions are memory allocation, which nondeterministically allocates some available block (the original formalization deterministically counted up the block indexes), and an instruction for parallel function call: The code generator only accepts this, if f and g are constants (i.e., function names). It then generates some type-casting boilerplate, and a call to an external parallel function, which we implement using the Threading Building Blocks [36] library: I.e., the two functions f1(x1) and f2(x2) are called in parallel. The generated boilerplate code sets up x1 and x2 to point to both, the actual arguments and space for the results.

Parallel Separation Logic
In the previous section, we have defined a shallow embedding of LLVM programs into Isabelle/HOL. We now describe how to reason about these programs, using separation logic.

Separation Algebra
In order to reason about memory with separation logic, we define an abstraction function from the memory into a separation algebra [8]. Separation algebras formalize the intuition of combining disjoint parts of memory. They come with a zero (0) that describes the empty part, a disjointness predicate a#b describing that the parts a and b do not overlap, and a disjoint union a + b that combines two disjoint parts. For the exact definition of a separation algebra, we refer to [8,20]. We note that separation algebras naturally extend over functions and pairs, in a pointwise manner.
▶ Example 1. (Trivial Separation Algebra) The type α option = None | Some α forms a separation algebra with: Intuitively, this separation algebra does not allow for combination of contents, except if one side is zero. While it is not very useful on its own, the trivial separation algebra is a useful building block for more complex separation algebras.
For our memory model, we define the following abstraction function: An abstract memory α µ consists of two parts: α m µ is a map from addresses to the values stored there. It is used to reason about load and store operations. α b µ is a map from block indexes to the sizes of the corresponding blocks. It is used to ensure that one owns all addresses of a block when freeing it. We continue to define a separation logic: assertions are predicates over separation algebra elements. The basic connectives are defined as follows: That is, the assertion false never holds and the assertion true holds for all abstract memories. The empty assertion □ holds for the zero memory, and the separating conjunction P * Q holds if the memory can be split into two disjoint parts, such that P holds for one, and Q holds for the other part. The lifting assertion ↑ϕ holds iff the Boolean value ϕ is true: It is used to lift plain logical statements into separation logic assertions owning no memory. When clear from the context, we omit the ↑-symbol, and just mix plain statements with separation logic assertions.

Weakest Preconditions and Hoare Triples
We define a weakest precondition predicate directly via the semantics: That is, wp m Q µ holds, iff program m run on memory µ does not fail, and all possible results (return value x, access report ρ, new memory µ ′ ) satisfy the postcondition Q.
To set up a verification condition generator based on separation logic, we standardize the postcondition: the reported memory accesses must be disjoint from some abstract memory amf, called the frame. We define the weakest precondition with frame: that is, when executed on memory µ, the program c does not fail, every return value x and new memory µ ′ satisfies Q, and no memory described by the frame amf is accessed. Equipped with a weakest precondition with access restrictions, we define a Hoare-triple: The predicate ABS amf P µ specifies that the abstract memory α µ can be split into a part am and the given frame amf, such that am satisfies the precondition P . A Hoaretriple ht P c Q specifies that for all memories and frames for which the precondition holds (ABS amf P µ), the program will succeed, not using any memory of the frame, and every result will satisfy the postcondition wrt. the original frame (ABS amf (Q x) µ ′ ).

Verification Condition Generator
The verification condition generator is implemented as a proof tactic that works on subgoals of the form: The tactic is guided by the syntax of the command c. Basic monad combinators are broken down using the following rules: For other instructions and user defined functions, the VCG expects a Hoare-triple to be already proved. It then uses the following rule: ht P c Q ∧ ABS amf P ′ µ -match Hoare triple and current state -continue with postcondition =⇒ wpf amf c Q ′ µ I T P 2 0 2 2 24:8

Refinement of Parallel Algorithms down to LLVM
To process a command c, the first assumption is instantiated with the Hoare-triple for c, and the second assumption with the assertion P ′ for the current state. Then, a simple syntactic heuristics infers a frame F and proves that the current assertion P ′ entails the required precondition P and the frame. Finally, verification condition generation continues with the postcondition Q and the frame as current assertion.

Hoare-Triples for Instructions
To use the VCG to verify LLVM programs, we have to prove Hoare triples for the LLVM instructions. For parallel calls, we prove the well-known disjoint concurrency rule [33]: That is, commands with disjoint preconditions can be executed in parallel.
For memory operations, we prove:

Refinement for Parallel Programs
At this point, we have described a separation logic framework for parallel programs in LLVM. It is largely backwards compatible with the framework for sequential programs described in [26], such that we could easily port the algorithms formalized there to our new framework. The next step towards verifying complex programs is to set up a stepwise refinement framework. In this section we describe the refinement infrastructure of the Isabelle Refinement Framework, focusing on our changes to support parallel algorithms.

Abstract Programs
Abstract programs are shallowly embedded into the nondeterminism error monad ′ a neM (cf. Section 2.1). They are purely functional, not modifying memory, or differentiating between sequential and parallel execution. We define a refinement ordering on neM: Intuitively, m 1 ≤ m 2 means that m 1 returns fewer possible results than m 2 , and may only fail if m 2 may fail. Note that ≤ is a complete lattice, with top element fail.
We use refinement and assertions to specify that a program m satisfies a specification with precondition P and postcondition Q: If the precondition is false, the right hand side is fail, and the statement trivially holds. Otherwise, m cannot fail, and every possible result x of m must satisfy Q.
For a detailed description on using the ne-monad for stepwise refinement based program verification, we refer the reader to [30].

The Sepref Tool
The Sepref tool [23,26] symbolically executes an abstract program in the ne-monad, keeping track of refinements for every abstract variable to a concrete representation, which may use pointers to dynamically allocated memory. During the symbolic execution, the tool synthesizes an imperative Isabelle-LLVM program, together with a refinement proof. The synthesis is automatic, but requires annotations to the abstract program.
The main concept of the Sepref tool is refinement between an abstract program c in the ne-monad, and a concrete program c † in the M monad, as expressed by the hnr-predicate: That is, either the abstract program c fails, or for a memory described by assertion Γ, the LLVM program c † succeeds with x † , such that the new memory is described by Γ ′ * R x x † , for a possible result x of the abstract program c. Moreover, the predicate CP holds for the concrete result. Note that hnr trivially holds for a failing abstract program. This makes sense, as we prove that the abstract program does not fail anyway. Moreover it allows us to assume that assertions actually hold during the refinement proof: ▶ Example 2. (Refinement of lists to arrays) We define abstract programs for indexing and updating a list:

lget xs i ≡ assert (i<|xs|); return xs!i lset xs i x ≡ assert (i<|xs|); return xs[i:=x]
These programs assert that the index is in bounds, and then return the accessed element (xs!i) or the updated list (xs[i:=x]) respectively. The following assertion links a pointer to a list of elements stored at the pointed-to location: That is, for every i < |xs|, p + i points to the ith element of xs. On arrays, indexing and updating of arrays is implemented by: aget p i ≡ ll ofs ptr p i; ll load p aset p i x ≡ ll ofs ptr p i; ll store x p; return p And the abstract and concrete programs are linked by the following refinement theorems: That is, if the list xs is refined by array xs † , and the natural number i is refined by the fixed-width 2 word i † (idx A i i † ), the aget operation will return the same result as the lget operation (id A ). The resulting memory will still contain the original array. Note that there is no explicit precondition that the array access is in bounds, as this follows already from the assertion in the abstract lget operation. The aset operation will return a pointer to an array that refines the updated list returned by lset. As the array is updated in place, the original refinement of the array is no longer valid. Moreover, the returned pointer r will be the same as the argument pointer xs † . This information is important for refining to parallel programs on disjoint parts of an array (cf. Section 4.3).
Given refinement assertions for the parameters, and hnr-rules for all operations in a program, the Sepref tool automatically synthesizes an LLVM program from an abstract neM program. The tool tries to automatically discharge additional proof obligations, typically arising from translating arithmetic operations from unbounded numbers to fixed width numbers. Where automatic proof fails, the user has to add assertions to the abstract program to help the proof. The main difference of our tool wrt. the existing Sepref tool [26] is the additional condition (CP) on the concrete result, which is used to track pointer equalities. We have added a heuristics to automatically synthesize and discharge these equalities.

Array Splitting
An important concept for parallel programs is to concurrently operate on disjoint parts of the memory, e.g., different slices of the same array. However, abstractly, arrays are just lists. They are updated by returning a new list, and there is no way to express that the new list is stored at the same address as the old list. Nevertheless, in order to refine a program that updates two disjoint slices of a list to one that updates disjoint parts of the array in place, we need to know that the result is stored in the same array as the input. This is handled by the CP argument to hnr. To indicate that operations shall be refined to disjoint parts of the same array, we introduce the combinator with_split for abstract programs: Abstractly, this is an annotation that is inlined when proving the abstract program correct. However, Sepref will translate it to the concrete combinator awith split: The refinement of the function f to f † requires an additional proof that the returned pointers are equal to the argument pointers (xs †1 ′ =xs †1 ∧ xs †2 ′ = xs †2 ). Sepref tries to prove that automatically, using a simple heuristics.

Refinement to Parallel Execution
The purely functional abstract programs have no notion of parallel execution. To indicate that refinement to parallel execution is desired, we define an abstract annotation npar: This rule can be used to automatically parallelize any (independent) abstract computations. For convenience, we also define nseq. Abstractly, it's the same as npar, but Sepref translates it to sequential execution.

A Parallel Sorting Algorithm
To test the usability of our framework, we verify a parallel sorting algorithm. We start with the abstract specification of an algorithm that sorts a list: sort spec xs = spec xs ′ . mset xs ′ =mset xs ∧ sorted xs That is, we return a sorted permutation of the original list. Note that this is a standard specification of sorting in Isabelle. Reusing the existing development of an abstract introsort algorithm [27], we easily prove with a few refinement steps that the following abstract algorithm implements sort spec: This algorithm is derived from the well-known quicksort and introsort algorithms [32]: like quicksort, it partitions the list (line 7), and then recursively sorts the partitions in parallel (l. 11). Like introsort, when the recursion gets too deep, or the list too short, we fall back to some (not yet specified) sequential sorting algorithm (l. 5). Similarly, when the partitioning is very unbalanced (l. 8), we sort the partitions sequentially (l. 10). These optimizations aim at not spawning threads for small sorting tasks, where the overhead of thread creation outweighs the advantages of parallel execution. A more technical aspect is the extra parameter n that we introduced for the list length. Thus, we can refine the list to just a pointer to an array, and still access its length 3 .

Implementation and Correctness Theorem
Next, we have to provide implementations for the fallback sort spec, and for partition spec. These implementations must be proved to be in-place, i.e., return a pointer to the same array. It was straightforward to amend our existing formalization of pdqsort [27] with the in-place proofs: once we had amended the refinement statements, and bug-fixed the pointer equality proving heuristics that we added to Sepref, the proofs were automatic.
Given the implementations of sort spec and partition spec, the Sepref tool generates an LLVM program psort † from the abstract psort, and proves a corresponding refinement lemma: hnr (arr A xs xs † * idx A n n † ) (psort † xs † n † ) (idx A n n † ) arr A (λr. r = xs † ) (psort xs n) Combining this with the correctness lemma of the abstract psort algorithm, and unfolding the definition of hnr, we prove the following Hoare-triple for our final implementation: ht (arr A xs xs † * idx A n n † * n = |xs|) (psort † xs † n † ) (λr. r=xs † * ∃ xs ′ . arr A xs ′ xs † * sorted xs ′ * mset xs ′ = mset xs) That is, for a pointer xs † to an array, whose contents are described by list xs (arr A ), and a fixed-size word n † representing the natural number n (idx A ), which must be the number of elements in the list xs, our sorting algorithm returns the original pointer xs † , and the array contents are now xs ′ , which is sorted and a permutation of xs. Note that this statement uses our semantically defined Hoare triples (cf. Section 3.2). In particular, its correctness does not depend on the refinement steps, the Sepref tool, or the VCG.

A Sampling Partitioner
While we could simply re-use the existing partitioning algorithm from the pdqsort formalization, which uses a pseudomedian of nine pivot selection, we observe that the quality of the pivot is particularly important for a balanced parallelization. Moreover, the partitioning in the psort aux procedure is only done for arrays above a quite big size threshold. Thus, we can invest a little more work to find a good pivot, which is still negligible compared to the cost of sorting the resulting partitions. We choose a sampling approach, using the median of 64 equidistant samples as pivot. The highly optimized partitioning algorithms that we use swap the pivot to the front of the partition, such that we need to determine its index, rather than just its value. We simply use quicksort to find the median 4 : sample xs ≡ is ← equidist |xs| 64; is ← sort wrt (λi j. xs!i < xs!j) is; return (is!32) Proving that this algorithm finds a valid pivot index is straightforward. More challenging is to refine it to purely imperative LLVM code, which does not support closures like λi j. xs!i < xs!j. We resolve such closures over the comparison function manually: using Isabelle's locale mechanism [19], we parametrize over the comparison function. Moreover, we thread through an extra parameter for the data captured by the closure: locale pcmp = fixes lt :: ′ p ⇒ ′ e ⇒ ′ e ⇒ bool and lt † :: ′ p † ⇒ ′ e † ⇒ ′ e † ⇒ bool and par A :: ′ p ⇒ ′ p † ⇒ assn and elem A :: ′ e ⇒ ′ e † ⇒ assn assumes ∀p. weak ordering (lt p) assumes hnr (par A p pi * elem A a ai * elem This defines a context in which we have an abstract compare function lt for the abstract elements of type ′ e. It takes an extra parameter of type ′ p (e.g. the list xs), and forms a weak ordering 5 . Note that the strict compare function lt also induces a non-strict version le p a b ≡ ¬lt p b a. Moreover, we have a concrete implementation lt † of the compare function, wrt. the refinement assertions par A for the parameter and elem A for the elements. Our sorting algorithm is developed and verified in the context of this locale (to avoid confusion, our presentation has, up to now, just used <, ≤, and sorted instead of lt p, le p, and sorted wrt (le p)). To get a sorting algorithm for an actual compare function, we have to instantiate the locale, providing an abstract and concrete compare function, along with a proof that the abstract function is a weak ordering, and the concrete function refines the abstract one. For our example of sorting indexes into an array, where the array elements are, themselves, compared by a parametrized function lt, we get: this yields sorting algorithms for sorting indexes, taking an extra parameter for the array to index into. For our sampling application, we use idx.introsort xs.

Code Generation
Finally, we instantiate the sorting algorithms to sort unsigned integers and strings: This yields implementations unat.psort † and str.psort † , and automatically proves instantiated versions of the correctness theorems. In a last step, we use our code generator to generate actual LLVM text, as well as a C header file with the signatures of the generated functions 6 : export llvm unat.psort † is uint64 t * psort(uint64 t * , int64 t) str.psort † is llstring * str psort(llstring * , int64 t) defines typedef struct {int64 t size; struct {int64 t capacity; char * data;};} llstring; file psort.ll This checks that the specified C signatures are compatible with the actual types, and then generates psort.ll and psort.h, which can be used in a standard C/C++ toolchain.

Benchmarks
We have benchmarked our verified sorting algorithm against a direct implementation of the same algorithm in C++. The result was that both implementations have the same runtime, up to some minor noise. This indicates that there is no systemic slowdown: algorithms verified with our framework run as fast as their unverified counterparts implemented in C++.   We also benchmarked against the state-of-the-art implementations std::sort with execution policy par unseq from the GNU C++ standard library [12], and sample sort from the Boost C++ libraries [4,5]. We have benchmarked the algorithm on two different machines, and various input distributions. The results are shown in Figure 2. While our verified algorithm is clearly competitive for integer sorting on the less parallel laptop machine, it's slightly less efficient for sorting strings on the highly parallel server machine. Nevertheless, we believe that our verified implementation is already useful in practice, and leave further optimizations to future work.
Finally, we measured the speedup that the implementations achieve for a certain number of cores. The results are displayed in Figure 3. While the speedup on the moderately parallel laptop is comparable to the one of the C++ standard library, our implementation achieves lower speedups than the state-of-the-art on the highly parallel server. Again, we leave further optimizations to future work.

Conclusions
We have presented a stepwise refinement approach to verify total correctness of efficient parallel algorithms. Our approach targets LLVM as back end, and there is no systemic efficiency loss in our approach when compared to unverified algorithms implemented in C++. The trusted code base of our approach is relatively small: apart from Isabelle's inference kernel, it contains our shallow embedding of a small fragment of the LLVM semantics, and the code generator. All other tools that we used, e.g., our Hoare logic, Sepref tool, and Refinement Framework for abstract programs, ultimately prove a correctness theorem that only depends on our shallowly embedded semantics.
As a case study, we have implemented a parallel sorting algorithm. It uses an existing verified sequential pdqsort algorithm as a building block, and is competitive with state-ofthe-art parallel sorting algorithms, at least on moderately parallel hardware.
The main idea of our parallel extension is to shallowly embed the semantics of a parallel combinator into a sequential semantics, by making the semantics report the accessed memory locations, and fail if there is a potential data race. We only needed to change the lower levels of our existing framework for sequential LLVM [26]. Higher-level tools like the VCG and Sepref remained largely unchanged and backwards compatible. This greatly simplified reusing of existing verification projects, like the sequential pdqsort algorithm [27].

Related Work
While there is extensive work on parallel sorting algorithm (e.g. [9,1]), there seems to be almost no work on their formal verification. The only work we are aware of is a distributed merge sort algorithm [16], for which "no effort has been made to make it efficient" [16,Sec. 2], nor any executable code has been generated or benchmarked. Another verification [34] uses the VerCors deductive verifier to prove the permutation property (mset xs ′ = mset xs) of odd-even transposition sort [13], but neither the sortedness property nor termination.
Concurrent separation logic is used by many verification tools such as VerCors [3], and also formalized in proof assistants, for example in the VST [37] and IRIS [18] projects for Coq [2]. These formalizations contain elaborate concepts to reason about communication between threads via shared memory, and are typically used to verify partial correctness of subtle concurrent algorithms (e.g. [31]). Reasoning about total correctness is more complicated in the step-indexed separation logic provided by IRIS, and currently only supported for sequential programs [35]. Our approach is less expressive, but naturally supports total correctness, and is already sufficient for many practically relevant parallel algorithms like sorting, matrix-multiplication, or parallel algorithms from the C++ STL.

Future Work
An obvious next step is to implement a fractional separation logic [6], to reason about parallel threads that share read-only memory. While our semantics already supports shared read-only memory, our separation logic does not. We believe that implementing a fractional separation logic will be straightforward, and mainly pose technical issues for automatic frame inference.
Another obvious next step is to verify a state-of-the-art parallel sorting algorithm, like Boost's sample sort. Like our current algorithm, sample sort does not require advanced synchronization concepts, and can be implemented only with a parallel combinator.
Finally, the Sepref framework has recently been extended to reason about complexity of (sequential) LLVM programs [14,15]. This line of work could be combined with our parallel extension, to verify the complexity (e.g. work and span) of parallel algorithms.
Extending our approach towards more advanced synchronization like locks or atomic operations may be possible: instead of accessed memory addresses, a thread could report a set of possible traces, which are checked for race-freedom and then combined.
Finally, our framework currently targets multicore CPUs. Another important architecture are general purpose GPUs. As LLVM is also available for GPUs, porting our framework to this architecture should be possible. We even expect that barrier synchronization, which is important in the GPU context, can be integrated into our approach.