1 Introduction

We present a stepwise refinement approach to develop verified and efficient parallel algorithms. Our method can verify total correctness down to LLVM intermediate code. The resulting verified implementations are competitive with state-of-the-art unverified implementations. Our approach is backwards compatible to the Isabelle Refinement Framework (IRF) [29], a powerful tool to verify efficient sequential software, such as model checkers [8, 11, 44], SAT solvers [13, 27, 28], or graph algorithms [25, 34, 35]. This paper adds parallel execution to the IRF’s toolbox, without invalidating the existing formalizations, which can now be used as sequential building blocks for parallel algorithms, or be modified to add parallelization.

As a case study, we verify total correctness of a parallel quicksort algorithm, re-using an existing verification of state-of-the-art sequential sorting algorithms [30]. Our verified parallel sorting algorithm is competitive to state-of-the-art parallel sorting algorithms from GNU’s C++ standard library and the Boost C++ Libraries.

This paper is an extended version of our ITP 2022 paper [31]. The main new contribution is a verified parallel partitioning algorithm, which significantly improves the efficiency and scalability of our sorting algorithm from [31]. To this end, we added a description of the interval list data structure (Sect. 4.3) and the combinator (Sect. 4.4), which are required by the parallel partitioner. The parallel partitioning algorithm itself is described in Sects. 5.3, and 5.5 contains updated and more extensive benchmarks.

Isabelle LLVM is hosted at https://lammich.github.io/isabelle_llvm/index.html. The version described in this paper has been archived [32].

1.1 Overview

This paper is based on the Isabelle Refinement Framework, a continuing effort to verify efficient implementations of complex algorithms, using stepwise refinement techniques [15, 24, 26, 29, 33, 36]. Figure 1 displays the components of the Isabelle Refinement Framework.

Fig. 1
figure 1

Components of the IRF, with focus on the back end

The back end layer handles the translation from Isabelle/HOL to the actual target language. The instructions of the target language are shallowly embedded into Isabelle/HOL, using a state-error (SE) monad. An instruction with undefined behaviour, or behaviour outside our supported fragment, raises an error. The state of the monad is the memory, represented via a memory model. The code generator translates the instructions to actual code. These components form the trusted code base, while all the remaining components of the Isabelle Refinement Framework generate proofs. In the back-end, the preprocessor transforms expressions to the syntactically restricted format required by the code generator, proving semantic equality of the original and transformed expression. While there exist back ends for purely functional code [24, 36], and sequential imperative code [26, 29], this paper describes a back end for parallel imperative LLVM code (Sect. 2).

On top of the back-end, we formalize a concurrent separation logic [39] and implement a verification condition generator (VCG), cf. Sect. 3.

At the level of the program logic and VCG, our framework can be used to verify simple low-level algorithms and data structures, like dynamic arrays and linked lists. More complex developments typically use a stepwise refinement approach, starting at purely functional programs modelled in a nondeterminism-error (NE) monad [36]. A semi-automatic refinement procedure (Sepref [26, 29]) translates from the purely functional code to imperative code, refining abstract functional data types to concrete imperative ones. In Sect. 4, we describe our extensions to support refinement to parallel executions, and a fine-grained tracking of pointer equalities, required to parallelize computations that work on disjoint parts of the same array.

Using our approach, complex algorithms and data structures can be developed and refined to optimized efficient code. The stepwise refinement ensures a separation of concerns between high-level algorithmic ideas and low-level optimizations. We have already used this approach to verify a wide range of practically efficient sequential algorithms [8, 11, 13, 25, 27, 28, 30, 34, 35, 44]. In Sect. 5, we use our extended techniques to verify a parallel sorting algorithm, with competitive performance wrt. unverified state-of-the-art algorithms.

Section 6 concludes the paper and discusses related and future work.

1.2 Notation

Formal statements in this paper correspond to theorems proved in our Isabelle/HOL formalization, though sometimes simplified to improve the clarity of the presentation.

We mainly use Isabelle/HOL notation, with some shortcuts and adaptations for presentation in a paper. In this section, we give examples of the more unusual notations: for implication, we always write , e.g., (in Isabelle, one must use here). Free variables are universally quantified at the top-level.

Type variables are and function types are written curried as . Types can be annotated to any term, e.g., . Function application is written as , and function update is . For definitions, we use , e.g., .

Algebraic datatypes are written as . This also defines the selector function . Values of the tuple type are written as . Lists have type , and we write for the empty list, for the list with head and tail , and for list concatenation. We also use the list notation . The length of list is .

The empty set is . The set contains all elements between l inclusive and h exclusive, and the set contains all elements greater than or equal to l. Disjoint union is written as . We only write if it is clear from the context where the disjointness constraint belongs to. Otherwise, we will explicitly write, e.g., . The cardinality of a finite set is .

Lambda abstraction is written as . When clear from the context, we omit the , e.g. . We also use Haskell-like sections for infix operators, e.g. for . We also use underscores to indicate the parameter positions, e.g., for .

2 A Back End for LLVM with Parallel Execution

We formalize a semantics for parallel execution, shallowly embedded into Isabelle/HOL. As for the existing sequential back ends [26, 29], the shallow embedding is key to the flexibility and feasibility of the approach. The main idea is to make an execution report its accessed memory, and use this information to raise an error when joining executions that would have exhibited a data race. We use this to model an instruction that calls two functions in parallel, and waits until both have returned.

2.1 State-nondeterminism-error Monad with Access Reports

We define the underlying monad in two steps. We start with a nondeterminism-error monad, and then lift it to a state monad and add access reports. Defining a nondeterminism-error monad is straightforward in Isabelle/HOL:

figure an

A program either fails, or yields a possible set of results ( ), described by its characteristic function . The operation yields exactly one result, and combines all possible results, failing if there is a possibility to fail. We use the notation for , and for .

Now assume that we have a state (memory) type , and an access report type , which forms a monoid ( ). With this, we define our state-nondeterminism-error monad with access reports, just called for brevity:

figure ba

Here, does not change the state, and reports no accesses ( ), and sequentially composes the executions, threading through the state \(\mu \) and adding up the access reports and . Note that we use the names for reports, for memory, and for monads.

Typically, the access report will contain read and written addresses, such that data races can be detected. Moreover, if parallel executions can allocate memory, we must detect those executions where the memory manager allocated the same block in both parallel strands. As we assume a thread safe memory manager, those infeasible executions can safely be ignored. Let and be symmetric predicates, and let be a commutative operator to combine two pairs of access reports and states. Then, we define a parallel composition operator for :

figure bn

Here, we use to ignore infeasible executions, and to fail on data races. Note that, if one parallel strand fails, and the other parallel strand has no possible results ( ), the behaviour of the parallel composition is not clear. For this reason, we fix an invariant , which implies that every non-failing execution has at least one possible result. We define the actual type as the subtype satisfying . Thus, we have to prove that every combinator and instruction of our semantics preserves the invariant, which is an important sanity check. As additional sanity check, we prove symmetry of parallel composition:

figure bu

2.2 Memory Model

Our memory model supports blocks of values, where values can be integers, structures, or pointers into a block:

figure bv

A block is either fresh, freed, or allocated, and a memory is a mapping from block indexes ( ) to blocks, such that only finitely many blocks are not fresh. Every block’s state transitions from fresh to allocated to freed. This avoids ever reusing the same block, and thus allows us to semantically detect use after free errors. Every program execution can only allocate finitely many blocks, such that we will never run out of fresh blocks.Footnote 1 An allocated block contains an array of values, modelled as a list. Thus, an address consists of a block index , and an index into the array.

To access and modify memory, we define the functions , , and :

figure cd

where is the length of list , returns the th element of list , and replaces the th element of by .

Note that our LLVM semantics does not support conversion of pointers to integers, nor comparison or difference of pointers to different blocks. This way, a program cannot see the internal representation of a pointer, and we can choose a simple abstract representation, while being faithful wrt. any actual memory manager implementation.

2.3 Access Reports

We now fix the state of the M-monad to be memory, and the access reports to be tuples of read and written addresses, as well as sets of allocated and freed blocks:

figure co

Two parallel executions are feasible if they did not allocate the same block. They have a data race if one execution accesses addresses or blocks modified by the other:

figure cp

The combine function joins the access reports and memories, preferring allocated over fresh, and freed over allocated memory. When joining two allocated blocks, the written addresses from the access report are used to join the blocks. We skip the rather technical definition of combine, and just state the relevant properties: Let and be feasible and race free access reports, and be memories that have evolved from a common memory , consistently with the access reports . Let . Then

figure cw

Moreover, for all addresses with :

figure cz

The properties (1)–(3) define the state of blocks in the combined memory: a fresh block in was fresh already in , and has not been allocated (1); an allocated block was already allocated or has been allocated, but has not been freed (2); and a freed block was already freed, or has been freed (3). The properties (4)–(6) define the content: addresses written or allocated in the first or second execution get their content from (4) or (5) respectively. Addresses not written nor allocated at all keep their original content (6).

2.4 The Interface of the M-monad

The invariant for states that blocks transition only from fresh to allocated to free, allocated blocks never change their size, and the access report matches the observable state change ( ). It also states, that for each finite set of blocks B, there is an execution that does not allocate blocks from B. The latter is required to show that we always find feasible parallel executions:

figure dg

To define functions in the M-monad, we have to show that they satisfy this invariant. For and , this is straightforward. The proof for the parallel operator is slightly more involved, using the properties of , and the invariant for the operands to obtain a feasible parallel execution.

Moreover, the M monad provides the memory management functions , , , , and . The latter function checks if a given address is valid, and is used to check if pointer arithmetic can be performed on that address. Currently, it behaves like loading from that address, in particular it does not support pointers one past the end of an allocated block. We leave integration of such pointers to future work.

Example 1

(Memory allocation) To define memory allocation in , we first define the allocation function in the underlying nondeterminism-error monad:

figure dq

This function selects an arbitrary fresh block , and initializes it with the given list of values. It returns the allocated block, an access report for the allocation, and the updated memory.

We then show that satisfies the invariant of M: we correctly report the allocated block. Moreover, we can select any fresh block. As our memory model guarantees an infinite supply of fresh blocks, any finite set of blocks can be avoided.

figure du

Finally, we define the corresponding function in the M monad, using Isabelle’s lifting and transfer package [18]:

figure dv

The other memory management functions are defined analogously.

2.5 LLVM Instructions

Based on the M-monad, we define shallowly embedded LLVM instructions. For most instructions, this is analogous to the sequential case [29]. Additionally, we define an instruction for a parallel function call:

figure dw

The code generator only accepts this, if and are constants (i.e., function names). It then generates some type-casting boilerplate, and a call to an external function, which we implement using the Threading Building Blocks [19] library:

figure ea

i.e., the two functions and are called in parallel. The generated boilerplate code sets up and to point to both, the actual arguments and space for the results.

3 Parallel Separation Logic

In the previous section, we have defined a shallow embedding of LLVM programs into Isabelle/HOL. We now reason about these programs, using separation logic.

3.1 Separation Algebra

In order to reason about memory with separation logic, we define an abstraction function from the memory into a separation algebra [9]. Separation algebras formalize the intuition of combining disjoint parts of memory. They come with a zero ( ) that describes the empty part, a disjointness predicate \(a\#b\) describing that the parts a and b do not overlap, and a disjoint union \(a+b\) that combines two disjoint parts. For the exact definition of a separation algebra, we refer to [9, 23]. We note that separation algebras naturally extend over functions and pairs in a pointwise manner.

Example 2

(Trivial separation algebra) The type forms a separation algebra with:

figure eh

Intuitively, this separation algebra does not allow for combination of contents, except if one side is zero. While it is not very useful on its own, the trivial separation algebra is a useful building block for more complex separation algebras.

For our memory model, we define the following abstraction function:

figure ei

An abstract memory consists of two parts: is a map from addresses to the values stored there. It is used to reason about load and store operations. is a map from block indexes to the sizes of the corresponding blocks. It is used to ensure that one owns all addresses of a block when freeing it.

We continue to define a separation logic: assertions are predicates over separation algebra elements. The basic connectives are defined as follows:

figure em

That is, the assertion never holds and the assertion holds for all abstract memories. The empty assertion holds for the zero memory, and the separating conjunction holds if the memory can be split into two disjoint parts, such that holds for one, and holds for the other part. The lifting assertion holds iff the Boolean value is true:

figure ev

It is used to lift plain logical statements into separation logic assertions owning no memory. When clear from the context, we omit the -symbol, and just mix plain statements with separation logic assertions.

3.2 Weakest Preconditions and Hoare Triples

We define a weakest precondition predicate directly via the semantics:

figure ex

That is, holds, iff program run on memory does not fail, and all possible results (return value , access report , new memory ) satisfy the postcondition.

To set up a verification condition generator based on separation logic, we standardize the postcondition: the reported memory accesses must be disjoint from some abstract memory , called the frame. We define the weakest precondition with frame:

figure fg

that is, when executed on memory , the program does not fail, every return value and new memory satisfies , and no memory described by the frame   is accessed.

Equipped with , we define a Hoare-triple:

figure fo

The predicate specifies that the abstract memory can be split into a part and the given frame , such that satisfies the precondition P. A Hoare-triple specifies that for all memories and frames for which the precondition holds ( ), the program will succeed, not using any memory of the frame, and every result will satisfy the postcondition wrt. the original frame ( ).

3.3 Verification Condition Generator

The verification condition generator is implemented as a proof tactic that works on subgoals of the form:

figure fx

The tactic is guided by the syntax of the command . Basic monad combinators are broken down using the following rules:

figure fz

For other instructions and user defined functions, the VCG expects a Hoare-triple to be already proved. It then uses the following rule:

figure ga

To process a command , the first assumption is instantiated with the Hoare-triple for , and the second assumption with the assertion for the current state. Then, a simple syntactic heuristics infers a frame and proves that the current assertion entails the required precondition and the frame. Finally, verification condition generation continues with the postcondition and the frame as current assertion.

3.4 Hoare-Triples for Instructions

To use the VCG to verify LLVM programs, we have to prove Hoare triples for the LLVM instructions. For parallel calls, we prove the well-known disjoint concurrency rule [39]:

figure gi

That is, commands with disjoint preconditions can be executed in parallel.

For memory operations, we prove:

figure gj

Here asserts that points to the beginning of a block of size , and describes that for all \(i\in I\), \(p+i\) points to value f i. Intuitively, creates a block of size n, initialized with the default value, and a tag. If one possesses both, the whole block and the tag, it can be deallocated by free. The rules for load and store are straightforward, where describes that p points to value x.

4 Refinement for (Parallel) Programs

At this point, we have described a separation logic framework for parallel programs in LLVM. It is largely backwards compatible with the framework for sequential programs described in [29], such that we could easily port the algorithms formalized there to our new framework. The next step towards verifying complex programs is to set up a stepwise refinement framework. In this section we describe the refinement infrastructure of the Isabelle Refinement Framework.

4.1 Abstract Programs

Abstract programs are shallowly embedded into the nondeterminism error monad (cf. Sect. 2.1). They are purely functional and have no notion of parallel execution. We define a refinement ordering on :

figure gt

Intuitively, means that returns fewer possible results than , and may only fail if may fail. Note that is a complete lattice, with top element .

We use refinement and assertions to specify that a program satisfies a specification with precondition and postcondition :

figure hd

If the precondition is false, the right hand side is , and the statement trivially holds. Otherwise, m cannot fail, and every possible result x of m must satisfy Q.

For a detailed description on using the -monad for stepwise refinement based program verification, we refer the reader to [36].

Example 3

(Swapping multiple elements) We specify an operation to perform multiple swaps. It takes two disjoint sets of indexes and , and a list . It then swaps each index in with some index in . The precondition of this operation assumes that the index sets are in range, disjoint, and have the same cardinality:

figure hl

The postcondition ensures that the resulting list is a permutation of the original list, the elements at indexes outside are unchanged, and that each element in is swapped with one in :

figure hp

Here, is the multiset of elements of the list , and is the standard way to express permutation in Isabelle.

As a sanity check, we prove that our specification is not vacuous, i.e., that for every input that satisfies the precondition, there exists an output that satisfies the postcondition:

figure ht

Note that this is only a sanity check lemma to detect problems early. Should we accidentally insert a vacuous specification here, we won’t be able to prove refinement to an M-monad program later, which cannot be vacuous due to .

In the monad, we then specify:

figure hw

In Sect. 5.3 we will refine this specification to a parallel implementation in LLVM.

4.2 The Sepref Tool

The Sepref tool [26, 29] symbolically executes an abstract program in the -monad, keeping track of refinements for every abstract variable to a concrete representation, which may use pointers to dynamically allocated memory. During the symbolic execution, the tool synthesizes an Isabelle-LLVM program, together with a refinement proof. The synthesis is automatic, but requires annotations to the abstract program.

The main concept of the Sepref tool is refinement between an abstract program in the -monad, and a concrete program in the monad, as expressed by the -predicate:

figure id

That is, either the abstract program fails, or for a memory described by assertion , the LLVM program succeeds with result , such that the new memory is described by , for a possible result of the abstract program . Moreover, the predicate holds for the concrete result. Note that trivially holds for a failing abstract program. This makes sense, as we prove that the abstract program does not fail anyway. It allows us to assume abstract assertions during the refinement proof:

figure in

Example 4

(Refinement of lists to arrays) We define abstract programs for indexing and updating a list:

figure io

These programs assert that the index is in bounds, and then return the accessed element ( ) or the updated list ( ) respectively. The following assertion links a pointer to a list of elements stored at the pointed-to location:

figure ir

That is, for every \(i<|xs|\), \(p+i\) points to the ith element of xs. Assertions like , that relate concrete to abstract values, are called refinement relations. If we want to emphasize that they depend on the heap, we also call them refinement assertions.

Indexing and updating of arrays is implemented by:

figure it

where is the Isabelle-LLVM instruction for offsetting a pointer by an index. The abstract and concrete programs are linked by the following refinement theorems:

figure iv

That is, if the list is refined by array , and the natural number is refined by the fixed-widthFootnote 2 word ( ), the operation will return the same result as the operation ( ). The resulting memory will still contain the original array. Note that there is no explicit precondition that the array access is in bounds, as this follows already from the assertion in the abstract operation. The operation will return a pointer to an array that refines the updated list returned by . As the array is updated in place, the original refinement of the array is no longer valid. Moreover, the returned pointer will be the same as the argument pointer . This information is important for refining to parallel programs on disjoint parts of an array (cf. Sect. 4.4).

To increase readability, we introduce an (almost) point-free notation for refinement theorems. The theorems for the array operations above can also be written as:

figure jj

The first theorem simply states that the first argument is refined by , the second argument by , and the result by . The second theorem adds the annotation \(\cdot ^d\) to the refinement for the array argument, indicating that this argument will be destroyed, i.e., the refinement is no longer valid when the function returns. Moreover, it binds the array argument to the name and the result to r. These names are used in the pointer equality predicate at the end, indicating that the result will be the same pointer as the array argument.

Given refinement relations for the parameters, and refinement theorems for all operations in a program, the Sepref tool automatically synthesizes an LLVM program from an abstract program. The tool tries to automatically discharge additional proof obligations, typically arising from translating arithmetic operations from unbounded numbers to fixed width numbers. Where automatic proof fails, the user has to add assertions to the abstract program to help the proof. The main difference of our tool wrt. the existing Sepref tool [29] is the additional condition ( ) on the concrete result, which is used to track pointer equalities. We have added a heuristics to automatically synthesize and discharge these equalities.

4.3 Modular Data Structure Development

The Refinement Framework allows us to build more complex data structures, using already existing ones as building blocks, and chaining together several refinements. We describe the development of an interval list data structure, which we need for our parallel partitioning algorithm (cf. 5.3).

A pair of natural numbers can be used to represent the set . We define to be the refinement relation between intervals (pairs) and sets. Moreover, we define operations for constructing an interval, testing if an interval is empty, intersection, and cardinality. We show that these operations refine the corresponding operations on sets:

figure ju

Note that , and we do not enforce \(l\le h\) for our representation. Thus, no checks are needed for construction and intersection. However, we use a check to avoid underflow when computing the cardinality.

Analogously, we implement open intervals with a single number, and define operations to construct an open interval, and to intersect a closed and an open interval:

figure jw

Next, we use Sepref to implement the natural numbers by fixed-sized words ( ). For example, given the definition of , and an annotation to implement natural numbers by 64 bit words, Sepref synthesizes the Isabelle LLVM program and proves the refinement theorem:

figure ka

We then define as the composition of the two refinements (word to nat to set). With the help of Sepref’s FCOMP tool, we can automatically compose the refinement lemmas. For example, composing the refinement lemmas for and yields:

figure ke

Thus, we obtain imperative implementations of the set operations. We proceed analogously for open intervals.

Next, we implement a set as the union of a list of non-empty, pairwise disjoint, and finite sets. While that seems to make little sense at first glance, we will later implement the sets in the list by intervals, and the list itself by dynamic arrays, to obtain an imperative interval list data structure. We define operations for constructing an empty set, emptiness test, disjoint union with a single set, cardinality, and a more specialized operation , which splits off a non-empty set from the list:

figure kg

We then refine the lists of sets to array lists (dynamic arrays) of intervals: . Here, is the refinement assertion from lists to the array list data structure from the IRF collections library. As argument, it takes the refinement relation for the list elements. Again, Sepref automatically generates imperative implementations of the functions and proves the corresponding refinement lemmas. Combining them with the refinements to sets yields the desired imperative interval list data structure, with the refinement relation . For example, for joining a single interval to the list, and for splitting off an interval, we get:

figure kl

Note that we update the underlying dynamic array destructively, hence the \(\cdot ^d\) annotation to the argument refinements.

In a last step, we define some operations on (finite) sets, and use Sepref to directly refine them to arrays, without any explicit intermediate steps. For example, intersecting two finite sets can be expressed as:

figure km

It is straightforward to prove that this algorithm returns . Also, Sepref can implement \(s_1\) with a closed or open interval, and \(s_2\) with an interval array, yielding:

figure ko

We have demonstrated one way of modularly developing an interval list data structure based on a dynamic array. By separating the actual intervals from the list data structure, the proofs about the interval list where independent of the interval implementation. This is a design choice, and a more direct design, e.g., using as intermediate data structure, is certainly possible.

4.4 Array Splitting

An important concept for parallel programs is to concurrently operate on disjoint parts of the memory, e.g., different slices of the same array. However, abstractly, arrays are just lists. They are updated by returning a new list, and there is no way to express that the new list is stored at the same address as the old list. Nevertheless, in order to refine a program that updates two disjoint slices of a list to one that updates disjoint parts of the array in place, we need to know that the result is stored in the same array as the input. This is handled by the argument to . To indicate that operations shall be refined to disjoint parts of the same array, we introduce the combinator for abstract programs:

figure kt

Abstractly, this is an annotation that is inlined when proving the abstract program correct. However, Sepref will translate it to the concrete combinator :

figure kv

The corresponding refinement theorem is:

figure kw

or, equivalently, in pointwise notation:

figure kx

The refinement of the function argument ( to ) requires an additional proof that the returned pointers are equal to the argument pointers ( ). Sepref tries to prove that automatically, using its pointer equality heuristics.

Splitting an array into two parts allows us to abstractly treat the array and its two parts just as lists, which simplifies the abstract proofs: the fact that the two parts come from the same array is only visible at a later refinement stage. However, while splitting an array in parts is adequate for many operations, it is not a workable abstraction for swapping multiple elements in parallel (cf. Sect. 3): while, in theory, we could split the array element-wise, this would incur a considerable proof burden.

A more elegant solution is to keep track of which elements of a list can be accessed already on the abstract level. To this end, we model a list of option values, meaning that we cannot access this element. We start by defining functions to abstractly handle lists of option values. These functions work on the actual list, and the structure of the list, which is a list of Booleans indicating which elements we do not own. Using the structure of the list as an explicit concept simplifies abstract proofs, as, typically, the values in the list will change, while the structure is preserved. The following functions obtain the structure of a list, and determine if two structures are compatible, i.e., have the same lengths and own disjoint indexes:

figure lc

Here, is the natural relator for lists, i.e., means that the lists and have the same length, and for each index, the element in at least one of the lists is true.

We also define functions to split and join lists:

figure lh

Here, returns a list that owns the indexes that are in and owned by , and joins the elements of two (compatible) lists. The function combines two lists element-wise, using the binary function .

Analogously to , we define a combinator , that splits the list into the lists with the indexes , and without the indexes , executes on these lists, and joins the resulting lists:

figure lu

On the concrete side, we define the refinement assertion between arrays and lists of options. It only owns the indexes of the array that are not None:

figure lw

We implement abstract operations for accessing the list, and show the corresponding refinement lemmas:

figure lx

We also define conversion operations between plain lists and lists of option values:

figure ly

These conversion operations are important to limit the proof overhead when using lists of option values: only where fine-grained ownership control is needed, we use option values. When we are done, and have reassembled all parts of the list, we convert it back to a plain list.

Finally, we define and prove its refinement theorem:

figure ma

Note that the set s of indexes does not have a concrete counterpart. It is a ghost variable that controls the split on the abstract level.

4.5 Refinement to Parallel Execution

Our abstract programs have no notion of parallel execution. To indicate that refinement to parallel execution is desired, we define an abstract annotation :

figure mc

Its refinement rule is:

figure md

This rule can be used to automatically parallelize any (independent) abstract computations. For convenience, we also define . Abstractly, it’s the same as , but Sepref translates it to sequential execution.

5 A Parallel Sorting Algorithm

To test the usability of our framework, we verify a parallel sorting algorithm. We start with the abstract specification of an algorithm that sorts a list:

figure mg

i.e., we return a sorted permutation of the original list. This is a standard specification of sorting in Isabelle, and easily proved equivalent to other, more explicit specifications:

figure mh
Fig. 2
figure 2

Abstract version of our parallel quicksort algorithm

Figure 2 shows our abstract parallel sorting algorithm . This algorithm is derived from the well-known quicksort and introsort algorithms [38]: like quicksort, it partitions the list (line 8), and then recursively sorts the partitions in parallel (l. 12). Like introsort, when the recursion gets too deep, or the list too short, we fall back to some (not yet specified) sequential sorting algorithm (l. 6). Similarly, when the partitioning is very unbalanced (l. 9), we sort the partitions sequentially (l. 11). These optimizations aim at not spawning threads for small sorting tasks, where the overhead of thread creation outweighs the advantages of parallel execution. A more technical aspect is the extra parameter that we introduced for the list length. Thus, we can refine the list to just a pointer to an array, and still access its length.Footnote 3

Reusing our existing development of an abstract introsort algorithm [30], we prove with a few refinement steps that implements :

figure mm

5.1 Implementation and Correctness Theorem

Next, we have to provide implementations for the fallback , and for . These implementations must be proved to be in-place, i.e., return a pointer to the same array. It was straightforward to amend our existing formalization of  [30] with the in-place proofs: once we had amended the refinement statements and bug-fixed the pointer equality proving heuristics, the proofs were automatic.

Given implementations of and , Sepref generates an LLVM program from the abstract , and proves a corresponding refinement lemma:

figure mu

Combining this with the correctness lemma of the abstract algorithm, and unfolding the definition of , we prove the following Hoare-triple for our final implementation:

figure mx

That is, for a pointer to an array, whose content is described by list ( ), and a fixed-size word representing the natural number ( ), which must be the number of elements in the list , our sorting algorithm returns the original pointer , and the array content now is , which is sorted and a permutation of . Note that this statement uses our semantically defined Hoare triples (cf. Sect. 3.2). In particular, it does not depend on the refinement steps, the Sepref tool, or the VCG.

5.2 Sampling Pivot Selection

While we could simply re-use the existing partitioning algorithm from the pdqsort formalization, which uses a pseudomedian of nine pivot selection, we observe that the quality of the pivot is particularly important for a balanced parallelization. Moreover, the partitioning in the procedure is only done for arrays above a quite big size threshold. Thus, we can invest a little more work to find a good pivot, which is still negligible compared to the cost of sorting the resulting partitions. We choose a sampling approach, using the median of 64 equidistant samples as pivot. We simply use quicksort to find the index of the pivotFootnote 4:

figure nj

Proving that this algorithm finds a valid pivot index is straightforward. More challenging is to refine it to purely imperative LLVM code, which does not support closures like .

We resolve such closures over the comparison function manually: using Isabelle’s locale mechanism [22], we parametrize over the comparison function. Moreover, we thread through an extra parameter for the data captured by the closure:

figure nl

This defines a context in which we have an abstract compare function for the abstract elements of type . It takes an extra parameter of type (e.g. the list xs), and forms a weak ordering.Footnote 5 Note that the strict compare function also induces a non-strict version . Moreover, we have a concrete implementation of the compare function, wrt. the refinement assertions for the parameter and for the elements.

Our sorting algorithms are developed and verified in the context of this locale (to avoid confusion, our presentation has, up to now, just used , , and instead of , , and ). To get an actual sorting algorithm, we instantiate the locale with an abstract and concrete compare function, proving that the abstract function is a weak ordering, and that the concrete function refines the abstract one. For our example of sorting indexes into an array, where the array elements themselves are compared by a function , we get:

figure ob

This instantiates the generic sorting algorithms defined in the locale to use as comparison function, taking as an extra parameter. To sort our list of sample indexes into , we use the instantiation of the introsort algorithm: .

5.3 Parallel Partitioning

While our parallel quicksort scheme parallelizes the sorting, partitioning is still a bottleneck: before the first thread is even spawned, the whole array needs to be partitioned. On the next recursion level, only two partitionings can run in parallel, and so on. That is, initially, most processors will be idle. To this end, the partitioning itself can be parallelized. The parallel partitioning algorithms used in the latest research on practically efficient sorting algorithms [2] are branchless k-way algorithms, which use atomic operations to orchestrate the parallel threads. In contrast, we only verify a 2-way partitioning algorithm that uses parallel calls as its only synchronization mechanism. This is a compromise between verification effort and efficiency, taking into account the features currently supported by Isabelle-LLVM. The idea of our parallel partitioning algorithm is sketched in Fig. 3.

Fig. 3
figure 3

Illustration of the phases of our parallel partitioning algorithm, after a pivot p has been picked. Step 1 splits the array into slices. Step 2 partitions the slices in parallel. Step 3 computes the start index m of the right partition as the sum of the sizes of all left partitions. It then determines the indexes of misplaced elements: the set bs contains right-partition elements that are left of m, and the set ss contains left partition elements that are right of m. Step 4 then swaps the misplaced elements in parallel. While Steps 1 and 3 are computationally cheap, Steps 2 and 4 do the main part of the work in parallel

We specify our algorithm on sets of indexes, and then refine it to intervals and interval arrays (cf. Sect. 4.3). First, we specify Steps 1 and 2, i.e., returning a permutation of the list, along with the sets of indexes that belong to a left partition, and of indexes that belong to a right partition. In Fig. 3, corresponds to the set of blue (\(\le p\)) intervals, and to the set of red (\(\ge p\)) intervals:

figure om

The whole partitioning algorithm is specified as follows:

figure on

A straightforward proof, only using arguments on lists and sets of indexes, shows that this algorithm partitions the list:

figure oo

The set operations in Step 3 are implemented by the operations and of our interval arrays (cf. Sect. 4.3), using the equalities: and . Thus, we are only missing implementations for and .

The refinement of the parallel partitioning is similar to that of the parallel sorting algorithm : using and , we split the list into slices that we then partition with a sequential algorithm.

The parallel swapping algorithm is refined as follows:

figure oz

The main procedure first checks if there are any indexes to swap. Then, it converts the plain list to a list of option values ( ), invokes the actual parallel swapping procedure , and converts the result back to a plain list ( ).

The procedure first splits off equally sized, non-empty sets and from the index sets, and then swaps these in parallel with the rest. Here, is the lifting of to lists of option values, and ensures that the swap operation will own the necessary elements.

We prove that our algorithm is correct ( ), and use Sepref, a straightforward implementation of sequential swapping, and our interval array implementation to refine it to efficient imperative Isabelle-LLVM code.

Combining all refinements in this section gives us a parallel partitioning algorithm. When we wanted to show that it satisfies the specification as required by our parallel sorting algorithm, we discovered that we also need to prove that neither partition can be empty. While this is certainly possible along the same lines as it is proved for sequential partitioning, we chose a pragmatic solution here: we dynamically check for the extreme case of one partition being empty, and fix that with an additional swap. The runtime impact of this check is negligible, but it greatly simplifies the correctness proof.

5.4 Code Generation

Finally, we instantiate the sorting algorithms to sort unsigned integers and strings:

figure pm

Here, the locale is the version of without an extra parameter to the compare function.Footnote 6 This yields implementations and , and instantiated versions of the correctness theorem.

We then use our code generator to generate actual LLVM text, as well as a C header file with the signatures of the generated functionsFootnote 7:

figure ps

This checks that the specified C signatures are compatible with the actual types, and then generates and , which can be used in a standard C/C++ toolchain.

5.5 Benchmarks

Fig. 4
figure 4

Runtimes in milliseconds for sorting various distributions of \(10^8\) unsigned 64 bit integers and \(10^7\) strings with our verified parallel sorting algorithm, C++’s standard parallel sorting algorithm, and Boost’s parallel sample sort algorithm. The experiments were performed on a server machine with 22 AMD Opteron 6176 cores and 128GiB of RAM, and a laptop with a 6 core (12 threads) i7-10750H CPU and 32GiB of RAM

Fig. 5
figure 5

Speedup of the various implementations for sorting \(10^8\) integers and \(10^7\) strings with a random distribution. The x axis ranges over the number of cores, and the y-axis gives the speedup wrt. the same implementation run on only one core. The thin black lines indicate linear speedup

Fig. 6
figure 6

Runtimes for sorting small arrays of randomly distributed integers and strings. The y-axis shows the runtime in milliseconds, and the x-axis the array size. Note that both axis are logarithmic

We have benchmarked our verified sorting algorithm against a direct implementation of the same algorithm in C++. The result was that both implementations have the same runtime, up to some minor noise. This indicates that there is no systemic slowdown: algorithms verified with our framework run as fast as their unverified counterparts implemented in C++.

We also benchmarked against the old verified algorithm with sequential partitioning from our ITP 2022 paper [31], as well as against the state-of-the-art implementations with execution policy from the GNU C++ standard library [42], and from the Boost C++ libraries [5, 6]. We have benchmarked the algorithm on two different machines, and various input distributions. The results are shown in Fig. 4: our verified algorithm is clearly competitive with the unverified state-of-the-art implementations. Only for a few string-sorting benchmarks, it is slightly slower. We leave improving on this for future work. Compared to the old verified algorithm, the parallel partitioning algorithm is more efficient in many cases, and there are only a few cases where it is slightly less efficient.

We also measured the speedup that the implementations achieve for a certain number of cores. The results are displayed in Fig. 5: again, our verified implementation is clearly competitive, and it scales better than the old verified algorithm.

The previous two benchmarks used relatively large input sizes. Figure 6 displays a benchmark for smaller input sizes. While we are still competitive with , is clearly faster for small arrays. This result is expected: our parallelization uses hard-coded thresholds to switch to sequential algorithms, which are independent of the number of available processors or the input size. We leave fine-tuning of the parallelization scheme to future work.

6 Conclusions

We have presented a stepwise refinement approach to verify total correctness of efficient parallel algorithms. Our approach targets LLVM as back end, and there is no systemic efficiency loss in our approach when compared to unverified algorithms implemented in C++.

The trusted code base of our approach is relatively small: apart from Isabelle’s inference kernel, it contains our shallow embedding of a small fragment of the LLVM semantics, and the code generator. All other tools that we used, e.g., our Hoare logic, the Sepref tool, and the Refinement Framework for abstract programs, ultimately prove a correctness theorem that only depends on our shallowly embedded semantics.

As a case study, we have implemented a parallel sorting algorithm. It uses an existing verified sequential pdqsort algorithm as a building block, and is competitive with state-of-the-art parallel sorting algorithms.

The main idea of our parallel extension is to shallowly embed the semantics of a parallel combinator into a sequential semantics, by making the semantics report the accessed memory locations, and fail if there is a potential data race. We only needed to change the lower levels of our existing framework for sequential LLVM [29]. Higher-level tools like the VCG and Sepref remained largely unchanged and backwards compatible. This greatly simplified reusing of existing verification projects, like the sequential pdqsort algorithm [30].

While the verification of our sorting algorithms uses a top-down approach, we actually started with implementing, benchmarking, and fine-tuning the algorithms in C++. This gave us a quick way to find an algorithm that is efficient, and, at the same time not too complex for verification, without having to verify each intermediate step towards this algorithm. Only then did we use our top-down approach to first formalize the abstract ideas behind the algorithm, and then refine it to an efficient implementation close to what we had written in C++. At this point, one may ask why not directly verify the C++ implementation: while this might be possible, the required work and steps would be similar: to manage the complexity of such a verification, several bottom-up refinement steps would be necessary, ultimately arriving at something similarly abstract as our initial abstract algorithm.

6.1 Development Effort

The Isabelle Refinement Framework for LLVM consists of roughly 45k lines of theory text and Isabelle-ML code (referred to as kLOC from now on). The sorting algorithms comprise another 14kLOC.

Integrating the initial version of the parallel semantics into the framework, and verifying the parallel sorting algorithm with sequential partitioning, as reported in our ITP 2022 paper [31], took us roughly 6 months.Footnote 8 During this time, we abandoned an initial formalization for the Imperative/HOL back end, as the achievable performance was unsatisfactory. We also abandoned a naive formalization of the parallel operator for the LLVM back end, to finally arrive at what is described in our ITP 2022 paper. Effectively, we added about 6kLOC to the framework and adapted most of the remaining code. The initial formalization of the parallel sorting algorithm required roughly 1kLOC. The parallel partitioning algorithm and its support data structures add 4kLOC, and took us one month to develop.

6.2 Related Work

While there is extensive work on parallel sorting algorithms (e.g. [1, 2, 10]), there seems to be almost no work on their formal verification. The only work we are aware of is a distributed merge sort algorithm [17], for which “no effort has been made to make it efficient” [17, Sect. 2], nor any executable code has been generated or benchmarked. Another verification [40] uses the VerCors deductive verifier to prove the permutation property ( ) of odd-even transposition sort [14], but neither the sortedness property nor termination.

Concurrent separation logic is used by many verification tools such as VerCors [4], and also formalized in proof assistants, for example in the VST [43] and IRIS [21] projects for Coq [3]. These formalizations contain elaborate concepts to reason about communication between threads via shared memory, and are typically used to verify partial correctness of subtle concurrent algorithms (e.g. [37]). Reasoning about total correctness is more complicated in the step-indexed separation logic provided by IRIS, and currently only supported for sequential programs [41]. Our approach is less expressive, but naturally supports total correctness, and is already sufficient for many practically relevant parallel algorithms like sorting, matrix-multiplication, or parallel algorithms from the C++ STL.

6.3 Future Work

An obvious next step is to implement a fractional separation logic [7], to reason about parallel threads that share read-only memory. While our semantics already supports shared read-only memory, our separation logic does not. We believe that implementing a fractional separation logic will be straightforward, and mainly pose technical issues for automatic frame inference.

Extending our approach towards more advanced synchronization like locks or atomic operations may be possible: instead of accessed memory addresses, a thread could report a set of possible traces, which are checked for race-freedom and then combined. Moreover, our framework currently targets multicore CPUs. Another important architecture are general purpose GPUs. As LLVM is also available for GPUs, porting our framework to this architecture should be possible. We even expect that we can model barrier synchronization, which is important in the GPU context.

Finally, the Sepref framework has recently been extended to reason about complexity of (sequential) LLVM programs [15, 16]. This could be combined with our parallel extension, to verify the complexity (e.g. work and span) of parallel algorithms.

An important aspect is the scalability of our tools. The current implementation scales to small software systems, like the verified IsaSAT-solver [12], which actually uses our sequential sorting algorithms. However, for projects of this size, the times needed by Sepref and the LLVM code exporter easily get into the order of dozens of minutes, and some manual performance tweaks are required (cf. [12, Sect. 5]). Further improvement of the scalability is left to future work.

Another direction for future work is to further optimize our verified sorting algorithm. We expect that tuning our parallelization scheme will improve the speedup for smaller input sizes. Also, while we are competitive with standard library implementations, recent research indicates that there is still some room for improvement, for example with the IPS\(^4\)o algorithm [2]. While this algorithm uses atomic operations in one place, many other of its optimizations, for example branchless decision trees for multi-way partitioning, only require features already supported by our framework.