Abstract
We present a stepwise refinement approach to develop verified parallel algorithms, down to efficient LLVM code. The resulting algorithms’ performance is competitive with their counterparts implemented in C++. Our approach is backwards compatible with the Isabelle Refinement Framework, such that existing sequential formalizations can easily be adapted or re-used. As case study, we verify a parallel quicksort algorithm that is competitive to unverified state-of-the-art algorithms.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We present a stepwise refinement approach to develop verified and efficient parallel algorithms. Our method can verify total correctness down to LLVM intermediate code. The resulting verified implementations are competitive with state-of-the-art unverified implementations. Our approach is backwards compatible to the Isabelle Refinement Framework (IRF) [29], a powerful tool to verify efficient sequential software, such as model checkers [8, 11, 44], SAT solvers [13, 27, 28], or graph algorithms [25, 34, 35]. This paper adds parallel execution to the IRF’s toolbox, without invalidating the existing formalizations, which can now be used as sequential building blocks for parallel algorithms, or be modified to add parallelization.
As a case study, we verify total correctness of a parallel quicksort algorithm, re-using an existing verification of state-of-the-art sequential sorting algorithms [30]. Our verified parallel sorting algorithm is competitive to state-of-the-art parallel sorting algorithms from GNU’s C++ standard library and the Boost C++ Libraries.
This paper is an extended version of our ITP 2022 paper [31]. The main new contribution is a verified parallel partitioning algorithm, which significantly improves the efficiency and scalability of our sorting algorithm from [31]. To this end, we added a description of the interval list data structure (Sect. 4.3) and the
combinator (Sect. 4.4), which are required by the parallel partitioner. The parallel partitioning algorithm itself is described in Sects. 5.3, and 5.5 contains updated and more extensive benchmarks.
Isabelle LLVM is hosted at https://lammich.github.io/isabelle_llvm/index.html. The version described in this paper has been archived [32].
1.1 Overview
This paper is based on the Isabelle Refinement Framework, a continuing effort to verify efficient implementations of complex algorithms, using stepwise refinement techniques [15, 24, 26, 29, 33, 36]. Figure 1 displays the components of the Isabelle Refinement Framework.
The back end layer handles the translation from Isabelle/HOL to the actual target language. The instructions of the target language are shallowly embedded into Isabelle/HOL, using a state-error (SE) monad. An instruction with undefined behaviour, or behaviour outside our supported fragment, raises an error. The state of the monad is the memory, represented via a memory model. The code generator translates the instructions to actual code. These components form the trusted code base, while all the remaining components of the Isabelle Refinement Framework generate proofs. In the back-end, the preprocessor transforms expressions to the syntactically restricted format required by the code generator, proving semantic equality of the original and transformed expression. While there exist back ends for purely functional code [24, 36], and sequential imperative code [26, 29], this paper describes a back end for parallel imperative LLVM code (Sect. 2).
On top of the back-end, we formalize a concurrent separation logic [39] and implement a verification condition generator (VCG), cf. Sect. 3.
At the level of the program logic and VCG, our framework can be used to verify simple low-level algorithms and data structures, like dynamic arrays and linked lists. More complex developments typically use a stepwise refinement approach, starting at purely functional programs modelled in a nondeterminism-error (NE) monad [36]. A semi-automatic refinement procedure (Sepref [26, 29]) translates from the purely functional code to imperative code, refining abstract functional data types to concrete imperative ones. In Sect. 4, we describe our extensions to support refinement to parallel executions, and a fine-grained tracking of pointer equalities, required to parallelize computations that work on disjoint parts of the same array.
Using our approach, complex algorithms and data structures can be developed and refined to optimized efficient code. The stepwise refinement ensures a separation of concerns between high-level algorithmic ideas and low-level optimizations. We have already used this approach to verify a wide range of practically efficient sequential algorithms [8, 11, 13, 25, 27, 28, 30, 34, 35, 44]. In Sect. 5, we use our extended techniques to verify a parallel sorting algorithm, with competitive performance wrt. unverified state-of-the-art algorithms.
Section 6 concludes the paper and discusses related and future work.
1.2 Notation
Formal statements in this paper correspond to theorems proved in our Isabelle/HOL formalization, though sometimes simplified to improve the clarity of the presentation.
We mainly use Isabelle/HOL notation, with some shortcuts and adaptations for presentation in a paper. In this section, we give examples of the more unusual notations: for implication, we always write
, e.g.,
(in Isabelle, one must use
here). Free variables are universally quantified at the top-level.
Type variables are
and function types are written curried as
. Types can be annotated to any term, e.g.,
. Function application is written as
, and function update is
. For definitions, we use
, e.g.,
.
Algebraic datatypes are written as
. This also defines the selector function
. Values of the tuple type
are written as
. Lists have type
, and we write
for the empty list,
for the list with head
and tail
, and
for list concatenation. We also use the list notation
. The length of list
is
.
The empty set is
. The set
contains all elements between l inclusive and h exclusive, and the set
contains all elements greater than or equal to l. Disjoint union is written as
. We only write
if it is clear from the context where the disjointness constraint belongs to. Otherwise, we will explicitly write, e.g.,
. The cardinality of a finite set
is
.
Lambda abstraction is written as
. When clear from the context, we omit the
, e.g.
. We also use Haskell-like sections for infix operators, e.g.
for
. We also use underscores to indicate the parameter positions, e.g.,
for
.
2 A Back End for LLVM with Parallel Execution
We formalize a semantics for parallel execution, shallowly embedded into Isabelle/HOL. As for the existing sequential back ends [26, 29], the shallow embedding is key to the flexibility and feasibility of the approach. The main idea is to make an execution report its accessed memory, and use this information to raise an error when joining executions that would have exhibited a data race. We use this to model an instruction that calls two functions in parallel, and waits until both have returned.
2.1 State-nondeterminism-error Monad with Access Reports
We define the underlying monad in two steps. We start with a nondeterminism-error monad, and then lift it to a state monad and add access reports. Defining a nondeterminism-error monad is straightforward in Isabelle/HOL:
![figure an](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figan_HTML.png)
A program either fails, or yields a possible set of results (
), described by its characteristic function
. The
operation yields exactly one result, and
combines all possible results, failing if there is a possibility to fail. We use the notation
for
, and
for
.
Now assume that we have a state (memory) type
, and an access report type
, which forms a monoid (
). With this, we define our state-nondeterminism-error monad with access reports, just called
for brevity:
![figure ba](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figba_HTML.png)
Here,
does not change the state, and reports no accesses (
), and
sequentially composes the executions, threading through the state \(\mu \) and adding up the access reports
and
. Note that we use the names
for reports,
for memory, and
for monads.
Typically, the access report will contain read and written addresses, such that data races can be detected. Moreover, if parallel executions can allocate memory, we must detect those executions where the memory manager allocated the same block in both parallel strands. As we assume a thread safe memory manager, those infeasible executions can safely be ignored. Let
and
be symmetric predicates, and let
be a commutative operator to combine two pairs of access reports and states. Then, we define a parallel composition operator for
:
![figure bn](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figbn_HTML.png)
Here, we use
to ignore infeasible executions, and
to fail on data races. Note that, if one parallel strand fails, and the other parallel strand has no possible results (
), the behaviour of the parallel composition is not clear. For this reason, we fix an invariant
, which implies that every non-failing execution has at least one possible result. We define the actual type
as the subtype satisfying
. Thus, we have to prove that every combinator and instruction of our semantics preserves the invariant, which is an important sanity check. As additional sanity check, we prove symmetry of parallel composition:
![figure bu](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figbu_HTML.png)
2.2 Memory Model
Our memory model supports blocks of values, where values can be integers, structures, or pointers into a block:
![figure bv](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figbv_HTML.png)
A block is either fresh, freed, or allocated, and a memory is a mapping from block indexes (
) to blocks, such that only finitely many blocks are not fresh. Every block’s state transitions from fresh to allocated to freed. This avoids ever reusing the same block, and thus allows us to semantically detect use after free errors. Every program execution can only allocate finitely many blocks, such that we will never run out of fresh blocks.Footnote 1 An allocated block contains an array of values, modelled as a list. Thus, an address
consists of a block index
, and an index
into the array.
To access and modify memory, we define the functions
,
, and
:
![figure cd](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figcd_HTML.png)
where
is the length of list
,
returns the
th element of list
, and
replaces the
th element of
by
.
Note that our LLVM semantics does not support conversion of pointers to integers, nor comparison or difference of pointers to different blocks. This way, a program cannot see the internal representation of a pointer, and we can choose a simple abstract representation, while being faithful wrt. any actual memory manager implementation.
2.3 Access Reports
We now fix the state of the M-monad to be memory, and the access reports to be tuples
of read and written addresses, as well as sets of allocated and freed blocks:
![figure co](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figco_HTML.png)
Two parallel executions are feasible if they did not allocate the same block. They have a data race if one execution accesses addresses or blocks modified by the other:
![figure cp](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figcp_HTML.png)
The combine function joins the access reports and memories, preferring allocated over fresh, and freed over allocated memory. When joining two allocated blocks, the written addresses from the access report are used to join the blocks. We skip the rather technical definition of combine, and just state the relevant properties: Let
and
be feasible and race free access reports, and
be memories that have evolved from a common memory
, consistently with the access reports
. Let
. Then
![figure cw](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figcw_HTML.png)
Moreover, for all addresses
with
:
![figure cz](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figcz_HTML.png)
The properties (1)–(3) define the state of blocks in the combined memory: a fresh block in
was fresh already in
, and has not been allocated (1); an allocated block was already allocated or has been allocated, but has not been freed (2); and a freed block was already freed, or has been freed (3). The properties (4)–(6) define the content: addresses written or allocated in the first or second execution get their content from
(4) or
(5) respectively. Addresses not written nor allocated at all keep their original content (6).
2.4 The Interface of the M-monad
The invariant for
states that blocks transition only from fresh to allocated to free, allocated blocks never change their size, and the access report matches the observable state change (
). It also states, that for each finite set of blocks B, there is an execution that does not allocate blocks from B. The latter is required to show that we always find feasible parallel executions:
![figure dg](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figdg_HTML.png)
To define functions in the M-monad, we have to show that they satisfy this invariant. For
and
, this is straightforward. The proof for the parallel operator is slightly more involved, using the properties of
, and the invariant for the operands to obtain a feasible parallel execution.
Moreover, the M monad provides the memory management functions
,
,
,
, and
. The latter function checks if a given address is valid, and is used to check if pointer arithmetic can be performed on that address. Currently, it behaves like loading from that address, in particular it does not support pointers one past the end of an allocated block. We leave integration of such pointers to future work.
Example 1
(Memory allocation) To define memory allocation in
, we first define the allocation function in the underlying nondeterminism-error monad:
![figure dq](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figdq_HTML.png)
This function selects an arbitrary fresh block
, and initializes it with the given list
of values. It returns the allocated block, an access report for the allocation, and the updated memory.
We then show that
satisfies the invariant of M: we correctly report the allocated block. Moreover, we can select any fresh block. As our memory model guarantees an infinite supply of fresh blocks, any finite set of blocks can be avoided.
![figure du](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figdu_HTML.png)
Finally, we define the corresponding function in the M monad, using Isabelle’s lifting and transfer package [18]:
![figure dv](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figdv_HTML.png)
The other memory management functions are defined analogously.
2.5 LLVM Instructions
Based on the M-monad, we define shallowly embedded LLVM instructions. For most instructions, this is analogous to the sequential case [29]. Additionally, we define an instruction for a parallel function call:
![figure dw](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figdw_HTML.png)
The code generator only accepts this, if
and
are constants (i.e., function names). It then generates some type-casting boilerplate, and a call to an external
function, which we implement using the Threading Building Blocks [19] library:
![figure ea](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figea_HTML.png)
i.e., the two functions
and
are called in parallel. The generated boilerplate code sets up
and
to point to both, the actual arguments and space for the results.
3 Parallel Separation Logic
In the previous section, we have defined a shallow embedding of LLVM programs into Isabelle/HOL. We now reason about these programs, using separation logic.
3.1 Separation Algebra
In order to reason about memory with separation logic, we define an abstraction function from the memory into a separation algebra [9]. Separation algebras formalize the intuition of combining disjoint parts of memory. They come with a zero (
) that describes the empty part, a disjointness predicate \(a\#b\) describing that the parts a and b do not overlap, and a disjoint union \(a+b\) that combines two disjoint parts. For the exact definition of a separation algebra, we refer to [9, 23]. We note that separation algebras naturally extend over functions and pairs in a pointwise manner.
Example 2
(Trivial separation algebra) The type
forms a separation algebra with:
![figure eh](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figeh_HTML.png)
Intuitively, this separation algebra does not allow for combination of contents, except if one side is zero. While it is not very useful on its own, the trivial separation algebra is a useful building block for more complex separation algebras.
For our memory model, we define the following abstraction function:
![figure ei](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figei_HTML.png)
An abstract memory
consists of two parts:
is a map from addresses to the values stored there. It is used to reason about load and store operations.
is a map from block indexes to the sizes of the corresponding blocks. It is used to ensure that one owns all addresses of a block when freeing it.
We continue to define a separation logic: assertions are predicates over separation algebra elements. The basic connectives are defined as follows:
![figure em](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figem_HTML.png)
That is, the assertion
never holds and the assertion
holds for all abstract memories. The empty assertion
holds for the zero memory, and the separating conjunction
holds if the memory can be split into two disjoint parts, such that
holds for one, and
holds for the other part. The lifting assertion
holds iff the Boolean value
is true:
![figure ev](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figev_HTML.png)
It is used to lift plain logical statements into separation logic assertions owning no memory. When clear from the context, we omit the
-symbol, and just mix plain statements with separation logic assertions.
3.2 Weakest Preconditions and Hoare Triples
We define a weakest precondition predicate directly via the semantics:
![figure ex](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figex_HTML.png)
That is,
holds, iff program
run on memory
does not fail, and all possible results (return value
, access report
, new memory
) satisfy the postcondition
.
To set up a verification condition generator based on separation logic, we standardize the postcondition: the reported memory accesses must be disjoint from some abstract memory
, called the frame. We define the weakest precondition with frame:
![figure fg](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figfg_HTML.png)
that is, when executed on memory
, the program
does not fail, every return value
and new memory
satisfies
, and no memory described by the frame
is accessed.
Equipped with
, we define a Hoare-triple:
![figure fo](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figfo_HTML.png)
The predicate
specifies that the abstract memory
can be split into a part
and the given frame
, such that
satisfies the precondition P. A Hoare-triple
specifies that for all memories and frames for which the precondition holds (
), the program will succeed, not using any memory of the frame, and every result will satisfy the postcondition wrt. the original frame (
).
3.3 Verification Condition Generator
The verification condition generator is implemented as a proof tactic that works on subgoals of the form:
![figure fx](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figfx_HTML.png)
The tactic is guided by the syntax of the command
. Basic monad combinators are broken down using the following rules:
![figure fz](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figfz_HTML.png)
For other instructions and user defined functions, the VCG expects a Hoare-triple to be already proved. It then uses the following rule:
![figure ga](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figga_HTML.png)
To process a command
, the first assumption is instantiated with the Hoare-triple for
, and the second assumption with the assertion
for the current state. Then, a simple syntactic heuristics infers a frame
and proves that the current assertion
entails the required precondition
and the frame. Finally, verification condition generation continues with the postcondition
and the frame as current assertion.
3.4 Hoare-Triples for Instructions
To use the VCG to verify LLVM programs, we have to prove Hoare triples for the LLVM instructions. For parallel calls, we prove the well-known disjoint concurrency rule [39]:
![figure gi](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figgi_HTML.png)
That is, commands with disjoint preconditions can be executed in parallel.
For memory operations, we prove:
![figure gj](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figgj_HTML.png)
Here
asserts that
points to the beginning of a block of size
, and
describes that for all \(i\in I\), \(p+i\) points to value f i. Intuitively,
creates a block of size n, initialized with the default
value, and a tag. If one possesses both, the whole block and the tag, it can be deallocated by free. The rules for load and store are straightforward, where
describes that p points to value x.
4 Refinement for (Parallel) Programs
At this point, we have described a separation logic framework for parallel programs in LLVM. It is largely backwards compatible with the framework for sequential programs described in [29], such that we could easily port the algorithms formalized there to our new framework. The next step towards verifying complex programs is to set up a stepwise refinement framework. In this section we describe the refinement infrastructure of the Isabelle Refinement Framework.
4.1 Abstract Programs
Abstract programs are shallowly embedded into the nondeterminism error monad
(cf. Sect. 2.1). They are purely functional and have no notion of parallel execution. We define a refinement ordering on
:
![figure gt](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figgt_HTML.png)
Intuitively,
means that
returns fewer possible results than
, and may only fail if
may fail. Note that
is a complete lattice, with top element
.
We use refinement and assertions to specify that a program
satisfies a specification with precondition
and postcondition
:
![figure hd](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Fighd_HTML.png)
If the precondition is false, the right hand side is
, and the statement trivially holds. Otherwise, m cannot fail, and every possible result x of m must satisfy Q.
For a detailed description on using the
-monad for stepwise refinement based program verification, we refer the reader to [36].
Example 3
(Swapping multiple elements) We specify an operation to perform multiple swaps. It takes two disjoint sets of indexes
and
, and a list
. It then swaps each index in
with some index in
. The precondition of this operation assumes that the index sets are in range, disjoint, and have the same cardinality:
![figure hl](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Fighl_HTML.png)
The postcondition ensures that the resulting list is a permutation of the original list, the elements at indexes outside
are unchanged, and that each element in
is swapped with one in
:
![figure hp](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Fighp_HTML.png)
Here,
is the multiset of elements of the list
, and
is the standard way to express permutation in Isabelle.
As a sanity check, we prove that our specification is not vacuous, i.e., that for every input that satisfies the precondition, there exists an output that satisfies the postcondition:
![figure ht](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Fight_HTML.png)
Note that this is only a sanity check lemma to detect problems early. Should we accidentally insert a vacuous specification here, we won’t be able to prove refinement to an M-monad program later, which cannot be vacuous due to
.
In the
monad, we then specify:
![figure hw](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Fighw_HTML.png)
In Sect. 5.3 we will refine this specification to a parallel implementation in LLVM.
4.2 The Sepref Tool
The Sepref tool [26, 29] symbolically executes an abstract program in the
-monad, keeping track of refinements for every abstract variable to a concrete representation, which may use pointers to dynamically allocated memory. During the symbolic execution, the tool synthesizes an Isabelle-LLVM program, together with a refinement proof. The synthesis is automatic, but requires annotations to the abstract program.
The main concept of the Sepref tool is refinement between an abstract program
in the
-monad, and a concrete program
in the
monad, as expressed by the
-predicate:
![figure id](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figid_HTML.png)
That is, either the abstract program
fails, or for a memory described by assertion
, the LLVM program
succeeds with result
, such that the new memory is described by
, for a possible result
of the abstract program
. Moreover, the predicate
holds for the concrete result. Note that
trivially holds for a failing abstract program. This makes sense, as we prove that the abstract program does not fail anyway. It allows us to assume abstract assertions during the refinement proof:
![figure in](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figin_HTML.png)
Example 4
(Refinement of lists to arrays) We define abstract programs for indexing and updating a list:
![figure io](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figio_HTML.png)
These programs assert that the index is in bounds, and then return the accessed element (
) or the updated list (
) respectively. The following assertion links a pointer to a list of elements stored at the pointed-to location:
![figure ir](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figir_HTML.png)
That is, for every \(i<|xs|\), \(p+i\) points to the ith element of xs. Assertions like
, that relate concrete to abstract values, are called refinement relations. If we want to emphasize that they depend on the heap, we also call them refinement assertions.
Indexing and updating of arrays is implemented by:
![figure it](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figit_HTML.png)
where
is the Isabelle-LLVM instruction for offsetting a pointer by an index. The abstract and concrete programs are linked by the following refinement theorems:
![figure iv](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figiv_HTML.png)
That is, if the list
is refined by array
, and the natural number
is refined by the fixed-widthFootnote 2 word
(
), the
operation will return the same result as the
operation (
). The resulting memory will still contain the original array. Note that there is no explicit precondition that the array access is in bounds, as this follows already from the assertion in the abstract
operation. The
operation will return a pointer to an array that refines the updated list returned by
. As the array is updated in place, the original refinement of the array is no longer valid. Moreover, the returned pointer
will be the same as the argument pointer
. This information is important for refining to parallel programs on disjoint parts of an array (cf. Sect. 4.4).
To increase readability, we introduce an (almost) point-free notation for refinement theorems. The theorems for the array operations above can also be written as:
![figure jj](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figjj_HTML.png)
The first theorem simply states that the first argument is refined by
, the second argument by
, and the result by
. The second theorem adds the annotation \(\cdot ^d\) to the refinement for the array argument, indicating that this argument will be destroyed, i.e., the refinement is no longer valid when the function returns. Moreover, it binds the array argument to the name
and the result to r. These names are used in the pointer equality predicate
at the end, indicating that the result will be the same pointer as the array argument.
Given refinement relations for the parameters, and refinement theorems for all operations in a program, the Sepref tool automatically synthesizes an LLVM program from an abstract
program. The tool tries to automatically discharge additional proof obligations, typically arising from translating arithmetic operations from unbounded numbers to fixed width numbers. Where automatic proof fails, the user has to add assertions to the abstract program to help the proof. The main difference of our tool wrt. the existing Sepref tool [29] is the additional condition (
) on the concrete result, which is used to track pointer equalities. We have added a heuristics to automatically synthesize and discharge these equalities.
4.3 Modular Data Structure Development
The Refinement Framework allows us to build more complex data structures, using already existing ones as building blocks, and chaining together several refinements. We describe the development of an interval list data structure, which we need for our parallel partitioning algorithm (cf. 5.3).
A pair of natural numbers
can be used to represent the set
. We define
to be the refinement relation between intervals (pairs) and sets. Moreover, we define operations for constructing an interval, testing if an interval is empty, intersection, and cardinality. We show that these operations refine the corresponding operations on sets:
![figure ju](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figju_HTML.png)
Note that
, and we do not enforce \(l\le h\) for our representation. Thus, no checks are needed for construction and intersection. However, we use a check to avoid underflow when computing the cardinality.
Analogously, we implement open intervals with a single number, and define operations to construct an open interval, and to intersect a closed and an open interval:
![figure jw](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figjw_HTML.png)
Next, we use Sepref to implement the natural numbers by fixed-sized words (
). For example, given the definition of
, and an annotation to implement natural numbers by 64 bit words, Sepref synthesizes the Isabelle LLVM program
and proves the refinement theorem:
![figure ka](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figka_HTML.png)
We then define
as the composition of the two refinements (word to nat to set). With the help of Sepref’s FCOMP tool, we can automatically compose the refinement lemmas. For example, composing the refinement lemmas for
and
yields:
![figure ke](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figke_HTML.png)
Thus, we obtain imperative implementations of the set operations. We proceed analogously for open intervals.
Next, we implement a set as the union of a list of non-empty, pairwise disjoint, and finite sets. While that seems to make little sense at first glance, we will later implement the sets in the list by intervals, and the list itself by dynamic arrays, to obtain an imperative interval list data structure. We define operations for constructing an empty set, emptiness test, disjoint union with a single set, cardinality, and a more specialized operation
, which splits off a non-empty set from the list:
![figure kg](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figkg_HTML.png)
We then refine the lists of sets to array lists (dynamic arrays) of intervals:
. Here,
is the refinement assertion from lists to the array list data structure from the IRF collections library. As argument, it takes the refinement relation for the list elements. Again, Sepref automatically generates imperative implementations of the
functions and proves the corresponding refinement lemmas. Combining them with the refinements to sets yields the desired imperative interval list data structure, with the refinement relation
. For example, for joining a single interval to the list, and for splitting off an interval, we get:
![figure kl](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figkl_HTML.png)
Note that we update the underlying dynamic array destructively, hence the \(\cdot ^d\) annotation to the argument refinements.
In a last step, we define some operations on (finite) sets, and use Sepref to directly refine them to arrays, without any explicit intermediate steps. For example, intersecting two finite sets can be expressed as:
![figure km](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figkm_HTML.png)
It is straightforward to prove that this algorithm returns
. Also, Sepref can implement \(s_1\) with a closed or open interval, and \(s_2\) with an interval array, yielding:
![figure ko](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figko_HTML.png)
We have demonstrated one way of modularly developing an interval list data structure based on a dynamic array. By separating the actual intervals from the list data structure, the proofs about the interval list where independent of the interval implementation. This is a design choice, and a more direct design, e.g., using
as intermediate data structure, is certainly possible.
4.4 Array Splitting
An important concept for parallel programs is to concurrently operate on disjoint parts of the memory, e.g., different slices of the same array. However, abstractly, arrays are just lists. They are updated by returning a new list, and there is no way to express that the new list is stored at the same address as the old list. Nevertheless, in order to refine a program that updates two disjoint slices of a list to one that updates disjoint parts of the array in place, we need to know that the result is stored in the same array as the input. This is handled by the
argument to
. To indicate that operations shall be refined to disjoint parts of the same array, we introduce the combinator
for abstract programs:
![figure kt](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figkt_HTML.png)
Abstractly, this is an annotation that is inlined when proving the abstract program correct. However, Sepref will translate it to the concrete combinator
:
![figure kv](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figkv_HTML.png)
The corresponding refinement theorem is:
![figure kw](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figkw_HTML.png)
or, equivalently, in pointwise notation:
![figure kx](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figkx_HTML.png)
The refinement of the function argument (
to
) requires an additional proof that the returned pointers are equal to the argument pointers (
). Sepref tries to prove that automatically, using its pointer equality heuristics.
Splitting an array into two parts allows us to abstractly treat the array and its two parts just as lists, which simplifies the abstract proofs: the fact that the two parts come from the same array is only visible at a later refinement stage. However, while splitting an array in parts is adequate for many operations, it is not a workable abstraction for swapping multiple elements in parallel (cf. Sect. 3): while, in theory, we could split the array element-wise, this would incur a considerable proof burden.
A more elegant solution is to keep track of which elements of a list can be accessed already on the abstract level. To this end, we model a list of option values,
meaning that we cannot access this element. We start by defining functions to abstractly handle lists of option values. These functions work on the actual list, and the structure of the list, which is a list of Booleans indicating which elements we do not own. Using the structure of the list as an explicit concept simplifies abstract proofs, as, typically, the values in the list will change, while the structure is preserved. The following functions obtain the structure of a list, and determine if two structures are compatible, i.e., have the same lengths and own disjoint indexes:
![figure lc](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figlc_HTML.png)
Here,
is the natural relator for lists, i.e.,
means that the lists
and
have the same length, and for each index, the element in at least one of the lists is true.
We also define functions to split and join lists:
![figure lh](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figlh_HTML.png)
Here,
returns a list that owns the indexes that are in
and owned by
, and
joins the elements of two (compatible) lists. The function
combines two lists element-wise, using the binary function
.
Analogously to
, we define a combinator
, that splits the list
into the lists with the indexes
, and without the indexes
, executes
on these lists, and joins the resulting lists:
![figure lu](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figlu_HTML.png)
On the concrete side, we define the refinement assertion
between arrays and lists of options. It only owns the indexes of the array that are not None:
![figure lw](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figlw_HTML.png)
We implement abstract operations for accessing the list, and show the corresponding refinement lemmas:
![figure lx](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figlx_HTML.png)
We also define conversion operations between plain lists and lists of option values:
![figure ly](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figly_HTML.png)
These conversion operations are important to limit the proof overhead when using lists of option values: only where fine-grained ownership control is needed, we use option values. When we are done, and have reassembled all parts of the list, we convert it back to a plain list.
Finally, we define
and prove its refinement theorem:
![figure ma](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figma_HTML.png)
Note that the set s of indexes does not have a concrete counterpart. It is a ghost variable that controls the split on the abstract level.
4.5 Refinement to Parallel Execution
Our abstract programs have no notion of parallel execution. To indicate that refinement to parallel execution is desired, we define an abstract annotation
:
![figure mc](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figmc_HTML.png)
Its refinement rule is:
![figure md](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figmd_HTML.png)
This rule can be used to automatically parallelize any (independent) abstract computations. For convenience, we also define
. Abstractly, it’s the same as
, but Sepref translates it to sequential execution.
5 A Parallel Sorting Algorithm
To test the usability of our framework, we verify a parallel sorting algorithm. We start with the abstract specification of an algorithm that sorts a list:
![figure mg](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figmg_HTML.png)
i.e., we return a sorted permutation of the original list. This is a standard specification of sorting in Isabelle, and easily proved equivalent to other, more explicit specifications:
![figure mh](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figmh_HTML.png)
Figure 2 shows our abstract parallel sorting algorithm
. This algorithm is derived from the well-known quicksort and introsort algorithms [38]: like quicksort, it partitions the list (line 8), and then recursively sorts the partitions in parallel (l. 12). Like introsort, when the recursion gets too deep, or the list too short, we fall back to some (not yet specified) sequential sorting algorithm (l. 6). Similarly, when the partitioning is very unbalanced (l. 9), we sort the partitions sequentially (l. 11). These optimizations aim at not spawning threads for small sorting tasks, where the overhead of thread creation outweighs the advantages of parallel execution. A more technical aspect is the extra parameter
that we introduced for the list length. Thus, we can refine the list to just a pointer to an array, and still access its length.Footnote 3
Reusing our existing development of an abstract introsort algorithm [30], we prove with a few refinement steps that
implements
:
![figure mm](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figmm_HTML.png)
5.1 Implementation and Correctness Theorem
Next, we have to provide implementations for the fallback
, and for
. These implementations must be proved to be in-place, i.e., return a pointer to the same array. It was straightforward to amend our existing formalization of
[30] with the in-place proofs: once we had amended the refinement statements and bug-fixed the pointer equality proving heuristics, the proofs were automatic.
Given implementations of
and
, Sepref generates an LLVM program
from the abstract
, and proves a corresponding refinement lemma:
![figure mu](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figmu_HTML.png)
Combining this with the correctness lemma of the abstract
algorithm, and unfolding the definition of
, we prove the following Hoare-triple for our final implementation:
![figure mx](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figmx_HTML.png)
That is, for a pointer
to an array, whose content is described by list
(
), and a fixed-size word
representing the natural number
(
), which must be the number of elements in the list
, our sorting algorithm returns the original pointer
, and the array content now is
, which is sorted and a permutation of
. Note that this statement uses our semantically defined Hoare triples (cf. Sect. 3.2). In particular, it does not depend on the refinement steps, the Sepref tool, or the VCG.
5.2 Sampling Pivot Selection
While we could simply re-use the existing partitioning algorithm from the pdqsort formalization, which uses a pseudomedian of nine pivot selection, we observe that the quality of the pivot is particularly important for a balanced parallelization. Moreover, the partitioning in the
procedure is only done for arrays above a quite big size threshold. Thus, we can invest a little more work to find a good pivot, which is still negligible compared to the cost of sorting the resulting partitions. We choose a sampling approach, using the median of 64 equidistant samples as pivot. We simply use quicksort to find the index of the pivotFootnote 4:
![figure nj](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Fignj_HTML.png)
Proving that this algorithm finds a valid pivot index is straightforward. More challenging is to refine it to purely imperative LLVM code, which does not support closures like
.
We resolve such closures over the comparison function manually: using Isabelle’s locale mechanism [22], we parametrize over the comparison function. Moreover, we thread through an extra parameter for the data captured by the closure:
![figure nl](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Fignl_HTML.png)
This defines a context in which we have an abstract compare function
for the abstract elements of type
. It takes an extra parameter of type
(e.g. the list xs), and forms a weak ordering.Footnote 5 Note that the strict compare function
also induces a non-strict version
. Moreover, we have a concrete implementation
of the compare function, wrt. the refinement assertions
for the parameter and
for the elements.
Our sorting algorithms are developed and verified in the context of this locale (to avoid confusion, our presentation has, up to now, just used
,
, and
instead of
,
, and
). To get an actual sorting algorithm, we instantiate the locale with an abstract and concrete compare function, proving that the abstract function is a weak ordering, and that the concrete function refines the abstract one. For our example of sorting indexes into an array, where the array elements themselves are compared by a function
, we get:
![figure ob](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figob_HTML.png)
This instantiates the generic sorting algorithms defined in the
locale to use
as comparison function, taking
as an extra parameter. To sort our list
of sample indexes into
, we use the instantiation of the introsort algorithm:
.
5.3 Parallel Partitioning
While our parallel quicksort scheme parallelizes the sorting, partitioning is still a bottleneck: before the first thread is even spawned, the whole array needs to be partitioned. On the next recursion level, only two partitionings can run in parallel, and so on. That is, initially, most processors will be idle. To this end, the partitioning itself can be parallelized. The parallel partitioning algorithms used in the latest research on practically efficient sorting algorithms [2] are branchless k-way algorithms, which use atomic operations to orchestrate the parallel threads. In contrast, we only verify a 2-way partitioning algorithm that uses parallel calls as its only synchronization mechanism. This is a compromise between verification effort and efficiency, taking into account the features currently supported by Isabelle-LLVM. The idea of our parallel partitioning algorithm is sketched in Fig. 3.
Illustration of the phases of our parallel partitioning algorithm, after a pivot p has been picked. Step 1 splits the array into slices. Step 2 partitions the slices in parallel. Step 3 computes the start index m of the right partition as the sum of the sizes of all left partitions. It then determines the indexes of misplaced elements: the set bs contains right-partition elements that are left of m, and the set ss contains left partition elements that are right of m. Step 4 then swaps the misplaced elements in parallel. While Steps 1 and 3 are computationally cheap, Steps 2 and 4 do the main part of the work in parallel
We specify our algorithm on sets of indexes, and then refine it to intervals and interval arrays (cf. Sect. 4.3). First, we specify Steps 1 and 2, i.e., returning a permutation of the list, along with the sets
of indexes that belong to a left partition, and
of indexes that belong to a right partition. In Fig. 3,
corresponds to the set of blue (\(\le p\)) intervals, and
to the set of red (\(\ge p\)) intervals:
![figure om](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figom_HTML.png)
The whole partitioning algorithm is specified as follows:
![figure on](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figon_HTML.png)
A straightforward proof, only using arguments on lists and sets of indexes, shows that this algorithm partitions the list:
![figure oo](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figoo_HTML.png)
The set operations in Step 3 are implemented by the operations
and
of our interval arrays (cf. Sect. 4.3), using the equalities:
and
. Thus, we are only missing implementations for
and
.
The refinement of the parallel partitioning
is similar to that of the parallel sorting algorithm
: using
and
, we split the list into slices that we then partition with a sequential algorithm.
The parallel swapping algorithm is refined as follows:
![figure oz](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figoz_HTML.png)
The main procedure
first checks if there are any indexes to swap. Then, it converts the plain list to a list of option values (
), invokes the actual parallel swapping procedure
, and converts the result back to a plain list (
).
The
procedure first splits off equally sized, non-empty sets
and
from the index sets, and then swaps these in parallel with the rest. Here,
is the lifting of
to lists of option values, and
ensures that the swap operation will own the necessary elements.
We prove that our algorithm is correct (
), and use Sepref, a straightforward implementation of sequential swapping, and our interval array implementation to refine it to efficient imperative Isabelle-LLVM code.
Combining all refinements in this section gives us a parallel partitioning algorithm. When we wanted to show that it satisfies the specification
as required by our parallel sorting algorithm, we discovered that we also need to prove that neither partition can be empty. While this is certainly possible along the same lines as it is proved for sequential partitioning, we chose a pragmatic solution here: we dynamically check for the extreme case of one partition being empty, and fix that with an additional swap. The runtime impact of this check is negligible, but it greatly simplifies the correctness proof.
5.4 Code Generation
Finally, we instantiate the sorting algorithms to sort unsigned integers and strings:
![figure pm](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figpm_HTML.png)
Here, the locale
is the version of
without an extra parameter to the compare function.Footnote 6 This yields implementations
and
, and instantiated versions of the correctness theorem.
We then use our code generator to generate actual LLVM text, as well as a C header file with the signatures of the generated functionsFootnote 7:
![figure ps](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10817-024-09701-w/MediaObjects/10817_2024_9701_Figps_HTML.png)
This checks that the specified C signatures are compatible with the actual types, and then generates
and
, which can be used in a standard C/C++ toolchain.
5.5 Benchmarks
Runtimes in milliseconds for sorting various distributions of \(10^8\) unsigned 64 bit integers and \(10^7\) strings with our verified parallel sorting algorithm, C++’s standard parallel sorting algorithm, and Boost’s parallel sample sort algorithm. The experiments were performed on a server machine with 22 AMD Opteron 6176 cores and 128GiB of RAM, and a laptop with a 6 core (12 threads) i7-10750H CPU and 32GiB of RAM
We have benchmarked our verified sorting algorithm against a direct implementation of the same algorithm in C++. The result was that both implementations have the same runtime, up to some minor noise. This indicates that there is no systemic slowdown: algorithms verified with our framework run as fast as their unverified counterparts implemented in C++.
We also benchmarked against the old verified algorithm with sequential partitioning from our ITP 2022 paper [31], as well as against the state-of-the-art implementations
with execution policy
from the GNU C++ standard library [42], and
from the Boost C++ libraries [5, 6]. We have benchmarked the algorithm on two different machines, and various input distributions. The results are shown in Fig. 4: our verified algorithm is clearly competitive with the unverified state-of-the-art implementations. Only for a few string-sorting benchmarks, it is slightly slower. We leave improving on this for future work. Compared to the old verified algorithm, the parallel partitioning algorithm is more efficient in many cases, and there are only a few cases where it is slightly less efficient.
We also measured the speedup that the implementations achieve for a certain number of cores. The results are displayed in Fig. 5: again, our verified implementation is clearly competitive, and it scales better than the old verified algorithm.
The previous two benchmarks used relatively large input sizes. Figure 6 displays a benchmark for smaller input sizes. While we are still competitive with
,
is clearly faster for small arrays. This result is expected: our parallelization uses hard-coded thresholds to switch to sequential algorithms, which are independent of the number of available processors or the input size. We leave fine-tuning of the parallelization scheme to future work.
6 Conclusions
We have presented a stepwise refinement approach to verify total correctness of efficient parallel algorithms. Our approach targets LLVM as back end, and there is no systemic efficiency loss in our approach when compared to unverified algorithms implemented in C++.
The trusted code base of our approach is relatively small: apart from Isabelle’s inference kernel, it contains our shallow embedding of a small fragment of the LLVM semantics, and the code generator. All other tools that we used, e.g., our Hoare logic, the Sepref tool, and the Refinement Framework for abstract programs, ultimately prove a correctness theorem that only depends on our shallowly embedded semantics.
As a case study, we have implemented a parallel sorting algorithm. It uses an existing verified sequential pdqsort algorithm as a building block, and is competitive with state-of-the-art parallel sorting algorithms.
The main idea of our parallel extension is to shallowly embed the semantics of a parallel combinator into a sequential semantics, by making the semantics report the accessed memory locations, and fail if there is a potential data race. We only needed to change the lower levels of our existing framework for sequential LLVM [29]. Higher-level tools like the VCG and Sepref remained largely unchanged and backwards compatible. This greatly simplified reusing of existing verification projects, like the sequential pdqsort algorithm [30].
While the verification of our sorting algorithms uses a top-down approach, we actually started with implementing, benchmarking, and fine-tuning the algorithms in C++. This gave us a quick way to find an algorithm that is efficient, and, at the same time not too complex for verification, without having to verify each intermediate step towards this algorithm. Only then did we use our top-down approach to first formalize the abstract ideas behind the algorithm, and then refine it to an efficient implementation close to what we had written in C++. At this point, one may ask why not directly verify the C++ implementation: while this might be possible, the required work and steps would be similar: to manage the complexity of such a verification, several bottom-up refinement steps would be necessary, ultimately arriving at something similarly abstract as our initial abstract algorithm.
6.1 Development Effort
The Isabelle Refinement Framework for LLVM consists of roughly 45k lines of theory text and Isabelle-ML code (referred to as kLOC from now on). The sorting algorithms comprise another 14kLOC.
Integrating the initial version of the parallel semantics into the framework, and verifying the parallel sorting algorithm with sequential partitioning, as reported in our ITP 2022 paper [31], took us roughly 6 months.Footnote 8 During this time, we abandoned an initial formalization for the Imperative/HOL back end, as the achievable performance was unsatisfactory. We also abandoned a naive formalization of the parallel operator for the LLVM back end, to finally arrive at what is described in our ITP 2022 paper. Effectively, we added about 6kLOC to the framework and adapted most of the remaining code. The initial formalization of the parallel sorting algorithm required roughly 1kLOC. The parallel partitioning algorithm and its support data structures add 4kLOC, and took us one month to develop.
6.2 Related Work
While there is extensive work on parallel sorting algorithms (e.g. [1, 2, 10]), there seems to be almost no work on their formal verification. The only work we are aware of is a distributed merge sort algorithm [17], for which “no effort has been made to make it efficient” [17, Sect. 2], nor any executable code has been generated or benchmarked. Another verification [40] uses the VerCors deductive verifier to prove the permutation property (
) of odd-even transposition sort [14], but neither the sortedness property nor termination.
Concurrent separation logic is used by many verification tools such as VerCors [4], and also formalized in proof assistants, for example in the VST [43] and IRIS [21] projects for Coq [3]. These formalizations contain elaborate concepts to reason about communication between threads via shared memory, and are typically used to verify partial correctness of subtle concurrent algorithms (e.g. [37]). Reasoning about total correctness is more complicated in the step-indexed separation logic provided by IRIS, and currently only supported for sequential programs [41]. Our approach is less expressive, but naturally supports total correctness, and is already sufficient for many practically relevant parallel algorithms like sorting, matrix-multiplication, or parallel algorithms from the C++ STL.
6.3 Future Work
An obvious next step is to implement a fractional separation logic [7], to reason about parallel threads that share read-only memory. While our semantics already supports shared read-only memory, our separation logic does not. We believe that implementing a fractional separation logic will be straightforward, and mainly pose technical issues for automatic frame inference.
Extending our approach towards more advanced synchronization like locks or atomic operations may be possible: instead of accessed memory addresses, a thread could report a set of possible traces, which are checked for race-freedom and then combined. Moreover, our framework currently targets multicore CPUs. Another important architecture are general purpose GPUs. As LLVM is also available for GPUs, porting our framework to this architecture should be possible. We even expect that we can model barrier synchronization, which is important in the GPU context.
Finally, the Sepref framework has recently been extended to reason about complexity of (sequential) LLVM programs [15, 16]. This could be combined with our parallel extension, to verify the complexity (e.g. work and span) of parallel algorithms.
An important aspect is the scalability of our tools. The current implementation scales to small software systems, like the verified IsaSAT-solver [12], which actually uses our sequential sorting algorithms. However, for projects of this size, the times needed by Sepref and the LLVM code exporter easily get into the order of dozens of minutes, and some manual performance tweaks are required (cf. [12, Sect. 5]). Further improvement of the scalability is left to future work.
Another direction for future work is to further optimize our verified sorting algorithm. We expect that tuning our parallelization scheme will improve the speedup for smaller input sizes. Also, while we are competitive with standard library implementations, recent research indicates that there is still some room for improvement, for example with the IPS\(^4\)o algorithm [2]. While this algorithm uses atomic operations in one place, many other of its optimizations, for example branchless decision trees for multi-way partitioning, only require features already supported by our framework.
Notes
If the actual system does run out of memory, we will terminate the program in a defined way.
We use Isabelle’s word library here, which encodes the actual width as a type variable, such that our functions work with any bit width. For code generation, we will fix the width to 64 bit.
Alternatively, we could refine a list to a pair of array pointer and length.
We leave verification of efficient median algorithms, e.g., quickselect, to future work. Note that the overhead of sorting 64 elements is negligible compared to the large partition that has to be sorted.
A weak ordering is induced by a mapping of the elements into a total ordering. It is the standard prerequisite for sorting algorithms in C++ [20].
Parameters to the compare function are currently not supported for parallel sorting algorithms, as we cannot efficiently share the parameter between multiple threads. Integrating fractional separation logic into Sepref, which would enable such a sharing, is left to future work.
For technical reasons, we represent the array size as non-negative signed integer, thus the C signature uses
. Moreover, we use a string implementation based on dynamic arrays, rather than C’s zero terminated strings.
As we do not have precise work time logs, we report calendar months during which we worked almost full-time on the project.
References
Asiatici, M., Maiorano, D., Ienne, P.: How many CPU cores is an FPGA worth? Lessons learned from accelerating string sorting on a CPU-FPGA system. J. Signal Process. Syst. 93, 1–13 (2021)
Axtmann, M., Witt, S., Ferizovic, D., Sanders, P.: Engineering in-place (shared-memory) sorting algorithms. ACM Trans. Parallel Comput. 9(1), 2–1262 (2022). https://doi.org/10.1145/3505286
Bertot, Y., Castran, P.: Interactive Theorem Proving and Program Development: Coq’Art The Calculus of Inductive Constructions, 1st edn. Springer, Heidelberg (2010)
Blom, S., Darabi, S., Huisman, M., Oortwijn, W.: The vercors tool set: verification of parallel and concurrent software. In: Polikarpova, N., Schneider, S. (eds.) Integrated Formal Methods, pp. 102–110. Springer, Cham (2017)
Boost C++ Libraries Sorting Algorithms. https://www.boost.org/doc/libs/1_77_0/libs/sort/doc/html/index.html
Boost C++ Libraries. https://www.boost.org/
Bornat, R., Calcagno, C., O’Hearn, P., Parkinson, M.: Permission accounting in separation logic. In: Proc. of POPL, pp. 259–270. ACM, New York, NY, USA (2005). https://doi.org/10.1145/1040305.1040327
Brunner, J., Lammich, P.: Formal verification of an executable LTL model checker with partial order reduction. J. Autom. Reasoning 60(1), 3–21 (2018). https://doi.org/10.1007/s10817-017-9418-4
Calcagno, C., O’Hearn, P.W., Yang, H.: Local action and abstract separation logic. In: LICS 2007, pp. 366–378 (2007)
Chhugani, J., Nguyen, A.D., Lee, V.W., Macy, W., Hagog, M., Chen, Y.-K., Baransi, A., Kumar, S., Dubey, P.: Efficient implementation of sorting on multi-core SIMD CPU architecture. Proc. VLDB Endow. 1(2), 1313–1324 (2008)
Esparza, J., Lammich, P., Neumann, R., Nipkow, T., Schimpf, A., Smaus, J.-G.: A fully verified executable LTL model checker. In: CAV. LNCS, vol. 8044, pp. 463–478. Springer, Saint Petersburg (2013)
Fleury, M., Lammich, P.: A more pragmatic CDCL for isasat and targetting LLVM (short paper). In: Pientka, B., Tinelli, C. (eds.) Automated Deduction - CADE 29 - 29th International Conference on Automated Deduction, Rome, Italy, July 1–4, 2023, Proceedings. Lecture Notes in Computer Science, vol. 14132, pp. 207–219. Springer, Rome, Italy (2023). https://doi.org/10.1007/978-3-031-38499-8_12
Fleury, M., Blanchette, J.C., Lammich, P.: A verified SAT solver with watched literals using Imperative HOL. In: Proc. of CPP, pp. 158–171 (2018)
Habermann, A.N.: Parallel Neighbor-Sort. Carnegie Mellon University, Pittsburgh (1972). https://doi.org/10.1184/R1/6608258.v1
Haslbeck, M.P.L., Lammich, P.: For a few dollars more-verified fine-grained algorithm analysis down to LLVM. In: Yoshida, N. (ed.) Proc. of ESOP. LNCS, vol. 12648, pp. 292–319. Springer, Luxemburg (2021). https://doi.org/10.1007/978-3-030-72019-3_11
Haslbeck, M.P.L., Lammich, P.: For a few dollars more - verified fine-grained algorithm analysis down to LLVM. TOPLAS, S.I. ESOP’21
Hinrichsen, J.K., Bengtson, J., Krebbers, R.: Actris: session-type based reasoning in separation logic. Proc. ACM Program. Lang. (2019). https://doi.org/10.1145/3371074
Huffman, B., Kuncar, O.: Lifting and transfer: A modular design for quotients in isabelle/hol. In: Gonthier, G., Norrish, M. (eds.) Proc. of CPP. LNCS, vol. 8307, pp. 131–146. Springer, Melbourne (2013). https://doi.org/10.1007/978-3-319-03545-1_9
Intel oneAPI Threading Building Blocks. https://software.intel.com/en-us/intel-tbb
Josuttis, N.M.: The C++ Standard Library: A Tutorial and Reference, 2nd edn. Addison-Wesley Professional, Boston (2012)
Jung, R., Krebbers, R., Jourdan, J., Bizjak, A., Birkedal, L., Dreyer, D.: Iris from the ground up: a modular foundation for higher-order concurrent separation logic. J. Funct. Program. 28, 20 (2018). https://doi.org/10.1017/S0956796818000151
Kammüller, F., Wenzel, M., Paulson, L.C.: Locales a sectioning concept for Isabelle. In: Bertot, Y., Dowek, G., Théry, L., Hirschowitz, A., Paulin, C. (eds.) TPHOLs, pp. 149–165. Springer, Nice (1999)
Klein, G., Kolanski, R., Boyton, A.: Mechanised separation algebra. In: ITP, pp. 332–337. Springer, Princeton (2012)
Lammich, P.: Automatic data refinement. In: ITP. LNCS, vol. 7998, pp. 84–99. Springer, Rennes (2013)
Lammich, P.: Verified efficient implementation of Gabow’s strongly connected component algorithm. In: International Conference on Interactive Theorem Proving, pp. 325–340 (2014). Springer
Lammich, P.: Refinement to Imperative/HOL. In: ITP. LNCS, vol. 9236, pp. 253–269. Springer, Nanjing (2015)
Lammich, P.: Efficient verified (UN)SAT certificate checking. In: Proc. of CADE. Springer, Gothenburg (2017)
Lammich, P.: The GRAT tool chain-efficient (UN)SAT certificate checking with formal correctness guarantees. In: SAT, pp. 457–463 (2017)
Lammich, P.: Generating Verified LLVM from Isabelle/HOL. In: Harrison, J., O’Leary, J., Tolmach, A. (eds.) ITP, vol. 141, pp. 22–12219. Dagstuhl Publishing, Portland (2019). https://doi.org/10.4230/LIPIcs.ITP.2019.22
Lammich, P.: Efficient verified implementation of introsort and pdqsort. In: Peltier, N., Sofronie-Stokkermans, V. (eds.) Proc. of IJCAR (II). LNCS, vol. 12167, pp. 307–323. Springer, Paris (2020). https://doi.org/10.1007/978-3-030-51054-1_18
Lammich, P.: Refinement of parallel algorithms down to LLVM. In: Andronick, J., Moura, L. (eds.) ITP. LIPIcs, vol. 237, pp. 24–12418. Dagstuhl Publishing, Haifa (2022). https://doi.org/10.4230/LIPIcs.ITP.2022.24
Lammich, P., Fleury, M.: lammich/isabelle_llvm: parallel sorting: artefact release. https://doi.org/10.5281/zenodo.10869631
Lammich, P., Lochbihler, A.: The Isabelle Collections Framework. In: ITP 2010. LNCS, vol. 6172, pp. 339–354. Springer, Edinburgh (2010)
Lammich, P., Sefidgar, S.R.: Formalizing the Edmonds-Karp algorithm. In: Proc. of ITP, pp. 219–234 (2016)
Lammich, P., Sefidgar, S.R.: Formalizing network flow algorithms: a refinement approach in Isabelle/HOL. J. Autom. Reasoning 62(2), 261–280 (2019). https://doi.org/10.1007/s10817-017-9442-4
Lammich, P., Tuerk, T.: Applying data refinement for monadic programs to Hopcroft’s algorithm. In: Beringer, L., Felty, A.P. (eds.) ITP 2012. LNCS, vol. 7406, pp. 166–182. Springer, Princeton (2012)
Mével, G., Jourdan, J.-H.: Formal verification of a concurrent bounded queue in a weak memory model. Proc. ACM Program. Lang. (2021). https://doi.org/10.1145/3473571
Musser, D.R.: Introspective sorting and selection algorithms. Software 27(8), 983–993 (1997)
O’Hearn, P.W.: Resources, concurrency and local reasoning. In: Gardner, P., Yoshida, N. (eds.) CONCUR 2004-Concurrency Theory, pp. 49–67. Springer, Berlin (2004)
Safari, M., Huisman, M.: A generic approach to the verification of the permutation property of sequential and parallel swap-based sorting algorithms. In: International Conference on Integrated Formal Methods, pp. 257–275 (2020). Springer
Spies, S., Gäher, L., Gratzer, D., Tassarotti, J., Krebbers, R., Dreyer, D., Birkedal, L.: Transfinite iris: Resolving an existential dilemma of step-indexed separation logic. In: Proc. of PLDI, pp. 80–95 (2021)
The GNU C++ Library 3.4.28. https://gcc.gnu.org/onlinedocs/libstdc++/
Verified Software Toolchain Project Web Page. https://vst.cs.princeton.edu/
Wimmer, S., Lammich, P.: Verified model checking of timed automata. In: TACAS 2018, Thessaloniki, pp. 61–78 (2018)
Author information
Authors and Affiliations
Contributions
The only author of the paper is (obviously) responsible for all contributions.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lammich, P. Refinement of Parallel Algorithms Down to LLVM: Applied to Practically Efficient Parallel Sorting. J Autom Reasoning 68, 14 (2024). https://doi.org/10.1007/s10817-024-09701-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10817-024-09701-w