For a Few Dollars More

We present a framework to verify both, functional correctness and worst-case complexity of practically efficient algorithms. We implemented a stepwise refinement approach, using the novel concept of resource currencies to naturally structure the resource analysis along the refinement chain, and allow a fine-grained analysis of operation counts. Our framework targets the LLVM intermediate representation. We extend its semantics from earlier work with a cost model. As case study, we verify the correctness and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n\log n)$$\end{document}O(nlogn) worst-case complexity of an implementation of the introsort algorithm, whose performance is on par with the state-of-the-art implementation found in the GNU C++ Library.


Introduction
In general, not only correctness, but also the complexity of algorithms is important. While it is obvious that the performance observed during experiments is essential to solve practical problems efficiently, also the theoretical worst-case complexity of algorithms is crucial: a good worst-case complexity avoids timing regressions when hitting worst-case input, and, even more important, prevents denial of service attacks that intentionally produce worst-case scenarios to overload critical computing infrastructure.
For example, the C++ standard requires implementations of std::sort to have worst-case complexity O(n log n) [7]. Note that this rules out quicksort [12], which is very fast in practice, but has quadratic worst-case complexity. Nevertheless, some standard libraries, most prominently LLVM's libc++ [20], still use sorting algorithms with quadratic worst-case complexity. 3 A practically efficient sorting algorithm with O(n log n) worst-case complexity is Musser's introsort [22]. It combines quicksort with the O(n log n) heapsort algorithm, which is used as fallback when the quicksort recursion depth 3 See, e.g., https://bugs.llvm.org/show_bug.cgi?id=20837. exceeds a certain threshold. It allows to implement standard-compliant, practically efficient sorting algorithms. Introsort is implemented by, e.g., the GNU C++ Library (libstdc++) [8].
In this paper, we present techniques to formally verify both, correctness and worst-case complexity of practically efficient implementations. We build on two previous lines of research by the authors.
On one hand, we have the Isabelle Refinement Framework [19], which allows for a modular top-down verification approach. It utilizes stepwise refinement to separate the different aspects of an efficient implementation, such as algorithmic idea and low-level optimizations. It provides a nondeterminism monad to formalize programs and refinements, and the Sepref tool to automate canonical data refinement steps. Its recent LLVM back end [15] allows to verify algorithms with competitive performance compared to (unverified) highly optimized C/C++ implementations. The Refinement Framework has been used to verify the functional correctness of an implementation of introsort that performs on par with libstdc++'s implementation [17].
On the other hand, we already have extended the Refinement Framework to reason about complexity [11]. However, this only supports the Imperative/HOL back end [16]. It generates implementations in functional languages, which are inherently less efficient than highly optimized C/C++ implementations. This paper combines and extends these two approaches. Our main contributions are.
• We present a generalized nondeterminism monad with resource cost, apply it to resource functions to model fine-grained currencies (Section 2) and show how they can be used to naturally structure refinement. • We extend the LLVM back end [15] with a cost model, and amend its basic reasoning infrastructure (Section 3). • We extend the Sepref tool (Section 4) to synthesize executable imperative code in LLVM, together with a proof of correctness and complexity. Our approach seamlessly supports imperative and amortized data structures. • We extend the verification of introsort to also show a worst-case complexity of O(n log n), thus meeting the C++11 stdlib specification [7] (Section 5). The performance of our implementation is still on par with libstdc++. We believe that this is the first time that both, correctness and complexity of a sorting algorithm have been formally verified down to a competitive implementation.

Specification of Algorithms With Resources
We use the formalism of monads [24] to elegantly specify programs with resource usage. We first describe a framework that works for a very generic notion of resource, and then instantiate it with resource functions, which model resources of different currencies. We then describe a refinement calculus and show how currencies can be used to structure stepwise refinement proofs. Finally, we report on automation and give some examples.

Nondeterministic Computations With Resources
Let us examine the features we require for our computation model. First, we want to specify programs by their desired properties, without having to fix a concrete implementation. In general, those programs have more than one correct result for the same input. Consider, e.g., sorting a list of pairs of numbers by the first element. For the input [ (1,2), (2,2), (1,3)], both [(1, 2), (1,3), (2,2)] and [(1, 3), (1,2), (2,2)] are valid results. Formally, this is modelled as a set of possible results. When we later fix an implementation, the set of possible results may shrink. For example, the (stable) insertion sort algorithm always returns the list [ (1,2), (1,3), (2,2)]. We say that insertion sort refines our specification of sorting.
Second, we want to define recursion by a standard fixed-point construction over a flat lattice. The bottom of this lattice must be a dedicated element, which we call fail. It represents a computation that may not terminate.
Finally, we want to model the resources required by a computation. For nondeterministic programs, these may vary depending on the nondeterministic choices made during the computation. As we model computations by their possible results, rather than by the exact path in the program that leads to the result, we also associate resource cost with possible results. When more than one computation path leads to the same result, we take the supremum of the used resources. The notion of refinement is now extended to a subset of results that are computed using less resources.
We now formalize the above intuition: The type (α,γ) NREST = fail | res (α → γ option) models a nondeterministic computation with results of type α and resources of type γ. 4 That is, a computation is either fail, or res M, where M is a partial function from possible results to resources. We define spec Φ T as a computation of any result r that satisfies Φ r using T r resources: spec Φ T = res (λr. if Φ r then Some (T r) else None). By abuse of notation, we write spec x T for spec (λr. r=x) (λ . T).
Based on an ordering on the resources γ, we define the refinement ordering on NREST, by first lifting the ordering to option with None as the bottom element, then pointwise to functions and finally to (α,γ) NREST, setting fail as the top element. This matches the intuition of refinement: m ≤ m reads as m refines m , i.e., m has less possible results than m , computed with less resources.
We require the resources γ to have a complete lattice structure, such that we can form suprema over the (possibly infinitely many) paths that lead to the same result. Moreover, when sequentially composing computations, we need to add up the resources. This naturally leads to a monoid structure (γ, 0, +), where 0, intuitively, stands for no resources.
We call such types γ resource types, if they have a complete lattice and monoid structure. Note that, in an earlier iteration of this work [11], the resource type was fixed to extended natural numbers (enat=N ∪ {∞}), measuring the resource consumption with a single number. Also note that (α,unit) NREST is isomorphic to our original nondeterministic result monad without resources [19].
If γ is a resource type, so is η → γ. Intuitively, such resources consist of coins of different resource currencies η, the amount of coins being measured by γ. Example 1. In the following we use the resource type ecost = string → enat, i.e., we have currencies described by a string, whose amount is measured by extended natural numbers, where ∞ models arbitrary resource usage. Note that, while the resource type string→enat guides intuition, most of our theory works for general resource types of the form η → γ or even just γ.
We define the function $ s n to be the resource function that uses n :: enat coins of the currency s :: string, and write $ s as shortcut for $ s 1.
A program that sorts a list in O(n 2 ) can be specified by: that is, a list xs can result in any sorted list xs with the same elements, and the computation takes (at most) quadratically many q coins in the list length, and one c coin, independently of the list length. Intuitively, the q and c coins represent the constant factors of an algorithm that implements that specification and are later elaborated by exchanging them into several coins of more finegrained currencies, corresponding to the concrete operations in the algorithm, e.g., comparisons and memory accesses. Abstract currencies like q and c only "have value" if they can be exchanged to meaningful other currencies, and finally pay for the resource costs of a concrete implementation.

Atomic Operations and Control Flow
In order to conveniently model actual computations, we define some combinators. The elapse m t combinator adds the (constant) resources t to all results of m: The program 5 return x computes the single result x without using any resources: The combinator bind m f models the sequential composition of computations m and f , where f may depend on the result of m: bind :: If the first computation m fails, then also the sequential composition fails. Otherwise, we consider all possible results x with resources t of m, invoke f x, and add the cost t for computing x to the results of f x. The supremum aggregates the cases where f yields the same result, via different intermediate results of m, and also makes the whole expression fail if one of the f x fails.
Example 2. We now illustrate an effect that stems from our decision to aggregate the resource usage of different computation paths that lead to the same result. Consider the program res (λn::nat. Some ($ c n)); return 0 It first chooses an arbitrary natural number n consuming n coins of currency c, and then returns the result 0. That is, there are arbitrarily many paths that lead to the result 0, consuming arbitrarily many c coins. The supremum of this is ∞, such that the above program is equal to elapse (return 0) ($ c ∞). Note that none of the computation paths actually attains the aggregated resource usage. We will come back to this in Section 4.4.
Finally, we use Isabelle/HOL's if-then-else and define a recursion combinator rec via a fixed-point construction [13], to get a complete set of basic combinators. As these combinators also incur cost in the target LLVM, we define resource aware variants. Furthermore we also derive a while combinator: Here, the guard of if c is a computation itself, and we consume an additional if coin to account for the conditional branching in the target model. Similarly, every recursive call consumes an additional call coin.
Assertions fail if their condition is not met, and return unit otherwise: assert P = if P then return () else fail They are used to express preconditions of a program. A Hoare-triple for program m, with precondition P, postcondition Q and resource usage t is written as a refinement condition: m ≤ assert P; spec Q (λ . t) Example 3. Comparison of two list elements at a cost of t can be specified by: where xs!i is the ith element of list xs. Instead of fixing the cost for specifications, we pass them as parameter t. This allows us to refine different instances of abstract data types (here lists) by different concrete data structures with different costs. To make bigger programs more readable, we note the cost parameter in parenthesis at the end of the line, as, e.g., in Example 4.

Refinement on NREST
We have used the refinement ordering to express Hoare triples. Two other applications of refinement are data refinement and currency refinement.
Data Refinement A typical use-case of refinement is to implement an abstract data type by a concrete data type. For example, we could implement (finite) sets of numbers by sorted lists. We define a refinement relation R between sorted lists and sets. A concrete computation m † that yields sorted lists then refines an abstract computation m that yields sets, if every possible concrete result is related to a possible abstract result. Formally, m † ≤ ⇓ D R m, where the operator ⇓ D is defined, for arguments R and m, by the following two rules.
Again, we use the supremum to aggregate the costs of all abstract results that are related to a concrete result. As in Example 2, this leads to the possibility that the supremum cost is not attained, which we discuss in Section 4.4.
Currency Refinement Consider we want to refine Example 3 into a program that first accesses the elements and then compares them.
Example 4. We refine idxs cmp spec ($ idxs cmp ) from Example 3 as follows: idxs cmp xs i j = assert (i<|xs| ∧ j<|xs|); xsi ← list get spec xs i; where list get spec xs i (T) = assert (i < |xs|); spec (xs!i) T and return x (T) returns the result x incurring cost T. Note that idxs cmp and idxs cmp spec use different, incompatible currency systems. To compare them, we need to exchange coins: one idxs cmp coin will be traded for two lookup coins and one less coin.
To make that happen we introduce the currency refinement ⇓ C E m. Here, the exchange rate E :: η a → η c → γ specifies for each abstract currency c a :: η a how many of the coins of the concrete currency c c :: η c are needed. Note that, in general, one abstract coin may be exchanged into multiple coins of different currencies. For a resource type γ that provides a multiplication operation ( * ) we define the operator ⇓ C with the following two rules.
The refined computation has the same results as the original. To get the amount of a concrete coin c c for some result r with resource function t, we sum, over all abstract coins c a , the amount of abstract coins needed in the original computation (t c a ) weighted by the exchange rate (E c a c c ).
For the sum to make sense, there must be only finitely many abstract coins c a with t c a * E c a c c = 0. This can be ensured by restricting the resource functions t of the computation to use finitely many different coins, or by restricting the exchange rate E accordingly. The latter can be checked syntactically in practice.
Example 5. For refining the specification idxs cmp spec we can use the exchange rate E 1 = 0(idxs cmp:= $ lookup 2 + $ less ), which does the correct exchange for idxs cmp and is zero everywhere else. Here, + and 0 are lifted to functions in a pointwise manner, and f(·:=·) denotes a function update. We can now prove:

Refinement Patterns
In practice, we encounter certain recurring patterns of refinement, which we describe in this section.

Refinement of Specifications
Instead of only asking whether a program m satisfies a specification res M, we also ask how much it satisfies the specification, i.e. what is the difference of the resources specified and actually used, denoted by gwp m M. 6 We have the following equality: m ≤ res M ⇔ Some 0 ≤ gwp m M.
To get some intuition let us fix the resource to be time. Then, gwp m M is the latest feasible time at which we can start m to still match the deadline M. If there is no feasible starting time (gwp m M = None), m does not fulfill the specification M. If it has some value t, this is the latest feasible starting time of all computation paths in m.
Using gwp, we can implement a syntax driven verification condition generator, as already described in [11].
Lockstep Refinement We often refine a compound program by refining some of its components. Let A and C be two structurally equal programs (i.e., they have the same structure of combinators if c , rec c , bind, etc.), and let A i and C i be the pairs of corresponding basic components, for i∈{0,. . . ,n}. Provided with refine- for each of those pairs, 7 an automatic procedure walks through the program and establishes a refinement C ≤ ⇓ D R n (⇓ C E A). This process generates verification conditions for ensuring the preconditions Φ i , which can be discharged automatically or, if required, via interactive proof. 6 The definition of gwp requires γ to provide a difference operator, dual to its + operator. It is a straightforward generalization of the concept defined in [11], and thus omitted here. We only note that the resource types unit, enat, and ecost provide a suitable difference operator. 7 The refinement relations R i and Ri relate the parameters and respectively the result of those components.
Note that, while the data refinements R i can be different for each component i, the exchange rate E must be the same for all components. Currently, we align the exchange rates by manually deriving specialized versions of the component refinement lemmas. However, we believe that this can be automated in many practical cases, by collecting constraints on the exchange rate during the lockstep refinement, which are solved afterwards to obtain a unified exchange rate. We leave the implementation of this idea to future work.

Separating Analysis of Resource Usage and Correctness
We can disregard resource usage and only focus on refinement of functional correctness, and then add resource usage analysis later. This is useful to separate the concerns of functional correctness and resource usage proof. We will describe a practical example later (Section 5.5), and only present an alternative way to prove the refinement in Example 4 here: First, for functional correctness, we use the specification idxs cmp spec (∞) and a program idxs cmp ∞ similar to idxs cmp but with all the costs replaced by ∞. Proving the refinement idxs cmp ∞ xs i j ≤ idxs cmp spec xs i j (∞) only requires showing verification conditions that correspond to functional properties and termination. In particular, assertions and annotated invariants in the concrete program have to be proved. Proof obligations on resource usage, however, collapse into the trivial t ≤ ∞. For the same reason, we get idxs cmp xs i j ≤ idxs cmp ∞ xs i j, and by transitivity obtain idxs cmp xs i j ≤ idxs cmp spec xs i j (∞) Next, we prove idxs cmp xs i j ≤ n spec (λ .True) ($ lookup 2 + $ less ). Here, the refinement relation m ≤ n m = m = fail =⇒ m ≤ m assumes that the concrete program does not fail. This has the effect that, during the refinement proof, assertions and annotated invariants in the concrete program can be assumed to hold, and we can focus on the resource usage proof.
Finally, the two refinements can be combined to obtain idxs cmp xs i j ≤ idxs cmp spec xs i j ($ lookup 2 + $ less )

LLVM With Cost Semantics
The NREST-monad allows to specify programs with their resource usage in abstract currencies. Those currencies only have a meaning when they finally can be exchanged for the costs of concrete computations. In the following we present such a concrete computation model, namely a shallow embedding of the LLVM semantics into Isabelle/HOL. The embedding is an extension of our earlier work [15] to also account for costs. In Section 4 we then report on linking the LLVM back end with the NREST front end.

Basic Monad
At the basis of our LLVM formalization is a monad that provides the notions of non-termination, failure, state, and execution costs.
Here, cost is a type for execution costs, which forms a monoid with operation + and neutral element 0, and state is an arbitrary type. 8 The type α M describes a program that, when executed on a state, either does not terminate (NTERM), fails (FAIL), or returns a result of type α, its execution costs, and a new state (SUCC).
It is straightforward to define the monad operations return and bind, as well as a recursion combinator rec over M. Thanks to the shallow embedding, we can also use Isabelle HOL's if-then-else to get a complete set of basic operations. As an example, we show the definition of the bind operation, in the case that both arguments successfully compute a result: That is, the result x and state s 1 after the first operation m is passed into the second operation f, and the result and state after the bind is what emerges from f. The cost for the bind is the sum of the costs for both operations.
The basic monad operations do not cost anything. To account for execution costs, we define an explicit operation consume c s = SUCC () c s. 9

Shallowly Embedded LLVM Semantics
The formalization of the LLVM semantics is organized in layers. At the bottom, there is a memory model that stores deeply embedded values, and comes with basic operations for allocation/deallocation, loading, storing, and pointer manipulation. Also the basic arithmetic operations are defined on deeply embedded integers. These operations are phrased in the basic monad, but consume no costs. This way, we could take them unchanged from our original LLVM formalization without cost [15]. For example, the low-level load operation has the signature raw load :: raw ptr → val M. Here, raw ptr is the pointer type of our memory model, consisting of a block address and an offset, and val is our value type, which can be an integer, a pointer, or a pair of values.
On top of the basic layer, we define operations that correspond to the actual LLVM instructions. Here, we map from deeply embedded values to shallowly embedded values, and add the execution costs.
For example, the semantics of LLVM's load instruction is defined as follows: 8 Note that this differs from the NREST monad in Section 2.1: it is deterministic, and provides a state. Because of determinism, we never need to form a supremum, and thus can base our cost model on natural numbers rather than enats. We leave a unification of the two monads to future work. 9 For NREST, we defined a higher-order operation elapse, while we use the firstorder operation consume here. This is for historical reasons. Note that elapse can be defined in terms of consume, and vice versa. The code generator checks that the set of definitions is complete and adheres to the required shape. It then translates them into LLVM code, which merely amounts to pretty printing and translating the structured control flow by if and while 12 statements to the unstructured control flow of LLVM. A powerful preprocessor can convert a more general class of terms to the restricted shape required by the code generator. This conversion is done inside the logic, i.e., the processed program is proved to be equal to the original. Preprocessing steps include monomorphization of polymorphic constants, extraction of fixed-point combinators to recursive function definitions, and conversion of tuple constructors and destructors to LLVM's insertvalue and extractvalue instructions.
In summary, the layered architecture of our LLVM formalization allowed for a smooth integration of the cost aspect, reusing most of the existing formalization nearly unchanged. Note that we opted to integrate the cost aspect into the existing top layer, which converts between deep and shallow embedding. Alternatively, we could have added another layer on top of the shallow embedding. While the latter would have been the cleaner design, we opted for the former approach to avoid the boilerplate of adding a new layer. This was feasible as the original top layer was quite thin, such that adding another aspect there did not result in excessive complexity. 10 See Section 3.3 for an explanation of our cost model. 11 Actually, the only change to the original formalization is the introduction of the ll call instruction, to make the costs of a function call visible. 12 Primitive while loops are not strictly required, as they can always be replaced by tail recursion. Indeed, our code generator can be configured to not accept while loops, and our preprocessor can automatically convert while loops to tail-recursive functions. However, the efficiency of the generated code then relies on LLVM's optimization pass to detect the tail recursion and transform it to a loop again.

Cost Model
As a cost model for running time, we chose to count how often each instruction is executed. That is, we set cost = string → nat, where the string encodes the name of an instruction. It is straightforward to define 0 and + such that (cost,0,+) forms a monoid. It is thus a valid cost model for our monad. But how realistic is our cost model, counting LLVM instructions? During compilation, LLVM text will be transformed by LLVM's optimizer, and finally, the LLVM's back end will translate LLVM instructions to machine instructions. Moreover, the actual running time of a machine program does not only depend on the number of executed instructions, but effects like pipeline flushes and cache misses also play an important role. Thus, without factoring in the details of the optimization passes and the target machine architecture, our cost model can, at best, be a rough approximation of the actual running time.
However, we can sensibly assume that a single instruction in the original LLVM text will result in at most a (small) constant number of machine instructions, and that each machine instruction has a constant worst-case execution time. Thus, the steps counted by our model linearly correlate to an upper bound of the actual execution time, though the exact correlation depends on the actual program, optimizer passes, and target architecture. Hence, while our cost model cannot be used for precise statements about execution time, it can be used to prove worst-case complexity. That is, a program that we have proved efficient will be compiled to an efficient machine program. Moreover, we can hope that the constant factors in the proved complexity are related to the actual constant factors in the machine program, i.e., an LLVM program with small constant factors will compile to a machine program with small constant factors.
The above discussion justifies the following design choices: The insertvalue and extractvalue instructions, which are used to construct and destruct tuple values, have no associated costs. The main reason for this design is to enable transparent use of tupled values, e.g., to encode the state of a while loop. We expect LLVM to translate the members of the tuple to separate registers anyway, such that no real costs are associated with tupling/untupling.
We define the malloc instruction to take cost proportional to the number of allocated elements. Note that LLVM itself does not provide memory management, and our code generator forwards memory management instructions to the libc implementation of the target platform. We use the calloc function here, which is supposed to initialize the allocated memory with zeros. While the exact costs of that are implementation dependent, they certainly will depend on the size of the allocated block.
Charguéraud and Pottier [6, §2.7] discuss the adequacy of abstract cost models in a functional setting. In their classification, our abstraction is on Level 2.

Reasoning Setup
Once we have defined the semantics, we need to set up some basic reasoning infrastructure. The original Isabelle-LLVM already comes with a quite generic separation logic and verification condition generation framework. Here, we report on our extensions to resources using time credits.

Separation Logic with Time Credits
Our reasoning infrastructure is based on separation logic with time credits [1,6,10]. We follow the algebraic approach of Calcagno et al. [3], using an earlier extension [15] of Klein et al. [18].
A separation algebra on type α induces a separation logic on assertions that are predicates over α. To guide intuition, elements of α are called heaps here. We use the following separation logic operators: The assertion ↑Φ holds for an empty heap if Φ holds, =↑True describes the empty heap, and ∃ A is the existential quantifier lifted to assertions. The separating conjunction P Q describes a heap comprised from two disjoint parts, one described by P and the other described by Q, and entailment P Q states that Q holds for every heap described by P.
Separation algebras naturally extend over product and function types, i.e., for separation algebras α, β, and any type γ, also α × β and γ → α are separation algebras, where the operations are lifted pointwise.
Note that enat forms a separation algebra, where elements, i.e. time credits, are always disjoint. Hence, also ecost = string → enat, and amemory × ecost are separation algebras, where amemory is the separation algebra that we already used in [15] to describe the abstract memory of LLVM. Thus, amemory × ecost induces a separation logic with time credits that match our cost model. The time credit assertion $ c = (λa. a=(0,c)) describes an empty memory (0) and precisely the time c. The primitive assertions on amemory are lifted analogously to describe no time credits.

Weakest Precondition and Hoare Triples
We start by defining a concrete state cstate that describes the memory content and the available resources: where memory is the memory type from our original LLVM formalization. Based on this, we define the weakest precondition predicate: Intuitively, the costs cc stored in the state is the credit available to the program. The weakest precondition holds if the program runs with real costs c that are within the available credit, and Q holds for the result r, the new memory s , and the new credit, cc−c, which is the old credit reduced by the actually required costs. Note that actual costs have type cost = string → nat, i.e., are always finite, while the credits have type ecost = string → enat, i.e., there can be infinite credits. Setting the credit to be infinite for all instruction types yields the classical weakest precondition that requires termination, but enforces no time limit.
Our concrete state type, in particular the memory, does not form a separation algebra, as the natural memory model of LLVM has no natural notion of partial memories. Thus, we define an abstraction function that maps a concrete state to an abstract state astate, which forms a separation algebra: astate = amemory × ecost abs (m, c) = (abs m m, c) Again, amemory and abs m is the abstract state and abstraction function from the original LLVM formalization. The costs already form a separation algebra, so we do not abstract them further. With this, we can instantiate a generic VCG infrastructure: let cstate be concrete states, wp :: α M → (α → cstate → bool) → cstate → bool be a weakest precondition predicate, and astate an abstract state, linked to concrete states via an abstraction function abs :: cstate → astate. Further, assume that wp distributes over conjunctions, i.e., Finally, let be an affine top [5], i.e., an assertion with and = , which captures resources that can be safely discarded. We define the Hoare triple {P} c {Q} to hold iff: Intuitively, {P} c {Q} holds if, for all states that contain a part described by assertion P, command c terminates with result r and a state where that part is replaced by a part described by Q r , and the rest of the state has not changed. Here, Q r is the postcondition of the Hoare triple, and describes resources that may be left over and can be discarded.
In our case, we set to describe the empty memory and any amount of time credits. This matches the intuition that a program must free all its memory, but may run faster than estimated, i.e., leave over some time credits. Note that our wp distributes over conjunctions.
The generic VCG infrastructure now provides us with a syntax driven VCG with a simple frame inference heuristics.

Primitive Setup
Once we have defined the basic reasoning infrastructure, we have to prove Hoare triples for the basic LLVM instructions and control flow combinators. As we have added the cost aspect only at the top level of our semantics, we can reuse most of the material from our original LLVM formalization without time. Technically, we instantiate our reasoning infrastructure with a weakest precondition predicate wpn, which only holds for programs that consume no costs. We define: The resulting reasoning infrastructure is identical with the one of our original formalization, most of which could be reused. Only for the topmost level, i.e., for those functions that correspond to the functional semantics of the actual LLVM instructions, we lift the Hoare triples over wpn to Hoare triples over wp: Using the VCG and the Hoare triples for the LLVM instructions, we can now define and prove correct data structures and algorithms. While this works smoothly for simple data structures like arrays, it does not scale to more complex developments. In contrast, NREST does scale, but lacks support for the low-level pointer reasoning required for basic data structures. In the next section, we show how to combine both approaches, with the LLVM level providing basic data structures and the NREST level using them as building blocks for larger algorithms.

Automatic Refinement
In this section we describe a tool to synthesize a concrete program in the LLVMmonad from an abstract algorithm in the NREST-monad. It can automatically refine abstract functional data structures to imperative heap-based ones. We will describe the synthesis predicate hnr that connects the two monads, the synthesis tool, and a way to extract Hoare triples from hnr predicates. Finally, we will discuss an effect that prevents combining hnr with data refinements in the NREST-monad in the general case.

Heap nondeterminism refinement
The heap nondeterminism refinement predicate hnr Γ m † Γ R m intuitively expresses that the concrete program m † computes a concrete result that relates, via the refinement assertion R, to a result in the abstract program m, using at most the resources specified by m for that result. A refinement assertion describes how an abstract variable is refined by a concrete value on the heap. It can also contain time credits. The assertions Γ and Γ constitute the heaps before and after the computation and typically are a separating conjunction of refinement assertions for the respective parameters of m † and m. Formally, we define: The predicate holds if either the abstract program fails or if, for all heaps and resources (s,c) that satisfy the pre-assertion Γ with some frame F, there exists an abstract result and cost (r a ,c a ) that refine m, and m † terminates with concrete result r in a state s where Γ with the frame holds, and r relates to the abstract result via assertion R. The execution costs of m † and the time credits c required by the post-assertion Γ are paid for by the specified cost c a and the time credits c described by the pre-assertion Γ . Thus, the real costs are paid by a combination of the advertised costs in the abstract program and the potential difference of Γ and Γ , allowing to seamlessly model amortized computation costs. Using the affine top , it is possible for the program to throw away portions of the heap. Note that our can only discard time credits. Memory must be explicitly freed by the concrete program m † .
Also note that hnr is not tied to the LLVM semantics specifically. It actually is a general pattern for combining the NREST-monad with any other program semantics that provides a weakest precondition and a separation algebra for data and resources.

The Sepref Tool
The Sepref tool [14,15] automatically synthesizes a concrete program in the LLVM-monad from an abstract algorithm in the NREST-monad. It symbolically executes the abstract program while maintaining refinements for the abstract variables to a concrete representation and generates a concrete program as well as a valid hnr predicate. Proof obligations 13 that occur during this process are discharged automatically, guided by user-provided hints where necessary.
The synthesis requires rules for all abstract combinators. For example, bind is processed by the following rule: To refine x ← m; f x, we first execute m, synthesizing the concrete program where y is the result of f x. Now, the intermediate variable x goes out of scope and has to be deallocated. The predicate MK FREE R x free (line 3) states that free is a deallocator for data structures implemented by refinement assertion R x . Note that free can only use time credits that are stored in R x . Typically, these are payed for during creation of the data structure. This way amortization can be used effectively to hide the necessary free operation and its costs in the abstract program.
All other combinators (rec c , if c , while c , etc.) have similar rules that are used to decompose an abstract program into parts, synthesize corresponding con-crete parts recursively and combine them afterwards with the respective combinators from LLVM. At the leaves of this decomposition, atomic operations need to be provided with suitable synthesis predicates.
An example is a list lookup that is implemented by an array: where array A , snat A and id A relate a list with an array, an unbounded natural number with a bounded signed word and identical elements respectively. With an array at address p holding the list xs and an index i † that is a bounded signed word representing an unbounded natural number i, array nth leaves the parameters unchanged and extracts the element specified by list get spec incurring costs array get cost =$ of s ptr + $ load .
Ideally, each operation has its own currency (e.g. list get). However, as our definition of hnr does not support currency refinement, the basic operations must use the currencies of the LLVM cost model. To still obtain modular hnr rules, we encapsulate specifications for data structures with their cost, e.g. by defining array get spec =list get spec (λ . array get cost ). These can easily be introduced in an additional refinement step. Automating this process, and possibly integrating currency refinement into hnr is left to future work.

Extracting Hoare Triples
Note that hnr predicates cannot always be expressed as Hoare triples, as the running time bound of the abstract program may depend on the result, which we cannot refer to in the precondition of a Hoare triple, where we have to express the allowed running time as time credits. However, if the running time bound does not depend on the result, we can write hnr as a Hoare triple: While intermediate components might not be of this form, final algorithms typically are. At the end of a development, this rule allows to extract a Hoare triple in the underlying LLVM semantics, cutting out the NREST-monad. For validating the correctness claim of an algorithm, only the final Hoare triple needs to be inspected, which only uses concepts of the underlying semantics.
Note that the above rule is an equivalence. Thus, it can also be used to obtain synthesis rules from Hoare triples provided by the basic VCG infrastructure.

Attain Supremum
We comment on a problem that arises when composing hnr predicates and data refinement in the NREST monad. Consider the following programs and relations: Data refinement defines the resource bound for a concrete result (here z) as the supremum over all bounds of related results (here x, y). Thus, we have m ≤ ⇓ C R R m . Moreover, we trivially have hnr m † R A m. Intuitively, we want to compose these two refinements, to obtain hnr m † (R A • R R ) m . However, as our definition of hnr does not form a supremum, this would require $ a + $ b ≤ $ a or $ a + $ b ≤ $ b , which obviously does not hold.
We have not yet found a way to define hnr or ⇓ D in a form that does not exhibit this effect. Instead, we explicitly require that the supremum of the data refinement has a witness. The predicate attains sup m m R R characterizes that situation: it holds, if for all results r of m the supremum of the set of all abstractions (r,r )∈R R applied to m is in that set. This trivially holds if R R is single-valued, i.e. any concrete value is related with at most one abstract value, or if m is one-time, i.e. assigns the same resource bound to all its results.
In practice we do encounter non-single-valued relations 14 , but they only occur as intermediate results where the composition with an hnr predicate is not necessary. Also, collapsing synthesis predicates and refinements in the NRESTmonad typically is performed for the final algorithm whose running time does not depend on the result, thus is one-time, and ultimately attains sup.

Case Study: Introsort
In this section, we apply our framework to the introsort algorithm [22]. We build upon the verification of its functional correctness [17] to verify its running time analysis and synthesize competitive efficient LLVM code for it. Following the "top-down" mantra, we use several intermediate steps to refine a specification down to an implementation.

Specification of Sorting
We start with the specification of sorting a slice of a list: slice sort spec xs 0 l h (T) = assert (l≤h ∧ h≤length xs 0 ); spec (λxs. slice sort aux xs 0 l h xs) (λ . T) where slice sort aux xs 0 l h xs states that xs is a permutation of xs 0 , xs is sorted between l and h and equal to xs 0 anywhere else.

Introsort's Idea
The introsort algorithm is based on quicksort. Like quicksort, it finds a pivot element, partitions the list around the pivot, and recursively sorts the two partitions. Unlike quicksort, however, it keeps track of the recursion depth, and if it exceeds a certain value (typically 2 log n ), it falls back to heapsort to sort the current partition. Intuitively, quicksort's worst-case behaviour can only occur when unbalanced partitioning causes a high recursion depth, and the introsort algorithm limits the recursion depth, falling back to the O(n log n) heapsort algorithm. This combines the good practical performance of quicksort with the good worst-case complexity of heapsort. Our implementation of introsort follows the implementation of libstdc++, which includes a second optimization: a first phase executes quicksort (with fallback to heapsort), but stops the recursion when the partition size falls below a certain threshold τ . Then, a second phase sorts the whole list with one final pass of insertion sort. This exploits the fact that insertion sort is actually faster than quicksort for almost-sorted lists, i.e., lists where any element is less than τ positions away from its final position in the sorted list. While the optimal threshold τ needs to be determined empirically, it does not influence the worstcase complexity of the final insertion sort, which is O(τ n) = O(n) for constant τ . The threshold τ will be an implicit parameter from now on.
While this seems like a quite concrete optimization, the two phases are already visible in the abstract algorithm, which is defined as follows in NREST: where almost sort spec (T) specifies an algorithm that almost-sorts a list, consuming at most T resources and final sort spec (T) specifies an algorithm that sorts an almost-sorted list, consuming at most T resources.
The program introsort leaves trivial lists unchanged and otherwise executes the first and second phase. Its resource usage is bounded by the sum of the first and second phase and some overhead for the subtraction, comparison, and if-then-else. Using the verification condition generator we prove that introsort is correct, i.e., refines the specification of sorting a slice: where E is = 0(sort:=introsort cost ) is the exchange rate used at this step and introsort cost = $ sub + $ if + $ lt + $ almost sort + $ f inal sort is the total allotted cost for introsort.

Quicksort Scheme
The first phase can be implemented in the following way: rec c (λintrosort rec (xs,l,h,d).
slice sort spec xs l h ($ sortc (μ (h-l))) 9 else 10 (xs,m) ← partition spec xs l h; ($ partitionc (h-l)) 11 xs ← introsort rec (xs,l,m,d ); 13 xs ← introsort rec (xs,m,h,d ); 14 return xs 15 else return xs 16 ) (xs,l,h,d) where partition spec partitions a slice into two non-empty partitions, returning the start index m of the second partition, and depth spec specifies 2 log(h − l) . Let us first analyze the recursive part: if the slice is shorter than the threshold τ , it is simply returned (line 15). Unless the recursion depth limit is reached, the slice is partitioned using h − l partition c coins, and the procedure is called recursively for both partitions (lines [10][11][12][13][14]. Otherwise, the slice is sorted at a price of μ (h−l) sort c coins (line 8). The function μ here represents the leading term in the asymptotic costs of the used sorting algorithm, and the sort c coin can be seen as the constant factor. This currency will later be exchanged into the respective currencies that are used by the sorting algorithm. Note that we use currency sort c to describe costs per comparison of a sorting algorithm, while currency sort describes the cost for a whole sorting algorithm.
Showing that the procedure results in an almost-sorted list is straightforward. The running time analysis, however, is a bit more involved. We presume a function μ that maps the length of a slice to an upper bound on the abstract steps required for sorting the slice. We will later use heapsort with μ nlogn n = n log n.
Consider the recursion tree of a call in introsort rec: We pessimistically assume that for every leaf in the recursion tree we need to call the fallback sorting algorithm. Furthermore, we have to partition at every inner node. This has cost linear in the length of the current slice. For each following inner level the lengths of the slices add up to the current one's, and so do the incurred costs. Finally we have some overhead at every level including the final one. The cost of the recursive part of introsort aux is: The correctness of the running time bound is proved by induction over the recursion of introsort rec. If the recursion limit is reached (d=0), the first summand pays for the fallback sorting algorithm. If d>0, part of the second summand pays for the partitioning of the current slice, then the list is split into two and the recursive costs are payed for by parts of all three summands. To bound the costs for the fallback sorting algorithm, μ needs to be superadditive: μ a + μ b ≤ μ (a+b). In both cases, the third summand pays for the overhead in the current call.
For d= 2 log n and an O(n log n) fallback sorting algorithm (μ=μ nlogn ), introsort rec cost μ nlogn is in O(n log n). 15 In fact, any d∈O(log n) would do.
Note that specifications typically use a single coin of a specific currency for their abstract operation, which is then exchanged for the actual costs, usually depending on the parameters.
This concludes the interesting part of the running time analysis of the first phase. It is now left to plug in an O(n log n) fallback sorting algorithm, and a linear partitioning algorithm.
Heapsort Independently of introsort, we have proved correctness and worst-case complexity of heapsort, yielding the following refinement lemma: where E hs n = 0(sort:= c 1 + log n * c 2 + n * c 3 + (n * log n) * c 4 ) for some constants c i :: ecost.
Assuming that n ≥ 2, 16 we can estimate E hs n sort ≤ μ nlogn n * c, for c = c 1 + c 2 + c 3 + c 4 , and thus get, for E hs = 0(sort c := c): ⇓ C (E hs (h−l)) (slice sort spec xs l h ($ sort )) ≤ ⇓ C E hs (slice sort spec xs l h ($ sortc (μ nlogn (h−l)))) and, by, transitivity heapsort xs l h ≤ ⇓ C E hs (slice sort spec xs l h ($ sortc (μ nlogn (h−l)))) Note that our framework allowed us to easily convert the abstract currency from a single operation-specific sort coin to a sort c coin for each comparison operation.

Partition and Depth Computation
We implement partitioning with the Hoare partitioning scheme using the median-of-3 as the pivot element. Moreover, we implement the computation of the depth limit (2 log(h − l) ) by a loop that counts how often we can divide by two until zero is reached. This yields the following refinement lemmas: Combining the Refinements We replace slice sort spec , partition spec and depth spec by their implementations heapsort, pivot partition and calc depth. We call the resulting implementation introsort aux 2 , and prove introsort aux 2 xs l h ≤ ⇓ C (E aux (h−l)) (introsort aux μ nlogn xs l h) where the exchange rate E aux combines the exchange rates E hs , E pp and E cd for the component refinements.
Transitive combination with the correctness lemma for introsort aux then yields the correctness lemma for introsort aux 2 : where E isa2 n = 0(almost sort:=↓ C (E aux n) (introsort aux cost n)) and the operation ↓ C E t applies an exchange rate to a resource function.

Refining Resources
The stepwise refinement approach allows to structure an algorithm verification in a way that correctness arguments can be conducted on a high level and implementation details can be added later. Resource currencies permit the same for the resource analysis of algorithms: they summarize compound costs, allow reasoning on a higher level of abstraction and can later be refined into fine-grained costs. For example, in the resource analysis of introsort aux the currencies sort c and partition c abstract the cost of the respective subroutines. The abstract resource argument is independent from their implementation details, which are only added in a subsequent refinement step, via the exchange rate E aux .

Final Insertion Sort
The second phase is implemented by insertion sort, repeatedly calling the subroutine insert. The specification of insert for an index i captures the intuition that it goes from a slice that is sorted up to index i−1 to one that is sorted up to index i. Insertion is implemented by moving the last element to the left, as long as the element left of it is greater (or the start of the list has been reached). Moving an element to its correct position takes at most τ steps, as after the first phase the list is almost sorted, i.e., any element is less than τ positions away from its final position in the sorted list. Moreover, elements originally at positions greater τ will never reach the beginning of the list, which allows for the unguarded optimization. It omits the bounds check for those elements, saving one index comparison in the innermost loop. Formalizing these arguments yields the implementation final insertion sort that satisfies where E fis n = 0(final sort:=final insertion cost n), and final insertion cost n is linear in n.
Note that final insertion sort and introsort aux 2 use the same currency system. Plugging both refinements into introsort yields introsort 2 and the lemma where the exchange rate E is2 combines the rates E isa2 and E fis .

Separating Correctness and Complexity Proofs
A crucial function in heapsort is sift down, which restores the heap property by moving the top element down in the heap. To implement this function, we first prove correct a version sift down 1 , which uses swap operations to move the element. In a next step, we refine this to sift down 2 , which saves the top element, then executes upward moves instead of swaps, and, after the last step, moves the saved top element to its final position. This optimization spares half of the memory accesses, exploiting the fact that the next swap operation will overwrite an element just written by the previous swap operation.
However, this refinement is not structural: it replaces swap operations by move operations, and adds an additional move operation at the end. At this point, we chose to separate the functional correctness and resource aspect, to avoid the complexity of a combined non-structural functional and currency refinement. It turns out that proving the complexity of the optimized version sift down 2 directly is straightforward. Thus, as sketched in Section 2.4, we first prove 17 sift down 2 ≤ sift down 1 ≤ sift down spec (∞), ignoring the resource aspect. Separately, we prove sift down 2 ≤ n spec (λ . True) sift down cost , and combine the two statements to get sift down 2 ≤ sift down spec sift down cost .

Refining to LLVM
The above abstract programs implicitly come with a fixed type and comparison operator for the elements of the list to be sorted. Those programs use abstract operations and currencies for arithmetic operations on indexes, control flow, comparisons and read/write of a random-access iterator (abstracted by lists with update and lookup operations).
When we further assume an LLVM program that refines the comparison operator in LLVM, and specify how the random-access data structure should be implemented -we choose arrays -we can automatically synthesize an LLVM program introsort impl that refines introsort 2 , i.e., satisfies the theorem: Combination with the refinement lemmas for introsort 2 and introsort, followed by conversion to a Hoare triple, yields our final correctness statement: where introsort impl cost :: nat → ecost is the cost bound obtained from applying the exchange rates E is and then E is2 to $ sort .
Note that this statement is independent of the Refinement Framework. Thus, to believe in its meaningfulness, one has to only check the formalization of Hoare triples, separation logic, and the LLVM semantics.
To formally prove the statement "introsort impl has complexity O(n log n)", we observe that introsort impl cost uses only finitely many currencies, and only finitely many coins of each currency. We define the overall number of coins as introsort impl allcost n = Σc. introsort impl cost n c which expands to introsort impl allcost n = 4693 + 5 * log n + 231 * n + 455 * (n * log n) which, in turn, is routinely proved to be in O(n log n).
As a last step, we instantiate the element type to 64-bit unsigned integers and the comparison operation to LLVM's icmp ult instruction, to obtain a program that sorts integers in ascending order. Our code generator can export this to actual LLVM text and a corresponding header file for interfacing our sorting algorithm from C or C++.
As LLVM does not support generics, we cannot implement a replacement for C++'s generic std::sort<T>. However, by repeating the last step for different types and compare operators, we can implement a replacement for any fixed T.

Benchmarks
In this section we present benchmarks comparing the code extracted from our formalization with the real world implementation of introsort from the GNU C++ Library (libstdc++). Also, as a regression test, we compare with the code extracted from an earlier formalization of introsort [17] that did not verify the running time complexity and used an earlier iteration of the Sepref framework and LLVM semantics without time.
The results are shown in Figure 1. As expected, all three implementations have similar running times. Note that the small differences are well within the noise of the measurements. We conclude that adding the complexity proof to our introsort formalization, and the time aspect to our refinement process has not introduced any timing regressions in the generated code. Note, however, that the code generated by our current formalization is not identical to what the original formalization generated. This is mainly due to small changes in the formalization introduced when adding the timing aspect.

Conclusions
We have presented a refinement framework for the simultaneous verification of functional correctness and complexity of algorithm implementations with competitive practical performance.
We use stepwise refinement to separate high-level algorithmic ideas from low-level optimizations, enabling convenient verification of highly optimized algorithms. The novel concept of resource currencies also allows structuring of the  Fig. 1. Comparison of the running time measured for the code generated by the formalization described in this paper (Isabelle-LLVM), the original formalization from [17] (notime), and the libstdc++ implementation. Arrays with 10 8 uint64s with various distributions were sorted, and we display the smallest time of 10 runs. The programs were compiled with clang-10 -O3, and run on an Intel XEON E5-2699 with 128GiB RAM and 256K/55M L2/L3 cache. See [17] for details of the benchmarking method.
complexity proofs along the refinement chain. Our framework refines down to the LLVM intermediate representation, such that we can use a state-of-the-art compiler to generate performant programs. As a case-study, we have proved the functional correctness and complexity of the introsort sorting algorithm. Our verified implementation performs on par with the (unverified) state-of-the-art implementation from the GNU C++ Library. It also provably meets the C++11 standard library [7] specification for std::sort, which in particular requires a worst-case time complexity of O(n log n). We are not aware of any other verified real-world implementations of sorting algorithms that come with a complexity analysis.
Our work is a combination and substantial extension of an earlier refinement framework for functional correctness [15] which also comes with a verification of introsort [17], and a refinement framework for a single enat-valued currency [11]. In particular, we have generalized the refinement framework to arbitrary resources, introduced currencies that help organizing refinement proofs, extended the LLVM semantics and reasoning infrastructure with a cost model, connected it to the refinement framework via a new version of the Sepref tool, and, finally, added the complexity analysis for introsort.

Related Work
Nipkow et al. [23, §4.1] collect verification efforts concerning sorting algorithms. We add a few instances verifying running time: Wang et al. use TiML [25] to verify correctness and asymptotic time complexity of mergesort automatically. Zhan and Haslbeck [26] verify functional correctness and asymptotic running time analysis of imperative versions of insertion sort and mergesort. We build on earlier work by Lammich [17] and provide the first verification of functional correctness and asymptotic running time analysis of heapsort and introsort.
The idea to generalize the nres monad [19] to resource types originates from Carbonneaux et al. [4]. They use potential functions (state → enat) instead of predicates (state → bool), present a quantitative Hoare logic and extend the CompCert compiler to preserve properties of stack-usage from programs in Clight to compiled programs.
We see our paper in the line of research concerning simultaneously verifying functional correctness and worst-case time complexity of algorithms. Atkey [1] pioneered resource analysis with separation logic, Guéneau et al. [9] present a framework that uses time credits in Coq and apply it to involved algorithms and data structures [10,6]. We further develop their work in three ways: First, while time credits usually are natural numbers [1,9,26,21,6] or integers [10], we generalize to an abstract resource type and specifically use resource currencies for a fine-grained analysis. Second, we use stepwise refinement to structure the verification and make the resource analysis of larger use-cases manageable. Third, we provide facilities to automatically extract efficient competitive code from the verification. The following are the most complex algorithms and data structures with verified running time analysis using time credits and separation logic we are aware of: a linear time selection algorithm [26], an incremental cycle detection algorithm [10], Union-Find [6], Edmonds-Karp and Kruskal's algorithm [11].

Future Work
A verified compiler down to machine code would further reduce the trusted code base of our approach. While that is not expected to be available soon for LLVM in Isabelle, the NREST-monad and the Sepref tool are general enough to connect to a different back end. Formalizing one of the CompCert C semantics [2] in Isabelle, connecting it to the NREST-monad and then processing synthesized C code with CompCert's verified compiler would be a way to go.
In this paper we apply our framework to verify an involved algorithm that only uses basic data structures, i.e. arrays. A next step is to verify more involved data structures, e.g. by porting existing verifications of the Imperative Collections Framework [16] to LLVM. We do not yet see how to reason about the running time of data structures like hash maps, where worst-case analysis would be possible but not useful. In general, extending the framework to average-case analysis and probabilistic programs are exciting roads to take.
We plan to implement more automation, saving the user from writing boilerplate code when handling resource currencies and exchange rates.
Neither the LLVM nor the NREST level of our framework is tied to running time. Applying it to other resources like maximum heap space consumption might be a next step.