1 Introduction

The idea of verifying that a program meets a given specification for all possible inputs has been studied for a long time. Part of the appeal of software verification is that it can ensure theoretical correctness of a software module for all possible usages. This is complementary to testing which, by acting at a more concrete level, may detect resource or hardware errors that are typically outside the scope of software verification [43].

According to Hoare’s vision, a verifying compiler “uses automated mathematical and logical reasoning to check the correctness of the programs that it compiles” [80]. A variety of tools have blossomed in this space, including Spec# [16], Dafny [104], Why3 [65], OpenJML [46], ESC/Java [68], VeriFast [85], SPARK/Ada [121], AutoProof for Eiffel [163], Frama-C [52], KeY [2], SPARK/Ada [13, 40] and Whiley [139, 159]. Automated theorem provers are integral to such tools and are responsible for discharging proof obligations [16, 45, 68, 85]. Various satisfiability modulo theory (SMT) solvers are typically used for this, such as Z3 [53], CVC4 [19, 20], Yices2 [61], Alt-Ergo [49], Vampire [82, 95] or Simplify [55]. These provide handcrafted implementations of important decision procedures, e.g. for linear and nonlinear arithmetic [23, 44, 62, 144], congruence [126, 128] and quantifier instantiation [54, 71, 145, 146]. Different solvers are appropriate for different tasks, so the ability to utilise multiple solvers can improve the chances of successful verification.

Verifying compilers often target an intermediate verification language, such as Boogie [14], WhyML [28, 65] or Viper [125], as these provide a nice separation of concerns and allow different theorem provers to be used interchangeably. SMT-LIB [21] provides another standard readily accepted by modern automated theorem provers, although it is often considered rather low level [28]. One issue faced by intermediate verification languages is the potential for an impedance mismatch [139] (see Sect. 5). This arises when constructs in the source language cannot be easily translated into those of the intermediate verification language (and vice versa).

Whiley is a programming language with first-class support for software specifications that is designed to simplify verification [43, 134, 135, 137,138,139,140, 159, 168, 169]. An important goal was to develop a system which is as accessible as possible and which one could imagine being used in a day-to-day setting. As such, Whiley superficially resembles a modern imperative language and employs flow typing [77, 133, 160] to eliminate unnecessary casts (which also aids specification). The ultimate aim is that all programs written in Whiley will be verified at compile time to ensure their specifications hold which, for example, has obvious application in safety-critical systems [40, 134]. In this paper, we explore Boogie as an intermediate verification language for Whiley. Our motivation is the desire to improve the verification capability of Whiley by leveraging the significant resources already invested in the development of Boogie (and Z3). A particular concern is the potential for an impedance mismatch arising, such as from Whiley’s type system (e.g. which supports union types and flow typing).

The contributions of this paper include:

  • (Translation) A comprehensive account of our encoding of Whiley programs into Boogie for the purpose of verification. Whilst in many cases the translation is straightforward, a number of challenges had to be overcome arising from Whiley’s design, including: the encoding of Whiley’s expressive type system and support for flow typing and generics; Whiley’s implicit assumption that expressions in specifications are well defined; the ability to invoke methods from within expressions; the ability to return multiple values from a function or method; the presence of unrestricted lambda functions; and Whiley’s limited syntax for framing.

  • (Evaluation) An empirical comparison between Boogie/Z3 and the native Whiley verifier using the existing suite of \(1100+\) tests provided for the Whiley compiler. The results confirm that Boogie/Z3 significantly outperforms the Whiley native verifier in terms of the number of tests passing.

  • (Case Studies) A report into the use of Boogie/Z3 to verify a number of larger Whiley programs, including a web-based implementation of Conway’s Game of Life and a number of challenges from the VerifyThis 2019 competition [59]. From these case studies we identify several areas in which the Whiley language or libraries could be improved to better exploit Boogie.

We note also that our work provides further evidence of Boogie’s utility as a general-purpose intermediate verification language. In particular, compared with Dafny or Spec#, Whiley was developed entirely independently from Boogie and includes various design choices that are not necessarily a natural fit. As such, it was unclear from the outset of this project whether or not Boogie would be sufficiently general for this task. Finally, compared with our earlier paper [164], this paper represents a significant evolution and improvement of our translation. We also provide a much more detailed account which covers almost the entire language, including generics, lambdas, references and the handling of various soundness issues. Our evaluation now includes a number of larger case studies, and we have expanded the related work discussion.

Organisation. The remainder of this paper is organised as follows: Sect. 2 provides an introduction to Whiley and Boogie; Sect. 3 provides a detailed description of our Whiley-to-Boogie translator and discusses the various challenges encountered; Sect. 4 presents our evaluation using the existing Whiley compiler test suite and various case studies; Sect. 5 examines the related work; and finally, Sect. 6 concludes. Finally, for reference, “Appendix A” illustrates our verified version of Conway’s Game of Life.

2 Background

We begin with an overview of Whiley and then a brief discussion of Boogie.

2.1 Whiley

The Whiley programming language has been developed to enable compile time verification of programs and, furthermore, to make this accessible to everyday programmers [139, 159]. The Whiley Compiler (WyC) attempts to ensure that all functions and methods in a program meet their specifications. When this succeeds, we know that: (i) all function/method postconditions are met (assuming their preconditions held on entry); (ii) all invocations meet the respective function or method precondition; (iii) runtime errors such as divide-by-zero, out-of-bounds accesses and null-pointer dereferences cannot occur. Notwithstanding, such programs may still loop indefinitely and/or exhaust available resources (e.g. stack or heap).

2.1.1 Primitive Types

Whiley provides a small number of primitive types, including: , , and (for unbound integers). Likewise, types can be composed into records (e.g. ), arrays (e.g. ) and unions (e.g. ). Here, the latter represents a type which is either or an . Records can be constructed using literals (e.g. , whilst arrays can be constructed using either literals (e.g. ) or generators (e.g. which gives ). The length of an array can also be queried dynamically (e.g. ). As expected, user-defined types are supported and can be declared as follows:

figure o

Whiley also supports type polymorphism (i.e. generics) and recursive types (which are similar to algebraic data types) as follows:

figure p

The type indicates a record with two fields, and . Thus, a is either or a record with the given structure. For completeness, we note that subtyping of generic types follows an (implicit) definition-site variance protocol [3]. Furthermore, user-defined types in Whiley offer greater flexibility than typically found with implementations of algebraic data types (e.g. in Haskell). For example:

figure v

The above illustrates how one recursive type () can implicitly subtype another (). This highlights a key advantage of typing in Whiley over, for example, algebraic data types. The approach to typing taken in Whiley is, in fact, closer to structural typing [36, 60, 72, 117, 118] with certain caveats to ensure safe treatment of type invariants (see below).

2.1.2 Flow Typing

An unusual feature of Whiley is the use of a flow typing system [77, 132, 133, 160] coupled with union types [12, 84]. Union types support runtime type tests to discriminate their cases, as the following illustrates (recall from above):

figure z

This counts the number of nodes in a list. Here, we see flow typing in action as is automatically retyped to on the false branch [132, 133]. Flow typing turns out to be particularly useful when specifying programs. Specifically, in it follows that has type within the expression . This helps, for example, when writing postconditions (as we’ll see shortly).

2.1.3 Value Semantics

The semantics of Whiley diverge from many mainstream languages (e.g. Java) in the treatment of compound data types, such as arrays. Specifically, arrays and records in Whiley have value semantics. This means they are passed and returned by value (as in Pascal, MATLAB [98] or most functional languages). But, unlike functional languages (and like Pascal), values of compound types can be updated in place [129, 151]. This latter point serves to give Whiley the appearance of an imperative language when, in fact, Whiley has a functional core. The following illustrates:

figure ag

Despite appearances, the above is a pure function which has no side effects. This contrasts with languages like Java, where arrays are references and updating them has unavoidable side effects. The following attempts to clarify this further:

figure ah

In a language like Java, the assertion would fail because and would alias each other. However, since this is not the case in Whiley, the above verifies without problem. We can think of arrays and records in Whiley as being immutable, so that updating them effectively means cloning them. The reason this semantics is adopted in Whiley is to facilitate their use in specification. Indeed, without a fundamental immutable collection type, verification is inherently challenging [99].

2.1.4 Side Effects

A in Whiley is pure and cannot have side effects. In contrast, a is impure and may have side effects, such as mutating the global heap or performing I/O. Whiley provides reference types which are allocated from a single global heap. For example, is a reference to an integer variable. The following illustrates the syntax:

figure ao

Here, the assignment through affects (because they are aliases), and hence, the final assertion holds. We note that, at the time of writing, Whiley supports allocation but not deallocation (and, hence, currently relies on garbage collection).

Statements which mutate the heap must appear within the body of a and, for example, are not permitted within a . To illustrate a more complete example, here is the classical algorithm for reversing a linked list [6]:

figure at

We note that the above is not yet fully specified, and this would be necessary before its behaviour could be fully verified (more on this later).

2.1.5 Packaging

Whiley currently supports a relatively limited form of packages and package management. For example, the standard library, , can be added as a dependency and compiled against. The following illustrates a simple example:

figure av

The above illustrates a simple function for converting an integer array into a string. This employs standard library functions from the modules and .

Fig. 1
figure 1

Implementation of in Whiley, returning the least index in which matches , or if no match exists

2.1.6 Specification and Verification

We now consider those features of Whiley provided for specifying and verifying programs. Figure 1 provides an initial example to illustrate the salient features:

  • Properties are used to specify things of interest, particularly to help with verification. They are interpreted meaning that, during verification, they can be expanded/unrolled as necessary. To facilitate this, they have a restricted form allowing them to be substituted in place for their body. In contrast, functions are uninterpreted which helps ensure verification remains (mostly) modular [78]. This means that, during verification, their actual implementation is ignored at call sites (more on this below).

  • Preconditions are given by clauses and postconditions by clauses. Multiple clauses are simply conjoined together. We have found that allowing multiple and/or clauses can help readability, and note that JML [48], Spec# [16] and Dafny [104] also permit this.

  • Loop invariants are given by clauses. Figure 1 illustrates an inductive loop invariant covering indices from zero to (exclusive). Similarly, type invariants arise from clauses. For example, type has an invariant and is used for variable to avoid the need for a loop invariant of the form . We consider good use of type invariants as critical to improving the readability of function specifications.

  • Assertions must be statically checked during verification, thus providing a useful debugging tool. For example, if during verification we are struggling to understand why a given postcondition is not met, assertions can be added to check our beliefs at a given point. In contrast, assumptions are not statically checked and, instead, are simply assumed to hold during verification. As such, they are a useful tool for overriding the verifier in cases where it cannot establish something we know to be true.

  • Flow typing simplifies postconditions (amongst other things) by ensuring that casts need not be given. For example, without flow typing, the first clause from Figure 1 would require a cast for on the right-hand side.

Being uninterpreted means a function’s implementation can change arbitrarily without affecting callers provided it still meets its specification. However, it also means that functions need to be properly specified before they can be used, which is sometimes problematic (e.g. when several functions are developed in tandem). For example, consider the following:

figure bo

Whilst the above function is implemented correctly, it has yet to be specified. Perhaps this has arisen because it is, in fact, part of a larger function being developed:

figure bp

At this moment, cannot be statically verified because the specification for (or lack thereof) yields insufficient information at the call site.

Framing. A related aspect of static verification is the need for clarity around side effects and framing [9, 91, 92, 130, 131, 153]. A key issue is the ability to distinguish the value of state before a method call from that after it. Languages such as Dafny, JML and Boogie support this by allowing one to refer to the “old” state of a location (i.e. the value it held on entry). For example, in JML writing in a method’s postcondition indicates the value stored in is increased by the method. Whiley supports similar syntax as the following illustrates:

figure bu

This simple method swaps the values referred to by and , and to specify it, we had to use the syntax. With the above specification for we can verify, for example, the following snippet:

figure bz

Here, the first follows from the specification of . In contrast, the second follows because the state referred to by is not reachable from any parameter passed to and, hence, could not be modified by it.

2.2 Boogie

Boogie [14] is an intermediate verification language developed by Microsoft Research as part of the Spec# project [16]. Boogie is intended as a back end for other programming language and verification systems [106], and has found use in various tools, such as Dafny [104], VCC [45], and others (e.g. [25]). Boogie is both a specification language (which shares some similarity with Dijkstra’s language of guarded commands [57]) and a tool for checking that Boogie “programs” are correct. The original Boogie language was “somewhat like a high-level assembly language in that the control flow is unstructured but the notions of statically scoped locals and procedural abstraction are retained” [14]. However, later versions support structured and statements to improve readability. Nevertheless, a non-deterministic statement is retained for encoding arbitrary control flow, which permits multiple target labels with non-deterministic choice. Boogie provides various primitive types including , and map types, which can be used to model arrays and records. Concepts such as a “program heap” can also be modelled using a map from references to values.

Boogie supports and declarations which have an important distinction. In general, functions are pure and can be used within the Boogie logic, such as in axioms and specifications. In contrast, procedures are potentially impure and are intended to model methods in the source language. A procedure can be given a specification composed of and clauses, and also a clause indicating non-local state that can be modified. Most importantly, a procedure can be given an , and the tool will attempt to ensure this implementation meets the given specification. The and for procedures demarcate proof obligations, for which Boogie emits verification conditions in first-order logic to be discharged by Z3. In addition, the implementation of a procedure may include and statements. The former lead to proof obligations, whilst the latter give properties which the underlying theorem prover can exploit.

To illustrate Boogie, Figure 2 provides an example encoding of the function into Boogie. Note that the example encodings used in this section are a little different to the more sophisticated encoding used later in the paper. At first glance, it is perhaps surprising how close to an actual programming language Boogie has become. Various features of the language are demonstrated with this example. Firstly, an array length operator is encoded using an uninterpreted function , and accompanying . Secondly, the input array is modelled using the map , which is a total mapping from arbitrary integers to arbitrary integers. For example, identifies a valid element of the map despite not normally being a valid array index (e.g. in Whiley). We can refine this to something closer to an array through additional constraints, as shown in the next section.

Whilst the structured form of Boogie is preferred, where possible, it is also useful to consider the unstructured form, which we use for a few Whiley constructs such as (Sect. 3.4.1). Figure 3 provides an unstructured encoding of the function from Figure 2. In this version, the loop is decomposed using a non-deterministic statement—the statement allows flow of control to jump to either label, but the statements after those labels block progress if their condition is false. Likewise, in this unstructured encoding, the loop condition and invariant are explicitly assumed (lines 8,9,12) and asserted (lines 15,16), rather than being done implicitly by the tool (as in Figure 2). The statement“assigns an arbitrary value to each indicated variable” [14], so is used here to indicate that variable contains an arbitrary integer value at this point.

Finally, we note that Boogie allows one to designate preconditions, postconditions and loop invariants as . This allows Boogie to assume these conditions hold without checking them—thereby (potentially) reducing overall verification time [103].

Fig. 2
figure 2

Simple Boogie program encoding an implementation of the function, making extensive use of the structured syntax provided in later versions of Boogie

Fig. 3
figure 3

Unstructured encoding of the example from Figure 2—the pre-/postconditions are omitted as they are unchanged from above, and likewise for

3 Modelling Whiley in Boogie

Our goal is to model as much of the Whiley language as possible in Boogie, so that we can utilise Boogie for verifying Whiley programs. Indeed, the motivation for this project was the hope that Boogie would offer significantly better verification capability than the existing (and relatively ad hoc) native verifier used in Whiley (and, as Sect. 4 shows, this is the case). At a superficial level, Whiley’s native verifier is not so different from Boogie/Z3. In particular, it employs an intermediate assertion language in which verification conditions are encoded and then discharged using a purpose-built SMT solver [139]. A key advantage is that the generated verification conditions resemble the Whiley source language much more closely. Nevertheless, whilst this toolchain has potential, it remains relatively immature compared with Boogie/Z3 and the considerable resources invested in their development [16]. However, this transition is not without challenges as, despite their obvious similarities, there remain significant differences between Whiley and Boogie:

  • Types. Whiley has a relatively rich (structural) type system which includes: union, record, array, reference and lambda types. Furthermore, there is support for type polymorphism through generics.

  • Flow Typing. Whiley’s support for flow typing is also problematic, as a given variable may have different types at different program points and there is a need to support runtime type tests [133].

  • Functions. Whiley functions are defined via code bodies, whereas the body of a Boogie function can contain only a single expression.

  • Methods. Whiley methods correspond quite well with procedures in Boogie, but may be invoked from within expressions in Whiley.

  • Definedness. Whiley implicitly assumes specification elements (e.g. pre-/postconditions and invariants) are well defined. This differs from other tools (e.g. Dafny) which require programmers to explicitly ensure that specification elements are well defined.

To understand the definedness issue, consider a precondition that contains an array reference, like . In a language like Dafny, one would additionally need to explicitly specify to avoid the verifier reporting an out-of-bounds error. Such preconditions are implicit in Whiley, so must be (automatically) extracted by our translator and made explicit in the generated Boogie.

We now present the main contribution of this paper, namely a mechanism for translating Whiley programs into Boogie, which is implemented in our translator program, called Wy2B.Footnote 1

3.1 Types

Finding an appropriate representation of Whiley types is a challenge. We begin by considering the straightforward (i.e. naive) shallow translation of Whiley types into Boogie, and highlight why this fails. Then, we present a more sophisticated approach which corresponds more closely with a deep embedding of types.

Shallow Embedding. The simple and obvious translation of Whiley types into Boogie would be a direct translation to the built-in types of Boogie. Here, an in Whiley is translated into a Boogie , which is appropriate since both languages support unbounded integers. A Whiley array (e.g. ) then translates to a Boogie map (e.g. , with appropriate constraints), and Whiley records can also be translated using Boogie’s map type. However, by itself, this is not sufficient to model all Whiley types. For example, the type has no obvious corresponding representation in Boogie. Likewise, a Whiley type test such as requires additional machinery. So this shallow embedding where Whiley types are directly translated into Boogie types is insufficient.

Deep Embedding. To support the more complex types found in Whiley such as unions, we provide a deep embedding of all types into Boogie.Footnote 2 Specifically, we model all Whiley values as disjoint members of a single set, , and model the various Whiley types as subsets of this:

figure dt

For each Whiley type , we define a membership predicate that holds for values in , an extraction function that maps to a Boogie type, and an injection function which does the reverse. We axiomatise these two functions to define a partial bijection between a type’s representation and its corresponding subset of . We also add Boogie axioms to ensure the subtypes of which correspond to each built-in Whiley type (, , , etc.) are mutually disjoint. This embedding has several advantages. Firstly, it is easy to model a Whiley user-defined subtype by defining a predicate as . Secondly, union types simply map to disjunctions of these type predicates. Thirdly, Boogie can prove equality of two values only if they are constructed using the same injection function from values that are equal.

Finally, to aid with the translation of compound types in Whiley (such as arrays—see Sect. 3.1.2 below) a special constant, , is used:

figure el

Observe that, since this value has (by design) no counterpart in Whiley, we must ensure it remains disjoint from all other Whiley values.

3.1.1 Primitives

Integers. The mapping functions for the Whiley type of unbounded integers are as follows (recall is also the Boogie name for integers).

figure eo

Bits and Bytes. Whiley includes a native type which supports the usual plethora of bitwise operators, including left- and right shifts. For this, Boogie provides a family of bitvector types (e.g. , , etc.) of which provides a suitable match. To use this, however, we must exploit various internal functions to implement bitwise operators as follows:

figure et

Coercions. In order to utilise our deep embedding, values must be coerced to / from primitive Boogie types. Consider an assignment where has type . Since union types in Whiley are encoded as type in Boogie, we must coerce the value (of Boogie type ) into its embedded form via . Such an assignment is thus translated as . In general, our translation attempts to minimise the amount of boxing/unboxing. For example, generated expressions of the form are automatically reduced to , etc. Amongst other things, this helps to simplify debugging!

3.1.2 Arrays

Whiley arrays are fixed-length sequences of values whose length can be queried at runtime (recall from Sect. 2.1.3 they have value semantics). We model Whiley arrays using: (1) a Boogie map from integers to values; and (2) an uninterpreted function returning the length. The embedding requires a number of additional axioms, as follows. As before, we provide extraction/injection functions as follows:

figure fg

A key aspect of our embedding is the treatment of indices which are out-of-bounds. The primary issue is that Boogie maps (e.g. ) are infinite structures with no concept of bounds. Elements which have not been explicitly defined always exist with some arbitrary value. This presents a problem for equality of arrays, as illustrated in Figure 4. To resolve this we fix all out-of-bounds indices to the special value, and enforce this throughout the axioms that follow.

Array Length. We employ the following function for extracting the length of an array:

figure fj

In the above, we take steps to ensure the axioms remain consistent. To understand this, consider the last axiom above which holds the array length invariant across an update. The value being assigned cannot be as, otherwise, we could artificially reduce an array’s length (e.g. by assigning to the last element). Finally, whilst our encoding of arrays here may appear somewhat elaborate, it does allow us to exploit Boogie’s internal notion of equality. An alternative, however, would be to define a bespoke equality operator for arrays (though this is complicated by the presence of unions and recursive types).

Fig. 4
figure 4

A pictorial illustration of four arrays embedded using (infinite) Boogie maps, where undefined (i.e. out-of-bounds) values are shown as “\(\mathtt{?}\)”. We might expect the first and fourth arrays to be equal (i.e. since they have the same length and values within bounds), but this depends also on whether the out-of-bounds values are also equal. To ensure these two arrays are indeed equal, we fix these undefined values to some known constant ()

Array Initialisers. Array values in Whiley can be constructed using the array literal syntax (e.g. , etc.). This creates an array containing the given values (zero-indexed). To translate this we employ a constructor, , as follows:

figure fq

The intuition is that constructs an uninitialised array of size , whose elements must then be initialised individually. For example, the array literal is translated into .

Array Generators. Array values can also be constructed using the array generator syntax, (recall Sect. 2.1.1). The constructor, , is used for translating these as follows:

figure fx

3.1.3 Records

Records are encoded using maps, , where characterises field names. For every field name used within the program, a unique constant is created. For example, if the type is used then the following constants are generated:

figure gb

These constants are then used as indices for the map encoding of the record (and any other record type containing a field or ). The constants are marked to ensure they are disjoint. Thus, the number of constants generated depends on exactly what types are used within the target program. As for arrays, care must be taken when encoding a given record to ensure that all other fields are mapped to . Again, various functions and axioms are provided to allow records to be embedded within other compound types:

figure gg

Like arrays, all fields not in a given record should hold . This cannot be enforced with an axiom as it depends upon the record type in question (i.e. what fields it has). Instead, this is enforced using constraints on parameters, returns and local variables as necessary.

Record Literals. As for arrays, a simple constructor is used for translating record literals:

figure gi

As an example, the record literal would be translated into Boogie as .

3.1.4 Generics

Type polymorphism in Whiley presents a number of challenges when translating to Boogie. Roughly speaking, we translate generic types (e.g. ) into Boogie’s type. We will return to discuss this in more detail later (see §3.4).

3.1.5 Lambdas

The ability to pass around first-class functions and methods as lambdas also presents some challenges, since lambdas in Boogie are relatively restricted. We return to discuss this in more detail later (see §3.4.1), but for now it suffices to introduce the following which represents the set of all lambda values:

figure gn

3.1.6 References

References in Whiley are modelled in a relative standard fashion as indexes into a heap represented as a map of the form  [103]. Again, we return to discuss this in more detail later (see Sect. 3.5), and for now, we simply introduce the type:

figure gq

In addition, the following constant is provided for describing an arbitrary heap:

figure gr

The above is useful in various situations where there is no logical heap (more on this later). In particular, since it does not provide any guarantee about its contents, it cannot be relied upon at all.

3.1.7 User-Defined Types

Our treatment of user-defined types follows naturally from our embedding of types discussed above. Roughly speaking, we can consider that every user-defined type in Whiley consists of two parts: firstly, its base or underlying type; secondly, its invariants (if any). For example, consider the following Whiley declaration:

figure gs

The underlying type of is , and it enforces a single invariant . In our translation to Boogie, this declaration would produce the following:Footnote 3

figure gw

This allows for several different use cases. For example, if we have a variable of type and wish to assume or assert its invariant, then can be applied directly. Alternatively, if we are reading such a variable from a boxed position (e.g. out of an array or record), then can be applied. Observe also that, for uniformity, such methods always accept a parameter even if (as in this case) this is not used. This parameter is necessary for user-defined types which are, or contain, references. For example, consider this declaration which builds upon the definition of :

figure hc

This describes the type of references to integer values which enforce the constraint. This would be translated as follows:

figure he

Here, enforces upon the element in referred to by . Thus it becomes clear that the embedding of a reference type only makes sense in the context of a given .

3.2 Constants

Global constants in Whiley require care to ensure a safe translation. A well-known issue with Boogie arises when specifications written by the user (i.e. in Whiley) are translated into unguarded Boogie axioms. In such cases, the user can be considered as maliciously injecting problematic (though rarely useful) code.Footnote 4 For example, a user can (perhaps accidentally) insert (or some equivalent thereof) into the generated Boogie file. Unfortunately, the presence of such a declaration allows Boogie to immediately verify all assertions in the file (i.e. regardless of whether they are correct or not) [102]. More importantly, Boogie does not report this as an error, and hence, it happens silently without the user being made aware. To see how this applies to constants in Whiley, consider the following (recall definition of from page 19):

figure hm

The challenge here is to ensure the value being assigned adheres to any type invariant(s) required of . One approach is to generate a typing axiom, such as , for this. Whilst this is sufficient for the above example, a problem arises if the value assigned was instead of . In such case, the translation leads to the following:

figure hr

Unfortunately, these axioms conflict as they imply both and (which is equivalent to ). To protected against this, we stratify our translation into two levels: the first establishes global constants are correctly initialised, and the second verifies functions and methods assuming they are correctly initialised. Following the approach taken in Dafny [104], this is done using a special constant . The following illustrates the translation of our example above:

figure hw

The above verifies without trouble. However, were to be initialised with , Boogie would now correctly report a failed proof obligation inside the method. Furthermore, note that all procedure bodies generated from functions or methods in Whiley require to ensure access to ’s invariant (see Figure 6 below).

3.3 Properties

Properties in Whiley are straightforward as they can be translated directly as Boogie functions. For example, consider the following property in Whiley:

figure ic

This is translated directly as follows (again name mangling would be applied in practice):

figure id

As for types, properties always accept a parameter for uniformity even when not needed. A key observation is that Boogie functions are strictly more expressive than properties in Whiley, and we will return later to consider the impact of this (see Sect. 4.4).

3.4 Functions

Recall that functions in Whiley are pure, have bodies comprised of statement blocks and may have multiple return values. This differs from functions in Boogie, whose bodies are made up of a single expression and can only return a single value. This presents challenges: firstly, the body of a Whiley function corresponds more closely with a Boogie procedure; but, secondly, functions in Whiley can be called from specification elements (e.g. pre-/postconditions) whereas Boogie procedures cannot. As such we provide a two-pronged translation (similar to that found in Dafny [102]) comprising: a prototype implemented as a Boogie function which can be invoked from a specification element; and a body, implemented as a Boogie procedure, which can be invoked directly from the body of other functions or methods.

Fig. 5
figure 5

Illustrating a simple function in Whiley which, for brevity, has not been fully specified

Fig. 6
figure 6

Illustrating the generated Boogie code for the example. Note that this is somewhat simplified as various details related to name mangling and parameter shadowing are omitted

Figure 5 illustrates a simple function written in Whiley which we adopt as a running example, whilst the generated Boogie for this is shown in Figure 6. We will endeavour to fully clarify all aspects of this figure over the coming pages, but for now we focus on the procedure’s specification. Here, additional clauses are included to enforce the type of and, likewise, additional clauses for the type of . Whilst the soundness assumption for constants was discussed above, we will return to discuss the purpose of the function prototype and linkage later. We note also that, whilst functions in Whiley cannot modify the heap, they can manipulate references as simple values (though cannot mutate through them).

Generics. Since Whiley supports type polymorphism, we might like to upgrade our function as follows:

figure il

As discussed in Sect. 3.1.4, we translate the Whiley type as Boogie type . In terms of verifying the above function in isolation, this presents no problems. However, in most cases, call sites of this function would expect to receive an array of the same type they put in. For example, consider this:

figure io

In this case, Boogie must be able to determine that the return from is an array of integers. Thus, a mechanism is required to enable our translation to state meta properties about the relationships between variable types (e.g. that they are the same). To do this, we introduce meta types as follows:

figure iq

Here, the Boogie type represents the set of all meta types, whilst performs a similar function as, for example, (but for an arbitrary meta type). In this way, we extend the generated procedure for as follows:

figure iv

Here, we see the generic type is now passed as an argument to procedure and using this we can, for example, make statements about the return value. For example, the postcondition now tells us at a given call site that all elements in the returned array have the same type as those elements in the input array. To make this work, we still need one additional piece. Specifically, for every type which can be used to instantiate a type variable (e.g. in ) we construct a unique meta type constant. For example, the meta type constant for is declared as follows:

figure jb

Finally, we note that user-defined types must be extended to use meta types as well. For example, consider the following:

figure jc

The various Boogie support functions generated for this type (recall Sect. 3.1.7) must now accept a meta type parameter. For example, is defined as:

figure je

Overloading and Parameters. Overloading on parameter types is supported in Whiley, but not in Boogie. To resolve this, we employ name mangling for every property, function, method and type. The latter is necessary because mangling also includes package and module information. Likewise, parameters for Boogie procedures are immutable, whereas parameters to functions or methods can be mutated in Whiley. To resolve this, our translator generates a shadow variable for each parameter which is assigned the parameter’s value on entry.

Function Linkage. Since functions in Whiley can be called from specification elements, a key question arises as to how such calls are encoded. Consider the following partial implementation of a stack:

figure jf

Here, the \(\ldots \) postcondition of uses other publicly visible functions to hide the implementation of .Footnote 5 Our translation of looks roughly as follows:

figure jj

The key here is that refers to the function prototype of , rather than its procedure. Furthermore, since is ensured in the postcondition of procedure , we can verify statements such as the following:

figure jo

Partial Correctness. An important limitation of Whiley is that it cannot ensure termination. For example, there is no equivalent syntax to as found in Dafny. As a result, non-terminating recursive functions can be verified with almost any postcondition. The following illustrates such an example:

figure jq

Observe that the above function will never violate its postcondition and, hence, is correct up to non-termination. In the future, we expect Whiley to be extended with support for variant expressions such that a well-founded ordering over recursive calls can be specified to ensure termination.

3.4.1 Statements and Expressions

Translating most Whiley statements and expressions into Boogie is straightforward (see the similarities between Figures 1 and 2). Here, we describe only the interesting cases that present specific challenges.

Variable Scoping.

Boogie requires all local variables to be declared at the start of a procedure body where, like most modern languages, Whiley allows variables to be declared with block scopes. Whilst, in most cases, this is relatively trivial to manage there are cases where name clashes arise. The following illustrates:

figure jr

In this case, the same variable is declared twice with different types. This is a problem because they have incompatible types, and hence, we cannot declare a single Boogie variable to cover both. Instead, we apply name mangling to ensure variables in different scopes have unique names.

Well-Definedness. As highlighted already, Whiley’s treatment of expressions (especially when used in specification elements such as pre-/postconditions) differs from other comparable systems (e.g. Dafny). In fact, handling this is straightforward and has been covered reasonably extensively elsewhere [102]. Essentially, when translating a Whiley expression, care must be taken to insert checks as necessary to ensure expressions are well defined. The following illustrates a simple example:

figure js

Here, there is an implicit assumption that and . Of course, this may not actually be the case and we employ statements to check such preconditions. As such, the above is translated roughly as follows:

figure jw

Whilst, in many cases, the extraction of such checks is straightforward there are some challenges. For example, we employed window inference [148] here. To understand this, consider the following:

figure jx

For this example, the following translation is not sufficient:

figure jy

This translation is invalid because the second may not hold. This arises because this definedness check is for part of the condition in a given context. Instead, for every check extracted, we must additionally extract facts which have become known within the expression. Thus our translation, in fact, is as follows:

figure ka

Another aspect of this issue is the well-definedness of specification elements, such as pre-/postconditions and loop invariants. Consider the following (albeit contrived) example:

figure kb

Since the precondition for this function requires facts about , it follows (implicitly) that must be well defined (i.e. that is within bounds). Thus, our translator extracts such additional requirements as necessary, as the following illustrates:

figure kf

A similar approach is taken to handling loop invariants and, perhaps surprisingly, also for postconditions. For example, consider the following (albeit also contrived) example:

figure kg

In this case, it follows from the postcondition that holds and, hence, is translated as follows:

figure ki

Finally, we note that care must be taken in a number of contexts when extracting well-definedness conditions, such as for expressions nested within quantifiers.

Type Invariants. Our translation must ensure type invariants are properly preserved at all points. For example, consider the following (recall definition of from page 19):

figure kk

In this case, we must establish that holds after is initialised, and also after it is subsequently reassigned. To do this, the above is translated as follows:

figure kn

Here the Boogie function encapsulates the invariant for and is generated where translating the type declaration (recall Sect. 3.1.7).

Looking at Figure 6 provides further insight into this process. No assertion for invariant preservation is generated for because the type of variable is unconstrained. In other words, since the check would correspond to we simply optimise it away. However, such optimisation remains relatively simplistic, as checks are still produced unnecessarily for the assignment.

Invocation. Translating function invocations into Boogie presents something of a challenge, since functions can be invoked from arbitrary expressions (including specification elements discussed previously in §3.4). However, Boogie does not permit invocations from within an expression, and provides only a simple statement form for calling procedures (e.g. ).Footnote 6 In short, this means function invocations must be extracted from expressions. Consider the following snippet in Whiley:

figure kw

The above is translated into the following Boogie sequence:

figure kx

Here a temporary variable, , is introduced to hold the value returned from . Thus, the order of evaluation for expressions is exposed by the order in which the calls are made prior to the final expression. In general, this approach works fine, but there are challenges. Short-circuit semantics presents the first challenge. For example, consider the following:

figure la

In this case, we cannot just extract the function invocation and execute it before the statement. Such a translation would model being executed every time the statement is executed, which is not the case. Instead, we must carefully preserve short circuit semantics using unstructured branching as necessary. For example, we can translate the above as follows:

figure le

We can see that, whilst this gives a faithful rendition of the original program, it is quite low level and harder to comprehend. This issue is further compounded with loops, whose unstructured representation is far more verbose (recall Figure 2 versus Figure 3).

Finally, we note our approach above is reminiscent of that used for Spec# [113] but differs from Dafny (because Dafny does not permit method calls within expressions).

Assignments. Boogie supports assignments to variables (e.g. ) and map elements (e.g. ). Unfortunately, our choice to represent arrays uniformly with Boogie type presents some minor challenges. For a Whiley variable of type , we could translate directly as . However, for a Whiley variable of type a direct translation fails because the Boogie type for is still (i.e. not as needed for a direct translation). For simplicity, we translate array assignments uniformly regardless of the nesting level. For example, consider the following:

figure lr

As seen in Figure 6, the above is translated using Boogie’s operator as follows:

figure lt

A similar approach is needed for assignments to records and to the heap via references.

A slightly more challenging issue arises from multiple assignments in Whiley. These have interesting semantics from a verification perspective. Consider this example:

figure lu

The semantics of multiple assignments mean that the type invariant of must hold after the assignment (hence the above correctly preserves its invariant). Observe, however, that attempting to assign each field individually would give a verification error, as the type invariant for would be temporarily broken. Thus, our translation of the above would be:

figure lx

Notice that the values of and are first stored in temporary variables to avoid interference between the left- and right-hand sides.

Another important aspect of multiple assignments is the semantics for conflicting assignments [75, 76]. The following illustrates:

figure ma

They key question is what value is assigned to when . We follow Gries by resolving this based on the order of the right-hand side. Thus, when above, holds after the assignment since is first assigned then . This differs from Dafny where the above would be rejected unless was known.

Switches. Like many languages, Whiley supports multi-way branching via statements. Although Boogie has no switch statement, it does support non-deterministic . Hence, rather than using a sequence of if–else statements, we exploit this with appropriate constraints. The following illustrates:

figure ml

Here, corresponds with whilst corresponds with the default case. Note also that cases do not fall through by default in Whiley. Furthermore, if there are nested / statements these are translated into s as well.

Loops. Loops are also relatively easy to translate. Since Boogie supports only loops, all other looping forms found in Whiley must be translated using this. Furthermore, since Boogie has no or statement, we translate these using s as for statements. We note also that, for a do–while loop in Whiley, the loop invariant need not hold before the first iteration (which makes some proofs easier). Furthermore (if desired) one can always check the invariant on entry using an explicit statement.

One challenge faced in translating loops is the handling of types for variables which are modified in a loop. For example, in our translation of our translator inserted additional loop invariants to preserve the type of variable (recall Figure 6). This is necessary because the postcondition for restates that is an array of integers and this is not expressed explicitly in the type . Indeed, this is stated for in the function’s precondition but, since is modified in the loop, this information is lost within and after the loop (because Boogie sends its value to havoc). To resolve this, we must reassert this type information as a loop invariant. Furthermore, this is done for any variables modified in the loop.

A related issue, which our translator does not currently address, is that of preserving immutable properties of variables. Consider again the example from Figure 5. In fact, this example does not verify as is with our translator! Again, key information about \(\mathtt{xs}\) is lost within and after the loop. In this case, the information that needs to be preserved is that the length of \(\mathtt{xs}\) is unchanged by the loop. In principle, our translator could be extended with a static analysis to infer this and add it implicitly as a loop invariant (but this remains future work). We note that this extends to records, as the following illustrates:

figure ng

Perhaps surprisingly, this also does not verify because the property is not preserved across the loop. This can be fixed by performing the assignment to after the loop. Or, we could add a loop invariant to ensure is preserved.

Lambdas. Boogie provides syntax (e.g. ) for lambdas (with map type in this case). They are comparable with Boogie functions and cannot, for example, call procedures, etc. As such, they are insufficient for representing lambdas in Whiley which can have side effects. Instead, we translate them into named Boogie procedures. Mostly this is straightforward, but a few challenges arise with captured variables. For example, consider the following:

figure nm

Translating the lambda into a standalone requires identifying captured variables ( in this case) and adding them as parameters. The following illustrates:

figure np

Here, the procedure contains the body of the lambda, which will include any necessary checks on the lambda itself. Likewise, the constant is generated to represent this particular lambda. When translating an indirect invocation, we automatically generate a suitable prototype to invoke. For example:

figure nr

The above Whiley snippet is then translated (roughly speaking) as follows:

figure ns

Here, the function is generated (in practice, with a suitable mangling) to represent the anonymous function being invoked. It accepts the lambda as a parameter, thus allowing one to exploit the fact that the same lambda returns the same value(s) when given the same parameter. Finally, we note that work remains to improve our translation of lambdas. In particular, information known about captured variables is not currently transferred to the generated . Thus, the following fails to verify:

figure nv

the generated accepts the captured variables and , but does not include a corresponding precondition. Whilst, in this case, it would be relatively easy to fix, in other cases it is more challenging (e.g. when a parameter has been modified prior to being captured). One approach, for example, would be to apply the Weakest Precondition transformer [17, 57, 101] to the body of the lambda (which should be relatively straightforward since this is just an expression).

3.5 Methods and Framing

Recall from Sect. 2.1.4 that methods in Whiley are permitted to have side effects and, for example, manipulate heap-allocated data through references. As such, Whiley methods correspond closely with procedures in Boogie. However, methods in Whiley can be called from expressions used in statements (though not from specification elements, such as pre-/postconditions or loop invariants). In many ways, the translation of methods follows that for functions, but with some important differences which we now consider.

Framing. Whilst the Whiley language provides relatively limited support for describing the effect a method has on the heap, a lot of machinery is nevertheless required to manage what can be expressed. As highlighted before, we adopt a relatively standard approach to modelling the heap. Specifically, a global variable of type is provided to model this. For example, consider the following Whiley method:

figure ob

Our translation produces both a procedure prototype and implementation in Boogie. The prototype for the above method looks roughly as follows:

figure oc
Fig. 7
figure 7

Illustrating the Boogie definition of a predicate for determining whether a reference is within the footprint of given variable . More specifically, it searches the contents of (whatever that might be) looking for , whilst traversing references as necessary

Observe that the clause is provided as we must conservatively assume methods may modify the heap. Note also that in Whiley is translated directly using Boogie’s syntax. There are two essential issues here: typing and framing. The former simply makes explicit guarantees on the shape of the heap provided by Whiley’s type system. For example, that for an integer reference there is indeed an integer value at , etc. The latter aspect of framing is perhaps more interesting. We divide framing into two separate conditions (both of which are marked since Whiley’s type system guarantees them). These conditions rely on a simple predicate for determining whether a reference is reachable from—or within—the frame of a given variable (see Figure 7).

The first frame condition enforces self-framing [92] by ensuring that only locations within the method’s frame can be modified:

figure on

There are three parts of the condition as follows:

  1. 1.

    (Mutable) This identifies which locations could be modified by the method and, for these, does not provide a connection between the heap beforehand with that after.

  2. 2.

    (Immutable) For locations which could not be modified by the method, an explicit connection is made to ensure this between the heap beforehand and that after.

  3. 3.

    (Allocated) As a special case, heap locations which did not exist prior to the method (i.e. were mapped to ) can have arbitrary values afterwards.

In essence, the footprint of a method (i.e. those locations it could write) is conservatively tied with its frame (i.e. those locations it could read). This provides a straightforward and extensible basis for reasoning about how methods modify the heap. For example, if syntax for describing the old heap in postconditions was added to Whiley, this would easily layer on top. The key is that, in the absence of more expressive syntax for restricting the locations a may modify, we must adopt a worst-case assumption that any reachable location could be modified.

The second frame condition (known as the swinging pivots restriction [92]) prevents unreachable locations from “migrating” into the frame:

figure oq

In essence, this ensures that any reference reachable from parameters or after the method was either freshly allocated, or was reachable from them beforehand. Note that, whilst for this particular method, these conditions are trivial they are required in general (e.g. for handling linked structures).

As a further example to illustrate the challenges addressed by the frame conditions, consider the following:

figure ot

Establishing that is not modified by the calls to above requires both frame conditions (something which is not immediately obvious at first glance). It is clear that the first frame condition (self-framing) allows us to establish that is not modified by the first call. One might then conclude the first condition is sufficient to establish this across both calls—but that is not the case! The challenge is that ensures is not modified, but allows to be modified. Without the second frame condition, the verifier might then consider that was within after the first call (e.g. that ). And, in such case, it would then rightly conclude that could be modified by the second call. As such, we see how the second frame condition helps to ensure that disjoint frames remain disjoint.

Finally, we note that our encoding makes heavy use of a recursive predicate (see Figure 7) which (as we have observed) can lead to the butterfly effect [111]. That is, where the verifier loops indefinitely unrolling predicates fruitlessly. In our experience, this typically happens when the condition being checked is invalid, and hence, the verifier cannot quickly find a proof by contradiction.

Allocation. Since data can be allocated on the heap in Whiley methods using the operator, a translation of this operator is required. To this end, we employ the following:

figure pf

This simply returns an arbitrary location which was not previously allocated, and ensures it now holds the requested value. Recall that, at the time of writing, Whiley does not support explicit memory deallocation, and hence, no counterpart for this is required. Finally we note that, since allocations result in calls to , they must be extracted from expressions as for method invocations above.

4 Experimental Results

In this section, we compare our Wy2B translator against the Whiley native verifier using the existing compiler test suite which consists of \(1100+\) (mostly) small Whiley programs. In particular, we are concerned with the number of tests that Wy2B can pass correctly, and note that the existing Whiley native verifier does not pass all the tests (e.g. because of outstanding bugs, etc.). In addition, we discuss our experiences using the new Wy2B toolchain on several larger case studies.

4.1 Micro-test Statistics

The Whiley compiler system includes a comprehensive suite of “micro”-test programs, which are small Whiley programs intended to methodically test all Whiley language features, including the Whiley native verifier. At the time of this evaluation (May 2021), this test suite included 731 “valid” micro-test case programs that should be verifiable, as well as 461 “invalid” micro-test case programs that should generate compiler errors or verification failures (to ensure that the compiler correctly catches them). Our first step in evaluating the correctness and usefulness of our new verifier is to apply it to this test suite. We use Boogie v2.8.26 and Z3 v4.8.10 for these evaluations.

When we applied our new Wy2B verifier to the invalid programs, ignoring 7 programs that are marked as IGNORE due to current limitations of the compiler front end, we found that all 454 of the remaining programs failed as expected. This confirms that the Boogie back end is correctly detecting verification issues in programs that should not be verifiable. For completeness, we illustrate one such example:

figure ph

The above “invalid” program is used to test that the verifier correctly reports a potential out-of-bounds access on line 2. Both the native verifier and our Wy2B verifier pass this test.

The valid micro-test programs are small Whiley programs (ranging from 3 to 250 lines of code with an average length of 18 lines) that each contain several (2.2 on average) function and method definitions, some with specifications and some without. Around one third of the programs have functions or methods with requires/ensures specifications, one third use arrays (which generate array bound proof obligations), and 21% have loops with invariants. On average, our Wy2B translator generates 6.0 explicit proof obligations per micro-test program (to check array bounds, function call preconditions, etc.). This is on top of any explicit statements in the Whiley program and also in addition to the main proof obligations of Boogie, which are that each function or method body correctly implements its specification, and that every loop invariant is correctly preserved. Again, for completeness we illustrate one such example:

figure pj

The above “valid” program is expected to pass verification without raising any errors. This means that, amongst other things, the verifier must prove that the body of satisfies its specification, and within must establish the precondition for the call and that the final holds. Again, both the native verifier and our Wy2B verifier pass this test.

Fig. 8
figure 8

Stacked bar chart of the Whiley native verifier and Boogie-based verifier results on the “valid” test programs. Green (left and middle bars) indicates percentage of programs fully verified, and red (right) indicates percentage where one or more proofs failed or timed out

Figure 8 compares the percentages of these “valid” micro-tests that the native Whiley verifier and the Wy2B Boogie-based verifier can verify respectively. The leftmost bar on each row corresponds to the programs that both verifiers can verify (604 programs, or 82.6%). The middle bars show that the Whiley native verifier can verify an extra 7 programs (1.0%), whereas the Boogie verifier can verify an additional 102 programs (14.0%). So in total, the Whiley native verifier can verify 83.6% of the programs, whilst the Boogie verifier can verify a total of 96.6%.

We investigated the 7 programs that the Whiley native verifier could verify but Boogie could not, and found that 4 of them are verifiable by a later version of Boogie (v2.9.6.0) and Z3 (v4.8.12). The remaining three are due to outstanding issues with the translation to Boogie related to lambda functions that return union types (Issue #59 in the Whiley2Boogie repository) and to proving the type invariants of cyclic data structures (Issue #61).

The larger number of programs that are verifiable by Boogie but not by the Whiley native verifier are largely because there are several Whiley language features that are not supported by the Whiley native verifier, such as:

  • heap updates;

  • reasoning about the results of calls to lambda functions;

  • some kinds of generic types.

The Wy2B+Boogie toolchain takes 15:30 minutes (930 seconds) to translate and verify just the 706 test programs that it can verify, on a Dell Precision 5520 laptop with an Intel i7-7820HQ CPU @ 2.90GHz and 32Gb RAM, and a 60 second timeout for Boogie. This is 1.3 seconds on average for each small valid test program, which is acceptable performance for real-world usage. When run on all 731 programs with a timeout of 60 seconds, the whole test run takes around 17:55 minutes, because some of the more difficult programs hit the 60 second timeout and fail. This is around 1.5 seconds average for each test, with a maximum of 60 seconds for those that time out, which is still reasonable.

Another interesting performance issue is that we run Boogie with the -useArrayTheory flag by default—this uses the built-in SMT theory of arrays within Z3, which handles large arrays better, usually gives better performance, and enables more programs to be verified (without this flag, Boogie can verify only 665/731 = 91% of the valid test suite). However, there are a few programs (e.g. While_Valid_71.whiley) where performance becomes dramatically worse with this flag—it takes 4.5 minutes to report 5 unverifiable proof obligations with the flag, but less than one second to finish and report 7 unverifiable proof obligations without the flag.

The Whiley native verifier takes only four minutes to process the 600+ test programs that it can verify (around 2.5 programs/sec), which is significantly faster than the Boogie verifier, but takes 18:32 minutes to process all the 731 valid tests (around 1.5 secs/test on average). However, it is difficult to compare the actual proof times, because the Whiley verifier runs within a single Java JVM process, whereas the Wy2B+Boogie toolchain creates several separate processes and intermediate files for each test program.

Fig. 9
figure 9

Illustrating the web-based implementation of Conway’s Game of Life developed in Whiley

4.2 Case Study: Conway Game of Life

The first case study we discuss is an interactive web page for playing the Game of Life by Conway [70]. This consists of a small index.html file to load the game, plus three Whiley modules:

  • +model.whiley+ (141 lines): defines the 2D board and the logic of the game;

  • +view.whiley+ (26 lines): defines how to draw the board onto an HTML canvas;

  • +main.whiley+ (87 lines): defines mouse event handlers and other controller methods.

The Whiley compiler compiles these three modules and generates JavaScript as output, which can then run in a standard web browser (see Figure 9). We focussed on verifying just the model component, since the others are just the view and controller components whose correct functioning is generally obvious by the visual updates of the canvas. We aimed to specify and verify as much of the functional behaviour of the model as possible, to try to explore the limits of the Boogie verification path. Figure 10 shows the main data structure that represents the board, plus a Whiley function that counts the number of neighbouring cells that are alive. “Appendix A” gives the full listing of model.whiley plus links to the corresponding output Boogie code.

Fig. 10
figure 10

Snippets from the Game of Life case study: the State data structure with its invariants, and the count_living function that counts how many neighbouring cells are alive. As explained in the text, the input of count_living is given a cell location (xy) as . Note that is defined in the standard library and has the same definition as (recall Figure 1)

In addition to adding specifications to model.whiley, we made some small changes to the code to make specification or verification easier:

  • The original board function took and inputs as arbitrary pixel sizes, but we changed these to be cell counts rather than pixels (since the size in pixels is just a GUI display issue) and required them to be greater than zero to avoid empty board cases that are not interesting in practice;

  • We moved the cell-update code out of a doubly nested loop into a separate function, for better modularity and easier specification;

  • Whiley currently supports only one-dimensional arrays, so the code implemented the 2D board as a one-dimensional array, where each (xy) location was translated into an index . We respected this data representation choice,Footnote 7 but initially had some difficulty with Boogie struggling to verify in-range assertions about these indexes, due to the nonlinear multiplication ( is initialised at the start of each game, so is not a static constant). Frequently, Boogie would go into an infinite loop trying to prove these assertions (or terminate with a timeout error if we set a time limit). Eventually we found that upgrading Z3 from version 4.8.9 to 4.8.10 solved most of these problems, and Boogie was then able to prove most of the required assertions, or give a quick failure result for those it could not prove. Even then, we found that it was sometimes necessary to try several different ways of specifying indexes and bounds before finding one that Boogie could verify. For example, it was much easier to verify the function when it took a single parameter rather than separate and parameters—this is why in our final version the function re-derives the and coordinates from the index. This meant that only one variable needed to be quantified in the loop invariant, instead of both the and coordinates. Typically, we found that cases where Boogie did not terminate were due to array accesses that it couldn’t prove were within bounds, and that adding redundant constraints to the specification to make it clear that they were in bounds would fix that problem. This process was rather frustrating, but reflects a limitation of SMT solvers (because nonlinear arithmetic is not decidable) rather than of the Whiley-to-Boogie translation.

After these changes, Boogie (v2.8.26) can easily verify all the functions in this program in 2.2 seconds, plus 2.8 seconds for the translation from Whiley to Boogie.

4.3 VerifyThis 2019 Competition Challenges

In this section, we briefly discuss our experience of translating and verifying several of the Dafny and JML solutions to the ‘VerifyThis 2019’ verification challenge [59].Footnote 8 These challenges involve quite sophisticated algorithms, with full specifications of the functional behaviour, so are reasonably challenging verification tasks.

Dafny and Boogie were designed to work together, whereas Whiley was independently designed, and originally used several generations of custom-built “native” provers to discharge proof obligations. It is only recently that we have developed the Boogie back end as an alternative verifier. So a useful way to evaluate the usability of the Whiley+Boogie verifier is to take verification solutions that are written in Dafny, translate them into Whiley and see how well the verification works in comparison with Dafny+Boogie. This can help us to understand how various language features of Whiley help or hinder the verification process and how well Whiley translates into the underlying Boogie verifier, which is a common back end for both languages.

We translated and verified the following challenge solutions using Boogie v2.9.6.0 and Z3 v4.8.12—the resulting Whiley solutions can be seen on GitHub.Footnote 9

Challenge 1A: Monotonic Segments:

This challenge takes an array and cuts it into monotonic segments, which are either increasing or decreasing. The Dafny solution used the built-in extensible sequences to specify some of the operations, but Whiley has only fixed size arrays. So to replicate Dafny’s sequence append operator, we defined in Whiley an function that adds an element to the end of an array. To replicate the functionality of Dafny’s sequence slicing, we added and parameters to each of the properties used where necessary, as these were the only uses of sequence slicing in the Dafny version. Interestingly, the lemmas found in the Dafny version were not necessary, as they were needed to prove properties of Dafny’s sequence manipulations that were not relevant to Whiley. Some assertions found in the Dafny version were also not needed in the Whiley code, as they were only needed to prove properties of the sequence manipulation in Dafny. The Dafny solution was 72 lines of non-comment specification and source code (excluding curly braces), while the Whiley solution was slightly shorter at 56 lines (including 11 lines for ) and took roughly 30s to verify without the -useArrayTheory flag. We note, with that flag enabled, it would not verify the program within 20 minutes.

Challenge 1B: GHC Sort.:

This challenge was to verify a sorting algorithm used by the GHC Haskell compiler, which takes the monotonic segments from the previous challenge, reverses the decreasing ones, and then pairwise merges the segments into a sorted result. For this challenge, we added the same function as above. As 01_ghc_sort builds upon 01_findcuts, the same start-and-end modifications to the properties were made, but a separate function was also added, as Dafny’s sequence slicing was used more extensively than in the cutpoints solution. The lemmas from the Dafny code were not needed in the Whiley code, and neither were any of the assertions. New assertions were necessary to add to the Whiley code in the and functions to demonstrate properties of the implemented slice function. The function was simplified, as the Dafny solution uses an extra while loop to copy its output sequence into an array, which is not necessary in Whiley as arrays are used throughout. The function was re-implemented slightly to avoid an on every iteration. The Dafny language includes multi-sets (bags) and the Dafny solution used these to prove that one sequence is a permutation of the other. However, the authors comment the “specification (and hence proofs) that the output is a permutation of the input is incomplete”. Whiley does not have built-in support for multi-sets, and it is difficult to recreate this using uninterpreted functions. As such, we also did not establish the permutation property in the Whiley version. Overall, the Dafny solution was 137 lines of non-comment specification and source code (excluding curly braces), and the Whiley solution was slightly longer at 152 lines. The Wy2B+Boogie verifier takes roughly 20 seconds to verify this program, and again failed to verify within 20mins with the Boogie -useArrayTheory flag.

Challenge 2A: Cartesian Trees.:

This challenge was to verify a stack-based algorithm for finding the nearest smaller value for each item in an array. There was no Dafny solution, so we started from the OpenJML solution, which has a single function with a doubly nested loop.

For this challenge, it was only necessary to add the loop invariants and to the outer loop, as Whiley is not able to automatically determine that the sizes of the arrays are unmodified when the loop body only updates valid indexes in the array, whereas OpenJML can infer this invariant.

The OpenJML solution was 38 lines of code and specifications, and the Whiley solution is 35 lines. The Wy2B+Boogie verifier takes 1.8 seconds to verify this program, or 5 seconds if we add the Boogie -useArrayTheory flag.

4.4 Discussion

From our case studies and our micro-test results, we have observed that using Boogie to verify Whiley programs has significantly increased the verification abilities of Whiley. This is partly due to Boogie making it easier to provide proof support for a wider range of Whiley language features and partly due to the maturity and power of the underlying proof tools—the decades of careful proof engineering that have gone into Z3.

However, the use of Boogie and Z3 is not yet perfect. The Boogie -useArrayTheory flag is necessary in some case studies to handle large arrays, but in other case studies it can lead to vastly increased proof times or even non-termination. Also, we have observed that Boogie can often make effective use of recursive predicates to prove a valid proof obligation, but can go into an infinite unfolding loop if that proof obligation is difficult or unprovable.

On the Whiley side, we found that when reasoning about arrays it is helpful to define various supporting properties, such as taking slices of an array, appending two arrays, and counting the occurrences of a given element. It would be useful to develop a Whiley library of these supporting properties and this would be easier if Whiley properties could return arbitrary values, rather than being limited to Boolean results. This would allow them to be used as specification-only functions, which would make it easier for Boogie to reason about the domain-specific concepts that are captured by those functions.

5 Related Work

We now consider various tools with similar aims to Whiley, including several which also compile to Boogie.

5.1 Extended Static Checkers

The Extended Static Checker for Java (ESC/Java) [68] and its later successor (ESC/Java2) is perhaps one of the most influential tools in the area of verifying compilers [38, 48]. The tool essentially provides a verifying compiler for Java programs whose specifications are given as annotations in a subset of the Java Modelling Language (JML) [38, 39, 99]. JML provides a standard notation for expressing contracts in Java, and the following illustrates a simple method in JML which ESC/Java verifies as correct:

figure qu

Here, we can see pre- and postconditions are given for the method, along with an appropriate loop invariant. Since refers to +i+ on entry to the loop, we have in this case. Despite some unsoundness (e.g. ignoring arithmetic overflow and unrolling loops a fixed number of times), the tool has been demonstrated in real-world settings. For example, Cataño and Huisman [37] used it to check specifications given for an independently developed implementation of an electronic purse. In addition to ESC/Java, a Runtime Assertion Checker (RAC) was developed for JML [34, 39, 99] as well as various utilities for specification-based testing [32, 42, 170, 171]. Likewise, Krakatoa [67] provided an alternative to ESC/Java for statically verifying Java programs based on the original Why platform. Finally, whilst the development of JML and its associated tooling stagnated somewhat over the last decade, we note more recent efforts through the OpenJML initiative [29, 46, 47, 149].

The approach taken to generating verification conditions in an earlier tool, ESC/Modula-3, was also adopted in ESC/Java [56]. In fact, ESC/Modula-3 was one of the earliest tools to use an intermediate verification language (based on Dijkstra’s language of guarded commands [57]) and, in many ways, is Boogie’s predecessor. Such a language typically includes assignment, and statements and non-deterministic choice. It is notable that the guarded command language used in ESC/Modula-3 lacked type information and used a similar encoding of types as ours, although Modula-3 has a simpler type system than Whiley. For example, a predicate \(\mathtt{isT}\) was defined for each type to determine whether a given variable was in the type \(\mathtt{T}\). A similar approach was also taken in Leino’s Ecstatic tool, where the subtyping relation was encoded using a \(\mathtt{subtype()}\) predicate [100]. Again, every type was given a membership predicate with specific axioms stating their non-intersection and was contained in what Leino refers to as the background predicate and included with each generated verification condition. A key difference from ESC/Modula-3 is that ESC/Java employed a multi-stage process allowing “high-level” guarded command programs to be desugared into a lower-level form. Further refinements were also made with “passive form” which reduced the size of generated verification conditions, and supported unstructured control flow [17].

5.2 Spec#

The Spec# system followed ESC/Java and benefited from many of the insights gained in that project. Spec# added proper support for handling loop invariants [16], for handling safe object initialisation  [64] and allowing temporary violations of object invariants through the +expose+ keyword [109]. The latter is necessary to address the so-called packing problem which was essentially ignored by ESC/Java [15]. Two further improvements meant Spec# was capable of verifying a wider range of programs than ESC/Java: firstly, Spec# incorporated the new Z3 automated theorem prover (as opposed to Simplify) [53]; secondly, Spec# refined the language of guarded commands used in ESC/Java to form Boogie. Boogie was described as an “effective intermediate language for verification condition generation of object-oriented programs because it lacks the complexities of a full-featured object-oriented programming language” [14]. In essence, Boogie was a version of the guarded command language from ESC/Java which also supported a textual syntax, type checking, and static analysis for inferring loop invariants. Other important innovations included the ability to specify triggers to help guide quantifier instantiation, and the use of trace semantics to formalise the meaning of Boogie [107].

Leino and Schulte [113] provide an excellent account of how Spec# programs are encoded in Boogie, and there is much similarity with that presented here. For example, the heap is modelled using a global variable of type where a special field, , tracks whether a location is allocated. Like Whiley, Spec# permits method calls within expressions, and hence, a similar mechanism for safely extracting them is employed. Furthermore, key challenges arise in preserving class invariants across inheritance and ownership relationships. The approach adopted was based on packing/unpacking [15] which identify code regions where class invariants are not required to hold.

5.3 Dafny

Dafny [104, 105] is perhaps the most comparable related work to Whiley, and was developed independently at roughly the same time. That said, the goals of the Dafny project are somewhat different. In particular, the primary goal of Dafny is to provide a proof assistant for verifying algorithms rather than, for example, generating efficient executable code (though it does compile to C#). In contrast, Whiley aims to generate code which is, for example, suitable for embedded systems [134, 156]. Dafny is an imperative language with simple support for objects and classes without inheritance and, more recently, traits [1]. Like Whiley, Dafny employs unbound arithmetic and distinguishes between pure and impure functions. Dafny provides algebraic data types (which are similar to Whiley’s recursive data types) and supports immutable collection types with value semantics that are primarily used for ghost fields to enable specification of pointer-based programs. Dynamic memory allocation is possible in Dafny, but no explicit deallocation mechanism is given and presumably any implementation would require a garbage collector.

Leino [102] provides a detailed description of how Dafny programs are translated into Boogie, much of which has already been touched upon earlier in this paper. Dafny also supports generic types and, unlike Whiley, dynamic frames [91]. As discussed in §2.1.6, the latter provides a suitable mechanism for reasoning about pointer-based programs. For example, Dafny has been used successfully to verify the Schorr–Waite algorithm for marking reachable nodes in a graph [104]. Finally, Dafny has been used to successfully verify benchmarks from the VSTTE’08 [108], VSCOMP’10 [93], VerifyThis’12 [83] challenges (and more).

Leino and Pit-Claudel [111] characterise the “Butterfly Effect” where minor changes to the program source cause significant instabilities in verification time. The authors argue one reason for this are so-called matching loops where the SMT solver repeatedly instantiates quantifiers or recursive predicates without making actual progress towards either a proof or a contradiction. Their approach is prototyped in Dafny and moves responsibility for trigger selection out of the SMT solver. This enables trigger selection to occur before quantifiers are rewritten into lower-level forms (i.e. as necessary for the SMT solver) where important triggers are obscured. Furthermore, whilst the authors don’t expect Dafny users to write triggers themselves, they are expected to understand them in order to diagnose verification performance problems.

5.4 Why3

In addition to Boogie, the other main intermediate verification language in use is WhyML [28, 65]. This is part of the Why3 verification platform which is intended to enable a range of different theorem provers to be used in proving correctness, depending on the nature of the program being verified. For example, a short but extremely intricate C program for solving the N-Queens program has been fully verified with the aid of Why3 [66]. This was achieved by abstracting the original program into WhyML, and the proof required the use of three distinct theorem provers to discharge 41 verification conditions. Of these, 35 were discharged automatically by Alt-ERGO [158] or CVC3 [19], whilst the remainder were discharged manually using Coq [24]. Indeed, the authors of Why3 state [28]:

The Why3 platform can be used by itself, as some kind of standalone “meta” theorem prover, but the main purpose of Why3 is to be used as an intermediate language.

WhyML is a first-order language with polymorphic types, pattern matching, inductive predicates, records and type invariants. It has also been used in the verification of C, Java and Ada programs (amongst others). Like Boogie, WhyML provides structured statements (e.g. while and if statements). In addition, a standard library is included which provides support for different theories (e.g. integer and real arithmetic, sets and maps).

Of note here is the Boogie to WhyML translation developed by Ameri and Furia [4] which, although largely successful, did expose some important mismatches between them. Their primary motivation was the wide support for alternative (even interactive) provers with Why3. The structured nature of WhyML presented some problems in handling Boogie’s unstructured branching, and aspects of Boogie’s polymorphic maps and bitvectors were problematic. They showed that Why3 could verify 83% of the translated programs with the same outcome as Boogie. However, they also identified three simple Boogie programs which Boogie either did not verify or incorrectly verified. Why3, on the other hand, handles these cases by virtue of its ability to use a wider range of provers. One of the cases, for example, failed to verify because of the way Z3 handles quantifier instantiation through triggers.

As another example, Spark/ADA is a commercially developed verifying compiler building upon Why3 which has seen good industrial uptake [13, 87]. For example, it has been used used for (amongst other things) space-control systems [33], aviation systems [40], automobile systems [81] and railway systems [58].

Finally, given our success here using Boogie to verify Whiley programs, we note it would be interesting future work to explore a WhyML back end for verifying Whiley programs as well.

5.5 Viper

Müller et al. [125] observed that existing intermediate verification languages (e.g. Boogie or WhyML) do not support separation logics and related permission-based logics. They identify that such systems have a more “higher-order nature” than typical software verification problems, and make extensive use of recursive predicates (which Boogie/Z3 does not support well). They developed an alternative intermediate verification language (Viper) which offers more precise handling of recursive predicates and protects against “infinite unrolling” using a least fixed-point semantic. The tool also supports two back ends, one of which generates an encoding in Boogie. This builds on earlier work looking at the encoding of abstract predicates and abstraction functions in the context of permission-based logics [79]. Here, abstract predicates describe the (potentially infinite) set of access permission a given object has, but this is problematic for an SMT solver which cannot arbitrarily unroll them. To handle this an encoding is employed which “versions” predicates to prevent arbitrary unrolling, along with various tactics to prevent unlimited matching loops.

An example of work utilising Viper is that of Ter-Gabrielyan et al. who argue that SMT solvers typically provide limited support for graph reachability problems, which is prohibitive for reasoning about mutable data structures that admit sharing in various forms [157]. By restricting themselves to problems involving acyclic structures of bounded outdegree, they obtained an encoding ammenable to first-order theorem provers which they demonstrated in the context of Viper. Finally, we note that Viper currently acts as the intermediate verification language for Chalice [110, 114], Prusti [8], Nagini [63], VerCors [5, 27] and more.

5.6 VeriFast

VeriFast is a modular program verifier for concurrent and sequential programs written in C and Java, which employs separation logic and fractional permissions to ensure memory safety [85, 86]. The tool comes from a line of work exploring the use of dynamic frames in the context of verification [152,153,154]. VeriFast is unusual in eschewing the use of quantifiers within specifications. Instead, inductive predicates are provided to model properties that would otherwise be expressed using quantified formulae. VeriFast supports algebraic data types to allow specifications to reason about locations contained in linked structures. Finally, VeriFast has been used to reason about memory safety in JavaCard programs and Linux device drivers [85], and also in the verification of FreeRTOS [124].

5.7 Frama-C

Frama-C [52] provides a set of sound software analyses for the industrial analysis of ISO C99 source code. The system uses the ACSL specification language as a platform on which different solver plugins can operate. For example, different plugins may use different approaches to checking functions meet their specifications, such as abstract interpretation or deductive verification. The ACSL specification language is based loosely upon JML and supports a variant of separation logic through the command. An unusual feature of Frama-C (e.g. compared with Dafny or Whiley) is that multiple loop invariants may be specified at different positions within the loop [22].

Volkov et al. [165] developed an extension for lemma functions (similar to those in Dafny) which enables a more “interactive” style of verification, and applied this to various functions from the Linux Kernel. We note that, whilst Whiley lacks specific support for lemmas, a similar effect can be achieved using a with a return. Kosmatov and Signoles illustrate runtime assertion checking with Frama-C which they argue provides useful stepping stone prior to static verification [94]. We note similar findings in the context of an automated testing tool for Whiley [43].

Finally, Frama-C has been applied to a range of real-world problems. For example, it has been used in the context of Air Traffic Management systems to reason about floating point operations and establish bounds on rounding errors [73]. Similarly, AirBus has investigated the use of Frama-C within the context of the DO-178B standard for software in airborne systems and equipment [155]. Frama-C has also been used in the context of IoT devices employing AES encryption [26], for verifying components of Contiki, an open-source operating system for IoT [119], and the Xen hypervisor [143]. It has also been applied to verifying railway software [142] and combined with CBMC [97] for test case generation in the context of automotive controllers [127].

5.8 AutoProof

Eiffel [122] is an influential and widely used language that promotes the idea of “Design by Contract” as a lightweight alternative to formal specification [123].

Tschannen et al. characterise the AutoProof verifier for Eiffel as being auto-active—meaning it lies somewhere between fully automatic and manual (i.e. interactive) [163]. Here, an automated theorem prover is used (as for Dafny or Whiley) in conjunction with appropriate annotations (e.g. pre-/postconditions, loops invariants, etc.). AutoProof translates Eiffel programs into Boogie which, for example, allows strengthening of postconditions and weakening of preconditions in subclasses [162]. Of relevance here is the approach to framing. Whilst Eiffel has no specific syntax for framing, a was implemented as a pragma. A default frame condition is employed which assumes only references mentioned in the postcondition can be modified. This utilises a similar rule to that in §3.5 for state preservation across method calls.

Finally, we note AutoProof has been used in various settings, such as for teaching a graduate course on software verification [69].

5.9 Other

Aside from various descriptions of Boogie’s syntax and semantics [14, 103, 112], several works focus on the usability of Boogie as an intermediate verification language. For example, Chen and Furia were concerned with the “brittleness” of verification tools [41]. Specifically, a verifier is brittle if small (inconsequential) changes can have major impacts on the outcome (e.g. it no longer verifies). They investigated this in the context of Boogie by mutating various (verified) programs in ways that preserved correctness finding, perhaps surprisingly, several issues. For example, where the ordering of declarations in a Boogie program affected the chance of success. Indeed, for one program consisting of five (independent) statements, Boogie only managed to verify half of the 120 possible orderings. In a similar vein, the Boogie verification debugger (BDV) addresses the disconnect between counterexamples generated at the Boogie level and the source language above [74]. The tool employs plugins to convert Boogie counterexamples into a form recognisable in the source language, with plugins provided for VCC and Dafny. Likewise, Boogaloo attempts to improve the process of debugging failed verification attempts by generating concrete inputs (e.g. arguments) that illustrate the failing trace [141]. A key challenge was in providing a runtime semantics for Boogie, and for input generation, a mixture of symbolic execution and constraint solving with Z3 was applied.

Segal and Chalin [150] attempted a systematic comparison of Boogie and Pilar. Here, Pilar is a component of the open-source Sireum framework and is similar in many ways to Boogie. They stated that it is “not trivial to define a common intermediate language that can still support the syntax and semantics of many source languages”. Their research method was to develop translations from Ruby into both Boogie and Pilar, and then compare. Various aspects of Ruby proved challenging for Boogie, including its dynamically typed nature and arrays. Their solution bears similarity to ours, as they defined an abstract Boogie type as the root of all Ruby values. Overall, they concluded that Boogie’s type system makes it “more flexible for languages with non-traditional type systems” whereas Pilar is more suitable for traditional Object-Oriented languages.

Arlt et al. [7] presented a translation from SOOT’s intermediate bytecode language (Jimple) to Boogie, with an aim of identifying unreachable code. As such, an important aspect of the translation was the preservation of feasible execution paths. Overall, they found many aspects of the translation straightforward. For example, Java’s operator was modelled using an uninterpreted function. Another interesting aspect of their translation was the use of multiple typed heaps (a Burstall–Bornat heap [30]) to model the Java heap. However, some aspects of impedance mismatch were present and they had difficulty with monitor bytecodes, exceptions, certain chains of if–else statements and finally blocks.

On a related note, Cook et al. [50] focus on the impedance mismatch, arguing that “existing theorem provers, such as Simplify, lack precise support for important programming language constructs such as pointers, structures and unions”. For example, that integer types are almost never unbounded in practice, though verification tools often assume this. Likewise, that the lack of support for nonlinear arithmetic is often a problem (though we note useful advances have been made in the intervening years [23, 44, 51, 88, 96]). Their tool, Cogent, “implements an eager and accurate translation of ANSI-C expressions (including features such as bitvectors, structures, unions, pointers and pointer arithmetic) into propositional logic”. It is in essence a layer that sits above tools like Boogie and encodes ANSI-C data types using bitvectors and we note the obvious similarity with the more recent tool, Frama-C, discussed above.

Rust provides another interesting perspective as there has also been growing interest in exploiting its safety guarantees for program verification. For example, RustHorn, translates Rust programs into Constrained Horn Clauses (CHC) which can then be discharged by a specialised CHC solver [120]. Likewise, Astrauskas et al. leverage Rust’s type system to simplify the specification and verification of systems software [8]. Their tool, Prusti, extends Rust with a specification language embedded using annotations and statically checked using Viper [125]. The SMACK verifier which translates LLVM IR to Boogie/Z3 [14, 53] was also extended to Rust [11]. The CRUST tool [161] enables unsafe code to be checked using the C Bounded Model Checker (CMBC) [97]. This employs a custom C code generator for , and correctly identified bugs arising during development of Rust’s standard library. The widely used symbolic execution tool, Klee [35], was also extended for Rust allowing assertions to be checked statically [115, 116]. Finally, we note ongoing work to formalise subsets of the Rust language which could assist the development of verification tools [89, 90, 136, 166, 167].

Finally, Jahob supports multiple provers and is concerned with recursive data structures (e.g. trees, etc.) and their encoding in first-order logic [31, 147]. Bannwart and Müller presented a Hoare-style logic for a sequential bytecode language similar to JVM Bytecode or MSIL [10]. As expected, the unstructured nature of bytecode languages presented a key challenge here. In a similar fashion, Barnett and Leino consider the problem of translating MSIL bytecode into a form suitable for Boogie, in particular by turning unstructured loops into quantified expressions [18].

6 Conclusion

Using Boogie as an intermediate verification language eases the development of a verifying compiler, particularly as it handles verification condition generation, and offers high-level structures such as while loops and procedures with specifications. However, as with any intermediate language, there is potential for an impedance mismatch when Boogie structures do not exactly match the source language. Fortunately, this impedance mismatch can be circumvented in a variety of ways, such as translating to lower-level Boogie statements (e.g. with unstructured control flow). Furthermore, Boogie provides a good level of flexibility to define the “background theory” of a source language, such as its type system, its object structure, and support for heaps. This background theory is at a similar level of abstraction in Boogie as it would be in SMT-LIB so, whilst Boogie offers no major advantages in this area, it also has no disadvantages.

Our work provides a comprehensive account of the encoding of an independently developed non-trivial source language (Whiley) into Boogie. In doing this, we faced many challenges in figuring out a good encoding and, unfortunately, encountered many dead ends along the way. As such, we hope this work can offer guidance to researchers when developing verifying compilers for other languages. Indeed, it would be beneficial to have a repository of knowledge about different ways of encoding various language constructs. Some alternatives (particularly for various heap encoding techniques and procedure framing axioms) are discussed in the published Boogie papers, but there is no central repository of techniques or publications comparing encoding techniques. A major benefit of Boogie is, of course, its easy access to Z3. We have shown that the Wy2B/Boogie/Z3 stack offers significant advantages over the native Whiley verifier in terms of the percentage of programs that can be verified automatically. We note, however, that whilst Boogie/Z3 offers tangible benefits, they are not without their own challenges. For example, understanding why Boogie/Z3 cannot verify a particular program, or loops indefinitely, still requires considerable expertise. We have also shown that a number of non-trivial case studies written in Whiley can be successfully verified with Boogie. This has also helped us identify areas in which the Whiley language itself could be improved to better exploit Boogie.

Finally, interesting future work would be to explore translating Boogie’s counterexample models back into Whiley-like notation to improve error reporting. We would also like to extend Whiley’s support for framing and, consequently, our Boogie back end. Another interesting path would be a more detailed comparison against the Whiley verifier. As discussed in §3, this has a layered designed based on an intermediate assertion language and an underlying SMT solver. Hence, one could swap out the SMT solver for Z3 to provide a more accurate comparison with Boogie itself.