Keywords

1 Introduction

Bounded Exhaustive Testing (BET, for short) automates unit testing of a function by checking one of its properties for all admissible inputs up to some size. Although this method is limited to small input data, its relevance is recognized [15, 21] since it facilitates debugging by providing the smallest counterexamples, and provides confidence by guaranteeing the absence of errors below some size bound. This makes BET complementary to methods adapted to data of larger size, such as random testing. Whatever, the subject of this paper is not to compare BET with other test methods, but to improve the quality and availability of BET tools.

BET has first been used to check properties of functional languages, as exemplified by SmallCheck in Haskell [20]. Then, BET has been adapted to several proof assistants, e.g., to Isabelle in Quickcheck [4] and more recently to Coq, in an extension of QuickChick [14] named CUT (Coq Unit Testing) [7].

BET is also relevant to check properties produced by deductive verification, aka. verification conditions that a given program satisfies a given specification. We present a prototypical implementation of BET in the deductive verification tool Why3 [3]. Programs for Why3 are written in WhyML, a verification-oriented dialect of ML with some functional features, such as polymorphic algebraic types, but also imperative features, such as loops or records with mutable fields. The functional behavior of WhyML programs can be specified with formal annotations: preconditions, postconditions, invariants and loop variants, assertions, etc., in a first-order logic with polymorphic types. Why3 standard library defines theories or data structures for common types such as integers, lists or arrays. Why3 reduces programs and specifications to logical verification conditions whose satisfiability entails that the programs meet their specifications. Then, automated provers (e.g., SMT solvers) or proof assistants (e.g., Coq) can be used to prove these logical statements. Why3 also provides extraction to get correct-by-construction OCaml programs.

Some BET tools implement techniques such as constraint solving or local choice with backtracking, either to enumerate data or to derive enumeration programs from data definitions (see [6, Section 7] for references). However, these techniques may fail or enumerate data too slowly. For effectiveness, we consider BET using a distinct handwritten enumeration program for each family of data of interest. Dubois and Giorgetti proposed BET for Coq with such custom enumeration programs, defined either in Coq or in Why3 language [6].

Confidence in BET is increased if its enumeration programs are certified, ideally with formal proofs of their properties. Genestier et al. [10] developed a first version of the ENUM library, gathering enumeration programs in C language, formally specified with ACSL clauses and proved with Frama-C plugin WP for deductive verification. An adaptation to Why3 of a small fragment of this library has been presented to the French community [11, 12]. Here we present a larger version of this library and its certification with Why3.

Another challenge for BET is to design and implement efficient enumeration algorithms. We examine here several ways to reduce their algorithmic cost: by implementing algorithms in a more efficient language (C versus WhyML), or by using optimized compilation. We also study the negative impact that these optimizations might have on certification.

The first contribution of this work is an implementation of BET to check Why3 properties (Sect. 2). The second contribution is a library of enumeration programs certified with Why3 (Sect. 3). The third contribution is an experimental study to optimize enumeration programs without sacrificing too much their certification (Sects. 4 and 5).

2 Bounded Exhaustive Testing for Why3

This section presents our implementation of bounded exhaustive testing for Why3 properties. It consists of a generic BET function (described in Sect. 2.2) and a library of enumeration programs (detailed in Sect. 3). All enumeration programs implement the same interface, described in Sect. 2.1. Two examples of BET are given in Sects. 2.3 and 2.4, respectively with success and exhibiting a counterexample.

2.1 Common Interface of Enumeration Programs

Since enumeration is a particular form of iteration, we specify and implement enumeration programs (sometimes hereafter called generators) by adapting the modular iterators defined by Filliâtre and Pereira [8, 9]. Our generators modify a state, called a cursor, whose type is

figure a

in WhyML. The field stores the last data generated so far. For simplicity, it is here a mutable array of integers, but other types can be used similarly. The Boolean flag is set to if and only if the data stored in the field has already been exploited, for instance to test a property.

The generators presented in this paper are composed of two functions (declared on Lines 3 and 4 in Listing 1.1): a constructor initiates the cursor with the first element of the iteration, and a function replaces the data in the cursor with the next one, if it exists. Otherwise, it sets the field to false.

2.2 BET Function

BET is implemented by the generic function in Listing 1.1, whose execution tests the property defined by the function (first parameter) for all data of size n (second parameter). The first parameter of the module (on Line 2) is a characteristic predicate of the enumerated data.

Note that the input type for the function is a list rather than an array, because Why3 has limited support for function parameters that are functions working with mutable data. For the same reason, the generator functions cannot be input parameters for function. Therefore we define them as module parameters (on Lines 3–4). They can be instantiated thanks to Why3’s module cloning mechanism, as detailed in Sect. 2.3.

The return type verdict is composed of the field storing either a counterexample, if it exists, or the empty list ( ) otherwise, and the field rank storing either the number of data tested when the witness is found, or the total number of tested data if there is no counterexample. The function first creates the cursor (line 12), then converts the cursor array into a list (line 18), by using the function from Why3 standard library. Finally, tests each generated data with the (line 19). If a counterexample is found, it is stored in the local variable (line 22), the enumeration is stopped and the function returns the counterexample and the number of data tested so far. Otherwise, the function stops when all data have been tested.

The clause (on Line 10) declares that the function is not guaranteed to terminate. To prove its termination it is necessary to annotate its loop with a variant, an integer expression whose value is non-negative before the loop and strictly decreases between two successive loop iterations. Defining a unique variant for all kinds of enumerated data is a challenging task out of the scope of the present study.

figure w

2.3 Example of BET

We illustrate our BET for Why3 with functions and properties on permutations of a given size. Permutations on a finite set is an important topic in combinatorics and group theory. They have recently been formalized as injective endofunctions in Coq [6, Section 3]. The present example is the first step of an adaptation of that case study to Why3.

The permutation p on the set \([0..n-1]\) of first n natural numbers is encoded by the Why3 integer array of its images, i.e., \([i] = p(i)\) for \(0 \le i < n\). We characterize these permutation arrays with the predicate

figure z

where specifies that the values of array a are in \([0...\texttt {a.length}-1]\) and specifies injectivity of the function represented by a, i.e., uniqueness of values in a.

Let us consider the function in Listing 1.2. The function reverses the order of the elements of its input array. For instance, it turns the array \( {\begin{array}{|c|c|c|c|} \hline 4 &{} 1 &{} 0 &{} 7\\ \hline \end{array}} \) into the array \( {\begin{array}{|c|c|c|c|} \hline 7 &{} 0 &{} 1 &{} 4\\ \hline \end{array}} \). It proceeds by exchanging symmetrical elements with respect to the middle of the array. We want to prove that the function preserves permutations. This property is specified by the precondition and the postcondition on Lines 2–3.

figure ae

Since WhyML predicates are not necessarily decidable, all specifications are ignored when a program is run. In particular, the postcondition ( ) is not executable. In order to test it, a Boolean function implementing the logical predicate has to be provided. A Boolean function implementing a logical predicate, when it exists, is a decision procedure for this predicate. The Boolean function and a proof that it corresponds to the predicate are together called a Boolean reflection. This mechanism has several applications, e.g., proof automation [13].

The Boolean function

figure ah

decides the predicate if and respectively are decision procedures for the predicates and . We only detail the Boolean reflection of the predicate

figure ao

a naive (i.e., non-optimized) implementation of the predicate being similar. The predicate

figure aq

is a specificity of WhyML. It is indeed both a logical predicate and a Boolean function, because it is also the case for comparison operators on integers. Thus, we have its Boolean reflection for free.

The Boolean function in Listing 1.3 is a decision procedure for the predicate . The universal quantification is implemented by a loop that stops at the first array value not in the interval \([0..n-1]\). The postcondition (on Line 2) ensures that the Boolean function decides the logical predicate : the function returns if and only if the predicate holds for the input array .

figure ax

A loop invariant (on Line 6) helps to prove the postcondition. It uses the generalization

figure ay

of which controls that each element of the subarray \(a[l..u-1]\) is in the interval \([0...b-1]\).

Whereas implementing a decision procedure is in general a difficult problem, it becomes simple for the family of first-order properties on integer arrays where all quantifications on array indices and values are bounded. All such universal quantifications (\(\forall \)) can be implemented by a for loop as in the former example, and implementing an existential quantification (\(\exists \)) is similar. Genestier et al. [10] showed that these array properties are common in combinatorics. They proposed a general pattern of Boolean reflection, when the properties are specified by ACSL predicates and implemented by Boolean functions in C language. The decidability property is proved generically, once for all, for all kinds of predicates. So, it holds for free (without requiring specific annotations) for each pattern instantiation. The adaptation of this feature to WhyML is left as future work.

figure ba

A simple program to test that the function preserves permutations is presented in Listing 1.4. The declarations on Lines 1 and 2 import other modules. The module provides the predicate and its Boolean reflection . The module provides a cursor and its functions to enumerate permutations. The declaration on Lines 4–7 imports a clone of the generic module , instantiated with the characteristic predicate and the enumeration functions for permutations. This cloning provides the type and the right instance of the generic function to test properties for all permutations with a given size. For the size \(n=6\) the test program (on Lines 9–13) uses this instance and an anonymous oracle function working as follows: as required by , its input is a list of integers. The function from Why3 standard library transforms it into an array , then reversed in-place by application of the function. Finally the Boolean function is applied to the resulting array.

For efficiency and to get an explicit test result, the test code is executed in OCaml, after extraction of the test program and related modules. Thanks to some additional lines of OCaml code, the test result is displayed as follows:

figure bq

meaning that the test was successful for the \(6! = 720\) permutations of size 6. This BET is executed in less than one second, in the environment used for the experimentation described in Sect. 4, where more efficiency results are provided.

The current prototype does not allow to set a time limit for BET, but it can be completed with this feature. The approach is suitable for arrays containing integers in a small interval, as it is the case for permutations here. For larger integer ranges, random generation is preferable.

2.4 Counterexample

What happens if there is an error in a tested function? To illustrate the behavior of in that case we inject an error on Line 9 of the function (in Listing 1.2) that becomes the following one:

figure bt

When running the same test (in Listing 1.4) for this erroneous version, the following output

figure bu

provides as counterexample a permutation that the false version of the function transforms into the array

figure bw

which is not a permutation. This BET discovers this error only after generating one test case. In general, more test cases may be required.

3 Certified Library of Enumeration Programs

ENUM is a library of certified enumeration programs for BET, freely distributed at https://github.com/alaingiorgetti/enum.Footnote 1 Its first releases were composed of C programs specified in ACSL language and verified with Frama-C plugin WP for deductive verification [10]. This section presents a new part of ENUM, composed of enumeration programs specified and implemented in WhyML. It is an almost complete adaptation in WhyML of the C/ACSL enumeration programs, completed by new generators. Its programs implement algorithms that enumerate combinatorial structures [2] and have various applications in combinatorics.

Section 3.1 introduces some expected properties of these generators and their formalization in WhyML. Section 3.2 presents a simple way to define a generator, by filtering the output of another generator. Section 3.3 describes the techniques we use to assist formal proofs that the generators satisfy their expected properties. Finally, the content of the library is detailed in Sect. 3.4.

3.1 Properties

Each data enumeration program is expected to satisfy the following three behavioral properties. Soundness is the property that each generated data satisfies the characteristics (or data invariant) of its family, such as being a duplicate-free or a sorted array. Completeness is the property that the program produces all existing data with a given size, without omitting any of them. Generally, proving completeness is more challenging than proving soundness. Therefore, we limit ourselves to algorithms enumerating data in a predefined strict total order, hereafter denoted by \(\prec \), and we adopt two strategies. The first strategy is to specify completeness as the conjunction of the following three properties: the property min that the first generated data is the smallest one, the property max that the last generated data is the largest one, and the property inc (for “incrementality”) that each data \(a_2\) generated from data \(a_1\) is the smallest data strictly greater than \(a_1\). In other words, no sound data \(a_3\) is such that \(a_1 \prec a_3 \prec a_2\). When proving completeness seems too difficult, the second strategy is to address the less challenging property – named progress – that each generated data is strictly greater than the former generated data. Since we assume that there are finitely many data with each size, progress entails termination of bounded-exhaustive enumeration.

figure bx

Listing 1.5 shows a declaration of the enumeration functions with their contracts (pre- and postconditions) formalizing these properties in WhyML. The precondition on Line 2 specifies that the size n of data should be a natural number. The function (resp. ) should set the cursor field to false if and only if there is no data for a given size n (resp. the input cursor contains the last data). Therefore, most of the properties are formalized by postconditions guarded by the condition that the Boolean flag is true.

We assume that a predicate

figure cc

encapsulates the data invariant. Then, the generator is sound if the first generated data satisfies this predicate (postcondition on Line 3) and if the output of the function satisfies this predicate (postcondition on Line 8) whenever its input does (precondition on Line 7). The progress property is formalized on Line 9, with a predicate formalizing the strict total order \(\prec \). (The expressions and in a function postcondition respectively denote the values of the expression before and after the function call.) The properties min, inc and max (entailing completeness) are respectively formalized on Lines 4, 10 and 11, with predicates , and respectively formalizing minimality, incrementality and maximality of the restriction of the order \(\prec \) to data satisfying the data invariant .

3.2 Enumeration by Filtering

Assume you already have implemented, specified and certified an enumeration program for some family of data. Then an enumeration program for those data that satisfy an additional constraint can easily be implemented by running your program and selecting among its outputs those satisfying that constraint. Of course, the more data are rejected, the less effective is the resulting program. However, we show in this section that this filtering technique provides a specification, an implementation and a certification of the resulting enumeration program almost for free.

The generic module in Listing 1.6 formalizes filtering in WhyML. It provides an enumeration program for a family of integer arrays by filtering those arrays in a family (characterized by the predicate ) that satisfy the additional constraint , implemented by the Boolean function . The module is parameterized by the predicates and , the Boolean function and the enumeration functions and of data. The module provides enumeration functions and of data in family .

The function searches the first data by enumeration of data started from the first one (given by ) and selection of the first enumerated data satisfying , if it exists. (Otherwise, the field is set to false by the function .)

The function proceeds similarly, but from the current cursor . If the current data in the cursor is the last one satisfying but subsequent data exist, then they are enumerated (by ) in the cursor. If furthermore none of them are in the family, then the cursor no longer contains a sound data. This is acceptable because, in that case, the field is set to false. As specified on Line 11 of Listing 1.5, the cursor is expected to contain the maximal data only as input of the function when it sets the field to false, not necessarly as its output. It is possible to restore the maximal data in the output cursor, but this makes the generator less effective.

When the Boolean function decides the predicate and the enumeration functions and satisfy their contract given in Listing 1.5, the resulting enumeration functions and satisfy the same contract. This is automatically proved by Why3. So, it also holds for all instantiations of the module , for free.

figure dy

3.3 Auto-active and Interactive Verification

We combine the following two techniques to assist deductive verification of the enumeration programs. Auto-active verification [16] consists in providing additional specifications, such as variants (for termination), invariants, assertions and lemmas (for partial correctness), before running an automated prover. Interactive verification consists in reducing the proof goal step by step, by applying rules – named tactics in Coq and transformations in Why3.

3.4 Contents of ENUM Library

Metrics on the library and its contents are collected in Table 1. The first column assigns a name to each generator. The number of lines of code (resp. WhyML annotations) is recorded in the second (resp. third) column. The fourth (resp. fifth) column gives the number of transformations (resp. lemmas) needed for the proof of the soundness, progress and completeness properties. All of them have been proved automatically with Why3 1.2.0 and the SMT solvers Alt-Ergo 2.2.0, CVC4 1.6 and Z3 4.7.1, except the completeness property for the generator of permutations, which required an interactive proof of two lemmas with Coq 8.9.0.

The first block of lines in Table 1 concerns effective enumeration programs. The first four are adaptations of C++ programs proposed in [2]. The program rgf (for “restricted growth function”) enumerates the arrays a of length n such that \(a[0]=0\) and \(a[i] \le a[i-1]+1\) for \(1 \le i \le n-1\). sorted generates all arrays from \(\{0,...,n-1\}\) to \(\{0,...,k-1\}\) sorted in increasing order. perm enumerates the permutations on \(\{0,...,n-1\}\). barray (for “bounded array”) (resp. endo) (for “endo-array”) enumerates the arrays of length n whose values are in \(\{0,...,k-1\}\) (resp. \(\{0,...,n-1\}\)). fact enumerates the n! factorial arrays [12] f of length n such that \(0 \le f[i] \le i\) for \(1 \le i \le n-1\).

Table 1. Verification results.

The second block concerns enumeration programs obtained by filtering (Sect. 3.2). We denote by Z \(\subset \) X an enumeration program of data Z by filtering among more general data X. For instance, sorted \(\subset \) barray enumerates increasing arrays filtered among bounded arrays. By filtering from barray we get generators for the following data families: arrays sorted in increasing order, injections from \(\{0,...,n-1\}\) to \(\{0,...,k-1\}\), for \(n \le k\) (inj \(\subset \) barray), surjections from \(\{0,...,n-1\}\) to \(\{0,...,k-1\}\), for \(n \ge k\) (surj \(\subset \) barray), and combinations of n elements selected from k, (comb \(\subset \) barray), which are encoded by arrays c of length n such that \(0 \le c[0]\) < ...< \(c[n-1]\) \(\le \) \(k-1\).

4 Experimentation Protocol

This section presents the experimental protocol we have designed in order to compare various ways of implementing, certifying and optimizing data enumeration programs. We consider two programming and specification languages, C/ACSL and WhyML, the properties detailed in Sect. 3.1, and the execution techniques (interpretation, extraction and compilation) detailed in Sect. 4.1. The goal of the experimentation is to answer the research questions detailed in Sect. 4.2.

All proofs and time measures were performed on a Ubuntu 18.04 virtual machine, with a Core i5-8259U processor.

4.1 Execution

There are several ways to run an enumeration program: With Why3 as interpreter (command why3 execute), by executing code compiled from OCaml source code extracted from WhyML code, or by compiling and executing C code, either extracted from WhyML code or written by hand. Indeed, Rieu-Helft [19] has developed a method to extract in C language a subset of programs written in WhyML. The C code can be compiled with gcc or with the certified C compiler CompCert [17]. Indeed, when you compile a program with an ordinary compiler like gcc, you have no assurance that the executed code has the same semantics as the source code. In contrast, the CompCert compiler is formally verified, using machine-assisted mathematical proofs, to be exempt from miscompilation issues.

4.2 Research Questions

We gather experimental data in order to answer the following research questions. In a nutshell, RQ1 is about certification only, RQ2 about efficiency only and RQ3 about how to find a good compromise between both quality criteria.

  • RQ1: What is the most convenient approach to certify the enumeration programs? Since we have two versions, one in C/ACSL and another one in WhyML, we want to compare the effort required to prove their properties with Frama-C/WP and Why3. We quantify this proof effort with the number of lines of specification. These numbers for WhyML version are in Table 1.

  • RQ2: What is the most efficient way to run our programs? The efficiency of our generators is estimated by computing their speed, i.e., the number of data generated per second, for all the ways to run our programs presented in Sect. 4.1. Indeed, we implement algorithms already optimal in memory, producing each data on the fly, starting from the data previously produced. Thus, only one data is stored in memory at a time.

  • RQ3: Since certification and optimization are two desirable but potentially antagonistic quality criteria, which language and tool combination provides the best compromise between both? From the answers to the former two questions we try to derive a good compromise between data generation speed and proof effort.

5 Experimentation Results

This section exploits experimental results to answer our research questions.

To answer RQ1 we first analyze some metrics collected in Table 1 for the version in WhyML and the metrics in Table 2 for the version in C/ACSL, for the most effective programs (the first 5 in Table 1, without filtering). These metrics are the numbers of lines of code and specification and the time required for proofs. The number of transformations is not comparable, as Frama-C/WP does not offer a transformation mechanism. We also do not compare the number of lemmas, because all lemmas in WhyML are used to prove completeness, but completeness is neither specified nor proved in the C/ACSL version. Nevertheless, the average proof time with C/ACSL is 1.69 times longer than with WhyML. The total numbers of lines of code and specifications are 76 and 154 in C/ACSL and 134 and 174 in WhyML program, i.e. not much more for one more specified property.

Table 2. Verification results with the C/ACSL version.

Since the completeness property was not proved formerly with Frama-C/WP, we have tried to adapt to that environment its specification and successful proof with Why3. Although the adaptation of the specification to ACSL language did not require much effort, we have not yet managed to demonstrate any fragment of the completeness property with Frama-C/WP. We assume that this is due to the different memory models used by Why3 and WP. A memory model defines links between the program variables and the mathematical terms used in the proof obligations. It represents a mapping of the memory, management processes (reading, writing, allocating, releasing) and their properties. While Why3 has a simple memory model for arrays, producing concise proof obligations, the WP memory model produces more complex proof obligations. This convinces us that Why3 is more convenient than Frama-C/WP for the certification of ENUM.

To answer RQ2 we compare the speed of data generation of various interpretations or compilations of implementations and extractions in WhyML, OCaml and C of the same enumeration algorithm. We consider an algorithm to enumerate permutations [2, p. 243], and assume that speeds would be classified in the same order for other generators.

The first column of Table 3 gives the size of the generated permutations. The other columns display the number of millions of data generated per second, for four implementations and execution scenarios. A dash (-) indicates that generation exceeds the 6 h time limit.

Table 3. Speed of data generation (number of millions of data per second).

The interpretation of WhyML code is the least efficient enumeration method. It is not surprising since the other methods include a compilation, usually more efficient than an interpretation. Next comes the execution of its extraction in OCaml. For instance, the OCaml program enumerates \(5.58\times 10^{6}\) permutations of size 12 in 1 s. This may be appropriate in some applications, but is well below the speeds of the C programs. Indeed, C is a low-level imperative programming language. It has been designed to provide low-level memory access, which allows it to reduce the memory allocation required and optimize performance, particularly through the use of pointers.

Although the extracted C code is behind the handwritten one, its speed is much higher than that of the OCaml code. Its performance allows us to continue our efficiency study only for the C code extracted from the WhyML code. Figure 1 shows data generation speeds for this C code compiled with gcc (without and with -O3 optimization option) and CompCert compilers. This experiment confirms the claim that code compiled with CompCert is about twice as fast as that compiled by gcc without optimization, and quantifies the claim that it is a bit slower than that compiled by gcc with higher levels of optimizationFootnote 2: the code compiled by gcc with its third level of optimization is about 40% faster than the one compiled by CompCert.

Fig. 1.
figure 1

Speed of data generation for different compilations.

To answer RQ3 we first draw some conclusions from the former answers to RQ1 and RQ2. Firstly, thanks to its elementary theory of arrays, Why3 makes it possible to prove more challenging properties – such as completeness – than Frama-C and its WP plugin. Moreover, C code automatically extracted from WhyML code is almost as fast as handwritten C code, for a much lower implementation effort. If a higher speed (resp. more confidence) is expected, the C code can be compiled with gcc -O3 (resp. CompCert).

It remains to evaluate the additional effort required to specify and implement in WhyML enumeration programs suitable for C extraction. For the pointer-adapted permutation generator, we had to write 49 lines of code (only 7 lines more than for the original program), and 107 lines of specifications, so 21 lines more than the original code. The number of specification lines is mainly related to the fact that we control the memory manually. To download the proofs, we need 56.31 s, 3.44 times more than the original code. In addition, in the case of this program, completeness is not proved. Other generators (barray and fact) were also adapted for extraction. All properties were proven for these programs, but the specification effort was also greater than for their original codes. However, we noted that many specifications were common to all programs.

6 Conclusion

We have presented a prototypical implementation of a bounded exhaustive testing tool to check properties in the deductive verification tool Why3. It relies on enumeration programs which are specified, implemented and certified by formal proofs with Why3. The impact of several execution scenarios on their efficiency has been evaluated experimentally.

Obviously, we do not claim that BET and our prototype are competing with advanced property testing tools, such as QuickCheck and its commercial version QuviQ [1]. Such a comparison would be of little interest, because we pursue different goals. Our first goal is to certify the test tool, which as far as we know has already been done only for and with the Coq proof assistant, in the Quickchick tool [18]. Our second goal is to offer a free test tool to Why3 users, complementing prover-based counterexample generation [5].

This is ongoing work and directions for future work are numerous. First, the presented certification of enumeration programs should be extended to the entire testing tool. Data enumeration should be generalized to address functions with several parameters, complex datatypes (e.g. tree-like) and constraints between parameters. The specification and certification of more efficient enumeration programs may also be explored.

An important possible improvement concerns Boolean reflection, i.e., implementation and certification of a decision procedure for the characteristic predicate of test data. We have shown two applications of this procedure: as a test oracle, and as a filter to select the test data among a wider family. In the presented prototype the user has to write each procedure manually. A small-term objective is to provide her with an automated mechanism of derivation of these procedures, covering at least a first-order theory including integers and integer arrays.