1 Introduction

The paradigm of deductive verification  [24, 32] combines manual annotations and semi-automated theorem proving to prove programs correct. Programmers annotate code they develop with contracts and inductive invariants, and use high-level directives to an underlying, mostly-automated logic engine to verify their programs correct. Several mature tools have emerged that support such verification, in particular tools based on the intermediate verification language Boogie  [3] and the SMT solver Z3  [18] (e.g., Vcc  [13] and Dafny  [42]).

Viewed through the lens of deductive verification, the primary challenges in automating verification are two-fold. First, even when strong annotations in terms of contracts and inductive invariants are given, the validity problem for the resulting verification conditions is often undecidable (e.g., in reasoning about the heap, reasoning with quantified logics, and reasoning with non-linear arithmetic). Second, the synthesis of loop invariants and strengthenings of contracts that prove a program correct needs to be automated so as to lift this burden currently borne by the programmer.

A standard technique to solve the first problem (i.e., intractability of validity checking of verification conditions) is to build automated, sound but incomplete verification engines for validating verification conditions, thus skirting the undecidability barrier. Several such techniques exist; for instance, for reasoning with quantified formulas, tactics such as model-based quantifier instantiation  [28] are effective in practice, and they are known to be complete in certain settings  [44]. In the realm of heap verification, the so-called natural proof method explicitly aims to provide automated and sound but incomplete methods for checking validity of verification conditions with specifications in separation logic  [12, 44, 53, 55].

Turning to the second problem of invariant generation, several techniques have emerged that are able to automatically synthesize invariants if the verification conditions fall in a decidable logic. Prominent among these are interpolation  [46] and IC3/PDR  [6, 20]. Moreover, a class of counterexample guided inductive synthesis (CEGIS) methods have emerged recently, including the ICE learning model  [26] for which various instantiations exist  [10, 22, 26, 27, 38, 58]. The key feature of the latter methods is a program-agnostic, data-driven learner that learns invariants in tandem with a verification engine that provides concrete program configurations as counterexamples to incorrect invariants.

Although classical invariant synthesis techniques, such as Houdini  [23], are sometimes used with incomplete verification engines, to the best of our knowledge there is no fundamental argument as to why this should work in general. In fact, we are not aware of any systematic technique for synthesizing invariants when the underlying verification problem falls in an undecidable theory. When verification is undecidable and the engine resorts to sound but incomplete heuristics to check validity of verification conditions, it is unclear how to extend interpolation/IC3/PDR techniques to this setting. Data-driven learning of invariants is also hard to extend since the verification engine typically cannot generate a concrete model for the negation of verification conditions when verification fails. Hence, it cannot produce the concrete configurations the learner needs.

The main contribution of this paper is a general, learning-based invariant synthesis framework that learns invariants using non-provability information provided by verification engines. Intuitively, when a conjectured invariant results in verification conditions that cannot be proven, the idea is that the verification engine must return information that generalizes the reason for non-provability, hence pruning the space of future conjectured invariants.

Our framework, which we present in Sect. 2, assumes a verification engine for an undecidable theory \({\mathscr {U}}\) that reduces verification conditions to a decidable theory \({\mathscr {D}}\) (e.g., using heuristics such as bounded quantifier instantiation to remove universal quantifiers, function unfolding to remove recursive definitions, and so on) that permits producing models for satisfiable formulas. The translation is assumed to be conservative in the sense that if the translated formula in \({\mathscr {D}}\) is valid, then we are assured that the original verification condition is \({\mathscr {U}}\)-valid. If the verification condition is found to be not \({\mathscr {D}}\)-valid (i.e., its negation is satisfiable), on the other hand, our framework describes how to extract non-provability information from the \({\mathscr {D}}\)-model. This information is encoded as conjunctions and disjunctions in a Boolean theory \({\mathscr {B}}\), called conjunctive/disjunctive non-provability information (CD-NPI), and communicated back to the learner. To complete our framework, we show how the formula-driven problem of learning expressions from CD-NPI constraints can be reduced to the data-driven ICE model. This reduction allows us to use a host of existing ICE learning algorithms and results in a robust invariant synthesis framework that guarantees to synthesize a provable invariant if one exists.

However, our CD-NPI learning framework has non-trivial requirements on the verification engine, and building or adapting appropriate engines is not straightforward. To show that our framework is indeed applicable and effective in practice, our second contribution is an application of our technique to two verification domains where the underlying verification is undecidable:

  • Our first setting, presented in Sect. 3, is the verification of dynamically manipulated data-structures against rich logics that combine properties of structure, separation, arithmetic, and data. We show how natural proof verification engines  [44, 53], which are sound but incomplete verification engines that translate a powerful undecidable separation logic called Dryad to decidable logics, can be fit into our framework. Moreover, we implement a prototype of such a verification engine on top of the program verifer Boogie  [3] and demonstrate that this prototype is able to fully automatically verify a large suite of benchmarks, containing standard algorithms for manipulating singly and doubly linked lists, sorted lists, as well as balanced and sorted trees. Automatically synthesizing invariants for this suite of heap-manipulating programs against an expressive separation logic is very challenging, and we do not know of any other technique that can automatically prove all of them. In fact, heap-based reasoning is already challenging when invariants are given, and even then the number of automatic tools able to handle heap verification for rich separation logic specifications is low. For instance, the SL-Comp benchmarks for separation logic  [62] are still at the level of logic solvers for entailment in separation logic, rather than for program verification. Thus, we have to leave a comparison to other approaches for future work.

  • The second setting, presented in Sect. 4, addresses the verification of programs against specifications with universal quantification, which renders verification undecidable in general. In this situation, automated verification engines commonly use a variety of bounded quantifier instantiation techniques (such as E-matching, triggers, and model-based quantifier instantiation) to replace universal quantification by conjunctions over a specific set of terms. This soundly reduces satisfiability checking of the negated verification conditions to a decidable theory. Based on such techniques, we implement our framework and we show that it is able to effectively generate invariants that prove a challenging suite of programs correct against universally quantified specifications.

To the best of our knowledge, our framework is the first to systematically address the problem of invariant synthesis for incomplete verification engines that work by soundly reducing undecidable logics to decidable ones. We believe our experimental results provide the first evidence of the tractability of this important problem.

This work in an extended version of a conference paper  [48]. Compared to the conference version, this work contains all proofs, an illustrative example of our framework, a section discussing the limitations of our framework (Sect. 2.4), as well as an extended description of the implementation of our framework for verifying heap-manipulating programs (Sect. 3). In addition, this work presents a second case study, namely the implementation of our framework for verifying programs against specifications with universal quantification as well as an empirical evaluation for this setting (Sect. 4), which was not contained in the conference paper.

1.1 Related Work

Techniques for invariant synthesis include abstract interpretation  [15], interpolation  [46], IC3  [6], predicate abstraction  [2], abductive inference  [19], as well as synthesis algorithms that rely on constraint solving  [14, 29, 30]. Complementing them are data-driven invariant synthesis approaches that are based on machine learning. Examples include techniques that learn likely invariants, such as Daikon  [21], and techniques that learn inductive invariants, such as Houdini   [23], ICE  [26], and Horn-ICE  [10, 22]. The latter typically requires a teacher that can generate counter-examples if the conjectured invariant is not adequate or inductive. Classically, this is possible only when the verification conditions of the program fall in decidable logics. In this paper, we investigate data-driven invariant synthesis for incomplete verification engines and show that the problem can be reduced to ICE learning if the learning algorithm learns from non-provability information and produces hypotheses in a class that is restricted to positive Boolean formulas over a fixed set of predicates. Data-driven synthesis of invariants has regained recent interest  [25, 26, 38, 51, 52, 57,58,59,60,61, 63], and our work addresses an important problem of synthesizing invariants for programs whose verification conditions fall in undecidable fragments.

Our application to learning invariants for heap-manipulating programs builds upon the logic Dryad  [53, 55], and the natural proof technique line of work for heap verification developed by Qiu et al. Techniques, similar to Dryad, for automated reasoning of dynamically manipulated data structure programs have also been proposed in  [11, 12]. However, unlike our current work, none of these works synthesize heap invariants. Given invariant annotations in their respective logics, they provide procedures to validate if the verification conditions are valid. There has also been a lot of work on synthesizing invariants for separation logic using shape analysis  [9, 41, 56]. However, most of these techniques are tailored for memory safety and shallow properties, but they do not handle rich properties that check full functional correctness of data structures as we do in this work. Interpolation has also been suggested recently to synthesize invariants involving a combination of data and shape properties  [1]. It is, however, not clear how the technique can be applied to a more complicated heap structure, such as an AVL tree, where shape and data properties are not cleanly separated but are intricately connected. Recent work also includes synthesizing heap invariants in the logic from  [33] by extending IC3  [34, 35].

In this work, our learning algorithm synthesizes invariants over a fixed set of predicates. When all programs belong to a specific class, such as the class of programs manipulating data structures, these predicates can be uniformly chosen using templates. Investigating automated ways for discovering candidate predicates is a very interesting future direction. Related work in this direction includes recent works  [51, 52].

Fig. 1
figure 1

A non-provability information (NPI) framework for invariant synthesis

2 An Invariant Synthesis Framework for Incomplete Verification Engines

In this section, we develop our framework for synthesizing inductive invariants for incomplete verification engines, using a counter-example guided inductive synthesis approach. We do this in a setting where the hypothesis space consists of formulas that are Boolean combinations of a fixed set of predicates \({\mathscr {P}}\), which need not be finite for the general framework (although we assume \({\mathscr {P}}\) to be a finite set of predicates when developing concrete learning algorithms later). For the rest of this section, let us fix a program P that is annotated with assertions (and possibly with some partial annotations describing pre-conditions, post-conditions, and assertions). Moreover, we say that a formula \(\alpha \) is weaker (stronger) than a formula \(\beta \) in a logic \(\mathscr {L}\) if \(\vdash _\mathscr {L} \beta \Rightarrow \alpha \) (\(\vdash _\mathscr {L} \alpha \Rightarrow \beta \)) where \(\vdash _\mathscr {L} \varphi \) means that \(\varphi \) is valid in \(\mathscr {L}\).

Figure 1 depicts our general framework of invariant synthesis when verification is undecidable. We fix several parameters for our verification effort. First, let us assume a uniform signature for logics in terms of constant symbols, relation symbols, functions, and types. For simplicity of exposition, we use the same syntactic logic for the various logics \({\mathscr {U}}\), \({\mathscr {D}}\), \({\mathscr {B}}\) in our framework as well as for the logic \({\mathscr {H}}\) used to express invariants.

Let us fix \({\mathscr {U}}\) as the underlying theory that is ideally needed for validating the verification conditions that arise for the program; we presume validity of formulas in \({\mathscr {U}}\) is undecidable. Since \({\mathscr {U}}\) is an undecidable theory, the engine will resort to sound approximations (e.g., using bounded quantifier instantiations using mechanisms such as triggers  [17], bounded unfolding of recursive functions, or natural proofs  [44, 53]) to reduce this logical task to a decidable theory \({\mathscr {D}}\). This reduction is assumed to be sound in the sense that if the resulting formulas in \({\mathscr {D}}\) are valid, then the verification conditions are valid in \({\mathscr {U}}\) as well. If a formula is found not valid in \({\mathscr {D}}\), then we require that the logic solver for \({\mathscr {D}}\) returns a model for the negation of the formula.Footnote 1 Note that this model may not be a model for the negation of the formula in \({\mathscr {U}}\).

Moreover, we fix a hypothesis class \({\mathscr {H}}\) for invariants consisting of positive Boolean combination of predicates over a fixed set of predicates \({\mathscr {P}}\). Note that considering only positive formulas over \({{\mathscr {P}}}\) is not a restriction in general because one can always add negations of predicates to \({{\mathscr {P}}}\), thus effectively synthesizing any Boolean combination of predicates (in negation normal form). The restriction to positive Boolean formulas is in fact desirable as it allows restricting invariants to not negate certain predicates, which is useful when predicates have intuitionistic definitions (as several recursive definitions of heap properties do).

The invariant synthesis proceeds in rounds, where in each round the synthesizer proposes invariants in \(\mathscr {H}\). The verification engine generates verification conditions in accordance to these invariants in the underlying theory \({\mathscr {U}}\). It then proceeds to translate them into the decidable theory \({\mathscr {D}}\), and gives them to a solver that decides their validity in the theory \({\mathscr {D}}\). If the verification conditions are found to be \({\mathscr {D}}\)-valid, we have successfully proven the program correct by virtue of the fact that the verification engine reduced verification conditions in a sound fashion to \({\mathscr {D}}\).

However, if the formula is found not to be \({\mathscr {D}}\)-valid, the solver returns a \({\mathscr {D}}\)-model for its negation. The verification engine then extracts from this model certain non-provability information (NPI), expressed as Boolean formulas in a Boolean theory \({\mathscr {B}}\), which captures more general reasons why the verification failed (the rest of this section is devoted to developing this notion of non-provability information). This non-provability information is communicated to the synthesizer, which then proceeds to synthesize a new conjecture invariant that satisfies the non-provability constraints provided in all previous rounds.

In order for the verification engine to extract meaningful non-provability information, we make the following natural assumption, called normality, which essentially states that the engine can do at least some minimal Boolean reasoning (if a Hoare triple is not provable, then Boolean weakenings of the precondition and Boolean strengthening of the post-condition must also be unprovable):

Definition 1

A verification engine is normal if it satisfies two properties:

  1. 1.

    If the engine cannot prove the validity of the Hoare triple \(\{ \alpha \} s \{ \gamma \}\) and \({}\vdash _{\mathscr {B}}\delta \Rightarrow \gamma \), then it cannot prove the validity of the Hoare triple \(\{ \alpha \} s \{ \delta \}\).

  2. 2.

    If the engine cannot prove the validity of the Hoare triple \(\{ \gamma \} s \{ \beta \}\) and \({}\vdash _{\mathscr {B}}\gamma \Rightarrow \delta \), then it cannot prove the validity of the Hoare triple \(\{ \delta \} s \{ \beta \}\).

Throughout this section, we use a running example to illustrate the components of our framework. Let us begin by introducing this example and specific logics \(\mathscr {U}\), \({\mathscr {D}}\), and \(\mathscr {B}\).

Fig. 2
figure 2

Synthesizing invariants for the program that constructs an inverse B of a bijective (i.e., injective and surjective) mapping A, taken from the Verified software competition  [37]

Example 1

Consider the program in Fig. 2. This program is taken from the Software Verification Competition  [37] and computes the inverse B of a bijective (i.e., injective and surjective) mapping A. Note that the post-condition of this program expresses that the mapping B is injective. The program appears with the name “inverse” in Sect. 4.

To prove this program correct, one needs to specify adequate invariants at the loop header and before the return statement of the function inverse. We wish to synthesize these invariants.

For simplicity, let us assume we are provided with a small set \({\mathscr {P}} = \{ p_1, p_2, p_3\}\) of predicates that serve as the basic building blocks for the invariants to be synthesized: \(p_1\) at the loop header and \(p_2\), \(p_3\) before the return statement. Therefore, the task is to synthesize adequate invariants for this program over the predicates \({\mathscr {P}}\).Footnote 2

For this specific verification task, we choose

  • the universally quantified theory of arrays as undecidable theory \(\mathscr {U}\).

Moreover, we assume a verification engine that soundly reduces verification conditions in \(\mathscr {U}\) to

  • the decidable theory \({\mathscr {D}}\) of linear arithmetic and uninterpreted functions by means of bounded quantifier instatiation.

Finally, we fix

  • propositional logic as the Boolean theory \(\mathscr {B}\).

Note the constant Boolean function inImage is crucially required to validate certain verification conditions by triggering appropriate quantifier instantiations in the surjectivity condition. Moreover, note that a verification engine using the theories \({\mathscr {U}}\) and \({\mathscr {D}}\) as above is normal. \(\square \)

In Sect. 2.1, we now develop an appropriate language to communicate non-provability constraints, which allow the learner to appropriately weaken or strengthen a future hypothesis. It turns out that pure conjunctions and pure disjunctions over \({{\mathscr {P}}}\), which we term CD-NPI constraints (conjunctive/disjunctive non-provability information constraints), are sufficient for this purpose. We also describe concretely how the verification engine can extract this non-provability information from \({\mathscr {D}}\)-models that witness that negations of VCs are satisfiable. Then, in Sect. 2.2, we show how to build learners for CD-NPI constraints by reducing this learning problem to another, well-studied learning framework for invariants called ICE learning. Finally, Sect. 2.3 argues the correctness of our framework, while Sect. 2.4 discusses its limitations.

2.1 Conjunctive/Disjunctive Non-provability Information

We assume that the underlying decidable theory \({\mathscr {D}}\) is stronger than propositional theory \({\mathscr {B}}\), meaning that every valid statement in \({\mathscr {B}}\) is valid in \({\mathscr {D}}\) as well. The reader may want to keep the following as a running example where \({\mathscr {D}}\) is the decidable theory of uninterpreted functions and linear arithmetic, say. In this setting, a formula is \({\mathscr {B}}\)-valid if, when treating atomic formulas as Boolean variables, the formula is propositionally valid. For instance, \(f(x)=y \Rightarrow f(f(x))=f(y)\) will not be \({\mathscr {B}}\)-valid though it is \({\mathscr {D}}\)-valid, while \(f(x)=y \vee \lnot (f(x)=y)\) is \({\mathscr {B}}\)-valid.

To formally define CD-NPI constraints and their extraction from a failed verification attempt, let us first introduce the following notation. For any \({\mathscr {U}}\)-formula \(\varphi \), let \({{ approx}}(\varphi )\) denote the \({\mathscr {D}}\)-formula that the verification engine generates such that the \({\mathscr {D}}\)-validity of \({{ approx}}(\varphi )\) implies the \({\mathscr {U}}\)-validity of \(\varphi \). Moreover, for any Hoare triple of the form \(\{\alpha \} s \{\beta \}\), let \(VC(\{\alpha \} s \{\beta \})\) denote the verification condition in \({\mathscr {U}}\) corresponding to the Hoare triple that the verification engine generates.

For the sake of a simpler exposition, let us assume that

  1. 1.

    the program has a single annotation hole A where we need to synthesize an inductive invariant to prove the program correct; and

  2. 2.

    every snippet s of the program for which a verification condition is generated has been augmented with a set of ghost variables \(g_1, \ldots , g_n\) that track the predicates \(p_1, \ldots , p_n\) over which the learner synthesizes the invariant (i.e., these ghost variables are assigned the values of the predicates).

As a shorthand notation, we denote the values of the ghost variables before the execution of the snippet s by \(\mathbf {v}= \langle v_1, \ldots , v_n\rangle \) and their values after the execution of s by \(\mathbf {v}' = \langle v'_1, \ldots , v'_n \rangle \).

Suppose now that the learner conjectures an annotation \(\gamma \) as an inductive invariant for the annotation hole A, and the verification engine fails to prove the verification condition corresponding to a Hoare triple \(\{\alpha \} s \{\beta \}\), where either \(\alpha \), \(\beta \), or both could involve the synthesized annotation. This means that the negation of \({{ approx}}( VC( \{ \alpha \} s \{ \gamma \} ) )\) is \({\mathscr {D}}\)-satisfiable, and the verification engine needs to extract non-provability information from a model of it. To this end, the verification engine first extracts the values \(\mathbf {v}\) and \(\mathbf {v}'\) of the ghost variables before and after the execution of s. Then, it generates one of three different types of non-provability information, depending on where the annotation appears in a Hoare triple \(\{\alpha \} s \{\beta \}\) (either in \(\alpha \), in \(\beta \), or in both). We now handle all three cases individually.

  • Assume the verification of a Hoare triple of the form \(\{ \alpha \} s \{ \gamma \}\) fails (i.e., the verification engine cannot prove a verification condition where the pre-condition \(\alpha \) is a user-supplied annotation and the post-condition is the synthesized annotation \(\gamma \)). Then, \({{ approx}}( VC( \{ \alpha \} s \{ \gamma \} ) )\) is not \({\mathscr {D}}\)-valid, and the decision procedure for \({\mathscr {D}}\) would generate a model for its negation.

    Since \(\gamma \) is a positive Boolean combination, the reason why \(\mathbf {v}'\) does not satisfy \(\gamma \) is due to the variables mapped to false by \(\mathbf {v}'\), as any valuation extending this will not satisfy \(\gamma \). Intuitively, this means that the \({\mathscr {D}}\)-solver is not able to prove the predicates in \(P_{false} = \{ p_i \mid v_i' = {false}\}\). In other words, \(\{ \alpha \} s \{ \bigvee P_{false} \}\) is unprovable (a witness to this fact is the model of the negation of \({{ approx}}( VC( \{ \alpha \} s \{ \gamma \} ) )\) from which the values \(\mathbf {v}'\) are derived). Note that any invariant \(\gamma '\) that is stronger than \(\bigvee P_{false}\) will result in an unprovable verification condition due to the verification engine being normal. Consequently we can choose \(\chi =\bigvee P_{false}\) as the weakening constraint, demanding that future invariants should not be stronger than \(\chi \).

    The verification engine now communicates \(\chi \) to the synthesizer, asking it never to conjecture in future rounds invariants \(\gamma ''\) that are stronger than \(\chi \) (i.e., such that \(\not \vdash _{{\mathscr {B}}} \gamma '' \Rightarrow \chi \)).

  • The next case is when a Hoare triple of the form \(\{ \gamma \} s \{ \beta \}\) fails to be proven (i.e., the verification engine cannot prove a verification condition where the post-condition \(\beta \) is a user-supplied annotation and the pre-condition is the synthesized annotation \(\gamma \)). Using similar arguments as above, the conjunction \(\eta = \bigwedge \{ p_i \mid v_i = {true} \} \) of the predicates mapped to true by \(\mathbf {v}\) in the corresponding \({\mathscr {D}}\)-model gives a stronger precondition \(\eta \) such that \(\{ \eta \} s \{ \alpha \}\) is not provable. Hence, \(\eta \) is a valid strengthening constraint. The verification engine now communicates \(\eta \) to the synthesizer, asking it never to conjecture in future rounds invariants \(\gamma ''\) that are weaker than \(\eta \) (i.e., such that \(\not \vdash _{{\mathscr {B}}} \eta \Rightarrow \gamma ''\)).

  • Finally, consider the case when the Hoare triple is of the form \(\{ \gamma \} s \{ \gamma \}\) and fails to be proven (i.e., the verification engine cannot prove a verification condition where the pre- and post-condition is the synthesized annotation \(\gamma \)). In this case, the verification engine can offer advice on how \(\gamma \) can be strengthened or weakened to avoid this model. Analogous to the two cases above, the verification engine extracts a pair of formulas \(( \eta , \chi )\), called an inductivity constraint, based on the variables mapped to true by \(\mathbf {v}\) and to false by \(\mathbf {v}'\). The meaning of such a constraint is that the invariant synthesizer must conjecture in future rounds invariants \(\gamma ''\) such that either \(\not \vdash _{{\mathscr {B}}} \eta \Rightarrow \gamma '' \) or \(\not \vdash _{{\mathscr {B}}} \gamma '' \Rightarrow \chi \) holds.

This leads to the following scheme, where \(\gamma \) denotes the conjectured invariant:

  • When a Hoare triple of the form \(\{ \alpha \} s \{ \gamma \}\) fails, the verification engine returns the \({\mathscr {B}}\)-formula \(\bigvee _{i \mid v_i' = {false}} p_i\) as a weakening constraint.

  • When a Hoare triple of the form \(\{ \gamma \} s \{ \beta \}\) fails, the verification engine returns the \({\mathscr {B}}\)-formula \(\bigwedge _{i \mid v_i = {true}} p_i\) as a strengthening constraint.

  • When a Hoare triple of the form \(\{ \gamma \} s \{ \gamma \}\) fails, the verification engine returns the pair \((\bigwedge _{i \mid v_i = {true}} p_i, \bigvee _{i \mid v'_i = {false}} p_i)\) of \({\mathscr {B}}\)-formulas as an inductivity constraint.

It is not hard to verify that the above formulas are proper strengthening and weakening constraints in the sense that any inductive invariant must satisfy these constraints. This motivates the following form of non-provability information.

Definition 2

(CD-NPI Samples) Let \({\mathscr {P}}\) be a set of predicates. A CD-NPI sample (short for conjunction-disjunction-NPI sample) is a triple \(\mathfrak S = (W, S, I)\) consisting of

  • a finite set W of disjunctions over \({\mathscr {P}}\) (weakening constraints);

  • a finite set S of conjunctions over \({\mathscr {P}}\) (strengthening constraints); and

  • a finite set I of pairs, where the first element is a conjunction and the second is a disjunction over \({\mathscr {P}}\) (inductivity constraints).

An annotation \(\gamma \) is consistent with a CD-NPI sample \(\mathfrak S = (W, S, I)\) if \(\not \vdash _{\mathscr {B}}\gamma \Rightarrow \chi \) for each \(\chi \in W\), \(\not \vdash _{\mathscr {B}}\eta \Rightarrow \gamma \) for each \(\eta \in S\), and \(\not \vdash _{\mathscr {B}}\eta \Rightarrow \gamma \) or \(\not \vdash _{\mathscr {B}}\gamma \Rightarrow \chi \) for each \((\eta , \chi ) \in I\).

A CD-NPI learner is an effective procedure that synthesizes, given an CD-NPI sample, an annotation \(\gamma \) consistent with the sample. In our framework, the process of proposing candidate annotations and checking them repeats until the learner proposes a valid annotation or it detects that no valid annotation exists (e.g., if the class of candidate annotations is finite and all annotations are exhausted). We comment on using an CD-NPI learner in this iterative fashion in the next section.

Example 2

Let us continue Example 1 (on Page 6) and assume that the learner conjectures \(\gamma _L = p_1\) as the loop invariant and \(\gamma _R = p_2 \wedge p_3\) as the invariant at the return statement. Moreover, suppose that the verification condition \(VC( \{ \gamma _L \} s \{ \gamma _R \} )\) along the path from the loop exit to the return statement, though valid in the undecidable theory \({\mathscr {U}}\), is not provable in the decidable theory \({\mathscr {D}}\) (i.e., \({{ approx}}( VC( \{ \gamma _L \} s \{ \gamma _R \} ) )\) is not valid).

In this situation, the \({\mathscr {D}}\)-solver returns a model \(\mathscr {M}\) for the negation of the approximated verification condition \({{ approx}}( VC( \{ \gamma _L \} s \{ \gamma _R \} ) )\). The verification engine now inspects this model and extracts the values of the predicates \(p_1\), \(p_2\), and \(p_3\). As explained above, we assume that the program is equipped with ghost variables that track the values of all predicates (in our case \(g_1\), \(g_2\), and \(g_3\)), and a verification engine can simply extract the value of a predicate from the value of the corresponding ghost variable in the model.

For this particular verification condition, let us assume that \(\mathscr {M}(g_1) = {true}\), \(\mathscr {M}(g_2) = {false}\), and \(\mathscr {M}(g_3) = {true}\), indicating that \(p_1\) and \(p_3\) hold in \(\mathscr {M}\), while \(p_2\) does not hold. From this information, the verification engine constructs a pair of formulas \(( \eta , \chi )\) with \(\eta = p_1\) and \(\chi = p_2\), which it communicates as an inductivity constraint to the learner. Intuitively, this constraint means that the verification condition obtained by substituting \(\gamma _L\) with \(\eta \) and \(\gamma _R\) with \(\chi \) is itself not provable. In subsequent rounds, the learner thus needs to conjecture only such invariants where \(\gamma _L\) is not weaker than \(\eta \) (i.e., \(\not \vdash _{\mathscr {B}}p_1 \Rightarrow \gamma _L\)) or \(\gamma _R\) is not stronger than \(\chi \) (i.e., \(\not \vdash _{\mathscr {B}}\gamma _R \Rightarrow p_2\)). Note that both \(\eta \) and \(\chi \) are formulas in the logic \(\mathscr {B}\) since we assume a uniform signature for all logics (i.e., \(p_1, p_2\), and \(p_3\) are seen as propositional variables in \(\mathscr {B}\)). Moreover, note that the formula \(\eta \) is a conjunction with a single conjuct, whereas the formula \(\chi \) is a disjunction with a single disjunct. \(\square \)

2.2 Building CD-NPI Learners

Let us now turn to the problem of building efficient learning algorithms for CD-NPI constraints. To this end, we assume that the set of predicates \({{\mathscr {P}}}\) is finite.

Roughly speaking, the CD-NPI learning problem is to synthesize annotations that are positive Boolean combinations of predicates in \({\mathscr {P}}\) and that are consistent with a given CD-NPI sample. Though this is a learning problem where samples are formulas, in this section we reduce CD-NPI learning to a learning problem from data. In particular, we show that CD-NPI learning reduces to the ICE learning framework for learning positive Boolean formulas. The latter is a well-studied framework, and the reduction allows us to use efficient learning algorithms developed for ICE learning in order to build CD-NPI learners.

We now first recap the ICE-learning framework and then reduce CD-NPI learning to ICE learning. Finally, we briefly sketch how the popular Houdini algorithm can be seen as an ICE learning algorithm, which, in turn, allows us to use Houdini as an CD-NPI learning algorithm.

2.2.1 The ICE Learning Framework

Although the ICE learning framework  [26] is a general framework for learning inductive invariants, we here consider the case of learning Boolean formulas. To this end, let us fix a set B of Boolean variables. Moreover, let \({\mathscr {H}}\) be a subclass of positive Boolean formulas over B (i.e., Boolean combinations of variables from B without negation). This class, called the hypothesis class, specifies the admissible solutions to the learning task.

The objective of the (passive) ICE learning algorithm is to learn a formula in \({\mathscr {H}}\) from positive examples, negative examples, and implication examples. More formally, if \(\mathscr {V}\) is the set of valuations \(v :B \rightarrow \{{true}, {false}\}\) (mapping variables in B to true or false), then an ICE sample is a triple \(\mathscr {S} = (S_+, S_-, S_\Rightarrow )\) where \(S_+ \subseteq \mathscr {V}\) is a set of positive examples, \(S_- \subseteq \mathscr {V}\) is a set of negative examples, and \(S_\Rightarrow \subseteq \mathscr {V} \times \mathscr {V}\) is a set of implications. Note that positive and negative examples are concrete valuations of the variables in B, and the implication examples are pairs of such concrete valuations.

A formula \(\varphi \) is said to be consistent with an ICE sample \(\mathscr {S}\) if it satisfies the following three conditions:Footnote 3\(v \models \varphi \) for each \(v \in S_+\), \(v \not \models \varphi \) for each \(v \in S_-\), and \(v_1 \models \varphi \) implies \(v_2 \models \varphi \), for each \((v_1, v_2) \in S_\Rightarrow \).

In algorithmic learning theory, one distinguishes between passive learning and iterative learning. The former refers to a learning setting in which a learning algorithm is confronted with a finite set of data and has to learn a concept that is consistent with this data. Using our terminology, the passive ICE learning problem for a hypothesis class \({\mathscr {H}}\) is then

“given an ICE sample \(\mathscr {S}\), find a formula in \({\mathscr {H}}\)that is consistent with \(\mathscr {S}\)”.

Recall that we here require the learning algorithm to learn positive Boolean formulas, which is stricter than the original ICE framework  [26].

Iterative learning, on the other hand, is the iteration of passive learning where new data is added to the sample from one iteration to the next. In a verification context, this new data is generated by the verification engine in response to incorrect annotations and used to guide the learning algorithm towards an annotation that is adequate to prove the program. To reduce our learning framework to ICE learning, it is therefore sufficient to reduce the (passive) CD-NPI learning problem described above to the passive ICE learning problem. We do this next.

2.2.2 Reduction of Passive CD-NPI Learning to Passive ICE Learning

Let \({\mathscr {H}}\) be a subclass of positive Boolean formulas. We reduce the CD-NPI learning problem for \({\mathscr {H}}\) to the ICE learning problem for \({\mathscr {H}}\). The main idea is to (a) treat each predicate \(p \in {\mathscr {P}}\) as a Boolean variable for the purpose of ICE learning and (b) to translate a CD-NPI sample \(\mathfrak G\) into an equi-consistent ICE sample \(\mathscr {S}_\mathfrak S\), meaning that a positive Boolean formula is consistent with \(\mathfrak S\) if and only if it is consistent with \(\mathscr {S}_\mathfrak S\). Then, learning a consistent formula in the CD-NPI framework reduces to learning a consistent formula in the ICE learning framework.

The following lemma will us help translate between the two frameworks. Its proof is straightforward and follows from the following observation about any positive formula \(\alpha \): if a valuation v sets a larger subset of variables to true than \(v'\) does and \(v' \models \alpha \), then \(v \models \alpha \) holds as well.

Lemma 1

Let v be a valuation of \({\mathscr {P}}\) and \(\alpha \) be a positive Boolean formula over \({\mathscr {P}}\). Then, the following holds:

  • \(v \models \alpha \) if and only if \({}\vdash _{\mathscr {B}}(\bigwedge _{p \mid v(p)={true}} p ) \Rightarrow \alpha \) (and, therefore, \(v \not \models \alpha \) if and only if \({}\not \vdash _{\mathscr {B}}(\bigwedge _{p \mid v(p)={true}} p ) \Rightarrow \alpha )\).

  • \(v \models \alpha \) if and only if \({}\not \vdash _{\mathscr {B}}\alpha \Rightarrow (\bigvee _{p \mid v(p)={false}} p )\).

This motivates our translation, which relies on two functions, c and d. The function c translates a conjunction \(\bigwedge J\), where \(J \subseteq {\mathscr {P}}\), into the valuation

$$\begin{aligned} c \bigl ( \bigwedge J \bigr ) = v \text { with } v(p) = {true} \text { if and only if } p \in J. \end{aligned}$$

The function d, on the other hand, translates a disjunction \(\bigvee J\), where \(J \subseteq {\mathscr {P}}\) is a subset of propositions, into the valuation

$$\begin{aligned} d \bigl ( \bigvee J \bigr ) = v \text { with } v(p) = {false} \text { if and only if } p \in J. \end{aligned}$$

By substituting v in Lemma 1 with \(c(\bigwedge J)\) and \(d(\bigvee J)\), respectively, one immediately obtains the following result.

Lemma 2

Let \(J \subseteq {\mathscr {P}}\) and \(\alpha \) be a positive Boolean formula over \({\mathscr {P}}\). Then, the following holds:

  • \(c \bigl ( \bigwedge J \bigr ) \models \alpha \) if and only if \({}\vdash _{\mathscr {B}}\bigwedge J \Rightarrow \alpha \) (and, therefore, \(c \bigl ( \bigwedge J \bigr ) \not \models \alpha \) if and only if \({}\not \vdash _{\mathscr {B}}\bigwedge J \Rightarrow \alpha \)).

  • \(d \bigl ( \bigvee J \bigr ) \models \alpha \) if and only if \({}\not \vdash _{\mathscr {B}}\alpha \Rightarrow \bigvee J\).

Based on the functions c and d, the translation of a CD-NPI sample into an equi-consistent ICE sample is as follows.

Definition 3

Given a CD-NPI sample \(\mathfrak S = (W, S, I)\), the ICE sample \(\mathscr {S}_\mathfrak S = (S_+, S_-, S_\Rightarrow )\) is defined by

  • \(S_+ = \bigl \{ d(\bigvee J) \mid \bigvee J \in W \bigr \}\);

  • \(S_- = \bigl \{ c(\bigwedge J) \mid \bigwedge J \in S \bigr \}\); and

  • \(S_\Rightarrow = \bigl \{ \bigl ( c(\bigwedge J_1), d(\bigvee J_2) \bigr ) \mid (\bigwedge J_1, \bigvee J_2) \in I \bigr \}\).

By virtue of the lemma above, we can now establish the correctness of the reduction from the CD-NPI learning problem to the ICE learning problem as follows.

Theorem 1

Let \(\mathfrak S = (W, S, I)\) be a CD-NPI sample, \(\mathscr {S}_\mathfrak S = (S_+, S_-, S_\Rightarrow )\) the ICE sample as in Definition 3, \(\gamma \) a positive Boolean formula over \({\mathscr {P}}\). Then, \(\gamma \) is consistent with \(\mathfrak S\) if and only if \(\gamma \) is consistent with \(\mathscr {S}_\mathfrak S\).

Proof

Let \(\mathfrak S = (W, S, I)\) be an CD-NPI sample, and let \(\mathscr {S}_\mathfrak S = (S_+, S_-, S_\Rightarrow )\) the ICE sample as in Definition 3. Moreover, let \(\gamma \) be a positive Boolean formula. We prove Theorem 1 by considering each weakening, strengthening, and inductivity constraint together with their corresponding positive, negative, and implication examples individually.

  • Pick a weakening constraint \(\bigvee J \in W\), and let \(v \in S_+\) with \(v = d(\bigvee J)\) be the corresponding positive sample. Moreover, assume that \(\gamma \) is consistent with \(\mathfrak S\) and, thus, \(\not \vdash _{\mathscr {B}}\gamma \Rightarrow \bigvee J\). By Lemma 2, this is true if and only if \(d \bigl ( \bigvee J \bigr ) \models \gamma \). Hence, \(v \models \gamma \).

    Conversely, assume that \(\gamma \) is consistent with \(\mathscr {S}\). Thus, \(v \models \gamma \), which means \(d \bigl ( \bigvee J \bigr ) \models \gamma \). By Lemma 2, this is true if and only if \(\not \vdash _{\mathscr {B}}\gamma \Rightarrow \bigvee J\).

  • Pick a strengthening constraint \(\bigwedge J \in S\), and let \(v \in S_-\) with \(v = c(\bigwedge J)\) be the corresponding negative sample. Moreover, assume that \(\gamma \) is consistent with \(\mathfrak S\) and, thus, \(\not \vdash _{\mathscr {B}}\bigwedge J \Rightarrow \gamma \). By Lemma 2, this is true if and only if \(c \bigl ( \bigwedge J \bigr ) \not \models \gamma \). Hence, \(v \not \models \gamma \).

    Conversely, assume that \(\gamma \) is consistent with \(\mathscr {S}\). Thus, \(v \not \models \gamma \), which means \(c \bigl ( \bigwedge J \bigr ) \not \models \gamma \). By Lemma 2, this is true if and only if \(\not \vdash _{\mathscr {B}}\bigwedge J \Rightarrow \gamma \).

  • Following the definition of implication, we split the proof into two cases, depending on whether \(\not \vdash _{\mathscr {B}}\bigwedge J \Rightarrow \gamma \) or \(\not \vdash _{\mathscr {B}}\gamma \Rightarrow \bigvee J\) (and \(v_1 \not \models \gamma \) or \(v_2 \models \gamma \) for the reverse direction). However, the proof of the former case uses the same arguments as the proof for strengthening constraints, while the proof of the latter case uses the same arguments as the proof for weakening constraints. Hence, combining both proofs immediately yields the claim.\(\square \)

Let us illustrate the reduction of Definition 3 with an example.

Example 3

We continue Example 2 (on Page 10), in which the verification engine returned an inductivity constraint \(( \eta , \chi )\) with \(\eta = p_1\) and \(\chi = p_2\). The task of the learner is now to construct invariants where \(\gamma _L\) is not weaker than \(\eta \) (i.e., \(\not \vdash _{\mathscr {B}}p_1 \Rightarrow \gamma _L\)) or \(\gamma _R\) is not stronger than \(\chi \) (i.e., \(\not \vdash _{\mathscr {B}}\gamma _R \Rightarrow p_2\)). To simplify this example, let us assume that the CD-NPI sample \(\mathfrak S = (W, S, I)\) of the learner was initially empty and now only contains the returned inductivity constraint \((p_1, p_2)\) (i.e., \(W = S = \emptyset \) and \(I = \{ (p_1, p_2) \}\)).

As described above, our learner works by reducing the CD-NPI learning problem to ICE learning. More precisely, the learner translates its CD-NPI sample \(\mathfrak S = (W, S, I)\) to an equisatisfiable ICE sample \(\mathscr {S}_\mathfrak S = (S_+, S_-, S_\Rightarrow )\), where the elements of the ICE sample are valuations \(v :{\mathscr {P}} \rightarrow \{ {true}, {false} \}\). For the sake of simplicity, we write these functions as Boolean vectors of length three.

For the actual translation, the learner applies the functions c and d described above. Given the inductivity constraint \((p_1, p_2)\), it constructs the implication constraint

$$\begin{aligned} \bigl ( c(p_1), d(p_2) \bigr ) = \bigl ( (1, 0, 0), (1, 0, 1) \bigr ) \end{aligned}$$

and adds to its ICE sample (recall that \(p_1\) is in fact a conjunction with a single conjunct, while \(p_2\) is a disjunction with a single disjunct). Thus, the learner obtains the ICE sample \(\mathscr {S}_\mathfrak S\) with \(S_+ = S_- = \emptyset \) and \(S_\Rightarrow = \bigl ( (1, 0, 0), (1, 0, 1) \bigr )\). Note that \(\mathfrak S\) and \(\mathscr {S}_\mathfrak S\) are in fact equisatisfiable, as stated by Theorem 1. \(\square \)

2.2.3 ICE Learners for Boolean Formulas

The reduction above allows us to use any ICE learning algorithm in the literature that synthesizes positive Boolean formulas. As we have mentioned earlier, we can add negations of predicates as first-class predicates and, hence, synthesize invariants over the more general class of all Boolean combinations as well.

The problem of passive ICE learning for one round, synthesizing a formula that satisfies the ICE sample, can usually be achieved efficiently and in a variety of ways. However, the crucial aspect is not the complexity of learning in one round but the number of rounds it takes to converge to an adequate invariant that proves the program correct. When the set \({\mathscr {P}}\) of candidate predicates is large (hundreds in our experiments), since the number of Boolean formulas over \({\mathscr {P}}\) is doubly exponential in \(n = |{\mathscr {P}}|\), building an effective learner is not easy.

However, there is one class of formulas that are particularly amenable to efficient ICE learning: conjunctions of predicates over \({\mathscr {P}}\). For this specific case, ICE learning algorithms exist that promise learning the invariant in a linear number of rounds (provided one exists expressible as a conjunct over \({\mathscr {P}}\))  [23, 50]. Note that this learning is essentially finding an invariant in a hypothesis class \({\mathscr {H}}\) of size \(2^n\) in a linear number of rounds.

Houdini   [23] is such a learning algorithm for conjunctive formulas. Though it is typically seen as a particular way to synthesize invariants, it is a prime example of an ICE learner for conjuncts, as described in the work by Garg et al.  [26]. In fact, Houdini is similar to the classical PAC learning algorithm for conjunctions  [36], but extended to the ICE model. The time Houdini spends in each round is polynomial in the size of the sample, and it is guaranteed to converge to an invariant in at most \(n + 1\) rounds (or reports that no conjunctive invariant over \({\mathscr {P}}\) exists). In our applications, we use this ICE learner to build a CD-NPI learner for conjunctions.

Example 4

Let us continue Example 3 and assume that the learner uses Houdini as ICE learning algorithm. Given the ICE sample \(\mathscr {S}_\mathfrak S\), the learner now constructs the invariants \(\gamma _L = p_1\) and \(\gamma _R = p_3\). Note that these formulas are consistent with both the ICE sample \(\mathscr {S}_\mathfrak S\) and the CD-NPI sample \(\mathfrak S\).

With this new hypothesis, the verification engine can prove all verification conditions of the program valid in the theory \({\mathscr {D}}\). At this point, our invariant synthesis procedure terminates with these as adequate inductive invariants. \(\square \)

2.3 Correctness and Convergence of the Invariant Learning Framework

To state the main result of this paper, let us first assume that the set \({\mathscr {P}}\) of predicates is finite. We comment on the case of infinitely many predicates at the end of this section.

Theorem 2

Assume a normal verification engine for a program P to be given. Moreover, let \({\mathscr {P}}\) be a finite set of predicates over the variables in P and \({\mathscr {H}}\) a hypothesis class consisting of positive Boolean combinations of predicates in \({\mathscr {P}}\). If there exists an annotation in \({\mathscr {H}}\) that the verification engine can use to prove P correct, then the CD-NPI framework described in Sect.  2.1 is guaranteed to converge to such an annotation in finite time.

Proof

The proof proceeds in two steps. First, we show that a normal verification engine is honest, meaning that the non-provability information returned by such an engine does not rule out any adequate and provable annotation. Second, we show that any consistent learner (i.e., a learner that only produces consistent hypotheses), when paired with an honest verification engine, makes progress from one round to another. Finally, we combine both results to show that the framework eventually converges to an adequate and provable annotation.

Honesty of the verification engine We show honesty of the verification engine by contradiction.

  • Suppose that the verification replies to a candidate invariant \(\gamma \) proposed by the learner with a weakening constraint \(\chi \) because it could not prove the validity of the Hoare triple \(\{ \alpha \} s \{ \gamma \}\). This effectively forces any future conjecture \(\gamma '\) to satisfy \(\not \vdash _{\mathscr {B}}\gamma ' \Rightarrow \chi \).

    Now, suppose that there exists an invariant \(\delta \) such that \(\vdash _{\mathscr {B}}\delta \Rightarrow \chi \) and the verification engine can prove the validity of \(\{ \alpha \} s \{ \delta \}\) (in other words, the adequate invariant \(\delta \) is ruled out by the weakening constraint \(\chi \)). Due to the fact that the verification engine is normal (in particular, by contraposition of Part 1 of Definition 1), this implies that the verification engine can also prove the validity of \(\{ \alpha \} s \{ \chi \}\). However, this is a contradiction to \(\chi \) being a weakening constraint.

  • Suppose that the verification engine replies to a candidate invariant \(\gamma \) proposed by the learner with a strengthening constraint \(\eta \) because it could not prove the validity of the Hoare triple \(\{ \gamma \} s \{ \beta \}\). This effectively forces any future conjecture \(\gamma \) to satisfy \(\not \vdash _{\mathscr {B}}\eta \Rightarrow \gamma '\).

    Now, suppose that there exists an invariant \(\delta \) such that \(\vdash _{\mathscr {B}}\eta \Rightarrow \delta \) and the verification engine can prove the validity of \(\{ \delta \} s \{ \beta \}\) (in other words, the adequate invariant \(\delta \) is ruled out by the weakening constraint \(\eta \)). Due to the fact that the verification engine is normal (in particular, by contraposition of Part 2 of Definition 1), this implies that the verification engine can also prove the validity of \(\{ \eta \} s \{ \beta \}\). However, this is a contradiction to \(\eta \) being a strengthening constraint.

  • Combining the arguments for weakening and strengthening constraints immediately results in a contradiction for the case of inductivity constraints as well.

Progress of the learner Now suppose that the learning algorithm is consistent, meaning that it always produces an annotation that is consistent with the current sample. Moreover, assume that the sample in iteration \(i \in \mathbb N\) is \(\mathfrak S_i\) and the learner produces the annotation \(\gamma _i\). If \(\gamma _i\) is inadequate to prove the program correct, the verification engine returns a constraint c. The learner adds this constraint to the sample, obtaining the sample \(\mathfrak S_{i+1}\) of the next iteration.

Since verification with \(\gamma _i\) failed, which is witnessed by c, we know that \(\gamma _i\) is not consistent with c. The next conjecture \(\gamma _{i+1}\), however, is guaranteed to be consistent with \(\mathfrak S_{i+1}\) (which contains c) because the learner is consistent. Hence, \(\gamma _i\) and \(\gamma _{i+1}\) are semantically different. Using this argument repeatedly shows that each annotation \(\gamma _i\) that a consistent learner has produced is semantically different from any previous annotation \(\gamma _j\) for \(j < i\).

Convergence We first make two observations.

  1. 1.

    The number of semantically different hypotheses in the hypothesis space \({\mathscr {H}}\) is finite because the set \({\mathscr {P}}\) is finite. Recall that \({\mathscr {H}}\) is the class of all positive Boolean combinations of predicates in \({\mathscr {P}}\).

  2. 2.

    Due to the honesty of the verification engine, every annotation that the verification engine can use to prove the program correct is guaranteed to be consistent with any sample produced during the learning process.

Now, suppose that there exists an annotation that the verification engine can use to prove the program correct. Since the learner is consistent, all conjectures produced during the learning process are semantically different. Thus, the learner will at some point have exhausted all incorrect annotations in \({\mathscr {H}}\) (due to Observation 1). However, there exists at least one annotation that the verification engine can use to prove the program correct. Moreover, any such annotation is guaranteed to be consistent with the current sample (due to Observation 2). Thus, the annotation conjectured next is necessarily one that the verification engine can use to prove the program correct.\(\square \)

Under certain realistic assumptions on the CD-NPI learning algorithm, Theorem 2 remains true even if the number of predicates is infinite. An example of such an assumption is that the learning algorithm always conjectures a smallest consistent annotation with respect to some fixed total order on \({\mathscr {H}}\). In this case, one can show that such a learner will at some point have proposed all inadequate annotation up to the smallest annotation the verification engine can use to prove the program correct. It will then conjecture this annotation in the next iteration. A correctness proof of this informal argument in a more general setting, called abstract learning frameworks for synthesis, has been given by Löding, Madhusudan, and Neider  [43].

2.4 Limitations

One limiting factor is the effort and knowledge required to generate a set of suitable predicates over which to search for invariants. For several domains of programs and types of specifications, however, searching for invariants over a fixed set of predicates has proven to be extremely effective. Prominent examples include Microsoft’s Static Driver Verifier  [39, 47] (specifically the underlying tool Corral  [40] is an industry-strength tool that leverages exactly this approach) as well as GPUVerify  [4], a fully-automated too to verify race freedom of GPU kernels. Note that for both examples, the predicates are generated automatically based on the code of the programs and/or the specification to verify—similar to what we did in Sect. 3.

In case the current set of predicates is insufficient to prove the program correct, a simple recourse is to negate predicates (that did not already occur negated) or add more complex predicates (e.g., as we do in Sect. 3). Although this increases the complexity of the verification task (as it enlarges the search space), the specific learning algorithm we use (i.e., Houdini ) only requires a linear number of rounds in the number of predicates to find an invariant (or report that there is no conjunctive invariant that the verification engine can use to prove the program correct). In fact, the increased number of rounds was never a limiting factor in our experiments (see Sect. 3 and “Appendix A)”. Note that the succinctness of a certain class of formulas is also less of a concern compared to the size of the hypothesis space (since the number of rounds depends on the latter).

Finally, note that our framework is designed to prove the correctness of programs, but not do find bugs. As such, it does not necessarily terminate if the input program is not correct with respect to its specification. In our experience, however, the ICE framework can often detect specification violations, which manifest as inconsistent ICE samples (i.e., samples with internal inconsistencies such as an implication leading from a positive example to a negative example).

3 Application: Learning Invariants that Aid Natural Proofs for Heap Reasoning

We now develop an instantiation of our learning framework for verification engines based on natural proofs for heap reasoning  [53, 55].

3.1 Background: Natural Proofs and Dryad

Dryad  [53, 55] is a dialect of separation logic that allows expressing second order properties using recursive functions and predicates. Dryad has a few restrictions, such as disallowing negations inside recursive definitions and in sub-formulas connected by spatial conjunctions (see Pek, Qiu, and Madhusudan  [53]). However, it is expressive enough to define a variety of data-structures (singly and doubly linked lists, sorted lists, binary search trees, AVL trees, maxheaps, treaps) and recursive definitions over them that map to numbers (length, height, etc.). Dryad also allows expressing properties about the data stored within the heap (the multiset of keys stored in lists, trees, etc.).

Natural proofs  [53, 55] is a sound but incomplete strategy for deciding satisfiability of Dryad formulas. The first step the natural proof verifier performs is to convert all predicates and functions in a Dryad-annotated program to classical logic. This translation introduces heaplets (modeled as sets of locations) explicitly in the logic. Furthermore, it introduces assertions demanding that the access of each method is contained to the heaplet implicitly defined by its pre-condition (taking into account newly allocated or freed nodes) and the modified heaplet at the end of the program precisely matches the heaplet implicitly defined by the post-condition.

In the second step, the natural proof verifier applies the following three transformations to the program: (a) it abstracts all recursive definitions on the heap using uninterpreted functions and introduces finite-depth unfoldings of these definitions at every place in the code where locations are dereferenced, (b) it models heaplets and other sets using the decidable theory of maps, and (c) it inserts frame reasoning explicitly in the code, which allows the verifier to derive that certain properties continue to hold across a heap update (or function call) using the heaplet that is being modified. Subsequently, the natural proof verifier translates the transformed program to Boogie  [3], an intermediate verification language that handles proof obligations using automatic theorem provers (typically SMT solvers).

To perform both steps automatically, we used the tool VCDryad  [53], which extends VCC  [13] and operates on heap-manipulating C programs. The result is a Boogie program with no recursive definitions, where all verification conditions are in decidable logics, and where a standard SMT solver, such as Z3  [18], can return models when formulas are satisfiable. The program in question can then be verified if supplied with correct inductive loop-invariants and adequate pre/post-conditions. We refer the reader to the work on Dryad  [55] and VCDryad  [53] for more details.

3.1.1 Learning Heap Invariants

We have implemented a prototypeFootnote 4 that consists of the entire VCDryad pipeline, which takes C programs annotated in Dryad and converts them to Boogie programs via the natural proof transformations described above. We then apply our transformation to the ICE learning framework and pair Boogie with the ICE learning algorithm Houdini  [23] that learns conjunctive invariants over a finite set of predicates (we describe shortly how these predicates are generated). After these transformations, Boogie satisfies the requirements on verification engines of our framework.

Given the Dryad definitions of data structures, we automatically generated a set \({\mathscr {P}}\) of predicates, which serve as the basic building blocks of our invariants. Following other successful work on template-based verification, such as GPUVerify  [4], we constructed the predicates from generic templates, which we obtained by inspecting a small number of programs. These templates were then instantiated using all combinations of program variables that occur in the program being verified. Figure 3 presents these templates in detail.

Fig. 3
figure 3

Templates of Dryad predicates. The operator \(\le _\mathsf {set}\) denotes comparison between integer sets, where \(A \le _\mathsf {set} B\) if and only if \(x \le y\) holds for all \(x \in A\) and \(y \in B\). The operator \(\ge _\mathsf {set}\) is similarly defined. Shape properties such as \(\mathsf {LinkedList}\), \(\mathsf {AVLtree}\), and so on are recursively defined in Dryad (not shown here) and are extensible to any class of Dryad-definable shapes. Similarly, the definitions related to keys stored in a data structure and the sizes of data structures also stem from recursive definitions in Dryad

Our templates cover a fairly exhaustive set of predicates. This includes properties of the store (equality of pointer variables, equality and inequalities between integer variables, etc.), shape properties (singly and doubly linked lists and list segments, sorted lists, trees, binary search trees, AVL trees, treaps, etc.), and recursive definitions that map data structures to numbers, involving arithmetic relationships and set relationships (keys/data stored in a structure, lengths of lists and list segments, height of trees). In addition, our templates include predicates describing heaplets of various structures, which involve set operations, disjointness, and equalities. The structures and predicates are extensible, of course, to any recursive definition expressible in Dryad.

The templates are grouped into three categories, roughly in increasing complexity. Predicates of Category 1 involve shape-related properties, predicates of Category 2 involve properties related to the keys stored in the data-structure, and predicates of Category 3 involve size-related properties (lengths of lists and heights of trees). Given a program to verify and its annotations, we choose the category of templates depending on whether the specification refers to shape only, shapes and keys, or shapes, keys, and sizes (choosing a category includes the predicates of lower category as well). Then, predicates are automatically generated by instantiating the templates with all (combinations of) program variables. This approach gives us a fairly fine-grained control over the set of predicates used in the verification process.

3.1.2 Benchmarks

We have evaluated our prototype on ten benchmark suites that contain standard algorithms on dynamic data structures, such as searching, inserting, or deleting items in lists and trees. These benchmarks were taken from the following sources:

  1. 1.

    GNU C Library(glibc) singly/sorted linked lists;

  2. 2.

    GNU C Library(glibc) doubly linked lists;

  3. 3.

    OpenBSD SysQueue;

  4. 4.

    GRASShopper  [54] singly linked lists;

  5. 5.

    GRASShopper  [54] doubly linked lists;

  6. 6.

    GRASShopper  [54] sorted linked lists;

  7. 7.

    VCDryad  [53] sorted linked lists;

  8. 8.

    VCDryad  [53] binary search trees, AVL trees, and treaps;

  9. 9.

    AFWP  [33] singly/sorted linked lists; and

  10. 10.

    ExpressOS  [45] MemoryRegion.

The specifications for these programs are generally checks for their full functional correctness, such as preserving or altering shapes of data structures, inserting or deleting keys, filtering or finding elements, as well as sortedness of elements. The specifications, hence, involve separation logic with arithmetic as well as recursive definitions that compute numbers (such as lengths and heights), data-aggregating recursive functions (such as multisets of keys stored in the data structures), and complex combinations of these properties (e.g., to specify binary search trees, AVL trees, and treaps). All programs are annotated in Dryad, and checking validity of the resulting verification conditions is undecidable.

From these benchmark suites, we first picked all programs that contained iterative loops and erased the user-provided loop invariants. We also selected some programs that were purely recursive and where the contract for the function had manually been strengthened to make the verification succeed. We weakened these contracts to only state the specification (typically by removing formulas in the post-conditions of recursively called functions) and introduced annotation holes instead. The goal was to synthesize strengthenings of these contracts that allow proving the program correct (we left out programs where weakening was unnatural). We also chose five straight-line programs, deleted their post-conditions, and evaluated whether our prototype was able to learn post-conditions for them (since our conjunctive learner learns the strongest invariant expressible as a conjunct, we can use our framework to synthesize post-conditions as well). Since our framework requires a verification engine that terminates when given incorrect invariants, we had to modify some axiomatizations for sets (to handle heaplets). We were unable to make a small number of programs work with these simpler axioms and had to exclude them. In total, we obtained 83 routines.Footnote 5

After removing annotations from the benchmarks, we automatically inserted appropriate predicates over which to build invariants and contracts as described above. For all benchmark suites, conjunctions of these predicates were sufficient to prove the programs correct.

3.1.3 Experimental Results

We performed all experiments sequentially on a single core of an Intel Core i5-6600 3.3 GHz CPU with 16 GB of RAM running Debian GNU/Linux 9.12 (stretch). For each benchmark, we limited the memory available to the underlying SMT solver (Z3  [18] in our case) to 1 GB. This amount was sufficient for all benchmarks.

The box plots in Fig. 4 summarize the results of this empirical evaluation aggregated by benchmark suite (full details can be found in “Appendix A”). This includes the time required to verify the programs, the number of automatically inserted base predicates (i.e., \(|{\mathscr {P}}|\)), and the number of iterations in the learning process. Each box in the diagrams shows the lower and upper quartile (left and right border of the box, respectively), the median (line within the box), as well as the minimum and maximum (left and right whisker, respectively).

Fig. 4
figure 4

Aggregated experimental results of the Dryad benchmarks

Our prototype was successful in learning invariants and contracts for all 83 programs. The fact that the median time for a great majority of benchmark suites is less than \(10\,s\) shows that our technique is extremely effective in finding inductive Dryad invariants. Despite many examples starting with hundreds of base predicates, which suggests a worst-case complexity of hundreds of iterations, the learner was able to learn with much fewer iterations. Moreover, the number of predicates in the final invariant is small. This shows that non-provability information of our framework provide much more information than the worst-case suggests.

To the best of our knowledge, our prototype is currently the only tool able of fully automatically verifying these challenging benchmark suites. We must emphasize, however, that there are subsets of these benchmarks that can be verified by reformulating the verification problem in decidable fragments of separation logic—we refer the reader to the related work in Sect. 1 for a survey of such work. Our goal in this evaluation, however, is not to compete with other, mature tools on a subset of benchmarks, but to measure the efficacy of our proposed CD-NPI-based invariant synthesis framework on the complete benchmark set.

4 Application: Learning Invariants in the Presence of Bounded Quantifier Instantiation

Software verification must deal with quantification in numerous application domains. For instance, quantifiers are often needed for axiomatizing theories that are not already equipped with decision procedures, for specifying properties of unbounded data structures and dynamically allocated memory, as well as for defining recursive properties of programs. For instance, the power of two function can be axiomatized using quantifiers:

$$\begin{aligned} {pow}_2(0) = 1 ~~\wedge ~~ \forall n \in \mathbb N :n > 0 \Rightarrow {pow}_2(n) = 2 \cdot {pow}_2(n-1). \end{aligned}$$

Despite the fact that various important first-order theories are undecidable (e.g., the first-order theory of arithmetic with uninterpreted functions), modern SMT solvers implement a host of heuristics to cope with quantifier reasoning. Quantifier instantiation, including pattern-based quantifier instantiation (e.g., E-matching  [16]) and model-based quantifier instantiation  [28], are particularly effective heuristics in this context. The key idea of instantiation-based heuristics is to instantiate universally quantified formulas with a finite number of ground terms and then check for validity of the resulting quantifier-free formulas (whose theory needs to be decidable). The exact instantiation of ground terms varies from method to method, but most instantiation-based heuristics are necessarily incomplete in general due to the undecidability of the underlying decision problems.

We can apply invariant synthesis framework for verification engines that employ quantifier instantiation in the following way. Assume that \({\mathscr {U}}\) is an undecidable first-order theory allowing uninterpreted functions and that \({\mathscr {D}}\) is its decidable quantifier-free fragment. Then, quantifier instantiation can be seen as a transformation of a \({\mathscr {U}}\)-formula \(\varphi \) (potentially containing quantifiers) into a \({\mathscr {D}}\)-formula \({{ approx}}(\varphi )\) in which all existential quantifiers have been eliminated (e.g., using skolemization) and all universal quantifiers have been replaced by finite conjunctions over ground terms.Footnote 6 This means that if the \({\mathscr {D}}\)-formula \({{ approx}}(\varphi )\) is valid, then the \({\mathscr {U}}\)-formula \(\varphi \) is valid as well. On the other hand, if \({{ approx}}(\varphi )\) is not valid, one cannot deduce any information about the validity of \(\varphi \). However, a \({\mathscr {D}}\)-model of \({{ approx}}(\varphi )\) can be used to derive non-provability information as described in Sect. 2.1.

4.1 Benchmarks

Our benchmark suite consists of twelve slightly simplified programs from IronFleet  [31] (provably correct distributed systems), the Verified Software Competition  [37], ExpressOS  [45] (a secure operating system for mobile devices), and tools for sparse matrix multiplication  [8]. In these programs, quantifiers are used to specify recursively defined predicates, such as power(nm) and sum(n), as well various array properties, such as no duplicate elements, periodic properties of array elements, and bijective (injective and surjective) maps. All these specifications are undecidable in general. In particular, the array specifications fall outside of the decidable array property fragment  [7] because they involve strict comparison between universally quantified index variables, array accesses in the index guard, nested array accesses (e.g., \(a_1[a_2[i]]\)), arithmetic expressions over universally quantified index variables, and alternation of universal and existential quantifiers.

From this benchmark suite, we erased the user-defined loop invariants and generated a set of predicates that serve as the building blocks of our invariants. To this end, we used the pre/post-conditions of the program being verified as templates from which the actual predicates were generated—as in the case of Dryad benchmarks, the templates were instantiated using all combinations of program variables that occur in the program. Additionally, we generated predicates for octagonal constraints over the integer variables in the programs (i.e., relations between two integer variables of the form \(\pm x \pm y \le c\)). For programs involving arrays, we also generated octagonal constraints over array access expressions that appear in the program.

4.1.1 Experimental Results

We have implemented a prototypeFootnote 7 based on Boogie  [3] and Z3  [18] as the verification engine and Houdini  [23] as a conjunctive ICE learning algorithm. As in the case of the Dryad benchmarks, all experiments were conducted sequentially on a single core of an Intel Core i5-6600 3.3 GHz CPU with 16 GB of RAM running Debian GNU/Linux 9.12 (stretch). Again, we limited the memory available to the underlying SMT solver to 1 GB. The results of these experiments are listed in Table 1.

Table 1 Experimental results of the quantifier instantiation benchmarks. The column \(|{\mathscr {P}}|\) refer to the number of automatically inserted base predicates, the column # Iterations to the number of iterations of the teacher and learner, and the column |Inv| to the number of predicates in the inferred invariant

As can be seen from this table, our prototype was effective in finding inductive invariants and was able to prove each program correct in less than 1 min (in \(75\,\%\) of the programs in less than \(10\,s\)). Despite having hundreds of base predicates in many examples, which in turn suggests a worst-case complexity of hundreds of rounds, the learner was able to learn an inductive invariant with much fewer rounds. As in the case of the Dryad benchmarks, the non-provability information provided by the verification engine provided much more information than the worst-case suggests.

5 Conclusions and Future Work

We have presented a learning-based framework for invariant synthesis in the presence of sound but incomplete verification engines. To prove that our technique is effective in practice, we have successfully applied it two important and challenging verification setting: verifying heap-manipulating programs against specifications expressed in an expressive and undecidable dialect of separation logic and verifying programs against specifications with universal quantification. In particular for the former setting, we are not aware of any other technique that can handle our extremely challenging benchmark suite.

Several future research directions are interesting. First, the framework we have developed is based on the principle of counterexample-guided inductive synthesis, where the invariant synthesizer synthesizes invariants using non-provability information but does not directly work on the program’s structure. It would be interesting to extend white-box invariant generation techniques such as interpolation/IC3/PDR, working using \({\mathscr {D}}\) (or \({\mathscr {B}}\)) abstractions of the program directly in order to synthesize invariants for them. Second, in the CD-NPI learning framework we have put forth, it would be interesting to change the underlying logic of communication \({\mathscr {B}}\) to a richer logic, say the theory of arithmetic and uninterpreted functions. The challenge here would be to extract non-provability information from the models to the richer theory, and pairing them with synthesis engines that synthesize expressions against constraints in \({\mathscr {B}}\). Finally, we think invariant learning should also include experience gained in verifying other programs in the past, both manually and automatically. A learning algorithm that combines logic-based synthesis with experience and priors gained from repositories of verified programs can be more effective.