ICE-Based Refinement Type Discovery for Higher-Order Functional Programs

We propose a method for automatically finding refinement types of higher-order function programs. Our method is an extension of the Ice framework of Garg et al. for finding invariants. In addition to the usual positive and negative samples in machine learning, their Ice framework uses implication constraints, which consist of pairs (x, y) such that if x satisfies an invariant, so does y. From these constraints, Ice infers inductive invariants effectively. We observe that the implication constraints in the original Ice framework are not suitable for finding invariants of recursive functions with multiple function calls. We thus generalize the implication constraints to those of the form ({x1,⋯,xk},y)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\{x_1,\dots ,x_k\}, y)$$\end{document}, which means that if all of x1,⋯,xk\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_1,\dots ,x_k$$\end{document} satisfy an invariant, so does y. We extend their algorithms for inferring likely invariants from samples, verifying the inferred invariants, and generating new samples. We have implemented our method and confirmed its effectiveness through experiments.


Introduction
Higher-order functional program verification is an interesting and challenging problem. Over the past two decades, several approaches have been proposed: refinement types with manual annotations [12,33], liquid types [24], and reduction to higher-order recursion schemes [26]. These approaches face the same problem found in imperative and synchronous data-flow program verification: the need for predicates describing how loops and components behave for the verification and/or abstraction method to work in practice [8,14,19]. This paper proposes to address this issue by combining refinement types with the recent machine-learning-based, invariant discovery framework Ice from [13,14].
Consider for instance a function f from integers to integers such that if its input n is less than or equal to 101, then its output is 91, otherwise it is n − 10. (This is the case of the mc_91 function on Fig. 1.) Then our objective is to automatically discover, by using an adaptation of Ice, the refinement type f : {n : int | true} → {r : int | (n > 101 ∧ r = n − 10) ∨ r = 91}.
That is, function f accepts any integer n that satisfies true as input, and yields an integer r equal to n − 10 when n > 101, and equal to 91 otherwise. The traditional Ice framework is not appropriate for our use-case. We briefly summarize it below, and then discuss how this approach needs to be extended for the purpose of functional program verification. Brief review of the Ice framework Let S be a transition system s, I(s), T (s, s ) , with s its vector of state variables, I(s) its initial predicate, and T (s, s ) the transition relation between consecutive states. Suppose we wish to prove that Prop(s) is an invariant, i.e., that a property Prop(s) holds for any state s reachable from an initial state. Then it suffices to find a predicate Inv(s) that satisfies the following conditions.

Inv(s) | Prop(s)
The predicate Inv(s) is an invariant that is inductive in that it is preserved by the transition relation, as guaranteed by (2). We call such an Inv(s) a strengthening inductive invariant for Prop(s). It serves as a certificate that Prop(s) is a (plain) invariant. Given a candidate for Inv(s), the conditions (1)- (2) can be checked by an SMT [2] solver. In the rest of this section, "invariant" will always mean "strengthening inductive invariant". The Ice framework is a machine-learning-based method combining a learner that incrementally produces candidate invariants, and a teacher that checks whether the candidates are such that (1), (3) and (2) hold. If a given candidate is not an invariant, the teacher produces learning data as follows, so that the learner can produce a better candidate. A candidate is an arbitrary Boolean combination of atomic predicates called qualifiers. Given a candidate C k (s), the teacher checks whether (1) holds-using an SMT solver for instance. If it does not, a concrete state e is extracted and will be given to the learner as an example: the next candidate C k+1 should be such that C k+1 (e) holds, i.e. it must include the example. Conversely, if (3) does not hold, a concrete state c is extracted and will be given as a counterexample: the next candidate should be such that C k+1 (c) does not hold, i.e. it must exclude the counterexample.
Unlike traditional machine-learning approaches, in Ice the teacher also extracts learning data from (2) when it does not hold. It takes the form of a pair of (consecutive) concrete states (i, i ), and is called an implication constraint: the next candidate should be such that C k+1 (i) ⇒ C k+1 (i ). Implication constraints are crucial for the learner to discover inductive invariants, as they let it know why its current candidate failed the induction check. The Ice framework does not specify how the learner generates candidates, but this is typically done by building a classifier consistent with the learning data, in the form of a decision tree-discussed further in Sect. 3. Refinement type inference as a predicate synthesis problem We now discuss why the original Ice framework is ill-suited for functional program verification. Consider McCarthy's 91 function from Fig. 1. To prove this program correct in a refinement type setting, it is enough to find some refinement type {n : int | ρ 1 (n)} → {r : int | ρ 2 (n, r )} for mc_91, where ρ 1 and ρ 2 are such that 1 ρ 1 (n) ∧ n > 100 ∧ r = n − 10 | ρ 2 (n, r ) (4) ρ 1 (n) ∧ n ≤ 100 | ρ 1 (n + 11) (5) ρ 1 (n) ∧ n ≤ 100 ∧ ρ 2 (n + 11, tmp) | ρ 1 (tmp) (6) ρ 1 (n) ∧ n ≤ 100 ∧ ρ 2 (n + 11, tmp) ∧ ρ 2 (tmp, r ) | ρ 2 (n, r ) (7) true | ρ 1 (m) (8) m ≤ 101 ∧ ρ 2 (m, res) | res = 91 (9) We can observe some similarities between the Horn clauses above and (1)- (2). The constraints (8) and (9) respectively correspond to the constraints (1) and (3) on initial states and the property to be proved, whereas the constraints (4)- (7) correspond to the induction constraint (2). This observation motivates us to reuse the Ice framework for refinement type inference. There are, however, two obstacles in adapting the Ice framework to refinement type inference. First, we must infer not one but several mutually-dependent predicates. Second, and more importantly, we need to generalize the notion of implication constraint because of the nested recursive calls found in functional programs. To illustrate, let us assume that we realized that mc_91's precondition is ρ 1 (n) = true. Then the third constraint from the else branch is Contrary to the ones found in the original Ice framework, this Horn clause is non-linear: it has more than one application of the same predicate (ρ 2 , here) in its antecedents. Now, assuming we have a candidate for which this constraint is falsifiable, the implication constraint should have form ( {(n 1 , r 1 ), (n 2 , r 2 )}, (n, r ) ), which means that the next candidate C should be such that C(n 1 , r 1 ) ∧ C(n 2 , r 2 ) ⇒ C(n, r ). This is because there are two occurrences of ρ 2 on the left-hand side of the implication.
The need to infer more than one predicate and support non-linear Horn clauses is not specific to higher-order functional program verification. After all, McCarthy's 91 function is first-order and is occasionally mentioned in first-order imperative program verification papers [4]. Sv-Comp [3], the main (imperative) software verification competition features 3247 verification problems in its linear arithmetic track which can be encoded as Horn clauses, 54 of which contain non-linear Horn clauses. In our context of higher-order functional program verification the ratio is much higher, with 63 of our 164 OCaml [22] programs yielding non-linear Horn clauses.
The main contribution of this paper is to address the two issues aforementioned and propose a modified Ice framework suitable for higher-order program verification in particular. While adapting machine-learning techniques to higher-order program verification has been done before [37,38], transposing implication constraints to this context is, to the best of our knowledge, new work. We also present various simplifications/optimizations for the encoding of the problem and the modified Ice framework, which prove extremely useful in practice. We have implemented our approach as a program verifier for a subset of OCaml and report on our experiments.
The rest of the paper is organized as follows. Section 2 introduces our target language and describes verification condition generation and simplification. The modified Ice framework is discussed in Sect. 3. We report on our implementation and experiments of the approach in Sect. 4. Section 5 describes and evaluates ongoing work for adapting our approach to Algebraic Data Types. Finally, we discuss related work in Sect. 6 before concluding in Sect. 7.
This article is an extended version of previous work [6]. This version adds information and examples that make the discussion more understandable, as well as a completely new section (Sect. 5) discussing preliminary work on adapting our approach to Algebraic Data Types. We implemented this work in our Horn clause solver HoIce [7] and report on our experimental evaluation.

Target Language and Verification Conditions
In this section, we first introduce the target language of our refinement type inference method. We then introduce a refinement type system and associated verification conditions (i.e., sufficient conditions for the typability of a given program).

Language
The target of the method is a simply-typed, call-by-value, higher-order functional language with recursion. Its syntax is given by: We use the meta-variables x, y, . . . , f , g, . . . for variables. We write · for a sequence; for example, we write x for a sequence of variables. For the sake of simplicity, we consider only integers as base values. We represent Booleans using integers, and treat 0 as false and non-zero values as true. We sometimes write true for 1 and false for 0. We briefly explain programs and expressions; the formal semantics is given later. We use let-normal-form-style for simplicity. A program P is a set of mutually recursive function definitions f ( z) = e. The expression ⊕{a i ⇒ e i } 1≤i≤n evaluates e i non-deterministically if the value of a i is non-zero, which can be also used to generate non-deterministic Booleans/integers. We also write (a 1 ⇒ e 1 ) ⊕ · · · ⊕ (a n ⇒ e n ) for ⊕{a i ⇒ e i } 1≤i≤n , and write if a then e 1 else e 2 for (a ⇒ e 1 ) ⊕ (¬a ⇒ e 2 ). The expression let x = * in e generates an integer, then binds x to it, and evaluates e. The expression let x = a in e (let x = y z in e, resp.) binds x to the value of a (y z, resp.), and then evaluates e. The expression fail aborts the program. An assert expression assert(a) can be represented as if a then 0 else fail. An arithmetic expression consists of an integer constant, an (integer) variable, and primitive operators, denoted by op; we assume that the set of primitive operators contains standard integer/Boolean operations/relations like +, <, ∨, ∧, · · · . A value v is either an integer constant or a function closure; the latter is a partial function application of the form f i v; here, the length | v| of arguments v must be (strictly) smaller We assume that a program is well-typed under the standard simple type system. We also assume that every function in P has a non-zero arity, the body of each function definition has the integer type, and P contains a distinguished function symbol main ∈ { f 1 , . . . , f n } whose simple type is int → int.
The operational semantics of the target language is given on Fig. 2, where we extend the syntax of expressions with let x = v in e and let x = e in e. The goal of our verification is to find an invariant (represented in the form of refinement types) of the program that is sufficient to verify that, for every integer n, main n does not fail (i.e., is not reduced to fail).

Refinement Type System
We present a refinement type system for the target language. The syntax of refinement types is given by: The refinement type {x :int | a} denotes the set of integers that satisfy a, i.e., the value of a is non-zero. For example, {x : int | x ≥ 0} represents natural numbers. The type (x : T 1 ) → T 2 denotes the set of functions that take an argument x of type T 1 and return a value of type T 2 . Here, note that x may occur in T 2 . We write int for {x : int | true}, and T 1 → T 2 for (x : T 1 ) → T 2 when x does not occur in T 2 . By abuse of notation, we sometimes (as in Figure 3 shows the typing rules, which are the standard ones. We have three kinds of type judgments: t : T for expressions, P : for programs, and T <: T for subtyping. A judgment e : T means that the expression e has the refinement type T under refinement type environment , which is a sequence of refinement type bindings and guard predicates: : Here, x : T means that x has refinement type T , and a means that a holds. When = 1 , x : T , 2 where 2 does not contain a binding of the form x : T , we write (x) for T . A judgment P : means that the program P is well-typed, where describes the type of each function defined in P. A subtyping judgment T <: T means that a value of type T may be used as a value of type T . In the rules, we implicitly assume that all the variables occurring in an arithmetic expression a have type int. Though typing rules are fairly standard, we explain a few typing rules. In rule T-Fail, expresses the constraint implied by the type environment . The premise | false ensures that there is no environment that conforms to , so that fail is unreachable. In rule T-AExp, the information that x is bound to a is propagated to the type of x. Since the type T of e may contain x, we substitute a for x in the conclusion. The rule T-Sub allows the type of an expression to be weakened. For example, if we have e : {x : int | x = 1}, it can be weakened to e : {x : int | x > 0} by using T-Sub. The type system is sound in the sense that if P : holds for some , then main n does not fail for any integer n. We omit to prove this type system sound as it is a rather well-known system [24,30,31]. The type system is, however, incomplete: there are programs that never fail but are not typable in the refinement type system. Implicit parameters are required to make the type system complete [32].

Verification Conditions
Our goal has now been reduced to finding such that P : , if such exists. To this end, we first infer simple types for the target program by using the Hindley-Milner type inference algorithm. From the simple types, we construct a template for the refinement type i.e., constraints on the predicate variables that describe a sufficient condition for P : . The construction of the verification conditions is also rather standard [24,30,31], we present it here in the form of a function VC( P : ) defined on Fig. 4.
We note that the verification conditions generated by the function VC( P : ) can be normalized to a set of Horn clauses [4].
In Fig. 4, VC( P : ) is defined by using three sub-procedures VC f , VC <: , and VC e . The procedure VC f ( f (x 1 , . . . , x k ) = e) generates a condition for the function definition f (x 1 , . . . , x k ) = e) to be well-typed. The procedures VC <: ( s T <: T ) and VC e ( e : T ) respectively generate conditions for s T <: T and e : T to be derivable by the rules in Fig. 3. Each of the computation rules for VC <: ( s T <: T ) and VC e ( e : T ) follows from the corresponding typing rule in Fig. 3.

Example 1
Consider the following program and its associated simple types: By assigning a unique predicate variable to each integer type, we can obtain the following refinement type templates.
We then extract the following verification conditions from the body of the program:

Simplifying Verification Conditions
The number of unknown predicates to infer is critical to the efficiency of our algorithm in Sect. 3, because the algorithm succeeds only when the learner comes up with correct solutions for all the unknown predicates. We discuss here a couple of techniques to reduce the number of unknown predicates. The first one takes place at the level of Horn clauses and is not limited to refinement type inference over functional programs. Suppose that some predicate ρ occurs in the clauses is a formula having only positive occurrences of ρ, and ρ does not occur in ϕ, ϕ , nor any other clauses of the verification condition. Then, we can replace the two clauses above with C[ϕ] | ϕ and ρ ≡ ϕ. For example, recall the incr / twice from the example above. The predicate ρ 1 occurs only in the clauses ρ 1 (n) | ρ 1 (n) and ρ 1 (n) | ρ 2 (n, n + 1). Thus, we can replace them with ρ 1 (n) | ρ 2 (n, n + 1) and ρ 1 (n) ≡ ρ 1 (n). In this manner we can reduce the number of unknown predicate variables. This optimization itself is not specific to our context of functional program verification; similar (and more sophisticated) techniques are also discussed in [4]. We found this optimization particularly useful in our context, because the standard verification condition generation for higher-order functional programs discussed above introduces too many predicate variables.
The other optimization is specific to our context of refinement type inference. Suppose that the simple type of a function f is int → int. Then, in general, we prepare the refinement type template {x : If the evaluation of f (n) does not fail for any integer n, however, then the above refinement type is equivalent to we had previously, resulting in fewer predicates to infer. For instance, in the mc_91 example from Sect. 1, it is obvious that mc_91(n) never fails as its body contains no assertions and contains only calls to itself. Thus, we can actually set ρ 1 (n) to true.
In practice we use effect analysis [23] to check whether a function can fail. To this end, we extend simple types to effect types defined by: σ : where ξ is either an empty effect , or a failure f. The type σ 1 ξ → σ 2 describes functions that take an argument of type σ 1 and return a value of type σ 2 , but with a possible side effect of ξ . We can infer these effect types using a standard effect inference algorithm [23]. A function with effect type int → σ takes an integer as input and returns a value of σ without effect, i.e., without failure. For this type, we then use the simpler refinement type template {x :int | true} → · · · instead of {x : int | ρ(x)} → · · ·. For example, since mc_91 has effect type int → int, we assign the template (x : int) → {r : int | ρ(x, r )} for the refinement type of mc_91.

Modified ICE Framework
This section discusses our modified Ice framework tackling the predicate synthesis problem extracted from the input functional program as detailed in Sect. 2. Algorithm 1 details how the teacher supervises the learning process. Following the original Ice approach, teacher and learner only communicate by exchanging guesses for the predicates (from the latter to the former) and positive (P), negative (N ) and implication (I) data-from the former to the latter. These three sets of learning data are incrementally populated as long as the verification conditions are falsifiable, as discussed below.

Algorithm 1: Teacher supervising the learning process.
Input: the set VC of verification conditions with predicate variables ρ 1 , . . . , ρ n Result: concrete predicates for ρ 1 , . . . , ρ n for which VC is valid

Teacher
We now describe our modified version of the Ice teacher that, given some candidate predicates for = {ρ 1 , . . . , ρ n }, returns learning data if the verification conditions instantiated on the candidates are falsifiable. Since there are several predicates to discover, the positive, negative and implication learning data (concrete values) will always be annotated with the predicate(s) concerned. Now, all the constraints from the verification condition set VC have one of the following shapes, reminiscent of the original Ice's (1)-(2) from Sect. 1: where each α 1 , . . . , α m+1 is an application of one of the ρ 1 , . . . , ρ n to variables of the program, and C is a concrete formula ranging over the variables of the program. In the following, we write ρ(α i ) for the predicate α i is an application of. To illustrate, recall constraint (7) of the example from Fig. 1: It has the same shape as (10), with ρ(α 1 ) = ρ 1 and ρ( Given some guesses p 1 , . . . , p n for the predicates ρ 1 , . . . , ρ n , the teacher can check whether VC[ρ 1 := p 1 , . . . , ρ n := p n ] (the verification conditions obtained from VC by substituting p i for each ρ i ) is falsifiable using an SMT solver. If it is, then function extract_data (Algorithm 1 line 4) extracts new learning data as follows. If a verification condition with shape (10) and m = 0 can be falsified, then we extract some values x from the model produced by the solver. This constitutes a positive example (ρ(α 1 ), x) since ρ(α 1 ) should evaluate to true for x. From a counterexample model for a verification condition of the form (11), we extract a negative constraint (ρ(α 1 ), Last, an implication constraint comes from a counterexample model for a verification condition of shape (10) with m > 0 and is a pair Similarly to the original Ice implication constraints, this constraint means that if ρ(α 1 )( Those positive examples, negative constraints, and implication constraints are accumulated in P, N , and I, respectively, in Algorithm 1.

Remark 1 Note that negative examples and implication constraints in the original
Ice framework are special cases of the negative constraints and implication constraints above. A negative example of the original Ice is just a singleton set {(ρ(α 1 ), x 1 )}, and an implication constraint of Ice is a special case of the implication constraint where m = 1. Due to the generalization of learning data, negative constraints also contain unclassified data (unless they are singletons).

Learner: Building Candidates
We now start describing the learning part of our approach, which is an adaptation of the decision tree construction procedure from the original Ice framework [14]. The main difference is that the unclassified data can also contain values from negative constraints, as explained in Remark 1. This impacts decision tree construction as we now need to make sure the negative constraints are respected, in addition to checking that the implication constraints hold. Also, we adapted the qualifier selection heuristic (discussed in Sect. 3.4) to fit our context.
The learner takes, in addition to learning data (P, N , and I), a mapping quals from predicate variables to sets of qualifiers as input. The learner then tries to find solutions for Horn clauses as Boolean combinations of qualifiers, by running Algorithm 2, as explained below. For the moment, we assume that the qualifier mapping quals is given a priori; how to find it is discussed in Sect. 3.5.
The learner needs to synthesize predicates for the variables ρ 1 , . . . , ρ n that respect the learning data. To do so, the learning data is projected on the different predicates and partially classified in the class mapping (in Algorithm 2 lines 1-5) following the semantics of the learning data given in Sect. 3.1. Notice the way each element of N is classified depending on whether it only has one predicate/values pair, as a consequence of Remark 1. The algorithm also maintains a partial classification class (line 3) of the data from unc. This mapping encodes the choices made on the unclassified data: if (ρ, x) → true (resp. false), then a previous choice forced (ρ, x) to be considered a positive (resp. negative) example.
It then calls build_tree (Algorithm 3) for each unknown predicate ρ, to construct a decision tree that encodes a candidate solution for ρ. A tree T is defined by The formula it corresponds to is given by function f , defined inductively by Algorithm 3 shows the decision tree construction process for a given ρ ∈ . We now discuss the algorithm formally and will illustrate it on an example in Sect. 3.3. Building a decision tree consists in choosing qualifiers splitting the learning data until there is no negative (positive) data left and the unclassified data can be classified as positive (negative) in each branch. The main difference with the tree construction from the original Ice framework is that the classification checks now take into account the negative constraints introduced earlier.
Qualifier selection is discussed separately in Sect. 3.4.
Function can_be_pos checks whether all the unclassified data can be classified as positive. This consists in making sure that negative and implication constraints are verified or contain unclassified data-meaning future choices are able to (and will) verify the constraints. Given unclassified data U , constraint sets N and I, and classifier mapping class,  x) appearing in the elements of I and N do 5 if class(ρ, x) is undefined then class(ρ, x) ← unknown; Leaf(false) 7 else 8 choose q in Q that best divides the data can_be_pos checks that the following conditions hold for every u ∈ U : where class(n) b means that class(n) is unknown or equal to b. Conversely, function can_be_neg checks that all the unclassified data can be classified as negative: The next section unfolds this algorithm on a simple example. While we did not specify the order in which the trees are constructed (Algorithm 2 line 10), it can impact performance greatly because the classification choices influence later runs of build_tree. Hence, it is better to treat the elements of that have the least amount of unclassified data first. Doing so directs the choices of the qualifier q (Algorithm 3 line 8, discussed below) on as much classified data as possible. The data is then split (lines 9 and 10 ) using q: more classified data thus means more informed splits, leading to more relevant classifications of unclassified data in the terminal cases of the decision tree construction.

Learner: Example
We now illustrate the decision tree building process discussed above using the 91 function from Fig. 1, with verification conditions (4)-(9). Again, say that we realize that ρ 1 (n) = true, so that we only need to synthesize ρ 2 (n, r ). Suppose that the learner is called with

Qualifier Selection in Algorithm 3
We now discuss how to choose qualifier q ∈ Q on line 8 in Algorithm 3. The choice of the qualifier q used to split the learning data D = (P, N , U ) in D q = (P q , N q , U q ) and D ¬q = (P ¬q , N ¬q , U ¬q ) is crucial. In [14], the authors introduce two heuristics based on the notion of Shannon Entropy ε: which yields a value between 0 and 1. This entropy rates the ratio of positive and negative examples: it gets close to 1 when |P| and |N | are close. A small entropy is preferred as it indicates that the data contains significantly more of one than the other. The information gain γ of a split is where D = (P, N , U ) = |P|+|N |. A high information gain means q separates the positive examples from the negative ones. Note that the information gain ignores unclassified data, a shortcoming the Ice framework [14] addresses by proposing two qualifier selection heuristics. The first subtracts a penalty to the information gain. It penalizes qualifiers separating data coming from the same implication constraint-called cutting the implication. The second heuristic changes the definition of entropy by introducing a function approximating the probability that a non-classified example will eventually be classified as positive. We present here our adaptation of this second heuristic, as it is much more natural to transfer to our use-case. The idea is to create a function Pr that approximates the probability that some values from the projected learning data D = (P, N , U ) end up classified as positive. More precisely, Pr(v) approximates the ratio between the number of legal (constraint-abiding) classifications in which v is classified positively and the number of all legal classifications. Computing this ratio for the whole data is impractical: it falls in the counting problems class and it is # Pcomplete [1]. The approximation we propose uses the following notion of degree: The three terms appearing in function Degree are based on the following remarks. Let v be some value in the projected learning data. If ( x, v) ∈ I, there is only one classification for x to force v to be true: the classification where all the elements of x are classified positively. More elements in x generally mean more legal classifications where one of them is false and v need not be true: Pr(v) should be higher if x has few elements. If v appears in the antecedents of a constraint ( x, y), then Pr(v) should be lower. Still, if x has many elements it means v is less constrained. There are statistically more classifications in which v is true without triggering the implication, and thus more legal classifications where v is true. Last, if v appears in a negative constraint x then it is less likely to be true. Again, a bigger x means v is less constrained, since there are statistically more legal classifications where v is true.
Our Pr function compresses the degree between 0 and 1, and we define a new multipredicate-friendly entropy function ε to compute the information gain (where D = (P, N , U )): Note that it can happen that none of the qualifiers can split the data, i.e. there is no qualifier left or they all have an information gain of 0. In this case we synthesize qualifiers that we know will split the data as described in the next subsection.

Mining and Synthesizing Qualifiers
We now discuss how to prepare the set Q of qualifiers used in Algorithm 3. The learner in both the original Ice approach and our modified version spend a lot of time evaluating qualifiers. Having too many of them slows down the learning process considerably, while not considering enough of them reduces the expressiveness of the candidates. The compromise we propose is to i) mine for (few) qualifiers from the clauses, and ii) synthesize (possibly many) qualifiers when needed, driven by the data we need to split.
To mine for qualifiers, for every clause C and for every predicate application of the form ρ( v) in C, we add every atomic predicate a in C as a qualifier for ρ as long as all the free variables of a are in v. All the other qualifiers are synthesized during the analysis.
Based on our experience, we have chosen the following synthesis strategy. With v 1 , . . . , v n the formal inputs of ρ, for all (x 1 , . . . , x n ) ∈ P ∪ N ∪U , we generate the set of new qualifiers Adding these qualifiers allows to split the data on these (strict, when negated) inequalities, and encode (dis)equalities by combining them in the decision tree. Also, notice that when no qualifier can split the data we have in general small P, N and U sets, and the number of new qualifiers is quite tractable. The learning process is an iterative one where relatively few new samples are added at each step, compared to the set of all samples. Since we could split the samples from the previous iteration, it is very often the case that P, N and U contain mostly new samples. Last, our approach shares the limitation of the original Ice: it will not succeed if a particular relation between the variables is needed to conclude, but no qualifier of the right shape is ever mined for or synthesized.

Experimental Evaluation
Our implementation consists of two parts: first, RType is a frontend (written in OCaml) generating Horn clauses from programs written in a subset of OCaml as discussed in Sect. 2. It relies on an external Horn clause solver for actually solving the clauses, and post-processes the solution (if any) to yield refinement types for the original program. HoIce 2 , written in Rust 3 , is one such Horn clause solver and implements the modified Ice framework presented in this paper. All experiments in this section use RType v1.0 and HoIce v1.0. Under the hood, HoIce relies on the Z3 4 SMT solver [10] for satisfiability checks. In the following experiments, RType uses HoIce as the Horn clause solver. Note that the input OCaml programs are not annotated: the Horn clauses correspond to the verification conditions encoding the fact that the input program cannot falsify its assertion(s). RType supports a subset of OCaml including (mutually) recursive functions and integers, without algebraic data types.
We now report on our experimental evaluation. Our benchmark suite of 162 programs 5 includes the programs from [26,37] in the fragment RType supports, along with programs automatically generated by the termination verification tool from [20], and 10 new benchmarks written by ourselves. We only considered programs that are safe since RType is not refutation-sound. These benchmarks range from very simple to relatively complex, with in particular a program computing a solution to the N Queen problem using arrays. The verification challenge for this program is to prove it does not perform out-of-bound array accesses. Here are a few statistics on the number of lines in the original OCaml program, and the number of predicates and clauses in the Horn clause problems:  Note that the Horn clause data was generated from the encoding discussed in Sect. 2, including optimizations.
All the experiments presented in this section ran on a machine running Ubuntu (Xeon E5-2680v3, 64GB of RAM) with a timeout of 100 seconds. The number between parentheses in the keys of the graphs is the number of benchmarks solved. We begin by evaluating the optimizations discussed in Sect. 2, followed by a comparison against automated verification tools for OCaml programs. Last, we evaluate our predicate synthesis engine against other Horn-clause-level solvers. Figure 6a shows our evaluation of the effect analysis (EA) and clause reduction (Red) simplifications discussed in Sect. 2. It is clear that both effect analysis and Horn reduction speedup the learning process significantly. They work especially well together and can reduce drastically the number of predicates on relatively big synthesis problems, as shown on Fig. 6c.

Evaluation of the Optimizations
The 11 programs that we fail to verify show inherent limitations of our approach. Two of them require an invariant of the form x + y ≥ z. Our current compromise for qualifier mining and synthesis (in Sect. 3.4) does not consider such qualifiers unless they appear explicitly in the program. We are currently investigating how to alter our qualifier synthesis approach to raise its expressiveness with a reasonable impact on performance. The remaining nine programs are not typable with refinement types, meaning the verification conditions generated by RType are actually unsatisfiable. An extension of the type system is required to prove these programs correct [32].

Comparison with Other OCaml Program Verifiers
The first tool we compare RType to is the higher-order program verifier MoCHi from [26] (Fig. 6b). MoCHi infers intersection types, which makes it more expressive than RType. The nine programs that MoCHi proves but RType cannot verify are the (refinement-)untypable ones discussed above. While this shows a clear advantage of intersection types over our approach in terms of expressiveness, the rest of the experiments make it clear that, when applicable, RType outperforms MoCHi on a significant part of our benchmarks.
We also evaluated our implementation against DOrder from [37,38]. This comparison is interesting as DOrder also uses machine-learning to infer refinement types, but does not support implication constraints. DOrder compensates by conducting test runs of the program on random inputs to gather better positive data. It supports a different subset of OCaml than RType though, and after removing the programs it does not support, 124 programs are left. The results are on Fig. 6d, and show that RType overwhelmingly outperforms DOrder. This is consistent with the results reported for the original Ice framework: the benefit gained by considering implication constraints is huge.
These results show that, despite its limitations, our approach is competitive and often outperforms other state-of-the-art automated verification tools for OCaml programs.

Horn-Clause-Level Evaluation
Last, we compare our Horn clause solver HoIce to other solvers (Fig. 7): Spacer [18], Duality [21], Z3's PDR [15], and Eldarica [16]. The first three are implemented in Z3 (C++) while Eldarica is implemented in Scala. The benchmarks are the Horn clauses encoding the safety of the 162 programs aforementioned with additional two programs, omitted in the previous evaluation as they are unsafe.
HoIce solves the most benchmarks at 162. 6 The fastest tool overall is Z3's Spacer which solves slightly fewer benchmarks. The two timeouts for HoIce come from the programs discussed above for which HoIce does not have the appropriate qualifiers to conclude. Because it mixes IC3 [5] with interpolation, Spacer infers the right predicates quite quickly. Thus, in our use-case, our approach is competitive with state-of-the-art Horn clause solvers in terms of speed, in addition to being more precise. We also include a comparison on the SV-COMP with Spacer on Fig. 7b. HoIce is generally competitive, but timeouts on a significant part of the benchmarks. Quite a few of them are unsatisfiable; the Ice framework is not made to be (a) (b) Fig. 7 Comparison with Horn clause solvers efficient at proving unsatisfiability. The rest of the timeouts require qualifiers we do not mine for nor synthesize, showing that some more work is needed on this aspect of the approach. In our experience, it is often the case that HoIce's models are significantly simpler than those of Spacer's and PDR's (as illustrated in "Appendix"). Note that simple models are useful if the Horn clause solver is placed inside a CEGAR loop such as the one in MoCHi [26]; indeed, Sato et al. [25] have recently employed CHC solving as a backend of MoCHi, and observed that HoIce was more effective than Spacer as the backend CHC solver in that context.

Algebraic Data Types
This section discusses how to extend our approach to deal with functional programs that manipulate algebraic data types (ADTs) such as lists and trees. We assume here that ADTs do not contain functions; we do not consider, for example, a list of functions. As we discuss below, most of our framework need not be changed, including verification condition generations, and the main procedures for teacher and learner, as long as the backend SMT solver supports ADTs; the main new issue is how to find/synthesize appropriate qualifiers. Below, after briefly discussing the overall extension of our framework in Sect. 5.1, we explain our approach to qualifier synthesis in Sect. 5.2. We then report preliminary experiments on the extension in Sect. 5.3.

Overview
As mentioned above, except for the fact that a new method is required for qualifier discovery, our framework can be smoothly extended to deal with ADTs; this is an advantage of our ICEbased approach. Below we sketch the extension of each component through the following running example. The function ins takes an integer i and an integer list lst as arguments, and returns a list obtained by inserting i into lst.

Refinement Types and Verification Condition Generation
Under the assumption that ADTs do not contain functions, the simplest way to extend the refinement type system and verification condition in Sect. 2 is to treat ADTs just like ordinary base types. The syntax of refinement types is extended by: T (refinement types):: = {x : A | a} | (x : T 1 ) → T 2 A (base and algebraic data types):: =int | int list | · · · Here, the set of expressions ranged over by a is extended to allow operations on algebraic data types. No change is required on the verification condition generation procedure in Fig. 4 (except the extension of the syntax of types and expressions).
For the running example above. we obtain the following refinement type templates and verification conditions.
Teacher Assuming that the backend SMT solver supports ADTs, the teacher procedure described in Sect. 3.1 can be used as it is. For the example above, given a candidate solution:ρ ins (x, , r ) ≡ r = [ ], the teacher just needs to check that the candidate satisfies the verification conditions by using the SMT solver. Learner The learner procedure described in Sect. 3.2 can also be used as it is. In fact, for the example above, the learner can easily find the valid candidate solution ρ ins (x, , r ) ≡ r = [ ], by using equality constraints as qualifiers.
The remaining issue is how to find appropriate qualifiers. In Sect. 3.5, we have discussed how to mine and synthesize qualifiers on integers. That is not sufficient for programs manipulating ADTs, like the following one: From the example above, we obtain the following refinement type templates and verification conditions.

Qualifier Synthesis for ADTs
Our approach to qualifier synthesis is, given a set of CHCs representing verification conditions, to extract (possibly recursive) functions that take ADTs as input and return base type values, and to allow them to be used in the qualifier mining and synthesis discussed in Sect. 3.5 (thus, v i in Sect. 3.5 may now be a function application f (v 1 , . . . , v n )). Let us explain this through the last example. We first gather CHCs on ρ sorted : lst = [ ] | ρ sorted (lst, true) tl = [ ] | ρ sorted (hd::tl, true) hd 1 > hd 2 | ρ sorted (hd 1 ::hd 2 ::tl, false) hd 1 ≤ hd 2 ∧ ρ sorted (hd 2 ::tl, res) | ρ sorted (hd 1 ::hd 2 ::tl, res) We turn them to the following function definition for the underlying SMT solver.
(define-fun-rec sorted ((i Int) (lst Intlist)) Bool (ite (or (= lst nil) (tl lst)) true (ite (> (hd lst) (hd (tl lst))) false) (ite (≤ (hd lst) (hd (tl lst))) (sorted (tl lst))))) We call this process function reconstruction, which will be explained later. The learner may now use, as qualifiers, arithmetic constraints involving the function sorted above, and may return the following candidate solution: The teacher can verify it as a valid solution, as long as the underlying SMT solver can properly deal with recursive functions (and indeed, Z3 can verify the validity of the solution above instantly).

Remark 2
In the example above, we picked ρ sorted for the function reconstruction. Our criterion for a predicate ρ to be eligible for function reconstruction is that it only appears in i) negative clauses and ii) clauses that only mention ρ (called the defining clauses of ρ). When there are more than one candidate predicate, we heuristically choose one with the simplest signature (lowest arity and lowest number of ADT-valued parameters) and complexity (lowest number of non-negative clauses mentioning ρ).

Function Reconstruction
We now explain how to reconstruct a function definition from the defining clauses of a predicate ρ. We can assume that each of the defining clauses is of the form: where C is a conjunction of atomic constraints without predicate variables. We first eliminate variables other than x 1 , . . . , x k , y 1 , . . . , y in a heuristic manner. For example, if C contains x = z 1 :: z 2 , we eliminate z 1 and z 2 by adding is-cons(x) and replacing z 1 and z 2 with hd(x) and tl(x) respectively. If the variable elimination fails, then we give up the function reconstruction for ρ.
Using this variable elimination, the defining clauses are normalized to: For the sake of simplicity, we assume below that C 1 , . . . , C n do not contain variables other than x. We now check that i) C 1 , . . . , C n are mutually exclusive and exhaustive, and ii) the term t i, j contains none of the variables y j , . . . , y i . We then construct the function definition: Finally, we check that the above definition makes f ρ total by requiring that C i implies that t i, i < x with respect to a certain well-founded order <. 7

Remark 3
Our approach above relies on the assumption that the underlying SMT solver can effectively reason about recursive functions. We have used Z3 and applied some optimizations to help it reason about recursive functions. Omitting solver-specific tweaks 8 , the most rewarding optimization was to add an invariant discovery step at the end of function reconstruction. The length function over lists for instance, without the invariant that its output is always positive, can be a problem for solvers even on relatively simple queries. This invariant discovery step generates candidate invariants based on the signature and definition of the function, which it attempts to prove by checking they are preserved in each branch of the definition. If some invariant inv is discovered for function f, then whenever the definition of f is given to a solver then it is followed by the assertion ∀v, f(v) ⇒ inv(v).

Evaluation
While the work on ADTs presented in this section is still in an early stage, we implemented and evaluated it against Spacer [18] and recent work [9] where the authors reduce Horn clause problems over ADTs to an equisatisfiable problem over basic sorts. Unfortunately, as far as we know this latter approach does not always allow to produce models for the predicates if the problem is satisfiable, while Spacer and our approach do. Also, we were not able to retrieve a binary for the implementation of [9] called ECZ3. The results presented on Fig. 8 for ECZ3 are the results reported in [9] where the experiments ran on a different machine. As a consequence, the runtimes reported below are not comparable and readers should really focus on the number of benchmarks solved. This evaluation uses the set of benchmarks from [9]. 9 The two versions of our implementation 10 presented, HoIce and "HoIce no rec", run with and without function reconstruction respectively. The difference in precision is quite noticeable, despite the fact that the approach and the implementation are still ongoing work. The benchmarks HoIce is not able to solve fail for various reasons and indicate future directions of research. In a few cases, the problem is that the underlying solver (Z3) returns unknown, at which point HoIce is forced to give up. In other cases the solver does not return in reasonable time on a query. This often happens in the teacher while checking a valid candidate, meaning all clauses are unsatisfiable which the solver struggles to verify. In a few other cases the function reconstructed are not enough for HoIce to reach a conclusion and it keeps trying to find a model forever.
ECZ3 from [9] is by far the best in terms of precision, with the drawback that models for the original predicates are not available. Spacer yields performance similar to HoIce without function reconstruction which suggests that it would also benefit from function reconstruction. In fact, we ran spacer on a handful of problems in which we manually forced the definition given by function reconstruction, and spacer was able to solve the modified version. This is a good indication that the approach we suggest in this section extends beyond sampling-and template-based techniques such as our generalized Ice framework.
Last, for the sake of reproducibility we should mention that we used Z3 version 4.7.1 11 as both HoIce's underlying SMT solver and in the evaluation of spacer. More recent versions of Z3 (4.8.* at the time of writing) seem far less efficient when it comes to dealing with ADTs and recursive functions. Running HoIce with Z3 4.8.* on the benchmarks mentioned in this section yields a huge number of timeouts and unknown result (meaning the Z3 cannot answer one the teacher's queries).

Related Work
There has been a lot of work on sampling-based approaches to program invariant discoveries during the last decade [13,14,[27][28][29][36][37][38]. Among others, most closely related to this paper are Garg et al.'s Ice framework [13,14] (which this paper extends) and Zhu et al.'s refinement type inference methods [36][37][38]. To the best of our knowledge, Zhu et al. [36][37][38] were the first to apply a sampling-based approach to refinement type inference for higher-order functional programs. They did not, however, consider implication constraints. As discussed in Sect. 4, their tool fails to verify some programs due to the lack of implication constraints.
There are other automated/semi-automated methods for verification of higher-order functional programs [17,24,[30][31][32]34,37,38], based on some combinations of Horn clause solving, automated theorem proving, counterexample-guided abstraction refinement, (higher-order) model checking, etc. As a representative of such methods, we have chosen MoCHi and compared our tool with it in Sect. 4. As the experimental results indicate, our tool often outperforms MoCHi, although not always. Thus, we think that our learning-based approach is complementary to the aforementioned ones; a good integration of our approach with them is left for future work. Liquid types [24], another representative approach, is semi-automated in that users have to provide qualifiers as hints. By preparing a fixed, default set of qualifiers, Liquid types may also be used as an automated method. From that viewpoint, the main advantage of our approach is that we can infer arbitrary Boolean combinations of qualifiers as refinement predicates, whereas Liquid types can infer only conjunctions of qualifiers. On the downside, since we synthesize (a potentially infinite number of) quantifiers, and build candidates which are arbitrary Boolean combinations of these qualifiers, our approach has no guarantee to terminate, unlike Liquid types.
Since the publication of our original paper on which this extended version is based, at least two related approaches to Horn clause solving were published [11,35]. Both approaches rely on ideas similar to ours: produce candidate for the predicates based on data accumulated by refuting previous candidates. The main difference with our work is that neither use implication constraints, which we believe to be important for the learning of inductive invariants. Also, they seem to mainly target verification problems stemming from imperative programs while our approach was designed with functional program verification in mind.

Conclusion
In this paper we proposed an adaptation of the machine-learning-based, invariant discovery framework Ice to refinement type inference. The main challenge was that implication constraints and negative examples were ill-suited for solving Horn clauses of the form ρ( x 1 ) ∧ · · · ∧ ρ( x n ) ∧ . . . | ρ( x), which tend to appear often in our context of functional program verification because of nested recursive calls.
We addressed this issue by generalizing Ice's notion of implication constraint. For similar reasons, we also adapted negative examples by turning them into negative constraints. This means that, unlike the original Ice framework, our learner might have to make classification choices to respect the negative learning data. We have introduced a modified version of the Ice framework accounting for these adaptations, and have implemented it, along with optimizations based on effect analysis. Our evaluation on a representative set of programs show that it is competitive with state of the art OCaml model-checkers and Horn clause solvers.
We also reported on preliminary work on adapting our approach to Algebraic Data Types by reconstructing, when relevant, functions that the framework can leverage to build useful qualifiers. The evaluation of our prototype implementation show that doing so is rewarding but, in its current state, fails to outperform a recent technique that encodes Horn clause over ADT verification as ADT-free Horn clauses, with the drawback of not being able to generate models for the original problem when the problem is satisfiable.