Keywords

1 Introduction

The Vampire prover [15] has supported higher-order reasoning since 2019 [6]. Until recently, this support was via a translation from higher-order logic (HOL) to polymorphic first-order logic using combinators. The approach had positives, specifically it avoided the need for higher-order unification. However, our experience suggested that for problems requiring complex unifiers, the approach was not competitive with calculi that do rely on higher-order unification. This intuition was supported by results at the CASC system competition [25].

Due to this, we recently devised an entirely new higher-order superposition calculus. This time we based our calculus on a standard presentation of HOL. The key idea behind our calculus is that rather than using full higher-order unification, we use a depth-bounded version. That is, when searching for higher-order unifiers, when some predefined number of projection and imitation steps have taken place, the search is backtracked. The crucial difference in our approach to similar approaches is that rather than failing on reaching the depth limit, we turn the set of remaining unification pairs into negative constraint literals which are returned along with the substitution formed until that point. This is similar to recent developments in the field of theory reasoning [5].

The new calculus has now been implemented in Vampire along with a dedicated strategy schedule. Together these developments propelled Vampire to first place in the THF division of the 2023 edition of the CASC competition.Footnote 1 As the completeness of the calculus is an open question which we are working on, we have to date not published a description of the calculus.

In this paper, we describe the calculus, discuss its implementation in Vampire and also provide some details of the strategy schedule and its formation.

2 Preliminaries

We assume familiarity with higher-order logic and higher-order unification. Detailed presentations of these can be found in recent literature [2, 4, 29]

We work with a rank-1 polymorphic, clausal, higher-order logic. For the syntax of the logic we follow a more-or-less standard presentation such as that of Bentkamp et al. [2]. Higher-order applications such as \( f \, a \, c\) contain subterms with no first-order equivalents such as f and \(f \, a\). We refer to these as prefix subterms. We represent term variables with xyz, function symbols with fgh, and terms with s and t. To keep the presentation simple, we omit typing information from our terms.

A substitution is a mapping of variables to terms. Unification is the process of finding a substitution \(\sigma \) for terms \(t_1\) and \(t_2\) such that \(t_1\sigma \approx t_2\sigma \) for some definition of equality (\(\approx \)) of interest. It is well known that first-order syntactic unification is decidable and unique most general unifiers exists. For the higher-order case, unification is not decidable, and the set of incomparable unifiers is potentially infinite. A commonly used higher-order unification procedure for enumerating unifiers is Huet’s preunification routine [13]. Unlike full higher-order unification, preunification does not attempt to unify terms if both have variable head symbols. Thus, preunification does not require infinitely branching rules unlike full higher-order unification [29].

The two main rules that extend first-order unification in Huet’s procedure are projection and imitation. We provide a flavour of these via an example. Consider unifying terms \(s = x \, a\) and \(s' = a\). In searching for a suitable instantiation of the variable x, we can either attempt to copy the head symbol of \(s'\) leading to the substitution \(x \rightarrow \lambda y. \, a\), or we can bring one of x’s arguments to the head position leading to the substitution \(x \rightarrow \lambda y. \, y\). The first is known as imitation and the second as projection.

We use the concept of a depth\(_n\) unifier. We do not define the term formally, but provide an intuitive understanding. Consider a higher-order preunification algorithm. Any substitution formed by following a path of the unification tree, starting from the root, that contains exactly n imitation and projection steps, or reaches a leaf using fewer than n such steps, is a depth\(_n\) unifier. For terms s and t, let \(U_n(s, t)\) be the set of all depth\(_n\) unifiers of s and t. Note that this set is finite as we are assuming preunification and hence the tree is finitely branching.

For terms s and t, for each depth\(_n\) unifier \(\sigma \in U_n(s, t)\), we associate a set of negative equality literals \(C_\sigma \) formed by turning the unification pairs that remain when the depth limit is reached into negative equalities. In the case \(\sigma \) is an actual unifier of s and t, \(C_\sigma \) is of course the empty set.

To make this clearer, consider the unification tree presented in Fig. 1. There are two depth\(_2\) unifiers labelled \(\sigma _1\) and \(\sigma _2\) in the figure. Related to these, we have \(C_{\sigma _1} = C_{\sigma _2} = \{ x_2 \, a \, b \, \not \approx b \}\). There are four depth\(_3\) unifiers (not shown in the figure) and zero depth\(_n\) unifiers for for \(n > 3\).

Fig. 1.
figure 1

Unification tree for terms \(x \, a \, b\) and \(f \, b \, a\)

3 Calculus

Our calculus is parameterised by a selection function and an ordering \(\succ \). Together these give rise to the concept of literals being (strictly) \(\succ \)-eligible with respect to a substitution \(\sigma \) [2]. When discussing eligibility we drop \(\succ \) and \(\sigma \) and rely on the context to make these clear. We call a literal \(s \not \approx t\), where both s and t have variable heads, a flex-flex literal. Such a literal is never selected in the calculus. We present the primary inference rule, Sup, below.

figure a

In the rule above, we use \(\, \dot{\approx } \,\) to denote either a positive or negative equality. We use \(s \langle \, u \, \rangle \) to denote that u is a first-order subterm of s. That is, a non-prefix subterm that is not below a lambda. The side conditions of the inference are \(\sigma \in U_n(t, u)\), u is not a variable, \(t \approx t'\) is strictly eligible in the left premise, \(s \langle \, u \, \rangle \, \dot{\approx } \,s'\) is eligible in the right premise, and the other standard ordering conditions. The remaining core inference rules are EqRes and EqFact.

figure b

For both rules, \(\sigma \in U_n(t, s)\). For EqFact, \(s \approx s'\) is eligible in the premise and for EqRes \(s \not \approx s'\) is eligble. We also include inferences ArgCong (see [3]), and FlexFlexSimp which derives the empty clause, \(\bot \), from a clause containing only flex-flex literals.

figure c

For ArgCong, \(s \approx s'\) is eligible in the premise, \(\sigma \) is the type unifier of s and \(s'\) and x is a fresh variable. In our implementation, the depth parameter n is set via a user option. In the case it is set to 0, the following pair of inferences are added to the calculus.

figure d
figure e

Where j ranges from 1 to m in Imitate and 1 to p in Project, and each \(z_j\) is a fresh variable. The literals \(x \, \overline{s} _n \not \approx f \, \overline{t} _m\) are eligible in the premises and p is the arity of \(y_i\), the projected variable. The idea behind introducing these rules is to facilitate the instantiation of head variables with suitable lambda terms when this is not being done as part of unification. Our intuition is that by intertwining the unification and calculus rules in the spirit of the EP calculus [21], the need for explosive rules (such as FluidSup [2]) that simulate superposition underneath variables is removed. The examples we present below support this intuition. Besides the core inference rules, the calculus has a set of rules to handle reasoning about Boolean terms. These are similar to rules discussed in the literature [20, 30]. Extensionality is supported either via an axiom or by using unification with abstraction as described by Bhayat [4]. Similarly, Hilbert choice can be supported via a lightweight inference in the manner of Leo-III [20] or via the addition of the Skolemized choice axiom. The calculus also contains various well-known simplification rules such as Demodulation and Subsumption.

Soundness and Completeness. The soundness of the calculus described above is relatively straightforward to show. On the other hand, the completeness of the calculus with respect to Henkin semantics is an open question. We hypothesise that given the right ordering, and with tuning of inference side conditions, the depth\(_0\) variant of the calculus (with the Imitate and Project rules) is refutationally complete. A proof is unlikely to be straightforward due to the fact that we do not select flex-flex literals.

Example 1

Consider the following unsatisfiable clause set. Assume a depth of 1. Selected literals are underlined.

$$\begin{aligned} C = \underline{x \, a \, b \not \approx f \, b \, a} \vee x \, c \, d \not \approx f \, b \, a \end{aligned}$$

An EqRes binds x to \(\lambda y, z. f (x_1 \, a \, b) (x_2 \, a \, b)\) and results in \(C_1 = \underline{f \, (x_1 \, a \, b) (x_2 \, a \, b)}\) \(\underline{\not \approx f \, b \, a} \vee f \, (x_1 \, c \, d) (x_2 \, c \, d) \not \approx f \, b \, a\). An EqRes on \(C_1\) binds \(x_1\) to \(\lambda y, z. b\) and results in \(C_2 = \underline{x_2 \, a \, b \not \approx a} \vee f \, b \, (x_2 \, c \, d) \not \approx f \, b \, a\). A final EqRes on \(C_2\) binds \(x_2\) to \(\lambda y, z. a\) and results in \(\underline{f \, b \, a \not \approx f \, b \, a}\) from which it is trivial to obtain the empty clause \(\bot \).

Example 2

(Example 1 of Bentkamp et al. [3]). Consider the following unsatisfiable clause set. Assume the depth\(_0\) version of the calculus.

$$\begin{aligned} C_1 = f \, a \approx c \qquad C _2 = h \, (y \, b) \, (y \, a) \not \approx h \, (g \, (f \, b)) \, (g \, c) \end{aligned}$$

An EqRes inference on \(C_2\) results in \(C_3 = y \, b \not \approx g \, (f \, b) \vee y \, a \not \approx g \, c\). An Imitate inference on the first literal of \(C_3\) followed by the application of the substitution and some \(\beta \)-reduction results in \(C_4 = g \, (z \, b) \not \approx g \, (f \, b) \vee g \, (z \, a) \not \approx g \, c\). A further double application of EqRes gives us \(C_5 = z \, b \not \approx f \, b \vee z \, a \not \approx c\). We again carry out Imitate on the first literal followed by an EqRes to leave us with \(C_6 = x \, b \not \approx b \vee f \, (x \, a) \not \approx c\). We can now carry out a Sup inference between \(C_1\) and \(C_6\) resulting in \(C_7 = x \, b \not \approx b \vee c \not \approx c \vee x \, a \not \approx a\) from which it is simple to derive \(\bot \) via an application of Imitate on either the first or the third literal. Note, that the empty clause was derived without the need for an inference that simulates superposition underneath variables, unlike in [3].

4 Implementation

The calculus described above, along with a dedicated strategy schedule, has been implemented in the Vampire theorem prover.Footnote 2 Vampire natively supports rank-1 polymorphic first-order logic. Therefore, we translate higher-order terms into polymorphic first-order terms using the well known applicative encoding. Note, that we use the symbol \(\mapsto \), in a first-order type, to separate the argument types from the return type. It should not be confused with the binary, higher-order function type constructor \(\rightarrow \) that we assume to be in the type signature. Application is represented by a polymorphic symbol \(app : \mathsf {\Pi }\alpha _1, \alpha _2. (\alpha _1 \rightarrow \alpha _2 \times \alpha _1) \mapsto \alpha _2\). Lambda terms are stored internally using De Bruijn indices. A lambda is represented by a polymorphic symbol \(lam : \mathsf {\Pi }\alpha _1, \alpha _2.\, \alpha _2 \mapsto (\alpha _1 \rightarrow \alpha _2)\). De Bruijn indices are represented by a family of polymorphic symbols \(d_i : \mathsf {\Pi }\alpha . \, \alpha \) for \( i \in \mathbb {N}\). Thus, the term \(\lambda x : \tau . \, x\) is represented internally as \(lam(\tau , \tau , d_0(\tau ))\). The term \(\lambda x. \, f (\lambda z. x)\) is represented internally (now ignoring type arguments) as \(lam(app(f, lam(d_1)))\).

Some of the most important options available are: hol_unif_depth to control the depth unification proceeds to, funx_ext to control how function extensionality is handled, cnf_on_the_fly to control how eager or lazy the clausification algorithm is, and applicative_unif which replaces higher-order unification with (applicative) first-order unification. This is surprisingly helpful in some cases. Besides for the options listed above, there are many other higher-order specific options as well as options that impact both higher-order and first-order reasoning. These options can be viewed by building Vampire and running with –help.

5 Strategies and the Schedule

We generally followed the Spider [27] methodology for strategy discovery and schedule creation. This starts with randomly sampling strategies to solve as-of-yet unsolved problems (or improve the best known time for problems already known to be solvable). Each newly discovered strategy is optimized with local search to work even better on the single problem which it just solved. This is done by trying out alternative values for each option, possibly in several rounds. A variant of the strategy that improves the solution time or at least uses a default value of an option is preferred. The final strategy is then evaluated on the totality of all considered problems and the process repeats.

In our case, we sought strategies to cover the 3914 TH0 problems of the TPTP library [24] version 8.1.2. The strategy space consisted of 87 options inherited from first-order Vampire and 26 dedicated higher-order options. To sample a random strategy, we considered each option separately and picked its value based on a guess of how useful each is. (E.g., for applicative_unif we used the relative frequencies of on: 3, off: 10.) During the strategy discovery process we adapted the maximum running time per problem, both for the random probes several times and for the final strategy evaluation: from the order of \({1}\,\text {s}\) up to \({100}\,\text {s}\). In total, we collected 1158 strategies over the course of approximately two weeks of continuous 60 core CPU computation. The strategies cover 2804 unsatisfiable problems, including 50 problems of TPTP rating 1.0 (which means these problems were not officially solved by an ATP before).

Once a sufficiently rich set of strategies gets discovered and evaluated, schedule building can be posed as a constraint programming task in which one seeks to allot time slices to individual strategies to cover as many problems as possible while not exceeding a given overall time bound T [12, 19]. We had a good experience with a weighted set cover formulation and applying a greedy algorithm [9]: starting from an empty schedule, at any point we decide to extend it by scheduling a strategy S for additional t units of time if this step is currently the best among all possible strategy extensions in terms of “the number of problems that will additionally get covered divided by t”. This greedy approach does not guarantee an optimal result, but runs in polynomial time and gives a meaningful answer uniformly for any overall time bound T (See [8] for more details).

Our final higher-order schedule tries to cover, in this greedy sense, as many problems as possible at several increasing time bounds: starting from \({1}\,\text {s}\), \({5}\,\text {s}\), and \({10}\,\text {s}\) bounds relevant for the impatient users, all the way up to the CASC limit of 16 min (2 min on 8 cores) and further beyond. In the end, it makes use of 278 out of the 1158 available strategies and manages to cover all the known-to-be-solvable problems in a bit less than 1 h of single core computation. We stress that our final schedule is a single monolithic sequence and does not branch based on any problems’ characteristics or features.Footnote 3

Table 1. The most important options in terms of contribution to problem coverage

Most Important Options: In Table 1, we list the first five options sorted in descending order of “how many problems we would not be able to cover if the given option could not be varied in strategies.” (In other words, as if the listed default value was “wired-in” to the prover code.)

Based on existing research [28], it is unsurprising to see that varying clausification has a large impact. Likewise, for varying the unification depth. What is perhaps more surprising is that replacing higher-order unification with applicative first-order unification can be beneficial. equality_to_equiv turns equality between Boolean terms into equivalence before the original clausification pass is carried out. The effectiveness of this option is also somewhat surprising.

Table 2. Number of problems solved by a single good higher-order strategy and our schedule at various time limit cutoffs. Run on the 3914 TH0 TPTP problems

Performance Statistics: It is long known [26, 31] that a strategy schedule can improve over the performance of a single good strategy by large margin. Table 2 confirms this phenomenon for our case. For this comparison we selected one of the best performing (at the \({60}\,\text {s}\) time limit mark) single strategies that we had previously evaluated. From the higher-order perspective, the strategy is interesting for setting hol_unif_depth to 4 and supporting choice reasoning via an inference rule (choice_reasoning on).Footnote 4

Although our schedule has been developed on (and for) the TH0 TPTP problems, it helps the new higher-order Vampire solve more problems of other origin too. Of the Sledgehammer problems exported by Desharnais et al. in their last prover comparison [10], namely the 5000 problems denoted in their work TH0\(^-\), Vampire can now solve 2425 compared to 2179 obtained by Desharnais et al. with the previous Vampire version (both under \({30}\,\text {s}\) per problem).Footnote 5

We remark that we also developed a different schedule specifically adapted to Sledgehammer problems (in various TPTP dialects, i.e., not just TH0), which is now available to the Isabelle [16] users since the September 2023 release.

6 Related Work

The idea to intertwine superposition and unification appears in earlier work, particularly in the EP calculus implemented in Leo-III [21]. The main differences between our calculus and EP are:

  1. 1.

    We do not move first-order unification to the calculus level. Hence, there are no equivalents to the Triv, Bind and Decomp rules of EP.

  2. 2.

    Our Project and Imitate rules are instances of EP’s FlexRigid rule. We do not include an equivalent to EP’s FlexFlex rule since we never select flex-flex literals. Instead, we leave such literals until one of the head variables becomes instantiated, or the clause only contains flex-flex literals at which point FlexFlexSimp can be applied.

  3. 3.

    Our core inference rules are parameterised by a selection function and an ordering.

  4. 4.

    Whilst EP always applies unification lazily, our calculus can control how lazily unification is carried out by varying the depth bound.Footnote 6

We also incorporate more recent work on higher-order superposition, mainly from the Matryoshka project [2, 28]. Of course, the use of constraints in automated reasoning extends far beyond the realm of higher-order logic. They have been researched in the context of theory reasoning [14, 18] and basic superposition [1].

7 Conclusion

In this paper, we have presented a new higher-order superposition calculus and discussed its implementation in Vampire. We have also described the new higher-order schedule created. The combination of calculus, implementation and schedule have already proven effective. However, we believe that there is great room for further exploration and improvement. On the theoretical side, we wish to prove refutational completeness of the calculus (or a variant thereof). On the practical side, we wish to refine the implementation, most notably by adding additional simplification rules.