figure a
figure b

1 Introduction

In the last few decades, proof assistants have become indispensable tools for developing trustworthy formal proofs. They are used both in academia to verify mathematical theories [17] and in industry to verify the correctness of hardware [21] and software [16, 22, 24]. However, due to the lack of strong built-in proof automation, proving seemingly simple goals can be a tedious manual task. To mitigate this, many proof assistants include a subsystem such as CoqHammer, HOL(y)Hammer, or Sledgehammer [9] that translates higher-order goals to first-order logic and passes them to efficient first-order automatic provers. If a first-order prover succeeds, the proof is reconstructed and the goal is closed.

Unfortunately, the translation of higher-order constructs is clumsy and leads to poor performance on goals that require higher-order reasoning. Using native higher-order provers such as Satallax [10] as backends is not always a good solution because they are much less efficient than their first-order counterparts [37]. To bridge this gap, in 2016 we proposed to develop a new generation of higher-order provers that extend the arguably most successful first-order calculus, superposition, to higher-order logic, starting from a position of strength.

Our research has focused on three milestones: supporting \(\lambda \)-free higher-order logic, adding \(\lambda \)-terms, and adding first-class Boolean terms. In 2019, we extended the state-of-the-art first-order prover E [32] with a \(\lambda \)-free superposition calculus [42], obtaining a version of E called Ehoh, as a stepping stone towards full higher-order logic. Together with Bentkamp, Tourret, and Waldmann, we have since developed calculi, called \(\lambda \)-superposition, corresponding to the other two milestones [4, 5] and implemented them in the experimental superposition prover Zipperposition [14]. This OCaml prover is not nearly as efficient as E. Nevertheless, it has won the higher-order division of the CASC prover competition [39] in 2020, 2021, and 2022, ending nearly a decade of Satallax domination.

We now fulfill a four-year-old promise: We present the extension of Ehoh to full higher-order logic (Sect. 2) based on incomplete variants of \(\lambda \)-superposition. We call this prover \(\lambda \)E. In \(\lambda \)E’s implementation, we used the extensive experience with Zipperpositionto choose a set of effective rules that could easily be retrofitted into an originally first-order prover. Another guiding principle was gracefulness: Our changes should not impact the strong first-order performance of E and Ehoh.

One of the main challenges we faced was retrofitting \(\lambda \)-terms in Ehoh’s term representation (Sect. 3). Furthermore, Ehoh’s inference engine assumes that inferences compute a most general unifier. We implemented a higher-order unification procedure [41] that can return multiple unifiers (Sect. 4) and integrated it in the inference engine. Finally, we extended and adapted the superposition rule, resulting in an incomplete, pragmatic variant of \(\lambda \)-superposition (Sect. 5).

We evaluated \(\lambda \)E on a selection of proof assistants benchmarks as well as all higher-order theorems in the TPTP library [38] (Sect. 6). \(\lambda \)E outperformed all other higher-order provers on the proof assistant benchmarks; on the TPTP benchmarks, it ended up second only to the cooperative version of Zipperposition, which employs Ehoh as a backend. An arguably fairer comparison without the backend puts \(\lambda \)E in first place for both benchmark suites. We also compared the performance of \(\lambda \)E with E on first-order problems and found that no overhead has been introduced by the extension to higher-order logic.

\(\lambda \)E is part of the E prover’s development repository and will be part of E 3.0. It can be enabled by passing the option -​-​enable-​ho to the configure script. E and \(\lambda \)E’s source code is freely available online.Footnote 1

2 Logic

Our target logic is monomorphic classical higher-order logic with Hilbert choice. The following text is partly based on Vukmirović et al. [40, Sect. 2].

Terms stuv are inductively defined as free variables \(F, X, \ldots \), bound variables \(x, y, z, \dotsc \), constants \(\textsf{f}, \textsf{g}, \textsf{a}, \textsf{b}, \dotsc \), applications \(s \, t\), and \(\lambda \)-abstractions \(\lambda x.\> s\). Bound variables may be loose (e.g., y in \(\lambda x.\> y \, \textsf{a}\)) [27].

We let \(s \, \overline{t}_n\) stand for \(s \, t_1 \, \ldots \, t_n\) and \(\lambda \overline{x}_n.\> s\) for \(\lambda x_1. \ldots \lambda x_n. \> s\). Every \(\beta \)-normal term can be written as \(\lambda \overline{x}_m.\> s \, \overline{t}_n\), where s is not an application; we call s the head of the term. If s is a free variable, we call the term flex; otherwise, the term is rigid. A term of type o, where o is the distinguished Boolean type, is called a formula. A term whose type is of the form \(\tau _1 \rightarrow \cdots \rightarrow \tau _n \rightarrow o\) is called a predicate. Logical symbols are part of the signature and may thus occur within terms. We write them in bold: \(\pmb \bot , \pmb \top , \pmb {\lnot }, \pmb \wedge , \pmb \vee , \pmb \rightarrow , \pmb \leftrightarrow , \pmb \forall , \pmb \exists , \pmb \approx \).

On top of the terms, we define some clausal structure. This structure is needed by \(\lambda \)-superposition. A literal l is an equation \(s \approx t\) or a disequation \(s \not \approx t\). A clause is a finite multiset of literals, interpreted and written disjunctively: \(l_1 \mathrel \vee \cdots \mathrel \vee l_n\).

3 Terms

E is designed around perfect term sharing [25], a principle that we kept in Ehoh and \(\lambda \)E: Any two structurally identical terms are guaranteed to be the same object in memory. This is achieved through term cells, which represent individual terms. Each cell has (among other fields) (1) f_code, an integer corresponding to the symbol at the head of the term (negative if the head is a free variable, positive otherwise); (2) num_args, corresponding to the number of arguments applied to the head; and (3) args, an array of size num_args of pointers to argument terms. We use the notation \(\textsf{f}(s_1, \ldots , s_n)\) to denote a cell whose f_code corresponds to \(\textsf{f}\), num_args equals n, and args points to the cells for \(s_1, \ldots s_n\).

Like Leo-III [33, Sect. 4.8], Ehoh represents \(\lambda \)-free higher-order terms using a flattened, spine notation [12]. Thus, the terms \(\textsf{f}\), \(\textsf{f} \, \textsf{a}\), and \(\textsf{f} \, \textsf{a} \, \textsf{b}\) are represented by the cells \(\textsf{f}\), \(\textsf{f}(\textsf{a})\), and \(\textsf{f}(\textsf{a}, \textsf{b})\). To ensure that free variables are perfectly shared, Ehoh treats applied free variables differently: Arguments are not applied directly to a free variable, but using a distinguished symbol \(\texttt {@}\) of variable arity. For example, the term \(X \, \textsf{a} \, \textsf{b}\) is represented by the cell \(\texttt {@}(X, \textsf{a}, \textsf{b})\). This ensures that two different occurrences of the free variable X correspond to the same object, which makes substitutions more efficient [42].

Representation of \({\pmb {\lambda }}\)-Terms. To support full higher-order logic, Ehoh’s \(\lambda \)-free cell data structure must be extended to support the \(\lambda \) binder. We use the locally nameless representation [13]: De Bruijn indices represent (possibly loose) bound variables, whereas we keep the current representation for free variables.

Extending the term representation of Ehoh with a new term kind involves intricate manipulation of the cell data structure. De Bruijn indices must be represented like other cells with either a negative or a positive f_code, but the code must clearly identify that the cell is a De Bruijn index.

Apart from during \(\beta \)-reduction, De Bruijn indices mostly behave like constants. Therefore, we choose to represent De Bruijn indices using positive f_codes: The De Bruijn index i will have f_code i. To ensure that De Bruijn indices are not mistaken for function symbols, we use the cell’s properties bitfield, which holds precomputed properties. We introduce the property IsDBVar to denote that the cell represents a De Bruijn index. De Bruijn indices are systematically created through a dedicated function that sets the IsDBVar property. When given the same De Bruijn index and type, this function always returns the same object. Finally, we guard all the functions and macros that manipulate function codes to check if the property IsDBVar is set. To ensure perfect sharing of De Bruijn indices, arguments to De Bruijn indices are applied like for free variables, using \(\texttt {@}\).

Extending cells to support \(\lambda \)-abstraction is easier. Each \(\lambda \)-abstraction has the distinguished function code \(\texttt {LAM}\) as the head symbol and two arguments: (1) a De Bruijn index 0 of the type of the abstracted variable; (2) the body of the \(\lambda \)-abstraction. Consider the term \(\lambda x. \, \lambda y.\, \textsf{f}\, x \, x\), where both x and y have the type \(\iota \). This term is represented as \(\lambda \,\lambda \, \textsf{f} \, \textbf{1} \, \textbf{1}\) in locally nameless representation, where bold numbers represent De Bruijn indices. In \(\lambda \)E, the same term is represented by the cell \(\texttt {LAM}(\textbf{0}, \texttt {LAM}(\textbf{0}, \textsf{f}(\textbf{1}, \textbf{1})))\), where all De Bruijn variables have type \(\iota \).

The first argument of \(\texttt {LAM}\) is redundant, since it can be deduced from the type of the \(\lambda \)-abstraction. However, basic \(\lambda \)-term manipulation operations often require access to this term. We store it explicitly to avoid creating it repeatedly.

Efficient \(\pmb {\beta }\)-Reduction. Terms are stored in \(\beta \eta \)-reduced form. As these two reductions are performed very often, they ought to be efficient. Ehoh performs \(\beta \)-reduction by reducing the leftmost outermost \(\beta \)-redex first. To represent \(\beta \)-redexes, E uses the \(\texttt {@}\) symbol. Thus, the term \((\lambda x.\, \lambda y.\, (x \, y)) \, \textsf{f} \, \textsf{a}\) is represented by \(\texttt {@}(\texttt {LAM}(\textbf{0}, \texttt {LAM}(\textbf{0}, \texttt {@}(\textbf{1}, \textbf{0}))), \textsf{f}, \textsf{a})\). Another option would have been to add arguments applied to \(\lambda \)-terms directly to the \(\lambda \) representation (as in \(\texttt {LAM}(\textbf{0}, \texttt {LAM}(\textbf{0}, \texttt {@}(\textbf{1}, \textbf{0})), \textsf{f}, \textsf{a})\)), but this would break the invariant that \(\texttt {LAM}\) has two arguments. Furthermore, replacing free variables with \(\lambda \)-abstractions (e.g., replacing X with \(\lambda x. \, x\) in \(\texttt {@}(X, \textsf{a})\)) would require additional normalization.

A term can be \(\beta \)-reduced as follows: When a cell \( \texttt {@}(\texttt {LAM}(\textbf{0}, s),t)\) is encountered, the field binding (normally used to record the substitution for a free variable) of the cell \(\textbf{0}\) is set to t. Then s is traversed to instantiate every loose occurrence of \(\textbf{0}\) in s with binding, whose loose De Bruijn indices are shifted by the number of \(\lambda \) binders above the occurrence of \(\textbf{0}\) in s [20]. Next, this procedure is applied to the resulting term and its subterms, in leftmost outermost fashion.

\(\lambda \)E’s \(\beta \)-normalization works in this way, but it features a few optimizations. First, given a term of the form \((\lambda \overline{x}_n.\> s) \,\overline{t}_n\), \(\lambda \)E, like Leo-III [34], replaces the bound variables \(x_i\) with \(t_i\) in parallel. Avoiding the construction of intermediate terms reduces the number of recursive function calls and calls to the cell allocator.

Second, in line with the gracefulness principle, we want \(\lambda \)E to incur little (or no) overhead on first-order problems and to excel on higher-order problems with a large first-order component. If \(\beta \)-reduction is implemented naively, finding a \(\beta \)-redex involves traversing the entire term. On purely first-order terms, \(\beta \)-reduction is then a waste of time. To avoid this, we use Ehoh’s perfectly shared terms and their properties field. We introduce the property HasBetaReducibleSubterm, which is set if a cell is \(\beta \)-reducible. Whenever a new cell that contains a \(\beta \)-reducible term as a direct subterm is shared, the property is set. Setting of the property is inductively continued when further superterms are shared. For example, in the term \(t = \textsf{f} \, \textsf{a} \, (\textsf{g} ((\lambda x.\, x)\,\textsf{a}))\), the cells for \((\lambda x.\, x)\,\textsf{a}\), \(\textsf{g}\,((\lambda x.\, x)\,\textsf{a})\), and t itself have the property HasBetaReducibleSubterm set. When it needs to find \(\beta \)-reducible subterms, \(\lambda \)E will visit only the cells with this property set. This further means that on first-order subterms, a single bit masking operation is enough to determine that no subterm should be visited.

Along similar lines, we introduce a property HasDBSubterm that caches whether the cell contains a De Bruijn subterm. This makes instantiating De Bruijn indices during \(\beta \)-normalization faster, since only the subterms that contain De Bruijn indices must be visited. Similarly, some other operations such as shifting De Bruijn indices or determining whether a term is closed (i.e., it contains no loose bound variables) can be sped up or even avoided if the term is first-order.

Efficient \(\pmb {\eta }\)-Reduction. The term \(\lambda x.\, s \, x\) is \(\eta \)-reduced to s whenever x does not occur unbound in s. Observing that a term cannot be \(\eta \)-reduced if it contains no \(\lambda \)-abstractions, we introduce a property HasLambda that notes the presence of \(\lambda \)’s in a term. Only terms with \(\lambda \)’s are visited during \(\eta \)-reduction.

\(\lambda \)E performs parallel \(\eta \)-reduction: It recognizes terms of the form \(\lambda \overline{x}_m.\> s \, \overline{x}_m \) such that none of the \(x_i\) occurs unbound in s. If done naively, reducing terms of this kind requires up to m traversals of s to check if each \(x_i\) occurs in s. In \(\lambda \)E, exactly one traversal of s is required. More precisely, when \(\eta \)-reducing a cell \(\texttt {LAM}(\textbf{0}, s)\), \(\lambda \)E considers all \(\lambda \) binders in s as well. In general, the cell will be of the form \(\texttt {LAM}(\textbf{0}, \dotsc , \texttt {LAM}(\textbf{0}, t) \ldots )\), where t is not a \(\lambda \)-abstraction, and l is the number of \(\texttt {LAM}\) symbols above t. Then \(\lambda \)E breaks the body t down into a decomposition \(u\, ({\boldsymbol{{n}}}\mathbf {{}-1}) \, \ldots \, \textbf{1} \, \textbf{0}\) where u is not of the form \(\ldots \, {\boldsymbol{{n}}}\); such a decomposition is unique. If \(n = 0\), the cell is not \(\eta \)-reducible. Otherwise, u is traversed to determine the minimal index \({\boldsymbol{{j}}}\) of a loose De Bruijn index, taking \({\boldsymbol{{j}}} = \infty \) if no such index exists. \(\lambda \)E can then remove the \(k = \min \{j,l,n\}\) rightmost outermost \(\lambda \) binders in \(\texttt {LAM}(\textbf{0}, \ldots , \texttt {LAM}(\textbf{0}, t) \ldots )\) and replace t by the variant of \(u \, ({\boldsymbol{{n}}}\mathbf {{}-1}) \, \ldots \, ({\boldsymbol{{k}}}\mathbf {{}+1}) \, {\boldsymbol{{k}}}\) obtained by shifting the loose De Bruijn indices down by k.

To illustrate this convoluted De Bruijn arithmetic, we consider the term \(\lambda x. \, \lambda y. \, \lambda z.\, \textsf{f} \, x \, x \, y \, z\). This term is represented by the cell \(\texttt {LAM}(\textbf{0}, \texttt {LAM}(\textbf{0}, \texttt {LAM}(\textbf{0}, \textsf{f}(\textbf{2}, \textbf{2}, \textbf{1}, \textbf{0}))))\). \(\lambda \)E splits \(\textsf{f}(\textbf{2}, \textbf{2}, \textbf{1}, \textbf{0})\) into two parts: \(u = \textsf{f} \, \textbf{2}\) and the arguments \(\textbf{2}, \textbf{1}, \textbf{0}\). Since the minimal index in u is \(\textbf{2}\), we can omit the De Bruijn indices \(\textbf{1}\) and \(\textbf{0}\) and their \(\lambda \) binders, yielding the \(\eta \)-reduced cell \(\texttt {LAM}(\textbf{0}, \textsf{f}(\textbf{0}, \textbf{0}))\).

Parallel \(\eta \)-reduction both speeds up \(\eta \)-reduction and avoids creating intermediate terms. For finding the minimal loose De Bruijn index, optimizations such as the HasDBSubterm property are used.

Representation of Boolean Terms. E and Ehoh represent Boolean terms using cells whose f_codes are reserved for logical symbols. Quantified formulas are represented by cells in which the first argument is the quantified variable and the second one is the body of the quantified formula. For example, the term \(\pmb \forall x.\, \textsf{p} \, x\) corresponds to the cell \(\pmb \forall (X, \textsf{p}(X))\), where X is a free variable. This representation is convenient for parsing and clausification, which is what E and Ehoh use it for, but in full higher-order logic, it is problematic during proof search: Booleans can occur as subterms in clauses, as in \(\textsf{q}(X) \mathrel \vee \textsf{p}(\pmb \forall (X, \textsf{r}(X)))\), and instantiating X in the first literal should not affect X in the second literal.

To avoid this issue, in \(\lambda \)E we use \(\lambda \) binders to represent quantified formulas, as is customary in higher-order logic [1, §51]. Thus, \(\pmb \forall x. \, s\) is represented by \(\pmb \forall \, (\lambda x. \, s)\). Quantifiers are then unary symbols that do not directly bind the variables. Since \(\lambda \)E represents bound variables using De Bruijn indices, this solves all \(\alpha \)-conversion issues. However, this solution is incompatible with thousands of decades-old lines of clausification code that assumes E’s representation of quantifiers. Therefore, \(\lambda \)E converts quantified formulas only after clausification, for Boolean terms that occur in a higher-order context (e.g., as argument to a function symbol).

New Term Orders. The \(\lambda \)-superposition calculus is parameterized by a term order that is used to break symmetries in the search space. We implemented the versions of the Knuth–Bendix order (KBO) and lexicographic path order (LPO) for higher-order terms described by Bentkamp et al. [4]. These orders encode \(\lambda \)-terms as first-order terms and then invoke the standard KBO or LPO. For efficiency, we implemented separate KBO and LPO functions that compute the order directly, intertwining the encoding and the order computation.

Ehoh cells contain a binding field that can be used to store the substitution for a free variable. Substitutions can then be applied by following the binding pointers, replacing each free variable with its instance. Thus, when Ehoh needs to perform a KBO or LPO comparison of an instantiated term, it needs only follow the binding pointers. In full higher-order logic, however, instantiating a variable can trigger a chain of \(\beta \eta \)-reductions, changing the shape of the term dramatically. To prevent this, \(\lambda \)E computes the \(\beta \eta \)-reduced instances of the terms before comparing them using KBO or LPO.

4 Unification, Matching, and Term Indexing

Standard superposition crucially depends on the concept of a most general unifier (MGU). In higher-order logic, the concept is replaced by that of a complete set of unifiers (CSU), which may be infinite. Vukmirović et al. [41] designed an efficient procedure to enumerate a CSU for a term pair. It is implemented in Zipperposition, together with some extensions to term indexing. In \(\lambda \)E, we further improve the performance of this procedure by implementing a terminating, incomplete variant. We also introduce a new indexing data structure.

The Unification Procedure. The unification procedure works by maintaining a list of unification pairs to be solved. After choosing a pair, it first normalizes it by \(\beta \)-reducing and instantiating the heads of both terms in the pair. Then, if either head is a variable, it computes an appropriate binding for this variable, thereby approximating the solution.

Unlike in first-order and \(\lambda \)-free higher-order unification, in the full higher-order case there may be many bindings that lead to a solution. To reduce this mostly blind guessing of bindings, the procedure features support for oracles [41]. These are procedures that solve the unification problem for a subclass of higher-order terms on which unification is decidable and, for \(\lambda \)E, unary. Oracles help increase performance, avoid nontermination, and avoid redundant bindings.

Vukmirović et al. described their procedure as a transition system. In \(\lambda \)E, the procedure is implemented nonrecursively, and the unifiers are enumerated using an iterator object that encapsulates the state of the unifier search. The iterator consists of five fields: (1) constraints, which holds the unification constraints; (2) bt_state, a stack that contains information necessary to backtrack to a previous state; (3) branch_iter, which stores how far we are in exploring different possibilities from the current search node; (4) steps, which remembers how many different unification bindings (such as imitation, projection, and identification) are applied; and (5) subst, a stack storing the variables bound so far.

The iterator is initialized to hold the original problem in \( constraints \), and all other fields are initially empty. The unifiers are retrieved one by one by calling the function \(\textsc {ForwardIter}\). It returns \(\textsc {True}\) if the iterator made progress, in which case the unifier can be read via the iterator’s \( subst \) field. Otherwise, no more unifiers can be found, and the iterator is no longer valid. The function’s pseudocode is given below, including two auxiliary functions:

figure c
figure d

\(\textsc {ForwardIter}\) begins by backtracking if the previous attempt was successful (i.e., all constraints were solved). If it finds a state from which it can continue, it takes term pairs from \( constraints \) until there are no more constraints or it is determined that no unifier exists. The terms are normalized by instantiating the head variable with its binding and reducing the potential top-level \(\beta \)-redex that might appear. This instantiation and reduction process is repeated until there are no more top-level \(\beta \)-redexes and the head is not a variable bound to some term. Then the term with shorter \(\lambda \) prefix is expanded (only on the top level) so that both \(\lambda \) prefixes have the same length. Finally, the \(\lambda \) prefix is ignored, and we focus only on the body. In this way, we avoid fully substituting and normalizing terms and perform just enough operations to determine the next step of the procedure.

If either term of the constraint is flex, we first invoke oracles to solve the constraint. \(\lambda \)E implements the most efficient oracles implemented in Zipperposition: fixpoint and pattern [41, Sect. 6]. An oracle can return three results: (1) there is an MGU for the pair (\(\textsc {Unifiable}\)), which is recorded in subst, and the next pair in \(constraints \) is tried; (2) no MGU exists for the pair (\(\textsc {NotUnifiable}\)), which causes the iterator to backtrack; (3) if the pairs do not belong to the subclass that oracle can solve (\(\textsc {NotInFragment}\)), we generate possible variable bindings—that is, we guess the approximate form of the solution.

\(\lambda \)E has a dedicated module that generates bindings (\(\textsc {NextBinding}\)). This module is given the current constraint and the values of branch_iter and steps, and it either returns the next binding and the new values of branch_iter and steps or reports that all different variable bindings are exhausted. The bindings that \(\lambda \)E’s unification procedure creates are imitation, Huet-style projection, identification, and elimination (one argument at a time) [41, Sect. 3]. A limit on the total number of applied binding rules can be set, as well as a limit on the number of individual rule applications. The binding module checks whether limits are reached using the iterator’s \( steps \) field.

Computing bindings is the only point in the procedure where the search tree branches and different possibilities are explored. Thus, when \(\lambda \)E follows the branch indicated by the binding module, it records the state to which it needs to return should the followed branch be backtracked. The state consists of the values of \( constraints , steps \), and \( subst \) before the branch is followed and the value of \( branch\_iter \) that points past the followed branch. The values of \( branch\_iter \) are either \(\textsc {BindBegin}\), which denotes that no binding was created, intermediate values that \(\textsc {NextBinding}\) uses to remember how far through bindings it is, and \(\textsc {BindEnd}\), which indicates that all bindings are exhausted.

If all bindings are exhausted, the procedure checks whether the pair is flex–flex and both sides have the same head. If so, the pair is decomposed and constraints are derived from the pair’s arguments; otherwise, the iterator backtracks. If the pair is rigid–rigid, for unification to succeed, the heads of both sides must be the same. Unification then continues with new constraints derived from the arguments. Otherwise, the iterator must be backtracked.

Matching. In E, the matching algorithm is mostly used inside simplification rules such as demodulation and subsumption [29]. As these rules must be efficiently performed, using a complex matching algorithm is not viable. Instead, we provide a matching algorithm for the pattern class of terms [27] to complement Ehoh’s \(\lambda \)-free higher-order matching algorithm [42, Sect. 4]. A term is a pattern if each of its free variables either has no arguments (as in first-order logic) or is applied to distinct De Bruijn indices.

To help determine whether to use the pattern or \(\lambda \)-free algorithm, we introduce a cached property HasNonPatternVar, which is set for terms of the form \(X \, \overline{s}_n\) where \(n>0\) and either there exists some \(s_i\) that is not a De Bruijn index or there exist indices \(i < j\) such that \(s_i = s_j\) is a De Bruijn index. This property is propagated to the superterms when they are perfectly shared. This allows later checks if a term belongs to the pattern class to be performed in constant time.

We modify the \(\lambda \)-free higher-order matching algorithm to treat \(\lambda \) prefixes as above in the unification procedure—by bringing the prefixes to the same length and ignoring them afterwards. This ensures that the algorithm will never try to match a free variable with a \(\lambda \)-abstraction, making sure that \(\beta \)-redexes never appear. We also modify the algorithm to ensure that free variables are never bound to terms that have loose bound variables. This algorithm cannot find many complex matching substitutions (matchers), but it can efficiently determine whether two terms are variable renamings of each other or whether a simple matcher can be used, as in the case of \((X \, (\lambda x. \,x) \, \textsf{b}, \textsf{f} \, (\lambda x. \,x) \, \textsf{b})\), where \({X \mapsto \textsf{f}}\) is usually the desired matcher. If this algorithm does not find a matcher and both terms are patterns, pattern matching is tried.

Indexing. E, like other modern theorem provers, efficiently retrieves unifiable or matchable pairs of terms using indexing data structures. To find terms unifiable with a query term or instances of a query term, it uses fingerprint indexing [30]. Vukmirović et al. extended this data structure to support full higher-order terms in Zipperposition [41, Sect. 6]. We use the same approach in \(\lambda \)E, and we extend feature vector indices [31] in the same way.

E uses perfect discrimination trees [26] to find generalizations of the query term (i.e., terms of which the query term is an instance). This data structure is a trie that indexes terms by representing them in a serialized, flattened form. The left branch from the root in Figure 1 shows how the first-order terms \(\textsf{f}\, \textsf{a} \, X\) and \(\textsf{f}\, \textsf{a} \, \textsf{a}\) are stored. In Ehoh, this data structure is extended to support partial application and applied variables [42].

Fig. 1.
figure 1

First-order, \(\lambda \)-free higher-order, and higher-order pattern terms in a perfect discrimination tree

In \(\lambda \)E, we extend this structure to support \(\lambda \)-abstractions and the higher-order pattern matching algorithm. To this end, we change the way in which terms are serialized. First, we require that all terms are fully \(\eta \)-expanded (except for arguments of variables applied in patterns). Then, when the term is serialized, we use a single node for applied variable terms \(X \, \overline{s}_n\), instead of a node for X followed by nodes for the arguments \(\overline{s}_n\). We serialize the \(\lambda \)-abstraction \(\lambda x.\, s\) using a dedicated node \(\texttt {LAM}_\tau \), where \(\tau \) is the type of x, followed by the serialization of s. Other than these changes, serialization remains as in Ehoh, following the gracefulness principle. Figure 1 shows how \(\textsf{g} \, (X\,\textsf{a}\,\textsf{b}) \, \textsf{c}\) and \(\textsf{h} \, (\lambda x. \, \lambda y. \, X \, y \, x)\) are serialized. Since the terms are stored in serialized form, it is hard to manipulate \(\lambda \) prefixes of stored terms during matching. Performing \(\eta \)-expansion when serializing terms ensures that matchable terms have \(\lambda \) prefixes of the same length.

We have dedicated separate nodes for applied variables because access to arguments of applied variables is necessary for the pattern matching algorithm. Even though arguments can be obtained by querying the arity n of the variable and taking the next n arguments in the serialization, this is both inefficient and inelegant. As for De Bruijn indices, we treat them the same as function symbols.

Following the notation from the extension of perfect discrimination trees to \(\lambda \)-free higher-order logic [42], we now describe how enumeration of generalizations is performed. To traverse the tree, \(\lambda \)E begins at the root node and maintains two stacks: \(\texttt {term\_stack}\) and \(\texttt {term\_proc}\), where \(\texttt {term\_stack}\) contains the subterms of the query term that have to be matched, and \(\texttt {term\_proc}\) contains processed terms that are used to backtrack to previous states. Initially, \(\texttt {term\_stack}\) contains the query term, the current matching substitution \(\sigma \) is empty, and the successor node is chosen among the child nodes as follows:

  1. A.

    If the node is labeled with a symbol \(\xi \) (where \(\xi \) is either a De Bruijn index or a constant) and the top item t of \(\texttt {term\_stack}\) is of the form \(\xi \, \overline{t}_n\), replace t by n new items \(t_1,\dots ,t_n\), and push t onto \(\texttt {term\_proc}\).

  2. B.

    If the node is labeled with a symbol \(\texttt {LAM}_\tau \) and the top item t of \(\texttt {term\_stack}\) is of the form \(\lambda x.\, s\) and the type of x is \(\tau \), replace t by s, and push t onto \(\texttt {term\_proc}\).

  3. C.

    If the node is labeled with a possibly applied variable \(X \, \overline{s}_n\) (where \(n \ge 0\)), and the top item of \(\texttt {term\_stack}\) is t, the matching algorithm described above is run on \(X \, \overline{s}_n\) and t. The algorithm takes into account \(\sigma \) built so far and extends it if necessary. If the algorithm succeeds, pop t from \(\texttt {term\_stack}\), push it onto \(\texttt {term\_proc}\), and save the original value of \(\sigma \) in the node.

Backtracking works in the opposite direction: If the current node is labeled with a De Bruijn index or function symbol node of arity n, pop n terms from \(\texttt {term\_stack}\) and move the top of \(\texttt {term\_proc}\) to \(\texttt {term\_stack}\). If the node is labeled with \(\texttt {LAM}_\tau \), pop the top of \(\texttt {term\_stack}\) and move the top of \(\texttt {term\_proc}\) to \(\texttt {term\_stack}\). Finally, if the node is labeled with a possibly applied variable, move the top of the \(\texttt {term\_proc}\) to \(\texttt {term\_stack}\) and restore the value of \(\sigma \).

As an example of how finding a generalization works, when looking for generalizations of \(\textsf{g} \, (\textsf{f} \, \textsf{a} \, \textsf{b}) \, \textsf{c}\) in the tree of Figure 1, the following states of stacks and substitutions emerge, from left to right:

figure e

5 Preprocessing, Calculus, and Extensions

Ehoh’s simple \(\lambda \)-free higher-order calculus performed well on Sledgehammer problems and formed a promising stepping stone to full higher-order logic [42]. When implementing support for full higher-order logic, we were guided by efficiency and gracefulness with respect to Ehoh’s calculus rather than completeness. Whereas Zipperposition provides both complete and incomplete modes, \(\lambda \)E only offers incomplete modes.

Preprocessing. Our experience with Zipperposition showed the importance of flexibility in preprocessing the higher-order problems [40]. Therefore, we implemented a flexible preprocessing module in \(\lambda \)E.

To maintain compatibility with Ehoh, \(\lambda \)E can optionally transform all \(\lambda \)-abstractions into named functions. This process is called \(\lambda \)-lifting [19]. \(\lambda \)E also removes all occurrences of Boolean subterms (other than \(\pmb \bot , \pmb \top \), and free variables) in higher-order contexts using a FOOL-like transformation [23]. For example, the formula \(\textsf{f}(\textsf{p} \pmb \wedge \textsf{q})\,\pmb \approx \,\textsf{a}\) becomes \((\textsf{p} \pmb \wedge \textsf{q} \pmb \rightarrow \textsf{f}(\pmb \top )\,\pmb \approx \,\textsf{a}) \pmb \wedge (\pmb {\lnot \,}(\textsf{p} \pmb \wedge \textsf{q}) \pmb \rightarrow \textsf{f}(\pmb \bot )\,\pmb \approx \,\textsf{a})\).

Many TPTP problems use the definition role to identify the definitions of symbols. \(\lambda \)E can treat definition axioms as rewrite rules, and replace all occurrences of defined symbols during preprocessing. Furthermore, during SInE [18] axiom selection, it can always include the defined symbol in the trigger relation.

Calculus. \(\lambda \)E implements the same superposition calculus as Ehoh with three important changes. First, wherever Ehoh requires the MGU of terms, \(\lambda \)E enumerates unifiers from a finite subset of the CSU, as explained in Sect. 4. Second, \(\lambda \)E uses versions of the KBO and LPO orders designed for \(\lambda \)-terms.

The third difference is more subtle. One of the main features of Ehoh is prefix optimization [42, Sect. 1]: a method that, given a demodulator \(s \approx t\), makes it possible to replace both applied and unapplied occurrences of s by t by traversing only the first-order subterms of a rewritable term. In a \(\lambda \)-free setting, this optimization is useful, but in the presence of \(\beta \eta \)-normalization, the shapes of terms can change drastically, making it much harder to track prefixes of terms. This is why we disable the prefix optimization in \(\lambda \)E. To compensate for losing this optimization, we introduce the argument congruence rule AC in \(\lambda \)E and enable positive and negative functional extensionality (PE and NE) by default:

figure f

AC and NE assume that s and t are of function type. In NE, \(\overline{X}\) denotes all the free variables occurring in s and t, and \(\textsf{sk}\) is a fresh Skolem symbol of the appropriate type. PE has a side condition that X may not occur in s, t, or C.

Saturation. E’s saturation procedure assumes that each attempt to perform an inference will either result in a single clause or fail due to one of the inference side conditions. Unification procedures that produce multiple substitutions break this invariant, and the saturation procedure needed to be adjusted.

For Zipperposition, Vukmirović et al. developed a variant of the saturation procedure that interleaves computing unifiers and scheduling inferences to be performed [40]. Since completeness was not a design goal for \(\lambda \)E, we did not implement this version of the saturation procedure. Instead, in places where previously a single unifier was expected, \(\lambda \)E consumes all elements of the iterator used for enumerating a unifier, converting them into clauses.

Reasoning about Formulas. Even though most of the Boolean structure is removed during preprocessing, formulas can reappear at the top level of clauses during saturation. For example, after instantiating X with \(\lambda x. \, \lambda y.\, x \pmb \wedge y\), the clause \(X \, \textsf{p} \, \textsf{q} \mathrel \vee \textsf{a} \approx \textsf{b}\) becomes \((\textsf{p} \pmb \wedge \textsf{q}) \mathrel \vee \textsf{a} \approx \textsf{b}\). \(\lambda \)E converts every clause of the form \(\varphi \mathrel \vee C\), where \(\varphi \) has a logic symbol as its head, or it is a (dis)equation between two formulas different than \(\pmb \top \), to an explicitly quantified formula. Then, the clausification algorithm is invoked on the formula to restore the clausal structure. Zipperposition features more dynamic clausification modes, but for simplicity we decided not to implement them in \(\lambda \)E.

The \(\lambda \)-superposition calculus for full higher-order logic [4] includes many rules that act on Boolean subterms, which are necessary for completeness. Other than Boolean simplification rules, which use simple tautologies such as \(\textsf{p} \pmb \wedge \pmb \top \pmb \leftrightarrow \textsf{p}\) to simplify terms, we have implemented none of the Boolean rules of this calculus in \(\lambda \)E. First, we have observed that complicated rules such as FluidBoolHoist and FluidLoobHoist are hardly ever useful in practice and usually only contribute to an uncontrolled increase in the proof state size. Second, simpler rules such as BoolHoist can usually be simulated by pragmatic rules that perform Boolean extensionality reasoning, described below.

To make up for excluding Boolean rules, we use an incomplete, but more easily controllable and intuitive rule, called primitive instantiation. This rule instantiates free predicate variables with approximations of formulas that are ground instances of this variable. We use the approximations described by Vukmirović and Nummelin [43, Sect. 3.3].

\(\lambda \)E’s handling of the Hilbert choice operator is inspired by Leo-III’s [35]. \(\lambda \)E recognizes clauses of the form \(\lnot \, P \, X \vee P \, (\textsf{f} \, P)\), which essentially denote that \(\textsf{f}\) is a choice symbol. Then, when subterm \(\textsf{f} \, s\) is found during saturation, s is used to instantiate the choice axiom for \(\textsf{f}\). Similarly, Leibniz equality [43] is eliminated by recognizing clauses of the form \(\lnot \, P \, \textsf{a} \mathrel \vee P \, \textsf{b} \mathrel \vee C\). These clauses are then instantiated with \(P \mapsto \lambda x. \, x \approx \textsf{a}\) and \(P \mapsto \lambda x. \, x\not \approx \textsf{b}\), which results in \(\textsf{a} \approx \textsf{b} \vee C\).

Finally, \(\lambda \)E treats induction axioms specially. Like Zipperposition [40, Sect. 4], it abstracts literals from the goal clauses and instantiates induction axioms with these abstractions. Since Zipperposition supports dynamic calculus-level clausification, induction axioms are instantiated during saturation, when the axioms are processed. In \(\lambda \)E, this instantiation is performed immediately after clausification. After \(\lambda \)E has collected all the abstractions, it traverses the clauses and instantiates those that have applied variable of the same type as the abstraction.

Extensionality. \(\lambda \)E takes a pragmatic approach to reasoning about functional and Boolean extensionality: It uses abstracting rules [5] which simulate basic superposition calculus rules but do not require unifiability of the partner terms in the inference. More precisely, assume a core inference needs to be performed between two \(\beta \)-reduced terms u and v, such that they can be represented as \(u=C[s_1, \ldots , s_n]\) and \(v=C[t_1, \ldots , t_n]\), where C is the most general “green” [5] common context of u and v, not all of \(s_i\) and \(t_j\) are free variables, and for at least one i, \(s_i \not = t_i\), \(s_i\) and \(t_i\) are not possibly applied free variables, and they are of Boolean or function type. Then, the conclusion is formed by taking the conclusion D of the core inference rule (which would be created if s and t are unifiable) and adding literals \(s_1 \not \approx t_1 \mathrel \vee \cdots \mathrel \vee s_n \not \approx t_n\).

These rules are particularly useful because \(\lambda \)E has no rules that dynamically process Booleans in FOOL-like fashion, such as BoolHoist. For example, given the clauses \(\textsf{f} \, (\textsf{p} \pmb \wedge \textsf{q}) \approx \textsf{a}\) and \(\textsf{g} \, (\textsf{f} \, \textsf{p}) \not \approx \textsf{b}\), the abstracting version of the superposition rule would result in \(\textsf{g} \, \textsf{a} \not \approx \textsf{b} \mathrel \vee (\textsf{p} \pmb \wedge \textsf{q}) \not \approx \textsf{p}\). In this way, the Boolean structure bubbles up to the top level and is further processed by clausification. We noticed that this alleviates the need for the other Boolean rules in practice.

6 Evaluation

We now try to answer two questions about \(\lambda \)E: How does \(\lambda \)E compare against other higher-order provers (including Ehoh)? Does \(\lambda \)E introduce any overhead compared with Ehoh? To answer these questions, we ran provers on problems from the TPTP library [38] and on benchmarks generated by Sledgehammer (SH) [28]. The experiments were carried out on StarExec Miami [36] nodes equipped with Intel Xeon E5-2620 v4 CPU clocked at 2.10 GHz. For the TPTP part, we used the CASC 2021Footnote 2 time limits: 120 s wall-clock and 960 s CPU. For SH benchmarks and to answer the other question, we used Sledgehammer’s default time limit: 30 s wall-clock and CPU. The raw evaluation data is available online.Footnote 3

Comparison with Other Provers. To answer the first question, we let \(\lambda \)E compete with the top contenders in the higher-order division of CASC 2021: cvc5 0.0.7 [2], Ehoh 2.7 [42], Leo-III 1.6.6 [35], Vampire 4.6 [8], and Zipperposition 2.1 [40]. We also included Satallax 3.5 [10]. We used all 2899 higher-order theorems in TPTP 7.5.0 as well as 5000 SH higher-order benchmarks originating from the Seventeen benchmark suite [15]. On SH benchmarks, cvc5, Ehoh, \(\lambda \)E, Vampire, and Zipperposition were run using custom schedules provided by their developers, optimized for single-core usage and low timeouts. Otherwise, we used the corresponding CASC configurations.

Although it internally does not support \(\lambda \)-abstractions, Ehoh 2.7 can parse full higher-order logic using \(\lambda \)-lifting. We included two versions of Zipperposition: coop uses Ehoh 2.7 as a backend to finish proof attempts, whereas uncoop does not. Both Ehoh and \(\lambda \)E were run in the automatic scheduling mode. Compared with Ehoh, \(\lambda \)E features a redesigned module for automatic scheduling, it can exploit multiple CPU cores, and its heuristics have been more extensively trained on higher-order problems.

The results are shown in Figure 2. \(\lambda \)E dramatically improves E’s higher-order reasoning capabilities compared with Ehoh. It solves 20% more TPTP benchmarks and 7% more SH benchmarks. The reason for the higher performance increase for TPTP is likely that TPTP benchmarks tend to require more higher-order reasoning than SH benchmarks, which often have a large first-order component and for which Ehoh was already very successful.

\(\lambda \)E was envisioned as an efficient backend to proof assistants. As such, it excels on SH benchmarks, outperforming the competition. On TPTP, it outperforms all higher-order provers other than Zipperposition-coop. If Zipperposition’s Ehoh backend is disabled, \(\lambda \)E outperforms Zipperposition by a wide margin. This comparison is arguably fairer; after all, \(\lambda \)E does not use an older version of Zipperposition as a backend. These results suggest that \(\lambda \)E already implements most of the necessary features for a high-performance higher-order prover but could benefit from the kind of fine-tuning that Zipperposition underwent in the last four years.

Remarkably, the raw evaluation data reveals thats \(\lambda \)E solves 181 SH problems and 24 TPTP problems that Zipperposition-coop does not. The lower number of uniquely solved TPTP problems is likely because Zipperposition was heavily optimized on the TPTP.

Comparison with the First-Order E. Both Ehoh and \(\lambda \)E can be compiled in a mode that disables most of the higher-order reasoning. This mode is designed for users that are interested only in E’s first-order capabilities and care a lot about performance. To answer the second evaluation question, about assessing overhead of \(\lambda \)E, we chose all the 1138 unique problems used at CASC from 2019 to 2021 in the first-order theorem division and ran Ehoh and \(\lambda \)E both in this first-order (FO) mode and in higher-order (HO) mode.

We fixed a single configuration of options, because Ehoh’s and \(\lambda \)E’s automatic scheduling methods could select different configurations and we would not be measuring the overhead but the quality of the chosen configurations. We chose the boa configuration [42, Sect. 7], which is the configuration most often used by E 2.2 in its automatic scheduling mode. The results are shown in Figure 3.

Counterintuitively, the higher-order versions of both provers outperform the first-order counterparts. However, the difference is so small that it can be attributed to the changes to memory layout that affect the order in which clauses are chosen. Similar effects are visible when comparing the first-order versions.

CASC Results. \(\lambda \)E also took part in CASC 2022. In the TPTP higher-order division, \(\lambda \)E finished second, after Zipperposition, as expected. In the Sledgehammer division, \(\lambda \)E tied with Ehoh for first place, a disappointment. The likely explanation is that \(\lambda \)E used a wrong configuration in this division, as we found out afterwards. We expect better performance at CASC 2023.

Fig. 2.
figure 2

Comparison of higher-order provers

Fig. 3.
figure 3

Evaluation of \(\lambda \)E’s overhead

7 Discussion and Related Work

On the trajectory to \(\lambda \)E, we developed, together with colleagues, three superposition calculi: for \(\lambda \)-free higher-order logic [6], for a higher-order logic with \(\lambda \)-abstraction but no Booleans [5], and for full higher-order logic [5]. These milestones allowed us to carefully estimate how the increased reasoning capabilities of each calculus influence its performance.

Extending first-order provers with higher-order reasoning capabilities has been attempted by other researchers as well. Barbosa et al. extended the SMT solvers CVC4 (now cvc5) and veriT to higher-order logic in an incomplete way [3]. Bhayat and Reger first extended Vampire to higher-order logic using combinatory unification [8], an incomplete approach, before they designed and implemented a complete higher-order superposition calculus based on SKBCI combinators [7]. The advantage is that combinators can be supported as a thin layer on top of \(\lambda \)-free terms. This calculus is also implemented in Zipperposition. However, in informal experiments, we found that \(\lambda \)-superposition performs substantially better, corroborating the CASC results, so we decided to make a more profound change to Ehoh and implement \(\lambda \)-superposition.

Possibly the only actively maintained higher-order provers built from the bottom up as higher-order provers are Leo-III [35] and Satallax’s [10] successor Lash [11]. A further overview of other traditional higher-order provers and the calculi they are based on can be found in the paper about Ehoh [42, Sect. 9].

8 Conclusion

In 2019, the reviewers of our Ehoh paper [42] were skeptical that extending Ehoh with support for full higher-order logic would be feasible. One of them wrote:

A potential criticism could be that this step from E to Ehoh is just extending FOL by those aspects of HOL that are easily in reach with rather straightforward extensions (none of the extensions is indeed very complicated), and that the difficult challenges of fully supporting HOL have yet to be confronted.

We ended up addressing the theoretical “difficult challenges” in other work with colleagues. In this paper, we faced the practical challenges pertaining to the extension of Ehoh’s data structures and algorithms to support full higher-order logic and demonstrated that such an extension is possible. Our evaluation shows that this extension makes \(\lambda \)E the best higher-order prover on benchmarks coming from interactive theorem proving practice, which was our goal. \(\lambda \)E lags slightly behind Zipperposition on TPTP problems. One reason might be that Zipperposition does not assume a clausal structure and can perform subtle formula-level inferences. It would be useful to implement the same features in \(\lambda \)E. We have also only started tuning \(\lambda \)E’s heuristics on higher-order problems.