Extending a brainiac prover to lambda-free higher-order logic

Decades of work have gone into developing efficient proof calculi, data structures, algorithms, and heuristics for first-order automatic theorem proving. Higher-order provers lag behind in terms of efficiency. Instead of developing a new higher-order prover from the ground up, we propose to start with the state-of-the-art superposition prover E and gradually enrich it with higher-order features. We explain how to extend the prover’s data structures, algorithms, and heuristics to λ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda $$\end{document}-free higher-order logic, a formalism that supports partial application and applied variables. Our extension outperforms the traditional encoding and appears promising as a stepping stone toward full higher-order logic.


Introduction
Superposition provers such as E [45], SPASS [57], and Vampire [27] are among the most successful first-order reasoning systems. They serve as backends in various frameworks, including software verifiers (e.g., Why3 [23]), automatic higher-order theorem provers (e.g., Leo-III [46], Satallax [18]), and one-click "hammers" in proof assistants (e.g., HOLyHammer in HOL Light [25], Sledgehammer in Isabelle [36]). Decades of research have gone into refining calculi, devising efficient data structures and algorithms, and developing heuristics to guide proof search [44]. This work has mostly focused on first-order logic with equality. Research on higher-order automatic provers has resulted in systems such as LEO [11], Leo-II [13], and Leo-III [46], based on resolution and paramodulation, and Satallax [18], based on tableaux and SAT solving. They feature a "cooperative" architecture, pioneered by LEO: They are full-fledged higher-order provers that regularly invoke an external firstorder prover with a low time limit as a terminal procedure, in an attempt to finish the proof quickly using only first-order reasoning. However, the first-order backend will succeed only if all the necessary higher-order reasoning has been performed, meaning that much of the first-order reasoning is carried out by the slower higher-order prover. As a result, this architecture leads to suboptimal performance on largely first-order problems, such as those that often arise in interactive verification [48]. For example, at the 2017 installment of the CADE ATP System Competition (CASC) [50], Leo-III, which uses E as a backend, proved 652 out of 2000 first-order problems in the Sledgehammer division, compared with 1185 for E on its own and 1433 for Vampire.
To obtain better performance, we propose to start with a competitive first-order prover and extend it to full higherorder logic one feature at a time. Our goal is a graceful extension, so that the system behaves as before on first-order problems, performs mostly like a first-order prover on typical, mildly higher-order problems, and scales up to arbitrary higher-order problems, in keeping with the zero-overhead principle: What you don't use, you don't pay for.

Logic
Our logic is a variant of the intensional λ-free Booleanfree higher-order logic (λfHOL) described by Bentkamp et al. [10,Sect. 2], which could also be called "applicative firstorder logic." In the spirit of FOOL [26], we extend the syntax of this logic by erasing the distinction between terms and formulas, and its semantics by interpreting the Boolean type o as a domain of cardinality 2. Functional extensionality can be obtained by adding suitable axioms [10,Sect. 3.1].
A type is either an atomic type ι or a function type τ → υ, where τ and υ are types. Terms, ranged over by s, t, u, v, are either variables x, y, z, . . . , (function) symbols a, b, c, d, f, g, . . . (often called "constants" in the higher-order literature), binary applications s t, or Boolean terms , ⊥, ¬s, s ∧ t, s ∨ t, s → t, s ↔ t, ∀x. s, ∃x. s, s ≈ t. Boolean terms are also called formulas, and function symbols returning a Boolean value are also called predicate symbols. The typing rules are as for the simply typed λ-calculus. A term's arity is the number of extra arguments it can take. If f has type ι → ι → ι and a has type ι, then f is binary, f a is unary, and f a a is nullary. Subterms are defined in the usual way; for example, s t has all subterms of s and t as subterms, in addition to s t itself.
Non-Boolean terms have a unique flattened decomposition of the form ζ s 1 . . . s m , where ζ , the head, is a variable or symbol, and s 1 , . . . , s m , the arguments, are arbitrary terms. We abbreviate tuples (a 1 , . . . , a m ) to a m or a. Abusing notation, we write ζ s m for ζ s 1 . . . s m . An equation s ≈ t corresponds to an unordered pair of terms. A literal L is an equation s ≈ t, where s and t have the same type, or its negation s ≈ t. Clauses C, D are finite multisets of literals, written L 1 ∨ · · · ∨ L n . E and Ehoh clausify the input as a preprocessing step, producing a clause set in which the only proper Boolean subterms are variables, , and ⊥.
Substitutions σ are partial functions of finite domain from variables to terms, written {x 1 → s 1 , . . . , x m → s m }, where each s i has the same type as x i . The substitution σ [x → s] maps x to s and otherwise coincides with σ . Applying σ to a variable beyond σ 's domain is the identity. Composition (σ • σ )(t) is defined as σ (σ (t)).
A well-known technique to support λfHOL is to use the applicative encoding: Every n-ary symbol is mapped to a nullary symbol, and application is represented by a distinguished binary symbol @. Thus, the λfHOL term f (x a) b is encoded as the first-order term @(@(f, @(x, a)), b). However, this representation is not graceful, since it also introduces @'s for terms within λfHOL's first-order fragment. By doubling the size and depth of terms, the encoding clutters data structures and slows down term traversals. In our empirical evaluation, we find that the applicative encoding can decrease the success rate by up to 15% (Sect. 9). For these and further reasons, it is not ideal (Sect. 10).

Types and terms
The term representation is a central concern when building a theorem prover. Delicate changes to E's representation were needed to support partial application and especially applied variables. In contrast, the introduction of a higher-order type system had a less dramatic impact on the prover's code.
Types For most of its history, E supported only untyped firstorder logic. Cruanes implemented support for atomic types for E 2.0 [19, p. 117]. Symbols are declared with a type signature: f: τ 1 × · · · × τ m → τ. Atomic types are represented by integers, leading to efficient type comparisons.
In λfHOL, a type signature is simply a type τ , in which the type constructor → can be nested-e.g., (ι → ι) → ι. A natural way to represent such types is to mimic their recursive structure using a tagged union. However, this leads to memory fragmentation; a simple operation such as querying the type of a function's ith argument would require dereferencing i pointers. We prefer a flattened representation, in which a type τ 1 → · · · → τ n → ι is represented by a single node labeled with → and pointing to the array (τ 1 , . . . , τ n , ι).
Ehoh stores all types in a shared bank and implements perfect sharing, ensuring that types that are structurally the same are represented by the same object in memory. Type equality can then be implemented as a pointer comparison.
Terms In E, terms are stored as perfectly shared directed acyclic graphs [30]. Each node, or cell, contains 11 fields, including f_code, an integer that identifies the term's head symbol (if ≥ 0) or variable (if < 0); arity, an integer corresponding to the number of arguments passed to the head; args, an array of size arity consisting of pointers to arguments; and binding, which may store a substitution for a variable (if f_code < 0), used for unification and matching.
In first-order logic, the arity of a variable is always 0, and the arity of a symbol is given by its type signature. In higher-order logic, variables may have function type and be applied, and symbols can be applied to fewer arguments than specified by their type signatures. A natural representation of λfHOL terms as tagged unions would distinguish between variables x, symbols f, and binary applications s t. However, this scheme suffers from memory fragmentation and linear-time access, as with the representation of types, affecting performance on purely or mostly first-order problems. Instead, we propose a flattened representation, as a generalization of E's existing data structures: Allow arguments to variables, for symbols let arity be the number of actual arguments, and rename the field num_args. This representation, often called "spine notation," is isomorphic to the standard definition of higher-order terms with binary application. It is employed in various higher-order reasoning systems, including Leo-III [46] and Zipperposition [9].
A side effect of the flattened representation is that prefix subterms are not shared. For example, the terms f a and f a b correspond to the flattened cells f(a) and f(a, b). The argument subterm a is shared, but not the prefix f a. Similarly, x and x b are represented by two distinct cells, x() and x(b), and there is no connection between the two occurrences of x. In particular, despite perfect sharing, their binding fields are unconnected, leading to inconsistencies.
A potential solution would be to systematically traverse a clause and set the binding fields of all cells of the form x(s) whenever a variable x is bound, but this would be inefficient and inelegant. Instead, we implemented a hybrid approach: Variables are applied by an explicit application operator @, to ensure that they are always perfectly shared. Thus, x b c is represented by the cell @(x, b, c), where x is a shared subcell. This is graceful, since variables never occur applied in first-order terms. The main drawback is that some normalization is necessary after substitution: Whenever a variable is instantiated by a symbol-headed term, the @ symbol must be eliminated. Applying the substitution {x → f a} to the cell @(x, b, c) must produce f(a, b, c) and not @(f(a), b, c), for consistency with other occurrences of f a b c.
There is one more complication related to the binding field. In E, it is easy and useful to traverse a term as if a substitution has been applied, by following all set binding fields. In Ehoh, this is not enough, because cells must also be normalized. To avoid repeatedly creating the same normalized cells, we introduced a binding_cache field that connects a @(x, s) cell with its substitution. However, this cache can easily become stale when x's binding pointer is updated. To detect this situation, we store x's binding value in the @(x, s) cell's binding field (which is otherwise unused). To find out whether the cache is valid, it suffices to check that the binding fields of x and @(x, s) are equal.
Term orders Superposition provers rely on term orders to prune the search space. The order must be a simplification order that is total on variable-free terms. E implements both the Knuth-Bendix order (KBO) and the lexicographic path order (LPO). KBO is widely regarded as the more robust option for superposition. In earlier work, Blanchette and colleagues have shown that only KBO can be generalized gracefully while preserving the necessary properties for superposition [7,16]. For this reason, we focus on KBO.
E implements Löchner's linear-time algorithm for KBO [29], which relies on the tupling method to store intermediate results. It is straightforward to generalize the algorithm to compute the graceful λfHOL version of KBO [7]. The main difference is that when comparing two terms f s m and f t n , because of partial application we may now have m = n; this required changing the implementation to perform a lengthlexicographic comparison of the tuples s m and t n .
Input and output syntax E implements the TPTP [51] formats FOF and TF0, corresponding to untyped and monomorphic first-order logic, for both input and output. In Ehoh, we added support for the λfHOL fragment of TPTP TH0, which provides monomorphic higher-order logic. Thanks to the use of a standard format, Ehoh's proofs can immediately be parsed by Sledgehammer [36], which reconstructs them using a variety of techniques. There is ongoing work on increasing the level of detail of E's proofs, to facilitate proof interchange and independent proof checking [38]; this will also benefit Ehoh.

Unification and matching
Syntactic unification of (Boolean-free) λfHOL terms has a first-order flavor. It is decidable, and most general unifiers (MGUs) are unique up to variable renaming. For example, the unification constraint f (y a) ? = y (f a) has the MGU {y → f}, whereas in full higher-order logic infinitely many independent solutions of the form {y → λx. f (f (· · · (f x) · · · ))} exist. Matching is a special case of unification where only the variables on the left-hand side can be instantiated.
An easy but inefficient way to implement unification and matching for λfHOL is to apply the applicative encoding (Sect. 2), perform first-order unification or matching, and decode the resulting substitution. To avoid the overhead, we generalize the first-order unification and matching procedures to operate directly on λfHOL terms.

Unification
We present our unification procedure as a nondeterministic transition system that generalizes Baader and Nipkow [5]. A unification problem consists of a finite set S of unification constraints s i = t n }, where the x i 's are distinct and do not occur in the t j 's. The corresponding unifier is {x 1 → t 1 , . . . , x n → t n }. The transition rules attempt to bring the input constraints into solved form. They can be applied in any order and eventually reach a normal form, which is either an idempotent MGU expressed in solved form or the special value ⊥, denoting unsatisfiability of the constraints.
The first group of rules-the positive rules-consists of operations that focus on a single constraint and replace it with a new (possibly empty) set of constraints: The Delete, Decompose, and Eliminate rules are essentially as for first-order terms. The Orient rule is generalized to allow applied variables and complemented by a new OrientXY rule. DecomposeX, also a new rule, can be seen as a variant of Decompose that analyzes applied variables; the term u may be an application.
The rules belonging to the second group-the negative rules-detect unsolvable constraints: = g a (y c)}: E stores open constraints in a double-ended queue. Constraints are processed from the front. New constraints are added at the front if they involve complex terms that can be dealt with swiftly by Decompose or Clash, or to the back if one side is a variable. This delays instantiation of variables and allows E to detect structural clashes early.
During proof search, E repeatedly needs to test a term s for unifiability not only with some other term t but also with t's subterms. Prefix optimization speeds up this test: The subterms of t are traversed in a first-order fashion; for each such subterm ζ t n , at most one prefix ζ t k , with k ≤ n, is possibly unifiable with s, by virtue of their having the same arity. For first-order problems, we can only have k = n, since all functions are fully applied. Using this technique, Ehoh is virtually as efficient as E on first-order terms.
The transition system introduced above always terminates with a correct answer. Our proofs follow the lines of Baader and Nipkow. The metavariable R is used to range over constraint sets S and the special value ⊥. The set of all unifiers of S is denoted by U(S). Note that U(S ∪ S ) = U(S) ∩ U(S ). We let U(⊥) = ∅. The notation S ⇒ ! S indicates that S ⇒ * S and S is a normal form (i.e., there exists no S such that S ⇒ S ). A variable x is solved in S if it occurs exactly once in S, in a constraint of the form x ? = t.
Proof The rules Delete, Decompose, Orient, and Eliminate are proved as in Baader and Nipkow. OrientXY trivially preserves unifiers. For DecomposeX, the core of the argument is as follows: The proof of the problem's unsolvability if rule Clash or OccursCheck is applicable carries over from Baader and Nipkow. For ClashTypeX, the justification is that σ (x s m ) = σ (u t m ) is possible only if σ (x) = σ (u), which requires x and u to have the same type. Similarly, for ClashLenXF, if

Lemma 2 If S is a normal form, then S is in solved form.
Proof Consider an arbitrary unification constraint s ? = t ∈ S. We show that in all but one cases, a rule is applicable, contradicting the hypothesis that S is a normal form. In the remaining case, s is a solved variable in S. Case s = x: -Subcase t = x: Delete is applicable. ClashTypeX is applicable, depending on whether x and η t n−m have the same type. -Subcase t = y t n for n < m: OrientXY is applicable.
-Subcase t = f t n for n < m: ClashLenXF is applicable.
Case s = f s m : -Subcase t = x t n : Orient is applicable.
-Subcase t = f t n : Due to well-typedness, m = n.
Decompose is applicable. -Subcase t = g t n : Clash is applicable.
Since each constraint is of the form x ? = t where x is solved in S, the problem S is in solved form.

Lemma 3 If the constraint set S is in solved form, then the associated substitution is an idempotent MGU of S.
Proof This lemma corresponds to Lemma 4.6.3 of Baader and Nipkow. Their proof carries over to λfHOL. Theorem 4 (Partial correctness) If S ⇒ ! ⊥, then S has no solutions. If S ⇒ ! S , then S is in solved form and the associated substitution is an idempotent MGU of S. Proof The first part follows from Lemma 1. The second part follows from Lemma 1 and Lemmas 2 and 3.

Theorem 5 (Termination) The relation ⇒ is well founded.
Proof We define an auxiliary notion of weight: Well-foundedness is proved by exhibiting a measure function from constraint sets to quadruples of natural numbers (n 1 , n 2 , n 3 , n 4 ), where n 1 is the number of unsolved variables in S; n 2 is the sum of all term weights, The following table shows that the application of each positive rule lexicographically decreases the quadruple: The negative rules, which produce the special value ⊥, cannot contribute to an infinite ⇒ chain.
A unification algorithm for λfHOL can be derived from the above transition system, by committing to a strategy for applying the rules. This algorithm closely follows the Ehoh implementation, abstracting away from complications such as prefix optimization. We assume a flattened representation of terms; as in Ehoh, each variable stores the term it is bound to in its binding field (Sect. 3). We also rely on a ApplySubst function, which applies the binding to the top-level variable. The algorithm assumes that the terms to be unified have the same type. The pseudocode is as follows: Matching Given s and t, the matching problem consists of finding a substitution σ such that σ (s) = t. We then write that "t is an instance of s" or "s generalizes t." We are interested in most general generalizations (MGGs). Matching can be reduced to unification by treating variables in t as nullary symbols [5], but E implements it separately.
Matching can be specified abstractly as a transition system on matching constraints s i ? t i consisting of the unification rules Decompose, DecomposeX, Clash, ClashTypeX, ClashLenXF (with ? instead of ? =) and augmented with The matching relation is sound, complete, and well founded. Interestingly, a Delete rule would be unsound for matching. Consider the problem {x ? x, x ? g x}. Applying Delete to the first constraint would yield the solution {x ? g x}, even though the original problem is clearly unsolvable.

Indexing data structures
Superposition provers like E work by saturation. Their main loop heuristically selects a clause and searches for potential inference partners among a possibly large set of other clauses. Mechanisms such as simplification and subsumption also require locating terms in a large clause set. For example, when E derives a new equation s ≈ t, if s is larger than t according to the term order, it will rewrite all instances σ (s) of s to σ (t) in existing clauses.
To avoid iterating over all terms (including subterms) in large clause sets, superposition provers store the potential inference partners in indexing data structures. A term index stores a set of terms S. Given a query term t, a query returns all terms s ∈ S that satisfy a given retrieval condition: σ (s) = σ (t) (s and t are unifiable), σ (s) = t (s generalizes t), or s = σ (t) (s is an instance of t), for some substitution σ. Perfect indices return exactly the subset of terms satisfying the retrieval condition. In contrast, imperfect indices return a superset of eligible terms, and the retrieval condition needs to be checked for each candidate.
E relies on two term indexing data structures, perfect discrimination trees [32] and fingerprint indices [42], that needed to be generalized to λfHOL. It also uses feature vector indices [43] to speed up subsumption and related techniques, but these require no changes to work with λfHOL.
Discrimination trees Discrimination trees [32] are tries in which every node is labeled with a symbol or a variable. A path from the root to a leaf node corresponds to a "serialized term"-a term expressed without parentheses and commas.
Consider the following discrimination trees D 1 and D 2 : x Assuming a, b, x, y: ι, f: ι → ι, and g: ι 2 → ι, D 1 represents the term set {f(a), g(a, a), g(b, a), g(b, b)}, and D 2 represents the term set {f(x), g(a, a), g(y, a), g(y, x), x}. E uses perfect discrimination trees for finding generalizations of query terms. Thus, if the query term is g(a, a), it would follow the path g.a.a in D 1 and return {g(a, a)}. For D 2 , it would also explore paths labeled with variables, binding them as it proceeds, and return {g(a, a), g(y, a), g(y, x), x}.
It is crucial for this data structure that distinct terms always give rise to distinct serialized terms. Conveniently, this property also holds for λfHOL terms. Suppose that two distinct λfHOL terms yield the same serialization. Clearly, they must disagree on parentheses; one will have the subterm s t u where the other has s (t u). However, these two subterms cannot both be well typed.
When generalizing the data structure to λfHOL, we face a complication due to partial application. First-order terms can only be stored in leaf nodes, but in Ehoh we must also be able to represent partially applied terms, such as f, g, or g a (assuming, as above, that f is unary and g is binary). Conceptually, this can be solved by storing a Boolean on each node indicating whether it is an accepting state. In the implementation, the change is more subtle, because several parts of E's code implicitly assume that only leaf nodes are accepting.
The main difficulty specific to λfHOL concerns applied variables. To enumerate all generalizing terms, E needs to backtrack from child to parent nodes. This is achieved using two stacks that store subterms of the query term: T stores the terms that must be matched in turn against the current subtree, and P stores, for each node from the root to the current subtree, the corresponding processed term.
Let [a 1 , . . . , a n ] denote an n-item stack with a 1 on top. Given a query term t, the matching procedure starts at the root with σ = ∅, T = [t], and P = []. The procedure advances by repeatedly moving to a suitable child node: A. If the node is labeled with a symbol f and the top item t of T is of the form f(t n ), replace t by n new items t 1 , . . . , t n , and push t onto P. B. If the node is labeled with a variable x, there are two subcases. If x is already bound, check that σ (x) = t; otherwise, extend σ so that σ (x) = t. Next, pop the term t from T and push it onto P.
The goal is to reach an accepting node. If the query term and all the terms stored in the tree are first-order, T will then be empty, and the entire query term will have been matched. Backtracking works in reverse: Pop a term t from P; if the current node is labeled with an n-ary symbol, discard T 's topmost n items; push t onto T . Undo any variable bindings.
As an example, looking up g(b, a) in the tree D 1 would result in the following succession of stack states, starting from the root along the path g.b.a: Backtracking amounts to moving leftward: To get back from g to the root, we pop g(b, a) from P, we discard two items from T , and we push g(b, a) onto T .
To adapt the procedure to λfHOL, the key idea is that an applied variable is not very different from an applied symbol. A node labeled with an n-ary head ζ matches a prefix t of the k-ary term t popped from T and leaves n − k arguments u to be pushed back, with t = t u. If ζ is a variable, it must be bound to the prefix t assuming ζ and t are of same type. Backtracking works analogously: Given the arity n of the node label ζ and the arity k of the term t popped from P, we discard the topmost n − k items u from P.
To illustrate the procedure, we consider the tree D 2 but change y's type to ι → ι. This tree stores {f x, g a a, g (y a), g (y x), x}. Let g (g a b) be the query term. We have the following sequence of substitutions σ and stacks T , P: When backtracking from g.y to g, by comparing y's arity of n = 1 with g a b's arity of k = 0, we determine that one item must be discarded from T . Finally, to avoid traversing twice as many subterms as in the first-order case, we can optimize prefixes: Given a query term ζ t n , we can also match prefixes ζ t k , where k < n, by allowing T to be nonempty when we reach an accepting node. Similarly to matching, we present finding generalizations in a perfect discrimination tree as a transition system. States b is a list of tuples storing backtracking information, D is a discrimination (sub)tree, and σ is a substitution.
Let D be a perfect discrimination tree. Term(D) denotes the set of terms stored in D. The function D| ζ returns the child of D labeled with ζ , if it exists. Child nodes are themselves perfect discrimination (sub)trees. Given any node D, if the node is accepting, then the value stored on that node is defined as val(D) = (s, d), where s is the accepted term and d is some arbitrary data; otherwise, val(D) is undefined.
Starting from an initial state ([t], [ ], D, ∅), where t is the query term and D is an entire discrimination tree, the following transitions are possible: if D| x is defined, x and s have the same type, and Above, · denotes prepending an element or a list to a list. Intuitively, AdvanceF and AdvanceX move deeper in the tree, generalizing cases A and B above to λfHOL terms. Backtrack can be used to return to a previous state. Success extracts the term t and data d stored in an accepting node.
The following derivation illustrates how to locate a generalization of g (g a b) in the tree D 2 : It is easy to show that Backtrack undoes an Advance transition: Proof For both Advance steps, we show that Backtrack restores the state properly. If AdvanceF was applied, we have We must show that t = f s m · t. Let k = arity(f) and l = arity(f s m ). By definition of k, we have m = k − l, as in Backtrack's side condition.
Again, we must show that t = s s m · t. Terms x and s must have the same type for AdvanceX to be applicable; therefore, they have the same arity. Then, we conclude m = arity(s) − arity(s s m ) = arity(x) − arity(s s m ), as in Backtrack's side condition. Thus, t = s s m · t.
thereby eliminating one Backtrack transition. By repeating this process, we can eliminate all applications of Backtrack.

Lemma 9 There exist no infinite chains of the form
Proof With each Advance transition, the height of the discrimination tree decreases by at least one.
Perfect discrimination trees match a single term against a set of terms. To prove them correct, we will connect them to the transition system ⇒ for matching (Sect. 4). This connection will help us show that whenever a discrimination tree stores a generalization of a query term, this generalization can be found. To express the refinement, we introduce an intermediate transition system, −→, that focuses on a single pair of terms (like ⇒) but that solves the constraints in a depth-first, left-to-right fashion and builds the substitution incrementally (like ). Its initial states are of the form ([s ? t], ∅). Its transitions are as follows: We need an auxiliary function to convert −→ states to ⇒ states. Let α( Moreover, let S range over states of the form (c, σ ) and R additionally range over special states of the form σ or ⊥. Case R = ⊥: All the −→ rules resulting in ⊥ except for Double have the same side conditions as the corresponding Hence, x ? u must be present in α(c, σ ). ⇒ DecomposeX will augment this set with x ? u, enabling ⇒ Double .
Proof First, we show that states S = (c , σ ) cannot be normal forms, by exhibiting transitions from such states. If c = [ ], the −→ Success rule would apply. Otherwise, let c = c 1 ·c and consider the matching problem {c 1 }∪α(σ ). If this problem is in solved form, c 1 is a constraint corresponding to a solved variable, and we can apply −→ DecomposeX to move the constraint into the substitution. Otherwise, some ⇒ rule can be applied. It necessarily focuses on c 1 , since the constraints from α(σ ) correspond to solved variables. In all cases except for ⇒ DecomposeX , a homologous −→ rule can be applied to S . If ⇒ DecomposeX would make ⇒ Double applicable, then we can apply −→ Double to S ; otherwise, −→ DecomposeX is applicable.

Lemma 12 The relation −→ is well founded.
Proof By Lemma 10, every −→ transition corresponds to zero or more ⇒ transitions. Since ⇒ is well founded, the only transitions that can violate well-foundedness of −→ are the ones that take idle ⇒ * transitions: −→ DecomposeX for m = 0 and −→ Success . The latter is terminal, so it cannot contribute to infinite chains. As for −→ DecomposeX , with m = 0, it decreases the following measure μ, which the other rules nonstrictly decrease, with respect to the multiset extension of We define t i , b i , and D i as follows, for i > 0. The list t i consists of the right-hand sides of the constraints c i , in the same order. Let hd be the function that extracts the head of a list. We If an accepting node storing s was reached in n steps, the serialization of s must be of the form ζ 1 . · · · .ζ n .
The sequence of states Q i forms a derivation: Success ((s, d),  ((s, d), σ ), then s ∈ Term(D) and σ is the MGG of s ? t.
Without loss of generality, by Lemma 8, we can assume that the derivation contains no Backtrack transitions.
Since s ∈ Term(D), the sequence D 0 , . . . , D n follows the preorder serialization of s: Next, we show that there exists a derivation of the form We define c i , for i > 0, as the list of constraints whose left-hand sides are the elements of u i and right-hand sides are the elements of t i , in the order they appear in the respective lists. By inspecting the definition of preord and the changes each Advance step makes to the head of t i , we can see that u i and t i have the same length. The sequence of states S i forms a derivation: The theorem tells us that given a term t, all generalizations s stored in the perfect discrimination tree can be found, but it does not exclude nondeterminism. Often, both AdvanceF and AdvanceX are applicable. To find all generalizations, we need to follow both transitions. But for some applications, it is enough to find a single generalization.
To cater for both types of applications, E provides iterators that store the state of a traversal. After an iterator is initialized with the root node D and the query term t, each call to FindNextVal will move the iterator to the next node that generalizes the query term and stores a value, indicating an accepting node. After all such nodes have been traversed, the iterator is set to point to Null.
The following definitions constitute the high-level interface for iterating through values incrementally or for obtaining all values of nodes that store generalizations of the query term in D.
The core functionality is implemented in FindNext-Node, presented below. This procedure moves the iterator to the next node that has not been explored in the search for generalization, or Null if the entire tree has been traversed. It first goes through all child nodes labeled with a variable before possibly visiting the child node labeled with a function symbol. We assume that we can iterate through the children of a node using a function NextVarChild that, given a tree node and iterator through children, advances the iterator to the child representing the next variable. Furthermore, we assume that the iterator can also be in the distinguished states Start and End. Start indicates that no child has been visited yet; End indicates that we have visited all children. Finally, the expression n.child(ζ ) returns a child of the node n labeled ζ if such a child exists or Null otherwise. The pseudocode uses a slightly different representation of backtracking tuples than . In the AdvanceX rule, σ changes only if the variable x was previously not bound. Instead of creating and storing substitutions explicitly, we simply remember whether the variable was bound in this step or not, in the var_unbound tuple component. Then we rely on the label x of the current node and its binding field to carry the substitutions. Similarly, since our strategy is to traverse the tree by first visiting the variable-labeled child nodes, we need to remember how far we have come with this traversal. We store this information in the c_iter tuple component. [42] trade perfect indexing for a compact memory representation and more flexible retrieval conditions. The basic idea is to compare terms by looking only at a few predefined sample positions. If we know that term s has symbol f at the head of the subterm at 2.1 and term t has g at the same position, we can immediately conclude that s and t are not unifiable. Let A ("at a variable"), B ("below a variable"), and N ("nonexistent") be distinguished symbols not present in the signature, and let q < p denote that position q is a proper prefix of p (e.g., < 2 < 2.1). Given a term t and a position p, the fingerprint function Gfpf is defined as

Fingerprint indices Fingerprint indices
if t| q is a variable for some q < p N otherwise Based on a fixed tuple of positions p n , the fingerprint of a term t is defined as Fp(t) = Gfpf (t, p 1 ), . . . , Gfpf (t, p n ) .
To compare two terms s and t, it suffices to check that their fingerprints are componentwise compatible using the following unification and matching matrices, respectively: The rows and columns correspond to s and t, respectively. The metavariables f 1 and f 2 represent arbitrary distinct symbols. Incompatibility is indicated by ✗.
As an example, let ( , 1, 2, 1.1, 1. A fingerprint index is a trie that stores a term set T keyed by fingerprint. The term f(g(x), g(a)) above would be stored in the node addressed by f.g.g.A.N.a.N, together with other terms that share the same fingerprint. This scheme makes it possible to unify or match a query term s against all the terms T in one traversal. Once a node storing the terms U ⊆ T has been reached, due to overapproximation we must apply unification or matching on s and each u ∈ U .
When adapting this data structure to λfHOL, we must first choose a suitable notion of position in a term. Conventionally, higher-order positions are strings over {1, 2}, but this is not graceful. Instead, it is preferable to generalize the first-order notion to flattened λfHOL terms-e.g., x a b | 1 = a and xab| 2 = b. However, this approach fails on applied variables. For example, although x b and f ab are unifiable (using {x → fa}), sampling position 1 would yield a clash between b and a. To ensure that positions remain stable under substitution, we propose to number arguments in reverse: t | = t and ζ t n . . . t 1 | i. p = t i | p if 1 ≤ i ≤ n. We use a nonstandard notation, t| p , for this nonstandard notion. The operation is undefined for out-of-bound indices.

Lemma 17
Let s and t be unifiable terms, and let p be a position such that the subterms s| p and t| p are defined. Then s| p and t| p are unifiable.
Proof By structural induction on p. The case p = is trivial. Case p = q.i: Let s| q = ζ s m . . . s 1 and t| q = η t n . . . t 1 .
Since p is defined in both s and t, we have s| p = s i and t| p = t i . By the induction hypothesis, s| q and t| q are unifiable, meaning that there exists a substitution σ such that σ (ζ s m . . .
Let t p denote the subterm t| q such that q is the longest prefix of p for which t| q is defined. The λfHOL version of the fingerprint function is defined as follows: if t| p is undefined but t p has a variable head N otherwise Except for the reversed numbering scheme, Gfpf coincides with Gfpf on first-order terms. The fingerprint Fp (t) of a term t is defined analogously as before, and the same compatibility matrices can be used.
The key difference between Gfpf and Gfpf concerns We can easily support prefix optimization for both terms s and t being compared: We simply add enough fresh variables as arguments to ensure that s and t are fully applied before computing their fingerprints.

Lemma 18
If terms s and t are unifiable, then Gfpf (s, p) and Gfpf (t, p) are compatible according to the unification matrix. If s generalizes t, then Gfpf (s, p) and Gfpf (t, p) are compatible according to the matching matrix.
Proof We focus on the case of unification. By contraposition, it suffices to consider the eight blank cells in the unification matrix, where the rows correspond to Gfpf (s, p) and the columns correspond to Gfpf (t, p). Since unifiability is a symmetric relation, we can rule out four cases.
Case f 1 -f 2 : By definition of Gfpf , s| p and t| p must be of the forms f 1 s and f 2 t, respectively. Clearly, s| p and t| p are not unifiable. By Lemma 17, s and t are not unifiable. for n < i. Since q.i is a legal position in s, s| q has the form ζ s m . . . s 1 , with i ≤ m. A necessary condition for σ (s| q ) = σ (t| q ) is that σ (ζ s m . . . s n+1 ) = σ (g), but this is impossible because the left-hand side is an application (since n < m), whereas the right-hand side is the symbol g. By Lemma 17, s and t are not unifiable.

Corollary 19 (Overapproximation) If s and t are unifiable terms, then Fp (s) and Fp (t) are compatible according to the unification matrix. If s generalizes t, then Fp (s) and
Fp (t) are compatible according to the matching matrix.

Feature vector indices A clause C subsumes a clause D
if there exists a substitution σ such that σ (C) ⊆ D. Subsumption is a crucial operation to prune the search space. Feature-vector indices [43] are an imperfect indexing data structure that can be used to retrieve clauses that subsume a query clause or that are subsumed by the query clause. Unlike for discrimination trees and fingerprint indices, no changes were necessary to adapt feature vectors indices to λfHOL. All the predefined features make sense in λfHOL.

Inference rules
Saturating provers show the unsatisfiability of a clause set by systematically adding logical consequences, eventually deriving the empty clause as a witness of unsatisfiability. They implement two kinds of inference rules: Generating rules produce new clauses and are needed for completeness, whereas simplification rules delete existing clauses or replace them by simpler clauses. This simplification is crucial for success, and most modern provers spend a large part of their time on simplification.
E's main loop, which applies the rules, implements the given clause procedure [4]. The proof state is represented by two disjoint subsets of clauses, the set of processed clauses P and the set of unprocessed clauses U . Initially, all clauses are unprocessed. At each iteration of the loop, the prover heuristically selects a given clause from U , adds it to P, and performs all generating inferences between this clause and all clauses in P. Resulting new clauses are added to U . This maintains the invariant that all direct consequences between clauses in P have been performed. Simplification is performed on the given clause (using clauses in P as side premises), on clauses in P (using the given clause), and on newly generated clauses (again, using P).
Ehoh is based on the same logical calculus as E, except that it is generalized to λfHOL terms. The standard inference rules and completeness proof of superposition with respect to intensional Boolean-free λfHOL fragment of our logic can be reused verbatim; the only changes concern the basic definitions of terms and substitutions [10,Sect. 1]. Refutational completeness of superposition for λfHOL terms has been formally proved by Peltier [37] using Isabelle. We introduced support for first-class Boolean terms in Ehoh by extending the preprocessor, as explained in Sect. 8.

The generating rules
The superposition calculus consists of the following four core generating rules, whose conclusions are added to the proof state: In each rule, σ denotes the MGU of s and s . Not shown are various side conditions that restrict the rules' applicability. Equality resolution (ER) and equality factoring (EF) are single-premise rules that work on the entire left-or righthand side of a literal of the given clause. To generalize them, it suffices to disable prefix optimization for unification.
T he rules for superposition into negative and positive literals (SN and SP) are more complex. As two-premise rules, they require the prover to find a partner for the given clause. There are two cases to consider, depending on whether the given clause acts as the first or second premise in an inference. Moreover, since the rules operate on subterms s of a clause, the prover must be able to efficiently locate all relevant subterms, including λfHOL prefix subterms. To cover the case where the given clause acts as the left premise, the prover relies on a fingerprint index to compute a set of clauses containing terms possibly unifiable with a side s of a positive literal of the given clause. Thanks to our generalization of fingerprints, in Ehoh this candidate set is guaranteed to overapproximate the set of all possible inference partners. The unification algorithm is then applied to filter out unsuitable candidates. Thanks to prefix optimization, we can avoid polluting the index with all prefix subterms.
When the given clause is the right premise, the prover traverses its subterms s looking for inference partners in another fingerprint index, which contains only entire left-and right-hand sides of equalities. Like E, Ehoh traverses subterms in a first-order fashion. If prefix unification succeeds, Ehoh determines the unified prefix and applies the appropriate inference instance.
The simplifying rules Unlike generating rules, simplifying rules do not necessarily add conclusions to the proof state-they can also remove premises. E implements over a dozen simplifying rules, with unconditional rewriting and clause subsumption as the most significant examples. Here, we restrict our attention to a single rule, which best illustrates the challenges of supporting λfHOL: Given an equation s ≈ t, equality subsumption (ES) removes a clause containing a literal whose two sides are equal except that an instance of s appears on one side where the corresponding instance of t appears on the other side. E maintains a perfect discrimination tree storing clauses of the form s ≈ t indexed by s and t. When applying ES, E considers each positive literal u ≈ v of the given clause in turn. It starts by taking the left-hand side u as a query term. If an equation s ≈ t (or t ≈ s) is found in the tree, with σ (s) = u, the prover checks whether σ (t) = v for some (possibly nonstrict) extension σ of σ . If so, ES is applicable, with a second premise of the form σ (s) ≈ σ (t) ∨ C.
To consider nonempty contexts, the prover traverses the subterms u and v of u and v in lockstep, as long as they appear under identical contexts. Thanks to prefix optimization, when Ehoh is given a subterm u , it can find an equation s ≈ t in the tree such that σ (s) is equal to some prefix of u , with some arguments u remaining as unmatched. Checking for equality subsumption then amounts to checking that v = σ (t) u, for some extension σ of σ .
For example, let f (g a b) ≈ f (h g b) be the given clause, and suppose that x a ≈ h x is indexed. Under context f [], Ehoh considers the subterms g a b and h x b. It finds the prefix g a of g a b in the tree, with σ = {x → g}. The prefix h g of h g b matches the indexed equation's righthand side h x using the same substitution, and the remaining argument in both subterms, b, is identical. Ehoh concludes that the given clause is redundant.
Pragmatic extensions Since Ehoh is based on a monomorphic logic, the only way to support extensionality without changing the calculus is to add a set of extensionality axioms for every function type occurring in problem [10,Sect. 3.1]. The evaluation by Bentkamp et al. of such an approach was discouraging [10,Sect. 6], so we decided to support extensionality via inference rules in Ehoh. We implemented two well-known incomplete rules we had experimented with in the context of Zipperposition.
The negative and positive extensionality (NE and PE) rules are defined as For NE, x contains all the variables occurring in s and t, the terms s and t are of function type, sk is a fresh Skolem symbol, and the literal s ≈ t is eligible for resolution [9,Sect. 5]. For PE, variable x does not occur in any of the s, t, or C, no literals are selected in C, and s x ≈ t x is a maximal literal.
Finally, we introduced an injectivity recognition (IR) rule, which detects injectivity axioms and asserts the existence of the inverse function for injective function symbols: where sk is a fresh Skolem symbol, J is the largest subset of {1, . . . , n} such that x j = y j for every j ∈ J . We denote the subsequence of x n indexed by J by x J . Moreover, we require that x i = y i , all variables in x K · y K are distinct, where K = {1, . . . , n}\J , and neither x K nor y K shares variables with x J . For example, given add a b ≈ add a b ∨ a ≈ b , IR can derive the existence of the inverse sk 1 characterized by sk 1 (add a b) a ≈ b. 7 Heuristics E's heuristics are largely independent of the logic used and work unchanged for Ehoh. Yet, in preliminary experiments, we noticed that E proved some λfHOL benchmarks quickly using the applicative encoding (Sect. 1), whereas Ehoh timed out. There were enough such problems to prompt us to take a closer look. Based on these observations, we extended the heuristics to exploit λfHOL-specific features.

Term order generation
The superposition calculus is parameterized by a term order-typically an instance of KBO or LPO (Sect. 3). E can generate a symbol weight function (for KBO) and a symbol precedence (for KBO and LPO) based on criteria such as the symbols' frequencies, their arities, and whether they appear in the conjecture.
In preliminary experiments, we discovered that the presence of an explicit application operator @ can be beneficial for some problems. Let a: ι 1 , b: ι 2 , c: ι 3 , f: ι 1 → ι 2 → ι 3 , x: ι 2 → ι 3 , y: ι 2 , and z: ι 3 , and consider the clauses f a y ≈ c and x b ≈ z, where the first one is the negated conjecture. Their applicative encoding is @ ι 2 ,ι 3 (@ ι 1 ,ι 2 →ι 3 (f, a), y) ≈ c and @ ι 2 ,ι 3 (x, b) ≈ z, where @ τ,υ is a type-indexed family of symbols representing the application of a function of type τ → υ. With the applicative encoding, generation schemes can take the symbols @ τ,υ into account, thereby exploiting the type information carried by such symbols. Since @ ι 2 ,ι 3 is a conjecture symbol, some weight generation scheme could give it a low weight, which would also impact the second clause. By contrast, the native λfHOL clauses share no symbols; the connection between them is hidden in the types of variables and symbols, which are ignored by the heuristics.
To simulate the behavior observed on applicative problems, we introduced four generation schemes that extend E's existing symbol-frequency-based schemes by partitioning the symbols by type. To each symbol, the new schemes assign a frequency equal to the sum of all symbol frequencies for its class. Each new scheme is inspired by a similarly named type-agnostic scheme in E, without type in its name: typefreqcount assigns as each symbol's weight the number of occurrences of symbols of the same type. typefreqrank sorts the frequencies calculated by the function typefreqcount in increasing order and assigns each symbol a weight corresponding to its rank. invtypefreqcount is typefreqcount's inverse.
If typefreqcount would assign a weight w to a symbol, it assigns M − w + 1, where M is the maximum symbol weight according to typefreqcount. invtypefreqrank is typefreqrank's inverse. It sorts the frequencies in decreasing order.
We designed four more schemes (whose names begin with comb instead of type) that combine E's type-agnostic and Ehoh's type-aware approaches using a linear equation.
To generate symbol precedences, E can sort symbols by weight and use the symbol's position in the sorted array as the basis for precedence. To reflect the type information introduced by the applicative encoding, we implemented four type-aware precedence generation schemes. Ties are broken by comparing the symbols' number of occurrences and, if necessary, the position of their first occurrence in the input.

Literal selection
The side conditions of the superposition rules SN and SP (Sect. 6) rely on a literal selection function to restrict the set of inference literals, thereby reducing the search space. Given a clause, a literal selection function returns a (possibly empty) subset of its literals. For completeness, any nonempty subset selected must contain at least one negative literal. If no literal is selected, all maximal literals become inference literals. The most widely used function is probably SelectMaxLComplexAvoidPosPred, which we abbreviate to SelectMLCAPP. It selects at most one negative literal, based on size, absence of variables, and maximality of the literal in the clause.
Intuitively, applied variables can potentially be unified with more terms than terms with rigid heads. This makes them prolific in terms of possible inference partners, a behavior we might want to avoid. On the other hand, shorter proofs might be found if we prefer selecting applied variables. To cover both scenarios, we implemented selection functions that prefer or defer selecting applied variables.
Let max(L) = 1 if L is a maximal literal of the clause it appears in; otherwise, max(L) = 0. Let appvar(L) = 1 if L is a literal where either side is an applied variable; otherwise, appvar(L) = 0. Based on these definitions, we devised the following selection functions, both of which rely on SelectMLCAPP to break ties: -SelectMLCAPPAvoidAppVar selects a negative literal L with the maximal value of (max(L), 1 − appvar(L)) according to the lexicographic order. -SelectMLCAPPPreferAppVar selects a negative literal L with the maximal value of (max(L), appvar(L)) according to the lexicographic order.
Clause selection Selection of the given clause is a critical choice point. E heuristically assigns clause priorities and clause weights to the candidates. The priorities provide a crude partition, whereas the weights order the clauses within a partition. E's main loop visits, in round-robin fashion, a set of priority queues. From a given queue, the clause with the highest priority and the smallest weight is selected. Typically, one of the queues will use the clauses' age as priority, to ensure fairness. E provides template weight functions that allow users to fine-tune parameters such as weights assigned to variables or function symbols. The most widely used template is ConjectureRelativeSymbolWeight, which we abbreviate to CRSWeight. It computes term and clause weights according to eight parameters, notably conj_mul, a multiplier applied to the weight of conjecture symbols. This template works well for some applicatively encoded problems. Let a: ι, f: ι → ι, x: ι, and y: ι → ι, and consider the clauses y x ≈ x and f a ≈ a, where the first one is the negated conjecture. Their encoding is @ ι,ι (y, x) ≈ x and @ ι,ι (f, a) ≈ a. The encoded clauses share @ ι,ι , whose weight will be multiplied by conj_mul-usually a factor in the interval (0, 1). By contrast, the native λfHOL clauses share no symbols, and the heuristic would fail to notice that f and y have the same type, giving a higher weight to the second clause. To mitigate this, we coded a new type-aware template, CRSTypeWeight, that applies the conj_mul multiplier to all symbols whose type occurs in the conjecture. For the example above, since ι → ι appears in the conjecture, it would notice the relation between the conjecture variable y and the symbol f and multiply f's weight by conj_mul.
Natively supporting λfHOL allows the prover to recognize applied variables. It may make sense to extend clause weight templates to either penalize or promote clauses with such variables. To support this extension, we added the following parameter to CRSTypeWeight, as well as to some other E's weight function templates: appv_mul is a multiplier applied to terms s = x t n , where s is either side of the literal and n > 0. In addition, we implemented a new clause priority scheme, ByAppVarNum, that separates the clauses by the number of top-level applied variables occurring in the clause, favoring those containing fewer such variables.

Configurations and modes
A combination of parameters, including term order, literal selection, and clause selection, is called a configuration. For years, E has provided an auto mode that analyzes the input problem and chooses a configuration known to perform well on similar problems. More recently, E has been extended with an autoschedule mode that applies a portfolio of configurations in sequence on the given problem, restarting the prover for each configuration.
Configurations that are suitable for a wide range of problems have emerged over time. One of them is the configuration that is most often chosen by E's auto mode. We call it boa ("best of auto"): The clause selection scheme consists of five queues, each of which is specified by a weight function template. The prefixes n. next to the template names indicate that the queue will be visited n times in the round-robin scheme before moving to the next one. The first argument to each template is the clause priority scheme.

Preprocessing
E's preprocessor transforms first-order formulas into clausal normal form, before the main loop is started. Since literals of clauses are (dis)equations, E encodes nonequational literals such as even(n) as equations even(n) ≈ . Beyond turning the problem into a conjunction of disjunctive clauses, the preprocessor eliminates quantifiers, introducing Skolem symbols for essentially existential quantifiers. For first-order logic, skolemization preserves both satisfiability (unprovability) and unsatisfiability (provability). In contrast, for higher-order logics without the axiom of choice, naive skolemization is unsound, because it introduces symbols that can be used to instantiate higher-order variables. One solution proposed by Miller [34,Sect. 6] is to ensure that Skolem symbols are always applied to a minimum number of arguments. However, to keep the implementation simple, we have decided to ignore this issue and consider all arguments as optional, including those to Skolem symbols. We plan to extend Ehoh's logic to full higher-order logic with the axiom of choice, which will address the issue.
There is another transformation performed by preprocessing that is problematic, but for a different reason. Definition unfolding is the process of replacing equationally defined symbols with their definitions and removing the defining equations. A definition is a clause of the form f x m ≈ t, where the variables x m are distinct, f does not occur in the righthand side t, and Var(t) ⊆ {x 1 , . . . , x m }. This transformation preserves unsatisfiability (provability) for first-and higherorder logic, but not for λfHOL, making Ehoh incomplete. The reason is that by removing the definitional clause, we also remove a symbol f that otherwise could be used to instantiate a higher-order quantifier. For example, the clause set {f x ≈ x, f (y a) ≈ a} is unsatisfiable, whereas {y a ≈ a} is satisfiable in λfHOL. (In full higher-order logic, the second clause set would be unsatisfiable thanks to the {y → λx. x} instance and β-conversion.) For the moment, we have simply disabled definition unfolding in Ehoh. We will enable it again once we have added support for λ-terms.
Higher-order logic treats formulas as terms of Boolean type, erasing the distinction between terms and formulas. As a consequence, formulas might appear as arguments not only to logical connectives but also to function symbols or applied variables-e.g., p(a∧b), y(¬ a). We call such formulas nested. Kotelnikov et al. [26] describe a modification to Vampire's clausification algorithm to support nested formulas. We adapt their approach to the clausification algorithm [35] used by E. Given a formula ϕ to clausify, the following procedure removes nested formulas: 1. Let χ = ϕ| p be the leftmost outermost nested formula that is different from , ⊥, or a variable x, if one exists; otherwise, skip to step 2. Let p = q.r where q is the longest strict prefix of p such that ψ = ϕ| q is a formula. As an example, consider the formula f x ≈ x → p (a∧b).
Step 1 moves the subterm a∧b outward, yielding f . This formula can be clausified further as usual.
Theorem 20 (Total correctness) The above procedure always terminates and produces a set of clauses that is equisatisfiable with the original formula ϕ in λfHOL with interpreted Booleans and that contains no nested formulas other than , ⊥, and variables.
Proof It is easy to see that steps 1, 3, and 5 produce equivalent formulas or clauses. Moreover, steps 1 and 3 remove all offending nested formulas (i.e., other than , ⊥, and variables). In conjunction with the standard clausification algorithm, which preserves and reflects satisfiability, our procedure gives correct results when it terminates.
To prove termination, we will use a measure function W to natural numbers that decreases with each application of step 1 or 3. Steps 2 and 4 rely on a terminating algorithm, whereas each application of step 5 decreases the size of a clause. We k is the number of offending outermost nested formulas in ζ s n . We must show W (ψ) > W (ψ ). By definition, ψ is of the form ζ s n , where ζ is not a logical connective. Thus, . Steps 1 and 3 substitute or ⊥, of measure 0, for a nested formula χ (including χ 's own nested formulas) in ψ. Clearly, the longer r is, the more W (ψ ) decreases. Taking |r | = 1, we get the upper bound The output may contain , ⊥, or Boolean variables as nested formulas. Since E was first developed as an untyped prover, unification of a variable with a Boolean constant was disallowed to avoid unsoundness. We needed to undo this in Ehoh. Ehoh must also remove trivial literals ⊥ ≈ and ≈ that emerge during proof search.

Evaluation
How useful are Ehoh's new heuristics? And how does Ehoh perform compared with E, used directly or in tandem with the applicative encoding, and compared with other provers? To answer the first question, we evaluated each new parameter independently. From the empirical results, we derived a new configuration optimized for λfHOL. For the second question, we compared Ehoh's success rate and speed on λfHOL problems with native higher-order provers and on applicatively encoded problems with E. We also included first-order benchmarks to measure Ehoh's overhead. We set a CPU time limit of 60 s per problem. This is more than allotted by interactive proof tools such as Sledgehammer, or by cooperative provers such as Leo-III and Satallax, but less than the 300 s of CASC [50]. The experiments were performed on StarExec [47] nodes equipped with Intel Xeon E5-2609 0 CPUs clocked at 2.40 GHz.

Heuristics tuning
We used the boa configuration as the basis to evaluate the new heuristic schemes. For each heuristic parameter we tuned, we changed only its value while keeping the other parameters the same as for boa. This gives an idea of how each parameter affects overall performance. All heuristic parameters were tested on a 5012 problem suite generated using Sledgehammer, consisting of four variants of the Judgment Day [17] suite. The problems were given in native λfHOL syntax. The experiments described in this subsection were carried out using an earlier E version (2.3).
Evaluating the new weight and precedence generation heuristics amounted to testing each possible combination of frequency-based schemes, including E's original typeagnostic schemes. Table 1 shows the number of solved (i.e., proved or disproved) problems for each combination. In this and the following figures, the underlined number is for boa, whereas bold singles out the best value. In the names of the generation schemes, we abbreviated inv to i, type to t, freq to f, comb to cm, count to cn, and rank to r. Table 1 indicates that including type information in the generation schemes results in a somewhat higher number of solved problems compared with E's type-agnostic schemes. Against our expectations, Ehoh's combined schemes appear to be less efficient than the type-aware schemes.
The literal selection function has little impact on performance: Ehoh solves 2379 problems with SelectMLCAPP or SelectMLCAPPAvoidAppVar, and 2378 problems with SelectMLCAPPPreferAppVar.
Clause selection is the heuristic component that we extended the most. We must assess the effect of a new heuristic weight function, a multiplier for the occurrence of top-level applied variables, and clause priority based on the number of top-level applied variables.
To test the effect of the new type-based weight function, we replaced boa's queue, which uses 4.CRSWeight(…), with the queue ordered by 4.CRSTypeWeight(…). We call the original heuristic W and the type-aware alternative TW. We chose nine values for testing the effect of the applied variable multiplier appv_mult. Table 2 summarizes the results of combining W or TW with the different appv_mult values. Applying a multiplier smaller than 1, which corresponds to preferring literals containing applied variables, can lose dozens of solutions. Overall, using the type-aware heuristic seems slightly detrimental.
The results presented above give an idea of how each parameter influences performance. We also evaluated their performance in combination, to derive an alternative to boa for λfHOL. For each category of parameters, we chose either boa's value of the parameter in boa ("Old") or the best performing newly implemented parameter ("New"). Based on the results above, for term orders, we chose the combination of invtypefreqrank and invtypefreq; for clause selection, we chose CRSTypeWeight with ConstPrio priority and an appv_mult factor of 1.41; for literal selection, we chose SelectMLCAPPAvoidAppVar. Table 3 shows the number of solved problems for all combinations of these parameters. From the two configurations that solve 2397 problems, we selected the "New Old New"  combination as our suggested "higher-order best of auto," or hoboa, configuration.

Main evaluation
We now present a more detailed evaluation of hoboa, along with other configurations, on a larger benchmark suite. Our raw data are publicly available. 2 The benchmarks are divided into four sets: (1) 1147 firstorder TPTP [51] problems belonging to the FOF (untyped) and TF0 (monomorphic) categories, excluding arithmetic; (2) 5012 Sledgehammer-generated problems from the Judgment Day [17] suite, targeting the monomorphic first-order logic embodied by TPTP TF0; (3) all 955 monomorphic higher-order problems from the TH0 category of the TPTP belonging to our extension of λfHOL; (4) 5012 Judgment Day problems targeting the λfHOL fragment of TPTP TH0.
The TPTP includes benchmarks from various areas of computer science and mathematics. It is the de facto standard for evaluating automatic provers, but it has few higherorder problems. For the first group of benchmarks, we randomly selected 1000 FOF problems (out of 8172) and all monomorphic TFF problems that are parsable by E within 60 s (amounting to 147 out of 231 monomorphic TFF problems). Both groups of Sledgehammer problems include two subgroups of 2506 problems, generated to include 32 or 512 Isabelle lemmas (SH32 and SH512), to represent both small and large problems. Each subgroup consists of two sub-subgroups of 1253 problems, generated by using either λ-lifting or SK-style combinators to encode λ-expressions.
To ascertain the effectiveness of our approach, we evaluated Ehoh against E used on applicative encodings of problems (denoted by @+E). For reference, we also evaluated the latest versions of higher-order provers that competed 2 https://doi.org/10.5281/zenodo.4045452 in the THF division of the 2019 edition of CASC [52]: CVC4 1.8 prerelease [6], Leo-III 1.4 [46], Satallax 3.4 [18], Vampire 4.4 [14], and Zipperposition 1.6 [9]. Like at CASC, we used different versions of Vampire for first-order and higher-order problems. Similarly, Zipperposition does not use E as backend when it is run on first-order problems and uses different heuristics on first-and higher-order problems. The genuine higher-order provers have the unfair advantage that they can instantiate higher-order variables with λ-terms. Thus, some formulas that are provable by these systems may be nontheorems for @+E and Ehoh, or they may require tedious reasoning about λ-lifted functions or SK-style combinators. An example is the conjecture ∃ f .∀x y. f x y ≈ g y x, whose proof requires taking λx y. g y x as the witness for f .
We ran all provers except Satallax (which only supports THF) on first-order benchmarks to measure the overhead introduced by our extensions, as well as that entailed by the applicative encoding. Table 4 gives the number of problems each system proved. In each column, bold highlights the best E value and the best value overall. We considered the E modes auto (a) and autoschedule (as) and the configurations boa (b) and hoboa (hb).
We observe the following. First, comparing the Ehoh row with the E row, we see that Ehoh's overhead is barely noticeable-the difference is at most two problems. Second, Ehoh outperforms the applicative encoding on both first-order and higher-order problems. Nevertheless, the raw evaluation data reveal that there are quite a few higher-order problems that @+E proves faster than Ehoh. Third, it is advantageous to use the higher-order versions of the Sledgehammer problems, although the difference in success rate is small, especially for SH512. Fourth, the new hoboa outperforms boa on higher-order problems, suggesting that it could be worthwhile to re-train auto and autoschedule based on λfHOL benchmarks and to design further heuristics. Fifth, Ehoh cannot compete against the best higher-order systems, but this is no surprise, given that it does not yet support λexpressions and higher-order unification.
Next to the success rate, the time in which a prover gives an answer is also an important consideration. Table 5 compares the average running times, in seconds, of the various systems on the problems that all of the applicable systems proved. Clearly, Ehoh incurs little overhead on first-order problems. The raw evaluation data reveal that for boa, it takes Ehoh 2747 s to prove all first-order problems that E, @+E, and Ehoh can all prove using this configuration, compared with

Discussion and related work
Our working hypothesis is that it is possible to extend firstorder provers to higher-order logic without slowing them down unduly. Our research program is two-pronged: On the theoretical side, we are investigating higher-order extensions of superposition [9,10,56]; on the practical side, we are implementing such extensions in a state-of-the-art prover.
The work described in this article required modifying many parts of the E prover. The invariant that variables cannot be applied and that symbols are always passed the same number of arguments were entrenched in E's code, requiring hundreds of modifications. Nonetheless, we found the generalization manageable and are now in a position to add support for λ-terms and higher-order unification.
Traditionally, most higher-order provers were designed from the ground up to target higher-order logic. Two exceptions are Otter-λ by Beeson [8] and Zipperposition by Cruanes et al. [9,20]. Otter-λ adds λ-terms and secondorder unification to the superposition prover Otter [31]. Zipperposition, also based on superposition, was extended to Boolean-free higher-order logic by Bentkamp et al. [9]. Its performance is a far cry from E's, but it is easier to modify. Vukmirović et al. also used it to test and evaluate higher-order unification procedures [54] and Boolean reasoning [56]. Zipperposition now includes Ehoh as a backend in a cooperative architecture. Finally, there is recent work by the developers of Vampire [14] and of the SMT (satisfiability modulo theories) solvers CVC4 and veriT [6] to extend their provers to higher-order logic. Native higher-order reasoning was pioneered by Robinson [39], Andrews [1], and Huet [24]. Andrews [2] and Benzmüller and Miller [12] provide excellent surveys. TPS, by Andrews et al. [3], was based on expansion proofs and lets users specify proof outlines. The Leo family of systems, developed by Benzmüller and his colleagues, is based on resolution and paramodulation. LEO [11] supported extensionality on the calculus level and introduced the cooperative paradigm to integrate first-order provers. Leo-III [46] expands the cooperation with SMT solvers and introduces term orders in a pragmatic, incomplete way. Brown's Satallax [18] is based on a complete higher-order tableau calculus, guided by a SAT solver; later versions also cooperate with E and Ehoh. Another noteworthy system is Lindblad's agsy-HOL [28]. It is based on a focused sequent calculus driven by a generic narrowing engine.
An alternative to all of the above is to reduce higher-order logic to first-order logic via a translation. Robinson [40] outlined this approach decades before tools such as MizAR [53], Sledgehammer [36], HOLyHammer [25], and CoqHammer [22] popularized it in proof assistants. In addition to performing an applicative encoding, such translations must eliminate the λ-expressions [21,33] and encode the type information [15]. In practice, on problems with a large first-order component, translations perform very well compared with the existing native provers [48]. Largely thanks to Sledgehammer, Isabelle often came in close second at CASC, even defeating Satallax in 2012 [49].
By removing the need for the applicative encoding, our work reduces the translation gap. The encoding buries the λfHOL terms' heads under layers of @ symbols. Terms double in size, cluttering the data structures, and twice as many subterm positions must be considered for inferences. Moreover, the encoding is incompatible with interpreted operators, notably for arithmetic. A common remedy is to introduce proxies to connect an uninterpreted nullary symbol with its interpreted counterpart (e.g., @(@(add, x), y) ≈ x + y), but this is clumsy. A further complication is that in a monomorphic logic, @ is not a single symbol but a family of symbols @ τ,υ , which must be correctly introduced and recognized. Finally, the encoding must be undone in the proofs. While it should be possible to base a higher-order prover on such an encoding, the prospect is aesthetically and technically unappealing, and performance would likely suffer.

Conclusion
Despite considerable progress since the 1970s, higher-order automated reasoning has not yet assimilated some of the most successful methods for first-order logic with equality, such as superposition. We presented a graceful extension of a state-of-the-art first-order theorem prover to a fragment of higher-order logic devoid of λ-terms. Our work covers both theoretical and practical aspects. Experiments show promising results on λ-free higher-order problems and very little overhead for first-order problems, as we would expect from a graceful generalization.
Despite its lack of support for λ-terms, Ehoh is already deployed as a backend in the leading higher-order provers Satallax and Zipperposition. Ehoh will also form the basis of our work toward stronger higher-order automation. Our aim is to turn it into a prover that excels on proof obligations arising in interactive verification, which tend to be large but only mildly higher-order [48]. The next steps will be to extend Ehoh's data structures with λ-expressions and implement the higher-order unification procedure by Vukmirović et al. [54]. These techniques are cornerstones of our prototype Zipperposition, which dominated the higher-order proving division of the 2020 edition of CASC.