Superposition for Full Higher-Order Logic

. We recently designed two calculi as stepping stones towards super-position for full higher-order logic: Boolean-free λ -superposition and superposition for ﬁrst-order logic with interpreted Booleans. Stepping on these stones, we ﬁnally reach a sound and refutationally complete calculus for higher-order logic with polymorphism, extensionality, Hilbert choice, and Henkin semantics. In addition to the complexity of combining the calculus’s two predecessors, new challenges arise from the interplay between λ -terms and Booleans. Our implementation in Zipperposition outperforms all other higher-order theorem provers and is on a par with an earlier, pragmatic prototype of Booleans in Zipperposition.


Introduction
Superposition is a leading calculus for first-order logic with equality.We have been wondering for some years whether it would be possible to gracefully generalize it to extensional higher-order logic and use it as the basis of a strong higher-order automatic theorem prover.Towards this goal, we have, together with colleagues, designed superposition-like calculi for three intermediate logics between first-order and higherorder logic.Now we are finally ready to assemble a superposition calculus for full higher-order logic.The filiation of our new calculus from Bachmair and Ganzinger's standard first-order superposition is as follows: Standard superposition Bachmair and Ganzinger [2] (Sup) Superposition with ← → and delayed CNF Ganzinger and Stuber [16] (← →Sup) Superposition with Booleans Nummelin et al. [23] (oSup) Boolean-free λ-free superposition Bentkamp et al. [7] (λfSup) Boolean-free λ-superposition Bentkamp et al. [6] (λSup) Boolean λ-superposition This paper (oλSup) Our goal was to devise an efficient calculus for higher-order logic.To achieve it, we pursued two objectives.First, the calculus should be refutationally complete.Second, the calculus should coincide as much as possible with its predecessors oSup and λSup on the respective fragments of higher-order logic (which in turn essentially coincide with Sup on first-order logic).Achieving these objectives is the main contribution of this paper.We made an effort to keep the calculus simple, but often the refutational completeness proof forced our hand to add conditions or special cases.
Like oSup, our calculus oλSup operates on clauses that can contain Boolean subterms, and it interleaves clausification with other inferences.Like λSup, oλSup eagerly βη-normalizes terms, employs full higher-order unification, and relies on a fluid subterm superposition rule (FLUIDSUP) to simulate superposition inferences below applied variables-i.e., terms of the form y t 1 . . .t n for n ≥ 1.
Because oSup contains several superposition-like inference rules for Boolean subterms, our completeness proof requires dedicated fluid Boolean subterm hoisting rules (FLUIDBOOLHOIST, FLUIDLOOBHOIST), which simulate Boolean inferences below applied variables, in addition to FLUIDSUP, which simulates superposition inferences.
Due to restrictions related to the term order that parameterizes superposition, it is difficult to handle variables bound by unclausified quantifiers if these variables occur applied or in arguments of applied variables.We solve the issue by replacing such quantified terms ∀y.t by equivalent terms (λy. ) in a preprocessing step.We implemented our calculus in the Zipperposition prover and evaluated it on TPTP and Sledgehammer benchmarks.The new Zipperposition outperforms all other higherorder provers and is on a par with an ad hoc implementation of Booleans in the same prover by Vukmirović and Nummelin [30].We refer to the technical report [8] for the completeness proof and a more detailed account of the calculus and its evaluation.

Logic
Our logic is higher-order logic (simple type theory) with rank-1 polymorphism, Hilbert choice, and functional and Boolean extensionality.Its syntax mostly follows Gordon and Melham [17].We use the notation ān or ā to stand for the tuple (a 1 , . . ., a n ) where n ≥ 0. Deviating from Gordon and Melham, type arguments are explicit, written as c τm for a symbol c : Π ᾱm .υ and types τm .In the type signature Σ ty , we require the presence of a nullary Boolean type constructor o and a binary function type constructor →.In the term signature Σ, we require the presence of the logical symbols , The logical symbols are shown in bold to distinguish them from the notation used for clauses below.Moreover, we require the presence of the Hilbert choice operator ε ∈ Σ.Although ε is interpreted in our semantics, we do not consider it a logical symbol.Our calculus will enforce the semantics of ε by an axiom, whereas the semantics of the logical symbols will be enforced by inference rules.We write V for the set of (term) variables.We use Henkin semantics, in the style of Fitting [15], with respect to which we can prove our calculus refutationally complete.In summary, our logic essentially coincides with the TPTP TH1 format [20].
We generally view terms modulo αβη-equivalence.When defining operations that need to analyze the structure of terms, however, we use a custom normal form as the default representative of a βη-equivalence class: The βηQ η -normal form t ↓ βηQ η of a term t is obtained by bringing the term into η-short β-normal form and finally applying the rewrite rule Q τ s − → Q η Q τ (λx.s x) exhaustively whenever s is not a λexpression.Here and elsewhere, Q stands for either On top of the standard higher-order terms, we install a clausal structure that allows us to formulate calculus rules in the style of first-order superposition.A literal s ≈ t is an equation s ≈ t or disequation s ≈ t of terms s and t; both equations and disequations are unordered pairs.A clause L 1 ∨ • • • ∨ L n is a finite multiset of literals L j .The empty clause is written as ⊥.This clausal structure does not restrict the logic, because an arbitrary term t of Boolean type can be written as the clause t ≈ .
We considered excluding negative literals by encoding them as [16].However, this approach would make the conclusion of the equality factoring rule (EFACT) too large for our purposes.Regardless, the simplification machinery will allow us to reduce negative literals , respectively, thereby eliminating redundant representations of nonequational literals.
We let CSU(s, t) denote an arbitrary (preferably, minimal) complete set of unifiers for two terms s and t on the set of free variables of the clauses in which s and t occur.To compute such sets, Huet-style preunification [18] is not sufficient, and we must resort to a full unification procedure [19,29].To cope with the nontermination of such procedures, we use dovetailing as described by Vukmirović et al. [28,Sect. 5].
Some of the rules in our calculus introduce Skolem symbols, representing objects mandated by existential quantification.We assume that these symbols do not occur in the input problem.More formally, given a problem over a term signature Σ, our calculus operates on a Skolem-extended term signature Σ sk that, in addition to all symbols from Σ, inductively contains symbols sk Π ᾱ. ∀ x. ∃z.t z : Π ᾱ. τ → υ for all types υ, variables z : υ, and terms t : υ → o over Σ sk , where ᾱ are the free type variables occurring in t and x : τ are the free term variables occurring in t, both in order of first occurrence.

The Calculus
The oλSup calculus closely resembles λSup, augmented with rules for Boolean reasoning that are inspired by oSup.As in λSup, superposition-like inferences are restricted to certain first-order-like subterms, the green subterms, which we define inductively as follows: Every term t is a green subterm of t, and for all symbols is a green subterm of u i for some i, then t is a green subterm of f τ ū.For example, the green subterms of f (g y a, and λx.h b.We write s t to denote a term s with a green subterm t and call the first-order-like context s a green context.Following λSup, we call a term t fluid if (1) t ↓ βηQ η is of the form y ūn where n ≥ 1, or (2) t ↓ βηQ η is a λ-expression and there exists a substitution σ such that tσ↓ βηQ η is not a λ-expression (due to η-reduction).Intuitively, fluid terms are terms whose normal form can change radically as a result of instantiation.
We define deeply occurring variables as in λSup, but exclude λ-expressions directly below quantifiers: A variable occurs deeply in a clause C if it occurs inside an argument of an applied variable or inside a λ-expression that is not directly below a quantifier.
Preprocessing.Our completeness theorem requires that quantified variables do not appear in certain higher-order contexts.We use preprocessing to eliminate problematic occurrences of quantifiers.The rewrite rules where the rewritten occurrence of Q τ is unapplied or has an argument of the form λx. v such that x occurs as a nongreen subterm of v.If either of these rewrite rules can be applied to a given term, the term is For example, the term λy.
; or a quantified variable occurs in the argument of a variable, either a free variable (e.g., or a variable bound above the quantifier (e.g., λy.
reducible ground instances of clauses will be considered redundant by the redundancy criterion.Thus, clauses whose ground instances are all ≈ ≈ ≈ ≈ ≈ -normal ground instances.Such clauses must be kept because the completeness proof relies on their In principle, we could omit the side condition of the rewrite rules and eliminate all quantifiers.However, the calculus (especially, the redundancy criterion) performs better with quantifiers than with λ-expressions, which is why we restrict normalization as much as the completeness proof allows.Extending the preprocessing to eliminate all Boolean terms as in Kotelnikov et al. [21] does not work for higher-order logic because Boolean terms can contain variables bound by enclosing λ-expressions.
Term Order.The calculus is parameterized by a well-founded strict total order on ground terms satisfying these four criteria: (O1) compatibility with green contextsi.e., s s implies t s t s ; (O2) green subterm property-i.e.t s s where is the reflexive closure of ; u for all types τ, terms t, and terms u such that Q τ t and u are normal and the only Boolean green subterms of u are and ≈ ≈ ≈ ≈ ≈ -normal terms ensures that term orders fulfilling the requirements exist, but it forces us to preprocess the input problem.We extend to literals and clauses via the multiset extensions in the standard way [2,Sect. 2.4].
For nonground terms, is required to be a strict partial order such that t s implies tθ sθ for all grounding substitutions θ.As in λSup, we also introduce a nonstrict variant for which we require that tθ sθ for all grounding substitutions θ whenever t s, and similarly for literals and clauses.
To construct a concrete order fulfilling these requirements, we define an encoding into untyped first-order terms, and compare these using a variant of the Knuth-Bendix order.In a first step, denoted O, the encoding translates fluid terms t as fresh variables z t ; nonfluid λ-expressions λx : τ.
. Bound variables are encoded as constants db i corresponding to De Bruijn indices.In a second step, denoted P , the encoding replaces Q 1 by Q 1 and variables z by z whenever they occur below lam.For example, ).The first-order terms can then be compared using a transfinite Knuth-Bendix order kb [22].Let the weight of and the weights of all other symbols be less than ω.Let the precedence > be total and Selection Functions.The calculus is also parameterized by a literal selection function and a Boolean subterm selection function.We define an element x of a multiset M to be -maximal for some relation if for all y ∈ M with y x, we have y = x.It is strictly -maximal if it is -maximal and occurs only once in M.
The literal selection function HLitSel maps each clause to a subset of selected literals.A literal may not be selected if it is positive and neither side is Moreover, a literal L y may not be selected if y ūn , with n ≥ 1, is a -maximal term of the clause.
The Boolean subterm selection function HBoolSel maps each clause C to a subset of selected subterms in C. Selected subterms must be green subterms of Boolean type.Moreover, a subterm s must not be selected s is at the topmost position on either side of a positive literal, or if s contains a variable y as a green subterm, and y ūn , with n ≥ 1, is a -maximal term of the clause.
Eligibility.A literal L is (strictly) eligible w.r.t. a substitution σ in C if it is selected in C or there are no selected literals and no selected Boolean subterms in C and Lσ is (strictly) -maximal in Cσ.
The eligible subterms of a clause C w.r.t. a substitution σ are inductively defined as follows: Any selected subterm is eligible.If a literal L = s ≈ t with sσ tσ is either eligible and negative or strictly eligible and positive, then the subterm s is eligible.If a subterm t is eligible and the head of The Core Inference Rules.The calculus consists of the following core inference rules.The first five rules stem from λSup, with minor adaptions concerning Booleans:  The following rules are concerned with Boolean reasoning and originate from oSup.
They have been adapted to support polymorphism and applied variables.
σ is a type unifier of the type of u with the Boolean type o (i.e., the identity if u is Boolean or {α → o} if u is of type α for some type variable α); 2. the head of u is neither a variable nor a logical symbol; 3. u is eligible in C; 4. the occurrence of u is not at the top level of a positive literal.EQHOIST, NEQHOIST, FORALLHOIST, EXISTSHOIST 1.
x, y, and α are fresh variables; 3. u is eligible in C w.r.t.σ; 4. if the head of u is a variable, it must be applied and the affected literal must be of the form u ≈ , BOOLRW 1. σ ∈ CSU(t, u) and (t, t ) is one of the following pairs, where y is a fresh variable: u is not a variable; 3. u is eligible in C w.r.t.σ; 4. if the head of u is a variable, it must be applied and the affected literal must be of the form , respectively, where β is a fresh type variable, y is a fresh term variable, ᾱ are the free type variables and x are the free term variables occurring in yσ in order of first occurrence; 2. u is not a variable; 3. u is eligible in C w.r.t.σ; 4. if the head of u is a variable, it must be applied and the affected literal must be of the form u where v is a variable-headed term; 5. for FORALLRW, the indicated occurrence of u is not in a literal u ≈ , and for EXISTSRW, the indicated occurrence of u is not in a literal u Like SUP, also the Boolean rules must be simulated in fluid terms.The following rules are Boolean counterparts of FLUIDSUP: FLUIDBOOLHOIST 1. u is fluid; 2. z and x are fresh variables; 3. σ ∈ CSU(z x, u); In addition to the inference rules, our calculus relies on two axioms, below.Axiom (EXT), from λSup, embodies functional extensionality; the expression diff α, β abbreviates sk Πα β. ∀z y. ∃x.z x ≈y x α, β .Axiom (CHOICE) characterizes the Hilbert choice operator ε.
Rationale for the Rules.Most of the calculus's rules are adapted from its precursors.SUP, ERES, and EFACT are already present in Sup, with slightly different side conditions.Notably, as in λfSup and λSup, SUP inferences are required only into green contexts.Other subterms are accessed indirectly via ARGCONG and (EXT).The rules BOOLHOIST, EQHOIST, NEQHOIST, FORALLHOIST, EXISTSHOIST, FALSEELIM, BOOLRW, FORALLRW, and EXISTSRW, concerned with Boolean reasoning, stem from oSup, which was inspired by ← →Sup.Except for BOOLHOIST and FALSEELIM, these rules have a condition stating that "if the head of u is a variable, it must be applied and the affected literal must be of the form u where v is a variable-headed term."The inferences at variable-headed terms permitted by this condition are our form of primitive substitution [1,18], a mechanism that blindly substitutes logical connectives and quantifiers for variables z with a Boolean result type.
Example 1.Our calculus can prove that Leibniz equality implies equality (i.e., if two values behave the same for all predicates, they are equal) as follows: The EQHOIST inference, applied on z b, illustrates how our calculus introduces logical symbols without a dedicated primitive substitution rule.Although ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ does not appear in the premise, we still need to apply EQHOIST on z b with CSU(z b, Other calculi [1,9,18,26] would apply an explicit primitive substitution rule instead, yielding essentially However, in our approach this clause is subsumed and could be discarded immediately.By hoisting the equality to the clausal level, we bypass the redundancy criterion.
Next, BOOLRW can be applied to Then SUP is applicable with the unifier {w → λx 1 x 2 x 3 .x 2 } ∈ CSU(b, w a b b), and ERES derives the contradiction.
Like in λSup, the FLUIDSUP rule is responsible for simulating superposition inferences below applied variables, other fluid terms, and deeply occurring variables.Complementarily, FLUIDBOOLHOIST and FLUIDLOOBHOIST simulate the various Boolean inference rules below fluid terms.Initially, we considered adding a fluid version of each rule that operates on Boolean subterms, but we discovered that FLUID-BOOLHOIST and FLUIDLOOBHOIST suffice to achieve refutational completeness.

Example 2. The clause set consisting of h (y b)
and a ≈ b highlights the need for FLUIDBOOLHOIST and its companion.The set is unsatisfiable because the instantiation {y → λx.g , which is unsatisfiable in conjunction with a ≈ b.The literal selection function can select either literal in the first clause.ERES is applicable in either case, but the unifiers {y → λx.
and {y → λx.g } do not lead to a contradiction.Instead, we need to apply FLUIDBOOLHOIST if the first literal is selected or FLUIDLOOBHOIST if the second literal is selected.In the first case, the derivation is as follows: The FLUIDBOOLHOIST inference uses the unifier {y → λu.z u (x u), z → λu.z b u, x → x b} ∈ CSU(z x, y b).We apply ERES to the first literal of the resulting clause, with unifier {z → λuv.
. Next, we apply EQHOIST with the unifier {x → λu.The two sides of the interpreted equality in the first literal can then be unified, allowing us to apply BOOLRW with the unifier {y → a, x → λu.a} ∈ CSU(y Finally, applying ERES twice and FALSEELIM once yields the empty clause.
Remarkably, none of the provers that participated in the CASC-J10 competition can solve this two-clause problem within a minute.Satallax finds a proof after 72 s and LEO-II after over 7 minutes.Our new Zipperposition implementation solves it in 3 s.
The Redundancy Criterion.In first-order superposition, a clause is considered redundant if all its ground instances are entailed by ≺-smaller ground instances of other clauses.In essence, this will also be our definition, but we will use a different notion of ground instances and a different notion of entailment.
Given a clause C, let its ground instances G(C) be the set of all clauses of the form Cθ for some substitution θ such that Cθ is ground and and for all variables x occurring in C, the only Boolean green subterms of xθ are and The rationale of this definition is to ensure that ground instances of the conclusion of FORALLHOIST, EX-ISTSHOIST, FORALLRW, and EXISTSRW inferences are smaller than the corresponding instances of their premise by property (O4).
The redundancy criterion's notion of entailment is defined via an encoding into a weaker logic, following λfSup and λSup.In this paper, the weaker logic is ground firstorder logic with interpreted Booleans-the ground fragment of the logic of oSup.Its signature (Σ ty , Σ GF ) is derived from our higher-order signature (Σ ty , Σ) as follows.The type constructors Σ ty are the same in both signatures, but → is an uninterpreted type constructor in first-order logic.For each ground instance f ῡ : we introduce a first-order symbol f ῡ j ∈ Σ GF with argument types τj and result type τ j+1 → • • • → τ n → τ, for each j.Moreover, for each ground term λx.t, we introduce a symbol lam λx.t ∈ Σ GF of the same type.The symbols 2 are identified with the corresponding first-order logical symbols.
We define an encoding ≈ ≈ ≈ ≈ ≈ -normal ground higher-order terms into this ground first-order logic recursively as follows: ∃x. F (t) for applied quantifiers; F (λx. t) = lam λx.t for λ-expressions; and F (f ῡ sj ) = f ῡ j (F ( sj )) for other terms.For quantified variables, we define F (x) = x.Here, ≈ ≈ ≈ ≈ ≈normality is crucial to ensure that bound variables do not occur applied or within λexpressions.The definition of green subterms is devised such that green subterms correspond to first-order subterms via the encoding F , with the exception of first-order subterms below quantifiers.The encoding F is extended to clauses by mapping each literal and each side of a literal individually.From the entailment relation |= for the ground first-order logic, we derive an entailment relation This relation is weaker than standard higher-order entailment; for example, {f ≈ g} |= F {f a ≈ g a} (because of the subscripts added by F ) and {p (λx.For first-order superposition, an inference is considered redundant if for each of its ground instances, a premise is redundant or the conclusion is entailed by clauses smaller than the main premise.For most inference rules, our definition follows this idea, using |= F for entailment; other rules need nonstandard notions of ground instances and redundancy.The definition of inference redundancy presented below is simpler than the more sophisticated notion in our technical report.Nonetheless, the redundant inferences below are a strict subset of the redundant inferences of our report and thus completeness also holds using the notion below.For the few prover optimizations based on inference redundancy that we know about (e.g., simultaneous superposition [4]), the following criterion suffices.
For SUP, ERES, EFACT, BOOLHOIST, FALSEELIM, EQHOIST, NEQHOIST, and BOOLRW, we define ground instances as usual: Ground instances are all inferences obtained by applying a grounding substitution to premises and conclusion such that the result adheres to the conditions of the given rule w.r.t.selection functions that select literals and subterms as in the original premise.For FLUIDSUP and FLUIDBOOLHOIST, we define ground instances in the same way except that we require that ground instances adhere to the conditions of SUP or BOOLHOIST, respectively.For FORALLRW, EXISTSRW, FORALLHOIST, EXISTSHOIST, which do not have ground instances in the sense above, we define a ground instance as any inference that is obtained by applying the unifier σ to the premise and then applying a grounding substitution to premise and conclusion, regardless of whether the resulting inference is an inference of our calculus.
For all rules except FLUIDLOOBHOIST and ARGCONG, we define an inference to be redundant w.r.t. a clause set N if for each ground instance ι, a premise of ι is redundant w.r.t.G(N) or the conclusion of ι is entailed w.r.t.|= F by clauses from G(N) that are smaller than the main (i.e., rightmost) premise of ι.For the rules FLUIDLOOB-HOIST and ARGCONG, as well as axioms (EXT) and (CHOICE)-viewed as premiseless inferences-we define an inference to be redundant w.r.t. a clause set N if all ground instances of its conclusion are contained in G(N) or redundant w.r.t.G(N).
We denote the set of redundant inferences w.r.t.N by Red I (N).
Simplification Rules.Our redundancy criterion is strong enough to support counterparts of most simplification rules implemented in Schulz's first-order E [25, Sect.2.3.1 and 2.3.2].Deletion of duplicated literals, deletion of resolved literals, syntactic tautology deletion, negative simplify-reflect, and clause subsumption adhere to our redundancy criterion.Positive simplify-reflect, equality subsumption, and rewriting (demodulation) of positive and negative literals are supported if they are applied on green subterms or on other subterms that are encoded into first-order subterms by G and F .Semantic tautology deletion can be applied as well, using |= F ; moreover, for positive literals, the rewriting clause must be smaller than the rewritten clause.
Under some circumstances, inference rules can be applied as simplifications.The FALSEELIM and BOOLRW rules can be applied as a simplification if σ is the identity.
FORALLHOIST and FORALLRW can both be applied and, together, serve as one simplification rule.The same holds for EXISTSHOIST and EXISTSRW if the head of For all of these rules, the eligibility conditions can be ignored.
Clausification.Like oSup, our calculus does not require the input problem to be clausified during the preprocessing, and it supports higher-order analogues of the three inprocessing clausification methods introduced by Nummelin et al.Inner delayed clausification relies on our core calculus rules to destruct logical symbols.Outer delayed clausification adds the following clausification rules to the calculus: The double bars identify simplification rules (i.e., the conclusions make the premise redundant and can replace it).The first two rules require that s has a logical symbol as its head, whereas the last two require that s and t are Boolean terms other than and The function oc distributes the logical symbols over the clause C-e.g., oc(s It is easy to check that our redundancy criterion allows us to replace the premise of the OUTERCLAUS rules with their conclusion.Nonetheless, we apply EQOUTERCLAUS and NEQOUTERCLAUS as inferences because the premises might be useful in their original form.
Besides the two delayed clausification methods, a third inprocessing clausification method is immediate clausification.This clausifies the input problem's outer Boolean structure in one swoop, resulting in a set of higher-order clauses.If unclausified Boolean terms rise to the top during saturation, the same algorithm is run to clausify them.
Unlike delayed clausification, immediate clausification is a black box and is unaware of the proof state other than the Boolean term it is applied to.Delayed clausification, on the other hand, clausifies the term step by step, allowing us to interleave clausification with the strong simplification machinery of superposition provers.It is especially powerful in higher-order contexts: Examples such as y p q ≈ (p ) can be refuted directly by equality resolution, rather than via more explosive rules on the clausified form.

Refutational Completeness
Our calculus is dynamically refutationally complete for problems in The full proof can be found in our technical report [8].
Theorem 3 (Dynamic refutational completeness).Let (N i ) i be a derivation-i.e., ≈ ≈ ≈ ≈ ≈ -normal and such that N 0 |= ⊥.Moreover, assume that (N i ) i is fair-i.e., all inferences from clauses in the limit inferior i j≥i N j are contained in i Red I (N i ).Then we have ⊥ ∈ N i for some i.
Following the completeness proof of λSup, our proof is structured in three levels of logics.For each, we define a calculus and show that it is refutationally complete: ground monomorphic first-order logic with an interpreted Boolean type (GF); the ≈ ≈ ≈ ≈ ≈ -normal ground fragment of higher-order logic (GH); and higher-order logic (H).
The logic of the GF level is the ground fragment of oSup's logic.The GF calculus is a ground version of oSup, which Nummelin et al. showed refutationally complete.It consists of ground first-order equivalents of our rules, excluding ARGCONG, FLUID-BOOLHOIST, and FLUIDLOOBHOIST, which are specific to higher-order logic.The counterparts to FORALLHOIST and EXISTSHOIST enumerate ground terms instead of producing free variables, to stay within the ground fragment.For compatibility with the nonground level, the conclusions of FORALLRW and EXISTSRW cannot contain concrete Skolem functions.Instead, the GF calculus is parameterized by a witness function that can assign an arbitrary term to each occurrence of a quantifier in a clause.This witness function is used to retrieve the Skolem terms in the GF equivalents of FORALLRW and EXISTSRW.
On the next level, the GH calculus includes inference rules isomorphic to the GF rules, transferred to higher-order logic via F −1 .Moreover, it contains an ARGCONG variant that enumerates ground terms instead of introducing fresh variables, as well as rules enumerating ground instances of axioms (EXT) and (CHOICE).We prove refutational completeness of the GH calculus by constructing a higher-order interpretation based on the model constructed for the completeness proof of the GF level.This proof step is analogous to the corresponding step in λSup's proof, but we must also consider ≈ ≈ ≈ ≈ ≈ -normality and the logical symbols.
To lift completeness to the H level, we use the saturation framework of Waldmann et al. [31].The main proof obligation it leaves us to show is that nonredundant GH inferences can be lifted to corresponding nonground H inferences.For this lifting, we must choose a suitable GH witness function and appropriate GH selection functions for literals and Boolean subterms, given a saturated clause set at the H level and the H selection functions.Then the saturation framework guarantees static refutational completeness w.r.t.Herbrand entailment, which is the entailment relation induced by the grounding function G.We then show that this implies dynamic refutational completeness w.r.t.

Implementation
We implemented our calculus in the Zipperposition prover [14], whose OCaml source code makes it convenient to prototype calculus extensions.Except for the presence of axioms (EXT) and (CHOICE), the new code gracefully extends Zipperposition's implementation of oSup in the sense that oλSup coincides with oSup on first-order problems.The same cannot be said w.r.t.λSup on Boolean-free problems because of the FLUIDBOOLHOIST and FLUIDLOOBHOIST rules, which are triggered by any applied variable.From the implementation of λSup, we inherit the given clause procedure, which supports infinitely branching inferences, as well as calculus extensions and heuristics [28].From the implementation of oSup, we inherit the simplification rule BOOLSIMP, a mainstay of our Boolean simplification machinery.
As in the implementation of λSup, we approximate fluid terms as terms that are either nonground λ-expressions or terms of the form x sn with n > 0. Two slight, accidental discrepancies are that we also count variable occurrences below quantifiers as deep and perform EFACT inferences even if the maximal literal is selected.Since we expect FLUIDBOOLHOIST and FLUIDLOOBHOIST to be highly explosive, we penalize them and all of their offspring.In addition to various λSup extensions [6, Sect.5], we also use all the rules for Boolean reasoning described by Vukmirović and Nummelin [30] except for the BOOLEF rules.

Evaluation
We evaluate the calculus implementation in Zipperposition and compare it with other higher-order provers.Our experiments were performed on StarExec Miami servers equipped with Intel Xeon E5-2620 v4 CPUs clocked at 2.10 GHz.We used all 2606 TH0 theorems from the TPTP 7.3.0library [27] and 1253 "Judgment Day" problems [12] generated using Sledgehammer (SH) [24] as our benchmark set.An archive containing the benchmarks and the raw evaluation results is publicly available [5].
Calculus Evaluation.In this first part, we evaluate selected parameters of Zipperposition by varying only the studied parameter in a fixed well-performing configuration.This base configuration disables axioms (CHOICE) and (EXT) and the FLUID-rules.It uses the unification procedure of Vukmirović et al. [29] in its complete variant-i.e., the variant that produces a complete set of unifiers.It uses none of the early Boolean rules described by Vukmirović and Nummelin [30].The preprocessor All of the completeness-preserving simplification rules listed in Sect. 3 are enabled.The configuration uses immediate clausification.We set the CPU time limit to 30 s in all three experiments.
In the first experiment, we assess the overhead incurred by the FLUID-rules.These rules unify with a term whose head is a fresh variable.Thus, we expected that they needed to be tightly controlled to achieve good performance.To test our hypothesis, we simultaneously modified the parameters of these three rules.In Figure 1, the off mode simply disables the rules, the pragmatic mode uses a terminating incomplete unification algorithm (the pragmatic variant of Vukmirović et al. [29]), and the complete mode uses a complete unification algorithm.The results show that disabling FLUIDrules altogether achieves the best performance.However, on TPTP problems, complete finds 35 proofs not found by off, and pragmatic finds 22 proofs not found by off.On Sledgehammer benchmarks, this effect is much weaker, likely because the Sledgehammer benchmarks require less higher-order reasoning: complete finds only one new proof over off, and pragmatic finds only four.
In the second experiment, we explore the clausification methods introduced at the end of Sect.3: inner delayed clausification, outer delayed clausification, and immediate clausification.The modes inner and outer employ oSup's RENAME rule, which renames Boolean terms headed by logical symbols using a Tseitin-like transformation if they occur at least four times in the proof state.Vukmirović and Nummelin [30] observed that outer clausification can greatly help prove higher-order problems, and we expected it to perform well for our calculus, too.The results are shown in Figure 2. The results confirm our hypothesis: The outer mode outperforms immediate on both TPTP and Sledgehammer benchmarks.The inner mode performs worst, but on Sledgehammer benchmarks, it proves 17 problems beyond the reach of the other two.Interestingly, several of these problems contain axioms of the form φ → → → → → → → → → → → → → → → → → → → → → → → → → ψ, and applying superposition and demodulation to these axioms is preferable to clausifying them.
In the third experiment, we investigate the effect of axiom (CHOICE), which is necessary to achieve refutational completeness.To evaluate (CHOICE), we either disabled it in a configuration labeled off or set the axiom's penalty p to different values.In Zipperposition, penalties are propagated through inference and simplification rules and are used to increase the heuristic weight of clauses, postponing the selection of penalized clauses.The results are shown in Figure 3.As expected, disabling (CHOICE), or at least penalizing it heavily, improves performance.Yet enabling (CHOICE) can be crucial: For 19 TPTP problems, the proofs are found when (CHOICE) is enabled and p = 4, but not when the rule is disabled.On Sledgehammer problems, this effect is weaker, with only two new problems proved for p = 4.
Prover Comparison.In this second part, we compare Zipperposition's performance with other higher-order provers.Like at CASC-J10, the wall-clock time limit was 120 s, the CPU time limit was 960 s, and the provers were run on StarExec Miami.We used the following versions of all systems that took part in the THF division: CVC4 1.8 [3], Leo-III 1.5.2 [26], Satallax 3.5 [13], and Vampire 4.5 [11].The developers of Vampire have informed us that its higher-order schedule is optimized for running on a single core.As a result, the prover suffers some degradation of performance when running on multiple cores.We evaluate both the version of Zipperposition that took part in CASC-J10 (Zip) and the updated version of Zipperposition that supports our new calculus (New Zip).Zip's portfolio of prover configurations is based on λSup and techniques described by Vukmirović and Nummelin [30].New Zip's portfolio is specially designed for our new calculus and optimized for TPTP problems.To assess the performance of Boolean reasoning, we used Sledgehammer benchmarks generated both with native Booleans (SH) and with an encoding into Boolean-free higher-order logic (ofSH).For technical reasons, the encoding also performs λ-lifting, but this minor transformation should have little impact on results [6,Sect. 7].
The results are shown in Figure 4.The two versions of Zipperposition are ahead of all other provers on both benchmark sets.This shows that, with thorough parameter tuning, higher-order superposition outperforms tableaux, which had been the state of the art in higher-order reasoning for a decade.The updated version of New Zip beats Zip on TPTP problems but lags behind Zip on Sledgehammer benchmarks as we have yet to further explore more general heuristics that work well with our new calculus.The Sledgehammer benchmarks fail to demonstrate the superiority of native Booleans reasoning compared with an encoding, and in fact CVC4 and Leo-III perform dramatically better on the encoded Boolean problems, suggesting that there is room for tuning.

Conclusion
We have created a superposition calculus for higher-order logic that is refutationally complete.Most of the key ideas have been developed in previous work by us and colleagues, but combining them in the right way has been challenging.A key idea was to Unlike earlier refutationally complete calculi for full higher-order logic based on resolution or paramodulation, our calculus employs a term order, which restricts the proof search, and a redundancy criterion, which can be used to add various simplification rules while keeping refutational completeness.These two mechanisms are undoubtedly major factors in the success of first-order superposition, and it is very fortunate that we could incorporate both in a higher-order calculus.An alternative calculus with the same two mechanisms could be achieved by combining oSup with Bhayat and Reger's combinatory superposition [10].The article on λSup [6, Sect.8] discusses related work in more detail.
The evaluation results show that our calculus is an excellent basis for higher-order theorem proving.In future work, we want to experiment further with the different parameters of the calculus (for example, with Boolean subterm selection heuristics) and implement it in a state-of-the-art prover such as E.
to the literal created by FLUIDBOOLHOIST, effectively performing a primitive substitution.The resulting clause can superpose into a ≈ b with the unifier {x → λu.u} ∈ CSU(x b, b).
because of the lam symbols used by F ).Using |= F , we define a clause C to be redundant w.r.t. a clause set N if for everyD ∈ G(C), we have {E ∈ G(N) | E ≺ D} |= F D or there exists a clause C ∈ N such that C C and D ∈ G(C ).The tiebreaker can be an arbitrary well-founded partial order on clauses; in practice, we use a well-founded restriction of the ill-founded strict subsumption relation[6, Sect.3.4].We denote the set of redundant clauses w.r.t. a clause set N by Red C (N).Note that |= F is weak enough to ensure that the ARGCONG inference rule and axiom (EXT) are not immediately redundant and can fulfill their purpose.