1 Introduction

While there are many automated theorem provers capable of proving theorems involving very large formulas and many lemmas, very few of them have formalized proofs of metatheoretical properties such as soundness and completeness. This leads to issues of trust: how do we know that the answers returned by automated theorem provers are actually correct? And do we know that our automated theorem provers will actually be able to prove what we want them to? Even those provers that can generate proof certificates to support their answers may not always be trustworthy, since some proof techniques lead to proofs that are very difficult to follow for a human, and are thus difficult to manually check for correctness. Proof certificates can also be mechanically checked, but doing so means trusting another piece of software to be correct.

Formalizing the soundness and completeness of a prover provides two crucial benefits. With a soundness result, we know that the prover will not erroneously accept an invalid formula and output a wrong proof of the formula. Advanced features and optimizations thus cannot cause unforeseen flaws in the prover, since their correctness is formally proven. Completeness of the prover is especially useful in combination with the possibility of generating readable proof certificates. With formalized completeness, we can use the prover as a tool to generate step-by-step proofs of any valid formula, and the prover can thus also be used to gain understanding, e.g. by students trying to understand why a counter-intuitive formula is valid. While there are some systems with formalized metatheories, they rarely include executable provers, often cannot generate proof certificates, and are often quite limited in their expressive power (cf. Sect. 1.1).

In this paper, we present an automated theorem prover for first-order logic with functions based on sequent calculus. We formalize its soundness and completeness in Isabelle/HOL. We reuse the syntax and semantics of first-order logic from the Sequent Calculus Verifier (SeCaV) system [15] (Sect. 2.1). We state the soundness and completeness of the prover with respect to the SeCaV proof system, its semantics and a bounded semantics that we introduce here. The prover can generate human-readable and machine-verifiable SeCaV proofs for valid formulas.

Our formalization instantiates an abstract framework of coinductive proof trees by Blanchette et al. [11] (Sect. 2.2). By instantiating the framework with concrete functions implementing our sequent calculus, the framework builds a prover for us (Sect. 3). By discharging further proof obligations, the framework proves that any proof tree built by our prover is either finite or contains an infinite path with certain properties. We then build either a SeCaV proof from the finite tree (Sect. 4) or a countermodel from the failed proof attempt (Sect. 5). As far as we are aware, we are the first to use the framework to prove soundness and completeness of an executable prover (as opposed to simply a calculus).

Our prover is deterministic and fair and works on finite sequents. To handle the quantifiers, we must thus build our countermodel in a Herbrand universe that contains only the subset of terms that actually appear in the failed proof. This idea is inspired by Ben-Ari’s textbook proof [2], where terms are either variables or constants, and by Ridge and Margetson’s Isabelle proof [41], where only variables are considered. We are not aware of any previous formalization of this construction that handles functions. We consider all terms in our Herbrand universe, including those with free variables, yielding completeness for both open and closed formulas.

The prover is free software and the source code is available as supplementary material. This consists of around 3000 lines of Isabelle/HOL and 1300 lines of supplementary Haskell. The supplementary Haskell code is not involved in the proof procedure, but handles parsing of the input formula and conversion of proofs to both human-readable and machine-verifiable formats, and can in some cases shorten proofs after they have been found by merging proof steps. The user is thus only required to trust the Haskell code if the generated proof certificates are used without verifying them.

We summarize our main contributions:

  • A formally verified sound and complete automated theorem prover for full first-order logic with functions.

  • An analytic proof of completeness for both open and closed formulas for a deterministic prover via a bounded semantics.

  • A method of translating the prover-generated certificates of validity into human-readable and machine-verifiable proofs in SeCaV.

  • A concrete application of the abstract completeness framework and a demonstration of how to obtain soundness and completeness of an executable prover using the framework as a starting point.

We summarize the results and discuss the generated proofs, the challenges encountered during the verification, the prover limitations, and future work in Sect. 6, before concluding in Sect. 7.

This paper is a revised and extended version of a paper presented at ITP 2022 [17] with a number of improvements. We give a more thorough explanation of the SeCaV syntax, including the formalized definitions of substitution and all of the rules of its proof system. This has allowed us to give full listings throughout, where the previous paper contained only abridged definitions. Moreover, we have included and explained the relevant Isabelle code from the abstract completeness framework, elucidating the actual definitions underlying the prover. We also give a more thorough explanation of the proof search procedure, including previously absent details of the procedures for finding terms in a sequent and determining whether a sequent is an axiom. Using the added detail in the definitions, we explain additional technical and conceptual details in several proofs. This improvement to the presentation includes more explanation of Isabelle syntax, locales and library functionality throughout the paper. Finally, we have extended our discussion of possible optimizations and post-processing of proofs and included an additional example proof.

1.1 Related Work

The present paper is a much improved version of the work started in the second author’s Master’s thesis [22]. The Sequent Calculus Verifier (SeCaV) is an existing proof system, and both soundness and completeness have been proven for the system [18]. The system has been used to teach students in several courses at the Technical University of Denmark [19, 52]. An online tool called the SeCaV Unshortener has been developed to allow input of proofs in a simple format, which is then translated to an Isabelle proof [15].

Our prover is based on the abstract completeness framework by Blanchette et al. [7, 11]. The framework contains a simple example prover for propositional logic, and the original application of the framework was a technical result (see [7]) in the formalization of the metatheory of the Sledgehammer tool for automated theorem proving within Isabelle/HOL [8]. Blanchette et al. [11] have used the framework to formalize soundness and completeness of a calculus for first-order logic with equality and in negation normal form. Their search is nondeterministic and they do not generate an executable prover like we do. We thus extend their work by using the framework to prove soundness and completeness of an executable prover.

A number of other systems have formally verified metatheories. NaDeA (Natural Deduction Assistant) by Villadsen et al. [51] is a web application that allows users to prove formulas with natural deduction. The metatheory of a model of the system is formalized in Isabelle/HOL, and the application allows export of proofs for verification in Isabelle. The Incredible Proof Machine by Breitner [12] is a web application that allows users to create proofs using a specialized graphical interface. The proof system is as strong as natural deduction, and a model of the system is formalized in Isabelle using the abstract framework by Blanchette et al. [11]. Neither system includes automated theorem provers; they are essentially simple proof assistants designed to aid students in understanding logical systems.

THINKER by Pelletier [38] is a proof system and an attached automated theorem prover. THINKER is a natural deduction system designed to allow for what the author calls “direct proofs”, as opposed to proofs based on reduction to a resolution system. THINKER was perhaps the first automated theorem prover designed specifically with “naturality” in mind, as a reaction to the indirectness of resolution-based proof systems. MUSCADET by Pastre [37] is also an automated theorem prover based on natural deduction. The system distinguishes itself by also supporting usage of prior knowledge such as previously proven theorems through a Prolog knowledge base.

While there are many very advanced automated theorem provers such as Vampire [26], Zipperposition [3] and Z3 [13], their metatheory and implementations are rarely formalized. As a first step towards formally verifying modern provers, Schlichtkrull et al. [43] have formalized an ordered resolution prover for clausal first-order logic in Isabelle/HOL. Jensen et al. [23] formalized the soundness, but not the completeness, of a prover for first-order logic with equality in Isabelle/HOL. Villadsen et al. [53] verified a simple prover for first-order logic in Isabelle/HOL with the aim of allowing students to understand both the prover and the formalization. That work was based on an earlier formalization by Ridge and Margetson [41], but it simplified both the prover and the proofs to enable easier understanding by students. Neither of these two provers provide support for functions or generation of proof certificates.

Blanchette [9] gives an overview of a number of verification efforts including the metatheory of SAT solvers [10, 14, 33, 34, 46] and certificate checkers [28, 29], SMT solvers [30, 32, 48], the superposition calculus [39], resolution [40, 42, 45], a number of non-classical logics [20, 21, 44, 50, 54] and a wide range of proof systems for classical propositional logic [35, 36]. Some of these efforts are part of the IsaFoL project (Isabelle Formalization of Logic). Part of the goal is to develop “a methodology for formalizing modern research in automated reasoning”. Our work points in this direction too, by formally verifying a non-saturation-based prover.

2 Background

In this section, we briefly introduce the two existing projects we build on: the Sequent Calculus Verifier (SeCaV) system and the abstract framework by Blanchette et al. [11]. We have not modified these projects in any way for our use, and their designs thus significantly influence the design of our prover.

2.1 The Sequent Calculus Verifier

The SeCaV system is a one-sided sequent calculus for first-order logic with functions. Constants are encoded as functions with arity 0. Figure 1 gives the syntax of terms and formulas (denoted \( p, q, \dots \)) as Isabelle/HOL datatypes. Constructor names like Fun for function symbols and Var for variables are capitalized. Parameterized datatypes are written in postfix notation in Isabelle, e.g. the type tm list of lists containing terms. The system uses de Bruijn indices to identify variables, while functions and predicates are named by natural numbers. Besides predicates, the system includes implication, disjunction, conjunction, existential quantification, universal quantification and negation (in that order in Fig. 1). Predicates and functions take their arguments as ordered lists of terms, which may be empty. Sequents (denoted \( y, z, \dots \)) are ordered lists of formulas.

Fig. 1
figure 1

The syntax of Sequent Calculus Verifier terms and formulas (parentheses added for clarity)

Fig. 2
figure 2

The semantics of the Sequent Calculus Verifier (# separates head and tail of a list)

The semantics of a formula is due to Berghofer [4], who models the universe as a type variable, and we do the same for now (we will revisit this choice in Sect. 5.3.1). Besides this implicit universe, the interpretation consists of an environment e for variables, a function denotation f and a predicate denotation g. The semantics of the system is standard and defined using the three recursive functions in Fig. 2. The function semantics-term evaluates a term to a member of the universe under an environment and a function denotation, while semantics-list does this for a list of terms. The semantics function itself defines the meaning of the object-logical connectives using the connectives from the meta-logic in Isabelle/HOL. The shift function handles shifting de Bruijn indices when interpreting quantifiers, such that the environment maps 0 to the meta-quantified element of the domain. We say that a sequent is valid when, under all interpretations, some formula in the sequent is satisfied.

Fig. 3
figure 3

Proof rules for the Sequent Calculus Verifier

The system has a number of proof rules, which are displayed in Fig. 3 (abusing set notation for the membership and inclusion relations on lists—see the formalization for their encodings as primitive recursive functions). The rules should be read from the bottom up, since we generally work backwards from the sequent we wish to prove. The rules are classified according to Smullyan’s uniform notation [47].

The first proof rule, Basic, terminates the branch and applies when the sequent contains both a formula and its negation. Isabelle/HOL allows pattern matching only on the head of a list, so to simplify the specification of this rule, the positive formula must come first.

The structural Ext rule can be applied to change the position of formulas in a sequent (permutation), duplicate an existing formula (contraction) and remove formulas that are not needed (weakening). This rule is crucial, since most rules in the system work only on the first formula in a sequent. Duplicating a formula is necessary if a quantified formula needs to be instantiated several times, since \( \gamma \)-rules (starting with Gamma) destroy the original formula.

The NegNeg rule removes a double negation from the first formula in a sequent. It can be considered an \(\alpha \)-rule, but we keep it separate from the others because it does not generate two formulas. The AlphaDis rule decomposes disjunctions (and similar for the AlphaImp and AlphaCon rules). The BetaDis rule decomposes negated disjunctions and requires that two sequents are proven separately, creating branches in the proof tree (and similar for the BetaCon and BetaImp). This essentially moves the connective into the proof tree itself, since both branches now need to be proven separately. The GammaExi rule instantiates an existential quantifier with any term \( t \) by substituting \( t \) for variable 0 in the quantified formula. Due to the use of de Bruijn indices, the definition of substitution is quite complicated. The definition of the substitution function sub can be seen in Fig. 4. The notation \(p\,[\textrm{Var}~0/t]\) should be interpreted as the function application \(sub 0 t p\). The GammaUni rule is similar. The DeltaExi rule instantiates a negated existential quantifier in the first formula in a sequent with a fresh constant function, with fresh here meaning that the function identifier does not already occur anywhere in the sequent. The fresh constant cannot have any relationship to other terms in the sequent: it is arbitrary. Thus, we could have used any other term without affecting the validity of the formula, which is exactly what is needed to prove a universally quantified (“there does not exist”) formula. The freshness condition can easily be checked by recursively going through the formulas in the sequent (see the formalization for the primitive recursive functions implementing this). The DeltaUni rule is similar.

Fig. 4
figure 4

The definition of substitution used in the Sequent Calculus Verifier

The proof system in Fig. 3 has been formally verified to be sound and complete with regards to the semantics in Fig. 2 by From et al. [18]. We use these results to relate our prover to SeCaV.

2.2 Abstract Frameworks for Soundness and Completeness

Blanchette et al. [11] have formalized an abstract framework to facilitate soundness and completeness proofs by coinductive methods. In particular, they give abstract definitions that can be instantiated to a concrete sequent calculus or tableau prover. They facilitate proofs in the Beth-Hintikka style: the search “builds either a finite deduction tree yielding a proof ... or an infinite tree from which a countermodel ... can be extracted.” The framework consists of a number of Isabelle/HOL locales that must be instantiated and in return provide various definitions and proofs.

Locales [1, 24] allow the abstraction of definitions and proofs over given parameters. As an example, consider groups in algebra defined by a carrier set, a binary operation and the group axioms. With a locale, these can be specified abstractly and a number of operations and results can then be given for the abstraction. Later, we can instantiate the locale with a concrete group by providing the carrier set and binary operation and then proving that the group axioms are fulfilled. We then obtain instantiations of the results for our concrete group.

In this section, we give an overview of the locales provided by the abstract framework: what they require and what they provide. We reproduce their Isabelle code for completeness, but note that the code in this section was not written by us. The full listings can be found in the Archive of Formal Proofs entry by Blanchette et al. [6].

First, two coinductive datatypes are crucial: a tree is finitely branching but can be infinitely deep, while a stream has no branching but is decidedly infinite (a list with no end).

figure a

A tree is a Node consisting of a root element and a finite set (fset) of subtrees, which can be extracted with cont. Streams are built with the constructor SCons and have a head element (shd) and a tail stream (stl). The function sset takes a stream and returns its set of elements.

Table 1 The RuleSystem locale with premises above the line and important conclusions below. Corresponding (abridged) Isabelle code below the dotted line
Table 2 The PersistentRuleSystem locale which extends RuleSystem from Table 1

Tables 1 and 2 cover the two locales RuleSystem and PersistentRuleSystem which are central for proving completeness. The locale premises are given above each vertical line and the (important) conclusions are given below. For reference, we include the relevant Isabelle code as well, after each dotted line, but reading this is optional. We need a few explanations to understand the Isabelle code. Locales in Isabelle are specified by giving a name followed by a number of fixed constants of given types and a number of assumptions about those constants, which we can rely on while working inside the locale. Types in Isabelle/HOL include type variables

figure d

and function arrows

figure e

. Assumptions can be stated with the meta-universal quantifier

figure f

and meta-logical implication

figure g

. The notation

figure h

is shorthand for

figure i

. We can also extend another locale by using the

figure j

operation explicitly. In this case, it can be useful to give the type variables descriptive names by using the for keyword to explicitly specify the types of the constants. The relation

figure k

denotes membership of a finite set. Moreover, SOME is Hilbert’s choice operator and picks an element satisfying the given predicate, when such an element exists. The expression the None is an undefined value, since the is only defined on the Some constructor. The combinators Not, alw, ev and holds implement linear temporal logic operators on streams. For instance, the function trim uses sdrop-while and Not to drop rules from a stream as long as they are not enabled in the current state. The predicate fair expresses that all rules occur infinitely often: no matter how far down the stream we go, it is always the case that every rule will eventually occur at some later point in the stream. The corecursive function mkTree uses fimage, which is simply the image of a function over a finite set; in this case it is used to build the subtrees. Its definition uses coinductive since it applies to the codatatype of potentially infinite trees.

Table 3 The Soundness locale
Table 4 The RuleSystem-Code locale

The locales in Table 1 require us to prove a number of things about three definitions, where two of them are given in the RuleSystem-Defs locale. First, the eff relation specifies the effect of applying a rule to a state in our proof search. By (proof) state we mean a sequent, potentially coupled with additional information. The nodes of our proof tree will be proof states in this sense. Second, rules is a stream of rules for the prover to attempt to apply. Third, S is a set of well formed states (in our case simply the set of all states).

For the RuleSystem locale, we must prove two things about these definitions. First, eff-S, that the set of well formed states S is closed under the eff relation on rules from the stream rules. Second, enabled-R, that no matter the proof state we have reached (in S), some rule in rules applies. In return we get the function mkTree which embodies our prover and a proof, wf-mkTree, that the tree produced by this prover is well formed (for fair rule streams). A tree is well formed (wf) when its children are well formed and the set of child states is eff-related to the node’s state and applied rule.

For the PersistentRuleSystem locale in Table 2, we must additionally prove the assumption per. This essentially states that rules do not interfere with each other: when we apply one rule, other rules that were applicable before are still applicable. In return we get a theorem called epath-completeness-Saturated. An escape path (epath) is an infinite path in a well formed proof tree. Such a path is saturated (Saturated) when any rule which is enabled at some point on the path is eventually applied. Thus, this theorem states a completeness property for the mkTree function (on well formed input): either it returns a well formed finite tree or a tree containing a saturated escape path (from which we can build a countermodel).

Table 3 covers the Soundness locale used to prove the soundness of resulting proof trees. Here, besides eff and rules, we must provide a set of models, structure, and a satisfaction predicate, sat, on sequents and models. The locale then turns a local soundness proof, local-soundness, that validity of a sequent follows from validity of its children, into a global result, soundness, that any finite, well formed tree has a valid root.

Finally, to generate code we need to instantiate the locale RuleSystem-Code in Table 4, where eff must now be a deterministic relation (i.e. a function) and rules is as before. In return we get an executable version of mkTree above, called i.mkTree and defined from the code lemmas specified in the table.

RuleSystem-Code provides no guarantees on its own, but we use the same underlying function in all four locales. We export this function to Haskell using Isabelle’s (only partly verified) code generation, code lemmas and a few (unverified) custom code-printing facilities. This step moves us from a verified prover inside Isabelle to a prover in Haskell which is based on a verified prover, but which is not itself verified.

3 Prover

In this section we explain the design of the proof search procedure driving our prover. The procedure does not use the proof system of SeCaV directly, but introduces a new set of similar proof rules that apply to entire sequents at once. That is, the prover rule corresponding to AlphaDis breaks down all formulas Dis p q in the sequent, not just the first one. This obviates the need for the structural Ext rule, which is therefore not present. It also removes the need for an explicit Idle rule, needed by Blanchette et al. [11] to prove that there always exists an enabled rule, since we can just let all rules be enabled at all times; sometimes they will simply not have an effect. In general, these multi-rules simplify the prover and its completeness proof by removing all concerns about the structure of sequents. If we need to apply a rule to a formula to find a proof, then the multi-rule will apply to that formula immediately, no matter where in the sequent it is. Additionally, we remove the Basic rule and let the prover close proof branches implicitly.

Before we can define what the rules do, we need a few auxiliary definitions. The functions listFunTm and listFunTms collect the function names that occur in a term and a list of terms, respectively:

figure n

This is used by generateNew to generate a function name that is fresh to a given list of terms:

figure o

The functions subtermTm and subtermFm compute the list of terms occurring in a term and a formula, respectively:

figure p
figure q

This is used by subtermFms to compute the list of all terms in a list of formulas (i.e. a sequent):

figure r

We define subterms as the list of all terms in a sequent, except that the list contains exactly Fun 0 [] when it would otherwise be empty. This ensures that we always have some term to instantiate \( \gamma \)-formulas with:

figure s

The function sub (defined in Fig. 4) implements substitution in a standard way using de Bruijn indices. See the formalization [16] or the original SeCaV work [18] for details. The function branchDone computes whether a sequent is an axiom, i.e. whether the sequent contains both a formula and its negation:

figure t

The prover uses this to determine when a branch of the proof tree is proven and can be closed. The disjunct

figure u

is not necessary for the prover, but makes the proofs easier later on.

We first define which “parts” of a single formula f must be proven for a rule r to apply:

figure v

The parts of a formula under a rule is a list of lists of formulas with an implicit conjunction between lists and disjunction between inner formulas. For instance, the function states that for AlphaDis to prove Dis p q, we must prove either p or q. The definition takes a parameter A, which should be a list of terms present on the proof branch. For \( \delta \)-rules, a function which does not appear in A is generated to instantiate the quantifier (ensuring soundness), and for \( \gamma \)-rules, the quantifier is instantiated with every term in A (ensuring completeness). Note that if the rule and formula do not match, the result simply contains the original formula. This means that rules are always enabled, but that they do nothing to most formulas.

To construct a proof tree, we need a function that computes the result of applying a rule to (all formulas in) a sequent. This is done by the following function (

figure w

appends two lists and remdups removes duplicates):

figure x

It first computes the effect of applying the rule to the first formula in the sequent (using the definition parts) and gives a name to the updated list of terms in the sequent (since \( \delta \)- and \( \gamma \)-rules may introduce new terms). The function then goes through the rest of the sequent recursively, combining the generated child branches with the function list-prod:

figure y

The type variable

figure z

in the type signature means that the function works on lists of lists containing any type of elements. The function list-prod behaves in the following way (similar to the Cartesian product):

figure aa

For \( \beta \)-rules, the end result is a list of \( 2^n \) child branches, where \( n \) is the number of \( \beta \)-formulas in the sequent. These branches are ordered such that they correspond to the branches one would have obtained by applying the corresponding SeCaV \( \beta \)-rule \( n \) times. For all other rules, the end result is a single child branch. The parameter A to children should again be a list of terms present on the proof branch. We should be clear that children does not apply rules recursively to sub-formulas, but only to the “top layer.” If the application of a rule reveals a formula that this rule applies to again, this formula is left alone and is only considered the next time children is applied to the sequent with that rule. For example, the result of calling children with the rule AlphaDis and the sequent containing only the formula \(\textrm{Dis}~(\textrm{Dis}~p~q)~r\) is \(\textrm{Dis}~p~q, r\) and not pqr.

The prover needs to ensure that bound variables are instantiated with all terms on the current branch when a \( \gamma \)-rule is applied. For this reason, we define the state in a proof tree node to be a pair consisting of a list of terms appearing on the branch and a sequent. The list of terms will be used to instantiate the parameter A in the definitions above.

We are now ready to define the effect of applying a proof rule to a proof state:

figure ab

To fit the types of the framework, the function returns a finite set (fset) instead of a list, and the function fimage is used to compute the image of a finite set under a given function. If the sequent is an axiom, the branch is proven, and the function returns an empty set of child nodes (denoted by

figure ac

for finite sets), closing the branch. Otherwise, the function converts the result of the children function to a finite set, and adds any new terms to the list of terms in each child node.

Having defined what rules do, we now need a stream of them (rules in Table 1). We, somewhat arbitrarily, define a list of rules in the order \( \alpha \), \( \delta \), \( \beta \), \( \gamma \) and cycle the list to obtain a stream. For efficiency, we could run, say, all \( \alpha \)- and \( \delta \)-rules to completion before branching with the \( \beta \)-rules, but this cannot be encoded in the simple stream of rules without further machinery: one could imagine having larger “meta-rules” corresponding to groups of SeCaV rules. This would give a notion of “phases” where we would first run all the rules in one group, then all the rules in the next group in the stream, etc. For simplicity (see Sect. 6.4) we apply single rules in a fixed order. This also trivially ensures fairness.

3.1 Applying the Framework

We are now ready to apply the abstract completeness framework to obtain the actual proof search procedure (cf. Sect. 2.2). First, we define a relational version of the effect of a rule, called eff. To use the framework, we need to prove three properties: that the set of well formed proof states is closed under eff (eff-S), that it is always possible to apply some rule (enabled-R) and that the rules that can be applied are still possible to apply after applying other rules (per). We do not need to restrict the set of well formed proof states, so the first property is trivial. Since all of our rules can always be applied (they simply do nothing if they do not match the sequent), the other two properties are also trivial. We can thus instantiate the framework with our effect relation and stream of rules. This allows us to define the prover using the mkTree function from the framework:

figure ad

This function takes a list of terms and a sequent, and applies the rules in the stream in order to build a proof tree with the given sequent at the root, using our eff relation to determine the children of each node. The list of terms is used to collect the terms that occur in the sequents on each branch and should initially be empty (in the exported prover, the function is wrapped in another function to ensure that the list of terms is empty).

We call the sequent at the root of this proof tree the root sequent:

figure ae

3.2 Making the Prover Executable

To actually make the prover executable, we need to specify that the stream of rules should be lazily evaluated, or the prover will never terminate. We do this by lazifying the stream type using the code-lazy-type command [31]. Additionally, we need to define the prover using the code interpretation of the framework to enable computation of some parts of the framework (cf. Table 4). After telling Isabelle how to translate operations on the option type to the Maybe type, this also allows us to export the prover to Haskell code.

We have implemented a few Haskell modules to drive the exported prover and translate found proofs into the proof system of SeCaV. These modules are not formally verified, but the proofs generated in this manner can be verified by Isabelle. We have written an automated test suite that tests the unverified code for soundness and completeness by applying the prover to a number of valid formulas, then calling Isabelle to verify the generated proofs, and by applying the prover to a number of invalid formulas and confirming that it does not generate a proof (within 10 seconds). While these tests do not give us absolute certainty that the exported code and the hand-written Haskell modules are correct, they provide a reasonable amount of certainty when combined with the formal proofs of correctness of the proof search procedure within Isabelle.

4 Soundness

We use the abstract soundness framework (cf. Sect. 2.2) to prove that any sequent with a well formed and finite proof tree can be proved in the proof system of SeCaV. It follows from the soundness of SeCaV that such sequents for which the prover terminates are semantically valid.

In a well formed proof tree, the sequent in every node can be derived from the sequents in its child nodes. We will need an intermediate lemma stating that the proof trees produced using the children function are well formed. For this to hold, it must be the case that the terms occurring in the sequent we are trying to derive are contained in the list of existing terms given to the children function such that the \(\delta \)-rules are used with constants which are actually fresh. To state this condition, we need the following functions, which are similar to listFunTms and subtermFms but compute sets instead of lists:

figure af
figure ag
figure ah

The following lemma then comprises the core of the result:

Lemma 1

If for all sequents z’ in children A r z, we can derive

figure ai

and the term list A contains all parameters of pre and z, then we can derive

figure aj

itself:

figure ak

Proof

By induction on z for arbitrary pre and A.

For the empty sequent, we can immediately derive

figure al

from the assumption and the definition of children.

For the non-empty sequent with formula p as head and z as tail we have the following induction hypothesis (for any pre and A):

figure am

We abbreviate the term list that the prover actually recurses on as

figure an

. From the first assumption and the definition of

figure ao

we then have:

The proof continues by examining the possible cases for

figure ap

.

Take first the case where

figure aq

and

figure ar

. Then (*) states that we can derive

figure as

for all

figure at

in

figure au

. We apply the induction hypothesis at pre extended with q and r, which is allowed since they are subformulas of p. We then get the derivation

figure av

. By the Ext and AlphaDis rules from SeCaV we obtain the desired derivation

figure aw

.

The remaining \( \alpha \)- and \( \beta \)-cases are similar. In the \( \delta \)-cases we prove that the constant used by the prover is new to the sequent, as required by the SeCaV \( \delta \)-rules. To apply the induction hypothesis, we need that the constant is already included in our term list

figure ax

, but this is guaranteed by constructing

figure ay

via

figure az

.

In the \( \gamma \)-cases we get a derivation that includes both the \( \gamma \)-formula and all instantiations of it using terms from the list A. Here we induct on A to generalize each instantiation into the corresponding \( \gamma \)-formula and use Ext to contract this \( \gamma \)-formula with the existing occurrence.

When

figure ba

returns p, the thesis holds from (*) and the induction hypothesis. \(\square \)

We only need pre in the above lemma to make the induction hypothesis strong enough for the proof, so we can instantiate it afterwards.

Corollary 1

(Proof tree to SeCaV) We derive a sequent from derivations of its

figure bb

:

figure bc

We obtain the following soundness theorem from the abstract soundness framework. Note that the derivation in SeCaV follows the steps in the well-formed proof tree, which means that the derivation corresponds exactly to the generated proof certificate.

Theorem 1

(Prover soundness wrt. SeCaV) The root sequent of any finite, well formed proof tree has a derivation in SeCaV:

figure bd

5 Completeness

The completeness proof is heavily based on the abstract completeness framework. As noted in Sect. 2.2, however, the framework only helps us with part of the proof. First, we duplicate the output of Table 2, since the

figure be

function is unhelpfully abstracted away by an existential quantifier. This could easily be changed in the framework and should be considered for the next release.

Lemma 2

(Prover cases) The proof tree generated by the prover is either finite and well formed or there exists a saturated escape path with our initial state as root:

figure bf

In the first case, the sequent has a proof (cf. Sect. 4). In the second case, we need to build a countermodel from the saturated escape path to contradict the validity of the sequent. The rest of this section does exactly that. Inspired by Ben-Ari [2] and Ridge and Margetson [41], we start off by giving a definition of Hintikka sets over a restricted set of terms (Sect. 5.1). We show that the set of formulas on saturated escape paths fulfills all Hintikka requirements when we take the set of terms to be the terms on the path (Sect. 5.2). We then define a countermodel for any formula in such a set using a new semantics that bounds quantifiers by an explicit set rather than by types alone (Sect. 5.3). Finally we tie these results together to show that the prover terminates for all sequents that are valid under our new semantics (Sect. 5.4). In Sect. 6.1 we use existing results to prove completeness of the prover wrt. the SeCaV semantics.

5.1 Hintikka

Fig. 5
figure 5

Requirements for a set of formulas H to be a Hintikka set

First, by the terms of a set of formulas H we mean all the subterms of formulas in H, unless there are none, in which case we mean a designated singleton set:

figure bg

This set contains an arbitrary (but fixed) constant, Fun 0 [], when H itself contains no terms. Otherwise it contains all subterms of all formulas in H. This mirrors the definition of subterms in Sect. 3.

Figure 5 contains a definition of a Hintikka set H. Here, we use a locale slightly differently to the previous ones, in that we specify no conclusions, only premises: the formula set H and the requirements Basic, AlphaDis, etc. This use simply allows us to assume Hintikka H in a theorem and know that the set H then fulfills the stated requirements. Similarly, we can prove that a set H is Hintikka by proving that it fulfills the requirements. It is important to note that in the \( \gamma \)- and \( \delta \)-cases, the quantifiers only range over the terms of H.

5.2 Saturated Escape Paths are Hintikka

The following definition forgets all structure of a path and reduces it to a set of formulas:

figure bh

The function pseq extracts the sequent from each step.

Given a saturated escape path steps, we want to prove that tree-fms steps is a Hintikka set. For instance, if Dis p q appears on the path, then both p and q should too. The prover is designed to make this property of its proof trees as evident as possible: formulas unaffected by a given rule are easily shown to be preserved by the application of that rule and any rule immediately applies to all its affected formulas, regardless of their position in the sequent.

We will need a number of intermediate results.

5.2.1 Unaffected Formulas

We define the predicate affects to hold for a rule and a formula when that rule does not preserve the formula (thus no rule affects a \( \gamma \)-formula, since the \( \gamma \)-rules of the prover, unlike those of SeCaV, preserve the original formula). For instance,

figure bi

holds while

figure bj

does not.

We then prove the following key preservation lemma:

Lemma 3

(effect preserves unaffected formulas)

Assume formula p occurs in sequent z and the rule r does not affect p. Then p also occurs in all children of z as given by effect:

figure bk

Proof

The function parts preserves unaffected formulas (proof by cases) so children does as well (proof by induction on the sequent) and thus effect does too. \(\square \)

We lift this to escape paths:

Lemma 4

(Escape paths preserve unaffected formulas) Assume formula p occurs in some sequent at the head of an escape path which consists of a prefix, pre, where none of the rules affect p, and a suffix, suf. Then p occurs at the head of suf (the operator @- prepends a list to a stream):

figure bl

Next, notice the following property of streams:

Lemma 5

(Eventual prefix) When a property P eventually holds of a stream, then the stream is comprised of a prefix of n (possibly zero) elements for which P does not hold and then a suffix that starts with an element for which P does hold:

figure bm

Saturation states that a rule is eventually applied and Lemmas 4 and 5 combine to state that any affected formulas are preserved until then.

5.2.2 Affected Formulas

Knowing that formulas are preserved as desired, we need to know that they are broken down as desired. The following lemma (proof omitted here) states this in general via parts:

Lemma 6

(Parts in effect) For any formula p in a sequent z, the effect of rule r on z includes some part of r’s effect on p:

figure bn

This is easier to understand when we specialize the rule and the formula:

Corollary 2

Effect of the NegNeg rule on a double-negated formula p:

figure bo

5.2.3 Hintikka Requirements

We then need to prove the following:

Theorem 2

(Hintikka escape paths) Saturated escape paths fulfill all Hintikka requirements:

figure bp

Proof

This boils down to proving each requirement of Fig. 5. We give a couple of examples and refer to the formalization for the full details.

For Basic, assume towards a contradiction that both a predicate and its negation appear on the branch. By preservation of formulas (Lemma 4), both appear in the same sequent at some point. But then branchDone holds for that sequent, so it has no children and the branch would terminate. This contradicts that escape paths are infinite, so Basic must hold.

For AlphaDis, assume that Dis p q appears on the branch. Then it appears at some step n. By saturation of the escape path, AlphaDis is eventually applied at some (earliest) step \( n + k \). By Lemma 4, Dis p q is preserved until then. So by the effect of rule AlphaDis, both p and q appear at step \( n + k + 1 \). The cases for the \( \beta \)- and \( \delta \)-requirements are very similar.

For GammaExi assume that Exi p occurs at step n. We need to show that it is instantiated with all terms that (eventually) appear on the branch. Fix an arbitrary such term t. There must be some step m where t appears in a sequent. Thus at every step after m, term t appears in the term list which is part of the proof state. By saturation, at some step greater than n + m + 1, rule GammaExi is applied. The formula Exi p is preserved until this stage (Lemma 4) and the term list only grows, so t is too. Thus, at the next step, sub 0 t p occurs on the branch as desired. \(\square \)

5.3 Countermodel

We need to build a countermodel for any formula in a Hintikka set to contradict the validity of any formula on a saturated escape path. We do this in the usual term model with a (bounded) Herbrand interpretation. Unfortunately, we cannot build a countermodel in the original semantics where the universe is specified as a type, since we cannot form the type of terms in a given Hintikka set (we cannot use the typedef mechanism of Isabelle to parameterize the universe on the terms in the Hintikka set since the set depends on the formula we are attempting to prove). Instead, we introduce a custom bounded semantics.

5.3.1 Bounded Semantics

The bounded semantics is exactly like the usual semantics (cf. Fig. 2) except for an extra argument u, standing for the universe, which bounds the range of the quantifiers in the following cases:

figure bq

This leads to the following natural requirements on environments e and function denotations f, namely that they must stay inside u:

figure br

In general, we only consider environments and function denotations that satisfy these requirements and call them (and any model based on them) well formed. When u = UNIV (the universal set), the quantifiers are not actually bounded and the two semantics coincide.

The SeCaV proof system (cf. Fig. 3) is sound for the bounded semantics too.

Theorem 3

(SeCaV is sound for the bounded semantics) Given a SeCaV derivation of sequent z and a well formed model, some formula p in z is satisfied in that model:

figure bs

Proof

The proof closely resembles the original soundness proof (cf. [18]). \(\square \)

We abbreviate validity of a sequent in the bounded semantics as uvalid:

figure bt

Namely, for all universes and well formed models, some formula in the sequent is satisfied in the bounded semantics at that universe by that model.

5.3.2 Model Construction

Our countermodel is given by a bounded Herbrand interpretation where terms are interpreted as themselves when they appear in the universe terms H and as an arbitrary term otherwise.

Definition 1

(Countermodel induced by Hintikka set S)

We abbreviate the model as M S:

figure bu

The definition of G is what makes this a countermodel rather than a model: a predicate is satisfied exactly when its negation is present in the Hintikka set.

Importantly, these definitions are well formed:

Lemma 7

(Well formed countermodel)

Definition 1 is well formed:

figure bv

Proof

By the construction of E and F and the nonemptiness of terms S. \(\square \)

Theorem 4

(Model existence)

The given model falsifies any formula p in Hintikka set S:

figure bw

Proof

By induction on the size of the formula p (substitution instances are smaller than the quantified formulas they arise from since Isabelle’s notion of size is the constructor depth of the formula). The second part of the thesis is needed when the Hintikka requirements concern negated formulas. We show a few cases here and refer to the formalization for the full details. The cases omitted here are similar to those shown.

Assume

figure bx

occurs in S. We need to show that the given model falsifies p. Since

figure by

is downwards closed by construction, ts is interpreted as itself by the bounded Herbrand interpretation. Moreover, by the Basic requirement, we know that

figure bz

is not in S and is therefore satisfied. Thus, p is falsified.

Assume

figure ca

occurs negated in S. Then by the BetaDis requirement, either

figure cb

or

figure cc

occurs in S. The induction hypothesis applies to these, so p is satisfied as desired.

Assume

figure cd

occurs in S. By the DeltaUni requirement, so does some instance

figure ce

for a term t in terms S. By the induction hypothesis, this is falsified by M S, and because t came from terms S, it is interpreted as itself. Thus, we have a counterexample that falsifies p.

Assume p = Exi q occurs in S. By the GammaExi requirement, so do all instantiations using terms from S. Thus, these are all falsified by the model. These terms from S are interpreted as themselves by definition, so we have no witness for p in terms S and M S falsifies it. \(\square \)

We note that the above proof works for open and closed formulas alike because we consider both bound and free variables to be subterms of a formula.

5.4 Result

We start off by proving completeness for uvalid sequents. We need to relate these to saturated escape paths.

Lemma 8

(Saturated escape paths contradict uvalidity) A sequent z on a saturated escape path, steps, cannot be uvalid:

figure cf

Proof

Assume towards a contradiction that z is uvalid. By Theorem 2 the formulas on steps form a Hintikka set S. Every formula p in z also occurs in S, so by Theorem 4, the well formed model M S (Lemma 7) falsifies all of them. This contradicts the uvalidity of z. \(\square \)

This leads to completeness for uvalid sequents:

Theorem 5

(Completeness wrt. uvalid) The prover terminates for uvalid sequents:

figure cg

Proof

From the abstract framework (Lemma 2), either the thesis holds or a saturated escape path exists for our sequent, but assumed uvalidity and Lemma 8 contradict the latter. \(\square \)

Corollary 3

(Completeness wrt. SeCaV) Termination for sequents derivable in SeCaV:

figure ch

Proof

By the soundness of SeCaV (Theorem 3) and Theorem 5 for uvalid sequents. \(\square \)

6 Results and Discussion

We have presented an automated theorem prover for the Sequent Calculus Verifier system. The prover is capable of proving a number of selected exercise formulas very quickly, including formulas which are quite difficult for humans to prove. The prover does have some limitations, mostly related to performance and length of the generated proofs, since our proof search procedure is not very optimized for either of these metrics. In particular, our prover always instantiates quantified formulas with all terms in the sequent and breaks down all formulas as much as possible, even when some formulas are “obviously” irrelevant to the proof.

6.1 Summary of Theorems

We have proven soundness and completeness of the proof search procedure with regards to the proof system of SeCaV (see Fig. 3). For soundness, this was done directly (in Theorem 1), while we took a detour through our notion of a bounded semantics to prove completeness (in Theorems 3 and 5, which led to Corollary 3). To justify the introduction of our bounded semantics, we can use the existing soundness and completeness theorems of the SeCaV proof system [18] and our results to prove that validity in the two semantics coincide. Additionally, a number of easy corollaries further linking the prover, the proof system and the two semantics follow from our results, and have been collected in Fig. 6 (refer to the formalization for the proofs). In the figure, the interpretations are implicitly universally quantified and for the bounded semantics we only consider well formed interpretations.

Fig. 6
figure 6

Overview of our results. Solid arrows represent our main contributions, squiggly arrows represent theorems of the existing SeCaV system, and dashed arrows represent easy corollaries

6.2 Example Proofs

As Knuth famously remarked [25], we must beware of a program that has only been proven correct, but not tested. To demonstrate that the automated theorem prover works, we examine some simple generated proofs. The prover generates proofs in the SeCaV Unshortener format [15]: first comes the formula to be proven, then the names of proof rules to apply and the resulting sequent after each application, with each formula in a sequent on its own line. Arguments to predicates and functions are given in square brackets and parentheses are used to disambiguate formulas.

Fig. 7
figure 7

Proofs generated by the prover in SeCaV Unshortener format. Note that the proof on the right spans two columns

Fig. 8
figure 8

Proof of the formula \(\forall x. A(f(x)) \longrightarrow \exists x. A(x)\) generated by the prover in SeCaV Unshortener format. Note that the prover applies a series of \(\gamma \)-rules to instantiate quantifiers with all possible terms and that these terms include free variables

We start with perhaps the simplest possible classical example, that \( \lnot p \vee p \). Figure 7 shows the proof generated by the prover on the left. This is the shortest possible proof of the formula in the SeCaV system, and the prover is thus on par with a human in this very simple case.

The next example is \(\lnot p(a) \vee \exists x. p(x) \). Figure 7 contains the generated proof on the right. It can be shortened since the quantified formula only needs to be instantiated once, by \( a \). However, the prover always duplicates a \( \gamma \)-formula before instantiating it with all terms on the branch.

Finally, Fig. 8 shows a proof of \(\forall x. A(f(x)) \longrightarrow \exists x. A(x)\) generated by the prover. This formula contains a function \( f \), which the prover needs to use to instantiate the existentially quantified variable. Proving the formula also requires instantiating variables occurring as arguments of an application of f. This proof can be shortened significantly, since the prover instantiates both quantifiers with both \( x \) (i.e. de Bruijn index 0) and \( f(x) \), even though only the instantiations leading to formulas involving \( A(f(x)) \) are needed for the proof. Note that the prover is able to instantiate formulas with free variables (i.e. de Bruijn indices with no corresponding quantifier) to overcome the lack of concrete constant names.

6.3 Verification Challenges

While verifying the prover, we discovered that our initial version was unsound due to a missing update of the term list when applying (multiple) \( \delta \)-rules to a sequent. The attempted soundness proof failed in exactly this case, pointing us directly to the issue. Thus, the formal verification caught a critical flaw that we had missed in our testing and helped us fix it.

We have designed the prover to be easily verified and it mostly was. Especially the abstract framework worked well for our novel case with a deterministic prover for first-order logic. One obstacle, however, was in using a type to represent the domain in the SeCaV semantics (cf. Fig. 2). To build the countermodel, we need the domain to contain only the terms on the saturated escape path, but we cannot form this type, which depends on a local variable, in Isabelle/HOL. Here we would benefit from Isabelle integration of the work by Kunčar and Popescu [27] which adds exactly this capability to higher-order logic. Instead we introduced the bounded semantics (“the set-based relativization” in their terminology [27]) and proved a new soundness result for it (cf. Sect. 5.3.1). Otherwise the largest issue was dealing with substitutions using de Bruijn indices. We are excited to see how recent work by Blanchette et al. [5] for reasoning about syntax with bindings improves matters in this area.

6.4 Limitations and Future Work

There are a number of limitations and possibilities for optimization in the proof search itself. Most importantly, the focus of the procedure is on completeness, not performance. Our prover is much slower than state-of-the-art provers such as Vampire [26], and in particular, our prover is not fast enough to prove any but the easiest problems in the TPTP database [49]. Our goal was not to compete on speed, but simply to show that formal verification of provers with advanced features such as generation of proof certificates and support for functions is possible. The prover also cannot output counterexamples, even though these could be detected in some cases: our prover simply never terminates on invalid formulas.

We believe that the approach used for our prover is extendable to more sophisticated and optimized proof search procedures, albeit with considerably more work needed to formally verify them. The most obvious opportunity for optimization is controlling the order of proof rules. In systems with unordered sequents, it is generally better to apply as many \( \alpha \)-rules as possible before applying \( \beta \)-rules to avoid duplicating work, but the prover simply applies rules in a fixed order. As mentioned in Sect. 3, this optimization can be done by working with “meta-rules” corresponding to groups of SeCaV rules such that a meta-rule e.g. applies as many \( \alpha \)-rules as possible before continuing to the next “phase” of the proof. We have attempted to implement this, but found that it complicates the proofs considerably since this idea makes it much harder to determine when a proof rule is actually applied. In the proof of fairness and the proof that the formulas on saturated escape paths form Hintikka sets, we need to know that certain formulas are preserved until proof rules are eventually applied to them. By introducing phases in the proof, proving this becomes much more difficult, since we then need to prove that each phase actually ends (requiring some measure which depends on the specific sequents in question), and to locate each rule within the meta-rule it is part of. We thus leave optimizations in this vein as future work. We note that, since the SeCaV system requires application of the Ext rule to permute sequents, and proof rules only apply to the first formula in a sequent, the optimization described above may not always reduce the number of SeCaV proof steps needed to prove a formula, and some heuristics would probably be needed to produce reasonably short proofs in all cases.

Instead of introducing phases by working with meta-rules, we could also imagine allowing the proof strategy to depend on the actual formula being proven more directly, by choosing a rule based on the current sequent or possibly even the entire proof tree. The framework used for the prover does not allow this (since the rules used for the proof search must be given as a fixed stream of rules) and a new framework would have to be developed to implement optimizations in this vein. Such a framework would also make it possible to avoid useless rule applications and to easily implement proof strategies based on heuristics. It is unclear how the additional flexibility of allowing rules to depend on formulas would impact the difficulty of the proofs of soundness and completeness, and we thus leave developing such a framework as future work.

Another optimization could be to only support closed formulas and thus reduce the number of subterms of a given formula. For our current Herbrand interpretation, we need variables to be subterms, but if we only considered closed terms, we could do away with this. This would however require a change to the proof strategy to make the prover invent new names instead of using free variable in cases such as the proof in Fig. 8.

The length of proofs could also be optimized by performing more post-processing of the found proofs, for example by removing unnecessary instantiations or rule applications that do not contribute to proving a branch. This would not improve the performance in the sense that the prover would still spend the same amount of time finding the proof, but it could reduce the length of some proofs significantly. The proof trees generated by the prover already require some (unverified) post-processing to obtain proofs in the SeCaV system. The existing post-processing also removes rule applications that do nothing to the sequent, and merges consecutive Ext rules, of which there may be many after removing other rule applications. The current post-processing does not, however, consider the usefulness of a rule in the context of the larger proof, and this could be implemented to eliminate entire lines of reasoning that end up amounting to nothing. This analysis can of course only be performed after seeing the entire proof, and would thus not be able to increase the speed of the prover. It would also be interesting to move these steps from Haskell into Isabelle/HOL and extend the proofs to cover them.

Another way to shorten the generated proofs would be to disable rules that do not actually change the sequent. For instance, an AlphaDis rule can affect a dozen formulas in a sequent at once or zero, when there are no disjunctions, but the rule is applied regardless. This can lead to a lot of useless rule applications in the final proof, which are removed by the above mentioned post-processing. In terms of how much work the prover does, we have to check the sequent anyway, to see if a given rule applies to it, so there is no immediate benefit compared to simply applying the rule with no effect. Moreover, the current design allows us, as mentioned, to forgo an always-enabled Idle rule, otherwise needed to satisfy the framework, since all rules are always enabled. It could, however, have a positive impact on the memory usage of the prover, since the proof tree would not grow as much, taking up unnecessary space. This should be investigated in future work.

7 Conclusion

We have designed, implemented and verified an automated theorem prover for first-order logic with functions in Isabelle/HOL. We have used an existing framework in a novel way to get us part of the way towards completeness and we have extended existing techniques on countermodels over restricted domains to reach our destination. We build on the existing SeCaV system and contribute an automatic way of finding derivations to the project. Thus, we have demonstrated the utility of Isabelle/HOL for implementing and verifying executable software and the strength of its libraries in doing so. Our prover handles the full syntax of first-order logic with functions and constructs human-readable proof certificates in a sequent calculus. We hope our work inspires others to verify more sophisticated provers in the same vein.