1 Introduction

Mathematics is built in a carefully structured way, with many disciplines and subdisciplines. These are characterized by concepts, definitions, axioms, theorems, lemmas, and so forth. There is no doubt that this inherent structure of mathematics is part of the discipline’s long-lasting success.

Research into Automated Theorem Proving (ATP) to date has taken little notice of the information provided by this structure. Even state-of-the-art ATP systems ingest a conjecture together with pertinent definitions and axioms in a way completely agnostic to their place in the mathematical structure. A comparatively small but nevertheless important part of the structure of mathematics is the identification and application of lemmas. It is this aspect which is the focus of the work presented here.

The purpose of lemmas in mathematics is at least threefold. First, and perhaps most importantly, lemmas support the search for proofs of assertions. If some lemma applies to a given problem, a proof may be found more easily. Second, it is often the case that a lemma may be applied more than once. If this happens, its use will shorten the length of the overall proof since the proof of the lemma need only be carried out once, not repeatedly for every application. Third, the structuring effect of proofs by the use of lemmas is an important feature for human comprehension of proofs. In our work we are motivated primarily by the first two of these three aspects.

These considerations give rise to the crucial question: how can we find useful lemmas for proving a given problem? Here we mean useful in the sense of the two aforementioned aspects: lemmas should be applicable to the problem at hand, preferably many times. In full generality this is a difficult question indeed, which will require much further research. In this first step we restrict the question to a narrow range of problems, known in literature as condensed detachment (CD) problems [41]. Proofs of CD problems can be represented in a simple and accessible form as proof structure terms, enabling structure enumeration to enhance proof search and lemma maintenance, as well as feature extraction for learning. Our investigation thus focuses on the question of how ATP performance may be improved for CD problems by the generation and selection of useful lemmas before search begins.

CD problems are of the form “axiom(s) and Det imply a goal” where Det represents the well-known modus ponens rule, or condensed detachment. They have a single unary predicate. A typical application is the investigation of an axiomatization of some propositional logic, whose connectives are then represented by function symbols. In order to support this study experimentally, we have built a combined system for dealing with these problems. It features SGCD  [74] as prover and lemma generator along with a learning module based on either an easily-interpreted linear model over hand-engineered features, or a graph neural network supporting end-to-end learning directly from lemmas.

Our work results in a number of inter-related particular contributions:

  1. 1.

    Incorporation of proof structure terms into ATP with Machine Learning (ML). Consideration of features of the proof structure terms, explicitly in linear-model ML or implicitly in a neural ML model. A novel ATP/ML dataflow that is centered around proof structure terms.

  2. 2.

    Experimentally validated general insights into the use of learned lemmas for provers of different paradigms, with different ways to incorporate lemmas, and based on two alternate ML models. At the same time pushing forward the state of the art on proving CD problems. Insights include: SGCD is competitive with leading first-order provers; Learned lemmas significantly extend the set of problems provable by the leading first-order prover Vampire; Provers without internal lemma maintenance, such as Connection Method (CM) [6,7,8] systems, are drastically improved; Vampire and SGCD are able to handle a few hundreds of supplied lemmas; Learning based on manual features and on automatic feature extraction perform similarly.

  3. 3.

    An automatic proof of the Meredith single axiom theorem LCL073-1, which has persisted in the TPTP rated 1.00 since 1997. The first and only system to succeed was OTTER [39], after intensive massaging by Wos [84]. It was proven by SGCD in a novel systematic way.

  4. 4.

    An implemented framework with the new techniques for generation, selection and application of lemmas.

Structure of the Paper. Section 2 presents condensed detachment and its embedding into the CM by way of so-called D-terms, as well as background material on lemmas and machine learning in ATP. Section 3 introduces a method for generating and selecting useful lemmas and presents experimental results with it, leading up to the proof of LCL073-1 in Sect. 4. We conclude with a summary and outlook for further work in this area in Sect. 5.

Supplementary material is provided in the appendix of the preprint version [54]. All experiments are fully reproducible and the artifacts are available at https://github.com/zsoltzombori/lemma, commit df2faaa. We use CD Tools  [74] and PIE [71, 72], implemented in SWI-Prolog [77], for reasoning tasks and PyTorch [47] for learning.

2 Background and Related Work

In a very general sense, lemmas in ATP factorize duplication. This may be between different proofs that make use of the same lemma, or within a single proof where a lemma is used multiple times. It may not even be a particular formula that is shared, but a pattern, such as a resonator [81]. In the presence of machine learning, we may think of even more abstract entities that are factorized: the principles by which proofs are written, repeated in different proofs or contexts.

Depending on the proving method, lemmas in ATP play different roles. Provers based on saturation, typically resolution/superposition (RS) systems [3], inherently operate by generating lemmas: a resolvent is itself a lemma derived from its parents. Nevertheless, one may ask for more meaningful lemmas than the clauses of the proof. This is addressed with cut introduction [14, 20, 78], which studies methods to obtain complex lemmas from resolution proofs. Such lemmas provide insight about the high-level structure of proofs, extract interesting concepts and support research into the correspondence between natural mathematical notions and possible proof compressions. Other approaches to interesting theorems or lemmas are described for example in [52, 65].

Another question concerning lemmas and ATP systems is whether performance can be improved by supplementing the input with lemmas. This is particularly applicable if lemmas are obtained with methods that are different from those of the prover. Otherwise, it may have obtained these by itself.Footnote 1 As we will see, leading ATP systems such as Vampire and E [59] can indeed be improved in this way. Different methods does not necessarily mean different systems: it is possible to use different configurations of the same system for lemma generation and proving, as well as for intermediate operations. This was the workflow used by Larry Wos to prove the challenge problem LCL073-1 with OTTER [84]. Our SGCD system also supports this, which played a major role in its ability to prove the aforementioned challenge problem.

Lemmas play a quite different role for a family of provers which we call CM-CT for Connection Method/Clausal Tableaux, exemplified by PTTP  [61], SETHEO [33], and leanCoP [45, 46]. Underlying conceptual models are model elimination [35], clausal tableaux [31] and the CM. They enumerate proof structures while propagating variable bindings initialized by the goal through unification, and hence proceed in an inherently goal-driven way. While they are good at problems that benefit from goal direction, in general they are much weaker than RS provers and have not been among the top provers at CASC for about two decades. This is attributed to the fact that they do not re-use the proof of one subgoal as the solution of another: they do not use lemmas internally.

The lack of lemmas was identified early as a weakness of CM-CT [15], so there have been various proposed remedies [2, 15, 17, 19, 32, 45, 60, 62]. Despite some insight and success, this did not yet elevate CM-CT to the level of the best RS systems. Nevertheless, the expectation remains that CM-CT provers would benefit from supplying lemmas as additional input. Hence, we included two CM-CT systems in our experiments, leanCoP and CMProver [12, 71, 72] and show that the expectation is greatly confirmed. Two other systems considered here, SGCD and CCS  [73], can be viewed as CM-CT systems extended to support specific forms of lemma generation and application.

Lemmas can be maintained within the prover as an inherent part of the method, as in saturation. They may also be created and applied by different systems, or different instances of the same system [13, 55]. Larry Wos calls this lemma adjunction [83]. Lemmas created by one system are passed to a second system in two principal ways. First, they can be passed as additional axioms, in the hope that the second system finds a shorter proof in the wider but shallower search space. Second, external lemmas can be used to replace search. The second system then starts with the given lemmas as if they were the cached result of its previous computation. Moreover, the provided lemmas can be restricted in advance by heuristic methods, such as by a machine-learned model. SGCD supports this replacing lemma incorporation. The basic distinction between augmenting and replacing search with lemmas was already observed by Owen L. Astrachan and Mark E. Stickel [2] in the context of improving CM-CT provers.

2.1 Machine Learning for ATP

The past decade has seen numerous attempts to leverage machine learning in the automated theorem proving effort. Early systems mostly focused on premise selection, e.g. [1, 68, 70], aiming to reduce the number of axioms supplied as input to the prover, or on selection of heuristics, e.g. [11]. Other works provide internal guidance directly at the level of inferences during search, e.g. [18, 24, 25, 27, 34, 53, 85]. The emergence of generative language models has also led to some initial attempts at directly generating next proof steps, e.g. [48, 49, 67], moving the emphasis away from search.

In contrast to these lines of work, our focus is on learning the utility of lemmas. Close to our aims is [26, 28], trying to identify globally useful lemmas in a collection of millions of proofs in HOL Light. Besides differences in the formal system, what distinguishes our work is that we learn a much more focused model: we put great emphasis on evaluating lemmas in the context of a particular goal and axiom set; in fact, our entire system was designed around the question whether a given lemma is moving the goal closer to the axioms. We argue that the D-term representation of all involved components (goal, lemma, axioms, proof) makes our framework particularly suitable for the lemma selection task.

We employ an iterative improvement approach first used in MaLARea [68]: in each iteration, we run proof search guided by a learned model, extract training data from proving attempts, and fit a new model to the new data. These steps can be repeated profitably until performance saturates.

2.2 Condensed Detachment: Proofs as Terms

Condensed detachment (CD) was developed in the mid-1950s by Carew A. Meredith as an evolution of substitution and detachment [30, 43, 50, 51]. Reasoning steps are by detachment, or modus ponens, under implicit substitution by most general unifiers. Its primary application is the investigation of axiomatizations of propositional logics at a first-order meta-level. CD also provides a technical approach to the Curry-Howard correspondence, “formulas as types” [22, 23] and is considered in witness theory [57]. Many early successes in ATP were on CD problems [40, 66], but success was also found in the reverse direction. Refinements of the OTTER prover in the 1990s, some of which have found their ways into modern RS provers, were originally conceived and explored in the setting of CD [16, 40, 69, 79,80,81,82, 84].

From a first-order ATP perspective, a CD problem consists of axioms, i.e. positive unit clauses; a goal theorem, i.e. a single negative ground unit clause representing a universally-quantified atomic goal theorem after Skolemization; and the following ternary Horn clause that models detachment.

$$\begin{aligned} \textit{Det} \; {\mathop {=}\limits ^{\text {def}}}\; \textsf{P}(\textsf{i}(x,y)) \wedge \textsf{P}(x) \rightarrow \textsf{P}(y). \end{aligned}$$

The premises of Det are called the major and minor premise, respectively. All atoms in the problem have the same predicate \(\textsf{P}\), which is unary and stands for something like provable. The formulas of the investigated propositional logic are expressed as terms, where the binary function symbol \(\textsf{i}\) stands for implies.

CD may be seen as an inference rule. From an ATP perspective, a CD inference step can be described as a hyperresolution from Det and two positive unit clauses to a third positive unit clause. A CD proof is a proof of a CD problem constructed with the CD inference rule. CD proofs can be contrasted with other types of proof, such as a proof with binary resolution steps yielding non-unit clauses. Prover9 [38] chooses positive hyperresolution by default as its only inference rule for CD problems and thus produces CD proofs for these.

It is, however, another aspect of CD that makes it of particular interest for developing new ATP methods, which only recently came to our attention in the ATP context [75]: the structure of CD proofs can be represented in a very simple and convenient way as full binary trees, or as terms. In ATP we find this aspect in the CM, where the proof structure as a whole is in focus, in contrast to extending a set of formulas by deduction [9]. This view of CD is made precise and elaborated upon in [76], on which the subsequent informal presentation is based. We call the structure representations of CD proofs D-terms. A D-term is a term recursively built from numeral constants and the binary function symbol \(\textsf{D}\) whose arguments are D-terms. In other words, it is a full binary tree where the leaf nodes are labeled with constants. Four examples of D-terms are

$$\begin{aligned} 1,\;\;\; 2,\;\;\; \textsf{D}(1,1),\;\;\; \textsf{D}(\textsf{D}(2,1),\textsf{D}(1,\textsf{D}(2,1))). \end{aligned}$$

A D-term represents the structure of a proof. A proof in full is represented by a D-term together with a mapping of constant D-terms to axioms. Conversion between CD proofs and D-terms is straightforward: the use of an axiom corresponds to a constant D-term, while an inference step corresponds to a D-term \(\textsf{D}(d_1,d_2)\) where \(d_1\) is the D-term that proves the major premise and \(d_2\) the minor.

Through first-order unification, constrained by axioms for the leaf nodes and the requirements of Det for inner nodes, it is possible to obtain a most general formula proven by a D-term [76]. We call it the most general theorem (MGT) of the D-term with respect to the axioms, unique up to renaming of variables. For a given axiom map, not all D-terms necessarily have an MGT: if unification fails, we say the D-term has no MGT. It is also possible that different D-terms have the same MGT, or that the MGT of one is subsumed by the MGT of another. A D-term is a proof of the problem if its MGT subsumes the goal theorem.

As an example, let the constant D-term 1 be mapped to \(\textsf{P}(\textsf{i}(x,\textsf{i}(x,x)))\), known as Mingle [66]. Then, the MGT of the D-term 1 is just this axiom. The MGT of the D-term \(\textsf{D}(1,1)\) is \(\textsf{P}(\textsf{i}(x,\textsf{i}(x,x)),\textsf{i}(x,\textsf{i}(x,x)))\), that is, after renaming of variables, \(\textsf{P}(y)\sigma \) where \(\sigma \) is the most general unifier of the set of pairs \(\{\{\textsf{P}(\textsf{i}(x,y)),\, \textsf{P}(\textsf{i}(x',\textsf{i}(x',x')))\},\; \{\textsf{P}(x),\, \textsf{P}(\textsf{i}(x'',\textsf{i}(x'',x'')))\}\}\).

D-terms, as full binary trees, facilitate characterizing and investigating structural properties of proofs. While, for a variety of reasons, it is far from obvious how to measure the size of proofs obtained from ATP systems in general, for D-terms there are at least three straightforward size measures:

  • The tree size of a D-term is the number of its inner nodes.

  • The height of a D-term is the length of the longest root-leaf path.

  • The compacted size of a D-term is the number of distinct compound subterms, or, in other words, the number of inner nodes of its minimal DAG.

Alternative names in the literature are length for compacted size, level for height and CDcount [69] for tree size. The D-term \(\textsf{D}(\textsf{D}(1, \textsf{D}(1, 1)), \textsf{D}(\textsf{D}(1, 1), 1))\), for example, has tree size 5, compacted size 4 and height 3. Factor equations provide a compact way of writing D-terms: distinct subproofs with multiple incoming edges in the DAG receive numeric labels, by which they are referenced. The D-term \(\textsf{D}(\textsf{D}(1, 1), \textsf{D}(\textsf{D}(1, \textsf{D}(1, 1)), \textsf{D}(1, \textsf{D}(1, 1))))\), for example, can be written as \(2 = \textsf{D}(1, 1),\; 3 = \textsf{D}(1, 2),\; 4 = \textsf{D}(2, \textsf{D}(3, 3))\).

CD problems have core characteristics of first-order ATP problems: first-order variables, at least one binary function symbol and cyclic predicate dependency. But they are restricted: positive unit clauses, one negative ground clause, and one ternary Horn clause. Equality is not explicitly considered. The generalization of CD to arbitrary Horn problems is, however, not difficult [73].

2.3 Condensed Detachment for ATP and Lemmas

From an ATP point of view, D-terms provide access to proofs as a whole. This exposes properties of proofs that are not merely local to an inference step, but spread across the whole proof. It suggests a shift in the role of the calculus from providing a recipe for building the structure towards an inductive structure specification. Moreover, D-terms as objects provide insight into all proofs: for example, growth rates of the number of binary trees for tree size, height and compacted size are well-known with entries in The On-Line Encyclopedia of Integer Sequences [44] and provide upper bounds for the number of proofs [76]. A practical consequence for ATP is the justification of proof structure enumeration techniques where each structure appears at most once.

CD proofs suggest and allow for a specific form of lemmas, which we call unit or subtree lemmas, reflecting two views on them. As formulas, they are positive unit clauses, which can be re-used in different CD inference steps. In the structural view, they are subterms, or subtrees, of the overall D-term. If they occur multiply there, they are factored in the minimal DAG of the overall D-term. The views are linked in that the formula of a lemma is the MGT of its D-term. The compacted size measure specified above takes into account the compression achievable by unit/subtree lemmas. From the perspective of proof structure compression methods, unit/subtree lemmas have the property that the compression target is unique, because each tree is represented by a unique minimal DAG. CM-CT provers do not support such lemmas, which is the main reason for their notorious weakness on CD problems.

2.4 SGCD—Structure Generating Theorem Proving

SGCD (Structure Generating Theorem Proving for Condensed Detachment) [74] is the central system used in our experiments as prover as well as lemma generator. It realizes an approach to first-order theorem proving combining techniques known from the CM and RS that was not fully recognized before. It generalizes (for CD problems) bottom-up preprocessing for and with CM-CT provers [60] and hypertableaux [4]. SGCD works by enumeration of proof structures together with unification of associated formulas, which is also the core method of the CM-CT provers. Structures for which unification fails are excluded. Each structure appears at most once in the enumeration.

Let the proof structures be D-terms. Partition the set of all D-terms according to some level such that those in a lower level are strict subterms of those in a higher level. Tree size or height are examples of such a level. Let

$$\begin{aligned} \texttt {enum\_dterm\_mgt\_pairs(}\textit{+Level}\texttt {,}~\,\textit{?DTerm}\texttt {,}~\,\textit{?Formula}\texttt {)} \end{aligned}$$

be a PrologFootnote 2 predicate enumerating D-terms and corresponding MGTs at a certain level, with respect to given axioms that do not explicitly appear as parameter. We say that the predicate generates these pairs in an axiom-driven way. If the predicate is invoked with the formula argument instantiated by a ground formula, it enumerates D-terms that prove the formula at the specified level. The predicate is then used goal-driven, like a CM-CT prover. Invoking it for increasing level values realizes iterative deepening. There are further instantiation possibilities: if only the D-term is instantiated and the level is that of the D-term, its MGT is computed. If both D-term and formula are instantiated, the predicate acts as verifier.

The implementation includes several generators, concrete variants of the enum_dterm_mgt_pairs predicate for specific level characterizations. SGCD maintains a cache of \(\langle \textit{level}, \textit{D-term}, \textit{formula} \rangle \) triples used to obtain solutions for subproblems in levels below the calling level. This cache is highly configurable. In particular, the number of entries can be limited, where only the best triples according to specified criteria are kept. Typical criteria are height or size of the formula, a heuristic shared with RS provers. Subsumed entries can be deleted, another feature in common with RS. Novel criteria are also supported, some of which relate the formula to the goal. Most criteria are based on the formula component of the triples, the MGT. Due to rigid variables [21], MGTs are not usually available in CM-CT provers [76] and cannot be used as a basis for heuristics.

When lemmas are provided to SGCD, they are used to initialize the cache, replacing search at levels lower than the calling level.Footnote 3 SGCD further maintains a set of abandoned \(\langle \textit{level}, \textit{D-term}, \textit{formula} \rangle \) triples, those that are generated but do not qualify for entering the cache or were removed from the cache. These are kept as a source for heuristic evaluation of other triples and for lemma generation.

For theorem proving, SGCD proceeds as shown in Fig. 1. Input parameter g is the goal formula, while parameters \( maxLevel \) and \( preAddMaxLevel \) are configurable. enum_dterm_mgt_pairs represents a particular generator that is also configurable. It enumerates argument bindings nondeterministically: if it succeeds in the inner loop, an exception returns the D-term d. C is the cache. The procedure merge_news_into_cache(N C) merges newly generated \(\langle \textit{level}, \textit{D-term}, \textit{formula}\rangle \) triples N into the cache C. If \( maxLevel \) is configured as 0, the method proceeds in purely goal-driven mode with the inner loop performing iterative deepening on the level m. Similarity to CM-CT provers can be shown empirically by comparing the sets of solved TPTP problems [74]. Generally successful configurations of \( preAddMaxLevel \) typically have values 0–3.

Fig. 1.
figure 1

The nested loops of the SGCD theorem proving method.

3 Improving a Prover via Learned Lemma Selection

We employ machine learning to identify lemmas that can enhance proof search. Unlike the standard supervised scenario in which we learn from some training problems and evaluate performance on separate test problems, we take a reinforcement learning approach of self-improvement that has already been successfully applied in several theorem proving projects since [68]. In this approach, we perform proof search with a base prover on our entire problem set and learn from the proof attempts.Footnote 4 The learning-assisted prover is evaluated again in the problem set to see if it can find more or different problems. If there is improvement, the process can be repeated until performance saturates. In a bit more detail, our system has the following components.

  1. 1.

    Base Prover: Performs proof search and its main role is to provide training data to the utility model.

  2. 2.

    Utility Model: The model takes \(\langle \)conjecture, lemma, axioms\(\rangle \) triples and outputs a utility score, i.e., some measure of how useful the lemma is for proving the conjecture from the axioms. The utility model is trained from the D-terms emitted by the base prover.

  3. 3.

    Lemma Generator: Produces a large set of candidate lemmas for each problem separately. All candidates are derivable from the axioms.

  4. 4.

    Evaluated Prover: For each problem, we evaluate the candidate sets with the utility model and select the best ones. These lemmas are provided to the evaluated prover which performs proof search on the problem set. The evaluated prover can be identical to or different from the base prover.

Base Prover. Any prover that emits proofs as D-terms is suitable as a base prover. Given a D-term proof tree P of some formula C from axiom set As, any connected subgraph S of P can be considered as the proof of a lemma L. If S is a full tree, it proves a unit lemma, which is the formula associated with its root. Otherwise, it proves a Horn clause, whose head is the root formula of S and whose body corresponds to the open leaves of S. We currently focus on unit lemmas and leave more general subgraphs for future work. To approximate the utility of lemma L for proving C from As, there are several easy-to-compute logical candidates, such as the reduction in tree size, tree height or compressed size. A more refined measure is obtained if we reprove C with the lemma L added to the axioms As and observe how the number of inference steps changes.Footnote 5 This is slower to compute, but takes into account the particularities of the base prover, hence provides more focused guidance. In our experiments, we find that the best performance is obtained by reproving and then computing utility U as the inference step reduction normalized into \([-1, 1]\), where \(-1\) means that the problem could not be solved within the original inference limit and 1 is assigned to the lemma that yields the greatest speedup. We end up with tuples \(\langle C, As, L, U\rangle \) to learn from.

Utility Model Training. We experiment with gradient-descent optimization for two classes of functions: linear models and graph neural networks (GNNs). Our linear model is based on 51 manually-identified features, some of them novel, described in [54, App. A]. For each feature \(f_i\) there is an associated weight parameter \(w_i\) to produce the final predicted utility

$$ U(\boldsymbol{f}; \boldsymbol{w}) = \sum _i f_i w_i$$

The second, more involved model is a GNN. Describing this model is beyond the scope of this paper: see e.g. [58] for a gentle introduction. What is crucial for our purposes is that no manual feature extraction is involved: a specialized neural network processes the D-terms of involved formulas directly and learns to extract useful features during optimization. As input, the model is given a graph, losslessly encoding D-terms of the lemma to be evaluated, the conjecture and the axioms. The precise network architecture is provided in [54, App. B].

Candidate Lemma Generation. Candidate lemmas are generated separately for each problem via the structure enumeration mechanism of SGCD, as explained in Fig. 1. The goal g is provided and \( preAddMaxLevel \) is set to 0, making SGCD proceed axiom-driven, generating lemmas level by level. However, it does intersperse the goal-driven inner loop, which is only trying to prove the goal on the level directly above the last cached level. SGCD may terminate with a proof, in which case further lemma generation is pointless. Otherwise it terminates after \( maxLevel \) is reached, generation of new levels is exhausted, or a time limit is reached. We then use the cache C and the abandoned triples as the generated output lemmas. Furthermore, there are many ways to configure SGCD. We obtained the best results generating by tree size and by PSP-level (explained below), combined with known good heuristic restrictions. In particular we restrict the size of the lemma formulas to the maximum of the size of the axioms and the goal, multiplied by some factor (usually 2–5). We also restrict the number of elements in the cache, typically to 1,000. The lemmas are sorted by formula size measures, smaller preferred, to determine which are retained in the cache.

Proof structure generation by PSP-level is a novel technique introduced in [74, 76], based on an observation by Łukasiewicz and Meredith. In a detachment step, often the D-term that proves one premise is a subterm of the D-term that proves the other. We turn this relationship into a proof structure enumeration method: structures in level \(n+1\) are D-terms where one argument D-term is at level n and the other argument is a subterm of that D-term. The method is incomplete, but combines features of DAG enumeration while being compatible with a simple global lemma maintenance as realized with SGCD ’s cache [76].

Table 1. Features of the considered provers: whether their proofs are available as D-terms (possibly after some conversion), whether they were used with replacing lemma incorporation (Sect. 2), whether they operate goal-driven, and the underlying method.

Evaluated Prover. For each problem, we evaluate the candidate set with the utility model and select k lemmas with the highest predicted utility, where k is a hyperparameter. The evaluated prover then tries to solve the problems with the help of the selected lemmas. The lemmas can either be treated as additional axioms—applicable to any prover—or have a specialized treatment if the prover provides for it: in particular, SGCD and CCS-Vanilla use the lemmas to replace inner lemma enumeration.Footnote 6 The evaluated prover can be any prover, since there is no specialized requirement to handle lemmas as new axioms. If, however, it is the base prover—or any other system that emits proofs as D-terms, then the learning procedure can be iterated as long as there are new problems solved.

3.1 Learning-Based Experiments

We experiment with a total of 312 CD problems, including all 196 pure CD problems from TPTP 8.1.2 [64], enriched with single-axiom versions of all the problems to which a technique by Tarski [37], as specified by Rezuş [56], was applicable. We test several representative ATP systems, including state-of-the-art systems for both general first-order reasoning and for CD problems.

Table 1 gives an overview of the considered provers. CCS-Vanilla is CCS  [73] in a restricted configuration to find only those CD proofs with minimal compacted size, identifying problems that can clearly be solved with exhaustive search. It operates goal-driven, like the CM-CT provers, but by enumerating DAGs instead of trees through a local lemma maintenance mechanism. Vampire and E represent the state of the art of first-order ATP. Provers that produce D-terms as proofs (SGCD, Prover9, CMProver, CCS) can serve as base provers. We always rely on SGCD for lemma candidate generation. All provers are recent public versions: Vampire 4.5.1, E  2.6, leanCoP 2.1. We provide results in terms of time limits, although for the Prolog provers SGCD, CMProver and CCS-Vanilla we used a roughly-equivalent inference limit to avoid fluctuations due to server workload.

Improving the Base Prover. In our first experiment, we evaluate base provers after learning from their own proof attempts. The provers are given \(k=200\) best lemmas according to the linear utility model. Table 2Footnote 7 shows problems solved by four base provers without lemmas (Base case) and with two iterations of learning. The Total row gives the number of theorems proved by any of the three iterations shown. The stronger the base model, the harder it is to improve. CMProver and CCS-Vanilla are purely goal-driven and benefit greatly, reaching over 37% improvement for larger time limits. SGCD and Prover9 improve over 5% for shorter time limits, but this effect gradually vanishes as the time limit is increased.

Table 2. Number of problems solved over 2 iterations of training a linear model.

An analysis, provided in [54, App. D], reveals that in the proofs not found during lemma generation and found by SGCD after the provision of lemmas, 63–96% of the distinct subterms originate from the lemmas, i.e., a substantial portion of the proofs are built up from the provided lemmas.

Learned Lemmas to Enhance Other Provers. Next, we fix SGCD as base prover and evaluate other provers, namely Vampire, E, Prover9 and leanCoP. Again, the provers are given \(k=200\) best lemmas according to the linear utility model. Table 3 shows the greatest boost is for the purely goal-driven leanCoP, where there is over 40% improvement for all time limits. Second is Vampire with 8–15% improvement, followed by Prover9 and E with around 3% improvement. Interestingly, E does not solve more problems with the lemmas, but it solves different ones, hence the improvement. These results suggest a great deal of transferability of the benefits of the lemma selector.

Table 3. Number of problems solved by Vampire (casc), E (autoschedule), Prover9 and leanCoP without and with additional lemmas using various time limits.

Changing the Number of Lemmas Added. Adding lemmas has potential to shorten proofs, but it also widens the search space, so it is not obvious how many lemmas are beneficial. In the next experiment, we again fix SGCD as base prover and evaluate SGCD and Vampire with different number of lemmas selected. Table 4 shows that as little as 25 added lemmas yield substantial improvement, 7% for Vampire and 4% for SGCD, and performance does not drop as we add more lemmas: even at 500 we see no negative effect of the expanded search space.

Table 4. Number of problems solved by Vampire (casc) and SGCD as we alter the number k of supplemented lemmas. We use a time limit of 100 s.

Linear vs GNN Model. The preceding experiments suggest that even a simple linear model can provide useful guidance when features are carefully selected. Table 5 shows that the GNN—which processes the formulas directly and has no access to expert designed features—also successfully learns to identify useful lemmas for SGCD and even slightly surpasses the linear model. LCL125-1 can only be solved by the GNN-assisted prover, even at extremely large time limits.

Table 5. Number of problems solved by SGCD over 2 iterations of training both a linear and a graph neural network model, for time limits 50 s, 100 s, 500 s and 30 min.

3.2 Discussion of Learning-Based Experiments

When enhanced by learning-based lemma selection, SGCD solves 287 of the 312 problems. These include 28 problems not solved by the leading first-order prover Vampire  [29], which solves 263 problems in its CASC [63] portfolio mode. Supplemented with our lemmas, Vampire is boosted to 284 solved problems. In combination, boosted SGCD and Vampire give 293 solved problems. Taking into account the solutions obtained by further provers with our lemmas, we obtain a total of 297. For detailed results see [54, App. E] and http://cs.christophwernhard.com/cdtools/exp-lemmas/lemmas.html.

A notable observation is that all systems—with the exception of E—improve when provided with selected lemmas. We argue that our framework addresses fundamental weaknesses of both purely goal-driven systems such as CMProver, leanCoP and CCS-Vanilla, as well as those of saturation style systems such as Vampire and E. For the former, it is their inability to generate lemmas, which results in unduly long proofs. For the latter, it is their unrestricted expansion of the branching of the search space. We find that goal-driven systems demonstrate huge improvement when lemmas are added: usually 20–40% depending on the configuration. The improvement is much more modest for saturation style systems, partly because their baselines are already stronger and partly because learned lemma selection still has a large room for improvement. This is the focus of our immediate future work. SGCD already provides a balance between goal-driven search and axiom-driven lemma generation and we only see significant improvement from lemmas when the time limit on proof search is smaller. Our manual feature-based linear model allows for exploiting expert knowledge. However, we see more potential in automated feature extraction via GNNs. The fact that the two models perform similarly suggests that we are not providing enough training data for the GNN to manifest its full capabilities.

4 Proving LCL073-1

LCL073-1 was proven by Meredith in the early 1950s with substitution and detachment [42] but it remains outstandingly hard for ATP, where it came to attention in 1992 [40]; TPTP reports rating 1.0 and status Unknown since 1997. Only Wos proved it in the year 2000 with several invocations of OTTER  [84], transferring output and insight between runs. The problem has a single axiom,

$$\begin{aligned} \textsf{P}(\textsf{i}(\textsf{i}(\textsf{i}(\textsf{i}(\textsf{i}(x,y),\textsf{i}(\textsf{n}(z),\textsf{n}(u))),z),v),\textsf{i}(\textsf{i}(v,x),\textsf{i}(u,x)))), \end{aligned}$$

and the goal \(\textsf{P}(\textsf{i}(\textsf{i}(\textsf{a},\textsf{b}),\textsf{i}(\textsf{i}(\textsf{b},\textsf{c}),\textsf{i}(\textsf{a},\textsf{c}))))\), known as Syll [66]. The wider context is showing that a single axiom entails the elements of a known axiomatization of a propositional logic. Experiments with SGCD in our workflow led to a proof of LCL073-1 (Fig. 2, also [54, App. F]) surprisingly quickly. Its compacted size is 46, between that of Meredith (40, reconstructed with CD in [84]) and that of Wos (74). Our workflow is much simpler than Wos’, basically the same as our other experiments but restricted to one phase of lemma generation and incorporation, with only heuristic lemma selection, no learning. Nevertheless, success is fragile with respect to configuration, where reasons for failure or success are not obvious.

Fig. 2.
figure 2

The D-term of our proof of LCL073-1 represented by factor equations.

Our configuration parameters are not problem specific, although we started out with lemma generation by PSP-level because it led earlier to a short proof of LCL038-1 [74, 76]. We first call SGCD to generate lemmas by PSP-level enumeration, configured with a cache size of 5,000, terminating after 60 s with exhaustion of the search space.Footnote 8 Lemma features are computed for the 98,198 generated lemmas and written to disk, taking another 120 s. Lemmas are then ordered lexicographically according to five features relating to sharing of symbols and subterms with the goal, and to formula dimensions, taking a further 70 s. These five features are lf_h_height, lf_h_excluded_goal_subterms, lf_h_tsize, lf_h_distinct_vars, dcterm_hash, see [54, App. A] for their specification. We now call SGCD again, configured such that it performs PSP-level enumeration for axiom-driven phases, interleaved with level enumeration by height for goal-driven phases with 0 as \( preAddMaxLevel \). It incorporates the first 2,900 ordered lemmasFootnote 9 as input by replacement (Sect. 2). The cache size limit is set to 1,500, a value used in other generally successful configurations. Formulas occurring as subformulas of an earlier-proven formula are excluded, a variation of the organic property [37, 76]. The proof is then found in 20 s, total time elapsed about 270 s.

The D-term dimensions \(\langle compacted\ size , tree\ size , height \rangle \) are \(\langle 46, 3276, 40\rangle \), compared to Meredith’s \(\langle 40, 6172, 30 \rangle \)Footnote 10 and Wos’ \(\langle 74, 9207, 48\rangle \). The maximal size (occurrences of non-constant function symbols) of a lemma formula (MGT of a subproof) in the proof is 19, the maximal height (tree height, disregarding the predicate symbol) 9, and the maximal number of variables 7. Of the 46 lemmas in the proof 12 are present in the 2,900 input lemmas. Among the 46 lemma formulas 35 are weakly organic [76] and 4 involve double negation. N-simplification [76] applies to 65 occurrences but does not effect a size reduction. The proof is S- and C-regular [76]. Certain configurations of SGCD for the proving phase also yield further proofs. In experiments so far, these are enumerated after the presented proof and have larger compacted size.

Proof structure enumeration by PSP-level [76] is the main key to finding our proof of LCL073-1. It is used for lemma generation and for axiom-driven proof search, whereas goal-driven phases use height instead. The structure of the proof reflects this: all steps with the exception of the root can be considered PSP steps, i.e. one premise is a subproof of the other. The particular challenge of the problem lies in the fact that it was solved by a human (Meredith). Unlike in recent ATP successes for Boolos’ curious inference [5, 10], where the key is two particular second-order lemmas, the key here is a proof-structural principle for building-up proofs by lemmas. Intuitively it might express a form of economy, building proofs from proofs at hand, that belonged to Meredith’s repertoire.

5 Conclusion

We presented encouraging results about the use of lemmas in proof search. Provers are provided with lemmas generated via structure enumeration, a feature of the CM, and filtered with either learned guidance or manual heuristics. As a first step with this new methodology, we focus on the class of CD problems where we obtained strong results with our own system and substantial improvement of general first-order provers based on different paradigms, including the long-time competition leader Vampire. Moreover, our approach has led to the—in a sense first—automatic proof for the well-known Meredith single axiom problem with TPTP difficulty rating 1.0.

An important and novel aspect in our work was the explicit consideration of proof structures, which for CD have a particularly simple form in D-terms. Proof structures of the CM have a direct correspondence to these [76], such that the CM may guide the way to generalizations for more expressive logics. Another course of generalization is to move from unit lemmas, i.e. sharing of subtrees of D-terms, to more powerful lemmas. Preliminary work shows a correspondence between Horn clause lemmas, D-terms with variables, proofs in the connection structure calculus [15], and combinatory compression [73].

The learning-based experiments show little difference in performance between using a simple linear model and a more sophisticated graph neural network. We believe this is due to the small problem corpus, which yields a limited training signal. Hence, we plan to scale the system up to larger problem sets.

Our work also sheds new light on perspectives for the CM. It is well-known that the lack of inherent lemma maintenance is a disadvantage of the CM compared to resolution, which can be overcome with the connection structure calculus [15], a generalization of the CM. Here we see in experiments a drastic improvement of the CM-CT provers by supplementing their input with externally generated lemmas. SGCD, which grew out of the CM-CT approach and integrates repeated lemma generation into the proving process, keeps up with RS provers on CD problems, and can even be applied to improve these by supplying its lemmas as additional input.