# Internal Guidance for Satallax

## Abstract

We propose a new internal guidance method for automated theorem provers based on the given-clause algorithm. Our method influences the choice of unprocessed clauses using positive and negative examples from previous proofs. To this end, we present an efficient scheme for Naive Bayesian classification by generalising label occurrences to types with monoid structure. This makes it possible to extend existing fast classifiers, which consider only positive examples, with negative ones. We implement the method in the higher-order logic prover Satallax, where we modify the delay with which propositions are processed. We evaluated our method on a simply-typed higher-order logic version of the Flyspeck project, where it solves 26 % more problems than Satallax without internal guidance.

## 1 Introduction

Experience can be described as knowing which methods to apply in which context. It is a result of experiments, which can show a method to either fail or succeed in a certain situation. Mathematicians solve problems by experience. When solving a problem, mathematicians gain experience, which in the future can help them to solve harder problems that they would not have been able to solve without the experience gained before.

**Premise selection**: Preselecting a set of axioms for a problem can be done as a preprocessing step or inside the ATP at the beginning of proof search. Examples of this technique are the Sumo INference Engine (SInE) [HV11] and E.T. [KSUV15].**Internal guidance**: Unlike premise selection, internal guidance influences choices made during the proof search. The*hints*technique [Ver96] was among the earliest attempts to directly influence proof search by learning from previous proofs. Other systems are E/TSM [Sch00], an extension of E [Sch13] with term space maps, and MaLeCoP [UVŠ11] respectively FEMaLeCoP [KU15], which are versions of leanCoP [Ott08] extended by Naive Bayesian learning.**Learning of strategies**: Finding good settings for ATPs automatically has been researched for example in the Blind Strategymaker (BliStr) project [Urb15].**Learning of strategy choice**: Once one has found good ATP strategies for different sets of problems, it is not directly clear which strategies to apply for which time when encountering a new problem. This problem was treated in the Machine Learning of Strategies (MaLeS) [Kü14].

In this paper, we show an internal guidance algorithm for ATPs that use (variations of) the given-clause algorithm. Specifically, we study a Naive Bayesian classification method, introduced for the connection calculus in FEMaLeCoP, and generalise it by measuring label occurrences with an arbitrary type having monoid structure, in place of a single number. This generalisation has the benefit that it can handle positive and negative occurrences. As a proof of concept, we implement the algorithm in the ATP Satallax [Bro12], using no features at all, which already solves 26 % more problems given the same amount of time, and which can solve about as many problems in 1 s than without internal guidance in 2 s.

## 2 Naive Bayesian Classifier with Monoids

### 2.1 Motivation

Many automated theorem provers have a proof state in which they make decisions, by ranking available choices (e.g. which proposition to process) and choosing the best one. This is related to the classification problem in machine learning, which takes data about previous decisions, i.e. which situation has led to which choice, and then orders choices by usefulness for the current situation.

In other proof searches, processing Eq. 2 in a certain prover state will not contribute towards the final proof. We call such situations negative examples.

Intuitively, we would like to apply propositions in situations that are similar to those in which the propositions were useful, and avoid processing propositions in situations similar to those where the propositions were useless. In general, examples (positive and negative) can be characterised by a prover state \(F\) and a proposition \(l\) that was processed in state \(F\). This makes it possible to treat the choice of propositions as classification problem. In the next section, we show how to rank choices based on previous experience.

### 2.2 Classifiers with Positive Examples

A classifier takes pairs \((F, l)\), relating a set of features \(F\) with a label \(l\), and produces a function that, given a set of features, predicts a label. Classifiers can be characterised by a function \(r(l, F)\), which represents the relevance of a label wrt a set of features. For internal guidance, we use \(r\) to estimate the relevance of a clause \(l\) to process in the current prover state \(F\).

^{1}are approximated by

### 2.3 Generalised Classifiers

In our experiments, we found negative training examples to be crucial for internal guidance. Therefore, we generalised the classifier to represent the type of occurrences as a *commutative monoid*.

### Definition 1

A pair (M, +) is a *monoid* if there exists a neutral element \(0 \in M\) such that for all \(x, y, z \in M\), \((x + y) + z = x + (y + z)\) and \(x + 0 = 0 + x = x\). If furthermore \(x + y = y + x\), then the monoid is *commutative*.

The generalised classifier is instantiated with a commutative monoid \((M, +)\) and reads triples \((F, l, o)\), which in addition to features and label now store the label occurrence \(o \in M\). For example, if the classifier is to support positive and negative examples, then one can use the monoid \((\mathbb {N} \times \mathbb {N}, +_2)\), where the first and second elements of the pair represent the number of positive respectively negative occurrences, the \(+_2\) operation is pairwise addition, and the neutral element is \((0,0)\). A triple learnt by this classifier could be \((F, l, (1, 2))\), meaning that \(l\) occurs with \(F\) once in a positive and twice in a negative way. Commutativity imposes that the order in which the classifier is trained does not matter.

*confidence*and models our intuition that labels which appear always in the same role (say, as positive example) should have a greater influence than more ambivalent labels. For example, if a label occurs about the same number of times as positive and as negative example, confidence is approximately 0, and when a label is almost exclusively positive or negative, confidence is 1.

We call \(D_l\), \(D_{l,f}\), and \(\text {idf}\) classification data. They are precalculated to allow fast classification. Furthermore, new training examples can be added to existing classification data efficiently, similarly to [KU15].

## 3 Learning Scenarios

In this section, we still consider ATPs as black boxes, taking as input a problem and classification data for internal guidance, returning as output training data (empty if the ATP did not find a proof).

On-line learning: We run the ATP on every problem with classification data. For every problem the ATP solves, we update the classifier with the training data from the ATP proof.

Off-line learning: We first run the ATP on all problems without classification data, saving training data for every problem solved. We then create classification data from the training data and rerun the ATP with the classifier on all problems.

## 4 Internal Guidance for Given-Clause Provers

Variants of the given-clause algorithm are commonly used in refutation-based ATPs, such as Vampire [KV13] or E [Sch13].^{2} We introduce a simple version of the algorithm: Given an initial set of clauses to refute, the set of *unprocessed* clauses is initialised with the initial set of clauses, and the set of *processed* clauses is the empty set. At every iteration of the algorithm, a *given clause* is selected from the unprocessed clauses and moved to the processed clauses, possibly generating new clauses which are moved to the unprocessed clauses. The algorithm terminates as soon as either the set of unprocessed clauses is empty or the empty clause was generated.

The integration of our internal guidance method into an ATP with given-clause algorithm involves two tasks: The recording of training data, and the ranking of unprocessed clauses, which influences the choice of the given clause. To reduce the amount of data an ATP has to load for internal guidance, we process training data and transform it into classification data outside of the ATP. We describe these tasks below in the order they are executed when no internal guidance data is present yet.

### 4.1 Recording Training Data

**In situ**: Information about clause usage is recorded every time an unprocessed clause gets processed. This method allows for more expressive prover state characterisation, on the other hand, we found it to decrease the proof success rate, as the recording of proof data makes the inference slower.**Post mortem**: Only when a proof was found, information about clause usage is reconstructed. As this method does not place any overhead on the proof search, we resorted to post-mortem recording, which is still sufficiently expressive for our purposes.

For every proof, we save: conjecture (if one was given), axioms \(A\) (premises given in the problem), processed clauses \(C\), and clauses \(C_+\) that were used in the final proof (\(C_+ \subseteq C\)). We call such information for a single proof a *training datum*. We ignore unprocessed clauses, as we cannot easily estimate whether they might have contributed to a proof.

### 4.2 Postprocessing Training Data

**Skolem filtering**: We discard clauses containing any Skolem constants.**Consistent Skolemisation**: We normalise Skolem constants inside all clauses, similarly to [UVŠ11]. That is, a clause \(P(x,y,x)\), where \(x\) and \(y\) are Skolem constants, becomes \(P(c_1,c_2,c_1)\).**Consistent normalisation**: Similarly to consistent Skolemisation, we normalise*all*symbols of a clause. That is, \(P(x, y, x)\) as above becomes \(c_1(c_2, c_3, c_2)\). This allows the ATP to discover similar groups of clauses, for example \(a + b = b + a\) and \(a * b = b * a\) both map to \(c_1(c_2, c_3) = c_1(c_3, c_2)\), but on the other hand, this also maps possibly different clauses such as \(P(x)\) and \(Q(z)\) to the same clause. Still, in problem collections which do not share a common set of function constants (such as TPTP), this method is suitable.**Inference filtering**: An interesting experiment is to discard all clauses generated during proof search that are not part of the initial clauses.

We denote the consistent Skolemisation/normalisation of a clause \(c\) described above as \(\mathcal {N}(c)\).

### 4.3 Transforming Training Data to Classification Data

### 4.4 Clause Ranking

This section describes how our internal guidance method influences the choice of unprocessed clauses using a previously constructed classifier.

At the beginning of proof search, the ATP loads the classifier. Some learning ATPs, such as E/TSM [Sch00], select and prepare knowledge relevant to the current problem before the proof search. However, as we store classifier data in a hash table, filtering irrelevant knowledge to the problem at hand would require a relatively slow traversal of the whole table, whereas lookup of knowledge is fast even in the presence of a large number of irrelevant facts. For this reason we do not filter the classification data per problem.

\(r_{\text {ATP}}(c)\) is an ATP function that calculates the relevance of a clause with traditional means (such as weight, age, ...),

\(F\) is the current prover state,

\(r(c, F)\) is the Naive Bayesian ranking function as shown in Sect. 2, and

\(\mathcal {N}(c)\) is the normalisation function as introduced in Subsect. 4.2.

## 5 Tuning of Guidance Parameters

We employed two different methods to automatically find good parameters for internal guidance, such as \(c\), \(c_p\), and \(c_n\) from Sect. 2.

### 5.1 Off-Line Tuning

In the end, we sum up the results of the formula above for all training data, and take the guidance parameters which minimise that sum.

### 5.2 Particle Swarm Optimisation

*particle*is defined by a location \({\varvec{x}}\) (a candidate solution for the optimisation problem) and a velocity \({\varvec{v}}\). Initially, \(p\) particles are created with random locations and velocities. Then, at every iteration of the algorithm, a new velocity is calculated for every particle and the particle is moved by that amount. The new velocity of a particle is:

\({\varvec{v}}(t)\) is the old velocity of the particle,

\({\varvec{b}}_p(t)\) is the location of the best previously found solution among all particles,

\({\varvec{b}}_g(t)\) is the location of the best previously found solution of the particle,

\({\varvec{r}}_p\) and \({\varvec{r}}_g\) are random vectors generated at every evaluation of the formula, and

\(\omega = 0.4\), \(\phi _p = 0.4\), and \(\phi _g = 3.6\) are constants.

We apply PSO to optimise the performance of an ATP on a problem set \(S\). For this, we define \(f({\varvec{x}})\) to be the number of problems in \(S\) the ATP can solve with a set of flags being set to \({\varvec{x}}\) and with timeout \(t\). We then run PSO and take the best global solution obtained after \(n\) iterations. We fixed \(t = 1s\), \(p = 300\), and \(|S| = 1000\). The algorithm has worst-case execution time \(t \cdot p \cdot n \cdot |S|\).

## 6 Implementation

We implement our internal guidance in Satallax version 2.8. Satallax is an automated theorem prover for higher-order logic, based on a tableaux calculus with extensionality and choice. It is written in OCaml by Brown [Bro12]. Satallax implements a priority queue, on which it places several kinds of proof search commands: Among the 11 different commands in Satallax 2.8, there are for example proposition processing, mating, and confrontation. Proof search works by processing the commands on the priority queue by descending priority, until a proof is found or a timeout is reached. The priorities assigned to these commands are determined by *flags*, which are the settings Satallax uses for proof search. A set of flag settings is called a *mode* (in other ATPs frequently called *strategies*) and can be chosen by the user upon the start of Satallax. Similar to other modern ATPs such as Vampire [KV13] or E [Sch13], Satallax also supports timeslicing via *strategies* (in other ATPs frequently called *schedules*), which define a set of modes together with time amounts Satallax calls each mode with. Formally, a strategy is a sequence \([(m_1, t_1), \dots , (m_n, t_n)]\), where \(m_i\) is a mode and \(t_i\) the time to run the mode with. The total time of the strategy is the sum of times, i.e. \(t_{\varSigma }(S) = \sum _{(m, t) \in S} t\).

As a side-effect of this work, we have extended Satallax with the capability of loading user-defined strategies, which was previously not possible as strategies were hard-coded into the program. Furthermore, we implemented modifying flags via the command line, which is useful e.g. to change a flag among all modes of a strategy without changing the flag among all files of a strategy. We used this extensively in the automatic evaluation of flag settings via PSO, as shown in Subsect. 5.2.

When running Satallax with a strategy \(S\) and a timeout \(t_{max}\), then all the times of the strategy are multiplied by \(\frac{t_{max}}{t_{\varSigma }(S)}\) if \(t_{max} > t_{\varSigma }(S)\), to divide the time between modes appropriately when running Satallax for longer than what the strategy \(S\) specifies. Then, every mode \(m_i\) in the strategy is run sequentially for time \(t_i\) until a proof is found or the timeout \(t_{max}\) is hit.

An analysis of several proof searches yielded that on average, more than 90 % of commands put onto the priority queue of Satallax are proposition processing commands, which correspond to processing a clause from the set of unprocessed clauses in given-clause provers. For that reason, we decided to influence the priority of proposition processing commands, giving those propositions with a high probability of being useful a higher priority. The procedure follows the one described in Subsect. 4.4, but the ranking of a proposition is performed when the proposition processing command is put onto the priority queue, and the Naive Bayes rank is added to the priority that Satallax without internal guidance would have assigned to the command. As other types of commands are in the priority queue as well, we pay attention not to influence the priority of term processing commands too much (by choosing too large guidance parameters), as this can lead to disproportionate displacement of other commands.

To record training data, we use the terms from the proof search that contributed to the final proof. For this, Satallax uses picomus [Bie08] to construct a minimal unsatisfiable core.

Symbols of processed terms: We collect the symbols of all processed propositions at the time a proposition is inserted into the priority queue and call these symbols the features of the proposition. However, this experimentally turned out to be a bad choice, because the set of features for each proposition grows quite rapidly, as the set of processed propositions grows monotonically.

Axioms of the problem: We associate every proposition processed in a proof search with all the axioms of the problem. In contrast to the method above, this associates the same features to all propositions processed during the proof search for a problem, and is thus more a characterisation of the problem (similar to TPTP characteristics [SB10]) than of the prover state.

In our experiments, just calculating the influence of these features without them actually influencing the priority makes Satallax prove less problems (due to the additional calculation time), and the positive impact of the features on the proof search does not compensate for the initial loss of problems. Therefore, we currently do not use features at all and associate the empty set of features to all labels, i.e. \(\mathcal {F}(c) = \{\}\). However, it turns out that even without features, learning from previous proofs can be quite effective, as shown in the next section.

## 7 Evaluation

To evaluate the performance of our internal guidance method in Satallax, we used a THF0 [SB10] version (simply-typed higher-order logic) of the top-level theorems of the Flyspeck [HAB+15] project, as generated by Kaliszyk and Urban [KU14]. The test set consists of 14185 problems from topology, geometry, integration, and other fields. The premises of each problem are the actual premises that were used in the Flyspeck proofs, amounting to an average of 84.3 premises per problem.^{3} We used an Intel Core i3-5010U CPU (2.1 GHz Dual Core, 3 MB Cache) and ran maximally one instance of Satallax at a time.

To evaluate the performance of the off-line learning scenario described in Sect. 3, we run Satallax on all Flyspeck problems, generating training data whenever Satallax finds a proof. We use the Satallax 2.5 strategy (abbreviated as “S2.5”), because the newest strategy in Satallax 2.8 can not always retrieve the terms that were used in the final proof, which is important to obtain training data.

As the off-line learning scenario involves evaluating every problem twice (once to generate training data and once to prove the problem with internal guidance), it doubles runtime in the worst case, i.e. if no problem is solved. Therefore, a user might like to compare its performance to simply running the ATP with double timeout directly: When increasing the timeout from 1 s to 2 s, the number of solved problems increases from 2717 to 3394. However, this is mostly due to the fact that Satallax tries more modes, so to measure the gain in solved problems more fairly, we create a strategy “S2.5_1s” which contains only those modes that were already used during the 1 s run, and let each of them run about double the time. This strategy proves 2845 problems in 2 s.

Comparison of postprocessing options.

Postprocessing | Solved | Lost | Gained |
---|---|---|---|

Consistent normalisation | 1911 | 920 | 114 |

Consistent Skolemisation | 1939 | 885 | 107 |

None | 2166 | 688 | 137 |

Skolem filtering | 3395 | 98 | 776 |

Inference filtering | 3428 | 75 | 786 |

To evaluate online learning, we run Satallax on all Flyspeck problems by ascending order, accumulating training data and using it for all subsequent proof searches. We filter away terms in the training data that contain Skolem variables. As result, Satallax with online learning, running 1 s per problem, solves 3374 problems (59 lost, 716 gained), which is a plus of 24 %.

## 8 Conclusion

We have shown how to integrate internal guidance into ATPs based on the given-clause algorithm, using positive as well as negative examples. We have demonstrated the usefulness of this method experimentally, showing that on a given test set, we can solve up to 26 % more problems. ATPs with internal guidance could be integrated into hammer systems such as Sledgehammer (which can already reconstruct Satallax proofs [SBP13]) or HOL(y)Hammer [KU14], continually improving their success rate with minimal overhead. It could also be interesting to learn internal guidance for ATPs from subgoals given by the user in previous proofs. Currently, we learn only from problems we could find a proof for, but in the future, we could benefit from considering also proof searches that did not yield proofs. Furthermore, it would be interesting to see the effect of negative examples on existing ATPs with internal guidance, such as FEMaLeCoP. We believe that finding good features that characterise prover state are important to further improve the learning results.

## Footnotes

- 1.
We omitted several constant factors. Furthermore, FEMaLeCoP considers also features of training examples that are

*not*part of the features \(F\), albeit this is a further derivation of the theoretical model. - 2.
Technically, our reference prover Satallax does not implement a given-clause algorithm, as Satallax treats terms instead of clauses, and it interleaves the choice of unprocessed terms with other commands. However, for the sake of internal guidance, we can consider Satallax to implement a version of the given-clause algorithm. We describe the differences in more detail in Sect. 6.

- 3.
The test set, as well as our modified version of Satallax and instructions to recreate our evaluation, can be found under: http://cl-informatik.uibk.ac.at/~mfaerber/satallax.html.

## Notes

### Acknowledgements

We would like to thank Sebastian Joosten and Cezary Kaliszyk for reading initial drafts of the paper, and especially Josef Urban for inspiring discussions and inviting the authors to Prague. Furthermore, we would like to thank the anonymous IJCAR referees for their valuable comments.

This work has been supported by the Austrian Science Fund (FWF) grant P26201 as well as by the European Research Council (ERC) grant AI4REASON.

### References

- [Bie08]Biere, A.: PicoSAT essentials. JSAT
**4**(2–4), 75–97 (2008)MATHGoogle Scholar - [Bro12]Brown, C.E.: Satallax: an automatic higher-order prover. In: Gramlich, B., Miller, D., Sattler, U. (eds.) IJCAR 2012. LNCS, vol. 7364, pp. 111–117. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- [HAB+15]Hales, T.C., Adams, M., Bauer, G., Dang, D.T., Harrison, J., Le Hoang, T., Kaliszyk, C., Magron, V., McLaughlin, S., Nguyen, T.T., Nguyen, T.Q., Nipkow, T., Obua, S., Pleso, J., Rute, J., Solovyev, A., Ta, A.H.T., Tran, T.N., Trieu, D.T., Urban, J., Vu, K.K., Zumkeller, R.: A formal proof of the Kepler conjecture. CoRR, abs/1501.02155 (2015)Google Scholar
- [HV11]Hoder, K., Voronkov, A.: Sine qua non for large theory reasoning. In: Bjørner, N., Sofronie-Stokkermans, V. (eds.) CADE 2011. LNCS, vol. 6803, pp. 299–314. Springer, Heidelberg (2011)CrossRefGoogle Scholar
- [KE95]Kennedy, J., Eberhart, R.: Particle swarm optimization. In: IEEE International Conference on Neural Networks, vol. 4, pp. 1942–1948, November 1995Google Scholar
- [KSUV15]Kaliszyk, C., Schulz, S., Urban, J., Vyskocil, J.: System description: E.T. 0.1. In: Felty, A.P., Middeldorp, A. (eds.) CADE-25. LNCS (LNAI), vol. 9195, pp. 389–398. Springer, Heidelberg (2015)CrossRefGoogle Scholar
- [KU14]Kaliszyk, C., Urban, J.: Learning-assisted automated reasoning with Flyspeck. J. Autom. Reasoning
**53**(2), 173–213 (2014)MathSciNetCrossRefMATHGoogle Scholar - [KU15]Kaliszyk, C., Urban, J.: FEMaLeCoP: fairly efficient machine learning connection prover. In: Davis, M., et al. (eds.) LPAR-20 2015. LNCS, vol. 9450, pp. 88–96. Springer, Heidelberg (2015). doi:10.1007/978-3-662-48899-7_7 CrossRefGoogle Scholar
- [KV13]Kovács, L., Voronkov, A.: First-order theorem proving and Vampire. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp. 1–35. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- [Kü14]Daniel, A.K.: Machine learning for automated reasoning. Ph.D. thesis, Radboud Universiteit Nijmegen, April 2014Google Scholar
- [Ott08]Otten, J.: \(\sf leanCoP 2.0\) and \(\sf ileanCoP 1.2\): high performance lean theorem proving in classical and intuitionistic logic (system descriptions). In: Armando, A., Baumgartner, P., Dowek, G. (eds.) IJCAR 2008. LNCS (LNAI), vol. 5195, pp. 283–291. Springer, Heidelberg (2008)CrossRefGoogle Scholar
- [SB10]Sutcliffe, G., Benzmüller, C.: Automated reasoning in higher-order logic using the TPTP THF infrastructure. J. Formalized Reasoning
**3**(1), 1–27 (2010)MathSciNetMATHGoogle Scholar - [SBP13]Sultana, N., Blanchette, J.C., Paulson, L.C.: LEO-II, Satallax on the Sledgehammer test bench. J. Appl. Logic
**11**(1), 91–102 (2013)MathSciNetCrossRefMATHGoogle Scholar - [Sch00]Schulz, S.: Learning Search Control Knowledge for Equational Deduction. DISKI, vol. 230. Akademische Verlagsgesellschaft Aka GmbH Berlin, Berlin (2000)Google Scholar
- [Sch13]Schulz, S.: System description: E 1.8. In: McMillan, K., Middeldorp, A., Voronkov, A. (eds.) LPAR-19 2013. LNCS, vol. 8312, pp. 735–743. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- [Urb15]Urban, J.: BliStr: the blind Strategy maker. In: Gottlob, G., Sutcliffe, G., Voronkov, A. (eds.) GCAI 32015, Global Conference on Artificial Intelligence. EPiC Series in Computing, vol. 36, pp. 312–319. EasyChair (2015)Google Scholar
- [UVŠ11]Urban, J., Vyskočil, J., Štěpánek, P.: \(\sf MaLeCoP\) machine learning connection prover. In: Brünnler, K., Metcalfe, G. (eds.) TABLEAUX 2011. LNCS, vol. 6793, pp. 263–277. Springer, Heidelberg (2011)CrossRefGoogle Scholar
- [Ver96]Veroff, R.: Using hints to increase the effectiveness of an automated reasoning program: case studies. J. Autom. Reasoning
**16**(3), 223–239 (1996)MathSciNetCrossRefMATHGoogle Scholar