# Testing a Saturation-Based Theorem Prover: Experiences and Challenges

## Abstract

This paper attempts to address the question of how best to assure the correctness of saturation-based automated theorem provers using our experience with developing the theorem prover Vampire. We describe the techniques we currently employ to ensure that Vampire is correct and use this to motivate future challenges that need to be addressed to make this process more straightforward and to achieve better correctness guarantees.

## 1 Introduction

This paper considers the problem of checking that a saturation-based automated theorem prover is *correct*. We consider this question within the context of the Vampire theorem prover [14], but many of our discussions generalise to similar theorem provers such as E [22], SPASS [26], and iProver [13]. We discuss what we mean precisely by correctness, describe how we detect bugs and, as our main contribution, outline the challenges that need to be addressed.

Automated theorem provers (ATPs) are often used as *black boxes* in other techniques (e.g. program verification) and those techniques rely on the results of the theorem prover for the correctness of their own results. Another area that makes use of ATPs is the application of so-called *hammers* [12, 15] in interactive theorem proving. These combinations usually provide functionality to reconstruct the proofs of the ATP using their own trusted kernels, although also offer users the option to skip such steps.

It is clear that correctness is important here, so how are we doing? Most theorem provers seem to be generally correct. However, cases of unsoundness are not uncommon. In SMT-COMP 2016 there were 603 conflicts (solvers returning different results) on 73 benchmarks caused by three solvers giving incorrect results for various reasons.^{1} In the CASC competition [25], there is a period of testing where soundness is checked and resolved, and there have been a number of solvers later disqualified from the competition due to unsoundness. In our experience, adding a new feature to a theorem prover is a highly complex task and it is easy to introduce unsoundness, or general incorrectness, especially in areas of the code that are encountered during proof search infrequently.

This paper begins by describing what we mean by correctness with respect to saturation-based theorem provers (Sect. 2) and the approach we take to finding and fixing bugs (Sect. 3). This provides sufficient context to present a set of challenges that need to be addressed to produce a better solution to this problem (Sect. 4). Addressing these challenges is part of our current ongoing research. An extended version of this paper containing examples of bugs found in Vampire is available online [20].

## 2 What Does Correctness Mean for Us?

Broadly there are two ways in which a theorem prover such as Vampire can be incorrect: either it *returns the wrong result*, or it *violates a contract of proper behaviour*.

### 2.1 Incorrect Result

To understand what a correct and incorrect result mean to Vampire, we need to introduce some of the theoretical foundations of the underlying technique. We note that the approach used by Vampire is the same as that taken by other first-order theorem provers, so these discussions, and the challenges outlined later, generalise beyond Vampire.

Providing one of the first two results when that result does not hold is clearly incorrect. Providing *Unknown* as the result is clearly incorrect in the sense that there is a known answer, but, due to the undecidability of first-order-logic and the general hardness of the problem, it is often unavoidable. However, as discussed below, we should understand the different ways in which *Unknown* as a result can be produced. Note that *Unknown* will be returned if Vampire exceeds either the time or memory allotted to it.

*validity*of problems in the form (1) by detecting

*unsatisfiability*of its negation:

*clauses*\({{\mathcal {S}}}\) and adding consequences of \({{\mathcal {S}}}\) until the contradiction \( false \) is derived or all possible consequences have been added. This process is called

*saturation*and may not terminate in general for a satisfiable set \({{\mathcal {S}}}\).

If Vampire derives a contradiction then it has shown that the problem (1) is *valid*, i.e. a theorem. Deriving a contradiction when the problem in (1) is not valid is *unsound* and an *incorrect result*.

If Vampire fails to derive a contradiction and *saturates* the set \({{\mathcal {S}}}\) in finitely many steps then there is a result [2] telling us that under certain conditions we can conclude that \( false \) cannot be a consequence of \({{\mathcal {S}}}\) and therefore problem (1) is a non-theorem. These conditions capture the *completeness* of the underlying inference system and generally require that all possible *non-redundant* inferences have been performed.

However, there are many things that Vampire does to heuristically improve proof search that break the completeness conditions. For example, (i) certain well-performing selection functions [10] might prevent inferences that need to be performed for completeness conditions to hold; and (ii) some preprocessing steps and proof search strategies explicitly remove clauses from the search space in an attempt to mitigate search space explosion [11, 21]. If the completeness conditions do not hold then upon saturation the result is *Unknown*. Sometimes it is easy to detect when these conditions hold, sometimes it is non-trivial, and sometimes they are erroneously broken. In this last case (when we think the conditions hold but they do not) this will lead to incorrectly reporting non-theorem i.e. this *completeness issue* is another kind of *incorrect result*.

To ensure the requirement that all possible non-redundant inferences will in the end be performed, we impose certain *fairness* criteria on the saturation process. More concretely, we require that no such inference is postponed indefinitely. Notice that this is by nature a tricky condition to deal with as it cannot be seen to have been violated after finitely many steps while the prover is running. And since, due to the semi-decidability of first-order logic, there is no upper bound on the length of the computation required to derive \( false \), a non-fair implementation might in certain cases never be able to return *Theorem*, even if it is the correct answer and instead keep computing indefinitely. Thus, this *fairness issue* does not lead to an incorrect result per se, but rather just negatively influences performance. As such it may be extremely hard to detect and deal with.

### 2.2 Violating the Contract of Proper Behaviour

*Program crash.*A program crash is where Vampire terminates unexpectedly, usually due to an unhandled exception, floating point error (SIGFPE), or segmentation fault (SIGSEG). Unhandled exceptions are bugs as we should handle them. In general, Vampire handles all known classes of exceptions at the top level, but we have recently had issues with integrated tools (MiniSAT and Z3) producing exceptions that we did not handle. Floating point errors and segmentation faults are typical software bugs that should be detected and removed.*Assertion violation.*Vampire is developed defensively with frequent use of*assertions*. For example, these are inserted wherever a function makes some assumptions about its input or the results of a nested function call, and wherever we believe a certain line to be unreachable. Vampire consists of roughly 194,000 lines of C++ code with roughly 2,500 assertions, meaning that there is roughly one assertion per 77 lines. The majority of potential errors are detected early as assertion violations.

## 3 Finding Bugs

In this section we briefly describe how we detect and investigate bugs in Vampire where these two steps can be equally difficult. The search space for Vampire is vast, and finding the combination of inputs that triggers a bug is very difficult. Some bugs are incredibly subtle, particularly soundness bugs or those involving memory errors, and tracking them down can involve hunting through thousands of lines of output.

### 3.1 The Input Search Space

The two inputs to Vampire are the input problem and a strategy capturing proof search parameters. The space of possible input problems is infinite. However, we do not currently explore this space systematically. Instead we sample from sets of representative benchmarks, e.g. TPTP [24] (\(\sim \)20k problems) and SMT-LIB [4] (\(\sim \)46k relevant problems). Vampire currently uses roughly 75 proof search parameters with more than half of these having more than two possible values and some taking arbitrary numeric values (although in testing we fix these to a predefined sensible set). Therefore, the search space is significantly larger than \(2^{75}\), i.e. too large to explore systematically.

### 3.2 The Debug Process

Users of the Vampire system may report bugs to us. Currently this is an informal process carried out by personal email. Sometimes these bugs are actually feature requests, and other times they can be due to a misuse of Vampire.

More commonly, they come from randomly sampling the parameter space and sets of available problems (ensuring reasonable diversity in terms of features and status, e.g. theorems and non-theorems). We use a cluster

^{2}that enables us to carry out around a million checks a day (using varying short time limits).

Once an error is detected, we must diagnose and fix the fault. Below we describe some of our methods for doing this.

*Tracing.*Vampire has its own library for tracing function calls. A macro is manually inserted at the start of each significant function. This macro enables the tracing library to maintain the current call stack, which is then printed on an assertion violation or during signal handling along with the number of such call points passed so far. This second piece of information can be used to explicitly log function calls for some range of call points, e.g. those just before the erroneous point. This feature is invaluable in quickly locating the cause of an assertion violation.*Memory Checking.*Vampire implements its own memory management library, allowing fine-grained control of memory allocation and deallocation and enforcement of soft memory limits. In debug mode, Vampire keeps track of each allocated piece of memory and checks that the corresponding deallocation is as expected. Vampire also reports memory leaks i.e. unallocated memory at the end of the proof search.*Segmentation Faults and Silent Memory Issues.*The most difficult bug to debug is a rogue pointer or piece of uninitialised memory. We find that a first step of applying Valgrind^{3}will often detect the more straightforward issues. However, such bugs are often only noticed via incorrect results and fixed by much manual effort.*Proof Checking.*To detect unsoundness we employ proof checking, which we discuss further below. We do not currently have a corresponding method for checking that a saturated set complies with necessary completeness conditions.

### 3.3 Proof Checking

The easiest way to confirm a result indicating that the input formula is a theorem is to check that the associated proof only performs sound inference steps. This process is called proof checking and here we briefly describe the capabilities and limitations of the proof checking technique as currently realised in Vampire.

^{4}

A proof is a directed acyclic graph printed in a linear form where nodes that have no incoming edges are either input formulas or axioms introduced by Vampire, and the single node with no outgoing edges contains the contradiction. In the above proof each derived clause is labelled with the name of the inference and the lines of the premises.

We can pass these directly to an independent theorem prover^{5} and if a step cannot be independently verified then it should be investigated.

## 4 Challenges

We now present a discussion of what we have identified as the main challenges left to be solved, or at least addressed, given in order of importance, as we perceive it.

### 4.1 Full and Automated Proof Checking

As described in Sect. 3.3, there is already reasonable support for independently checking the correctness of proofs. However, this situation could still be improved.

*Missing Features.*There are parts of proofs that cannot currently be proof checked, the two main parts are:

*Symbol Introducing Preprocessing.*Certain inference steps of the clausification phase, e.g. Skolemization and formula naming [19], introduce new symbols and as such do not preserve logical equivalence. This means the conclusion of the inference does not logically follow from its premises. What these steps preserve is global satisfiability of the clause set they modify. One necessary condition for correctness is that the introduced symbols be*fresh*, i.e. not appearing elsewhere in the input. This requires a non-trivial extension to the described approach.*SAT and SMT solving.*Vampire makes use of SAT and SMT solvers in various ways (see [18]). This means that we have some inferences in Vampire that are of the form \(P_1 \wedge \ldots \wedge P_n \rightarrow C\)*by SAT/SMT*, or even the argument that some abstraction or grounding of the premises leads to \(C\) by SAT or SMT solving. To handle such proof steps we need to collect together the premises (potentially apply the necessary abstraction or grounding) and run a SAT or SMT solver as appropriate.

Extra information may need to be added to proofs to support these checks.

*Automating Proof Checking.* Having tools able to check the correctness of proofs is irrelevant if those tools are not used. Ideally, theorem provers should provide the functionality to check the proofs that they produce automatically. As the problems produced during proof checking are often easy to solve, one could imagine a situation where, in a certain mode, a theorem prover applied proof checking to its proof output.

*Independence.* It might not be possible to find an independent solver able to handle the problems produced by proof checking. A solver might not be able to check an individual step, because it is too hard, or not be able to handle the language features the problem contains. A weaker independence could be achieved by making use of a previous version of the original theorem prover that we are more confident in.

### 4.2 Analysability of Unsound Proofs

Checking whether a proof is correct or not is essential. However, knowing that a proof is incorrect is not, in itself, very useful. Another missing piece to this puzzle are tools that can analyse proofs and extract, summarise or explain the *reason* the proof is incorrect. The proof checking process will reveal the proof step that fails to hold, but the problem of detecting the underlying reason for that proof step to have occurred is non-trivial.

One step in this direction is the application of *delta-debugging* [27] to reduce the input to a simpler form to aid debugging efforts. This approach has been explored for SAT/QBF solvers [1, 5] applied to both the input problem and the parameter space.

### 4.3 Handling Non-theorem Results

So far we have ignored the incorrect result of reporting a problem to be satisfiable when it is not. It is not clear how to practically check whether a saturated set is indeed saturated as the notion of saturation is dependent on the used calculus and its instantiation with parameters such as the term ordering and literal selection methods.

*Non-redundant Inferences.* A necessary condition for completeness it that proof search never deletes anything that is not redundant. Checking this is significantly more complex than proof checking. In proof checking we must check that each inference of the proof is sound i.e. that we were allowed to perform those inferences to derive a contradiction. If we have a saturated set then we should check that every inference that we chose not to perform was redundant; this is what we often have to do manually, with some intuition about what such inferences might be. The number of such inferences is typically a few orders of magnitude larger than the length of a typical proof.

*Monitoring Fairness.* To avoid missing a saturated set we need to satisfy the fairness criteria discussed in Sect. 2.1. However, this is not *monitorable* in a formal sense [8, 9] as it cannot be satisfied or violated based on a finite number of observations. However, if we were to introduce a *stronger* property of *bounded fairness* [7], e.g. a clause of age \(A\) will be processed within \(kA\) iterations for some constant \(k\), then this property becomes monitorable (this is now a *response* property).

### 4.4 Achieving Better Coverage with Random Testing

As previously discussed, due to the enormous variability in proof search parameters and possible problem inputs, the best approach to detecting errors and incorrect results is through random search. However, the current approaches to random search are not optimal. Here we briefly outline areas of improvement.

*Code Coverage.* Our current approach makes no attempts to ensure that testing covers all lines in the code. Even though this is a very weak notion of coverage, it could be used to detect areas of code that should be tested, or removed if never used.

*Coverage of the Parameter Space.* Whilst random sampling of the parameter space can be effective at discovering bugs, it is not clear that all areas of the parameter space are of equal interest. Clearly, combinations of features that have not been tested together should have priority, and features added more recently should be tested more thoroughly. In this vein we could borrow from T-wise test case generation strategies for Software Product Lines [16] which aims to test all T-combinations of features.

*Coverage of the Problem Space.* This is an area where relatively little has been done (in the first-order setting). We currently use libraries of existing problems as possible inputs to the testing process. However, if we do not have a problem that exercises a certain feature sufficiently, we are unlikely to detect bugs related to that feature. For example, the TPTP language contains features that are very rarely used within the TPTP library. This issue is not confined to language features. Proof search is dependent on particular dimensions of the input problem (e.g. size, signature) that are difficult to quantify. If the input problems do not cover these dimensions sufficiently then certain parts of Vampire will not be tested effectively. A useful area of research would be the automatic generation of problems, or *fuzzing* of existing problems, to cover such dimensions. In this direction we could borrow from successful results in SAT/QBF solving [5, 6].

## 5 Conclusion

This paper describes our experience testing the Vampire theorem prover and what we see as the challenges to overcome to help us improve this effort. The ideas we discuss generalise to other theorem provers and some efforts, such as proof checking techniques and better problem coverage, would be widely beneficial. Addressing the challenges set out in this paper is part of our current research and we plan to provide a proof checking tool that can fully and automatically check proofs produced by Vampire.

## Footnotes

### References

- 1.Artho, C., Biere, A., Seidl, M.: Model-based testing for verification back-ends. In: Veanes, M., Viganò, L. (eds.) TAP 2013. LNCS, vol. 7942, pp. 39–55. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38916-0_3 CrossRefGoogle Scholar
- 2.Bachmair, L., Ganzinger, H.: Resolution theorem proving. In: Handbook of Automated Reasoning, vol. 1, chap. 2, pp. 19–99. Elsevier Science (2001)Google Scholar
- 3.Barrett, C., Conway, C., Deters, M., Hadarean, L., Jovanovic, D., King, T., Reynolds, A., Tinelli, C.: CVC4. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 171–177. Springer, Heidelberg (2011). doi:10.1007/978-3-642-22110-1_14 CrossRefGoogle Scholar
- 4.Barrett, C., Stump, A., Tinelli, C.: The Satisfiability Modulo Theories Library (SMT-LIB) (2010). www.SMT-LIB.org
- 5.Brummayer, R., Lonsing, F., Biere, A.: Automated testing and debugging of SAT and QBF solvers. In: Strichman, O., Szeider, S. (eds.) SAT 2010. LNCS, vol. 6175, pp. 44–57. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14186-7_6 Google Scholar
- 6.Creignou, N., Egly, U., Seidl, M.: A framework for the specification of random SAT and QSAT formulas. In: Brucker, A.D., Julliand, J. (eds.) TAP 2012. LNCS, vol. 7305, pp. 163–168. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30473-6_14 CrossRefGoogle Scholar
- 7.Dershowitz, N., Jayasimha, D.N., Park, S.: Bounded fairness. In: Dershowitz, N. (ed.) Verification: Theory and Practice. LNCS, vol. 2772, pp. 304–317. Springer, Heidelberg (2003). doi:10.1007/978-3-540-39910-0_14 CrossRefGoogle Scholar
- 8.Diekert, V., Leucker, M.: Topology, monitorable properties and runtime verification. Theor. Comput. Sci.
**537**, 29–41 (2014). Theoretical Aspects of Computing (ICTAC 2011)MathSciNetCrossRefMATHGoogle Scholar - 9.Falcone, Y., Fernandez, J.C., Mounier, L.: Runtime verification of safety-progress properties. In: Bensalem, S., Peled, D.A. (eds.) RV 2009. LNCS, vol. 5779, pp. 40–59. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04694-0_4 CrossRefGoogle Scholar
- 10.Hoder, K., Reger, G., Suda, M., Voronkov, A.: Selecting the selection. In: Olivetti, N., Tiwari, A. (eds.) IJCAR 2016. LNCS, vol. 9706, pp. 313–329. Springer, Cham (2016). doi:10.1007/978-3-319-40229-1_22 Google Scholar
- 11.Hoder, K., Voronkov, A.: Sine qua non for large theory reasoning. In: Bjørner, N., Sofronie-Stokkermans, V. (eds.) CADE 2011. LNCS, vol. 6803, pp. 299–314. Springer, Heidelberg (2011). doi:10.1007/978-3-642-22438-6_23 CrossRefGoogle Scholar
- 12.Kaliszyk, C., Urban, J.: Hol(y)hammer: online ATP service for HOL light. Math. Comput. Sci.
**9**(1), 5–22 (2015)CrossRefMATHGoogle Scholar - 13.Korovin, K.: iProver - an instantiation-based theorem prover for first-order logic (system description). In: Armando, A., Baumgartner, P., Dowek, G. (eds.) IJCAR 2008. LNCS, vol. 5195, pp. 292–298. Springer, Heidelberg (2008). doi:10.1007/978-3-540-71070-7_24 CrossRefGoogle Scholar
- 14.Kovács, L., Voronkov, A.: First-order theorem proving and vampire. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp. 1–35. Springer, Heidelberg (2013). doi:10.1007/978-3-642-39799-8_1 CrossRefGoogle Scholar
- 15.Paulson, L.C., Blanchette, J.C.: Three years of experience with sledgehammer, a practical link between automatic and interactive theorem provers. In: The 8th International Workshop on the Implementation of Logics, IWIL 2010. EPiC Series in Computing, vol. 2, pp. 1–11. EasyChair (2012)Google Scholar
- 16.Perrouin, G., Sen, S., Klein, J., Baudry, B., Traon, Y.l.: Automated and scalable t-wise test case generation strategies for software product lines. In: Proceedings of the 2010 Third International Conference on Software Testing, Verification and Validation, ICST 2010, pp. 459–468. IEEE Computer Society (2010)Google Scholar
- 17.Reger, G.: Better proof output for Vampire. In: Proceedings of the 3rd Vampire Workshop, Vampire 2016. EPiC Series in Computing, vol. 44, pp. 46–60. EasyChair (2017)Google Scholar
- 18.Reger, G., Suda, M.: The uses of sat solvers in vampire. In: Proceedings of the 1st and 2nd Vampire Workshops. EPiC Series in Computing, vol. 38, pp. 63–69. EasyChair (2016)Google Scholar
- 19.Reger, G., Suda, M., Voronkov, A.: New techniques in clausal form generation. In: 2nd Global Conference on Artificial Intelligence, GCAI 2016. EPiC Series in Computing, vol. 41, pp. 11–23. EasyChair (2016)Google Scholar
- 20.Reger, G., Suda, M., Voronkov, A.: Testing a Saturation-Based Theorem Prover: Experiences and Challenges (Extended Version). ArXiv e-prints (2017)Google Scholar
- 21.Riazanov, A., Voronkov, A.: Limited resource strategy in resolution theorem proving. J. Symb. Comput.
**36**(1–2), 101–115 (2003)MathSciNetCrossRefMATHGoogle Scholar - 22.Schulz, S.: E - a brainiac theorem prover. AI Commun.
**15**(2–3), 111–126 (2002)MATHGoogle Scholar - 23.Sutcliffe, G.: Semantic derivation verification: techniques and implementation. Int. J. Artif. Intell. Tools
**15**(6), 1053–1070 (2006)CrossRefGoogle Scholar - 24.Sutcliffe, G.: The TPTP problem library and associated infrastructure. J. Autom. Reason.
**43**(4), 337–362 (2009)CrossRefMATHGoogle Scholar - 25.Sutcliffe, G.: The CADE ATP system competition - CASC. AI Mag.
**37**(2), 99–101 (2016)MATHGoogle Scholar - 26.Weidenbach, C., Dimova, D., Fietzke, A., Kumar, R., Suda, M., Wischnewski, P.: SPASS version 3.5. In: Schmidt, R.A. (ed.) CADE 2009. LNCS, vol. 5663, pp. 140–145. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02959-2_10 CrossRefGoogle Scholar
- 27.Zeller, A.: Yesterday, my program worked. Today, it does not. Why? In: Nierstrasz, O., Lemoine, M. (eds.) ESEC/FSE 1999. LNCS, vol. 1687, pp. 253–267. Springer, Heidelberg (1999). doi:10.1007/3-540-48166-4_16 Google Scholar