Keywords

1 Introduction

Saturation-based proof search is known to be fragile. Even seemingly insignificant changes in the search procedure, such as shuffling the order in which input formulas are presented to the prover, can have a huge impact on the prover’s running time and thus on the ability to find a proof within a given time limit.

This chaotic aspect of the prover behaviour is relatively poorly understood, yet has obvious consequences for evaluation. A typical experimental evaluation of a new technique T compares the number of problems solved by a baseline run with a run enhanced by T (over an established benchmark and with a fixed timeout). While a higher number of problems solved by the run enhanced by T indicates a benefit of the new technique, it is hard to claim that a certain problem P is getting solved thanks to T. It might be that T just helps the prover get lucky on P by a complicated chain of cause and effect not related to the technique T—and the original idea behind it—in any reasonable sense.

We propose to expose and counter the effect of chaotic behaviours by deliberately injecting randomness into the prover and observing the results of many independently seeded runs. Although computationally more costly than standard evaluation, such an approach promises to bring new insights. We gain the ability to apply the tools of probability theory and statistics to analyze the results, assign confidences, and single out those problems that robustly benefit from the evaluated technique. At the same time, by observing the changes in the corresponding runtime distributions we can even meaningfully establish the effect of the new technique on a single problem in isolation, something that is normally inconclusive due to the threat of chaotic fluctuations.

In this paper, we report on several experiments with a randomized version of the ATP Vampire [9]. After explaining the method in more detail (Sect. 2), we first demonstrate the extent in which the success of a typical Vampire proof search strategy can be ascribed to chance (Sect. 3). Next, we use the collected data to highlight the specifics of comparing two strategies probabilistically (Sect. 4). Finally, we focus on a single problem to see a chaotic behaviour smoothened into a distribution with a high variance (Sect. 5). The paper ends with an overview of related work (Sect. 6) and a discussion (Sect. 7).

2 Randomizing Out Chaos

Any developer of a saturation-based prover will confirm that the behaviour of a specific proving strategy on a specific problem is extremely hard to predict, that a typical experimental evaluation of a new technique (such as the one described earlier) invariably leads to both gains and losses in terms of the solved problems, and that a closer look at any of the “lost” problems often reveals just a complicated chain of cause and effect that steers the prover away from the original path (rather than a simple opportunity to improve the technique further).

These observations bring indirect evidence that the prover’s behaviour is chaotic: A specific prover run can be likened to a single bead falling down through the pegs of the famous Galton boardFootnote 1. The bead follows a deterministic trajectory, but only because the code fixes every single detail of the execution, including many which the programmer did not care about and which were left as they are merely out of coincidence. We put forward here that any such fixed detail (which does not contribute to an officially implemented heuristic) represents a candidate location for randomization, since a different programmer could have fixed the detail differently and we would still call the code essentially the same.

Implementation: We implemented randomization on top of Vampire version 4.6.1; the code is available as a separate git branchFootnote 2. We divided the randomization opportunities into three groups (governed by three new Vampire options).

Shuffling the input (-si on) randomly reorders the input formulas and, recursively, sub-formulas under commutative logical operations. This is done several times throughout the preprocessing pipeline, at the end of which a finished clause normal form is produced. Randomizing traversals (-rtra on) happens during saturation and consists of several randomized reorderings including: reordering literals in a newly generated clause and in each given clause before activation, and shuffling the order in which generated clauses are put into the passive set. It also (partially) randomizes term ids, which are used as tiebreakers in various term indexing operations and determine the default orientation of equational literals in the term sharing structure. Finally, “randomized age-weight ratio” (-rawr on) swaps the default, deterministic mechanism for choosing the next queue to select the given clause from [13] for a randomized one (which only respects the age-weight ratio probabilistically).

All the three options were active by default during our experiments.

Fig. 1.
figure 1

Blue: first-order TPTP problems ordered by the decreasing probability of being solved by the dis10 strategy within 50 billion instruction limit. Red: a cactus plot for the same strategy, showing the dependence between a given instruction budget (y-axis) and the number of problems on average solved within that budget (x-axis). (Color figure online)

3 Experiment 1: A Single-Strategy View

First, we set out to establish to what degree the performance of a Vampire strategy can be affected by randomization. We chose the default strategy of the prover except for the saturation algorithm, which we set to Discount, and the age-weight ratio, set to 1:10 ( calling the strategy dis10). We ran our experiment on the first-order problems from the TPTP library [15] version 7.5.0Footnote 3.

To collect our data, we repeatedly (with different seeds) ran the prover on the problems, performing full randomization. We measured the executed instructionsFootnote 4 needed to successfully solve a problem and used a limit of 50 billion instructions (which roughly corresponds to 15 s of running time on our machineFootnote 5) after which a run was declared unsuccessful. We ran the prover 10 times on each problem and additionally as many times as required to observe the instruction count average (over both successful and unsuccessful runs) stabilize within 1% from any of its 10 previously recorded valuesFootnote 6.

A summary view of the experiment is given by Fig. 1. The most important to notice is the shaded region there, which spans 965 problems that were solved by dis10 at least once but not by every run. In other words, these problems have probability p of being solved between \(0< p < 1\). This is a relatively large number and can be compared to the 8720 “easy” problems solved by every run. The collected data implies that 9319.1 problems are being solved on average (marked by the left-most dashed line in Fig. 1) with a standard deviation \(\sigma = 11.7\). The latter should be an interesting indicator for prover developers: beating a baseline by only 12 TPTP problems can easily be ascribed just to chance.

Figure 1 also contains the obligatory “cactus plot” (explained in the caption), which—thanks to the collected data—can be constructed with the “on average” qualifier. By definition, the plot reaches the left-most dashed line for the full instruction budged of 50 billion. The subsequent dashed lines mark the number of problems we would on average expect to solve by running the prover (independently) on each problem twice, three, four and five times. This is an information relevant for strategy scheduling: e.g., one can expect to solve whole additional 137 problems by running randomized dis10 for a second time.

Fig. 2.
figure 2

The effect of turning AVATAR off in the dis10 strategy (cf. Figure 1).

Not every strategy exhibits the same degree of variability under randomization. Observe Fig. 2 with a plot analogous to Fig. 1, but for dis10 in which the AVATAR [16] has been turned off. The shaded area there is now much smaller (and only spans 448 problems). The powerful AVATAR architecture is getting convicted of making proof search more fragile and the prover less robustFootnote 7.

Remark. Randomization incurs a small but measurable computational overhead. On a single run of dis10 over the first-order TPTP (filtering out cases that took less than 1 s to finish, to prevent distortion by rounding errors) the observed median relative time spent randomizing on a single problem was 0.47%, the average 0.59%, and the worseFootnote 8 13.86%. Without randomization, the dis10 strategy solved 9335 TPTP problems under the 50 billion instruction limit, i.e., 16 problems more than the average reported above. Such is the price we pay for turning our prover into a Las Vegas randomized algorithm.

4 Experiment 2: Comparing Two Strategies

Once randomized performance profiles of multiple strategies are collected, it is interesting to look at two at a time. Figure 3 shows two very different scatter plots, each comparing our baseline dis10 to its modified version in terms of the probabilities of solving individual problems.

Fig. 3.
figure 3

Scatter plots comparing probabilities of solving a TPTP problem by the baseline dis10 strategy and 1) dis10 with AVATAR turned off (left), and 2) dis10 with blocked clause elimination turned on (right). On problems marked red the respective technique could not be applied (no splittable clauses derived / no blocked clauses eliminated).

On the left we see the effect of turning AVATAR off. The technique affects the proving landscape quite a lot and most problems have their mark along the edges of the plot where at least one of the two probabilities has the extreme value of either 0 or 1. What the plot does not show well, is how many marks end up at the extreme corners. These are: 7896 problems easy for both, 661 easy for AVATAR and hard without, 135 hard for AVATAR and easy without.

Such “purified”, one-sided gains and losses constitute a new interesting indicator of the impact of a given technique. They should be the first to look at, e.g., during debugging, as they represent the most extreme but robust examples of how the new technique changes the capabilities of the prover.

The right plot is an analogous view, but now at the effect of turning on blocked clause elimination (BCE). This is a preprocessing technique coming from the context of propositional satisfiability [7] extended to first-order logic [8]. We see that here most of the visible problems show up as marks along the plot’s main diagonal, suggesting a (mostly) negligible effect of the technique. The extreme corners hide: 8648 problems easy for both, 17 easy with BCE (11 satisfiable and 6 unsatisfiable), and 2 easy without BCE (1 satisfiable and 1 unsatisfiable).

5 Experiment 3: Looking at One Problem at a Time

In their paper on age/weight shapes [13, Fig. 2], Rawson and Reger plot the number of given-clause loops required by Vampire to solve the TPTP problem PRO017+2 as a function of age/weight ratio \(( awr )\), a ratio specifying how often the prover selects the next clause to activate from its age-ordered and weight-ordered queues, respectively. The curve they obtain is quite “jiggly”, indicating a fragile (discontinuous) dependence. Randomization allows us to smoothen the picture and reveal new, until now hidden, (probabilistic) patterns.

Fig. 4.
figure 4

2D-histograms for the relative frequencies (color-scale) of how often, given a specific \( awr \) (x-axis), solving PRO017+2 required the shown number of instructions (y-axis). The curves in pink highlight the mean y-value for every x. The performance of dis10 (left) and the same strategy enhanced by a goal-directed heuristic (right). (Color figure online)

The 2D-histogram in Fig. 4 (left) was obtained from 100 independently seeded runs for each of 1200 distinct values of \( awr \) from between \(\text {1:1024}=2^{-10}\) and \(\text {4:1}=2^2\). We can confirm Rawson and Reger’s observation of the best \( awr \) for PRO017+2 lying at around 1:2. However, we can now also attempt to explain the “jiggly-ness” of their curve: With a fragile proof search, even a slight change in \( awr \) effectively corresponds to an independent sample from the prover’s execution resourceFootnote 9 distribution, which—although changing continuously with \( awr \)—is of a high variance for our problem (note the log-scale of the y-axis)Footnote 10.

The distribution has another interesting property: At least for certain values of \( awr \) it is distinctly multi-modal. As if the prover can either find a proof quickly (after a lucky event?) or only after much harder effort later and almost nothing in between. Shedding more light on this phenomenon is left for further research.

It is also very interesting to observe the change of such a 2D-histogram when we modify the proof search strategy. Figure 4 (right) shows the effect of turning on SInE-level split queues [3], a goal directed clause selection heuristic (Vampire option -slsq on). We can see that the mean instruction count gets worse (for every tried \( awr \) value) and also the variance of the distribution distinctly increases. A curious effect of this is that we observe the shortest successful runs with -slsq on, while we still could not recommend (in the case of PRO017+2) this heuristic to the user. The probabilistic view makes us realize that there are competing criteria of prover performance for which one might want to optimize.

6 Related Work

The idea of randomizing a theorem prover is not new. Ertel [2] studied the speedup potential of running independently seeded instances of the connection prover SETHEO [10]. The dashed lines in our Figs. 1 and 2 capture an analogous notion in terms of “additional problems covered” for levels of parallelism 1−5. randoCoP [12] is a randomized version of another connection prover, leanCoP 2.0 [11]: especially in its incomplete setup, several restarts with different seeds helped randoCoP improve over leanCoP in terms of the number of solved problems.

Gomes et al. [4] notice that randomized complete backtracking algorithms for propositional satisfiability (SAT) lead to heavy-tailed runtime distributions on satisfiable instances. While we have not yet analyzed the runtime distributions coming from saturation-based first-order proof search in detail, we definitely observed high variance also for unsatisfiable problems. Also in the domain of SAT, Brglez et al. [1] proposed input shuffling as a way of turning solver’s runtime into a random variable and studied the corresponding distributions.

An interesting view on the trade-offs between expected performance of a randomized solver and the risk associated with waiting for an especially long run to finish is given by Huberman et al. [6]. This is related to the last remark of the previous section.

Finally, in the satisfiability modulo theories (SMT) community, input shuffling, or scrambling, has been discussed as an obfuscation measure in competitions [17], where it should prevent the solvers to simply look up a precomputed answer upon recognising a previously seen problem. Notable is also the use of randomization in solver debugging via fuzz testing [14, 18].

7 Discussion

As we have seen, the behaviour of a state-of-the-art saturation-based theorem prover is to a considerable degree chaotic and on many problems a mere perturbation of seemingly unimportant execution details decides about the success or the failure of the corresponding run. While this may be seen as a sign of our as-of-yet imperfect grasp of the technology, the author believes that an equally plausible view is that some form of chaos is inherent and originates from the complexity of the theorem proving task itself. (A higher-order logic proof search is expected to exhibit an even higher degree of fragility.)

This paper has proposed randomization as a key ingredient to a prover evaluation method that takes the chaotic nature of proof search into account. The extra cost required by the repeated runs, in itself not unreasonable to pay on contemporary parallel hardware, seems more than compensated by the new insights coming from the probabilistic picture that emerges. Moreover, other uses of randomization are easy to imagine, such as data augmentation for machine learning approaches or the construction of more robust strategy schedules. It feels that we only scratched the surface of the opened-up possibilities. More research will be needed to fully harness the potential of this perspective.