5.1 Implementation
We implemented our integer induction rules \(\texttt {IntInd}_{\ge }\), \(\texttt {IntInd}_{>}\), \(\texttt {IntInd}_{\le }\), \(\texttt {IntInd}_{<}\) as well as \(\texttt {IntInd}_{[\ge ]}\) and the other corresponding interval induction rules in Vampire. Further, we also implemented a more general induction rule IntInd that does not require bounds to be in the search space and uses 0 as the lower or the upper bound. Our implementation in Vampire, consisting of approximately 1,200 lines of new C++ code, is available at https://github.com/vprover/vampire. The size of this additional code is relatively small because Vampire has libraries for indexing and chaining inference rules that could be used off the shelf.
Our (interval) downward/upward induction rules described in Section 4 can be applied when either (i) the comparison literal (e.g., \(t\ge b\) for the \(\texttt {IntInd}_\ge \) rule) is selected and the corresponding clause \(\lnot L[t] \vee C\) was already selected as an induction candidate before, or (ii) if \(\lnot L[t] \vee C\) is selected as an induction candidate and the corresponding comparison literal was already selected before. To implement these rules efficiently, we should be able to efficiently retrieve comparison literals and literals selected for induction. To do so, we extended the indexing mechanism of Vampire to index such literals. We do not apply induction when the induction formula L[x] is a comparison having x as a top level argument, for example, \(x \le t\), and allow to apply it to all other induction formulas deemed to be suitable by other user-specified options.
Our (interval) downward/upward induction rules in Vampire are enabled by the new option --induction int. The options --int_induction_interval infinite and --int_induction_interval finite limit the enabled rules to downward/upward only, and interval downward/upward only, respectively. Further, --int_induction_default_bound on enables the more general rule which does not require bounds to be in the search space. Our new induction rules can also be controlled by other Vampire options for well-founded/structural induction, such as --induction_on_complex_terms on, which enables applying induction on any ground complex term. To improve Vampire ’s performance for integer induction, we combined our new induction rules with --induction_on_complex_terms on and also other options not specific for induction. We extended Vampire with a new mode scheduling various option configurations for integer induction, switched on by the option --mode portfolio --schedule integer_induction. Additionally, we introduced the option --schedule induction which uses either the integer induction configurations as for --schedule integer_induction, or structural induction configurations, or both, depending on the data types used in the problem/property to be proved.
5.2 Benchmarks
We used two sets of examples: (i) benchmark sets LIA and UFLIA from the SMT-LIB collection [2], consisting of, respectively, 607 and 10,137 examples, and (ii) 120 new benchmarks similar to our motivating examples from Section 2.
To the best of our knowledge, the state-of-the-art systems implementing inductive reasoning have so far not yet considered inductive reasoning over integers, with two exceptions: [17], which mainly focuses on induction over inductively defined data types but mentions induction on non-negative integers and [11], which supports inductive reasoning using recursive function definitions without any special treatment for integers.
Since integer induction has not yet attracted enough attention in theorem proving, there is no significant collection of benchmarks for integer induction. To properly carry out experiments, we therefore created a set of 120 new benchmarks based on variations of our motivating examples from Section 2 and on properties of computing integer powers. One example is the function correctness of the program of Figure 2, which is formalized as follows:
$$\begin{aligned} \begin{aligned} \text {axioms: }\quad&\forall x \in \mathbb Z. (\mathtt {power}(x, 1) = x)\\ \quad&\forall x, e \in \mathbb Z. (2\le e \rightarrow \mathtt {power}(x, e) = x\cdot \mathtt {power}(x, e-1)) \\ \text {conjecture: }\quad&\forall x, y, e.(1 \le e \rightarrow \mathtt {power}(x\cdot y, e) = \mathtt {power}(x, e)\cdot \mathtt {power}(y, e)) \end{aligned} \end{aligned}$$
(12)
Our set of 120 new benchmarks is described in Table 1 and available online at:
Table 1. Description of our benchmark set of 120 new examples.
To confirm that our new benchmarks require the use of inductive reasoning, we tested them on the SMT solver Z3 [6] that does not support induction. Z3 could not solve any of the 120 problems from our benchmark set. Names of subsets of our new benchmarks are constructed by joining variant tags described in Table 1. For example, problem (6) belongs to the category declared_unint_ax-fin_conj-fin of the set val. The following benchmark:
$$\begin{aligned} \begin{aligned} \text {axiom: }\quad&\forall x\in \mathbb Z. (\mathtt {val}(x) = \mathtt {val}(x+1)) \\ \text {conjecture: }\quad&\forall x, y \in \mathbb Z. (\mathtt {val}(x) = \mathtt {val}(y)) \end{aligned} \end{aligned}$$
(13)
belongs to declared_unint_ax-all_conj-all of val and the below example is from defined_inter_ax-geq_conj-geq of val:
$$\begin{aligned} \begin{aligned} \text {axioms: }\quad&\forall x\in \mathbb Z. (x \le 0 \rightarrow \mathtt {val}(x) = 0) \\ \quad&\forall x\in \mathbb Z. (0 < x \rightarrow \mathtt {val}(x) = \mathtt {val}(x-1)) \\ \text {conjecture: }\quad&\forall x \in \mathbb Z. ( 0\le x \rightarrow \mathtt {val}(x) = \mathtt {val}(0)) \end{aligned} \end{aligned}$$
(14)
While 9 of the benchmarks (all in val) use finite intervals in both the assertion and the invariant (ax-fin_conj-fin), the remaining 111 benchmarks require inductive reasoning over infinite intervals.
5.3 Experimental Setup
We ran our experiments on computers with 32 cores (AMD Epyc 7502, 2.5 GHz) and 1 TB RAM. In all experiments we used the memory limit of 16 GB per problem. For the new benchmarks we used a 300 seconds time limit. For the experiments on the larger LIA and UFLIA sets we used a 10 seconds time limit.
In what follows, Vampire refers to the (default) version of Vampire, as in [10, 16]. By Vampire-I we denote our new version of Vampire, using integer induction rules (--induction int). Vampire-I * refers to the portfolio mode of Vampire-I, scheduling various option configurations for integer induction (--mode portfolio --schedule induction).
For experiments with the new benchmarks, we note that Vampire without integer induction cannot solve any of the problems. In this set of experiments, we therefore compared Vampire-I to the provers Cvc4 [17] and Acl2 [11], which are, to the best of our knowledge, the only two automated solvers supporting inductive reasoning with integers in addition to reasoning with theories and quantifiers. For Cvc4, we used the ig configuration from [17]: --quant-ind --quant-cf --conjecture-gen --conjecture-gen-per-round=3 --full-saturate-quant. For Acl2, we used its default configuration and translated our new problem set into the functional program encoding syntax of Acl2. In the experiments with the LIA and UFLIA benchmark sets of SMT-LIB, we also used Z3 [6] in the default configuration.
We ran Cvc4, Z3, Vampire and Vampire-I on problems encoded in the SMT-LIB2 syntax [2]. For running Acl2 on the new benchmarks, we translated problems into the functional program encoding syntax of Acl2.
Table 2. Comparison of solvers on SMT-LIB benchmarks.
5.4 Experimental Results
SMT-LIB Benchmarks. First, we evaluated the improvements of integer induction in Vampire-I when compared to Vampire, Cvc4 and Z3 on the LIA and UFLIA sets of SMT-LIB [2]. We aimed to verify that Vampire-I ’s performance does not deteriorate due to adding integer induction, check whether Vampire-I can solve problems that could not be solved automatically before, and to identify the best values for options related to integer induction. To this end, we picked five different strategies (e.g. using different saturation algorithms and selection functions) and used different combinations of induction options. Table 2 summarizes our results, showcasing that integer induction enabled Vampire-I to solve over 100 new problems that Vampire could not solve before (last but one column of Table 2). Moreover, 45 of these problems were also new compared to Cvc4 and Z3 (last column of Table 2), which most likely means that no theorem prover was able to prove them before.
In problems solved using integer induction, the integer induction rules were applied often: at least one of the interval induction rules was used in nearly 99% of problems, while one of the induction rules with one bound was used in nearly all problems. The interval induction and induction rules were used on average 4559 and 1191 times, respectively. 89% of the proofs employed interval induction (67% upward, 29% downward), while 27% of the proofs used induction with one bound (22% upward, 8% downward). Additionally, over 64% of proofs only required one application of any induction rule.
Table 3. Experiments with our new benchmarks from Table 1.
Experiments with 120 New Benchmarks. Comparison results for Vampire-I, Acl2 and Cvc4 on our new benchmarks are displayed in Table 3, aggregated by benchmark subsets, as described in Table 1. We do not show Vampire in the table, since without integer induction it cannot solve any of the problems.
The results show that in some cases Acl2 can perform upward and downward induction on integers, but only when using interpreted constants as a base case (that is, it cannot handle symbolic bounds). However, it can only do so if it also proves termination of the recursively defined function. It also has issues with reasoning about multiplication.
Cvc4 has limited support for integer induction: it can apply upward induction but only when the base case is an interpreted constant. Since some problems seem to require induction with symbolic bounds, Cvc4 is mostly able to either solve all problems in a subset, or none of them. The only exception is the subset declared_mixed_ax-fin_conj-fin, in which Cvc4 solves one problem, which can be solved using upward induction with an interpreted constant as the base case.
Vampire-I * does not have any conceptual problems with solving the benchmarks. However, since it uses axioms and inference rules rather than dedicated decision procedures for handling integers, it sometime has issues with solving problems with large integer values. For example, for the infinite interval subset of the val benchmark set, the only problems Vampire-I * did not solve were those containing the interpreted constant 100 or -100. Similarly, in the power benchmark set, the unsolved problems contained large numbers. Finally, in the declared_mixed_ax-fin_conj-fin subset, the two problems Vampire-I * did not solve also required more sophisticated arithmetic reasoning. However, inability of efficiently dealing with large numbers is not an intrinsic problem of superposition theorem provers. Reasoning with quantifiers and theories is still in its infancy and major improvements are underway. For example, there are recent parallel developments in superposition and linear arithmetic [15] that should improve this kind of reasoning in Vampire.