Learning-Assisted Automated Reasoning with Flyspeck

The considerable mathematical knowledge encoded by the Flyspeck project is combined with external automated theorem provers (ATPs) and machine-learning premise selection methods trained on the proofs, producing an AI system capable of answering a wide range of mathematical queries automatically. The performance of this architecture is evaluated in a bootstrapping scenario emulating the development of Flyspeck from axioms to the last theorem, each time using only the previous theorems and proofs. It is shown that 39% of the 14185 theorems could be proved in a push-button mode (without any high-level advice and user interaction) in 30 seconds of real time on a fourteen-CPU workstation. The necessary work involves: (i) an implementation of sound translations of the HOL Light logic to ATP formalisms: untyped first-order, polymorphic typed first-order, and typed higher-order, (ii) export of the dependency information from HOL Light and ATP proofs for the machine learners, and (iii) choice of suitable representations and methods for learning from previous proofs, and their integration as advisors with HOL Light. This work is described and discussed here, and an initial analysis of the body of proofs that were found fully automatically is provided.


Introduction and Motivation
"It is the view of some of us that many people who could have easily contributed to project QED have been distracted away by the enticing lure of AI or AR." -The QED Manifesto "So it will take 140 man-years to create a good basic library for formal mathematics." -Freek Wiedijk [86] "We will encourage you to develop the three great virtues of a programmer: laziness, impatience, and hubris." -Larry Wall, Programming Perl [83] "And in demonstration itself logic is not all. The true mathematical reasoning is a real induction [...]" -Henri Poincaré, Science and Method [63] 1

.1 Large-Theory Automated Reasoning and HOL Light
Use of external first-order automated theorem provers (ATPs) like Vampire [46], E [67], SPASS [85], and recently also SMT (satisfiability modulo theories) solvers like Z3 [24] for (large-theory) formalization has been developed considerably in the recent decade. Particularly in the Isabelle community, the Sledgehammer [13,15] bridge to such external tools is getting increasingly popular. This helps to further develop various parts of the technology involved. ATPs have recently gained the ability to quickly load large theories over large signatures and work with them [38]. Methods for automated selection of relevant knowledge and for proof guidance are actively developed [76], together with specialized automated systems targeted at particular mathematical domains [2,7,64]. Formats and translation methods handling more formalization-friendly foundations are being defined [16,27,70], and metasystems that decide which ATP, translation method, strategy, parallelization, and premises to use to solve a given problem with limited resources are being designed [59,80]. Cooperation of humans and computers over large corpora of formal knowledge is an interesting field, allowing exploration of new AI systems and combinations of different AI techniques that can attempt to encode concepts like analogy and intuition, and rigorously evaluate their usefulness. Perhaps not only Hilbert and Turing, but also the formality-opposing and intuition-oriented Poincaré 1 [63] would have been interested to learn about the new "semantic AI paradise" of such large corpora of fully computer-understandable mathematics (from which we do not intend to be expelled).
The HOL Light [34] system is probably the first among the existing well-known interactive theorem provers (ITPs) which has integrated and extensively used a general ATP procedure, the MESON tactic [36]. Hurd has developed and benchmarked early bridges [39,40] between HOL and external systems, and his Metis system [41] has also become a significant part of the Isabelle/Sledgehammer bridge to ATPs [60]. Using the very detailed Otter/Ivy [53] proof objects, Harrison also later implemented a bridge from HOL Light to Prover9 [52]. HOL Light however does not yet have a general bridge to large-theory ATP/AI ("hammer" 2 ) methods, similar to Isabelle/Sledgehammer or MizAR [78,79], which would attempt to automatically solve a new goal by selecting relevant knowledge from the large library and running (possibly customized/trained) external ATPs on such premise selections. HOL Light seems to be a natural candidate for adopting such methods, because of the amount of work already done in this direction mentioned above, and also thanks to HOL Light's foundational closeness to Isabelle/HOL. Also, thanks to the Flyspeck project [31], HOL Light is becoming less of a "single, very knowledgable formalizer" tool, and is getting increasingly used as a "tool for interested mathematicians" (such as the Flyspeck team in Hanoi 3 ) who may know the large libraries much less and have less experience with crafting their own proof tactics. For such ITP users it is good to provide a small number of strong methods that allow fast progress, which can perhaps also complement the declarative modes [87] pioneered by HOL Light [35] in the LCF world.

Flyspeck as an Interesting Corpus for Semantic AI Methods
The purpose of the Flyspeck project is to produce a formal proof of the Kepler Conjecture [33,45]. The Flyspeck development (which in this paper always means also the required parts of the HOL Light library) is an interesting corpus for a number of reasons. First, it formalizes considerable parts of standard mathematics, and thus exposes a large body of interconnected mathematical reasoning to all kinds of semantic AI methods and experiments. Second, the formalization is done in a relatively directed way, with the final goal of the Kepler conjecture in mind. For example, in the Mizar library 4 (and even more in other collections like the Coq contribs 5 ), articles may be contributed as isolated developments, and only much later (or never) re-factored into a form that makes them work well with related developments. Such refactoring is often a nontrivial process [66]. In a directed development like Flyspeck, such integrity is a concern from the very beginning, and this concern should result in the theorems working better together to justify new conjectures that combine the areas covered by the development. Third, the language of HOL Light is in a certain sense simpler than the language of Mizar and Coq (and to a lesser extent also than Isabelle/HOL), where one typically first needs to set up the right syntactic/type-automation environment to be able to formulate new conjectures in advanced areas. This greater simplicity (which may come at a cost) makes it possible to write direct (yet advanced) queries to the AI/ATP ("hammer") system in the original language, without much additional need for specifying the context of the query. This could make such "hammer" more easy to try for interested mathematicians, and allow them to explore formal mathematics and Flyspeck. And fourth, Flyspeck is accompanied with an informal (L A T E X) text that is often cross-linked to the formal concepts and theorems developed in HOL Light. With sufficiently strong automated reasoning over the library, this cross-linking opens the way to experiments with alignment (and eventual semi-automated translation) between the informal and formal Flyspeck texts, using corpus-driven methods for language translation, assisted by such an AI/ATP "hammer" as an additional semantic filter/advisor.

The Rest of the Paper
The work reported here makes several steps towards the above goals: 1. Sound and efficient translations of the HOL Light formulas to several ATP (TPTP) formalisms are implemented (Section 2). This includes the untyped first-order (FOF) format [71], the polymorphic typed first-order (TFF1) format [16], and the typed higher-order (THF) format [27,69]. 2. Dependency information is exported from the Flyspeck proofs (Section 3). This allows experiments with re-proving of theorems by 17 different ATPs/SMTs from their HOL Light dependencies, and provides an initial dataset for machine learning of premise selection from previous proofs. 3. Several feature representations characterizing HOL Light formulas are proposed, implemented, and used for machine-learning of premise selection. Several preprocessing methods are developed for the dependency data that are used for learning. The trained premise-selection systems are integrated as external advisors for HOL Light. A prototype system answering real-time mathematical queries by running various parallel combinations of the premise selectors and ATPs is built and made available as an online service. See Section 4.
The methods are evaluated in Section 5, and it is shown that by running in parallel the most complementary proof-producing methods on a 14-CPU workstation, one now has a 39% chance to prove the next Flyspeck theorem within 30 seconds in a fully automated push-button mode (without any high-level advice). 50% of the Flyspeck theorems can be re-proved within 30 seconds by a collection of 7 ATP methods (run in parallel) if the HOL Light proof dependencies are used. 56% of the theorems could be proved by the union of all methods tried in the evaluation. An initial analysis of these sets of proofs is given in Section 6. It is shown that the proofs produced by the learning-advised ATPs can occasionally develop ideas that are very different from the original HOL Light proofs, and that the learning-advised ATPs can sometimes produce simpler proofs and discover duplications in the library. Section 7 discusses related work and Section 8 suggests future directions.

Translation of HOL Light Formulas to ATP formats
The HOL logic differs from the formalisms used by most of the existing ATP and SMT systems. The main differences to first-order logic are the use of the polymorphic type system, and higher-order features (guarded by the type system) such as quantification (abstraction) over higher-order objects and currying. On the other hand, the logic is made classical and comes with a straightforward intended interpretation in ZFC. Translation of this logic (and its type-class extension used by Isabelle/HOL) to ATP formalisms has been an active research topic started already in the 90s. Prominent techniques, such as lambda lifting, suitable type system translation methods, etc., have been described several times [12,36,39,40,56]. Therefore this section assumes familiarity with these techniques, and only briefly summarizes the logic and the translation approaches considered, and their particular suitability for the experiments over the HOL Light corpora. For a comprehensive recent overview and discussion of this topic and the issues related to the translation see Blanchette's thesis [12]. In particular, it contains the arguments about the soundness and (in)completeness of the translation methods that we eventually chose.

Summary of the HOL Logic
HOL Light uses the HOL logic [62]: an extended variant of Church's simple type theory [21]. Type variables (implicitly universally quantified) are explicitly added to the language (providing polymorphism), together with arbitrary type operators (constructors of compound types like 'int list' and 'a set'). In the HOL logic, the terms and types are intended to have a standard set-theoretical interpretation in HOL universes. A HOL universe U is a set of non-empty sets, such that U is closed under non-empty subsets, finite products and powersets, an infinite set I ∈ U exists, and a choice function ch over U exists (i.e., ∀X ∈ U : ch(X) ∈ X) . The subsets, products, and powersets together also yield function spaces. A frequently considered example of a HOL universe is the set V ω+ω \ {0}, 6 with ch being its (ZFC-guaranteed) selector, and I = ω. The standard U -interpretation of a monomorphic (i.e., free of type variables) type σ is a set σ ∈ U , a polymorphic (i.e., containing type variables) type σ with n type variables is interpreted as a function σ : U n → U , and the arrow operator observes the standard functionspace behavior (lifted to appropriate mappings for polymorphic types) on the type interpretations. The standard interpretation of a closed monomorphic term t : σ is an element of the set σ ∈ U , and a closed polymorphic term (with n type variables) t : σ with σ : U n → U is interpreted as a (dependently typed) function assigning to each n-tuple [X 1 , . . . , Xn] ∈ U n an element of σ ([X 1 , . . . , Xn]). The HOL logic's type signature starts with the built-in nullary type constants ind, interpreted as the infinite set I, and bool (type of propositions), interpreted as a chosen two-element set in U (its existence follows from the properties of a HOL universe). The term signature initially contains the polymorphic constants = α→α→bool , and ǫ (α→bool)→α , interpreted as the equality and selector on each set in U . The inference mechanisms start with a set of standard primitive inference rules, later adding the axioms of functional extensionality, choice (implying the excluded middle in the HOL setting), and infinity. New type and term constructors can be introduced by simple definitional extension mechanisms, which are in HOL Light also used to introduce the standard logical connectives and quantifiers. The result is a classical logic system that is in practice quite close to set theory, differing from it mainly by the built-in type discipline (allowing also complete automation of abstraction) and by more frequent use of total functions to model mathematical objects. For example, predicates are modelled as total functions to bool on types, and sets are in HOL Light identified with (unary) predicates. The main issues for translation are the type system and the automated reification (abstraction) mechanisms that are not immediately available in first-order logic and may be encoded in more or less efficient and complete ways.

The MESON Translation
An obvious first idea for generating FOL ATP problems from HOL Light problems was to re-use parts of the already implemented MESON tactic. This tactic tries to justify a given goal G with a supplied list of premises P 1 , ..., Pn by calling a customized first-order ATP implemented in HOL Light, which is based on the model elimination method invented by Loveland [49], later combined with a Prolog-like search tree [50]. The implementation of the MESON tactic in HOL Light first applies a number of standard translation techniques (such as β-reduction followed by lambda lifting, skolemization, introduction of the apply functor, 7 etc.) that transform the HOL goal (together with the supplied premises) to a clausal FOL goal (or multiple goals). An interesting (and MESON-specific) part of the transformation is a rather exhaustive and heuristic instantiation of the (often polymorphic) premises (called POLY_ASSUME_TAC), described below. The clausal FOL goal is then passed to the core ATP. If the core ATP succeeds, it returns a proof, which is then translated into HOL Light proof steps. The transformation from HOL to FOL is heuristic, incomplete, and tuned for relatively small problems. An interesting feature of MESON is that the core ATP does not treat equality specially (as is quite common in tableau provers), which in turn allows using multiple instantiated versions of equality (e.g., on lists and on real numbers) inside one problem. Such equational separation, when combined with the heuristic instantiation of other polymorphic constants done by MESON, then prevents the core ATP from doing ill-typed inferences without the necessity for any additional type guards. The most interesting part of the translation heuristically instantiates the (possibly polymorphic) premises P 1 , ..., Pn and adds them to the goal G. This is done iteratively, building a new temporary goal G i (where G 0 = G) from each premise P i and the previous goal G i−1 as follows. All (possibly polymorphic) constants are collected from P i and G i−1 , and the set of all their pairs is created. When such a pair {c P , c G } consists of two (symbolically) equal constants, the type of c P is matched to the type of c G , and if a substitution σ exists (i.e., T ypec P σ = T ypec G ), it is added to the resulting set Σ i of type substitutions. Each type substitution from Σ i is then applied to P i , and all such resulting instances of P i are added as assumptions to G i−1 , yielding G i . The set of assumptions of the goal G i is thus typically greater than that of G i−1 , and the same typically holds for the set of constants in G i , which will be in turn used to instantiate P i+1 .
This procedure is quite effective for the small problems that MESON normally handles. However, for problems with many premises and many polymorphic constants this turns out to be very inefficient. While re-using MESON allowed the quick initial exploration of using external ATPs and advisors described in [44], this inefficiency practically excluded the (seemingly straightforward) use of the unmodified MESON procedure as an (at least basic) translation method for generating ATP problems with many premises. This is why the experiments presented here use different translations, described below.
Completeness of the translation from HOL to FOL is in general hard to achieve in an efficient way. The MESON translation is incomplete in several ways. The goal's proper assumptions are not monomorphised, and the free variables of polymorphic types are not used in the same way as the polymorphic constants. For example, given the premise: and a goal that does not mention α, the premise will never be instantiated to the type present in the goal, and thus will not be usable for MESON.

Translation to the TFF1 and FOF Formats
There is a "simple" solution to the instantiation blow-up experienced with the MESON translation: avoid heuristic instantiation as a pre-processing step, and instead let the ATPs handle it as a part of the ATP problems. This technique is used in the Mizar/MPTP translation [72,73,75], where the (dependent and undecidable) soft type system cannot be separated from the core predicate logic. The relevant heuristics can instead be developed (and experimented with) on the level of ATPs. Indeed, for example the SPASS system includes a number of ATP techniques for both complete and incomplete work with (auto-detected) types [17,84]. This approach has been in the recent years facilitated by developing type-aware TPTP standards such as TFF0, TFF1, and THF, which -unlike related typeaware efforts like DFG [30] and KIF [28] -seem to be more successful in being adopted by ATP and tool developers. In the case of the recent TFF1 standard [16] adding HOL-like polymorphic types to first-order logic, a translation tool to the FOF and SMT formats has been developed in 2012 by Andrei Paskevich as part of the Why3 system [25], simplifying the first experiments with the non-instantiating translation.
The translation to TFF1 proceeds similarly to the MESON translation, but without applying the POLY_ASSUME_TAC. The problem formulas are β-reduced, the remaining lambda abstractions are again removed using lambda lifting, and the apply functor is heuristically introduced. The particular heuristic for this is the one used by Meng and Paulson, i.e., for each higher-order constant c the minimum arity nc with which it appears in a problem is computed, and the first nc arguments are always passed to c directly inside the problem. If the constant is also used with more arguments in the problem, apply is used. Blanchette [12] reports that this optimization works fairly well for Isabelle/Sledgehammer, and gives a simple example when it introduces incompleteness. As an example of the translation to the TFF1 format, consider the re-proving problem 8 for the theorem Float.REAL_EQ_INV 9 proved as part of the Jordan curve theorem formalization, 10 whose HOL Light proof is as follows: The dependency tracking (see Section 3.1.2) has found the following dependencies of the theorem: 11 AND_DEF FORALL_DEF IMP_DEF REAL_INV_INV REFL_CLAUSE TRUTH Tactics_jordan.unify_exists_tac_example From these dependencies, only REAL_INV_INV has nontrivial first-order content (a list of the trivial facts has been collected and is used for such filtering). The problem creation additionally adds three facts encoding properties of (HOL) booleans, and also the functional extensionality axiom (EQ_EXT). In the original HOL Light syntax the re-proving problem looks as follows: After applying β-reduction, lambda lifting (none in the example), and introducing the apply functor (called here happ), this is transformed (still as HOL terms) into the following: The application functor happ was only used for the function variables in the extensionality axiom (EQ_EXT). The function inv is always used with one argument in the problem, so it is never wrapped with happ. Finally, the TFF1 TPTP export declares the signature of the symbols and type operators, and adds the corresponding guarded quantifications to the formulas. The apply functor is called i in the TFF1 export (for concise output in case of many applications in a goal), and it explicitly takes also the type arguments (A and B in aEQu_EXT). This (making the implicit type variables explicit) is in TFF1 done for any symbol that remains polymorphic. We reserve the predicate p for translation between Boolean terms and formulas. This is done in the same way as in [56].
Problems in this format can be already given to the Why3 tool, which can translate them for various SMT solvers and ATP systems, and call the systems on the translated form. This was initially used both for ATPs working with the FOF format and for the SMTs. Currently, we only use Why3 for preparing problems for Yices, CVC3, and AltErgo. The translation to the FOF format was later implemented independently of Why3, to avoid an additional translation layer for the strongest tools, and in particular to be able to run the ATPs with different parameters and in a proof-producing mode. The procedure is however the same as in Why3, and the resulting FOF form will be as follows. (~(p(s(bool,f))) <=> p(s(bool,t)))). fof(aREALu_INVu_INV, axiom, ![X]: s(real,realu_inv(s(real,realu_inv(s(real,X))))) = s(real,X)). fof(aTRUTH, axiom, p(s(bool,t))). fof(conjecture, conjecture, ![X, Y]: (s(real,X) = s(real,Y) <=> s(real,realu_inv(s(real,X))) = s(real,realu_inv(s(real,Y))))).
This translation uses the (possibly quadratic) tagging of terms with their types (with "s" as the tagging functor), used, e.g., in Hurd's work.

Translations to Higher-Order Formats
The recently developed TPTP THF standard can be used to encode problems in monomorphic higher-order logic. This allows experimenting with higher-order ATPs like LEO2 [9] and Satallax [19], in addition to the standard ATPs working in the first-order formalism. The translation to THF needs to perform only one step: monomorphisation. As explained in 2.2, this is however a nontrivial task, and the MESON tactic approach is already in practice too exhaustive for problems with many premises. After developing the TFF1 and FOF translation, some initial experiments were done to produce a monomorphisation heuristic that behaves reasonably on problems with many premises. This heuristic is now as follows. The constants that can be used to instantiate the premises are extracted only once from the goal at the start of the procedure. Every premise can be instantiated using these goal constants, but the premises themselves are not further used to grow this set. This means that the procedure is even less complete than MESON, however the procedure is linear in the number of premises, and it is therefore possible to use it even with large numbers of advised premises. In practice, it is rarely the case that a premise can be instantiated in more than one way. A simple example when this happens is in the THF problem 12 created for the theorem I_O_ID, 13 where the particular goal and premise (both properties of the identity function) are as follows: The exact types inferred by the standard HOL (Hindley-Milner [37]) type inference for the goal are as follows: Since the identity function appears in the goal both with the type A→A and with the type B→B, the following two instances of the premise I_THM are created by the THF translation: % TYPES thf(ta, type, a : $tType). thf(tb, type, b : $tType). thf(ci0, type, i0 : (a > a)). thf(ci, type, i : (b > b)). % AXIOMS thf(aIu_THMu_monomorphized0, axiom, ![X:a]:((i0 @ X) = X)). thf(aIu_THMu_monomorphized1, axiom, ![X:b]:((i @ X) = X)).
Finally, while there is no TPTP standard yet for the polymorphic HOL logic, this logic is shared by a number of systems in the HOL family of ITPs. For the experiments described in 5.1 Isabelle is used in its CASC 2012 THF mode, but it should be possible to pass the problems to Isabelle directly in some (not necessarily TPTP) polymorphic HOL encoding. This is has been tried only to a small extent, and is still future work.

Exporting Theorem Problems for Re-proving with ATPs
In our earlier initial experiments [44], it was found that the ATP problems created from the calls to the MESON tactic in the HOL Light and Flyspeck libraries are very easy for the state-of-the-art ATPs. Some of this easiness might have been caused by the (generally unsound) merging of different polymorphic versions of equality used by MESON into just one standard first-order equality. 14 However, after a manual random inspection it still seemed that the ratio of such unsound proofs is low, and the MESON problems are just too easy. That is why only the set of problems on the theorem level is considered for experiments here. The theorem level seems to be quite similar in the major ITPs: theorem is typically not corresponding to what mathematicians call a theorem, but it is rather a self-sufficient lemma with a formal proof of several to dozens (exceptionally hundreds) lines that can be useful in other formal proofs and hence should be named and exported. Since the ITP proofs can be longer (i.e., they can contain a number of MESON and other subproblems), proving such theorems fully automatically is typically a challenge, which makes such problems suitable for ATP benchmarks, challenges, and competitions.

Collecting Theorems and their Dependency Tracking
In Mizar/MPTP and in Isabelle (done by Blanchette in so far unpublished work) the ATP problems corresponding to theorems can be produced by collecting the dependencies (premises) from the proofs (by suitable tracking mechanisms), and then translating the P remises ⊢ T heorem problem using the methods described in Section 2. The recent work by Adams in exporting HOL Light to HOL Zero [1] (with cross-verification as the main motivation) was initially used to obtain the theorem dependencies for the first experiments with HOL Light in [44], and after that custom theorem-exporting and dependency-tracking mechanisms were implemented as described below.

Collecting and Naming of Theorems
The first issue in implementing such mechanism is to decide what is considered to be a relevant theorem, and what should be its canonical name. In some ITPs, important statements have labels like lemma, theorem, corollary, etc. This is not the case in HOL Light, which is implemented in the OCaml toplevel. This means that every theorem or tactic is just an OCaml value. Some of those values are assigned names, while some are only created on the fly and immediately forgotten. In the relevant exporting work of Obua [57], every occurrence of the HOL Light command prove is replaced with a command that additionally records the name of the stored object. This strategy was used first, and extended to work with the whole Flyspeck library by also recording the names for the following commands: prove_by_refinement, new_definition, new_recursive_definition, new_specification, new_inductive_definition, define_type, and lift_theorem. This purely syntactic replacement method however turned out to be insufficient for a number of reasons. First, this method does not provide information about the scope of names with respect to the OCaml modules. Second, it does not provide the information whether a name given to a theorem has been declared on the top level, or inline inside a function (which makes such theorem unusable for proving other theorems on the toplevel), or even within a function called multiple times with different arguments (in which case the same name would be assigned to a number of different theorems). Finally, certain theorems accessible on the top level are created using other OCaml mechanisms, for example mapping a decision procedure over a list of terms. Recognizing syntactically theorems created in this way turned out to be impractical.
That is why a more robust method has been eventually used, based on the update_database 15 recording functionality by Harrison and Zumkeller. This code accesses the basic OCaml data structures and makes it possible to record the namevalue pairs for all OCaml values of the type theorem in a given OCaml state. Thus, it is sufficient to load the whole Flyspeck development, and then invoke this recording functionality.
After some initial experiments with ATP re-proving of the translated problems, this method was however further modified to be able to keep finer track of the use of theorems that are conjunctions of multiple facts. Such (often large) conjunctive theorems are used quite frequently in HOL Light, typically to package together facts that are likely to jointly provide a useful method for dealing with certain concepts or certain kinds of problems. For example the theorem ARITH_EQ 16 packages together ten facts about the equality of numerals as follows: An even more extreme example is the ARITH theorem which conjoins together all the basic arithmetic facts (there are 108 of them in the current version of HOL Light). The conjuncts of such theorems are now also named (using a serial numbering of the form ARITH_EQ_conjunctN), so that the dependency tracking can later precisely record which of the conjuncts were used in a particular proof. This significantly prunes the search space for ATP re-proving of the theorems that previously depended on the large conjunctive dependencies, and also makes the learning data extracted from dependencies of such theorems more precise.
This method can however result in the introduction of multiple names for a single theorem (which is just a HOL Light term of type theorem). If that happens (for this or other reasons), the first name that was associated with the theorem during the Flyspeck processing is always consistently used, and the other alternative names are never used. Such consistency is important for the performance of the machine learning on the recorded proof data. 17 The list of all theorems and their names obtained in this way is saved in a file, and subsequently used in the dependency extraction and problem creation passes.

Dependency Recording
After the detection and naming of theorems, the recording of proof dependencies is performed, by processing the whole library again with a patched version of the HOL Light kernel. This patched version is the proof-recording component of the new HOL-Import [43], a mechanism designed to transfer proofs from HOL Light to Isabelle/HOL in an efficient way allowing the export of big repositories like Flyspeck. The code for every HOL inference step is patched, to record the newly created theorems. Each theorem is assigned a unique integer counter, and for every new theorem its dependencies on other theorems (integers) are recorded and exported to a file. For every processed theorem it is also checked if it is one of the theorems named in the previous theorem-naming pass. If so, the association of this theorem's name to its number is recorded, and again exported to a file.
After this dependency-recording pass, the recorded information is further processed by an offline program to eliminate all unnamed dependencies (originating for example from having multiple names for a single theorem). For every named theorem its dependencies are inspected, and if a dependency D does not have a name, it is replaced by its own dependencies (there is no unnamed dependency that could not be further expanded). This is done recursively, until all unnamed dependencies are removed. This produces for each named theorem T a minimal (wrt. the original HOL Light proof) list of named theorems that are sufficient to prove T .
The numbering of theorems respects the order in which the theorems are processed in the Flyspeck development. This total ordering is compatible with (extends) the partial ordering induced by proof dependencies, and for the experiments conducted here it is assumed to be the chronological order in which the library was developed. The dependency information given in this chronological order for all 16082 named theorems (of which 1897 are (type) definitions, axioms, or their parts, and their dependencies are not exported) obtained by processing the Flyspeck library 18 (and its HOL Light pre-requisites) is available online. 19 Together with suitably chosen characterizations of the theorems (see Section 4.1), this constitutes an interesting new dataset for machine-learning techniques that attempt to predict the most useful premises from the formal library for proving the next conjecture. 17 For example, the MaLARea system does such de-duplication as a useful preprocessing step before learning and theorem-proving is started on a large number of related problems. 18

The Data Set of ATP Re-proving Problems
Analogously to Mizar and Isabelle, the re-proving ATP problems for the collected named theorems are finally produced by translating the Dependencies ⊢ T heorem problem to the ATP formalisms using the methods described in Section 2, together with basic filtering of dependencies that have trivial first-order content. 1897 of the 16082 named theorems do not have a proof (those are definitions and axioms). For all the remaining 14185 named theorems the corresponding re-proving ATP problems were created, and are available online 20 in the FOF, THF, and TFF1 formats. These problems are used for the ATP re-proving experiments described in Section 5.1. Smaller meaningful datasets will likely be created from this large dataset for ATP/AI competitions such as CASC LTB and Mizar@Turing 21 , analogously to the smaller MPTP2078 22 [4] ATP benchmark created from the ATPtranslated Mizar library (MML), and the Judgement Day benchmark [18] created by ATP translation of a subset of the Isabelle/HOL library.
The average, minimum, and maximum sizes of problems in these datasets are shown in Table 1, together with the corresponding statistics for the problems expressed in the original HOL formalism. It can be seen that the number of formulas in the translated problems is typically at most twice the number of the original HOL formulas, i.e., the translations are indeed efficient for all the problems. This was not the case (and became a bottleneck) in the initial experiments using the more prolific MESON translation. There is no increase in the number of formulas when translating from the original HOL-formulated problem to the FOF translation. For the TFF1 and THF translation, formulas declaring the types of the symbols appearing in the problems are added, and for the THF translation multiple instances of the premises can additionally appear. Table 2 shows the total times needed for the various exporting phases run over the whole Flyspeck as explained above. For completeness, the time needed to export characterizations of the theorems for external (e.g., machine-learning) tools is also included (see Section 4.1 for the description of the characterizations that are used). See [48,76] for recent overviews of such methods. ATP problems of this kind are created for Mizar/MML by consistent translation of the whole MML to TPTP, and then letting external premise selection algorithms find the most relevant premises for a given theorem t from the large set of t-allowed premises (typically those theorems and definitions that were already available when t was being proved, expressed, e.g., as TPTP include files). For Isabelle/Sledgehammer, the default premise selection algorithm is implemented inside Isabelle, i.e., it is working on the native Isabelle symbols. Only after the Sledgehammer premise selection chooses the suitable set of premises, the problem is translated to a given ATP formalism using one of the several implemented translation methods. In general, the symbol naming is in Isabelle consistent only before the translation is applied, and a particular symbol in two translated problems can have different meanings.
Both the Mizar and the Isabelle approach have some advantages and disadvantages. Optimizing the translation (or using multiple translations) as done in Isabelle can improve the ultimate ATP performance once the premises have been selected. On the other hand, if the whole library is not translated in a consistent manner to a common ATP format such as TPTP, ATP-oriented external premise selection tools like SInE cannot be directly used on the whole library. It could be argued that the SInE algorithm is relatively close to the Sledgehammer premise selection algorithm, and can be easily implemented inside Isabelle. However there are useful premise selection methods for which this is not so straightforward. For example in the MaLARea system, evaluation of premises in a large common pool of finite first-order models is an additional semantic premise-characterization method that improves the overall precision quite significantly [80]. 23 For such a pool of firstorder models to be useful, the premises have to use symbols consistently also after the translation to first-order logic. Although various techniques can again be developed to lift this method to the current Sledgehammer translation setting, they seem less straightforward than for example a direct Isabelle implementation of SInE. This discussion currently applies also to the HOL Light ATP translations described in Section 2. For example, the problem-specific optimization of the arity of symbols described in 2.3 will in general cause inconsistency on the symbol level between the FOF translations of two different HOL Light problems.
The procedure implemented for HOL Light is currently a combination of the external, internal, learning, and non-learning premise-selection approaches. This procedure assumes the common ITP situation of a large library of (also definitional) theorems T i and their proofs P i (for definitions the proof is empty), over which a new conjecture has to be proved. The proofs refer to other theorems, giving rise to a partial dependency ordering of the theorems extended into their total chronological ordering as described for Flyspeck in 3.1.2. For the experiments it will be assumed that the library was developed in this order. An overview of the procedure is as follows, and its details are explained in the following subsections.
1. Suitable characterizations (see Section 4.1) of the theorems and their proof dependencies are exported from HOL Light in a simple format. 2. Additional dependency data are obtained by running ATPs on the ATP problems created from the HOL Light proof dependencies, i.e., the ATPs are run in the re-proving mode. Such data are often smaller and preferable [47]. These data are again exported using the same format as in (1). 3. The (global, first-stage) external premise selectors preprocess (typically train on) the theorem characterizations and the proof dependencies. Multiple characterizations and proof dependencies may be used. 4. When a new conjecture is stated in HOL Light, its characterization is extracted and sent to the (pre-trained) first-stage premise selectors. 5. The first-stage premise selectors work as rankers. For a given conjecture characterization they produce a ranking of the available theorems (premises) according to their (assumed) relevance for the conjecture. 6. The best-ranked premises are used inside HOL Light to produce ATP (FOF, TFF1, THF) problems. Typically several thresholds (8, 32, 128, 512, etc.) on the number of included premises are used, resulting in multiple versions of the ATP problems. 7. The ATPs are called on the problems. Some of the best ATPs run in a strategyscheduling mode combining multiple strategies. Some of the strategies always use the SInE (i.e., local, second-stage) premise selection (with different parameters), and some other strategies may decide to use SInE when the ATP problem is sufficiently large. Loop to improve (2) and (3): It is not an uncommon phenomenon that in the dataimproving step (2) (ATP re-proving from the HOL Light proof dependencies) an ATP proof could not be found for some theorem T i , but an alternative proof of T i can be found from some other theorems preceding T i in the chronological order (which guards such alternative proofs against cycles). To achieve this, the trained premise selectors can be used also on all theorems that are already in the library, and the whole ATP/training process can be iterated several times to obtain as many ATP proofs as possible, and better (and differently) trained premise selectors for step (3). This is the same loop as in MaLARea.

Formula Characterizations Used for Learning
Given a new conjecture C, how do mathematicians decide that certain previous knowledge will be relevant for proving C? The approach taken in practically all existing premise-selection methods is to extract from such C a number of suitably defined features, and use them as the input to the premise selection for C. The most obvious characterization that already works well in large libraries is the (multi)set of symbols appearing in the conjecture. This can be further extended in many interesting ways, using various methods developed, e.g., in statistical machine translation and web search, but also by methods specific to the formal mathematical domain. In this work, characterization of HOL formulas by all their subterms (found useful in MaLARea) was used, and adapted to the typed HOL logic. For example, the latest version of the characterization algorithm would describe the HOL theorem DISCRETE_IMP_CLOSED: 24 by the following set of strings: "real", "num", "fun", "cart", "bool", "vector_sub", "vector_norm", "real_of_num", "real_lt", "closed", "_0", "NUMERAL", "IN", "=", "&0", "&0 < Areal", "0", "Areal", "Areal^A", "Areal^A -Areal^A", "Areal^A IN Areal^A->bool", " Areal^A->bool", "_0", "closed Areal^A->bool", "norm (Areal^A -Areal^A)", "norm (Areal^A -Areal^A) < Areal" This characterization is obtained by: In the above example, real is a type constant, IN is a term constructor, Areal^A->bool is a normalized type, Areal^A its component type, norm (Areal^A -Areal^A) < Areal is an atomic formula, and Areal^A -Areal^A is its normalized subterm. The normalization of variable names is an interesting topic. It is good if the premise selectors can notice some similarity between two terms with variables, 25 which is hard (when using strings) if the variables have different names. On the other hand, total normalization to just one generic variable name removes also the information that the variables in a particular subterm were (not) equal. Also, terms with differently typed variables should be more distant from each other than those with the same variable types. In total, four versions of variable normalization were tested: syms0: All free and bound variables are given the same name A0. This encoding is the most liberal, i.e., the resulting equality relation on the features is the coarsest one, allowing the premise selectors to see many similarities.
24 http://mws.cs.ru.nl/~mptp/hol-flyspeck/trunk/Multivariate/topology.html# DISCRETE_IMP_CLOSED 25 One could require the similarity to also handle matching, etc. A simple way how to do it is to generate even more features. This is again left to further general research in this area.
syms: First the free variables are numbered consecutively (A0, A1, etc). Then the bound variables are named with the subsequent numbers. This results in a finer notion of similarity than in syms0. symst: Every variable is renamed to a textual representation of its type, for example Anum or Areal. This is again finer than syms0, but different from syms. This normalization is used in the above example, and also for most of the premise selection trainings. symsd: In one symst implementation, the internal HOL Light type variable numbering was accidentally used, thus making most of such term features disjoint between different theorems. The performance was lower, but the method produces some unique solutions and is included in the evaluation.
In addition to that, several feature exports included also logic symbols. Various feature characterizations can have different performance on different datasets, and such characterizations can be also combined together in interesting ways. This is a rather large research topic that is left as future work for this newly developed dataset, along with other large-theory datasets. Just including the features encoding the validity in finite models will be interesting.

Machine Learning of Premise Selection
All the currently used first-stage premise selectors are machine learning algorithms trained in various ways on previous proofs. A number of machine learning algorithms can be experimented with today, and in particular kernel-based methods [4] and ensemble methods [48] have recently shown quite good performance on smaller datasets such as MPTP2078. However, scaling and tuning such methods to a large corpus like Flyspeck and to quite a large number of incremental training and testing experiments is not straightforward. 26 That is why this work so far uses mostly the sparse implementation of a multiclass naive Bayes classifier provided by the SNoW system [20]. SNoW can incrementally train and produce predictions on the whole Flyspeck library presented in the chronological order in an hour (and often considerably faster on minimized data). I.e., one new prediction takes a fraction of a second. In addition to that, several other fast incremental (online) learning algorithms were briefly tried: the Perceptron and Winnow algorithms provided also by SNoW, and a custom implementation of the k-nearest neighbor (k-NN) algorithm. Only k-NN however produced enough additional prediction power. As already mentioned, the first-stage algorithms are often complemented by SInE as a second-stage premise selector when the ATP problem is written, and that is why some (in particular SInE-like) algorithms might look mostly redundant (in the overall ATP evaluation) when used at the first stage. This is obviously a consequence of the particular setup used here. Two kinds of evaluation are possible in this setting and have been used several times for the Mizar data: a pure machine-learning evaluation comparing the predicted premises with the set of known sufficient premises, and an evaluation that actually runs an ATP on the predicted premises. While data are available also for the former, in this paper only the second evaluation is presented, see Section 5.2. The main reason for this is that alternative proofs are quite common in large libraries, and they often obfuscate the link between the pure machinelearning performance and the final ATP performance. Measuring the final ATP performance is more costly, however it practically stopped being a problem with the recent arrival of low-cost workstations with dozens of CPUs.
At a given point during the library development, the training data available to the machine learners are the proofs of the previously proved theorems in the library. A frequently used approach to training premise selection is to characterize each proof P i of theorem T i as a (multi)set of theorems {T i1 , ..., T im |T ij used in P i }. The training example will then consist of the input characterization (features) of T i (see 4.1), and the output characterization (called also output targets, classes, or labels) will be the (multi)set {T i } ∪ {T i1 , ..., T im |T ij used in P i }. Such training examples can be tuned in various ways. For example the output theorems may be further recursively expanded with their own dependencies, the input features could be expanded with the features of their definitions, various weighting schemes and similarity clusterings can be tried, etc. This is also mostly left to future general research in premise-selection learning. Once the machine learner is trained on a particular development state of the library, it is tested on the next theorem T in the chronological order. The input features are extracted from T and given to the trained learner which then answers with a ranking of the available theorems. This ranking is given to HOL Light, which uses it to produce ATP problems for T with varied numbers of the best-ranked premises.

Proof Data Used for Learning
An interesting problem is getting the most useful proof dependencies for learning. Many of the original Flyspeck dependencies are clearly unnecessary for first-order theorem provers. For example the definition of the ∧ connective (AND_DEF) is a dependency of 14122 theorems. Another example are proofs done by decision procedures, which typically first apply some normalization steps justified by some lemmas, and then may perform some standard algorithm, again based on a number of lemmas. Often only a few of such dependencies are needed (i.e., the proofs found by decision procedures are often unnecessarily "complicated").
Some obviously unhelpful dependencies were filtered manually, and this was complemented by using also the data obtained from ATP re-proving (see Section 5.1). Vampire, E, and Z3 can together re-prove 43.2% of the Flyspeck theorems (see Table 5), which is quite a high number, useful for trying to post-process automatically the remaining dependency data or even to completely disregard them. The following approaches to combining such ATP and HOL Light dependencies were initially tried, and combined with the various characterization methods to get the training data: minweight (default): Always prefer the minimal ATP proof if available. On the ATP re-proved theorems collect the statistics about how likely a dependency in the HOL Light proof is really going to be used by the ATP proof, and use this likelihood as a weight when ATP proof is not available. When the weight is 0, use (cautiously) a minimal weight (0.001 or 0.000001) instead.
nominweight: As minweight, but without a minimal weight. Totally ignore ATPirrelevant HOL Light dependencies. v_pref (e_pref, z_pref): Instead of using the minimal ATP proof, always prefer the Vampire (E, Z3) proof. Can be combined with both weighting methods. symsonly: Ignore all proofs. Learn only on examples saying that a theorem is good for proving itself, i.e., for its feature characterization. The trained system will thus recommend theorems similar to the conjecture, but not the dependencies of such theorems. atponly: Use only the (minimal) ATP proofs for learning. Ignore the HOL Light proofs completely, and construct only the symsonly training examples for theorems that have no ATP proof. Can be combined with v_pref, e_pref, z_pref.
At some point, a pseudo-minimization procedure started to be applied first to the ATP proofs: each proof is re-run only with the premises needed for the proof, and if the number of needed premises decreases, this is repeated until the premise count stabilizes. Often this further removes unnecessary premises that appeared in the ATP proof, e.g., by performing unnecessary rewriting steps. This was later followed by adding cross-minimization: Each proof is re-run (pseudo-minimized) not just by the ATP that found the proof, but by all ATPs. This can further improve the training data, and also raise the number of proofs found by a particular ATP quite considerably, which in turn helps when proofs by a particular ATP are preferable for learning (see the v_pref approach above). Finally, the learning and proving can boost each other's performance: the proofs obtained by using the advice of the first-generation premise selectors can be added to the training data obtained from re-proving, and used to train the second generation of premise selectors. This process can be iterated, but only one iteration was done so far (using two different prediction methods).
The summary of the training data obtained by these procedures from the proofs is given in Table 3. Each of the ATP-obtained dependency sets (2-6 in Table 3) could be complemented by the HOL Light dependencies (1) as described above, producing differently trained advisors. For example, the best advising method based on (4) was only using the ATP proofs for training (no adding of HOL Light dependencies when the ATP proof was missing), preferring proofs by E (e_pref), using the symst (types instead of variables) characterizations, and choosing the best 128 premises. The new ATP proofs found using this method were added to (4), resulting in the dataset (5). The next most complementary advising method to that (measured before (5) became available) was combining the ATP dependencies from (2) with the HOL Light dependencies (1) using the nominweight and v_pref techniques, also using the symst features, and choosing the best 512 premises. The new ATP proofs found using this method were added to (5), resulting in (6). The performance of various premise selection methods is discussed in Section 5.

Communication with the Premise Advisors
There are several modes in which external premise selectors can be used. The main mode used here for experiments is the offline (batch) mode. In this mode, the premise selectors are incrementally trained and tested on the whole library dependencies presented as one file in the chronological order. Incremental training/testing means that the learning system reads the examples from the file one by one, for thm: Number of theorems proved by the given prover. dep ø: The average number of proof dependencies in the proofs found by the prover.
(1) -HOL deps: Dependencies exported from HOL Light.  Table 5. This is due to a incorrect (cyclic) dependency export for about 60 early HOL Light theorems used for (2)-(4). For training premise selection the effect of this error is negligible.
each theorem first producing an advice (ranking of previous theorems) based only on its features, and only after that learning on the theorem's dependencies and proceeding to the next example. The rankings are then used in HOL Light to produce all ATP problems in batch mode. This mode is good for experiments, because the learning data can be analyzed and pre-processed in various ways described above. All communication is fully file-based. Another mode is used for static online advice. In this mode an (offline) pretrained premise selector receives conjecture characterizations from HOL Light over a TCP socket, replies in real time with a ranking of theorems from which HOL Light produces the ATP problems and calls the ATPs to solve them. This mode has been initially implemented as a simple online service and can already be experimented with by interested readers. 27 Finally, in a dynamic online mode the premise selector receives also training data in real time, and updates itself. The currently used learning systems support this dynamic mode, however, in an online service this mode of interaction requires some implementation of access rights, user limits, cloning of developments and services, etc. This is still future work, close to the recent work on formal mathematical editors and wikis [3,42].

Experiments
The main testing set in all scenarios is constructed from the 14185 Flyspeck theorems. To be able to explore as many approaches as possible, a smaller subset 27 The service runs at colo12-c703.uibk.ac.at on port 8080, example queries are: of 1419 theorems is often used. This subset is stable for all such experiments, and it is constructed by taking every tenth theorem (starting with 0-th) in the chronological ordering.
A number of first-and higher-order ATPs and SMT solvers were tried on the problems. The most extensively used are Vampire 2.6 (called also V below), modified E 1.6, and Z3 4.0. Proofs are important in the ITP/learning scenarios, so Z3 and E are (unless otherwise noted) run in a proof-producing mode. In particular for Z3 this costs some performance. E 1.6 is not run in its standard auto-mode, but in a strategy-scheduling wrapper 28 used by the MaLARea system in the Mizar@Turing large-theory competition.
This wrapper (called either Epar or just E in the tables below) subsequently tries 14 strategies provided by the second author. These strategies were developed on the 1000 problems allowed for training large-theory AI/ATP systems before the Mizar@Turing competition [77]. Some of these strategies have become available in E 1.6 when it was released after CASC 2012, but E's auto mode is still tuned for the TPTP library, and by default it always uses only one "best" (heuristically chosen) strategy on a problem, shunning so far the temptations of strategy scheduling. Epar outperforms the old auto-mode of E 1.4 by over 20% on the Mizar@Turing training problems, and seems to be competitive with Vampire 2.6. Fifteen more systems (or their versions) were tried to a lesser extent (typically on the 1419-problem subset) in the experiments. Some of these systems perform very well, and might be used more extensively later. Sometimes an additional effort is needed to make systems really useful; for example, proof/premise output might be missing, or additional mapping to a system's constructs needs to be done to take full advantage of the system's features. In this work such customizations are avoided. All the systems used are alphabetically listed in Table 4. Unless specified otherwise, all systems are run with 30s time limit on a 48-core server with AMD Opteron 6174 2.2 GHz CPUs, 320 GB RAM, and 0.5 MB L2 cache per CPU. Each problem is always assigned one CPU. In the tables below, basic statistics are often computed about the population of the methods used: Unique solutions found by each method, and its State of the art contribution (SO-TAC) as defined by CASC. 30 A system's CASC-defined SOTAC will be highest even if the system solved only one problem (which no other system solved). That is why also the Σ-SOTAC value is used: the sum of a system's SOTAC over all problems attempted. These metrics often indicate how productive it is to add a particular system or its version to a population of systems. Often it is interesting to know the best joint performance when running N methods in parallel. Finding such a best combination is however an instance of the classical NP-hard Maximum Coverage problem. While it is often possible to use SAT solvers to get an optimal solution, a greedy algorithm is always consistently used to avoid problems when scaling to larger datasets. This also allows us to present this joint performance as a simple greedy (covering) sequence, i.e., a sequence that starts with the best system, and each next system in such sequence is the system that greedily adds most solutions to the union of solutions of the previous systems in the sequence. Table 5 shows the results of running Vampire, Epar, Z3, and Paradox on the 14185 FOF problems constructed from the HOL Light proof dependencies. The ATP success rate measured on such problems is useful as an upper estimate for the ATP success rate on the (potentially much larger) problems where all premises from the whole previous library are allowed to be used. This success rate can be used later to evaluate the performance of the algorithms that select a smaller number of the most relevant premises. The time limit for Vampire, Epar and Z3 was relatively high (900s), because particularly Vampire benefits from higher time limits (compare with Table 7) and the ATP proofs found by re-proving turn out to be more useful for training premise selectors than the original HOL Light dependencies (see Section 5.2). Paradox was run for 30 seconds to get some measure of the incompleteness of the FOF translation. The systems in Table 5 are already ordered using the greedy covering sequence, i.e., the joint performance of the top two systems is 41.9%, etc. The counter-satisfiability detected by Paradox is not by default included in the greedy sequence, since its goal is to find the strongest combination of proof-finding systems. The Paradox results are however included in the SOTAC and Unique columns. Table 6 shows these results restricted to the 1419-problem subset. This provides some measure of the statistical error encountered when testing systems on the 29 For the experiments that produce proof dependencies (useful, e.g., for learning), we have used the (so far experimental) version of LEO2 that fully reconstructs the original dependencies (the "po 2" option). For the rest of the experiments (where proofs are not needed), the standard version of LEO2 is used. This version is also proof-producing, but some additional work is still needed to extract the original proof dependencies. This work is currently being done by the LEO2 developers. The "po 1" version outperforms the "po 2" version. 30 For each problem solved by a system, its SOTAC for the problem is the inverse of the number of systems that solved the problem. A system's overall SOTAC is the average SOTAC over the problems it solves. smaller problem set, and also a comparison of the systems' performance under high and low time limits used in Table 7.  Table 7 shows all tested systems on the 1419-problem subset, ordered by their absolute performance, and Table 8 shows the corresponding greedy ordering. The tested systems include also SMT solvers that use the TFF1 encoding and higherorder provers using the THF encoding. This is why it is no longer possible to aggregate the counter-satisfiability results (particularly found by Paradox) with the theoremhood results, and all the derived statistics are only computed using the Theorem column. While Vampire does well with high time limits in Table 5, it is outperformed by Z3 and E-based systems (Epar, E 1.6, LEO2-po1) when using only 30 seconds (which seem more appropriate for interactive tools than 300 or even 900 seconds). This suggests that the strategy scheduling in Vampire might benefit from further tuning on the Flyspeck data. Z3 is not run in the proof-producing mode in this experiment, which improves its performance considerably. It is not very surprising (but still evidence of solid integration work) that Isabelle performs best, as it already combines a number of other systems; see its CASC 2012 description 31 for details. An initial glimpse at Isabelle's unique solutions also shows that 75% of them are found by the recent Isabelle-specific additions (such as hard sorts) to SPASS [17] and its tighter integration with Isabelle. This is an evidence that pushing such domain knowledge inside ATPs (as done recently also with the MaLeCoP prototype [82]) might be quite rewarding. The joint performance of all systems tested is 50.2% when Isabelle is included, and 47.4% when only the base systems are allowed. This is quite encouraging, and for example the counter-satisfiability results suggest that additional performance could be gained by further (possibly heuristic/learning) work on alternative translations. Pragmatically, the joint reproving performance also tells us that when used in the MESON-tactic mode with premises explicitly provided by the users, a parallel 9-CPU machine running the nine systems from Table 8 will within 30 seconds (of real time) prove half of the Flyspeck theorems without any further interaction.

Using External ATPs to Prove Theorems with Premise Selection
As described in Section 4, there are a number of various approaches and parameters influencing the training of the premise selectors. These parameters were gradually (but not exhaustively) explored, typically on the 1419-problem subset. Several times the underlying training data changed quite significantly as a result of the data-improving passes described in Section 4.3. Some of these passes were evaluating the best prediction methods developed so far on all 14185 problems.
All the experiments were limited to Vampire, Epar, and Z3. For most of the experiments (and unless otherwise noted) the first-stage premise selection is used to create problems with 8, 32, 128, and 512 premises. This slicing (i.e., taking the first N premises) can be later fine-tuned, as done below in Section 5.2.2 for the best premise selection method. Table 9 shows an initial evaluation of 16 different learning combinations trained on ATP proofs obtained in 300s, complemented by the HOL Light proof dependencies (the second and first pass in Table 3). The two exceptions are the symst+symsonly combination, which ignores all proofs, and the syms+old+000001 combination, which uses older ATP-re-proving data obtained by running each ATP only for 30s (about 700 ATP proofs less). Each row in Table 9 is a union of twelve 30s ATP runs: Vampire, Epar, and Z3 used on the 8, 32, 128, and 512 slices. After this initial evaluation, the symst (types instead of variables) characterization was preferred, trivial symbols were always pruned out, and Winnow and Perceptron were left behind. It is of course possible that some of these methods are useful as a complement of better methods. Preferring Vampire proofs helps the learning a bit, for reasons that are not yet understood. To get the joint 39.5% performance, in general 192-fold (= 16 × 12) parallelization is needed. This number could be reduced, but first better training methods were considered.

Further Premise Selection Improvements
Complementing the ATP dependency data with the (possibly discounted) HOL Light dependencies seems to be a plausible method. Even if the HOL Light dependencies are very redundant, the redundancies should be weighted down by the information learned from the large number of ATP proofs, and the remaining HOL Light dependencies should be in general more useful than no information at all. A possible explanation of why this approach might still be quite suboptimal (in the ATP setting) is that the HOL Light proofs are often not a good guidance for the ATP proofs, and may push the machine learners in a direction that is ATPinfeasible. A small hint that this might be the case is the good performance of the nominweight method in Table 9. This method completely ignores all HOL Light dependencies that were never used in previous ATP proofs. This suggested to test the more radical atponly approach, in which only the ATP proofs are used for training. This approach improved the best method from 29.4% to 31.9%, and added 25 newly solved problems (1.8%) to those solved by the 16 methods in Table 9. These results motivated further work on getting as many (and as minimal) ATP proofs as possible, producing the methods tagged as m10, m10u and m10u2 in the tables below. These methods were trained on the proofs obtained by 10-minute (hence m10) ATP runs that were further upgraded by the advised proofs as described in Section 4.3. The best m10u2 method raised the performance by further 0.5%, and the learning on the advised proofs in different ways made these methods again quite orthogonal to the previous ones. Even though Winnow and Perceptron performed poorly (as expected from earlier unpublished experiments with MML), they added some new solutions. This motivated one simple additional experiment with the classic k-nearest neighbor (k-NN) learner, which computes for a new example (conjecture) the k nearest (in a given feature distance) previous examples and ranks premises by their frequency in these examples. This is a fast ("lazy" and trivially incremental) learning method that can be easily parameterized and might for some parameters behave quite differently from naive Bayes. For large datasets a basic implementation gets slow in the evaluation phase, but on the Flyspeck dataset this was not yet a problem and full training/evaluation processing took about the same time as naive Bayes. Table 10 shows the performance of three differently parameterized k-NN instances, and Table 11 shows 8 different k-NN-based methods that together prove 29% of the problems. As expected, k-NN performs worse than naive Bayes, but much better than Winnow and Perceptron. The 160-NN and 40-NN methods indeed produce somewhat different solutions, and they are sufficiently orthogonal to the previous methods and both contribute to the performance of the final best mix of 14 prediction/ATP methods. Fig. 1 shows how the ATP performance changes when using different numbers of the best-ranked premises. This is again evaluated in 30 seconds on the 1419problem subset, i.e., Vampire's performance is likely to be better (compared to  premises. Thus, 512 premises seems to be the current "margin of error" for the first-stage premise selection that can be (at least to some extent) offset by using SInE at the second stage. Table 12 shows for this premise selection method the joint performance (in greedy steps) of all premise slicings, when for each slice the union of all ATPs' solutions is taken. Only 17 slices are necessary (when using the greedy approach); the remaining 8 slices do not contribute more solutions. In general, this union would take 3 * 17 = 51 ATP runs, however only 28 ATP runs are actually required to achieve the maximum 36.4% performance. These runs are not shown in full here, and instead only the first 14 runs that yield 35% are shown in Table 13. Assuming a 14-CPU server, 35% is thus the 30-second performance when using only one (the best) premise selection method.

The Final Combination and Higher Time Limits
It is clear that the whole learning/ATP AI system can (and will) be (self-)improved in various interesting ways and for long time. 32 When the number of small-scale evaluations reached several hundred and the main initial issues seemed corrected, Fig. 1   an overall evaluation of the (greedily) best combination of 14 methods was done on the whole set of 14185 Flyspeck problems using a 300s time limit. These 14 methods together prove 39% of the theorems when given 30 seconds in parallel (see Table 14), which is also how they are run in the online service. The large scale evaluation is shown in Table 15 and Table 16. Table 15 sorts the methods by their 300s performance, and Table 16 computes the corresponding greedy covering sequence. Comparison with Table 14 shows that raising the CPU time to 300s helps the individual methods (2.7% for the best one), but not so much the final combination (only 0.3% improvement).

Union of Everything
Tables 17 shows the "union of everything", i.e., the union of problems (limited to the 1419-problem subset) that could either be proved by an ATP from the HOL Light dependencies or by the premise selection methods. Together with Table 7 and Table 8 this also shows how much the ATP proofs obtained by premise selection methods complement the ATP proofs based only on the HOL Light dependencies. The methods' running times are not comparable: the re-proving used 30s for each system, while the data for advised methods are aggregated across E, Vampire and Z3, and across the four premise slicing methods. This means that they in general run in 12 × 30 seconds (although typically only one or two slices are needed for the final joint performance). The number of Flyspeck theorems that were proved by any of the many experiments conducted is thus 56.5% when Isabelle is considered, and 54.7% otherwise. There are 6162 theorems that can be proved by either Vampire, Z3 or E from the original HOL dependencies. Their collection is denoted as Original. There are 5580 theorems (denoted as Advised) that can be proved by these ATPs from the premises advised automatically. It is interesting to see how these two sets of ATP proofs compare. In this section, a basic comparison in terms of the number of premises used for the ATP proofs is provided. A more involved comparison and research of the proofs using the proof-complexity metrics developed for MML in [5] is left as an interesting future work. The intersection of Original and Advised contains 4694 theorems. Both sets of proofs are already minimized as described in Section 4.3. The proof dependencies were extracted 33 and the number of dependencies was compared. The complete results of this comparison are available online, 34 sorted by the difference between the length of the Original proof and the Advised proof.
To make it easier to explore the differences described in the next subsections, the Flyspeck and HOL Light Subversion repositories were merged into one (git) repository, and (quite imperfectly) HTML-ized 35 by a simple heuristic Perl script. A simple CGI script 36 can be used to compare the dependencies needed for the (minimized) advised ATP proof with the dependencies needed for the ATP proof from the original HOL Light premises, and also with the actual HOL Light proof.

Theorems Proved Only with Advice
The list of 885 theorems proved only with advice is available online 37 sorted by the number of necessary premises. The last theorem in this order (CROSS_BASIS_-NONZERO) 38 used 34 premises for the advised ATP proof, while its HOL Light proof is just a single invocation of the VEC3_TAC tactic 39 (which however brings in 121 HOL Light dependencies, making re-proving difficult). The following two short examples show how the advice can sometimes get simpler proofs.
1. Theorem FACE_OF_POLYHEDRON_POLYHEDRON states that a face of a polyhedron (defined in HOL Light generally as a finite intersection of half-spaces) is again a polyhedron: ∀s:real^N→bool c. polyhedron s ∧ c face_of s =⇒ polyhedron c The HOL Light proof 40 takes 23 lines and could not be re-played by ATPs, but a much simpler proof was found by the AI/ATP automation, based on (a part of) the FACE_OF_STILLCONVEX theorem: a face t of any convex set s is equal to the intersection of s with the affine hull of t. To finish the proof, one needs just three "obvious" facts: Every polyhedron is convex (POLYHEDRON_IMP_CONVEX), the intersection of two polyhedra is again a polyhedron (POLYHEDRON_INTER), and affine hull is always a polyhedron (POLYHEDRON_AFFINE_HULL): 1. Theorems COMPLEX_MUL_CNJ 42 and COMPLEX_NORM_POW_2 stating the equality of squared norm to multiplication with a complex conjugate follow easily from each other (together with the commutativity of complex multiplication COMPLEX_MUL_-SYM). The proof of COMPLEX_MUL_CNJ in HOL Light (below) re-uses the longer proof of COMPLEX_NORM_POW_2. The advised ATP proof directly uses COMPLEX_NORM_POW_2, but (likely because COMPLEX_MUL_SYM was never used before) first unfolds the definition of complex conjugate and then applies commutativity of real multiplication.
let COMPLEX_MUL_CNJ = prove ('∀z. cnj z * z = Cx(norm(z)) pow 2 ∧ z * cnj z = Cx(norm(z)) pow 2', GEN_TAC 3. Theorem NEGLIGIBLE_CONVEX_HULL_3 44 states that the convex hull of three points in R 3 is a negligible set. In HOL Light this is proved from the general theorem NEGLIGIBLE_CONVEX_HULL stating this property for any finite set of points in R n with cardinality less or equal to n. Instead of justifying this precondition, a shorter proof is found by the advised ATP that saw an analogous theorem about the affine hull, the inclusion of the convex hull in the affine hull, and the preservation of negligibility under inclusion.  5. An example of the reverse phenomenon (i.e., the advised proof is more complicated than the original) is theorem BOUNDED_CLOSURE_EQ 46 saying that a set in R n is bounded iff its closure is bounded. The harder direction of the equivalence was already available as theorem BOUNDED_CLOSURE, and was used both by the HOL Light and the advised proof. The easier direction was in HOL Light proved by theorems CLOSURE_SUBSET and BOUNDED_SUBSET saying that any set is a subset of its closure and any subset of a bounded set is bounded. The advised proof instead went through a longer path based on theorems CLOSURE_APPROACHABLE, IN_BALL and CENTRE_IN_BALL to show that every element in a set is also in its closure, and then unfolded the definition of bounded and showed that the bound on the norms of closure elements can be used also for the original set.

Remarks
The average number of HOL Light proof dependencies restricted to the set of theorems re-proved by ATPs is 34.54, i.e., there are on average about nine times more dependencies in a HOL Light proof than in the corresponding ATP proof (see Table 3). This perhaps casts some light on how learning-assisted ATP currently achieves its performance. A large human-constructed library like Flyspeck is often dense/redundant enough 47 to allow short proofs under the assumption of perfect (and thus inhuman) premise selection. Such short proofs can be found even by the quite exhaustive methods employed by most of the existing ATPs. The smarter the premise selection and the stronger the search inside the ATPs, the greater the 46 http://mws.cs.ru.nl/~mptp/cgi-bin/browseproofs.cgi?refs=BOUNDED_CLOSURE_EQ 47 As long as such libraries are human-constructed, they will remain imperfectly organized and redundant. No "software engineering" or other approach can prevent new shortcuts to be found in mathematics, unless an exhaustive (and infeasible) proof minimization is applied. chance that such proofs will end up inside the ATP's time-limited search envelope. The outcome of using such advisors extensively could be "better-informed" mathematics that has shorter proofs which use a variety of lemmas much more than the basic definitions and theorems. Whether such mathematics is easier for human consumption is not clear. Already now mathematical texts sometimes optimize proof length by lemma re-use to an extent that may make the underlying ideas less visible. Perhaps this is just another case where the strong automation tools will eventually help to understand how human cognition works.
The ATP search is quite unlike the much less exhaustive search done by decision procedures, and also unlike the human proofs, where the global economy of dependencies is not so crucial once a fuzzy high-level path to the goal gets some credibility. Both the human and the decision-procedure proofs result in more redundant ("sloppier") proofs, which can however be more involved (complicated) than what the ATPs can achieve even with optimal premise selection. Learning such (precise or fuzzy) high-level pathfinding is an interesting next challenge for large-theory AI/ATP systems. With the number of proofs and theory developments to learn from available now in the HOL/Flyspeck, Mizar/MML, and Isabelle corpora, and the already relatively strong performance of the "basic" AI/ATP methods that are presented in this paper, these next steps seem to be worth a try. oped recently and evaluated over large-theory benchmarks and competitions like CASC LTB and Mizar@Turing. A comprehensive comparison of ATP and Mizar proofs was recently done in [5]. As here, the average number of Mizar proof dependencies is higher than the number of ATP dependencies, however, the difference is not as striking as for HOL Light (a very different method is used to get the Mizar dependencies).
The work described here adds HOL Light and Flyspeck to the pool of systems and corpora accessible to large-theory AI/ATP methods and experiments. A number of large-theory techniques are re-used, sometimes the Mizar, Isabelle and CASC LTB approaches are combined and adapted to the HOL Light setting, and some of the techniques are taken further. The theorem naming, dependency export, problem creation, and advising required newly implemented HOL Light functions. The machine learning adds k-nearest neighbor, and the feature characterization was improved by replacing variables in terms with their HOL types. A MaLARealike pass interleaving ATP with learning was used to obtain as many ATP proofs as possible, and the proofs were postprocessed by pseudo-and cross-minimization. Unlike in MaLARea, this was done in a scenario that emulates the growth of the library, i.e., no information about the proofs of later theorems was used to train premise selection for earlier theorems. Motivated by the recent experiments over the MPTP2078 benchmark, the machine learning was complemented by various SInE strategies used by E and Vampire. The strategy-scheduling version of E using the strategies developed for Mizar@Turing was tested for the first time in such large evaluation. A significant effort was spent to find the most orthogonal ingredients of the final mix of premise selectors and ATPs: in total 435 different combinations were tested. The resulting 39% chance of proving the next theorem without any user advice is a landmark for a library of this size. While a similar number was achieved in [48] on the much smaller MPTP2078 benchmark with a lower time limit, only 18% success rate was recently reported in [5] for the whole MML in this fully push-button mode. 49 None of those evaluations however combined so many methods as here. The improvement over the best method (proving 24.1% theorems in 30s and 26.8% in 300s) shows that such combinations significantly improve the usability of large-theory ATP methods for the end users.

Future Work
Stronger machine learning (kernel/ensemble, etc.) methods and more suitable characterizations (e.g., addition of model-evaluation features and more abstract features) are likely to further improve the performance. The prototype online service could be made customizable by learning from users' own proofs. So far only three ATPs are used by the service, but many other systems can eventually be added, possibly with various custom mappings to their logics. The translation methods can be further experimented with: either to get a symbol-consistent first-order translation (to allow, e.g., the model-evaluation features), or to get less incomplete translations. Proof reconstruction is currently work in progress. A simple and obvious approach is to try MESON with the minimized set of dependencies.
When it is ready, unsound translations can be added to the pool of methods as was originally done in Isabelle Isabelle [56]. Training ATP-internal guidance on the corpus for prototype learning/ATP systems like MaLeCoP will be interesting, and perhaps also further tuning of ATP strategies for systems like E.
The power of the combined system probably already now makes it interesting as a complementary semantic aid/filter for first experiments with statistical translation methods between the informal Flyspeck text and the Flyspeck formalization. The cases of machine translation (as in Google Translate) and natural-language query answering (as in IBM Watson) have recently demonstrated the power of large-corpus-driven methods to automatically learn such translation/understanding layers from uncurated imperfect resources such as Wikipedia. In other words, large bodies of mathematics (and exact science) such as arXiv.org are unlikely to become computer-understandable by the current painstaking human encoding efforts and additions of further and further logic complexity layers that increase the formalization barrier both for humans and AI systems. Large-scale (worldknowledge-scale) formalization for (mathematical) masses is hard to imagine as one large "perfectly engineered" knowledge base in which everyone will know perfectly well where their knowledge fits. Such attempts seem to be as doomed as the initial attempts (in the Stone Age of Internet) to manually organize the World Wide Web in one concise directory. Gradual world-scale formalization seems more likely to happen through simpler logics that can be reasonably crowd-sourced (e.g., as Wikipedia was), assisted by AI (learning/ATP) methods continuously training and self-improving on cross-linked formal/semiformal/informal corpora expressed in simple formalisms that can be reasonably explained to such automated/AI methods.

Acknowledgments
Tom Hales helped to start the MESON exporting work at CICM 2011, and his interest as a leader of Flyspeck has motivated us. Thanks to John Harrison for discussing MESON and related topics. Mark Adams gave us his HOL Light proof export data for HOL Zero, which made it easy to start the initial re-proving and learning experiments. Piotr Rudnicki has made his Mizar workstation available for a number of experiments (and his enthusiasm and support will be sorely missed). Andrei Paskevich has implemented the Why3 bridge to the TFF1 format just in time for our experiments. Finally, this work stands on the shoulders of many ATP and ITP (in particular HOL and Isabelle) developers, and tireless ATP competition organizers and standards producers. We are thankful to the JAR referees, PAAR 2012 referees and also to Jasmin Blanchette for their extensive comments on the early versions of this paper.