Keywords

1 Introduction

Formalizing mathematics in proof assistants is an ambitious and hard undertaking. One of the major challenges in constructing formal proofs of theorems depending on multiple other results is the prerequisite of having a good familiarity with the structure and contents of the library. Tools for helping users search through formal libraries are currently limited.

In the case of the Lean proof assistant [13], users may look for relevant lemmas in its formal library, mathlib [5], either by (1) using general textual search tools and keywords, (2) browsing the related source files manually, (3) using mathlib’s or tactics.

Approaches (1) and (2) are often slow and tedious. The limitation of approach (3) is the fact that or propose lemmas that strictly match the goal at the current proof state. This is often very useful, but it also means that these tactics often fail to direct the user to relevant lemmas that do not match the current goal exactly. They may also suggest too many trivial lemmas if the goal is simple.

The aim of this project is to make progress towards improving the situation of a Lean user looking for relevant lemmas while building proofs. We develop a new tool that efficiently computes a ranking of potentially useful lemmas selected by a machine learning (ML) model trained on data extracted from mathlib. This ranking can be accessed and used interactively via the tactic.

The project described here belongs to the already quite broad body of work dealing with the problem of fact selection for theorem proving [1, 7, 9, 11, 12, 15, 16]. This problem, commonly referred to as the premise selection problem, is crucial when performing automated reasoning in large formal libraries – both in the context of automated (ATP) and interactive (ITP) theorem proving, and regardless of the underlying logical calculus. Most of the existing work on premise selection focuses on the ATP context. Our main contribution is the development of a premise selection tool that is practically usable in a proof assistant (Lean in that case), tightly integrated with it, lightweight, extendable, and equipped with a convenient interface. The tool is available in a public GitHub repository: https://github.com/BartoszPiotrowski/lean-premise-selection.

2 Dataset Collection

A crucial requirement of a useful ML model is a high-quality dataset of training examples. It should represent the learning task well and be suitable for the ML architecture being applied.

In this work, we use simple ML architectures that cannot process raw theorem statements and require featurization as a preprocessing step. The features need to be meaningful yet simple so that the model can use them appropriately. Our approach is described in Sect. 2.1. The notion of relevant premise may be understood differently depending on the context. In Sect. 2.2, we describe the different specifications of this notion that we used in our experiments.

The tool developed in this work is implemented and meant to be used in Lean 4 together with mathlib 4. However, since, at the time of writing, Lean 4’s version of the library is still being ported from Lean 3, we use mathlib3portFootnote 1 as our main data source.

2.1 Features

The features, similar to those used in [8, 15], consist of the symbols used in the theorem statement with different degrees of structure. In particular, three types of features are used: names, bigrams and trigrams.

As an illustration, take this theorem about groups with zero:

figure g

This statement comes from one of the source files of mathlib. When producing the features for it, we do not use it directly as printed above but rather we take its elaborated counterpart – a much more detailed version where all the hidden assumptions are made explicit by the Lean’s elaborator so that the expression precisely conforms to Lean’s dependent type theory.

The most basic form of featurization is the bag-of-words model, where we simply collect all the names (and numerical constants) involved in the theorem.

Following this definition, we obtain names , , and , which are visible in the source version of the statement,Footnote 2 plus many more hidden names only appearing in the elaborated expression, e.g., that is related to interpreting numerical literals as natural numbers.

During the featurization we distinguish features coming from the conclusion and the hypotheses (assumptions) of the theorem, and we mark them by prepending either or , respectively.

For our running example of theorem , all this results in the list of names that looks as follows:

figure p

It would be desirable, however, to keep track of which symbols appear next to each other in the syntactic trees of the theorem hypotheses and its statement. Thus, we extract bigrams that are formed by the head symbol and each of its arguments (separated by / below).

figure q

Similarly, we also consider trigrams, taking all paths of length 3 from the syntactic tree of the expression.

figure r

2.2 Relevant Premises

To obtain the list of all the premises used in a proof of a given theorem it suffices to traverse the theorem’s proof termFootnote 3 and keep track of all the constants whose type is a proposition. For instance, the raw list of premises that appear in the proof of is:

figure t

For more complicated examples, this approach results in a large number of premises including lemmas used implicitly by tactics (for instance, those picked by the ‘simplify’ tactic ), or simple facts that a user would rarely write explicitly. Three different filters are applied to mitigate this issue: all, source, and math. They are described below and their overall effect is shown in Table 1.

Table 1. Filters’ statistics. An example is a theorem with a non-empty list of premises. Because applying the source or math filter may result in an empty set of premises, the numbers of obtained training examples differ across the filters.
  1. 1.

    The all filter preserves almost all premises from the original, raw list, removing those that were generated automatically by Lean. They contain a leading underscore in their names, e.g., . In our example, there are no such premises. Examples from this filter are not appropriate for training an ML advisor for interactive use as the suggestions would contain many lemmas used implicitly by tactics. Yet, such an advisor could be used for automated ITP approaches such as hammers [3].

  2. 2.

    The source filter leaves only those premises that appear in the proof’s source code. The idea is to model the explicit usage of premises by a user. Following our example, we would take the following proof as a string and list only the three premises appearing there:

    figure w
  3. 3.

    The math filter preserves only lemmas that are clearly of mathematical nature, discarding basic, technical ones. The names of all theorems and definitions from mathlib are extracted and used as a white list. In particular, this means that many basic lemmas from Lean’s core library (e.g. from our example) are filtered out.

In addition to our base datasets containing one data point per theorem, we also created a dataset (labeled as intermediate) representing intermediate proof states. In the standard data sets we recorded features of an initial proof state (the hypotheses and the conclusion of the theorem to be proved) and the premises used in a full proof. In the intermediate data set we instead record features of a proof state encountered during constructing a proof, and premises used in the next proof step only.

To this end, we used LeanInk,Footnote 4 a helper tool for Alectryon [17] – a toolkit that aids exploration of tactical proof scripts without running the proof assistant. Given a Lean file, LeanInk generates all the states that a user might be able to see in the infoview (a panel in Lean that displays goal states and other information about the prover’s state) by clicking on the file. The file is split into fragments, each containing a string of Lean code, represented by a list of tokens, together with the proof states before and after. In this way, the file can be loaded statically simulating the effect of running Lean. Furthermore, it can be configured to keep track of typing information, which is key to detecting which tokens are premises. We modified LeanInk so that every fragment that appears inside a proof is treated as its own theorem by our extractor. We gather all the premises found in the list of tokens and featurize the hypotheses and goals in the “before” proof state.

This dataset consists of 91 292 examples and 143 165 premises, which gives an average of around 1.57 premises per example. It represents a more fine-grained use of the premises, which does not exactly correspond to our main objective of providing rankings of premises on the level of theorem statements. We treat it as an auxiliary dataset potentially useful for augmenting our base data sets.

3 Machine Learning Models

The task modelled here with ML is predicting a ranking of likely useful premises (lemmas and theorems) conditioned by the features of the statement of a theorem being proved by a user. The nature of this problem is different than common applications of classical ML: both the number of features and labels (premises) to predict is large, and the training examples are sparse in the feature space. Thus, we could not directly rely on traditional implementations of ML algorithms, and using custom-built versions was necessary. As one of our design requirements was tight integration with the proof assistant, we implemented the ML algorithms directly in Lean 4, without needing to call external tools. This also served as a test for the maturity and efficiency of Lean 4 as a programming language.

In Sects. 3.1 and 3.2 we describe two machine learning algorithms implemented in this work: k-nearest neighbours (k-NN) and random forest.

3.1 k-Nearest Neighbours

This is a classical and conceptually simple ML algorithm [6], which has already been used multiple times for premise selection [2, 9, 10]. It belongs to the lazy learning category, meaning that it does not result in a prediction model trained beforehand on the dataset, but rather the dataset is an input to the algorithm while producing the predictions.

Given an unlabeled example, k-NN produces a prediction by extracting the labels of the k most similar examples in the dataset and returning an averaged (or most frequent) label. In our case, the labels are lists of premises. We compose multiple labels into a ranking of premises according to the frequency of appearance in the concatenated labels.

The similarity measure in the feature space calculates how many features are shared between the two data points, but additionally puts more weight on those features that are rarer in the whole training dataset \(\mathcal {D}\). The formula for the similarity of the two examples \(x_1\) and \(x_2\) associated with sets of features \(f_1\) and \(f_2\), respectively, is given below.

$$ M(x_1, x_2) = \frac{ \sum _{f \in f_1 \cap f_2} t(f) }{ \sum _{f \in f_1} t(f) + \sum _{f \in f_2} t(f) - \sum _{f \in f_1 \cap f_2} t(f) },\,\,\,\,\, t(f) = \log \left( \frac{|\mathcal {D}|}{|\mathcal {D}_f|}\right) ^2, $$

where \(\mathcal {D}_f\) are those training examples that contain the feature f.

The advantages of k-NN are its simplicity and the lack of training. A disadvantage, however, is the need to traverse the whole training dataset in order to produce a single prediction (a ranking). This may be slow, and thus not optimal for interactive usage in proof assistants.

3.2 Random Forest

As an alternative to k-NN, we use random forest [4] – an ML algorithm from the eager learning category, with a separate training phase resulting in a prediction model consisting of a collection of decision trees. The leaves of the trees contain labels, and their nodes contain decision rules based on the features. In our case, the labels are sets of premises, and the rules are simple tests that check if a given feature appears in an example.

When predicting, unlabeled examples are passed down the trees to the leaves, the reached labels are recorded, and the final prediction is averaged across the trees via voting. The trees are trained in such a way as to avoid correlations between them, and the averaged prediction from them is of better quality than the prediction from a single tree.

Our version of random forest, adapted to deal with sparse binary features and a large number of labels, is similar to the one used in [19], where the task was to predict the next tactic progressing a proof in Coq proof assistant. There, the features were also sparse, however, the difference is that here we need to predict sets of labels (premises), not just one label (the next tactic).

Our random forest is trained in an online manner, i.e., it is updated sequentially with single training examples – not with the entire training dataset at once, as is typically done. The rationale for this is to make it easy to update the model with data coming from new theorems proved by a user. This allows the model to immediately provide suggestions taking into account these recently added theorems.Footnote 5

Fig. 1.
figure 1

A schematic example of a decision tree from a trained random forest. Lowercase letters (a, b, c, ...) designate features of theorem statements, whereas uppercase letters (P, Q, R, ...) designate names of premises. The input (a featurized theorem statement) is being passed down the tree (along the green arrows) so that each node tests for a presence of a single feature, and passes the input example to the left (or right) sub-tree in the negative (or positive) case. The output is a set of premises in the reached leaf. (Color figure online)

figure y

Algorithm 1 provides a sketch of how a training example updates a tree – for all the details see the actual implementation in our public GitHub repository.Footnote 6 A crucial part of the algorithm is the MakeSplitRule function creating node splitting rules. Searching for the rules resulting in optimal splits would be costly, thus this function relies on heuristics.

Figure 1 schematically depicts how a simple decision tree from a trained random forest predicts a set of premises for an input example.

4 Evaluation Setup and Results

To assess the performance of the ML algorithms, the data points extracted from mathlib were split into training and testing sets. The testing examples come from the modules that are not dependencies of any other modules (there are 592 of them). This simulates a realistic scenario in which a user utilizing the suggestion tool develops a new mathlib module. The rest of the modules (2436) served as the source of training examples.

Two measures of the quality of the rankings produced by ML are defined: Cover and Cover\(_+\). Assuming a theorem T depends on the set of premises P of size n, and R is the ranking of premises predicted by the ML advisor for T, these measures are defined as follows:

$$ \text {Cover}(T) = \frac{\big |P \cap R\texttt {[:} n\texttt {]}\big |}{n}, \qquad \text {Cover}_+(T) = \frac{\big |P \cap R\texttt {[:} n+10\texttt {]}\big |}{n}, $$

where \(R\texttt {[:} k\texttt {]}\) is a set of k initial premises from ranking R. Both Cover and Cover\(_+\) return values in [0, 1]. Cover gives the score of 1 only for a “perfect” prediction where the premises actually used in the proof form an initial segment of the ranking. Cover\(_+\) may also give a perfect score to less precise predictions. The rationale for Cover\(_+\) is that the user in practice may look through 10 or more suggested premises. This is often more than the n premises actually used in the proof, so we consider initial segments of length \(n + 10\) in Cover\(_+\).

Both k-NN and random forest are evaluated on data subject to all three premise filters described in Sect. 2.2. For each of these variants of data, three combinations of features are tested: (1) names only, (2) names and bigrams, (3) names, bigrams, and trigrams. The hyper-parameters for the ML algorithms were selected by an experiment on a smaller dataset. For k-NN, the number of neighbours was fixed to 100. For random forest, the number of trees was set to 300, each example was used for training a particular decision tree with probability equal to 0.3, and the training algorithm passed through the whole training data 3 times.

Table 2 shows the results of the experiment. In terms of the Cover metric, random forest performed better than k-NN for all data configurations. However, for Cover\(_+\) metric, k-NN surpassed random forest for the math filter.

It turned out that the union of names and bigrams constitutes the best features for all the filters and both ML algorithms. It likely means that the more complex trigrams did not help the algorithms to generalize well but rather caused over-fitting on the training set.

The results for the all filter appear to be much higher than for the other two filters. However, this is because applying all results in many simple examples containing just a few common, basic premises (e.g., just a single rfl lemma). They increase the average score.

Overall, random forest with names\(\, + \,\)bigrams (n+b) features gives the best results. An additional practical advantage of this model over k-NN is the speed of outputting predictions. For instance, for the source filter and n+b features, the average times of predicting a ranking of premises per theorem were 0.28 s and 5.65 s for random forest and k-NN, respectively.

Additionally, we evaluated the ML models on the intermediate dataset, using n+b features. The random forest achieved Cover = 0.09 and Cover\(_+\) = 0.24, whereas k-NN resulted in Cover = 0.08 and Cover\(_+\) = 0.21 on the testing part of the data. Then, we used the intermediate dataset in an attempt to improve the testing results on the base dataset with the source filter (as intermediate only contains premises exposed in the source files). We used the intermediate data as a pre-training dataset, first training a random forest on it, and later on the base data. We also used intermediate to augment the base data, mixing the two together. However, neither in the pre-training, nor in the augmentation mode statistically significant improvements in the testing performance were achieved. It is possible that the prediction quality from the practical perspective actually improved, being more proof-state-dependent and not only theorem-dependent, but it did not manifest in our theorem-dependent evaluation.

The evaluation may be reproduced by following the instructions in the linked source code.Footnote 7

Table 2. Average performance of random forest and k-NN on testing data, for three premises filters and three kinds of features. The type of features is indicated by a one-letter abbreviation: n = names, b = bigrams, t = trigrams. For each configuration, Cover and Cover+ measures are reported (the latter in brackets). In each row, the best Cover result is bolded.

5 Interactive Tool

The ML predictor is wrapped in an interactive tactic that users can type into their proof script. It will invoke the predictor and produce a list of suggestions. This list is displayed in the infoview. The display makes use of the new remote-procedure-call (RPC) feature in Lean 4 [14], to then asynchronously run various tactics for each suggestion. Given a suggested premise p, the system will attempt to run tactics p, p and p , and return the first successful tactic application that advances the state. This will then be displayed to the user as shown in Fig. 2. She can select the resulting tactic to insert into the proof script. By using an asynchronous approach, we can display results rapidly without waiting for a slow tactic search to complete.

Fig. 2.
figure 2

The interactive tool in Visual Studio Code. The left pane shows the source file with the cursor over a tactic. The right pane shows the goal state at the cursor position and, below, the suggested lemmas to solve the goal. Suggestions annotated with a checkbox advance the goal state, suggestions annotated with confetti close the current goal. Clicking on a suggested tactic (e.g. ) automatically appends to the proof script on the left.

6 Future Work

There are several directions in which the current work may be developed.

The results may be improved by augmenting the dataset with, for instance, synthetic theorems, as well as developing better features, utilizing the well-defined structure of Lean expressions.

The evaluation may be extended to assess the proof-state level performance, and to compare with the standard Lean’s suggestion tactics: and . It could be beneficial to combine these tactics – which use sctrict matching – with our tool based on statistical matching.

Applying modern neural architectures in place of the simpler ML algorithms used here is a promising path [7, 12, 16, 18]. It would depart from our philosophy of a lightweight, self-contained approach as the suggestions would come from an external tool, possibly placed on a remote server. However, given the strength of the current neural networks, we could hope for higher-quality predictions. Moreover, neural models do not require hand-engineered features. The results presented here could serve as a baseline for comparison.

Finally, premise selection is an important component of ITP hammer systems [3]. The presented tool may be readily used for a hammer in Lean, which has not yet been developed.