MaLeS: A Framework for Automatic Tuning of Automated Theorem Provers

MaLeS is an automatic tuning framework for automated theorem provers. It provides solutions for both the strategy finding as well as the strategy scheduling problem. This paper describes the tool and the methods used in it, and evaluates its performance on three automated theorem provers: E, LEO-II and Satallax. An evaluation on a subset of the TPTP library problems shows that on average a MaLeS-tuned prover solves 8.67% more problems than the prover with its default settings.


Introduction
Automated theorem proving is a search problem. Many different approaches exist, and most of them have parameters that can be tuned. Examples of such parameterizations are clause weighting and selection schemes, term orderings, and sets of inference and reduction rules used. For a given ATP A, its implemented parameters form A's parameter space. A specific choice of parameters is called a search strategy, 1 i.e. strategies are elements of the parameter space ( Fig. 1). The choice of a strategy can often make the difference between finding a proof in a few milliseconds or not at all (within a reasonable time limit). This naturally leads to the question: Given a new problem, which search strategy should be used?
Considerable attention has already been paid to this problem. Gandalf [24] pioneered strategy scheduling: Instead of running a single strategy for the whole user defined time limit, run several search strategies sequentially for shorter times. This method is used in most current ATPs, most prominently Vampire [13]. In the SETHEO project [27], a local search algorithm was used to find better strategy schedules. Fuchs [5] employed a nearest neighbor algorithm to determine which strategy(s) to run. Bridge's [3] thesis is about machine learning for search heuristic selection in ATPs with a particular focus on problem features and feature selection. In the SAT community, Satzilla [28] very successfully used machine learning to decide when to run which SAT solver. ParamILS [7] is a general tuning framework that searches for good parameter settings with a randomized hill climbing algorithm. BliStr [25] uses ParamILS to develop strategies for E [15] on a large set of interrelated problems.
Despite all this work, most ATPs do not harness the methods available. Search strategies are often manually defined by the developer of the ATP and strategy schedules are created by a greedy algorithm or very simple clustering. This chapter introduces MaLeS (Machine Learning (of) Strategies), a learningbased framework for automatic tuning and configuration of ATPs. It is based on and supersedes E-MaLeS 1.0 [9] and E-MaLeS 1.1 [10]. The goal of MaLeS is to help ATP users to fine-tune an ATP to their problems and give developers a simple tool for finding good search strategies and creating strategy schedules. MaLeS is implemented in Python and has been tested with the ATPs E, LEO-II [1] and Satallax [4]. The source code is freely available at https: //code.google.com/p/males/. Figure 1 gives an informal overview of the strategy selection problem. Given a problem p ∈ P , find a strategy(s) s in the parameter space S that can quickly solve this problem. First, we note that parameter spaces can be very big. For example, the ATP E supports over 10 17 different search strategies. Hence, to simplify the strategy selection problem, strategy selection algorithms usually only consider a small number of preselected strategies S. Defining S is the first challenge. There are different criteria to determine which strategies should be selected. The most common ones are to pick strategies that solve a lot of problems, or are very good for a particular kind of problem.

The Strategy Selection Problem
As a second step we need a way to characterize problems. This is usually done by defining a set of features F.The features must strike a balance between being fast to compute (via a feature function ϕ) and being expressive enough so that the ATP behaves similarly on problems with similar features. Once we have defined the features, we still need a way to predict how well each preselected strategy performs on a given set of features. Finally, one needs to combine the predictions to create a strategy schedule. Hence, the strategy selection problem consists of three subproblems: • Finding a good set of preselected strategies S. • Defining features F which are easy to compute, but also expressive enough to distinguish different types of problems.  Fig. 1: Overview of the strategy selection problem for ATPs.
• Determining a method which given the features of a problem creates a strategy schedule.

Overview
The rest of the paper is organized as follows: Section 2 explains how MaLeS defines the preselected strategies S. The features and the algorithm that creates the strategy schedule are presented in Section 3. MaLeS is evaluated against the default installations of E 1.7, LEO-II 1.6.0 and Satallax 2.7 in Section 4. The experiments compare the performance of running an ATP in default mode versus running the ATP with strategy scheduling provided by MaLeS. Future work is considered in Section 5, and the paper concludes with Section 6. The appendix shows how to install the MaLeS-tuned versions of the ATPs mentioned above: E-MaLeS, LEO-MaLeS and Satallax-MaLeS, how to tune any of those systems for new problems, and how to use MaLeS with different ATPs. It also includes an overview of the CASC results.

Finding Good Search Strategies with MaLeS
Choosing a good strategy for a problem requires prior information on how the different strategies behave on different kinds of problems. Getting this information for all strategies is often infeasible due to constraints on CPU power available and the number possible strategies. Hence, one has to decide which strategies one wishes to evaluate. ATP developers often manually define such a set of strategies based on their intuition and experience. This option is, however, not available when one lacks in-depth knowledge of the internal workings of the ATP. A local search algorithm can help in these cases, and can even be combined with the manual approach by taking the predefined strategies as starting points of the search.
Algorithm 1 find strategies: For each problem search for the strategy that solves it in the least amount of time.
1: procedure find strategies(Problems,tol,t max,nS,nC) 2: initialize Queue Q 3: initialize dictionary bestTime with t max for all problems 4: initialize dictionary bestStrategy as empty 5: while Q not empty do 6: s ← pop(Q) 7: for p ∈ Problems do 8: oldBestTime ← bestTime[p] 9: proofFound,timeNeeded ← run strategy(s, p,t max) 10: if proofFound and timeNeeded < bestTime return bestStrategy 30: end procedure The initialization of Q in Line 2 is either done by randomly creating some strategies, or by manually defining which strategies to use. Variable tol defines the tolerance of the algorithm, t max is the maximal time that may be used by the strategy. nS determines the number of strategies generated in the create random strategies sub-procedure, nC is an upper limit to how much these new strategies differ from the old one. bestStrategy is a dictionary that for each problems stores the strategy that solved it in the least amount of time.
MaLeS employs a basic stochastic local search algorithm labeled find strategies (Algorithm 1) for ATPs. The strategies returned by find strategies define the preselected strategies S. The difference to existing parameter selection frameworks like ParamILS and BliStr is that find strategies searches for each problem for the fastest strategy, whereas ParamILS tries to find the best strategy for all problems (i.e. find the strategy that solves the most problems within some time limit). 2 BliStr searches for the best strategy for sets of similar problems.
find strategies takes a list of problems as input. A queue of start strategies is initialized, either with random or predefined strategies. Each strategy in the queue is then tried on all problems. If the strategy solves a problem faster than any of the tried strategies (within some tolerance, see Line 14), a local search is performed. If the search yields faster strategies, the fastest newly found search strategy is appended to the queue. In the end, find strategies returns the strategies that were the fastest strategy on at least one problem.
Algorithm 2 create random strategies: Returns slight variations of the input strategy.
1: procedure create random strategies(Strategy,nS,nC) 2: newStrategies is an empty list 3: for i in range(nS) do 4: newStrategy is a copy of Strategy 5: for j in range(nC) do 6: newStrategy = change random parameter(newStrategy) 7: end for 8: newStrategies.append(newStrategy) 9: end for 10: return newStrategies 11: end procedure nS determines the number of new strategies, nC is the upper limit for the number of changed parameters.
The local search part is defined in Algorithm 2 (create random strategies). It returns a predefined number of strategies similar to the input strategy. The new strategies are created by randomly changing the parameters of the input strategy. How many parameters are changed is determined in MaLeS' configuration file. 3

Strategy Scheduling with MaLeS
As mentioned previously, most automated theorem provers, independent of the parameters used, solve problems either very fast, or not at all (within a reasonable time limit). Instead of trying only a single strategy for a long time, it is often beneficial to run several search strategies for a shorter time. This approach is called strategy scheduling.
Many current ATPs use strategy scheduling to define their default configuration. Some use a single schedule for every problem (e.g. Satallax 2.7). Others define classes of similar problems and use different schedules for different classes (e.g. E 1.7, LEO-II 1.6.0). MaLeS creates an individual strategy schedule for each problem, depending on the problem's features.

Notation
We shall use the following notation: · p is an ATP problem. P denotes a set of problems. 3 Parameter WalkLength in Table 2 · P train ⊆ P is a set of training problems that is used to tune the learning algorithm. · F is the feature space. We assume that F is a subset of R n for some n ∈ N. · ϕ : P → F is the feature function. ϕ(p) is the feature vector of a problem. · S is the parameter space, S is the set of preselected strategies. · The time the ATP running strategy s needs to solve a problem p is denoted by τ (p, s). If s is obvious from the context or irrelevant, we also use τ (p). · For a strategy s, ρ s : P → R is the runtime prediction function.
For each strategy s in the preselected strategies S, MaLeS defines a runtime prediction function ρ s : P → R. The prediction function ρ s uses the features of a problem to predict the time the ATP running strategy s needs to solve the problem. The strategy schedule for the problem is created from these predictions.

Features
Features give an abstract description of a problem. Optimally, the features should be designed in such a way that the ATP behaves similar on problems with similar features, i.e. if two problem p, q have similar features ϕ(p) ∼ ϕ(q), then for each strategy s the runtimes should be similar τ (p, s) ∼ τ (q, s). The similarity function (e.g. cosine distance between the feature vectors) and set of features heavily influence the quality of the prediction functions. Indeed, feature selection is an entire subfield of machine learning [6,11].
Currently, MaLeS supports two different feature spaces: Schulz's E features are used for first order (FOF) problems. The TPTP features designed by Sutcliffe are used for higher order (THF) problems [22]. Note that the main reason for using these features was that they were easily available. Evaluating different features sets and/or introducing new features is beyond the scope of this paper.

The E Features
Schulz designed a set of features for clause-normal-form and first order problems. They are used in the strategy selection process in his theorem prover E [10]. Table 1 shows the features together with a short description. 4 MaLeS uses the same features for first-order problems. A clause is called negative if it only has negative literals. It is called positive if it only has positive literals. A ground clause is a clause that contains no variables. In this setting, we refer to all negative clauses as "goals", and to all other clauses as "axioms". Clauses can be unit (having only a single literal), Horn (having at most one positive literal), or general (no constraints on the form). All unit clauses are Horn, and all Horn clauses are general. The features are computed by running Schulz's classify problem program which is distributed with MaLeS.

The TPTP Features
The TPTP problem library [17] provides a syntactical description of every problem which can be used as problem features. Figure 2 shows an example. Before normalization, the feature vector corresponding to the example is [145, 5, 47, 31, 1106, . . . , 147, 0, 0, 0, 0] Sutcliffe's MakeListStats computes these features and is publicly available as part of the TPTP infrastructure. A modified version which outputs only the numbers without any text is also distributed with MaLeS.

Normalization
In the initial form, there can be great differences between the values of different features. In the THF example ( Figure 2), the number of atoms (1106) is of a different order of magnitude than e.g. the maximal formula depth (7). Since  our machine learning method (like many other) computes the euclidean distance between data points, these differences can render smaller valued features irrelevant. Hence, normalization is used to scale all features to have values between 0 and 1. First we compute the features for each p ∈ P train . Then the maximal and minimal value of each feature f is determined. These values are then used to rescale the feature vectors for each problem p via where ϕ(p) f is the value of feature f for problem p, min f is the minimal and max f is the maximal value for f among the problems in P train .

Runtime Prediction Functions
Predicting the runtime of an ATP is a classic regression problem [2]. For each strategy s in the preselected strategies S, we are searching for a function ρ s : P → R such that for all problems p ∈ P the predicted values are close to the actual runtimes: ρ s (p) ∼ τ (p, s). This section explains the learning method employed by MaLeS as well as the data preparation techniques used.

Timeouts
The prediction functions are learned from the behavior of the preselected strategies on the training problems P train . Each preselected strategy is run on all training problems with a timeout t. Often, strategies will not solve all problems within the timeout. This leads to the question how one should treat unsolved problems. Setting the time value of an unsolved problem-strategy pair (p, s) to the timeout, i.e. τ (p, s) = t, is one possible solution. Another possibility, which is used in MaLeS, is to only learn on problems that can be solved. While ignoring unsolved problems introduces a bias towards shorter runtimes, it also simplifies the computation of the prediction functions and allows us to update the prediction functions at runtime (Section 3.5).

Kernel Methods
MaLeS uses kernels to learn the runtime prediction function. Kernels are a very popular machine learning method that has successfully been applied in many domains [16]. A kernel can be seen as a similarity function between feature vectors. Kernels allow the usage of nonlinear features while keeping the learning problem itself linear. The basic principles will be covered on the next pages. More information about kernel-based machine learning can be found in [16].
Definition 1 (Gaussian Kernel) The Gaussian kernel k with parameter σ of two problems p, q ∈ P with feature vectors ϕ(p), ϕ(q) ∈ F ⊆ R n for some n ∈ N is defined as is the dot product between ϕ(p) and ϕ(q) in R n .
In order to apply machine learning, we first need some data to learn from. Let t ∈ R be a time limit. For each preselected strategy s ∈ S, the ATP is run with strategy s and time limit t on each problem in P train . Note that the same t is used for all problems. For each strategy s, P s train ⊆ P train is the set of problems that the ATP can solve within the time limit t with strategy s.

Definition 2 (The Prediction Function)
In kernel based machine learning, the prediction function ρ s has the form for some α s q ∈ R. The α s q are called weights and are the result of the learning. To define how exactly this is done, some more notation is needed.

Definition 3 (Kernel Matrix, Times Matrix and Weights Matrix)
For every strategy s ∈ S, let m be the number of problems in P s train and (p i ) i∈m be an enumeration of the problems in P s train . The kernel matrix K s ∈ R m×m is defined as Finally, we set the weight matrix A s ∈ R m×1 as If is it obvious which strategy is meant, or the statement is independent of the strategy, we omit the s in K s ,Y s and A s .
A simple way to define values for the weights α s pi would be to solve KA = Y . Such a solution (if it exists) would likely perform very well on known data but poorly on new data, a behavior called overfitting. As a measure against overfitting, a regularization parameter λ ∈ R is added and least square regression is used to minimize the difference between the predicted times and the actual times [14]. That means we want is the square loss between the predicted values and the actual time needed. λA T KA is the regularization term. A T KA is a measure of how complex, in terms of VC dimension [26], our prediction function is. The bigger λ, the more complex functions are penalized. For very high values of λ, we force A to be almost equal to the 0 matrix. This approach can be seen as a kind of Occam's razor for prediction functions. A is the matrix that best fits the training data while staying as simple as possible.
Theorem 1 (Weight Matrix for a Strategy) For λ > 0, the optimal weights for a strategy s are given by It can be shown that K is a positive-semi definite symmetric matrix and therefore (K + λI) is invertible for λ > 0. To find a minimum, we set the derivative to zero and solve with respect to A.
and hence

Crossvalidation
Finally, the values for the regularization constant λ and the kernel width σ need to be determined. This is done via 10-fold cross-validation on the training problems, a standard machine learning method for such tasks [8]. Crossvalidation simulates the effect of not knowing the data and picks the values that perform, in general, best on unknown problems. First a finite number of possible values for λ and σ is defined. Then, the training set P s train is split in 10 disjoint, equally sized subsets P 1 , . . . P 10 . For all 1 ≤ i ≤ 10, each possible combination of values for λ and σ is trained on P s train − P i and evaluated on P i . The evaluation is done by computing the square-loss between the predicted runtimes and the actual runtimes. The combination with the least average square loss is used.

Creating Schedules from Prediction Functions
MaLeS uses the knowledge of how different strategies perform on a set of training problems to estimate how these strategies will behave on a new problem. This is done by learning runtime prediction functions as described above using the data gathered with Algorithm 1. With the runtime prediction functions we can create individual strategy schedules for new problems, i.e. compute a strategy schedule for every set of features.
Given a new problem, MaLeS iterates between computing the predicted runtimes for each strategy, running the predicted best strategy and updating the prediction models. Algorithm 3 shows the details.
In line 2 the algorithm starts by running some predefined start strategies. The goal of running these start strategies first is to filter out simple problems which allows the learning algorithm to focus on the harder problems. The start strategies are picked greedily. First the strategy that solves most problems (within some time limit) is chosen. Then the strategy that solves most of the problems that were not solved by the first picked strategy (within some time limit) is picked, etc. The number of start strategies and their runtime are determined via their respective parameters in the setup.ini file ( Table 2). Training problems that are solved by the start strategies are deleted from the training set. For example, let s 1 , . . . , s n be the starting strategies, all with a runtime of 1 second. Then for all s ∈ S we can set and train ρ s on the updated P s train . The subprocedure choose best strategy in line 12 picks the strategy with the minimum predicted runtime among those that have not been run with a bigger or equal runtime before. 5 run strategy runs the ATP with strategy s Algorithm 3 males: Tries to solve the input problem within the time limit. Creates and runs a strategy schedule for the problem. proofFound,timeUsed ← run start strategies(problem,time) 3: if proofFound then 4: return timeUsed 5: end if 6: while timeUsed < time do 7: Set times as an empty list 8: for s ∈ S do 9: ts ← ρs(problem) 10: times.append([ts, s]) 11: end for 12: ([t s , s ]) ← choose best strategy(times) 13: proofFound,timeNeeded ← run strategy(s ,problem, t s ) 14: timeUsed + = timeNeeded 15: if proofFound then 16: return timeUsed 17: end if 18: for s ∈ S do 19: timeUsed + = update prediction function(ρs,s ,t s ) 20: end for 21: end while 22: return timeUsed 23: end procedure and time limit t s on the problem. If the ATP cannot solve the problem within the time limit, this information is used to improve the prediction functions in update prediction function (Line 19). For this, all the training problems that are solved by the picked strategy s within the predicted runtime t s are deleted from the training set P train , i.e. for all s ∈ S  Table 1. For the numeric features, threshold values have originally been selected to split the TPTP into 3 or 4 approximately equal subsets on each feature. Over time, these have been manually adapted using trial and error.
Once the classification is fixed, a Python program assigns to each class one of the strategies that solves the most examples in this class. For large classes (arbitrarily defined as having more than 200 problems), it picks the strategy that also is on average the fastest on that class. For small classes, it picks the globally best strategy among those that solve the maximum number of problems. A class with zero solutions by all strategies is assigned the overall best strategy.

The Training Data
The problems from the FOF divisions of CASC-22 [18], CASC-J5 [19], CASC-23 [20] and CASC-J6 and CASC@Turing [21] were used as training problems. Several problems appeared in more than one CASC. There are also a few problems from earlier CASCs that are not part of the TPTP version used in the experiments, TPTP-v5.4.0. Deleting duplicates and missing problems leaves 1112 problems that were used to train E-MaLeS. The strategy search for the set of preselected strategies took three weeks on a 64 core server. The majority of the time was spent running promising strategies with a 300 seconds time limit. Over 2 million strategies were considered. Of those, 109 were selected to be used in E-MaLeS. E-MaLeS runs 10 start strategies, each with a 1 second time limit. E 1.7 (running the automatic mode) and E-MaLeS were evaluated on all training problems with a 300 second time limit. The results can be seen in Figure 3.

The Test Data
Similar to the way the problems for CASC are chosen, 1000 random FOF problems of TPTP-v5.4.0 with a difficulty rating [23] between 0.2 and (including) 1.0 were chosen for the test dataset. 165 of the test problems are also part of the training dataset.
The results are similar to the results on the training problems and can be seen in Figure 4. In the first three seconds, E solves more problems than E-MaLeS. Afterwards, E-MaLeS overtakes E. After 300 seconds, E-MaLeS solves 573 of the problems (57.3%) and E 1.7 511 (51.1%), an increase of 12.4%. Figure 5 shows the results for only the 835 problems that are not part of the training problems.

Satallax-MaLeS
In order to show that MaLeS works for arbitrary ATPs, we picked a very different ATP for the next experiment: Satallax. Satallax is a higher order theorem prover that has a reputation of being highly tuned. The built-in strategy schedule of Satallax solves 95.3% of all solvable problems in the training dataset and, with the right parameters, 91.3% (525) of the training problems can be solved in less than 1 second. The strategy search for the set of preselected strategies was done on a 32 core Intel Xeon with 2.6GHz per CPU and 256 GB of RAM. The evaluations were done on a 64 core AMD Opteron Processor 6276 with 1.4GHz per CPU and 256 GB of RAM.

Satallax's Automatic Mode
Satallax employs a hard-coded strategy schedule that defines a sequence of strategies together with their runtimes. The same schedule is used for all prob- lems. It is defined in the file satallaxmain.ml in the src directory of the Satallax installation. Many modes are only run for a very short time (0.2 seconds). This can cause problems if Satallax is run on CPUs that are slower than the one(s) used to create this schedule.

The Training Data
The problems from the THF divisions of CASC-J5 [19], CASC-23 [20] and CASC-J6 [21] were used as training problems. The THF division of CASC-J5 contained 200 problems, of CASC-23 300 problem, and of CASC-J6 also 200 problems. After deleting duplicates and problems that are not available in TPTP-v5.4.0, 573 problems remain. The strategy search took approximately 3 weeks. In the end, 111 strategies were selected to be used in Satallax-MaLeS. Satallax-MaLeS runs 20 start strategies, each with a 0.5 second time limit. 533 of the 573 problems are solvable with the appropriate strategy. Satallax and Satallax-MaLeS were evaluated on all training problems with a 300 second time limit. Satallax solves 508 of the problems (88.7%). Satallax-MaLeS solves 1.6% more problems for a total of 516 solved problems (90.1%). Figure 6 shows a log-scaled time plot of the results. For low time limits, Satallax-MaLeS solves significantly more problems than Satallax. It seems that Satallax's automatic mode is very suboptimal which might be a result of only focusing on the number of problems solved after 300 seconds. Best Strategy shows the best possible result, i.e. the number of problems solved if for each problem the strategy that solves it in the least amount of time was picked.

The Test Data
Similar to the E-MaLeS evaluation, the test dataset consists of 1000 randomly selected THF problems of TPTP-v5.4.0 with a difficulty rating between 0.2 and (including) 1.0. 301 of the test problems are also part of the training dataset. The results are similar to the results on the training problems and can be seen in Figure 7. While the end results are almost the same with Satallax-MaLeS solving 590 (59.0% ) and Satallax solving 587 (58.7%) of the problems, Satallax-MaLeS significantly outperforms Satallax for lower time limits. Figure 8 shows the results for only the 699 problems that are not part of the training problems. Here, Satallax-MaLeS solves more problems than Satallax in the beginning, but fewer for longer time limits. After 300 seconds, Satallax solves 344 and Satallax-MaLeS 336 problems.

LEO-MaLeS
LEO-MaLeS is the latest addition to the MaLeS family. LEO-II is a resolutionbased higher-order theorem prover designed for fruitful cooperation with specialist provers for natural fragments of higher-order logic. 8 The strategy search for the set of preselected strategies, and all evaluations were done on a 32 core Intel Xeon with 2.6GHz per CPU and 256 GB of RAM.

LEO-II's Automatic Mode
LEO-II's automatic mode is a combination of E's and Satallax's automatic modes. The problem space is split into disjoint subspaces and a different strategy schedule is used for each subspace. The automatic mode is defined in the file strategy scheduling.ml in the src/interfaces directory of the LEO-II installation.

The Training and Test Datasets
The same training and test problems as for the Satallax evaluation were used. The strategy search took 2 weeks. 89 strategies were selected. LEO-II and LEO-MaLeS were run with a 300 second time limit per problem.
Of the 573 training problems 472 can be solved by LEO-II if the correct strategy is picked. LEO-MaLeS runs 5 start strategies, each with a 1 second time limit. Using more start strategies only marginally increases the number of solved problems by the start strategies. LEO-II's default mode solves 415 of the training problems (72.4%), and 367 of the test problems (36.7%). LEO-MaLeS improves this to 441 (77.0%) and 417 (41.7%) solved problems respectively. Figure 9 and Figure 10 show the graphs. Figure 11 shows the results for only the 699 problems that are not part of the training problems.

Further Remarks
There are a few things to note that are independent of the underlying prover.
Multi-core Evaluations: All the evaluations were done on multi-core machines, a 64 core AMD Opteron Processor 6276 with 1.4GHz per CPU and 256 GB of RAM and a 32 core Intel Xeon with 2.6GHz per CPU and 256 GB of RAM. All runtimes were measured in wall-clock time. During the evaluation we noticed irregularities in the runtime of the ATPs. When running a single instance of an ATP, the time needed to solve a problem often differed from the result we got when running several instances in parallel, even when using less than the maximum number of cores. It turns out that the number of cores used during the evaluation heavily influences the performance. The more cores, the worse the ATPs performed. We were not able to completely determine the cause, but the speed of the hard disk drive, shared cache and process swapping are all possible explanations. Reducing the hard disk drive load by changing the behavior of MaLeS from loading all models at the very beginning to only when they are needed did lead to more (and faster) solved problems. Eventually, all evaluation experiments (apart from the strategy searches for the sets of preselected strategies) were redone using only 20 out of 64 / 14 out of 32 cores and the results reported here are based on those runs.
How Good are the Predictions? Apart from the total number of solved problems, the quality of the predictions is also of interest. In short, they are not very good. The predictions of MaLeS are already heavily biased because the unsolvable problems are ignored (Section 3.3.1). Reducing the number of training problems during the update phase makes the predictions even less reliable. For some strategies, the average difference between the actual and predicted runtimes exceeds 40 seconds. Two heuristics were added to help MaLeS to deal with this uncertainty. First, the predicted runtime must always exceed the minimal runtime of the training data. This prevents unreasonably low (in particular negative) predictions. Second, if the number of training problems is less than a predefined minimum (set to 5) then the predicted runtime is the maximum runtime of the training data. That MaLeS nevertheless gives good results is likely due to the fact that the tested ATPs all utilize either no or very basic strategy scheduling.
The Impact of the Learning Parameters: Table 2 shows the learning parameters of MaLeS. Tolerance, StartStrategies and StartStrategiesTime had the greatest impact in our experiments. Tolerance influences the number of strategies used in MaLeS. A low value means more strategies, a high value less. For E and LEO, higher values (1.0 − 15.0 seconds) gave better results since fewer irrelevant strategies were run. Satallax performed slightly better with a low tolerance which is probably due to the fact that it can solve almost every problem in less than a second. The values for StartStrategies and StartStrategiesTime determine how many problems are left for learning. 10 StartStrategies with a 1 second StartStrategiesT ime are good default values for the provers tested. For LEO-II we found that the number of solved problems barely increased after 5 seconds, and hence changed to number of StartStrategies to 5.

Future Work
Apart from simplifying the installation and set up, there are several other ways to improve MaLeS. We present the most promising ones.
Automated Parameter Configuration: Parameters like Tolerance, StartStrategies and StartStrategiesTime could and should be set automatically. We hope to implement this in the next version of MaLeS.
Features: The quality of the runtime prediction function is limited by the quality of the features. Adding new features and/or integrating feature selection algorithms could increase the prediction capabilities of MaLeS.
Strategy Finding: As an alternative to randomized hill climbing, different search algorithms should be supported. In particular simulated annealing and genetic algorithms seem promising. The biggest problem of the current implementation, the time it needs to find good strategies, could be improved by using a clusterized local search principle similar to the one employed in BliStr [25].
Strategy Prediction: The runtime prediction function are the heart of MaLeS. Machine learning offers dozens of different regression methods which could be used instead of the kernel methods of MaLeS. A big drawback of the current approach is that it scales badly due to the need to invert a new matrix after every tried strategy. One possible solution for eliminating the need for matrix computations and also the dependency on Numpy and Scipy would be a nearest neighbor algorithm.

Conclusion
Finding the best parameter settings and strategy schedules for an ATP is a time consuming task that often requires in-depth knowledge of how the ATP works. MaLeS is an automatic tuning framework for ATPs that, given the possible parameter settings of an ATP and a set of problems, finds good search strategies and creates individual strategy schedules. MaLeS currently supports E, LEO-II and Satallax and can easily be extended to work with other provers.
Experiments with the ATPs E, LEO-II and Satallax showed that the MaLeS version performs at least comparable to the respective default strategy selection algorithm. In some cases, the MaLeS optimized version solves considerably more problems than the untuned ATP.
MaLeS aims to simplifies the workflow for both ATP users and developers. It allows ATP users to fine-tune ATPs to their specific problems and helps ATP developers to focus on actual improvements instead of time-consuming parameter tuning. occurred. We would like to thank the anonymous reviewers, Jasmin Christian Blanchette and Michael Nahas for their comments on earlier versions of this paper.

Tuning E, LEO-II or Satallax for a New Set of Problems
Tuning an ATP for a particular set of problems involves finding good search strategies and learning prediction models. The search behavior is defined in the the file setup.ini in the main directory. Using the default search behavior, E, LEO-II and Satallax can be tuned for new problems as follows: If True, a log file is created.

LogFile
Name of the log file.

Search Parameter Description
Time Maximal runtime during search.

Problems
File with the absolute pathnames of the problems.

FullTime
If True, the ATP is run for the value of Time. If False, it is run for the rounded minimal time required to solve the problem.

TryWithNewDefaultTime
If True, findStrategies uses the best strategies from Tm-pResultsDir and TmpResultsPickle as a start strategies for a new search.

Walks
How many different strategies are tried in the local search step.

WalkLength
Up to this many parameters are changed for each strategy in the local search step.

Learn Parameter Description
Features Which features to use. Possible values are E for the E features and TPTP for the TPTP features.

FeaturesFile
Location of the feature file.

StrategiesFile
Location of the strategies file.

KernelFile
Location of the file containing the kernel matrices.

CrossValidate
If False, no crossvalidation is done during learning. Instead the first values in RegularizationGrid and KernelGrid are used.

CrossValidationFolds
How many folds to use during crossvalidation.

StartStrategies
Number of start strategies.

StartStrategiesTime
Runtime of each start strategy.

CPU Bias
This value is added to each runtime before learning. Serves as a buffer against runtime irregularities.

Tolerance
For a strategy s to be considered as a good strategy, there must be at least one problem where the difference of the best runtime of any strategy and the runtime of s is at most this value.

Run Parameter Description
CPUSpeedRatio Predicted runtimes are multiplied with this value. Useful if the training was done on a different machine.

MinRunTime
Minimal time a strategy is run.

Features
Either TPTP for higher order features or E for first order features.

StrategiesFile
Location of the strategies file.

FeaturesFile
Location of the feature file.

OutputFile
If not None, the output of MaLeS is stored in this file.   Strategies defined in strategies.ini are used to initialize the strategy queue during the strategy searching for the set of preselected strategies. The default ini format is used. Each strategy is its own section with each parameter on a separate line. For example At least one strategy must be defined. After the ini files are adapted, the new ATP can be tuned and run using the procedure defined in the last two sections.

CASC Results
MaLeS 1.2 is the third iteration of the MaLeS framework. E-MaLeS 1.0 competed at CASC-23, E-MaLeS 1.1 at CASC@Turing and CASC-J6, and E-MaLeS 1.2 at CASC-24. Satallax-MaLeS competed for the first time at CASC-24. We give an overview of the older versions, the CASC performance and the changes over the years.

CASC-23
E-MaLeS 1.0 [9] was the first MaLeS version to compete at CASC. Stephan Schulz provided us with a set of strategies and information about their performance on all TPTP problems. This data was used to train a kernel-based classification model for each strategy. Given the features of a problem p, the classification models predict whether or not a strategy can solve p. Altogether, three strategies were run. First E's auto mode for 60 seconds, then the strategy with the highest probability of solving the problem as predicted by a Gaussian kernel classifier for 120 seconds. Finally the strategy with the highest probability of solving the problem as predicted by a linear (dot-product) kernel classifier was run for the remainder of the available time. E-MaLeS 1.0 won third place in the FOF division. Table 5 shows the results.

CASC@Turing and CASC-J6
E-MaLeS 1.1 [10] changed the learning from classification to regression. Like E-MaLeS 1.0, E-MaLeS 1.1 learned from (an updated version of) Schulz's data. Instead of predicting which strategy to run, E-MaLeS 1.1 learned runtime prediction functions. The learning method is the same as the one presented in this    chapter, without the updating of the prediction functions. E-MaLeS 1.1 first ran E's auto mode for 60 seconds. Afterwards, each strategy was run for its predicted runtime, starting with the strategy with the lowest predicted runtime. E-MaLeS 1.1 won second place in the FOF divisions of both CASC@Turing (Table 6) and CASC-J6 (Table 7). It also came fourth in the LTB division of CASC-J6.  Tables 8 and 9.