1 Introduction

The performance of algorithms and solvers varies greatly depending on the settings of the parameters controlling the behavior of the approach [1]. In particular, parameter settings that work well for a particular dataset of instances may work poorly on a different dataset, especially in terms of special problem structures or instance sizes. To ensure good performance of an algorithm, either in terms of runtime or solution quality, it is critical that algorithm parameters be configured or tuned for the types of instances the algorithm is expected to solve in practice.

Searching for high-quality parameter settings by hand is a time-consuming endeavor, hence several tools have been developed to automatically determine good parameter settings for a solver or algorithm given a dataset of instances. These tools use a variety of methods ranging from fractional factorial design [2], local search [3, 4], genetic algorithms [5,6,7], Bayesian optimization [8, 9], and racing [10] (see [1] for a full overview). Most algorithm configurators support configuring for one or both of the following settings: (1) minimization of target algorithm runtime or (2) maximization of solution quality. Some target algorithms, such as mixed-integer programming solvers, require a mixture of configuring for runtime and solution quality to find high quality solutions [6] to effectively tune their parameters for a given dataset.

Mixed-integer programming (MIP) solvers can tackle a wide range of problem types and thus ought to be configured for the instance set they are meant to solve. Indeed, with this in mind, IBM CPLEX, Gurobi, and FICO Xpress, three of the most well-known general MIP solvers, have built in parameter tuning capabilities [11,12,13]. Moreover, MIP solvers have been a focus of the algorithm configuration (AC) community for some time, with early results providing speed-ups of up to 52x on the CPLEX solver [14] and recent results showing there are still performance gains to be had in tuning these approaches [15, 16].

Most of the successes of AC solvers on MIP have involved small-scale problems; however, in industry, problems with tens of thousands or even millions of variables must be solved on a regular basis. These extremely large problems post a challenge to configurators. On the one hand, when tuning for runtime, many instances will likely not finish in the given timeout, leading to wasted executions and poor performance of the configurator. On the other hand, when tuning for solution quality, many MIP runs may not find feasible solutions, meaning a mechanism for comparing these failed executions is required to provide the configurator with a search trajectory.

This paper introduces the OPTANO Algorithm Tuner (OAT), a general algorithm configurator that has a special focus on addressing large-scale MIP instances. The contributions are as follows:

  • We describe OAT, an AC tool based on the GGA algorithm.

  • We investigate a dominance racing mechanism to shorten configuration times on MIP without sacrificing overall performance.

  • We further show on a real world problem that configuring smaller copies of large instances (i.e., instances of reduced size, but similar structures to large instances) is effective for finding good configurations for the large instances.

We make OAT freely available under the MIT license at https://github.com/OPTANO/optano.algorithm.tuner.

This paper is organized as follows: We discuss the current state-of-the-art for configuring MIP solvers in Section 2. In Section 3, we describe OAT, which forms the experimental basis for this work followed by the extensions of OAT specifically for configuring MIP solvers. We evaluate the extensions computationally in Section 4 and show that OAT can find high-quality configurations for a large-scale, real-world MIP dataset. Finally, we discuss future work and conclude in Section 5.

2 Related Work and Background Information

We provide a general overview of AC, including offline and realtime AC, and discuss its application to configure MIP solvers. For further details about AC and related problem settings, we refer interested readers to [1].

2.1 Offline Automated AC

We first formalize the offline AC problem and adopt the notation in [3]. The goal of AC is to optimize the performance of a parameterized algorithm \(\mathbb {A}\). To achieve this, the configurator searches for high-quality parameter configurations \(\theta \) in the space of all possible configurations \(\Theta \) of \(\mathbb {A}\). The quality of a configuration is measured by a performance metric m on a set of problem instances \(\Pi \subseteq \hat{\Pi }\), where \(\hat{\Pi }\) represents the full distribution of problem instances and \(\Pi \) the sample the AC approach is provided, such that \(m: \hat{\Pi } \times \Theta \rightarrow \mathbb {R}\). The general process of algorithm configuration is depicted in Fig. 1.

Fig. 1
figure 1

The information flow of offline automated AC

Offline AC aims at finding a high quality configuration \(\theta ^*\) that performs well over any possible set \(\Pi \) drawn from \(\hat{\Pi }\). To this end, a set of problem instances \(\Pi \), called the training set, is drawn from \(\hat{\Pi }\) is provided to the AC method that is representative of \(\hat{\Pi }\). The configuration space \(\Theta \) is searched for high quality configurations \(\theta _i\) on the training set, where the aim is to minimize \(\sum _{\pi \in \hat{\Pi }} m(\pi , \theta )\) in the runtime scenario, whereas in the solution quality scenario, this term is maximized.

Several well-known approaches have been developed for offline AC using both model-based (i.e., machine learned models to predict/evaluate configurations) and non-model-based approaches. ParamILS [3], a non-model-based approach, employs an iterated local search combined with an adaptive capping mechanism to avoid wasting CPU time on poorly performing configurations. The AC method on which our approach in this paper is based, GGA [5, 7], uses a genetic algorithm with a tournament-based racing mechanism, while irace [10] also uses racing, but in a statistical fashion. GPS [4] exploits parameter configuration landscape structures and examine sparameters in a semi-independent way. Model-based configurators include SMAC [8], which is based on a Bayesian optimization paradigm that uses a random forest to predict the performance metric of a given configuration, and GGA++ [6], which uses a random forest with a modified tree building mechanism to predict configuration performance.

2.2 Algorithm Configuration for MIP

The AC community has long targeted the MIP setting due to the long runtimes of solving MIP instances and the industrial relevance of MIPs. Several commercial MIP solvers include configuration procedures, such as CPLEX [11], Gurobi [12], and FICO Xpress [13], although we note that these have not been shown to be more effective than any AC method in the literature.

ParamILS is used to configure the solvers CPLEX, Gurobi, and LpSolve in [14], resulting in significant speedups on seven different instance sets. Several MIP settings are included in the AClib [17], allowing developers of AC methods to test on standard benchmarks. AC methods have also been used to configure MIPs in instance-specific settings, i.e., a specific configuration \(\theta \) is assigned to each instance in \(\hat{Pi}\), e.g., in [18] using the Hydra method [19] and in [20] using the instance-specific algorithm configuration (ISAC) approach. We further note that online/dynamic approaches for configuring MIP parameters exist, e.g., DASH [21] (see also dynamic AC (DAC), [22]).

3 OPTANO Algorithm Tuner

OAT is a general algorithm configurator distributed as a.NET nuget package that can be used as a standalone configurator or integrated directly into solvers or algorithms written in.NET. The goal of OAT is to provide a configurator that has state-of-the-art performance combined with the reliability expected of software running in production. While OAT is originally based on GGA [5] and GGA++ [6], it has since been extended to include search strategies based on JADE [23] and active CMA-ES [24]. OAT is inherently distributed and can run its target algorithm in parallel across multiple machines to reduce the overall wall-clock time of the configuration process. In addition, OAT supports configuring in multiple sessions, allowing the configuration process to be restarted should it be interrupted by, e.g., a system failure or reaching a resource limit. Finally, OAT includes numerous ideas from the literature, including parameter tree customization (from GGA), non-numeric evaluation metrics (GGA++), and adaptive capping strategies (ParamILS). We first describe how OAT works and describe how it distributes jobs across cores, which differs from previous distributed versions of GGA. Then, we introduce its MIP-specific enhancements, namely the new evaluation metric and short-circuit domination rule.

3.1 OAT’s Configuration Process

OAT is based on the genetic algorithm-based GGA and GGA++ configurators from a methodological standpoint, however not an engineering one. The function of OAT consists of three phases: (1) initialization, (2) the main loop, and (3) convergence/termination. The main loop iterates until OAT either reaches the maximum number of generations (as specified by the user), a maximum number of evaluations of the target algorithm, or runs out of time. Figure 2 provides an overview of the function of OAT, and we refer readers to [5] and [6] for further details.

Fig. 2
figure 2

Overview of the GGA approach [5] used in OAT

Initialization

Considering the previously introduced formalization of algorithm configuration, OAT needs the following four inputs to start its search. First, it needs a list of instances, I, that will be investigated, potentially associated with random seeds. Second, OAT must be told how to invoke the target algorithm, either on the command line or through an interface into other.NET code. Third, the target algorithm parameters to be configured must be specified. OAT takes a structured view of parameters as in GGA, accepting a parameter tree defining relations between parameters (see part 1(c) of Fig. 2). For example, Gurobi [12] contains several parameters relating to the presolver that can be adjusted to change its behavior. Another parameter turns the presolver on and off, meaning that the parameters relating to the presolver depend on the parameter to turn it on and off. This information is used during search to generate new configurations. Thus, the dependent parameters are placed lower below the presolver on/off parameter, and the recombination procedure takes this into account when creating new individuals. Finally, OAT’s own internal parameters can be changed from their default values, relating to how its search strategy functions.Footnote 1

Given the inputs outlined above, OAT initializes a population consisting of the default configuration(s) and randomly generated configurations, partitioned into two groups, representing the competitive (C) configurations that will be run on the target algorithm, and non-competitive (N) configurations, which act as a diversity store.

Main Loop

This phase consists of up to n generations, in which configurations from the C population are assessed in races and the winners are recombined with non-competitive configurations. At the beginning of each generation, a subset of instances are sampled from the instance pool. This subset linearly increases with each generation until either all instances are used or a user-specified maximum value is reached. All configurations of the competitive population must be evaluated on the currently active subset of instances. We note that some configurations may have already been evaluated on some of the instances in previous generations; these configurations need not be evaluated on the same instances again. If the size of the competitive population is larger than the number of available CPUs,Footnote 2 the configurations are split into mini-tournaments equal to the number of available CPUs. In the pure runtime setting, mini-tournaments are executed until a fixed percentage of the configurations have solved all instances; in the case of Fig. 2, only one configuration can win the race. The rest of the configurations are subsequently terminated when they have used the same amount of CPU time as the winning configuration(s). The mechanism by which OAT distributes mini-tournaments is described in more detail below. In the case of MIP, we slightly modify this procedure and describe this in Sections 3.2 and 3.3.

After all planned evaluations for the current generation are completed, the population is updated based on the performance of the configurations. Several options are available to do this, such as the GGA crossover mechanism in GGA, the genetic engineering algorithm of GGA++, as well as approaches based on JADE and active CMA-ES. Figure 2 shows the GGA crossover mechanism in which the winners of the mini-tournaments are recombined with randomly chosen members of the non-competitive population. The crossover procedure constructs a new configuration by randomly choosing components from the two parents. We refer to [5] for the full details of this algorithm and of the subsequent mutation operator. A specified percentage of the population is replaced every generation (usually one third) through the recombination procedure in the hope of generating high-quality configurations. Model-based recombination is also possible in OAT using the GGA++ recombination strategy, see [6] for details.

Termination

The main loop of OAT runs until one of three conditions is reached. The first condition is whether the maximum number of generations is achieved (usually 75 or 100). The second condition is whether a maximum number of evaluations of the target algorithm is exceeded. Finally, the third condition is whether the maximum wall-clock time of the configurator is exceeded. Note that GGA supports a convergence criterion that checks whether the population is improving or not, but this is not implemented yet in OAT.

Increasing Mini-Tournament CPU Utilization

The mini-tournaments as described above must be efficiently distributed across the available CPU resources. A key engineering advancement of OAT over previous GGA configurators is that it attempts to maximally utilize available CPU resources. While OAT runs mini-tournaments to race competitive configurations, it distributes mini-tournaments across multiple nodes according to a priority queue of configuration-instance-seed tuples that must still be run, leading to less wasted CPU capacity than, e.g., GGA and GGA++. OAT prioritizes configurations that it believes are likely to finish first so that the finishing time can be used in the short-circuit evaluation of other configurations according to the formula

$$\begin{aligned} \textit{priority}(c) = 100 \left( \frac{\textit{timeouts}(c)}{|I_g|}\right) + 10 \left( \frac{\textit{running}(c)}{|I_g|}\right) + \frac{\textit{runtime}(c)}{\kappa |I_g|}, \end{aligned}$$

where c is a configuration being evaluated in the current generation, g, \(\textit{timeouts}(c)\) provides the number of timeouts the configuration c has had in the current generation so far, \(I_g\) is the instance subset being considered in the current generation, \(\textit{running}(c)\) describes the number of instance-seed pairs c is currently running on, \(\textit{runtime}(c)\) gives the total runtime of c so far in the current generation, and \(\kappa \) is the timeout as previously defined.

The proposed mechanism runs configuration-instance-seed tuples with a low value. The intuition is that a low priority score corresponds first to configurations with a low number of timeouts, following that configurations that have not yet seen much CPU time are favored, and finally, the total runtime expended should be a low percentage of the total CPU time allotment for the configuration. In this way, configurations are preferred that are likely to finish the mini-tournaments first, allowing us to dominate poor-performers before they waste CPU resources (see Section 3.3).

3.2 MIP Evaluation Metric

The evaluation metric tells OAT how to interpret and aggregate the performance of the target algorithm on a subset of the training instances. One of the main considerations when developing a runtime evaluation metric is how to deal with timeouts. While many configurators simply use the average performance of a configuration over a set of instances, this does not significantly discourage timeouts from occurring. Hence, many configurators also support the so-called PAR10 score, which extends the average by multiplying timeouts by a factor of 10.

While PAR10 effectively penalizes timeouts, when an instance set contains many difficult instances, it often does not offer effective search guidance. To improve on this, in the gray-box configuration schemes [25] and [16], we analyze intermediate output of the target algorithm to assist in ranking or otherwise scoring timeouts. In the case of CPPL, ties between configurations that do not finish are broken using the quality of the feasible solution found (if one was found).

In contrast to realtime configuration, where only a single instance is solved per iteration, in offline configuration, breaking times is somewhat more complicated. Especially in the first few iterations of configuration, timeouts are very likely as the search process has not yet identified good configurations. Hence, it is critical to have an effective mechanism for comparing configurations even if none find optimal solutions to the instances being solved. Thus, to minimize the runtime of solving MIPs, we propose the following simple ranking scheme. Assume we are given two configurations A and B that are run on n instances and the following rules are applied in order:

  1. 1.

    If A finds more feasible solutions than B, A is better.

  2. 2.

    Otherwise, if A has less timeouts than B, A is better.

  3. 3.

    Otherwise, if A has a lower average MIP gap over the timeout runs than B, A is better.

  4. 4.

    Otherwise, if A has a lower average runtime than B, A is better.

Since the runtime is a floating point value, and there is generally some noise in its measurement, this ranking is all but guaranteed to return a total order over the available configurations. Note that the focus of the ranking is on feasibility and not optimality. The reason for this is that companies solving MIPs in practice would much rather have feasible solutions for all of the instances they are investigating than optimal solutions on a few and no solution at all on the rest. However, while our motivation for these rule set is a practical one, we show later that there is also a computational benefit to the rules, as these rules help guide the configurator’s search towards areas of the search space with configurations effective at finding optimal solutions.

3.3 Dominance Racing

Running MIPs is computationally expensive; thus, if we detect that a particular configuration is dominated, we can stop running it and use the available resources to run something else. GGA and GGA++ accomplish this in the average or PAR10 runtime setting through their racing mechanism, which ensures that configurations that are dominated are stopped before wasting CPU resources. However, when using a ranking mechanism for MIP, we need to adjust the domination criteria to avoid wasting CPU time.

The goal of our short circuit evaluation is to ensure that configurations with no chance of winning their mini-tournament are stopped as soon as this is detected. Given a mini-tournament, once one of the configurations finishes an instance, we can then check if any other configurations in the mini-tournament are dominated. Let the current best configuration of the tournament be A, and without loss of generality, another configuration in the mini-tournament that is not yet finished be B. Let all unfinished instancesFootnote 3 of A be considered timeouts for the purpose of the domination, and let all unfinished instances of B be represented as optimal solutions found immediately. Then, using our ranking mechanism, rank A and B. If B is ranked worse than A, we know that B will never be better than A and can be eliminated from consideration.

4 Experimental Results

We evaluate OAT on a set of synthetic MIP instances that model the frequency assignment problem from [26] followed by a real-world instance set modeling a strategic network planning problem for a customer of OPTANO GmbH. We address the following two research questions:

  • RQ1: Does the dominance racing allow OAT to find the same or better configurations in less wall-clock time than without dominance racing?

  • RQ2: Can OAT find high-quality configurations for small datasets of long-running MIPs?

In the following, we configure OAT using the GGA++ search strategy on the target algorithm Gurobi 8.1 on two Intel Xeon E5-2680 processors with 12 cores each running at 2.5 GHz and 256 GB of RAM.

4.1 RQ1: Effectiveness of Dominance Racing

We examine the effectiveness of the dominance racing mechanism using the proposed ranking method and with the standard PAR10 metric on a dataset of synthetic instances representing the frequency assignment problem [26]. We configure OAT for 100 generations and increase the number of instances in each generation linearly until generation 75, after which all instances are run in each generation. We allow Gurobi to use a single thread. The dataset of instances is split into 25 training instances assigned to 2 seeds each, and a test set of 25 instances with 10 seeds each to try to avoid erraticism/variability in solving the instances [27] from influencing the results. The timeout for Gurobi is set to 300 s. We repeat this experiment three times.

Table 1 Average target algorithm evaluations and runtime of OAT, SMAC and grbtune over three executions of OAT and 72 executions of SMAC and grbtune (leading to an equivalent total CPU-time allotmenta), along with the resulting performance of the configurations on Gurobi on average over these executions. The evaluation time of Gurobi with its default parameters on the instance set is also provided

Table 1 shows the average results over three runs of OAT to tune the synthetic frequency assignment problem instances, using PAR10 with and without the dominance racing mechanism, and the ranking metric with it. RR stands for runtime racing, which is the type of racing used by GGA++ in which a tournament is stopped once enough configurations finish. By definition, this is a special case of dominance racing (DR). While dominance racing is more aggressive than runtime racing, it also does not eliminate any run of a potential tournament winner. Since the results of PAR10 give a strong impression and these experiments are computationally intensive, we refrain from combining ranking with runtime racing, but focus on ranking with dominance racing. Dominance racing is able to cut the overall wall-clock time to configure with OAT by nearly half, without sacrificing any performance on the test set using both PAR10 and the ranking metric after 100 generations. If OAT is stopped after 72 h, the performance of DR is superior to not using it on both the training and test sets. It is thus clear that many of the target algorithm executions made are actually avoidable. Combining the ranking metric with DR leads to even faster configuration than using PAR10; however, the test performance both after 72 h and 100 generations shows signs of overfitting. We note that DR itself cannot result in overfitting, so this is potentially due to the ranking mechanism being overly aggressive. Finally, we note that regardless of the configuration of OAT, large gains over the default parameters are visible.

OAT performs well compared to SMAC and grbtune thanks to the addition of RR and DR, which we are unable to add to these configurators. We note that grbtune, the built in parameter tuner of Gurobi, actually finds configurations that perform worse than the default configuration on average. In about half the runs of grbtune, the default configuration was returned. However, in the other half, poorly performing configurations were returned that did not generalize to the test set. As the implementation of grbtune is private, we are unable to suggest why this is the case.

The three configurations found by OAT agree with default parameters of Gurobi roughly one third of the time. Note, however, that which parameters are the same as the defaults vary between the configurations greatly. The agreement between the three configurations ranges from 26% of the parameters the same up to 40%. Note that there is only one floating point parameter, and it is different for all configurations. Furthermore, the parameters with large integer ranges tend to be significantly different from each other. The key insight from this analysis is that even configurations with significantly different configurations can have similar performance. While Gurobi includes many parameters that can be set to “automatic” values (i.e., Gurobi uses a heuristic to decide its value in an instance-specific way), our configurations only use this option between 21 and 28% of the time.

4.2 RQ2: Configuring Large-Scale MIPs

We now configure a small dataset of large-scale, real MIP instances from a project at OPTANO GmbH. The instances represent a strategic network planning problem for production and delivery for a customer who cannot be named. The instances contain over 470,000 variables, over 50,000 of which are integer. Furthermore, the instances have over 2.5 million non-zeros, making them practically impossible to solve to optimality even given days of computation time. Due to the strategic nature of the problem, there are only three instances available, and they are extremely hard to solve. The goal is to find a good configuration to solve these three instances as well as future instances that the customer may need solved.

To configure this dataset, we first create three small-scale copies of the three instances. While we are unable to propose a general method for doing this, on many problems, it is possible to create smaller copies that maintain the original structure of the instance. For the network planning problem at hand, we generate smaller instances by restricting the degrees of freedom in the optimization model. For example, we decide a priori which demands will be served and which machines will be used for those demands. Thus, we will configure the small instances and apply the resulting configurations on the larger instances.

Due to the small size of the dataset, we attempt to mitigate the effect of randomness on our results so that we can draw some limited conclusions about tuning small-scale copies of MIPs. To this end, we use leave-one-out (LOO) cross validation and perform four different configurations of the small instances, one including all three instances, and three more with all combinations of two out of three instances. Validation is performed on the large instances; in this way, we simulate the situation where a new large instance is encountered after configuration is completed. We assign four random seeds to each training instance and assign 10 to each large instance in the test set. We again configure Gurobi 8.1, but this time allow Gurobi to use two threads as this is how the customer’s environment is set up. We configure for 50 generations and use the entire set of instance seed pairs in all generations. The timeout is set to 30 s.

Table 2 Runtime until the first solution is found in minutes on the large instance test set

We configure and evaluate the four settings as discussed and aggregate the results in Table 2, which shows the average time in minutes until the first solution is found on the test set. We note that the default parameters were not particularly robust, with two out of thirty instance-seed pairs requiring almost 8 h to find a first solution, which is unacceptable for implementing the model for the customer. The configurations found by OAT are significantly more robust, with the maximum time needed to find a first solution only 4.0 min (all instances) resp. 9.0 min (LOO).

Fig. 3
figure 3

Visualizations over the course of solving the instances of the test set

Figure 3a and b provide information over the course of the execution of the instances of the test set. In Fig. 3a, the configuration tuned on all three instances lowers the MIP gap the fastest, but then levels out. Surprisingly, the LOO configurations catch up before the three hour mark and lower the MIP gap further, although we note this is very likely just noise. Nonetheless, the fact that configuring on two out of three instances has similarly to configuring on all three instances is promising information for the many real-world domains where often very few instances are available. Furthermore, Fig. 3b shows that the all configurations found generate more solutions than the default Gurobi configuration. While the number of solutions alone is not an indication of quality (they could all be bad), in general, when solving MIPs, finding solutions is good for users who can receive intermediate feedback at multiple intervals.

5 Conclusion and Future Work

We introduced OAT, a general-purpose algorithm configuration tool with special mechanisms for configuring MIP solvers. We show that using the proposed ranking mechanism and short-circuit domination rule significantly reduces the time required to configure MIPs on a synthetic dataset. Furthermore, we confirm that OAT is capable of configuring a real-world strategic network design problem that has since been deployed at a customer of OPTANO. There are still many open research questions for future work, such as how to automatically reduce the size of large MIP instances so that they can be effectively tuned or how to further avoid wasting time on unpromising configurations by, e.g., applying machine learning models.