1 Introduction

Looking at algorithmic problem classes such as Boolean satisfiability (SAT) (Xu et al., 2007, 2011), the traveling salesman problem (TSP) (Pihera & Musliu, 2014), or constraint satisfaction (CSP) (Lobjois et al., 1998), practical experience suggests that algorithms perform differently on different problem instances: while algorithm A might be better than B on a specific instance (e.g., a specific TSP), B may outperform A on another instance (e.g., another TSP). This is not very surprising and completely in line with theoretical results proving that there is “no free lunch”, i.e., excluding that one algorithm uniformly dominates all others (Wolpert et al., 1997). The following task thus appears to be meaningful from a practical point of view: Given a problem class and a pool of algorithms to choose from, find a rule that automatically assigns a (presumably) most suitable algorithm to each possible problem instance. This task is called (instance-specific) algorithm selection (AS) in the literature (Rice, 1976). Here, suitability may refer to different performance criteria, such as runtime (Tornede et al., 2020c) or a measure of solution quality (Wever et al., 2021).

The problem of algorithm selection has received considerable attention over the past decade, resulting in a large set of heterogeneous algorithm selection approaches. Many of these approaches rely on machine learning, which essentially means that a rule assigning algorithms to problem instances is learned from suitable training data, for example, the performance observed in the past when running specific algorithms on specific instances. Given a new instance, a machine learning algorithm leverages such data to predict the performance of the candidate algorithms, or to predict the presumably best algorithm directly. AS approaches of that kind achieve state-of-the-art performance and typically outperform the best stand-alone algorithm, also referred to as “single best solver” (SBS) in the following, by several orders of magnitude (Kerschke et al., 2019).

Interestingly, because an algorithm selector is again an algorithm (taking an instance as input and returning a presumably best algorithm as output), the very same task of algorithm selection can also be considered on a meta level, giving rise to the following question: Given a problem instance and a set of algorithm selectors, which one should be used to predict the best algorithm? This question could be answered by an algorithm selector on the meta level, that is, by an “algorithm selector selector”, which does not choose among the algorithms (or “base algorithms”, to distinguish them from the AS algorithms), but among the algorithm selectors, which in turn are responsible for selecting an algorithm. Indeed, a certain complementarity among AS approaches can be observed (e.g. Tornede et al., 2020c) and the resulting meta-AS problem was first mentioned by Lindauer et al. (2019) and Kerschke et al. (2019), though without pursuing it further.

Having the choice between a set of candidate algorithm selectors, limiting oneself to choosing only a single one of them (which in turn chooses the final algorithm) might actually seem unnecessarily restrictive. In fact, leveraging a composition of selectors, which then choose the final algorithm jointly, might be a better idea. This naturally leads to ensemble learning (Dietterich, 2000), which is a common approach in machine learning to combine several predictors into stronger compositions. Thus, instead of using a single algorithm selector to choose an algorithm, a set of selectors is asked to evaluate the available algorithms. Subsequently, these evaluations are aggregated into a joint decision. Somewhat surprisingly, building ensembles of algorithm selectors has hardly been considered in the AS literature so far (see Sect. 7), although ensemble learning is well known to improve predictive accuracy in standard machine learning problems such as classification and regression. One reason could be that querying multiple models obviously takes more time than querying only a single one, so that ensembling may appear counterintuitive in scenarios where runtime is considered as the target measure.

In this paper, we formalize the problem of meta algorithm selection and propose algorithmic solutions. Furthermore, we investigate their potential to make better decisions with respect to the selection of algorithms. In an extensive empirical study, we find that trying to learn the best algorithm selector, i.e., to predict which algorithm selector will pick the best algorithm for a given query, does not lead to better algorithm selection performance. On the other side, ensembling algorithm selectors helps to improve efficacy, while the additional runtime consumed for querying multiple algorithm selectors remains negligible. Of course, the improved performance comes at a higher cost of building the ensemble algorithm selector, because multiple basic algorithm selectors need to be fitted for one ensemble. However, this does not pose a problem in practice, because algorithm selectors are in general built in an offline phase prior to the actual selection process.

The remainder of the paper is structured as follows. First, we give a formal introduction to the algorithm selection problem in Sect. 2, followed by a definition of the meta AS problem in Sect. 3 and a first (still quite limited) solution to the problem in Sect. 4. As a more advanced solution, we present algorithm selection ensembles in Sect. 5. Subsequently, we present and discuss the results of our empirical evaluation in Sect. 6. Related work is discussed in Sect. 7, prior to concluding our paper in Sect. 8.

2 Algorithm selection

In the per-instance algorithm selection problem, first formalized by Rice (1976), we are faced with a space of instances \(\mathcal {I}\) of an algorithmic problem class (such as SAT, where every instance is a logical formula) and a finite set of algorithms \(\mathcal {A}\), which solve such instances. The goal is to find a map \(s: \mathcal {I} \longrightarrow \mathcal {A}\), called algorithm selector, which assigns algorithms to instances. An assignment \(a = s(i)\) is interpreted as a recommendation, suggesting that algorithm \(a \in \mathcal {A}\) will perform strongly, or perhaps even best among all algorithms, on problem instance \(i \in \mathcal {I}\). More formally, the goal is to optimize (expected) performance in terms of a measure \(m: \mathcal {I} \times \mathcal {A} \longrightarrow \mathbb {R}\), which is also part of the AS problem specification. Hence, the optimal algorithm selector for all instances \(i \in \mathcal {I}\), also known as the oracle or virtual best solver (VBS), is defined as

$$\begin{aligned} s^*(i) = \arg \min _{a \in \mathcal {A}} \mathbb {E}\left[ m(i,a) \right] \, , \end{aligned}$$
(1)

where the expectation accounts for the potential randomness imposed by the algorithm. We denote the algorithm that is best on average (in expectation) on a predefined set of instances as the single-best solver (SBS). It constitutes the default baseline in algorithm selection.

Observe that an exhaustive evaluation of all algorithms for computing the VBS is not deemed a solution, because m is usually costly to evaluate and often even requires running the respective algorithm. For example, if runtime is the measure of interest, a single evaluation already results in a solved instance, rendering all other evaluations unnecessary. Hence, instead of performing evaluations at query time, the algorithm selector should make use of gathered knowledge to come to a decision.

2.1 Algorithm selection methods

The majority of AS approaches leverages machine learning techniques to learn (in one way or another) a surrogate performance measure \(\widehat{m}: \mathcal {I} \times \mathcal {A} \longrightarrow \mathbb {R}\) approximating m while being cheap to evaluate. With such a surrogate performance measure at hand, an exhaustive enumeration, actually excluded for the reasons explained before, does become possible and yields the canonical algorithm selector

$$\begin{aligned} s(i) :=\arg \min _{a \in \mathcal {A}} \widehat{m}(i,a) \,. \end{aligned}$$
(2)

For the purpose of inferring such a surrogate, the setting is usually assumed to contain a set of training instances \(\mathcal {I}_D \subset \mathcal {I}\) on which some (but not necessarily all) of the algorithms in \(\mathcal {A}\) have been evaluated, so that performance evaluations m(ia) are available. Note that the corresponding training performance matrix spanned by \(\mathcal {I}_D\) and \(\mathcal {A}\) is usually assumed to contain (sometimes many) missing values. Furthermore, instances are assumed to be representable by a set of d features generated by a feature map \(f: \mathcal {I} \longrightarrow \mathbb {R}^d\). In many cases, such features are available or can be defined in a quite natural way. In the case of SAT, for example, common features include the length of a formula, the number of clauses or variables, etc. In general, the computation of features does not come for free and requires time. This should be taken into account, especially when runtime is chosen as a performance measure to be optimized.

One of the most straight-forward instantiations of the framework described above, in this paper denoted by PerAlgo, was proposed by Xu et al. (2007), where one performance surrogate \(\widehat{m}_a: \mathcal {I} \longrightarrow \mathbb {R}\) is learned for each algorithm \(a \in \mathcal {A}\) separately. The joint surrogate can then be defined as \(\widehat{m}(i,a) = \widehat{m}_a(i)\) for all instances \(i \in \mathcal {I}\).

Alternatively, the problem can be formalized as a multi-class classification problem, where each algorithm corresponds to a class, so that a multi-class classifier (here called Multiclass) of the form \(s: \mathcal {I} \longrightarrow \mathcal {A}\) can be learned directly. A well-known example from this category is SATzilla’11 (Xu et al., 2011), which employs an all-pairs decomposition approach, learning a cost-sensitive classifier for each pair of algorithms and determining the selected algorithm by majority voting. Building upon the idea of pairwise comparisons of algorithms, Hanselle et al. (2020) suggest learning selectors via a combined ranking and regression approach. Similarly, Kotthoff (2012) suggests employing a stacking approach, using regression models to predict the performance of each algorithm, which is used as an additional input for a meta-learner selecting the final algorithm.

Focusing on so-called censored information present in algorithm selection data, Tornede et al. (2020c) propose a decision-theoretic approach (R2S-PAR10 and R2S-EXP), leveraging techniques from survival analysis to effectively learn from such censored information. Similarly, Hanselle et al. (2021) consider the censored information present in the data within the framework of superset learning (Hüllermeier, 2014).

Furthermore, instance-based approaches, such as SUNNY (Amadini et al., 2014) or ISAC (Kadioglu et al., 2010), have proven to successfully perform algorithm selection by exploiting performances recorded on similar instances in the training data. To this end, they employ k-nearest neighbor or clustering techniques in order to estimate the performance of an algorithm on an unseen instance.

Finally, Tornede et al. (2019, 2020a) propose the setting of “extreme algorithm selection”, in which the pool of algorithms to choose from can be extremely large. They show that, by leveraging a feature representation not only for problem instances but also for algorithms, convincing selection performance can be achieved even in this setting.

2.2 Loss functions

One of the most natural and interesting performance measures to consider for satisfaction problems is the time until the instance is solved, i.e., the algorithm runtime. Unfortunately, combinatorial problems often feature skewed runtime distributions, such that some algorithms are running extremely long on some instances (Gomes et al., 1997). As a consequence, algorithms are generally executed with an upper bound C on their runtime. If an algorithm does not terminate within this bound, called cutoff, the instance is considered unsolved and the algorithm is forcefully terminated; see Fig. 1 for an illustration. As choosing an algorithm running into a cutoff leads to an unsolved instance, such a choice should be avoided by all means. One of the most common loss functions in AS, called the penalized average runtime (PAR10), considers this by explicitly penalizing such timeouts. The PAR10 over a set of instances \(\mathcal {I}' \subset \mathcal {I}\), called scenario, is defined as follows, where m(is(i)) corresponds to the runtime of the algorithm s(i) chosen by the algorithm selector s (and potentially the time required to compute the corresponding instance features) on instance i:

$$\begin{aligned} \begin{aligned} PAR10 (s, \mathcal {I}')&= \frac{1}{\vert \mathcal {I}' \vert } \sum \limits _{i \in \mathcal {I}'} PR10(s,i) \\ PR10 (s, i)&= {\left\{ \begin{array}{ll} m(i,s(i)) &{} \text {if } m(i,s(i)) \le C \\ 10 \cdot C &{} \text {else} \end{array}\right. } \end{aligned} \end{aligned}$$
(3)

Naturally, PAR10 scores can vary drastically across scenarios making them incomparable. To alleviate this situation, one often falls back to the normalized PAR10 score of an algorithm selector s defined as

$$\begin{aligned} nPAR10 (s, \mathcal {I}') = \frac{ PAR10 (s, \mathcal {I}') - PAR10 ( oracle , \mathcal {I}')}{ PAR10 ( SBS ,\mathcal {I}') - PAR10 ( oracle ,\mathcal {I}')} \,. \end{aligned}$$
(4)

An nPAR10 score of 0 corresponds to the oracle performance, a score of 1 corresponds to a performance on a par with the SBS, whereas scores above 1 indicate a deterioration in comparison to the SBS. Therefore, lower nPAR10 scores indicate better performance, and a successful algorithm selector should definitely have a score of less than 1.

Fig. 1
figure 1

This figure depicts the process of running multiple algorithms on an instance (e.g. for training data generation). If an algorithm requires longer than C to solve an instance, it is forcefully terminated and a selection of the corresponding algorithm will be punished

3 Meta algorithm selection

Similar to the algorithms actually solving the problem instances, the algorithm selectors also show the phenomenon of performance complementarity, as mentioned earlier. This gives rise to the question whether choosing between different algorithm selectors might be beneficial. In fact, by moving to the meta level, i.e. from the level of choosing among algorithms to the level of choosing among the algorithm selectors, we gain more freedom and can even select multiple selectors instead of only a single algorithm as long as we ensure to aggregate the selections made by the selectors such that a single algorithm is returned at the end. Thus, the problem of per-instance meta algorithm selection (meta AS) concerns the problem of selecting one or multiple algorithm selectors together with an aggregation, for a given instance of an algorithmic problem class. Each of the selected algorithm selectors then in turn selects an algorithm for solving the problem. Finally, these selected algorithms are aggregated such that only a single algorithm (of these) is returned. Hence, instead of directly choosing an algorithm to solve a problem instance, we take a detour by selecting one or multiple algorithm selectors and aggregating their decisions.

Formally, in the meta AS problem, we are given a set of algorithm selectors \(\mathcal {S}\subseteq \{ s \vert s: \mathcal {I} \longrightarrow \mathcal {A}\}\), which is a subset of all possible selection functions, in addition to the instance space \(\mathcal {I}\), the set of algorithms \(\mathcal {A}\) and the performance measure m known from the AS problem. We then seek to find a mapping

$$\begin{aligned} ass : \mathcal {I} \longrightarrow 2^{\mathcal {S}} \,, \end{aligned}$$
(5)

called algorithm selector selector (ASS), and an aggregation function

$$\begin{aligned} agg : \mathcal {I} \times 2^{\mathcal {S}} \longrightarrow \mathcal {A}\, , \end{aligned}$$
(6)

such that the algorithm resulting from the aggregation optimizes the original performance measure m. Accordingly, we seek to find the best pair \(( agg , ass )\) of aggregation function \(agg\) and algorithm selector selector \(ass\), such that for all instances \(i \in \mathcal {I}\) the best algorithm is returned, i.e.,

$$\begin{aligned} agg (i, ass (i)) \in \arg \min _{a \in \mathcal {A}} \mathbb {E}\left[ m(i,a)\right] \,\,\,. \end{aligned}$$
(7)

Observe that we principally allow the concrete aggregation to depend on the instance, thereby allowing for learning instance-specific aggregation functions.

Figure 2 illustrates the relation between algorithms, algorithm selectors and algorithm selector selectors. In the following, we present several instantiations of this framework.

Fig. 2
figure 2

Illustration of the connection between algorithms (\(\mathcal {A}\)), algorithm selectors (\(\mathcal {S}\)) and algorithm selector selectors. Algorithms solve instances of an algorithmic problem, whereas algorithm selectors are mappings from an instance to a single algorithm from \(\mathcal {A}\). Algorithm selector selectors select one or multiple algorithm selectors, which in turn each select an algorithm. These selections are then aggregated using an aggregation function (not displayed here)

4 Selecting single algorithm selectors through meta learning

The arguably simplest solution to the meta AS problem is achieved through meta learning (Vanschoren, 2018; Brazdil et al., 2008; Vilalta et al., 2009), namely to learn which algorithm selector takes the best decision for a given instance. More formally, one could seek to learn a map

$$\begin{aligned} s_{meta}: \mathcal {I} \longrightarrow \mathcal {S}\, , \end{aligned}$$
(8)

such that the chosen selector returns the most suitable algorithm for a given instance i, i.e.,

$$\begin{aligned} \left( s_{meta}(i)\right) (i) \in \arg \min _{a \in \mathcal {A}} \mathbb {E}\left[ m(i,a)\right] \,\,\,. \end{aligned}$$
(9)

In this case, the co-domain of the function ass in (5) is effectively restricted to singleton sets \(ass(i) = \{ s \} \in \mathcal {S}\) consisting of only a single algorithm selector s — we shall discuss the consequences of this self-imposed restriction in Sect. 4.1. Moreover, the aggregation agg in (6) is the identity, or, stated differently, there is actually no need for learning an aggregation function. Lastly, the instance features computed by the feature map f for the standard AS problem are also used on the meta level and thus constitute what is known as meta features in the context of meta learning. Likewise, as (8) indicates, the set of selectors \(\mathcal {S}\) corresponds to the set of meta targets.

Observe that this approach is essentially a special case of the standard AS problem itself, with a very specific set of algorithms to choose from, namely algorithm selectors. Hence, standard AS methods (see Sect. 2.1) can in principle be applied. It is important to note that algorithm selection approaches not relying on a feature representation of instances do not necessarily have an advantage in terms of runtime anymore, because they may select an algorithm selector which in turn requires the feature representation. If the feature computation has to be performed either on the meta or on the base level, its time has to be taken into account as well. However, there is no need to perform the computation twice, if both the algorithm selector selector and the algorithm selector require it, because the resulting features can be shared.

4.1 Limits of learned algorithm selector selection

Limiting ourselves to choosing only a single algorithm selector for a given instance instead of leveraging multiple ones obviously has consequences in terms of achievable algorithm selection performance. To elaborate on these consequences, let us define an algorithm selector oracle (AS-oracle) as

$$\begin{aligned} ass ^*(i) \in \arg \min _{s \in \mathcal {S}} \mathbb {E}\left[ m(i,s(i))\right] \,\,\, . \end{aligned}$$
(10)

It is important to note that the AS-oracle is in general not identical to the oracle on the base level, as the set of algorithms to choose from may change. For a better understanding, consider an example with two algorithms \(a_1\) and \(a_2\) and two algorithm selectors \(s_1\) and \(s_2\), where both always select algorithm \(a_1\). Furthermore, assume there exists an instance for which \(a_2\) performs better than \(a_1\), and hence the oracle would select \(a_2\). However, the AS-oracle can only select \(s_1\) or \(s_2\), which in turn both select \(a_1\), resulting in a decrease in oracle performance.

Generally speaking, in order to preserve the original oracle, it is necessary that, for each instance, at least one algorithm selector exists that selects the best algorithm for that instance. Otherwise, the AS-oracle performance may degrade compared to the oracle. In practice, there will be at least one such instance most of the time, and hence an important question is how much the oracle performance degrades. As we show in our experimental evaluation, the degradation strongly depends on the scenario at hand, and ranges from less than \(1\%\) to over \(116\%\).Footnote 1

Similarly to the oracle, the SBS on the meta level changes as well, since the single best algorithm selector (SBAS), i.e., the algorithm selector which is best on average, is now an algorithm selector, making it a lot stronger baseline than the single best solver. Hence, while the SBS selects the actual problem solving algorithm that is best on average and accordingly does not depend on instance features, the SBAS does in fact depend on such features as long as it is not identical to the SBS. Observe that this results in a significant disadvantage for the SBAS in terms of achievable PAR10 scores due to the time required to compute these instance features.

Obviously, these implications also influence the performance gains that can be achieved by algorithm selector selectors of the form (8) in comparison to algorithm selectors. As the oracle performance most likely degrades, while the SBS performance most likely improves, the gap between the two also decreases, offering less potential for algorithm selection approaches to close this gap.

5 Constructing ensembles of algorithm selectors

As mentioned earlier, the restriction to choose only a single algorithm selector seems like an unnecessary constraint and may even lead to a potential loss in achievable algorithm selection performance. Accordingly, one may think about using a composition of algorithm selectors, which can play to their strengths on some instances while compensating for each other’s weaknesses on other instances. This idea motivates us to construct a mapping of the form (5) through ensemble learning.

Ensemble learning (Dietterich, 2000) presumably constitutes the most natural technique to combine several machine learning approaches into a joint one, with the goal to improve in performance. In algorithm selection, an ensemble can be thought of as a set of algorithm selectors \(\mathcal {S}\), called base algorithm selectors, which are either trained independently or dependently on each other. At prediction time, each selector is queried for the given instance i, and the algorithm choices are aggregated into a final choice using an aggregation function as defined in (6). The concrete strategy used to make the selectors cooperate depends on the ensemble technique being used. Figure 3 depicts the general process of predicting / selecting an algorithm for a given instance through a trained ensemble of algorithm selectors.

Fig. 3
figure 3

This figure depicts the general process of predicting / selecting an algorithm for a given instance through a trained ensemble of algorithm selectors \(s_1, s_2, s_3\)

As mentioned earlier, allowing for the selection of multiple algorithm selectors also requires the definition of an aggregation function in order to finally return a single algorithm. In principle, the aggregation functions can either depend on the instance, i.e., are instance-specific, or can be fixed across instances. Similarly, they can either be learned or be predefined.

In general, to be successful, ensembles require a certain degree of heterogeneity of the predictions. Therefore, the different algorithm selectors should not always coincide in their selections. Otherwise, it can easily happen that the majority of predictions made by the base selectors are identical. Hence, in such a situation, the prevalent selector (maybe with slight but negligible variations) dominates the predictions of the entire ensemble, only yielding a computationally more expensive variant of the respective dominating selector. To avoid this problem, most ensemble methods strive for a heterogeneous set of base selectors. This can be achieved through a suitable choice of base selectors given to the method, like for example in voting. Alternatively, in the case of methods such as bagging, which only work with a single base selector, different variants of the same selector can be trained on different data sets.

Intuitively, the training and querying of more than one selector might be counter-intuitive in settings where runtime is optimized, as it automatically results in larger runtime. In this regard, it is important to note that the majority of the runtime is required for training the selectors in the ensembles. In AS, we can assume this training to be performed offline, i.e., prior to the actual selection of algorithms. Hence, longer training times do not constitute a real disadvantage, as long as prediction (querying the ensemble members) remains fast, which is the case as most selectors are known to be extremely fast such that even compositions of them are slower, but still fast.

In the following, we first elaborate on different aggregation strategies. Although some of these aggregation functions include learnable components, they are fixed across instances, i.e., the aggregation of predictions does not depend on the query instance. Then, we present several ensemble techniques for creating a pool of algorithm selectors, in particular voting (Dietterich, 2000), bagging (Breiman, 1996), and boosting (Schapire, 1990). We continue with a discussion of stacking (Wolpert, 1992), which can be seen as a learned, instance-specific aggregation method. As such, it is somehow positioned in-between ensemble and meta learning. Finally, we close this section with a methodological comparison of the presented approaches.

5.1 Aggregation strategies

One of the most natural forms of aggregation in our context is (weighted) majority aggregation. As the name suggests, it aggregates the algorithm choices by selecting the algorithm that was selected most frequently, potentially weighting the choices of the selectors differently. This is motivated by the idea that selectors with a strong performance should potentially be trusted more than weaker ones. More formally, weighted majority aggregation can be defined asFootnote 2

$$\begin{aligned} agg_{(w)maj}(i, \mathcal {S}) = \arg \max _{a \in \mathcal {A}} \sum \limits _{s \in \mathcal {S}} w_s \cdot \llbracket s(i) = a \rrbracket \, , \end{aligned}$$
(11)

where \(w_s \in \mathbb {R}^+\) denotes the weight associated with selector s. With \(w_s = 1\) for all \(s \in \mathcal {S}\), we recover standard majority voting. To obtain proper weights, a plethora of methods are applicable in principle. However, we simply consider the nPAR10 score of the different base algorithm selectors on the training data in order to determine corresponding weights — conducting a cross-validation on the training data for the same purpose turned out to result in similar performance while being computationally more expensive.

Up to now, we assumed that an algorithm selector only returns a single algorithm. While this is typically true in practice, the majority of approaches internally feature more nuanced predictions, often constituting some kind of loss (or score) for each algorithm in \(\mathcal {A}\). Accordingly, instead of using only a concrete algorithm choice as the output of the algorithm selectors, we adapted them to return such nuanced predictions where possible.

More formally, let us assume that each trained algorithm selector \(s \in \mathcal {S}\) cannot only be evaluated on \(i \in \mathcal {I}\), but that it also allows access to \(\widehat{m}_s(i,a)\), i.e., to the corresponding internal score of each algorithm \(a \in \mathcal {A}\). For those approaches where such a score cannot be extracted explicitly, e.g., multi-class algorithm selectors, we define dummy losses as

$$\begin{aligned} \widehat{m}_s(i,a) = {\left\{ \begin{array}{ll} 0 &{} \text {if } s(i) = a \\ 1 &{} \text {else} \end{array}\right. } \end{aligned}$$
(12)

for all instances \(i \in \mathcal {I}\) and algorithms \(a \in \mathcal {A}\), such that all approaches can be assumed to work as defined in (2).

With this consideration, aggregations on this more nuanced level of scores instead of the level of final choices can be made. The most straight-forward aggregation function on this level is the arithmetic mean, i.e.,

$$\begin{aligned} agg_{avg}(i, \mathcal {S}) = \arg \min _{a \in \mathcal {A}} \frac{1}{\vert \mathcal {S}\vert }\sum \limits _{s \in \mathcal {S}} \widehat{m}_s(i,a) \,. \end{aligned}$$
(13)

While conceptually simple, it requires the performance surrogates of the different selectors to approximate the same function. Otherwise, the predictions are incomparable, and averaging is not a meaningful operation. For example, combining the output of a ranking loss function optimized by one selector with the estimated average PAR10 scores of another does not make any sense. In principle, the arithmetic mean can also be turned into a weighted version as done in (11).

In order to be able to aggregate on this more nuanced level while overcoming the weakness of the arithmetic mean, we propose to aggregate rankings (rank aggregation) of algorithms constructed from the algorithm scores obtained from the selectors. More precisely, we can assume that each selector s returns a ranking over the algorithms in \(\mathcal {A}\) by sorting them in increasing order w.r.t. \(\widehat{m}_s(i, \cdot )\), such that the presumably best algorithm is put on the first position in the ranking, the second-best on the second position, etc. Having obtained such a ranking over the algorithms for each selector, they need to be aggregated in order to draw a conclusion and eventually return a single algorithm as the final choice.

A very simple method for rank aggregation is called Borda count (Borda, 1784). Given a ranking of n items, it assigns n points to the top item, \(n-1\) points to the second-best, and so forth. This is done for each ranking to be aggregated, and the consensus ranking is obtained by sorting the items (algorithms in our case) in descending order according to their total sum of points. As pointed out by Dwork et al. (2001), the Borda count has a number of less appealing properties, at least from a theoretical point of view. On the other side, its linear time complexity makes it fast to compute. This is in sharp contrast to other rank aggregation techniques that involve intractable optimization problems (Dwork et al., 2001). Besides, Borda comes with provable approximation guarantees for several other aggregation techniques (Coppersmith et al., 2006). Overall, it seems to be a good compromise for the case of algorithm selection, where predictions are performed under tight time constraints.

Formally, we can use Borda count as an aggregation function for our setting as follows, where \(rank : \mathcal {I} \times \mathcal {S}\times \mathcal {A}\rightarrow \mathbb {N}\) returns the rank of an algorithm a in the ranking returned by a selector s on an instance i:

$$\begin{aligned} agg_{borda}(i, \mathcal {S}) = \arg \min _{a \in \mathcal {A}} \sum \limits _{s \in \mathcal {S}} rank (i, s, a) \end{aligned}$$
(14)

Ties are handled by assigning to all tied algorithms the average of the block of ranks they occupy (Saari, 2000). In practice, ties can only be caused through the dummy scores introduced in (12). Therefore, they always occur at the end of the rankings. Theoretically, identical scores of \(\widehat{m}(i, \cdot )\) could also result in ties, but this never happened in practice.

While the aggregation techniques outlined above appear to be meaningful in the context of the algorithm selection task, we would like to point out that other aggregation techniques are of course conceivable and could be used instead.

5.2 Voting

Voting ensembles are presumably the easiest form of ensemble learning: Each algorithm selector in a set \(\mathcal {S}' \subseteq \mathcal {S}\) is trained independently of the others on the same training data \(\mathcal {I}_D\). At prediction time, all algorithm selectors in \(\mathcal {S}'\) are queried, and the predictions are aggregated using one of the previously described aggregation strategies. Figure 4 depicts the training process of a voting ensemble.

Fig. 4
figure 4

This figure depicts the training process of a voting ensemble, where each base algorithm selector is trained with the same training instances. Ensemble heterogeneity is achieved by choosing a heterogeneous set of algorithm selectors in advance

As we demonstrate empirically, it is important to optimize the ensemble composition, i.e., the set of base algorithm selectors \(\mathcal {S}' \subseteq \mathcal {S}\) specifying the ensemble, because the performance of a voting ensemble solely depends on this configurable parameter. Intuitively, a complete evaluation of each possible composition to check the corresponding performance might seem intractable due to the exponential (in \(| \mathcal {S}|\)) number of compositions. However, in practice this can be a viable option under certain circumstances. To this end, we hold back a portion of the training data \(\mathcal {I}_D\) as validation data \(\mathcal {I}'_D \subset \mathcal {I}_D\). Then, all base algorithm selectors can be trained on the reduced training data \(\mathcal {I}_D \setminus \mathcal {I}'_D\) once, so that, in order to estimate the performance of an ensemble composition, only the predictions of the used selectors on the validation data \(\mathcal {I}'_D\) need to be obtained and aggregated.Footnote 3 As the training of the selectors has to be performed only once at the beginning, and the computation of both the predictions and the aggregation can be performed in a negligible amount of time, the evaluation of all possible compositions is feasible as long as the set of algorithm selectors remains moderately large. For example, computing the training performance of each possible voting ensemble composed of up to 7 algorithm selectors required less than 5 minutes for all scenarios presented in Sect. 6. However, we want to stress that this approach still has an exponential complexity even if the corresponding predictions can be obtained quite fast as the number of ensemble compositions to evaluate is exponential in the number of algorithm selectors. Thus, if the amount of algorithm selectors becomes larger, more sophisticated ensemble pruning methods as Rokach (2009), Lazarevic and Obradovic (2001) and Hernández-Lobato et al. (2009) can be used to find good compositions.

5.3 Bagging

In contrast to voting, baggingFootnote 4 (Breiman, 1996) only leverages a single kind of algorithm (selector). Therefore, heterogeneity between the ensemble members has to be achieved through data manipulation techniques. To this end, bagging leverages a data resampling technique from statistics called bootstrapping, which works as follows. Given a set of training instances \(\mathcal {I}_D\) of size \(N = \vert \mathcal {I}_D \vert\), it creates a new training instance set by sampling N times from \(\mathcal {I}_D\) with replacement. The actual ensemble is constructed by sampling k such new training instance sets \(\mathcal {I}_D^{(1)}, \ldots , \mathcal {I}_D^{(k)}\) and training one instantiation of the provided algorithm selector on each of the k training sets. Thus, the ensemble eventually consists of k algorithm selector instances. At prediction time, one of the previously discussed aggregation functions can be used to aggregate the predictions (selections) of the different selectors. Figure 5 depicts the training process of a bagging ensemble.

Fig. 5
figure 5

This figure depicts the training process of a bagging ensemble consisting of several instantiations of the same base algorithm selector trained on bootstrapped versions of the original training data

We would like to point out that we bootstrap on the level of the problem instances and not on the level of the actual training data points ((instance/algorithm)-pairs or (instance/algorithm performance)-pairs). This is done in order to allow the selection algorithms themselves to construct their training data points. In principle, this may lead to differently large training data sets for the corresponding base algorithm selectors if the number of training performance values \(m(i,\cdot )\) varies across instances. However, we assume that either m(ia) is available or we know at least that \(m(i,a) > C\) for all \(i \in \mathcal {I}_D, a \in \mathcal {A}\), and hence can reasonably impute these missing values, thereby solving the problem of differently sized training data sets.

5.4 Boosting

While both voting and bagging fit ensemble members independently of each other (except for (partially) identical training data), boosting successively trains its members, each time re-weighting the training instances (Schapire, 1990). After each iteration, i.e., trained selector, the error of the previous selectors is determined and more weight is put onto those instances where a wrong algorithm selection has been performed, while the weight on correctly judged instances is reduced. Similar to bagging, boosting only uses a single selector as a basis of which it trains instantiations based on differently weighted versions of the same training instance set in order to achieve diversity w.r.t. its ensemble members. At prediction time, the predictions of each of the trained selectors are obtained and combined into a joint prediction using a weighted aggregation, using the weights that have been determined as part of the boosting algorithm during the training phase. Figure 6 illustrates the training process of a boosting ensemble.

Fig. 6
figure 6

This figure depicts the training process of a boosting ensemble. Similar to bagging, the ensemble constitutes several instances of the same base algorithm selector. These are subsequently trained on differently weighted versions of the training data

In boosting algorithms for multi-class classification, such as SAMME (Hastie et al., 2009), and regression problems, such as AdaBoost.R2 (Drucker, 1997), one would naturally consider multi-class classification errors and regression losses, respectively, for re-weighting training instances. However, due to the inferior performance of AdaBoost.R2 in preliminary experiments, we focus on SAMME for the remainder of this paper.

5.5 Stacking

In the previous ensemble techniques, the aggregation strategy is always fixed from the beginning and independent of the actual instance at hand. The idea of stacking is to learn the aggregation, i.e., how to best aggregate the predictions of the base algorithm selectors for a given instance. Therefore, a meta-learner

$$\begin{aligned} \mathbf {h}_{ agg }: \mathcal {I} \times \mathbb {R}^{\vert \mathcal {S}\vert \times \vert \mathcal {A}\vert } \rightarrow \mathcal {A}\end{aligned}$$
(15)

is fitted and used to aggregate the predicted performances \(\widehat{m}(i,a)\) of each algorithm selector \(s \in \mathcal {S}\) for a given instance \(i \in \mathcal {I}\) and each algorithm \(a \in \mathcal {A}\) into a joint decision. To avoid any bias in the training data for the meta-learner, it needs to be ensured that this data is disjoint from the training data of the base algorithm selectors. Therefore, the set of training instances \(\mathcal {I}_D\) is normally split into a set of base algorithm selector training instances \(\mathcal {I}'_D \subset \mathcal {I}_D\) and a set of meta-learner training instances \(\mathcal {I}''_D \subset \mathcal {I}_D\) such that \(\mathcal {I}'_D \cap \mathcal {I}''_D = \emptyset\).Footnote 5 As all possible base algorithm selectors are used, each can be trained independently on the same subset of training instances \(\mathcal {I}'_D\) as a first step such that the training data for the meta-learner can be built. Then, the meta-learner is trained based on the features \(f(i) \in \mathbb {R}^d\) of each training instance \(i \in \mathcal {I}''_D\) extended by the predictions \(\widehat{m}_s(i,\cdot )\) of all base algorithm selectors \(s \in \mathcal {S}\) on these instances. At prediction time, each base algorithm selector \(s \in \mathcal {S}\) is queried, its predictions \(\widehat{m}_s(i,\cdot )\) are concatenated and attached to the instance features \(f(i) \in \mathbb {R}^d\) of instance i, based on which the meta-learner predicts which algorithm to choose. As the meta-learner is an algorithm selector itself, any of the base algorithm selectors can be used. Figure 7 depicts the general idea of a stacking ensemble.

Fig. 7
figure 7

This figure depicts the general idea behind a stacking ensemble. Each ensemble member is trained with the same subset of training instances and the remaining instances are augmented with the corresponding predictions of the trained selectors. Then, a meta-learner, i.e. an additional algorithm selector, \(h_{agg}\) is trained on this augmented data, which decides on the algorithm to select

Since stacking is working on an (extended) feature representation, standard feature selection techniques can be used to reduce the number of features and help the meta-learner achieve better prediction performance. Thus, the ensemble composition does not require any optimization upfront. For an overview of feature selection methods, we refer to Guyon and Elisseeff (2003).

5.6 Comparison of the approaches

To put the approaches presented so far into the broader context of meta AS, we close this section by revisiting them w.r.t. to their most important properties. Figure 8 provides an overview and illustrates how the approaches relate to each other. It clarifies what kind of mapping these approaches model, how this mapping is constructed, and how the required aggregation function is constructed.

Fig. 8
figure 8

Illustration of the different approaches w.r.t. the kind of mapping they model, how this mapping is constructed, and how the required aggregation is obtained

As an important observation, note that some approaches involve learning on the meta level while others do not. The former most obviously holds for learning an algorithm selector selector (cf. Sect. 4), where the modeled mapping is learned directly. On the other side, most ensemble approaches (cf. Sect. 5) do not require any learning on the meta level, because their mapping is essentially predefined. Stacking is somehow in-between these two groups: the mapping itself is predefined, but the aggregation function is learned on the meta level.

6 Experimental evaluation

In this section, we provide an empirical evaluation of the ideas presented in the preceding sections. It is organized into four main parts. First, we introduce our experiment setup. Second, we investigate the chance for performance improvements when learning algorithm selector selectors and evaluate the performance of standard algorithm selectors working as algorithm selector selectors. Third, we evaluate the performance of the different ensemble methods presented earlier and discuss the results. We end this section by drawing a broader conclusion from these results.

6.1 Experiment setup

All evaluations are run on a subset of the scenarios from the ASlib v4.0 benchmark suite (Bischl et al., 2016) with a 10-fold cross-validation, where the folds are provided by the benchmark. ASlib is a curated collection of algorithm selection problems spanning a variety of different problems such as the Boolean satisfiability problem (SAT), the quantified Boolean formula problem (QBF) and others. Most of these problems focus on runtime as a measure of interest as such problems often exhibit the property of performance complementarity motivating the AS problem. Table 1 shows the scenarios used with their corresponding characteristics. Details regarding each scenario can be found in the corresponding README file of the ASlib GitHub repositoryFootnote 6 and in Bischl et al. (2016).

Table 1 Overview of examined ASlib scenarios including their number of instances (#I), unsolved instances (#U), algorithms (#A), provided features (#F), and the cutoffs (C)

The performance of the approaches is measured in terms of the normalized penalized average runtime (nPAR10) metric as defined in (4) if not mentioned otherwise. Recall that a value of 0 indicates oracle performance, values below 1 an improvement over the SBS, and values above 1 a degradation compared to the SBS. To allow for a better visual interpretation, we sometimes illustrate results aggregated over all scenarios. Needless to say, such aggregations have to be treated with care, because (differences between) performance degrees are not easily comparable across scenarios.

The set of algorithm selectors used for the evaluation consists of \(\mathcal {S}= \{\)PerAlgo, SATzilla’11, R2S-Exp, R2S-PAR10, SUNNY, ISAC, Multiclass\(\}\), which all have been described in Sect. 2. These are used both as meta-learners, but also as base algorithm selectors for the ensembles. Furthermore, we compare all ensemble variants against the single best algorithm selector (SBAS), i.e., the algorithm selector which performs best across all scenarios in terms of average or median nPAR10 performance. Lastly, we note that in general we leave out instances from the test sets where all algorithms run into the cutoff as no sensible selection is possible for those. However, we do include these instances for the meta learning experiments in Sect. 6.2 as the set of instances in the test sets would otherwise vary between the base level and the meta level yielding incomparable results. This is the case, as we would potentially need to leave out an instance on the meta level (if none of the algorithm selectors chose an algorithm solving it before the cutoff), which we might have included on the base level (since there exists an algorithm solving it before the cutoff time). This problem is very much related to the degradation in oracle performance, which was previously discussed.

All experiments were run on machines featuring Intel Xeon E5-2695v4@2.1GHz CPUs with 16 cores and 64GB RAM. In the interest of reproducibility of our results, all code, including detailed documentation of the experiments and execution instructions, is available at GitHub.Footnote 7

6.2 Meta learning for selecting an algorithm selector

Figure 9 shows the PAR10 scores of the oracle, AS-oracle, SBS and SBAS on a subset of the ASlib v4.0 benchmark scenarios. As one can see, several of the implications we noted in Sect. 4.1 can be validated empirically. Firstly and most importantly, although the SBS/oracle gap is a lot larger than the SBAS/AS-oracle gap, the SBAS/AS-oracle gaps are non-negligible, and hence constructing an algorithm selector selector can in principle make sense. For example, consider scenarios BNSL-2016 or CPMP-2015 with large SBAS/AS-oracle gaps.

Fig. 9
figure 9

This figures shows the PAR10 scores of the oracle, AS-oracle, SBS and SBAS on a subset of the ASlib v4.0 benchmark scenarios as bar charts

As we noted earlier, the reason why these gaps become smaller is that the oracle performance degrades when moving to the meta level for all scenarios, whereas the SBS performance tends to improve, because the SBAS is essentially an algorithm selector. While the degradation in oracle performance is moderate for the majority of scenarios (less than \(10\%\)), the improvement of the SBAS over the SBS is non-negligible, as the more successful the algorithm selectors considered by the algorithm selector selectors are, the larger this performance gain is.

Table 2 shows the nPAR10 scores of all algorithm selectors and the corresponding algorithm selector selectors of form (8). Moreover, for the algorithm selector selectors, the values in brackets (a/b) indicate that the approach achieves a performance better or equal to a base approaches and is worse than b base approaches.

Table 2 PAR10 scores of all base- and algorithm selector selectors normalized wrt. the standard oracle and SBS. The result of the best approach is marked in bold for each scenario. Moreover, for the meta-algorithm selectors the values in brackets (a/b) indicate that that the approach achieves a performance better or equal to a base-approaches and is worse than b base-approaches

Unsurprisingly, most algorithm selector selectors are able to consistently improve over the SBS. However, moving to the meta level proves to be beneficial for only seven scenarios and these improvements are even distributed across different algorithm selector selectors. To explain this moderate result, we speculate that the considered AS approaches are not able to unleash their full potential on the meta level, although considerable SBAS/AS-oracle gaps exist, as we have seen previously. However, the win/loss scores in brackets indicate that moving to the meta level is beneficial in the sense that a more robust performance across several scenarios can be achieved.

6.3 Voting ensembles

Figure 10 shows the average/median performance in terms of nPAR10 (over all scenarios) of all possible voting ensemble compositions as violin plots grouped by the aggregation strategy being used. The dashed line indicates the performance of the SBAS, the black dot indicates the performance of the best composition w.r.t. the training performance, whereas the red dot indicates the performance of the ensemble with all base algorithm selectors.

Fig. 10
figure 10

Mean/median performance in terms of nPAR10 (over all scenarios) of all possible voting ensemble compositions as violin plots grouped by the aggregation strategy being used. The dashed line indicates the performance of the SBAS, the black dot indicates the performance of the best composition w.r.t. to the training performance, whereas the red dot indicates the performance of the ensemble with all base algorithm selectors

First of all, it is important to note that voting ensembles offer a lot of optimization potential in terms of both mean and median performance in comparison to the SBAS. While a concrete optimization of the ensemble composition (black dots) does not seem to be beneficial, simply using all possible base algorithm selectors as ensemble members often comes close to the lower performance bound of the voting ensemble strategy. Independent of the aggregation strategy, a voting ensemble with all base algorithm selectors is always able to improve over the best single algorithm selector, sometimes even drastically (e.g., Borda aggregation in terms of median performance). Overall, the weighted majority and the Borda aggregation seem to be on a par in terms of performance when considering the mean nPAR10 score, while Borda is superior in the median case.

It is important to understand the scope of the improvement depicted here. Although R2S-PAR10 already offers a remarkable performance and represents the state of the art in algorithm selection, it is beaten by around \(15\%\) (mean) and \(32\%\) (median), which constitute tremendous improvements.

6.4 Bagging ensembles

Figure 11 shows the average / median nPAR10 performance over all scenarios of each bagging ensemble with 10 instantiations of the corresponding base algorithm selector and different aggregation functions. Moreover, the performance of the corresponding base algorithm selector is shown. Once again, the dashed line indicates the performance of the SBAS.

Fig. 11
figure 11

Average/median nPAR10 performance over all scenarios of each bagging ensemble with 10 instantiations of the corresponding base algorithm selector and different aggregation functions. Moreover, the performance of the corresponding base algorithm selector is shown. Once again, the dashed line indicates the performance of the SBAS

While both ensemble variants equipped with ISAC or Multiclass as a base algorithm selector deteriorate in terms of performance compared to the SBAS, SUNNY, SATzilla’11, and PerAlgo are able to improve both in terms of mean and median performance if the right aggregation is chosen. Surprisingly, none of the aggregation functions seems to be dominating the others. Furthermore, it can be seen that bagging improves the performance of SUNNY, SATzilla’11 and PerAlgo, but mostly worsens the performance for ISAC and offers mixed results for Multiclass.

In light of the general experience with bagging in machine learning, the performance deterioration of the ISAC ensemble in comparison to its base selector may appear surprising. We conjecture that the negative effect of ensembling is due to the specific characteristics of this method. ISAC applies a clustering technique in order to form clusters over the training instances and computes a threshold t based on the average distances of all instances to their corresponding cluster centroid and the standard deviation over these values. At prediction time, ISAC finds the centroid which is closest to the new instance and returns the algorithm performing best on the cluster, if the distance to the centroid is below the aforementioned threshold. If this is not the case, the SBS is returned. Thus, the threshold can be seen as a fail-safe in case ISAC considers the closest cluster to be too different to draw any reasonable conclusions. After careful investigation, we found that the threshold t decreases for the ensemble members trained on bootstrapped training instance sets as both the average distance and the standard deviation decreases. As a result, the ensemble members mostly deteriorate to the SBS and suggest the SBS on a majority of the instances. This explains the decrease in performance and the similar results of the different aggregation strategies.

We note that Run2Survive was left out as a base algorithm selector for bagging as it cannot easily be trained with bootstrapped instance training sets on scenarios with many censored samples. In such cases, bootstrapping often leads to training data sets consisting of censored samples only, which the approach cannot handle.

6.5 Boosting ensembles

Figure 12 shows the average / median nPAR10 performance over all scenarios of each boosting ensemble with 20 iterations and different aggregation functions.

Fig. 12
figure 12

Average/median nPAR10 performance over all scenarios of each boosting ensemble with 20 iterations and different aggregation functions. Moreover, the performance of the corresponding base algorithm selector is shown. Once again, the dashed line indicates the performance of the SBAS

While the performance of the PerAlgo, Multiclass and SATzilla’11 algorithm selectors improves through boosting, the performance of SUNNY and ISAC degrades. Once again, the degradation of ISAC can be explained by the same phenomenon as in the case of bagging: the instance weighting required by boosting was implemented through data sampling, whence ISAC mostly degenerates to the SBS. We chose to do so, since not all of the base algorithm selectors inherently support instance weights, but we wanted to investigate boosting variants powered by as many base algorithm selectors as possible. The degradation of the performance of SUNNY can also be explained in a similar fashion. Recall that SUNNY essentially is a similar k-nearest neighbor algorithm, which, given a new instance, returns the algorithm which performs best in terms of PAR10 performance on the k nearest instances in the training data. However, this training data mostly consists of instances with a high weight as all others have a lower chance of being sampled. As a consequence, SUNNY will return the algorithm performing best on average on exactly these instances, while completely ignoring all other instances. This results in degenerate boosting learning curves as depicted in Fig. 13. The problem is less dominant for selectors that generalize in a more sophisticated way across the features, such as PerAlgo or Multiclass. For instance-based approaches such as SUNNY or ISAC, different forms of boosting specialized for k-NN approaches (García-Pedrajas & Ortiz-Boyer, 2009) or clustering (Frossyniotis et al., 2004) might be more promising and should be investigated in future work.

Fig. 13
figure 13

Learning curves featuring training (orange) and testing (blue) nPAR10 scores of the SAMME boosting algorithm with SUNNY (top two) and ISAC (bottom two) as a base selector on two instances

6.6 Stacking

Figure 14 shows the average nPAR10 performance of stacking variants, where the meta-learner \(h_{agg}\) is instantiated through different algorithm selectors with and without a variance threshold feature selection approach. Each variant uses all base algorithm selectors to generate additional features. The variance threshold method selects all features with a variance larger than a given threshold, which was set to 0.16 for these experiments. The dotted line indicates the average performance of the SBAS.

Fig. 14
figure 14

This figure shows the average nPAR10 performance of stacking variants where \(h_{agg}\), i.e. the meta-learner, is instantiated through different algorithm selectors with and without a variance threshold feature selection approaches

Firstly, we would like to note that no general recommendation on the use of feature selection can be made, as the effect seems to depend very much on the meta-learner. However, while all stacking ensemble variants do not improve over the best single algorithm selector, the variants deploying SATzilla’11 and Multiclass as a meta-learner can slightly improve in performance. We find this quite disappointing, because the additional features provided to the meta-learner seem to carry valuable information. This is confirmed by the feature importance analysis portrayed in Fig. 15. It shows a ranking over the features w.r.t. their feature importance values extracted from the multi-class classification meta-learner (instantiated with a random forest classifier) for the QBF-2011 scenario. Clearly, the additional features in the form of the predictions of the ensemble members carry the biggest part of the information contained in the data.

Fig. 15
figure 15

This figure portrays a ranking over the features w.r.t. their feature importance values extracted from the multi-class classification meta-learner (instantiated with a one-vs-all decomposition equipped with a random forest classifier) for the QBF-2011 scenario

6.6.1 Overall comparison

Table 3 displays nPAR10 values of a subset of all evaluated ensemble variants and all base algorithm selectors broken down to the different scenarios. The best result for each scenario is marked in bold, and a line above a result of an ensemble approach indicates that it is better than the result of the best base algorithm selector on the corresponding scenario.

Overall, ensembles of algorithm selectors achieve a performance superior to single algorithm selectors. There are only two scenarios (ASP-POTASSCO, MAXSAT-WPMS-2016) for which none of the selected ensemble variants was able to improve over the base algorithm selector, performing best on that particular scenario, and another three scenarios where a competitive performance was achieved (MAXSAT15-PMS-INDU, SAT11-HAND, SAT12-HAND). For all other scenarios, at least one of the ensemble variants achieved a new state-of-the-art performance. While some of these improvements are rather small (CSP-MNZ-2013, where an improvement from 0.11 to 0.10 is recorded), there are also various scenarios with a \(>1.5\) fold improvement (e.g., CSP-Minizinc-Time-2016, SAT03_16_INDU, QBF-2011). This is especially remarkable as only very few improvements have been made in the last two years.

In terms of median, and average rank performance across all scenarios, the Borda voting ensemble variant achieves the best result and improves over the previous state of the art by more than \(32\%\) (median performance). Thus, it demonstrates a very robust performance across all scenarios. The voting ensemble with a Borda aggregation (13), the bagging ensemble with the PerAlgo base selector and a Borda aggregation (11), and the boosting ensemble with the PerAlgo base selector and a weighted majority aggregation (13) all consistently outperform the best single algorithm selector on 11 to 13 of 25 scenarios and, thus, achieve an impressive performance.

Table 3 nPAR10 values of the best ensemble variants and all base algorithm selectors broken down to the different scenarios. The best result for each scenario is marked in bold and a line above a result indicates beating all base algorithm selectors

6.7 Discussion of results

We end the experimental evaluation by discussing both the scope of the presented results and the hardness of meta learning.

6.7.1 Scope of results

As the composition of the ASlib benchmark and existing literature show, most of the algorithm selection research is centered around constraint satisfaction problems, where the measure to optimize is algorithm runtime or a penalized version thereof such as the PAR10. This has several reasons: First, constraint satisfaction problems play a very important role in industry while, despite the large amount of research committed to these kinds of problems over the last century, they remain hard to solve in general. Second, the phenomenon of performance complementarity among algorithms, which is the main motivation for the AS problem as discussed earlier, is very present for these problems. More precisely, algorithms for solving constraint satisfaction problems are known to exhibit heavy-tailed runtime distributions (Gomes et al., 1997), i.e., they need very long to solve some instances while other algorithms might solve the same much faster. Overall, the potential for algorithm selection is very large on these kinds of problems, while sometimes lower for other problems such as selecting machine learning algorithms for a dataset. For that particular example, both random forests and gradient boosting often show strong performance and, thus, constitute strong SBS, which in principle can be improved upon as for example shown in Thornton et al. (2013), but often to a smaller degree.

Correspondingly, as common in the AS literature, the results presented so far focus on scenarios optimizing algorithm runtime. However, in order to at least give an idea about the applicability of the proposed framework for other algorithmic problem classes, we would also like to present results on two other scenarios from ASlib, which focus on optimizing solution quality instead of algorithm runtime. In particular, we present results on the OPENML-WEKA-2017 and the TTP-2016 scenarios. While the former is concerned with the selection of machine learning algorithms for different datasets, the latter deals with selecting algorithms for instances of the traveling thief problem (Bonyadi et al., 2013). As the Run2Survive models are specifically tailored towards AS wrt. algorithm runtime instead of performance, we leave them out of the comparison here.

Table 4 shows the results for the ensemble methods and the base algorithm selectors including both the SBS and the oracle as reference points. These reference points are included as this table does not show nPAR10 scores, but a performance score in the unit interval where 1 is the optimum since the scenarios are concerned with solution quality optimization as noted earlier. While the base algorithm selectors are able to achieve a slight improvement over the SBS on the TTP-2016 scenario, none of them can beat the SBS on the OPENML-WEKA-2017 scenario. Similarly, none of the ensemble approaches is able to improve over a base selector on the two scenarios. Hence, the empirical results corroborate the essence of the discussion above: On both scenarios the SBS is a strong baseline, which is quite close to the oracle in terms of performance and hence, there exists hardly any potential for algorithm selection in general let alone the meta level.Footnote 8

Overall, algorithm selection on the meta level is only sensible in cases where (a) a considerable gap between the performances of standard algorithm selection approaches and the oracle exists and (b) performance complementarity among the algorithm selectors can be exploited. In contrast to the scenarios concerned with runtime, at least the first condition is not met for the two additional scenarios here, making an application not worthwhile for these cases.

Table 4 Performance values (OPENML-WEKA-2017: accuracy, TTP-2016: TTP objective function Wagner et al., 2018) of the best ensemble variants and all base algorithm selectors broken down to the respective scenarios. The best result for each scenario is marked in bold and a line above a result indicates beating all base algorithm selectors

6.7.2 Is meta learning harder than learning?

Recall our taxonomy of the approaches presented in Fig. 8, regarding which kind of mapping they model, how this mapping is constructed, and how the required aggregation function is obtained. Drawing an overall conclusion from the results presented in this work, we cautiously conclude that the presumably simpler problem of learning a mapping (8) from the instances to the set of algorithm selectors yields worse results than solving the presumably more complicated problem of finding both a mapping from instances to a set of selectors and a corresponding aggregation function. While we observed remarkable performance improvements for all ensemble approaches, the meta learning approach could essentially achieve no improvement. Although ensembles are known to often yield better results than single approaches and thus an improvement is to be expected, we believe that the degree of improvement in a well-researched field such as algorithm selection is truly remarkable. Moreover, it is surprising that the meta learning essentially fails and hence, classic AS approaches cannot exploit performance complementarity on the meta level.

As a possible reason, note that the meta learning approach heavily relies on the instance features, which are required for learning on the meta level. On the contrary, ensembles of algorithm selectors do not use these features on the meta level directly (except for stacking), but only aggregate the predictions of multiple selectors. Thus, we speculate that the information contained in the features does not allow for an improvement in performance through moving to the meta level, while the predictions of the selectors do carry enough information to do so. This hypothesis is corroborated by the feature analysis conducted as part of the experiments around stacking (cf. Fig. 15), which indicate that much more information is present in the predictions of the base selectors than in the original instance features. We attribute stacking’s ability to perform successful learning on the meta level (aggregation) to the same reason. While stacking was able to achieve improvements, the arguably most simple ensemble approach in the form of voting, which involves no learning on the meta level at all, achieved by far the best results. Overall, learning on the meta level appears to be a very hard problem.

7 Related work

In the following, we give an overview of the most related work regarding the use of ensemble methods in algorithm selection. As mentioned earlier, this work is surprisingly sparse. For a general overview of work on algorithm selection, we refer to Kerschke et al. (2019).

We presented a preliminary version of the meta AS problem in a preprint (Tornede et al., 2020b), which aimed at constructing a more effective algorithm selector by leveraging multiple existing selectors. The idea presented there is identical to the idea presented here in Sect. 4. In this work, we define the problem in a more general fashion, present a framework for solving this problem and show several instantiations of this framework. Accordingly, the work presented in the preprint is subsumed by this work.

In algorithm selection, it is normally assumed that the set of algorithms \(\mathcal {A}\) to choose from is predefined, although the composition of this set can have an influence on the selectors. Therefore, Kordík et al. (2018) propose to not simply use all available algorithms as a basis to choose from, but to employ ensemble techniques in order to construct algorithms constituting this set. Thus, Kordík et al. (2018) build ensembles on the level of algorithms, whereas we ensemble on the level of selectors with the goal to create a better combined algorithm selector.

Last but not least, and perhaps indeed most related, both Malone et al. (2017) and Kotthoff (2012) suggest a stacking approach: First, a regression model is learned per algorithm to estimate the performance on a given instance, and second, the estimated performances are used as input for a multi-class classification model that eventually selects the algorithm. While Kotthoff (2012) only uses the outputs of the performance estimators as input of the meta-learner, Malone et al. (2017) use these in addition to the original features. Moreover, Malone et al. (2017) suggest to also include uncertainty information obtained from the performance estimators as input for the meta-learner. Both variants are very specific instantiations of the general idea presented in this paper, using stacking as an ensemble technique and a specific selector as a base algorithm selector. While the approach presented by Malone et al. (2017) resulted in the last spot in the open algorithm selection competition of 2017 (Lindauer et al., 2019), Kotthoff (2012) considered a setting, where the goal was to select the best machine learning algorithm for a dataset. He showed that stacking a classifier on top of the pure performance estimation does yield indeed an improvement in most cases over choosing the algorithm based on the performance estimates only.

8 Conclusion

In this paper, we revisited the problem of algorithm selection from a meta perspective. We defined the problem of meta algorithm selection and proposed a general methodological framework for this problem. Moreover, we considered several concrete learning methods as instantiations of this framework and compared them conceptually and empirically. In an extensive experimental study on an established benchmark for algorithm selection, we have shown that the meta algorithm selection problem can be solved efficiently, and that solutions can provide remarkable improvements in performance, often significantly better than the hitherto state of the art. Finally, we set the results into a broader context, concluding that learning algorithm selector selectors seems to be harder and less promising than defining them through well-established concepts from ensemble learning.

In future work, more effort should be invested in understanding why learning algorithm selector selectors appears to be a hard problem, while manually defined algorithm selection ensembles can achieve good performance. In particular, investigations of this phenomenon on a theoretical level would be of interest. Another possible direction for future work might be to focus more on learning instance-specific aggregation functions (Melnikov & Hüllermeier, 2016) to be used inside the ensembles, because this would allow one to leverage the information of which algorithm did indeed perform best on a given instance, instead of using an a priori fixed aggregation function. As seen with stacking, this works at least in principle. Yet another direction for future work is to adapt the idea of ensembles to the field of algorithm scheduling, where the recommendation target is no longer a single algorithm, but a complete algorithm schedule. One of the main challenges here is the aggregation of schedules.