Abstract
Algorithm selection (AS) tasks are dedicated to find the optimal algorithm for an unseen problem instance. With the knowledge of problem instances’ metafeatures and algorithms’ landmark performances, Machine Learning (ML) approaches are applied to solve AS problems. However, the standard training process of benchmark ML approaches in AS either needs to train the models specifically for every algorithm or relies on the sparse onehot encoding as the algorithms’ representation. To escape these intermediate steps and form the mapping function directly, we borrow the learning to rank framework from Recommender System (RS) and embed the bilinear factorization to model the algorithms’ performances in AS. This Bilinear Learning to Rank (BLR) has proven to work with competence in some AS scenarios and thus is also proposed as a benchmark approach. Thinking from the evaluation perspective in the modern AS challenges, precisely predicting the performance is usually the measuring goal. Though approaches’ inference time also needs to be counted for the running time cost calculation, it’s always overlooked in the evaluation process. The multiobjective evaluation metric Adjusted Ratio of Root Ratios (A3R) is therefore advocated in this paper to balance the tradeoff between the accuracy and inference time in AS. Concerning A3R, BLR outperforms other benchmarks when expanding the candidates range to TOP3. The better effect of this candidates expansion results from the cumulative optimum performance during the AS process. We take the further step in the experimentation to represent the advantage of such TOPK expansion, and illustrate that such expansion can be considered as the supplement for the convention of TOP1 selection during the evaluation process.
Introduction
In the Algorithm Selection domain, for scenarios like computational complexity and machine learning, the number of problem instances can be infinite, while a bunch of new algorithms are created for solving problem instances every year. In a specific scenario, the performance of an algorithm on different problem instances varies a lot, and thus correctly foretell the performances of algorithms on problem instances is critical for finding the good algorithm. The research problem of how to effectively select a good algorithm given a specific problem instance has been raised since year 1975 by Rice [1]. A perinstance Algorithm Selection (AS) problem can be formulated as \({\mathbf {A}} \times I \rightarrow {{\mathbb {R}}}\), where set \({\mathbf {A}} = \left\{ A_1, A_2, \ldots , A_n \right\} \) represents the set of all available algorithms in a scenario, and I denotes a specific problem instance in this scenario. Overall performances algorithms behave on a problem instance are embedded in the space \({{\mathbb {R}}}\). Using brute force to traverse all the algorithms tells the exact performance and helps select the best algorithm precisely, whereas it is often time consuming. In order to speed up the algorithm selection process in the formulated problem, AS approaches need to sacrifice the chance of only giving back the absolute perfect algorithm and yet strive to find as close as possible to that of the perfect algorithms on instance set I [2].
Single Best or Average Rank makes use of the landmark features, i.e., performance values of algorithms on the problem instances [3, 4]. They pick only one wellperformed algorithm as the suggestion for all the problem instances in a scenario. However, choosing a single algorithm for all the unseen problem instances is not always a good way; it’s possible that one algorithm performs better on many problem instances, but dramatically worse on other minor instances [5]. For the sake of increasing the coverage of the well solved instances, the perinstance algorithm selection has been proposed. It creates the possibility that every problem instance is treated individually and obtains their own optimal algorithm, thereby increasing the selection effect. Take an example, Propositional Satisfiability Problem (SAT) has been commonly solved by Machine Learning (ML) approaches. It is one of the most fundamental problems in computer science, and many other NPcomplete problems can be converted into SAT and be solved by SAT solvers [5]. Thus algorithm competitions toward SAT problems are held every year in the community^{Footnote 1}. The frequent winner SATzilla in the competition uses ML to build an empirical hardness model to serve as the basis for an algorithm portfolio. The model forms a computationally inexpensive predictor based on the features of the instance and algorithm’s past performances [5, 6]. The strategy of running the algorithm portfolio can be either sequential or parallel or the combination of the two [5, 7]. Though the running time is sacrificed especially in the sequential cases, the solved ratio has been increased.
When formulating AS problem from the view of ML, it can be abstracted in different models [8,9,10]. More specifically: \(SATzilla^{*}\) applies pairwise performance prediction from random forest classifiers [5, 6], LLAMA creates the multiclass classification model to attribute a problem instance with meta features to an algorithm class [11], ISAC aggregates the similar training instances as a subset via clustering or kNN and find the best algorithm on set basis for a new problem instance [12, 13]. Facing multiple MLbased AS approaches, AutoFolio realized the process of locating the best AS approach in the combined searching space [14, 15]. Similarly, in the ML scenarios, automatically selecting proper algorithms and their hyper parameter configuration for algorithm on a specific dataset is the main purpose. AutoML tools like AutoWeka [16] and AutoSklearn [17] are quite popular for the algorithms and hyperparameter space search.
A typical AS problem can be represented as Fig. 1. Meta features of problem instances are fully given as the full matrix on lefthand side, while performances of solvers (algorithms) applied on known problem instances form the performance matrix on the right hand side. The mapping function from meta features to the performances is expected to be learned. Given a new problem instance, the performance prediction fully relies on the metafeatures vector. This full reliance makes the prediction task in AS similar as cold start condition in Recommender System (RS). When looking at the blocks with stars in Fig. 1, standing from the view of RS, the problem instance metafeature input can be understood as usual user profiling features like age, working field, preference category etc. And the performance matrix can be associated with user rating or implicit feedback matrix RS. Therefore, the approaches used in RS are also applicable in AS problem. The terminologies used in AS, ML and RS occasionally overlap, and we distinguish these terminologies in Table 5 in “Appendix A.1” to avoid misunderstanding. The Examples inside the table mainly come from the definitions in the work with TSP Solvers by Bao et al. [18].
When applying RS approaches in the AS problems, we need to note that the recorded algorithms’ performances on problem instances are usually in much smaller size. Thence the stateoftheart deep learning and transaction embedding techniques in the largescale sessionbased RS [19, 20] are not suitable for AS scenarios. On the contrary, shallow ML approaches from RS are more adaptable. Since 2010, Stern et al. have applied Bilinear Matrix Factorization (originally designed for RS) in AS scenarios and got some good results [21, 22]. Thereafter, many researchers tried the approaches from RS to solve AS tasks. Misir and Sebag created Alors AS system, which utilized random forest to map metafeatures of problem instances onto the latent feature space. Based on these latent features, their Collaborative Filtering (CF) is designed to make algorithm recommendation [9, 23]. Yang et al. proposed Principal Component Analysis to decompose the performance matrix actively to solve the sparse performance entries problem for the new problem instances [24].
Learning to Rank (L2R) as a famous RS framework has been proposed to learn the prediction model from the ranking of the recommended list [25,26,27] and is also applicable in AS. As summarized in [28], L2R methods are usually divided into three groups: pointwise, pairwise and listwise. Pointwise L2R is designed for the labeled ranks, and thus multiclasses classification ML models can be used. Pairwise L2R works well for the recommendation with large amount of candidate items. Owing to the pairs sampling from the lengthy candidates list, time cost can be saved during learning. Listwise L2R creates the loss function through the cross entropy between the ground truth list and the predicted list. In [29], authors utilized the sigmoid function as ranking surrogate to tell the algorithms’ pairwise performance order. The surrogate embeds the polynomial scoring model function to produce the probability. However, the pairwise L2R costs extra during the pairwise sampling phase and listwise L2R is more preferable for the shorter candidates list. To model the uncertainty of the performance ranking, we apply listwise L2R framework to the proposed model solving AS problems.
As the exchange for speeding up the algorithm selection process, AS approaches need to sacrifice the performance prediction accuracy to some extent. For every AS scenario, an Oracle or Virtual Best Solver (VBS) is assumed to know the best performed algorithm for all the instances. Reducing the gap between a proposed AS approach and the VBS is one of the evaluation goals while assessing a new AS approach. In this paper, we mainly deal with the AS problem in computational complexity scenarios like SAT, Maximum Satisfiability Problem (MAXSAT), Constraint Satisfaction Problems (CSP), Quantified Boolean Formula (QBF) and Answer Set Programming (ASP) [30,31,32,33]. In these scenarios, run time is the performance indicator for all candidate algorithms. The additional runtime cost and solved ratio of the predicted optimal algorithm are the main effect measurements for AS approaches [34,35,36,37]. Aside from the accuracyoriented evaluation metrics, the inference time of AS approaches can span in many magnitudes thus also needs to be taken as a tradeoff factor in the evaluation. Nevertheless, inference time is usually overlooked in the algorithm evaluation.
From the view of modeling, evaluation, and candidates selection while applying RS approaches in AS problems, there are still some open research questions: (1) if both problem metafeatures and algorithms performance information are utilized for modeling, multimodels training or onehot encoding is usually unavoidable in benchmark approaches, whether a model can skip these intermediate step and create the mapping directly? (2) During the evaluation process, the inference time from a specific AS approach is usually ignored. When both prediction accuracy and inference time are taken into account, how to balance the AS effect? (3) In most AS challenges [37, 38], only the predicted optimal algorithm is chosen for the evaluation. It narrows the range of candidate set and reduces the chance of finding the actual optimal algorithm, whether a proper expansion on the candidates set can benefit the AS effect with the cumulative optimal algorithm? In order to address the research problems, we construct the following studies in this paper:

(1)
We propose Bilinear Learning to Rank (BLR) to include both problem instance metafeatures and performance matrix in one L2R framework. The mapping matrix \(\mathbf {W}\) and \(\mathbf {V}\) in the model creates the mapping from metafeatures to the performance matrix in a straightforward way. It avoids multimodels training or algorithms onehot encoding as what other benchmark approaches do. And the probabilistic assumption on the ranking solves the randomness modeling of the performance value in the algorithmproblem instance interaction matrix. We illustrate the good performance of BLR compared with other benchmark AS approaches in the experiments.

(2)
Adjusted Ratio of Root Ratios (A3R) was proposed as a ranking measure for the algorithms in ML metalearning; it incorporates both accuracyoriented metric and time cost metric into one evaluation measurement. We apply A3R as the evaluation metric for the general AS tasks, in order to balance the accuracy and inference time for measuring AS approaches. Being measured with A3R, BLR outperforms other approaches in terms of this tradeoff.

(3)
While observing the cumulative optimal performance, we find that AS approaches usually converge to a good performance when K setting goes from 1 to 3 or 5. Though TOP1 candidate selection is still used in many AS challenges, we advocate expanding the this candidates selection spectrum from TOP1 to TOPK (K depends on the concrete computational power). The error decrease effect detected in the experiment confirms the benefits of such expansion.
The rest of the paper is structured as follows: basic methodologies, benchmark approaches and concrete modeling steps of BLR are introduced in Sect. 2. In Sect. 3, we first list the evaluation metrics frequently used in AS tasks and then introduce A3R as the tradeoff metric for accuracy and inference time. Section 4 presents the experiments design and the results. Finally, Sect. 5 draws the conclusion and gives an outlook to the future work.
Methodologies
In AS, regarding one problem instance, the predicting targets are the performances of multiple algorithms, instead of a single label or a numerical value. In order to solve the multitargets prediction task, there are three ways to design AS approaches: (1) relying on statistics of algorithms’ historical performances; (2) algorithm performances separation: building the predicting model for each algorithm individually, run the fitted models for all algorithms during the inference; (3) algorithm indicators’ onehot conversion: horizontally concatenate the problem instance metafeature matrix and algorithm appearance onehot matrix to form the input matrix as the input for the general prediction function. In this section, we first introduce the benchmark approaches which follow these three ways of design. Subsequently, we propose our own approach Bilinear Learning to Rank (BLR), which doesn’t need multimodels training and onehot conversion to complete the AS model creation.
Benchmark approaches
Targeting diverse AS scenarios, some wellperformed benchmark approaches have already been proposed.^{Footnote 2} We separate these benchmark approaches into three groups according to the data transformation ways mentioned above.
Performances’ statistics
Virtual Best Selector and Single Best are two traditional benchmark approaches in AS. They don’t rely on any Machine Learning (ML) model assumption of metafeatures, but come from the performance statistics instead.

Virtual Best Selector is the ground truth of the algorithms performances. The ranking of algorithms in VBS is the true rank used to compare with the predicted list. The evaluation of the VBS list is the upper bound for all other AS approaches.

Single Best is the most classical algorithm selection baseline approach. It selects the algorithm whose mean performance is the best through all the problem instances in the training set.
Algorithmbased separated learning
The algorithmbased separated learning process is explained in Fig. 2. For each algorithm, a single prediction model is trained based on problem instances’ metafeatures and the algorithm’s performances. When a new problem instance shows up, N prediction models are used to infer the performances for the N algorithms separately. The following AS approaches adopt the algorithmbased separated learning process. In spite of the model specialty for this group of approaches, long inference time is its main disadvantage.

Separated Linear Regressors train linear regressors for candidate algorithms separately. When a new problem instance must be handled, the performance prediction on all algorithms depends on all the fitted linear models.

Separated Random Forest Regressors fit Random Forest (RF) models for a designated algorithm. During the inference phase, N RF are called separately to generate predictions for N algorithm individually.

Separated Gradient Boosted Regression Trees (XGBoost) uses gradient boosted trees to learn the performance predictor, every individual algorithm owns a XGBoost model and infer the new performance value based on its own XGBoost model.
Algorithms onehot conversion
Another group of AS approaches apply the onehot conversion of the algorithms appearance indicator to form the new AS input. Targeting on a new problem instance, concatenated vector of problem instance metafeature and the algorithm indicator vector forms the input for the prediction model. Figure 3 represents the conversion process. Though the single model brings in the simplicity, onehot conversion creates extra sparsity for the data. The AS approaches following this conversion rules include:

Onehot Linear Regressor trains one linear predicting model with the flattened representation from the combination of problem instance metafeatures and algorithms appearance indicators. Only one linear model is applied during the inference process for the new problem instances.

Onehot RF Regressor has each entry in the performance matrix as the regression target, with the \(L + N\) dimensional features, only RF is needed to fit the model. The model can infer any algorithm’s performance with its onehot encoded appearance indicator.

Onehot XGBoost fit a single XGBoost model with \(M \times N\) training samples, this XGBoost model is applicable for the performances inference for all the algorithms.
Bilinear L2R
There are two matrices with known entries in AS scenarios. One is the problem instance metafeature matrix X, and the other is the algorithm problem instance performance matrix S. The benchmark approaches mentioned in the above subsection solve the mapping from X to S via either multimodels training (time consuming) or algorithms’ indicators’ onehot conversion (can sparsify the dataset). In order to avoid the multimodels training and features onehot conversion, we propose Bilinear Learning to Rank (BLR) to create the AS strategies. Given the bilinear assumption, the factorization process of the mapping from X to S is represented in Fig. 4. The performance inference on new problem instances is depicted in Fig. 5. With the help of the two mapping latent matrices W and V, an entry in the performance matrix \(s_{m,n}\) can be calculated through \(\mathbf {X}_{m,:} \cdot W \cdot V_{:,n}\). Therefore, the model parameters to be learned are matrices W and V. There is no need to train specific models individually for different algorithms. Owing to the indices exact mapping, the latent dense matrix is enough to directly contribute to the entries in the performance matrix, thus the sparse onehot encoding is not needed during the inference time.
In algorithms’ performances, uncertainty always exists. For computational complexity problem, like SAT and Traveling Salesman Problem (TSP), the algorithm’s runtime performance can be different when altering the specific running environment. For MLbased problems, the accuracy measured can also be different when crossvalidation setting changes. With the performance Bilinear factorization assumption, we model the ranking of algorithms w.r.t. a specific problem instance in a probabilistic fashion. We assume the probability an algorithm ranked TOP1 for a problem instance is proportional to its performance (or predicted performance) among all the algorithms. The cross entropy between the ground truth TOP1 probability vector \(P_{\mathbf {r}_m}(r_{m,n})\) and the predicted TOP1 probability vector \(P_{{{\hat{\mathbf {r}}}}_m}({{\hat{r}}}_{m,n})\) (where r is the converted value of a performance value s) defines the loss and influence the optimization strategy.
Embedding bilinear factorization in L2R framework, this is the full idea of BLR. We refine the notations for BLR in Table 6 in “Appendix A.2”. The modeling and learning of BLR is structured as four steps: (1) Performance scoring model function and corresponding rating converting function; (2) loss function considering the ranking loss; (3) gradient function for corresponding weights; and (4) updating rule of the weights according to the specific optimization approach. The first two steps are introduced as follows in this section, while gradient function and updating rules are explained in the “Appendix A.3.1 and A.3.2” separately.
Model function
In BLR, given the problem instance m and algorithm n, we predict the performance score as \({{\hat{s}}}_{m,n}\) in the Eq. (1). The preferred sorting order on performance values depends on the choice of target performance. For example, if runtime is performance metric, the lower value is better. However, if accuracy is the targeted performance metric, the higher performance is preferred. For the simplicity of calculating the listwise ranking loss, we set a converting function \(r = f(s)\) to make descending order preferable for all the rating values r. And the converted rating value r is the optimization unit in the ranking loss function. In this paper, we simply define f(s) as Eq. (2).
Listwise loss function
Assuming that the performances scores of all algorithms on specific problem instance are with measuring noises, we model the probability that an algorithm being ranked topone proportional to its normalized measured performance value. This normalized topone probability representation has been proposed in L2R domain to model the listwise ranking loss [28]. Regarding a single problem instance, the topone probability for the same algorithm is different between the ground truth performances list and the predicted performances list. As defined in Eq. (3), for a problem instance m, with the rating vector \(\mathbf {r}_{m}\) (converted version of the performance vector), the topone probability for each algorithm n is normalized in the form of \(P_{\mathbf {r}_{m}}\).
For the sake of making probability distribution more gathered around the position of the largest input values, the exponential function is applied as the concrete form for monotonically increasing function \(\varphi \) in Eq. (3). Thus \(P_{\mathbf {r}_{m}}\) can be represented as Eq. (4), which is in the same shape of Softmax function representation.
To represent the listwise ranking loss per problem instance, the cross entropy is calculated between the topone probability from the predicted rating list \({{\hat{\mathbf {r}}}}_m\) and the ground truth rating value list \(\mathbf {r}_{m}\). For each problem instance m, the pointwise loss for algorithm n is formulated as Eq. (5). Considering the probabilities normalization is calculated under the same scale for a problem instance m, the per instance listwise loss \(L_{m}\) is defined as the summation of the pointwise loss inside this list, as shown in Eq. (6). Here \(L_{m}\) is the listwise ranking loss between the ground truth list and the predicted list. The total loss on the whole m problem instances is defined in Eq. (7), in which L2 regularization is applied to avoid overfitting.
The concrete gradient calculation for the loss definition and the updating rule based on the gradient can be found in the “Appendix A.3.1 and A.3.2” separately.
Evaluation metrics
We measure the AS effect of different approaches with the evaluation metrics Success Rate (SUCC), MisClassification Penalty (MCP), Penalized Average Runtime Score (PAR10) and Mean Average Precision (MAP). In addition to these performance prediction accuracyoriented metrics, A3R is also applied to solve the tradeoff between prediction effect and inference time.
Accuracyoriented evaluation metrics
SUCC, PAR10 and MCP are the standard evaluation metrics from AS community. SUCC cares only whether the selected algorithms are solvable. Yet for question, how close does the predicted best algorithm perform to the actual best algorithm? It is the most concern in PAR10 and MCP. Additionally, MAP is included as a representative of ranking measurement. Obeying the conventional candidate selection criteria, the selection range of algorithms is limited to TOP1 from the predicted list. The chance of finding the optimal algorithm is actually limited to this specific choice. In this paper, we propose expand the algorithm candidate selection range to TOPK on the predicted list to gain the evaluation bonus. The four evaluation metrics with their TOPK understanding are explained below.
 SUCC:

stands for the average solved ratio of the selected algorithm per problem instances across the test set. For TOP1 selection criteria, the solved ratio is only calculated w.r.t. the algorithm with best predicted performance. Yet for the case of SUCC@K, the average calculation is applied over the best K algorithms.
 PAR10:

is the penalty version for the actual runtime of the selected algorithm. If the selected algorithm is actually timeout, its runtime will be penalized by multiplying 10 to the timeout runtime. Otherwise, the actual runtime is directly used. With TOP1 selection criteria, the penalty is only applied on the best ranked algorithm in the predicted list. For PAR10@K, the penalty will be applied on the algorithm with the shortest actual runtime in the TOPK algorithms of the predicted list.
 MCP:

compares the time cost difference between the actual runtime of the predicted best algorithm and the VBS. The algorithm with the lowest actual runtime in the TOPK predicted list is chosen as the comparison with the runtime of VBS. The algorithm selected by VBS always has the MCP value as zero.
 MAP:

measures the mean average precision of the TOPK predicted algorithms vs. the TOPK ranked algorithms with the ground truth performance. MAP for TOPK algorithms in the predicted list is calculated in the same way as MAP@K (average of the precision rate which has a hit indicator).
Among the above evaluation metrics, accuracyoriented ones SUCC and MAP comply with the rule the higher the better, while for the time costoriented metrics like MCP and MAP, the lower the better.
Multiobjective evaluation metrics
The standard AS evaluation metrics are aimed at the accuracy of the performance prediction. However, inference time on the unknown problem instances also deserves our attention. Multiobjective evaluation metric Adjusted Ratio of Root Ratios (A3R) involves both accuracy and inference time into the evaluation and brings in the tradeoff between the two factors.
Abdulrahman, Salisu et al. introduced A3R in AutoML [39, 40]. A3R is treated as the ranking basis for algorithms w.r.t. a dataset AutoML scenario. A3R balances the precision and the runtime of the selected algorithm. As Eq. (8) shows, when applying algorithm \(a_p\) on dataset \(d_i\), \(SR^{d_i}_{a_p}\) stands for the success rate and \(T^{d_i}_{a_p}\) represents the time cost. A reference algorithm \(a_q\) is chosen, to standardize the success rate across all the algorithms as ratio \(SR^{d_i}_{a_p} / SR^{d_i}_{a_q}\). The equivalent ratio for time cost is represented as \(T^{d_i}_{a_p}/T^{d_i}_{a_q}\). The combined metric takes success rate ratio as advantage, while the time ratio as disadvantage. Since the time cost ratio ranges across more magnitudes than the success rate does, \(N_{th}\) root on the denominator of Eg. 8 enables the rescaling of the running time ratio and turns the A3R to a reasonable value range. A3R is used to measure the comprehensive quality running an algorithm on the dataset.
In this paper, we borrow the idea of A3R from AutoML and apply it as the ranking basis for the approaches in AS scenario. We replace \(d_i\) with \(s_i\) (the \(i_{th}\) scenario), keep a but note as approach in Eq. (11). For accuracybased metrics like SUCC and MAP, we apply their values ACC to substitute SR in the Eq. (8). While for run time cost based metrics TC, lower values denote higher accuracy, the inverse ratio \({TC^{s_i}_{a_q} / TC^{s_i}_{a_p}}\) is instead used in the numerator. Since the run time cost spans several magnitudes, \(M_{th}\) root is used on the numerator for rescaling. In the following experiments, we utilize Eqs. 11 and 12 to evaluate the combined AS effect.
Experiments
We design the experiments to study: (1) The algorithm selection effect of the proposed BLR approach compared with other benchmark approaches; (2) AS effect when taking both accuracy and inference time into consideration; (3) the benefits of expanding the candidates set selection range.
Datasets
In this paper, we focus on typical AS problems in computational complexity domain. The Algorithm Selection Library (ASLib) released by COnfiguration and SElection of ALgorithms (COSEAL)^{Footnote 3} research group provides the most complete and standardized dataset over such tasks. In our experiments, we fetch the following scenarios from ASLib: ASPPOTASSCO, BNSL2016, CPMP2015, CSP2010, CSPMZN2013, CSPMinizincObj2016, GRAPHS2015, MAXSAT12PMS, MAXSAT15PMSINDU, PROTEUS2014, QBF2011, QBF2014, SAT11HAND, SAT11INDU, SAT11RAND, SAT12ALL, SAT12HAND, SAT12INDU, SAT12RAND, SAT15INDU and TSPLION2015. In all of these computational complex AS scenarios, runtime is the main performance metric. In each scenario, the dataset comprises algorithms’ performances on problem instances, problem instances metafeatures run status, and feature values. The standardized datasets make the experiments evaluation results among many scenarios comparable.
In each AS scenario from the ASLib, we split the dataset into 10 folds and apply crossvalidation on the 9 folds to find the best hyperparameter setting for each approach. With the best selected hyper parameters, all approaches are trained again on the whole 9fold dataset and the fitted models are acquired. These models are used to do the inference on the last fold (test set) to be evaluated.
Performance of Bilinear L2R approach
We compare the AS effect of BLR with other benchmark approaches under the four evaluation metrics introduced in the last section. For BLR model, latent dimension K, learning rate \(\eta \), regularizer \(\lambda \) are the hyper parameters to be tuned during cross validation. Since the optimization target of BLR decomposition is not convex, the trained model is sensitive to the initialization of the entries in the latent matrices. Thus the best initialization state is also determined in the crossvalidation phase. To speed up the convergence of the BLR, we use Stochastic Gradient Descent instead of Gradient Descent as optimization method. Given the vibrated loss value on Stochastic Gradient Descent, we tell the convergence of BLR model with at least 5 successive increases on the loss detected during the optimization.
BLR performance with TOP1 candidates selection
First we apply the conventional TOP1 candidates selection in the evaluation and observe under what circumstances BLR performs better. In Table 1, AS scenario and evaluation metric combination are listed per row. These are the cases BLR is ranked among the best 3 compared with other benchmark approaches. More specifically, in CSPMininzicObj2016 and SAT15INDU regarding success rate, in PROTEUS2014 concerning MCP and PAR10, in TSPLION2015 in terms of MAP, BLR is ranked as top1. These competitive performances verify that BLR can also be considered as a benchmark approach in some AS scenarios.
Cumulative performance in TOPK expansion
If parallel processing on the candidate algorithms is considered, we can broaden the range of candidates selection to increase the chance of finding the best algorithm without extra time consumption. Thus if the cumulative best performances of approaches decrease drastically at first several predicted positions, it’s proper to consider TOPK expansion for the predicted list. We first observe the cumulative best performance along the TOPK position elapse in some scenarios. For SAT11HAND, PROTEUS2014 and MAXSAT12PMS, we visualize the cumulative minimum mean runtime for all approaches’ predicting lists in Fig. 6. On the left hand side, in scenario SAT11HAND, though BLR (plotted with bold green yellow line) gives the worst recommendation at the top1 position, it reaches the optimal performance as one hot random forest does at position 4. Conversely, in scenario PROTEUS2014, as plotted in the middle subplot, BLR finds the algorithm with shortest runtime at position 4 and beats all other approaches, while loses its dominant role gradually from position 3. Approaches like single best, separated xgboost and separated random forest take over the dominant positions from position 3. In scenario MAXSAT12PMS, similar as in scenario SAT11HAND, the recommendation from blr reaches best at top position 3, in spite of the worst average run time of its predicted algorithms list at position 1.
BLR performance with expanded candidates selection
The cumulative best performance varies a lot even considering single AS approach, thus the rank of approaches also changes when considering different expansion degrees. For BLR, aside from the conventional TOP1 candidates selection criteria, we observe its rankings under TOP3 selection. In Table 2, we list the conditions (combinations of scenario and evaluation metric) where BLR is evaluated as competitive (ranked in top 3). BLR can still perform well in some specific scenarios. When being compared with Table 2, only in scenarios GRAPHS2015 and TSPLION2015, BLR shows competitive role in both TOP1 and TOP3. The advancing performances of BLR doesn’t hold consistent between TOP1 and TOP3 candidates selection in most scenarios.
Accuracy and inference time tradeoff
With evaluation metrics SUCC, MAP, MCP and PAR10, the accuracy of AS approaches can be assessed. Nevertheless, shorter inference time is also preferred for an AS approach. As introduced in Sect. 3, A3R is a good metric for measuring the combining effect of accuracy and time. We take this metric to make the combining effect evaluation for the AS approaches in this experiment. To make the accuracy/time ratio comparable across all scenarios, one hot random forest regressor (the approach wins in most scenarios) is taken as reference approach (\(a_q\)) in the evaluation equation. It’s drawn as the pink bar in the following figures, and the A3R value of this referred algorithm is always 1. All the accuracy metric values are from TOP3 candidates setting.
As to precision oriented accuracy metrics ((SUCC and MAP), the accuracy ratio is proportional to the metric value of the selected approach. Thus ACC value of \(a_q\) (referenced approach) is set as the denominator in the ratio formula \(ACC^{s_i}_{a_p}/ACC^{s_i}_{a_q}\) in Eq. (11). Considering that inference time of different AS approaches span in 3 to 4 magnitudes, parameter for root N is set as 30 in the experiment to limit A3R in a reasonable range. As Fig. 7 shows, when evaluating the approaches regarding both MAP and inference time using A3R, BLR (in light blue bar) outperforms all other benchmark approaches. Thus BLR reaches the balance of model complexity and inference simplicity.
For time costoriented accuracy metrics (MCP and PAR10), their values are negatively correlated with prediction accuracy. The accuracy ratio \({TC^{s_i}_{a_q} / TC^{s_i}_{a_p}}\) therefore takes the metric value \(TC^{s_i}_{a_p}\) as the denominator. In addition, since the MCP and PAR10 metric value among approaches varies a lot even concerning magnitude, root parameter M is involved for this accuracy ratio as well to transform the ratio to a readable range. As Fig. 8 shows, with the setting of \(M=3\) and \(N=30\), BLR (represented as light blue bar) again wins other benchmark approaches.
The excellent performance on A3R which cares both precision oriented and time costoriented accuracy metrics verifies that BLR can be a good option when the balance between accuracy and inference time needs to be taken into account.
Benefit of expanding the candidate selection range from Top1 to TopK
As discussed in the former subsections, if we enlarge the algorithm candidates range from TOP1 to TOPK, we can expect the algorithm selected from the wider spectrum yield better optimal selected algorithm. In this experiment, we tentatively set \(K = 3\), and observe the difference on the cumulative evaluation result difference between the conditions \(K = 1\) and \(K = 3\). For every AS scenario, we list the approach with the largest performance difference caused by TOP1 and TOP3 selection criteria and thus illustrate the benefit of the TOPK expansion. We choose time costoriented metrics MCP and PAR10 to represent the performance difference, considering their straightforward cumulative performance decrease along the TOPK positions.
MisClassification Penalty (MCP) calculates the time cost difference between the selected algorithm and the actual best algorithm. The lower the MCP value, the better effect the AS approach possesses. Seen from Table 3, TOP3 selection criteria leads to the decreasing effect on MCP significantly. We highlight the decrease percentage higher than 90.00% in bold boxes in the table. The decrease percentage ranges from 55.78% to 100%. It demonstrates that enlarging the TOPK candidates selection range can boost finding the algorithm runtime closer to the ground truth best.
The evaluation metric PAR10 gives 10 times penalty on the most recommended algorithm which is actually timeout. We list the decrease percentage caused by TOP3 candidates expansion in Table 4. This decrease percentage falls in the interval 19.47% to 95.72%. The cases, in that the percentage values are higher than 90.00%, have been highlighted in the bold boxes. This decrease percentage indicates the reduction of the possibility that selected algorithm runs in a timeout.
Expanding the TOP1 candidate set to the case TOP3, the observation of the significant decrease on the time cost metrics MCP and PAR10 confirms the benefit of the expansion. In AS, under the parallel testing environment, the test on the TOPK candidates stops at the runtime of the optimal algorithm in the candidates set. Thus the test time is also saved owing to the expansion. The selection of K depends on the computational power and environmental limit. Though TOP1 setting is required in most AS challenges, we would like suggest the expansion of this candidates selection range.
Discussion
The experiments in this section unveil several interesting points: (1) BLR possesses the chance outperforming other benchmark AS approaches in some scenarios; (2) on evaluation metric A3R, BLR shows the power of balancing prediction accuracy and inference time; (3) TOPK expansion on candidates set brings benefit for finding the optimal algorithm.
Conclusion and future work
In this paper, we propose Bilinear Learning to Rank (BLR) to solve AS problem. BLR is inspired from the collaborative filtering in RS. With the listwise TOP1 probability assumption, it models the uncertainty in the algorithm performance. The learning process of BLR averts the problems like multimodels training and algorithms’ onehot conversion in traditional AS benchmark approaches. Being compared with the benchmark approaches, selection effects of BLR have proven to perform well in some AS scenarios. Considering the balance of the tradeoff between the accuracy and inference time in the evaluation, we propose using A3R as the evaluation’s protocol. BLR performs especially well on this new tradeoff metric A3R. Finally, we affirm the benefit of expanding the selection range of candidate approaches from TOP1 to TOPK regarding the cumulative optimal demand of AS evaluation.
Given the work so far, there is much to do for the future. For BLR, since it’s a model with nonconvex loss definition, the convergence criteria can be adjusted to tune better parameters setting. In the current experimental settings, we only investigate 21 AS scenarios. Extending the experiments to additional scenarios can give a stronger confidence on the experimental results. Though we set K in TOPK expansion as 3, and illustrate the expansion benefit, more thorough study can be done on how to choose K to meet the balance of performance gain and computational power.
References
 1.
Rice, J.R.: The Algorithm Selection Problem. Volume 15 of Advances in Computers, pp. 65–118. Elsevier, Amsterdam (1976)
 2.
Kerschke, P., Hoos, H.H., Neumann, F., Trautmann, H.: Automated algorithm selection: survey and perspectives. Evolut. Comput. 27, 3–45 (2019)
 3.
Brazdil, P.B., Soares, C.: A comparison of ranking methods for classification algorithm selection. In: Machine Learning: ECML, pp. 63–75, Berlin, Heidelberg, 2000. Springer Berlin Heidelberg (2000)
 4.
Abdulrahman, S.M., Brazdil, P., van Rijn, J.N., Vanschoren, J.: Speeding up algorithm selection using average ranking and active testing by introducing runtime. Mach. Learn. 107, 79–108 (2018)
 5.
Xu, L., Hutter, F., Hoos, H.H., LeytonBrown, K.: Satzilla: Portfoliobased algorithm selection for SAT. Computing Research Repository (CoRR), arXiv:1111.2249 (2011)
 6.
Lin, X., Hutter, F., Hoos, H.H., LeytonBrown, K.: Satzilla: Portfoliobased algorithm selection for sat. J. Artif. Intell. Res. 32, 565–606 (2008)
 7.
Gonard, F., Schoenauer, M., Sebag, M.: Asap.v2 and asap.v3: sequential optimization of an algorithm selector and a scheduler. In: Lindauer, M., van Rijn, J.N., Kotthoff, L. (eds.), Proceedings of the Open Algorithm Selection Challenge, volume 79 of Proceedings of Machine Learning Research, pages 8–11, Brussels, Belgium, (2017). PMLR
 8.
Doan, T., Kalita, J.: Algorithm selection using performance and run time behavior. In: Dichev, C., Agre, G. (eds) International Conference on Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2016:Artificial Intelligence: Methodology, Systems, and Applications, pages 3–13, Cham, Switzerland, (2016). Springer International Publishing
 9.
Misir, M., Sebag, M.: Algorithm selection as a collaborative filtering problem. Research report, December (2013)
 10.
Liu, J.H., Zhou, T., Zhang, Z.K., Yang, Z., Liu, C., Li, W.M.: Promoting coldstart items in recommender systems. PLoS ONE 9, 1–13 (2014)
 11.
Kotthoff, L.: LLAMA: leveraging learning to automatically manage algorithms. Computing Research Repository (CoRR) arXiv:1306.1031 (2013)
 12.
Kadioglu, S., Malitsky, Y., Sellmann, M., Tierney, K.: Isac—instancespecific algorithm configuration. In: Proceedings of the 2010 Conference on ECAI 2010: 19th European Conference on Artificial Intelligence, pp. 751–756, Amsterdam, Netherland (2010). IOS Press
 13.
Kadioglu, S., Malitsky, Y., Sabharwal, A., Samulowitz, H., Sellmann, M.: Algorithm selection and scheduling. In: Lee, J. (ed), Principles and Practice of Constraint Programming—CP 2011, pages 454–469, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg
 14.
Lindauer, M.T., Hoos, H.H., Hutter, F., Schaub, T.: Autofolio: an automatically configured algorithm selector. J. Artif. Intell. Res. 53, 745–778 (2015)
 15.
Lindauer, M., Hutter, F., Hoos, H.H., Schaub, T.: Autofolio: an automatically configured algorithm selector (extended abstract). In: Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 1925, 2017, pp. 5025–5029 (2017)
 16.
Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., LeytonBrown, K.K.: Autoweka 2.0: automatic model selection and hyperparameter optimization in weka. J. Mach. Learn. Res. (JMLR) 18, 826–830 (2017)
 17.
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pages 2962–2970. Curran Associates, Inc., (2015)
 18.
Lin, B.L., Sun, X., Salous, S.: Solving travelling salesman problem with an improved hybrid genetic algorithm. J. Comput. Commun. 4, 98–106 (2016)
 19.
Wang, S., Hu, L., Cao, L.: Perceiving the next choice with comprehensive transaction embeddings for online recommendation. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 285–302, Cham, Switzerland (2017). Springer International Publishing
 20.
Wang, S., Hu, L., Wang, Y., Sheng, Q.Z, Orgun, M., Cao, L.: Modeling multipurpose sessions for nextitem recommendations via mixturechannel purpose routing networks. In: Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19, pp. 3771–3777. International Joint Conferences on Artificial Intelligence Organization, July (2019)
 21.
Stern, D.H., Herbrich, R., Graepel, T.: Matchbox: Large scale online bayesian recommendations. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pages 111–120, New York, NY, USA, (2009). ACM
 22.
Stern, D., Herbrich, R., Graepel, T., Samulowitz, H., Pulina, L., Tacchella, A.: Collaborative expert portfolio management. In: Proceedings of the TwentyFourth AAAI Conference on Artificial Intelligence, AAAI’10, pages 179–184. AAAI Press, (2010)
 23.
Misir, M., Sebag, M.: Alors: an algorithm recommender system. Artif. Intell. 244, 291–314 (2017)
 24.
Yang, C., Akimoto, Y., Kim, D.W., Udell, M.: Oboe: collaborative filtering for automl model selection. In: Proceedings of the TwentyFifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1173–1183. Association for Computing Machinery, (2019)
 25.
Wang, X., Bendersky, M., Metzler, D., Najork, M.: Learning to rank with selection bias in personal search. In: Proc. of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124 (2016)
 26.
Joachims, T., Swaminathan, A., Schnabel, T.: Unbiased learningtorank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, pages 781–789, New York, NY, USA, (2017). ACM
 27.
Abdollahpouri, H., Burke, R., Mobasher, B.: Controlling popularity bias in learningtorank recommendation. In: Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys ’17, pp. 42–46, New York, NY, USA, 2017. ACM
 28.
Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: From pairwise approach to listwise approach. In: Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 129–136, New York, NY, USA, (2007). ACM
 29.
Oentaryo, R.J., Handoko, S.D., Lau, H.C.: Algorithm selection via ranking. In: Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 1826–1832. AAAI Press, (2015)
 30.
Luo, M., Li, C.M., Xiao, F., Manya, F., Zhipeng, L.: An effective learnt clause minimization approach for CDCL sat solvers. In: Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17, pp. 703–711, (2017)
 31.
Lee, J.N.Z., Wang, Y.S., Jiang, J.H.R.: Solving stochastic boolean satisfiability under randomexist quantification. In: Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17, pp. 688–694, (2017)
 32.
Mordido, A., Caleiro, C., Casal, F.: Classical generalized probabilistic satisfiability. In: Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17, pp. 908–914 (2017)
 33.
Bacchus, F., Hyttinen, A., Matti, S.: Paul: Reduced cost fixing in maxsat. In: Principles and Practice of Constraint Programming—23rd International Conference, CP 2017, Melbourne, VIC, Australia, pp. 641–651 (2017)
 34.
Gomes, C.P., Selman, B.: Algorithm portfolios. Artif. Intell. 126, 43–62 (2001)
 35.
Gomes, C.P., Selman, B.: Algorithm portfolio design: theory vs. practice. Computing Research Repository (CoRR), arXiv:1302.1541 (2013)
 36.
LeytonBrown, K., Nudelman, E., Andrew, G., McFadden, J., Shoham, Y.: A portfolio approach to algorithm select. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI’03, pp. 1542–1543, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc (2003)
 37.
Kotthoff, L., Hurley, B., O’Sullivan, B.: The ICON challenge on algorithm selection. AI Mag. 38, 91–93 (2017)
 38.
Bischl, B., Kerschke, P., Kotthoff, L., Lindauer, M.T., Malitsky, Y., Fréchette, A., Hoos, H.H., Hutter, F., LeytonBrown, K., Tierney, K., Vanschoren, J.: Aslib: A benchmark library for algorithm selection. Computing Research Repository (CoRR), arXiv:1506.02465 (2015)
 39.
Abdulrahman, S., Brazdil, P.: Measures for combining accuracy and time for metalearning. CEUR Workshop Proc. 1201, 49–50 (2014)
 40.
Abdulrahman, S.M., Pavel, B., Wan, M.N., Wan, Z., Alhassan, A.: Simplifying the algorithm selection using reduction of ranking of classification algorithms. In: Proceedings of the 2019 8th International Conference on Software and Computer Application, pp. 140–148. ACM (2019)
Acknowledgements
This work is supported in part by the German Federal Ministry of Education and Research (BMBF) under the grant number 01IS16046.
Funding
Open Access funding provided by Projekt DEAL.
Author information
Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Terminologies declaration
See Table 5.
Notations of Bilinear L2R
See Table6.
Gradient and updating rules in Bilinear L2R
In BLR model, for approximating the weighting matrix \(\mathbf {W}\) and latent matrix \(\mathbf {V}\) to minimize the loss function defined in the Sect. 2, we calculate the gradient for loss function and use the updating rule as described in the following subsections.
Gradient calculation
Having known the loss function L, to use gradient descent as the optimizer, the gradient concerning metafeatures mapping weight matrix \(\mathbf {W}\) and algorithm latent vectors matrix \(\mathbf {V}\) should be provided accordingly. Since the loss function is defined layer by layer through model function, converter function, topone probability function and cross entropy function, we use chain rule to calculate the gradient correspondingly. For L, its partial differential over \(w_{l,k}\) and \(v_{n,k}\) can be factorized in the similar way as Eqs. (13) and (14) separately.
For each \(L_{m,n}\), the intermediate calculation steps for deviation according to chain rule can be derived as following for each \(L_{m,n}\):
For the last step, which returns the partial differential \({{\hat{s}}}_{m,n}\) over \(w_{l,k}\) and \(v_{n,k}\), we can broadcast it in the vectorized way like Eqs. (18) and (19):
Updating rule
Having known the partial differential of the chain rule, we can update the weight matrix \(\mathbf {W}\) and algorithm latent matrix \(\mathbf {V}\) by the following updating rule Eqs. (20) and (21) , where \(\eta \) is the learning rate.
Since the loss is listwise, which means for each problem instance, there is a loss schema based on the corresponding top one probability. If we would like to update the weights in a stochastic way, the updating unit should be a list based on problem instance m, rather than each rating point. Therefore, the stochastic updating rule is like Eqs. (22) and (23):
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yuan, J., Geissler, C., Shao, W. et al. When algorithm selection meets Bilinear Learning to Rank: accuracy and inference time trade off with candidates expansion. Int J Data Sci Anal (2020). https://doi.org/10.1007/s4106002000229x
Received:
Accepted:
Published:
Keywords
 Algorithm selection
 Bilinear Learning to Rank
 Multiobject evaluation
 Candidates expansion