When algorithm selection meets Bi-linear Learning to Rank: accuracy and inference time trade off with candidates expansion

Algorithm selection (AS) tasks are dedicated to find the optimal algorithm for an unseen problem instance. With the knowledge of problem instances’ meta-features and algorithms’ landmark performances, Machine Learning (ML) approaches are applied to solve AS problems. However, the standard training process of benchmark ML approaches in AS either needs to train the models specifically for every algorithm or relies on the sparse one-hot encoding as the algorithms’ representation. To escape these intermediate steps and form the mapping function directly, we borrow the learning to rank framework from Recommender System (RS) and embed the bi-linear factorization to model the algorithms’ performances in AS. This Bi-linear Learning to Rank (BLR) has proven to work with competence in some AS scenarios and thus is also proposed as a benchmark approach. Thinking from the evaluation perspective in the modern AS challenges, precisely predicting the performance is usually the measuring goal. Though approaches’ inference time also needs to be counted for the running time cost calculation, it’s always overlooked in the evaluation process. The multi-objective evaluation metric Adjusted Ratio of Root Ratios (A3R) is therefore advocated in this paper to balance the trade-off between the accuracy and inference time in AS. Concerning A3R, BLR outperforms other benchmarks when expanding the candidates range to TOP3. The better effect of this candidates expansion results from the cumulative optimum performance during the AS process. We take the further step in the experimentation to represent the advantage of such TOPK expansion, and illustrate that such expansion can be considered as the supplement for the convention of TOP1 selection during the evaluation process.


Introduction
In the Algorithm Selection domain, for scenarios like computational complexity and machine learning, the number of problem instances can be infinite, while a bunch of new algorithms are created for solving problem instances every year. In a specific scenario, the performance of an algorithm on different problem instances varies a lot, and thus correctly foretell the performances of algorithms on problem instances is critical for finding the good algorithm. The research problem of how to effectively select a good algorithm given a specific problem instance has been raised since year 1975 by Rice [1]. A per-instance Algorithm Selection (AS) problem can be formulated as A × I → R, where set A = {A 1 , A 2 , . . . , A n } represents the set of all available algorithms in a scenario, and I denotes a specific problem instance in this scenario. Overall performances algorithms behave on a problem instance are embedded in the space R. Using brute force to traverse all the algorithms tells the exact performance and helps select the best algorithm precisely, whereas it is often time consuming. In order to speed up the algorithm selection process in the formulated problem, AS approaches need to sacrifice the chance of only giving back the absolute perfect algorithm and yet strive to find as close as possible to that of the perfect algorithms on instance set I [2].
Single Best or Average Rank makes use of the landmark features, i.e., performance values of algorithms on the problem instances [3,4]. They pick only one well-performed algorithm as the suggestion for all the problem instances in a scenario. However, choosing a single algorithm for all the unseen problem instances is not always a good way; it's possible that one algorithm performs better on many problem instances, but dramatically worse on other minor instances [5]. For the sake of increasing the coverage of the well solved instances, the per-instance algorithm selection has been proposed. It creates the possibility that every problem instance is treated individually and obtains their own optimal algorithm, thereby increasing the selection effect. Take an example, Propositional Satisfiability Problem (SAT) has been commonly solved by Machine Learning (ML) approaches. It is one of the most fundamental problems in computer science, and many other NP-complete problems can be converted into SAT and be solved by SAT solvers [5]. Thus algorithm competitions toward SAT problems are held every year in the community 1 . The frequent winner S AT zilla in the competition uses ML to build an empirical hardness model to serve as the basis for an algorithm portfolio. The model forms a computationally inexpensive predictor based on the features of the instance and algorithm's past performances [5,6]. The strategy of running the algorithm portfolio can be either sequential or parallel or the combination of the two [5,7]. Though the running time is sacrificed especially in the sequential cases, the solved ratio has been increased.
When formulating AS problem from the view of ML, it can be abstracted in different models [8][9][10]. More specifically: S AT zilla * applies pair-wise performance prediction from random forest classifiers [5,6], L L AM A creates the multiclass classification model to attribute a problem instance with meta features to an algorithm class [11], I S AC aggregates the similar training instances as a subset via clustering or k-NN and find the best algorithm on set basis for a new problem instance [12,13]. Facing multiple ML-based AS approaches, Auto-Folio realized the process of locating the best AS approach in the combined searching space [14,15]. Similarly, in the ML scenarios, automatically selecting proper algorithms and their hyper parameter configuration for algorithm on a specific dataset is the main purpose. AutoML tools like AutoWeka [16] and AutoSklearn [17] are quite popular for the algorithms and hyper-parameter space search.
A typical AS problem can be represented as Fig. 1. Meta features of problem instances are fully given as the full matrix on left-hand side, while performances of solvers (algorithms) applied on known problem instances form the performance matrix on the right hand side. The mapping function from meta features to the performances is expected to be learned. Given a new problem instance, the performance prediction fully relies on the meta-features vector. This full reliance makes the prediction task in AS similar as cold start condition in Recommender System (RS). When looking at the blocks with stars in Fig. 1, standing from the view of RS, the problem instance meta-feature input can be understood as usual user profiling features like age, working field, preference category etc. And the performance matrix can be associated with user rating or implicit feedback matrix RS. Therefore, the approaches used in RS are also applicable in AS problem. The terminologies used in AS, ML and RS occasionally overlap, and we distinguish these terminologies in Table 5 in "Appendix A.1" to avoid misunderstanding. The Examples inside the table mainly come from the definitions in the work with TSP Solvers by Bao et al. [18].
When applying RS approaches in the AS problems, we need to note that the recorded algorithms' performances on problem instances are usually in much smaller size. Thence the state-of-the-art deep learning and transaction embedding techniques in the large-scale session-based RS [19,20] are not suitable for AS scenarios. On the contrary, shallow ML approaches from RS are more adaptable. Since 2010, Stern et al. have applied Bi-linear Matrix Factorization (originally designed for RS) in AS scenarios and got some good results [21,22]. Thereafter, many researchers tried the approaches from RS to solve AS tasks. Misir and Sebag created Alors AS system, which utilized random forest to map meta-features of problem instances onto the latent feature space. Based on these latent features, their Collaborative Filtering (CF) is designed to make algorithm recommendation [9,23]. Yang et al. proposed Principal Component Analysis to decompose the performance matrix actively to solve the sparse performance entries problem for the new problem instances [24].
Learning to Rank (L2R) as a famous RS framework has been proposed to learn the prediction model from the ranking of the recommended list [25][26][27] and is also applicable in AS. As summarized in [28], L2R methods are usually divided into three groups: point-wise, pair-wise and list-wise. Point-wise L2R is designed for the labeled ranks, and thus multi-classes classification ML models can be used. Pair-wise L2R works well for the recommendation with large amount of candidate items. Owing to the pairs sampling from the lengthy candidates list, time cost can be saved during learning. Listwise L2R creates the loss function through the cross entropy between the ground truth list and the predicted list. In [29], authors utilized the sigmoid function as ranking surrogate to tell the algorithms' pair-wise performance order. The surrogate embeds the polynomial scoring model function to produce the probability. However, the pair-wise L2R costs extra during the pair-wise sampling phase and list-wise L2R is more preferable for the shorter candidates list. To model the Fig. 1 Algorithm Selection as Recommendation System: we need to learn a model which maps the given meta features matrix (which size is M × L) to the performance matrix (whose size is M × N ). Thus the inference of the newly coming problem instance in the same scenario is like making recommendation for a user in the cold-start phase in the Recommender System (RS). uncertainty of the performance ranking, we apply list-wise L2R framework to the proposed model solving AS problems.
As the exchange for speeding up the algorithm selection process, AS approaches need to sacrifice the performance prediction accuracy to some extent. For every AS scenario, an Oracle or Virtual Best Solver (VBS) is assumed to know the best performed algorithm for all the instances. Reducing the gap between a proposed AS approach and the VBS is one of the evaluation goals while assessing a new AS approach. In this paper, we mainly deal with the AS problem in computational complexity scenarios like SAT, Maximum Satisfiability Problem (MAXSAT), Constraint Satisfaction Problems (CSP), Quantified Boolean Formula (QBF) and Answer Set Programming (ASP) [30][31][32][33]. In these scenarios, run time is the performance indicator for all candidate algorithms. The additional runtime cost and solved ratio of the predicted optimal algorithm are the main effect measurements for AS approaches [34][35][36][37]. Aside from the accuracy-oriented evaluation metrics, the inference time of AS approaches can span in many magnitudes thus also needs to be taken as a trade-off factor in the evaluation. Nevertheless, inference time is usually overlooked in the algorithm evaluation.
From the view of modeling, evaluation, and candidates selection while applying RS approaches in AS problems, there are still some open research questions: (1) if both problem meta-features and algorithms performance information are utilized for modeling, multi-models training or one-hot encoding is usually unavoidable in benchmark approaches, whether a model can skip these intermediate step and create the mapping directly? (2) During the evaluation process, the inference time from a specific AS approach is usually ignored. When both prediction accuracy and inference time are taken into account, how to balance the AS effect? (3) In most AS challenges [37,38], only the predicted optimal algorithm is chosen for the evaluation. It narrows the range of candidate set and reduces the chance of finding the actual optimal algorithm, whether a proper expansion on the candidates set can benefit the AS effect with the cumulative optimal algorithm? In order to address the research problems, we construct the following studies in this paper: (1) We propose Bi-linear Learning to Rank (BLR) to include both problem instance meta-features and performance matrix in one L2R framework. The mapping matrix W and V in the model creates the mapping from metafeatures to the performance matrix in a straightforward way. It avoids multi-models training or algorithms onehot encoding as what other benchmark approaches do. And the probabilistic assumption on the ranking solves the randomness modeling of the performance value in the algorithm-problem instance interaction matrix. We illustrate the good performance of BLR compared with other benchmark AS approaches in the experiments. (2) Adjusted Ratio of Root Ratios (A3R) was proposed as a ranking measure for the algorithms in ML meta-learning; it incorporates both accuracy-oriented metric and time cost metric into one evaluation measurement. We apply A3R as the evaluation metric for the general AS tasks, in order to balance the accuracy and inference time for measuring AS approaches. Being measured with A3R, BLR outperforms other approaches in terms of this tradeoff. (3) While observing the cumulative optimal performance, we find that AS approaches usually converge to a good performance when K setting goes from 1 to 3 or 5.
Though T O P1 candidate selection is still used in many AS challenges, we advocate expanding the this candidates selection spectrum from T O P1 to T O P K (K depends on the concrete computational power). The error decrease effect detected in the experiment confirms the benefits of such expansion.
The rest of the paper is structured as follows: basic methodologies, benchmark approaches and concrete modeling steps of BLR are introduced in Sect. 2. In Sect. 3, we first list the evaluation metrics frequently used in AS tasks and then introduce A3R as the trade-off metric for accuracy and inference time. Section 4 presents the experiments design and the results. Finally, Sect. 5 draws the conclusion and gives an outlook to the future work.

Methodologies
In AS, regarding one problem instance, the predicting targets are the performances of multiple algorithms, instead of a single label or a numerical value. In order to solve the multitargets prediction task, there are three ways to design AS approaches: (1) relying on statistics of algorithms' historical performances; (2) algorithm performances separation: building the predicting model for each algorithm individually, run the fitted models for all algorithms during the inference; (3) algorithm indicators' one-hot conversion: horizontally concatenate the problem instance meta-feature matrix and algorithm appearance one-hot matrix to form the input matrix as the input for the general prediction function. In this section, we first introduce the benchmark approaches which follow these three ways of design. Subsequently, we propose our own approach Bi-linear Learning to Rank (BLR), which doesn't need multi-models training and one-hot conversion to complete the AS model creation.

Benchmark approaches
Targeting diverse AS scenarios, some well-performed benchmark approaches have already been proposed. 2 We separate these benchmark approaches into three groups according to the data transformation ways mentioned above.

Performances' statistics
Virtual Best Selector and Single Best are two traditional benchmark approaches in AS. They don't rely on any Machine Learning (ML) model assumption of meta-features, but come from the performance statistics instead.
-Virtual Best Selector is the ground truth of the algorithms performances. The ranking of algorithms in VBS is the true rank used to compare with the predicted list. The evaluation of the VBS list is the upper bound for all other AS approaches. -Single Best is the most classical algorithm selection baseline approach. It selects the algorithm whose mean performance is the best through all the problem instances in the training set.

Algorithm-based separated learning
The algorithm-based separated learning process is explained in Fig. 2. For each algorithm, a single prediction model is trained based on problem instances' meta-features and the algorithm's performances. When a new problem instance shows up, N prediction models are used to infer the performances for the N algorithms separately. The following AS approaches adopt the algorithm-based separated learning process. In spite of the model specialty for this group of approaches, long inference time is its main disadvantage.
-Separated Linear Regressors train linear regressors for candidate algorithms separately. When a new problem instance must be handled, the performance prediction on all algorithms depends on all the fitted linear models.

Algorithms one-hot conversion
Another group of AS approaches apply the one-hot conversion of the algorithms appearance indicator to form the new AS input. Targeting on a new problem instance, concatenated vector of problem instance meta-feature and the algorithm indicator vector forms the input for the prediction model. Figure 3 represents the conversion process. Though the single model brings in the simplicity, one-hot conversion creates extra sparsity for the data. The AS approaches following this conversion rules include: -One-hot Linear Regressor trains one linear predicting model with the flattened representation from the combination of problem instance meta-features and algorithms appearance indicators. Only one linear model is applied during the inference process for the new problem instances. -One-hot RF Regressor has each entry in the performance matrix as the regression target, with the L + N dimensional features, only RF is needed to fit the model. The model can infer any algorithm's performance with its one-hot encoded appearance indicator. -One-hot XGBoost fit a single XGBoost model with M×N training samples, this XGBoost model is applicable for the performances inference for all the algorithms.

Bi-linear L2R
There are two matrices with known entries in AS scenarios. One is the problem instance meta-feature matrix X , and the other is the algorithm problem instance performance matrix S. The benchmark approaches mentioned in the above subsection solve the mapping from X to S via either multimodels training (time consuming) or algorithms' indicators' one-hot conversion (can sparsify the dataset). In order to avoid the multi-models training and features one-hot conversion, we propose Bi-linear Learning to Rank (BLR) to create the AS strategies. Given the bi-linear assumption, the factorization process of the mapping from X to S is represented in Fig. 4. The performance inference on new problem instances is depicted in Fig. 5. With the help of the two mapping latent matrices W and V , an entry in the performance matrix s m,n can be calculated through X m,: · W · V :,n . Therefore, the model parameters to be learned are matrices W and V . There is no need to train specific models individually for different algorithms. Owing to the indices exact mapping, the latent dense matrix is enough to directly contribute to the entries in the performance matrix, thus the sparse one-hot encoding is not needed during the inference time.
In algorithms' performances, uncertainty always exists. For computational complexity problem, like SAT and Traveling Salesman Problem (TSP), the algorithm's runtime  Bi-linear factorization graph given the known matrices (problem instance meta-feature matrix X and performance matrix S in blue). W (in yellow) is supposed as the weighted mapping matrix for input X , to project X onto the intermediate left latent matrix U with K latent dimensions for M problem instances. The dot product of intermedi-ate left latent matrix U and right latent matrix V (in yellow) yields the performance matrix S (in blue, known entries in the training set). Aside from the known matrices and intermediate matrices, the unknown matrices W and V in yellow are what to be estimated during the training process performance can be different when altering the specific running environment. For ML-based problems, the accuracy measured can also be different when cross-validation setting changes. With the performance Bi-linear factorization assumption, we model the ranking of algorithms w.r.t. a specific problem instance in a probabilistic fashion. We Algorithm selection as a cold start problem under bi-linear Decomposition. W and V are the decomposed matrices after bi-linear factorization from problem instance meta-feature matrix and performance matrix. When a problem instance is introduced into a scenario with only its own meta-feature vector (on the left in blue), yet with-out any algorithm performance record. The continuous dot product on this meta-feature vector and the learned matrices W , V yields the full performance (on the right in green) vector regarding this new problem instance assume the probability an algorithm ranked T O P1 for a problem instance is proportional to its performance (or predicted performance) among all the algorithms. The cross entropy between the ground truth T O P1 probability vector P r m (r m,n ) and the predicted T O P1 probability vector Pr m (r m,n ) (where r is the converted value of a performance value s) defines the loss and influence the optimization strategy.
Embedding bi-linear factorization in L2R framework, this is the full idea of BLR. We refine the notations for BLR in Table 6 in "Appendix A.2". The modeling and learning of BLR is structured as four steps: (1) Performance scoring model function and corresponding rating converting function; (2) loss function considering the ranking loss; (3) gradient function for corresponding weights; and (4) updating rule of the weights according to the specific optimization approach. The first two steps are introduced as follows in this section, while gradient function and updating rules are explained in the "Appendix A.3.1 and A.3.2" separately.

Model function
In BLR, given the problem instance m and algorithm n, we predict the performance score asŝ m,n in the Eq. (1). The preferred sorting order on performance values depends on the choice of target performance. For example, if runtime is performance metric, the lower value is better. However, if accuracy is the targeted performance metric, the higher performance is preferred. For the simplicity of calculating the list-wise ranking loss, we set a converting function r = f (s) to make descending order preferable for all the rating values r . And the converted rating value r is the optimization unit in the ranking loss function. In this paper, we simply define f (s) as Eq. (2).ŝ f (s) = s higher performance value is preferred −s lower performance value is preferred (2)

List-wise loss function
Assuming that the performances scores of all algorithms on specific problem instance are with measuring noises, we model the probability that an algorithm being ranked top-one proportional to its normalized measured performance value. This normalized top-one probability representation has been proposed in L2R domain to model the list-wise ranking loss [28]. Regarding a single problem instance, the top-one probability for the same algorithm is different between the ground truth performances list and the predicted performances list. As defined in Eq. (3), for a problem instance m, with the rating vector r m (converted version of the performance vector), the top-one probability for each algorithm n is normalized in the form of P r m .
For the sake of making probability distribution more gathered around the position of the largest input values, the exponential function is applied as the concrete form for monotonically increasing function ϕ in Eq. (3). Thus P r m can be represented as Eq. (4), which is in the same shape of Softmax function representation.
To represent the list-wise ranking loss per problem instance, the cross entropy is calculated between the top-one probability from the predicted rating listr m and the ground truth rating value list r m . For each problem instance m, the point-wise loss for algorithm n is formulated as Eq. (5). Considering the probabilities normalization is calculated under the same scale for a problem instance m, the per instance list-wise loss L m is defined as the summation of the pointwise loss inside this list, as shown in Eq. (6). Here L m is the list-wise ranking loss between the ground truth list and the predicted list. The total loss on the whole m problem instances is defined in Eq. (7), in which L2 regularization is applied to avoid over-fitting.
The concrete gradient calculation for the loss definition and the updating rule based on the gradient can be found in the "Appendix A.3.1 and A.3.2" separately.

Evaluation metrics
We measure the AS effect of different approaches with the evaluation metrics Success Rate (SUCC), Mis-Classification Penalty (MCP), Penalized Average Runtime Score (PAR10) and Mean Average Precision (MAP). In addition to these performance prediction accuracy-oriented metrics, A3R is also applied to solve the trade-off between prediction effect and inference time.

Accuracy-oriented evaluation metrics
SUCC, PAR10 and MCP are the standard evaluation metrics from AS community. SUCC cares only whether the selected algorithms are solvable. Yet for question, how close does the predicted best algorithm perform to the actual best algorithm? It is the most concern in PAR10 and MCP. Additionally, MAP is included as a representative of ranking measurement. Obeying the conventional candidate selection criteria, the selection range of algorithms is limited to T O P1 from the predicted list. The chance of finding the optimal algorithm is actually limited to this specific choice. In this paper, we propose expand the algorithm candidate selection range to T O P K on the predicted list to gain the evaluation bonus. The four evaluation metrics with their T O P K understanding are explained below.
SUCC stands for the average solved ratio of the selected algorithm per problem instances across the test set.
For T O P1 selection criteria, the solved ratio is only calculated w.r.t. the algorithm with best predicted performance. Yet for the case of SUCC@K, the average calculation is applied over the best K algorithms. PAR10 is the penalty version for the actual runtime of the selected algorithm. If the selected algorithm is actually timeout, its runtime will be penalized by multiplying 10 to the timeout runtime. Otherwise, the actual runtime is directly used. With T O P1 selection criteria, the penalty is only applied on the best ranked algorithm in the predicted list. For PAR10@K, the penalty will be applied on the algorithm with the shortest actual runtime in the T O P K algorithms of the predicted list. MCP compares the time cost difference between the actual runtime of the predicted best algorithm and the VBS. The algorithm with the lowest actual runtime in the T O P K predicted list is chosen as the comparison with the runtime of VBS. The algorithm selected by VBS always has the MCP value as zero. MAP measures the mean average precision of the T O P K predicted algorithms vs. the T O P K ranked algorithms with the ground truth performance. MAP for T O P K algorithms in the predicted list is calculated in the same way as MAP@K (average of the precision rate which has a hit indicator).
Among the above evaluation metrics, accuracy-oriented ones SUCC and MAP comply with the rule the higher the better, while for the time cost-oriented metrics like MCP and MAP, the lower the better.

Multi-objective evaluation metrics
The standard AS evaluation metrics are aimed at the accuracy of the performance prediction. However, inference time on the unknown problem instances also deserves our attention. Multi-objective evaluation metric Adjusted Ratio of Root Ratios (A3R) involves both accuracy and inference time into the evaluation and brings in the trade-off between the two factors.
Abdulrahman, Salisu et al. introduced A3R in AutoML [39,40]. A3R is treated as the ranking basis for algorithms w.r.t. a dataset AutoML scenario. A3R balances the precision and the runtime of the selected algorithm. As Eq. (8) shows, when applying algorithm a p on dataset d i , S R d i a p stands for the success rate and T d i a p represents the time cost. A reference algorithm a q is chosen, to standardize the success rate across all the algorithms as ratio S R d i a p /S R d i a q . The equivalent ratio for time cost is represented as T d i a p /T d i a q . The combined metric takes success rate ratio as advantage, while the time ratio as disadvantage. Since the time cost ratio ranges across more magnitudes than the success rate does, N th root on the denominator of Eg. 8 enables the re-scaling of the running time ratio and turns the A3R to a reasonable value range. A3R is used to measure the comprehensive quality running an algorithm on the dataset.

A3R(ACC) s i a p a q =
In this paper, we borrow the idea of A3R from AutoML and apply it as the ranking basis for the approaches in AS scenario. We replace d i with s i (the i th scenario), keep a but note as approach in Eq. (11). For accuracy-based metrics like SUCC and MAP, we apply their values ACC to substitute S R in the Eq. (8). While for run time cost based metrics T C, lower values denote higher accuracy, the inverse ratio T C s i a q /T C s i a p is instead used in the numerator. Since the run time cost spans several magnitudes, M th root is used on the numerator for re-scaling. In the following experiments, we utilize Eqs. 11 and 12 to evaluate the combined AS effect.

Experiments
We design the experiments to study: (1) The algorithm selection effect of the proposed BLR approach compared with other benchmark approaches; (2) AS effect when taking both accuracy and inference time into consideration; (3) the benefits of expanding the candidates set selection range.

Datasets
In this paper, we focus on typical AS problems in computational complexity domain. The Algorithm Selection Library (ASLib) released by COnfiguration and SElec-tion of ALgorithms (COSEAL) 3  In all of these computational complex AS scenarios, runtime is the main performance metric. In each scenario, the dataset comprises algorithms' performances on problem instances, problem instances meta-features run status, and feature values. The standardized datasets make the experiments evaluation results among many scenarios comparable.
In each AS scenario from the ASLib, we split the dataset into 10 folds and apply cross-validation on the 9 folds to find the best hyper-parameter setting for each approach. With the best selected hyper parameters, all approaches are trained again on the whole 9-fold dataset and the fitted models are acquired. These models are used to do the inference on the last fold (test set) to be evaluated.

Performance of Bi-linear L2R approach
We compare the AS effect of BLR with other benchmark approaches under the four evaluation metrics introduced in the last section. For BLR model, latent dimension K , learning rate η, regularizer λ are the hyper parameters to be tuned during cross validation. Since the optimization target of BLR decomposition is not convex, the trained model is sensitive to the initialization of the entries in the latent matrices. Thus the best initialization state is also determined in the crossvalidation phase. To speed up the convergence of the BLR, we use Stochastic Gradient Descent instead of Gradient Descent as optimization method. Given the vibrated loss value on Stochastic Gradient Descent, we tell the convergence of BLR model with at least 5 successive increases on the loss detected during the optimization.

BLR performance with TOP1 candidates selection
First we apply the conventional T O P1 candidates selection in the evaluation and observe under what circumstances BLR performs better. In Table 1, AS scenario and evaluation metric combination are listed per row. These are the cases BLR is ranked among the best 3 compared with other benchmark approaches. More specifically, in CSP-Mininzic-Obj-2016 and SAT15-INDU regarding success rate, in PROTEUS-2014 concerning MCP and PAR10, in TSP-LION2015 in terms of MAP, BLR is ranked as top1. These competitive performances verify that BLR can also be considered as a benchmark approach in some AS scenarios.

Cumulative performance in TOPK expansion
If parallel processing on the candidate algorithms is considered, we can broaden the range of candidates selection to increase the chance of finding the best algorithm without extra time consumption. Thus if the cumulative best performances of approaches decrease drastically at first several predicted positions, it's proper to consider T O P K expansion for the predicted list. We first observe the cumulative best performance along the

BLR performance with expanded candidates selection
The cumulative best performance varies a lot even considering single AS approach, thus the rank of approaches also changes when considering different expansion degrees. For BLR, aside from the conventional T O P1 candidates selection criteria, we observe its rankings under T O P3 selection.
In Table 2, we list the conditions (combinations of scenario and evaluation metric) where BLR is evaluated as competitive (ranked in top 3). BLR can still perform well in some specific scenarios. When being compared with

Accuracy and inference time trade-off
With evaluation metrics SUCC, MAP, MCP and PAR10, the accuracy of AS approaches can be assessed. Nevertheless, shorter inference time is also preferred for an AS approach. As introduced in Sect. 3, A3R is a good metric for measuring the combining effect of accuracy and time. We take this  metric to make the combining effect evaluation for the AS approaches in this experiment. To make the accuracy/time ratio comparable across all scenarios, one hot random forest regressor (the approach wins in most scenarios) is taken as reference approach (a q ) in the evaluation equation. It's drawn as the pink bar in the following figures, and the A3R value of this referred algorithm is always 1. All the accuracy metric values are from T O P3 candidates setting.

A3R(ACC) s i a p a q =
As to precision oriented accuracy metrics ((SUCC and MAP), the accuracy ratio is proportional to the metric value of the selected approach. Thus ACC value of a q (referenced approach) is set as the denominator in the ratio formula ACC s i a p /ACC s i a q in Eq. (11). Considering that inference time of different AS approaches span in 3 to 4 magnitudes, parameter for root N is set as 30 in the experiment to limit A3R in a reasonable range. As Fig. 7 shows, when evaluating the approaches regarding both MAP and inference time using A3R, BLR (in light blue bar) outperforms all other benchmark approaches. Thus BLR reaches the balance of model complexity and inference simplicity.
For time cost-oriented accuracy metrics (MCP and PAR10), their values are negatively correlated with prediction accuracy. The accuracy ratio T C s i a q /T C s i a p therefore takes the metric value T C s i a p as the denominator. In addition, since the MCP and PAR10 metric value among approaches varies a lot Fig. 7 Approaches' Average A3R score across all the scenarios. A3R in terms of MAP and inference time Fig. 8 Approaches' average A3R score across all the scenarios. A3R in terms of MCP and inference time even concerning magnitude, root parameter M is involved for this accuracy ratio as well to transform the ratio to a readable range. As Fig. 8 shows, with the setting of M = 3 and N = 30, BLR (represented as light blue bar) again wins other benchmark approaches.
The excellent performance on A3R which cares both precision oriented and time cost-oriented accuracy metrics verifies that BLR can be a good option when the balance between accuracy and inference time needs to be taken into account.

Benefit of expanding the candidate selection range from Top1 to TopK
As discussed in the former subsections, if we enlarge the algorithm candidates range from T O P1 to T O P K , we can expect the algorithm selected from the wider spectrum yield better optimal selected algorithm. In this experiment, we tentatively set K = 3, and observe the difference on the cumulative evaluation result difference between the conditions K = 1 and K = 3. For every AS scenario, we list the approach with the largest performance difference caused by T O P1 and T O P3 selection criteria and thus illustrate the benefit of the T O P K expansion. We choose time cost-oriented metrics MCP and PAR10 to represent the performance difference, considering their straightforward cumulative performance decrease along the T O P K positions.
Mis-Classification Penalty (MCP) calculates the time cost difference between the selected algorithm and the actual best algorithm. The lower the MCP value, the better effect the AS approach possesses. Seen from Table 3, T O P3 selection criteria leads to the decreasing effect on MCP significantly. We highlight the decrease percentage higher than 90.00% in bold boxes in the table. The decrease percentage ranges from 55.78% to 100%. It demonstrates that enlarging the T O P K candidates selection range can boost finding the algorithm runtime closer to the ground truth best.
The evaluation metric PAR10 gives 10 times penalty on the most recommended algorithm which is actually timeout. We list the decrease percentage caused by T O P3 candidates expansion in Table 4. This decrease percentage falls in the interval 19.47% to 95.72%. The cases, in that the percentage values are higher than 90.00%, have been highlighted in the bold boxes. This decrease percentage indicates the reduction of the possibility that selected algorithm runs in a timeout.
Expanding the T O P1 candidate set to the case T O P3, the observation of the significant decrease on the time cost metrics MCP and PAR10 confirms the benefit of the expansion. In AS, under the parallel testing environment, the test on the T O P K candidates stops at the runtime of the opti-mal algorithm in the candidates set. Thus the test time is also saved owing to the expansion. The selection of K depends on the computational power and environmental limit. Though T O P1 setting is required in most AS challenges, we would like suggest the expansion of this candidates selection range.

Discussion
The experiments in this section unveil several interesting points: (1) BLR possesses the chance outperforming other benchmark AS approaches in some scenarios; (2) on evaluation metric A3R, BLR shows the power of balancing prediction accuracy and inference time; (3) T O P K expansion on candidates set brings benefit for finding the optimal algorithm.

Conclusion and future work
In this paper, we propose Bi-linear Learning to Rank (BLR) to solve AS problem. BLR is inspired from the collaborative filtering in RS. With the list-wise T O P1 probability assumption, it models the uncertainty in the algorithm performance. The learning process of BLR averts the problems like multi-models training and algorithms' one-hot conversion in traditional AS benchmark approaches. Being compared with the benchmark approaches, selection effects of BLR have proven to perform well in some AS scenarios. Considering the balance of the trade-off between the accuracy and inference time in the evaluation, we propose using A3R as the evaluation's protocol. BLR performs especially well on this new trade-off metric A3R. Finally, we affirm the benefit of expanding the selection range of candidate approaches from T O P1 to T O P K regarding the cumulative optimal demand of AS evaluation. Given the work so far, there is much to do for the future. For BLR, since it's a model with non-convex loss definition, the convergence criteria can be adjusted to tune better parameters setting. In the current experimental settings, we only investigate 21 AS scenarios. Extending the experiments to additional scenarios can give a stronger confidence on the experimental results. Though we set K in T O P K expansion as 3, and illustrate the expansion benefit, more thorough study can be done on how to choose K to meet the balance of performance gain and computational power.

Algorithm
Algorithm or heuristic (e.g., generic algorithm in TSP) which can successfully solve some of the problem instances in the designate scenario

Solver
The alias for algorithm in some problems like SAT

Solution
The solving result settings by an algorithm (or solver) on a problem instance

Approach
The method used to select the potential optimal algorithms candidates set for problem instances in a specific scenario

Predictor
The method that predicts the performance of algorithms on a problem instance

Selector
The method used to select the potential optimal algorithms based on their predicted performances

Algorithm Candidate Set
A set of algorithms selected as the most possible optimal algorithms for a specific problem instance inferred from an approach/selector Performance The measurement representing how an algorithm solves a problem instance, e.g., runtime

Evaluation Metric
The evaluation criteria to measure the selection effect of an approach in a scenario, e.g., SUCC, MCP

A.2 Notations of Bi-linear L2R
See Table6.  Given the actual performance rating vector r m , the probability that algorithm n is ranked at top 1 regarding the m th problem instance Pr m (r m,n ) Given the estimated rating vectorr m , the probability that algorithm n is ranked at top 1 regarding the m th problem instance

A.3 Gradient and updating rules in Bi-linear L2R
In BLR model, for approximating the weighting matrix W and latent matrix V to minimize the loss function defined in the Sect. 2, we calculate the gradient for loss function and use the updating rule as described in the following subsections.

A.3.1 Gradient calculation
Having known the loss function L, to use gradient descent as the optimizer, the gradient concerning meta-features mapping weight matrix W and algorithm latent vectors matrix V should be provided accordingly. Since the loss function is defined layer by layer through model function, converter function, top-one probability function and cross entropy function, we use chain rule to calculate the gradient correspondingly. For L, its partial differential over w l,k and v n,k can be factorized in the similar way as Eqs. (13) and (14) separately.
For the last step, which returns the partial differentialŝ m,n over w l,k and v n,k , we can broadcast it in the vectorized way like Eqs. (18) and (19): ∂ŝ m,n ∂v n = x m × W

A.3.2 Updating rule
Having known the partial differential of the chain rule, we can update the weight matrix W and algorithm latent matrix V by the following updating rule Eqs. (20) and (21) , where η is the learning rate.
Since the loss is list-wise, which means for each problem instance, there is a loss schema based on the corresponding top one probability. If we would like to update the weights in a stochastic way, the updating unit should be a list based on problem instance m, rather than each rating point. Therefore, the stochastic updating rule is like Eqs. (22) and (23):