Pairwise metarules for better metalearningbased algorithm ranking
 1k Downloads
 22 Citations
Abstract
In this paper, we present a novel metafeature generation method in the context of metalearning, which is based on rules that compare the performance of individual base learners in a oneagainstone manner. In addition to these new metafeatures, we also introduce a new metalearner called Approximate Ranking Tree Forests (ART Forests) that performs very competitively when compared with several stateoftheart metalearners. Our experimental results are based on a large collection of datasets and show that the proposed new techniques can improve the overall performance of metalearning for algorithm ranking significantly. A key point in our approach is that each performance figure of any base learner for any specific dataset is generated by optimising the parameters of the base learner separately for each dataset.
Keywords
Metalearning Algorithm ranking Ranking trees Ensemble learning1 Introduction
Training a good model for a given dataset is one of the common tasks of a data analyst. The straightforward approach of simply applying and optimising all known learning algorithms is usually not feasible. Thus an experienced data analyst will generally perform some form of preliminary analysis and then focus on a few promising algorithms. This selection is guided by the prior experience of the analyst. Metalearning tries to support and automate this process. It tries to predict the probably best or close to best algorithms, thus considerably reducing the amount of training and optimisation time needed for finding a good model on a given dataset. This reduction in resource consumption should be accompanied by no or only a small loss in predictive performance when compared to the best possible result (Brazdil et al. 2003, 2009). Metalearning for algorithm ranking uses a general machine learning approach to generate metaknowledge mapping the characteristics of a dataset, captured by metafeatures, to the relative performances of the available algorithms.
The advantage of the metalearning approach is that high quality algorithm ranking can be done on the fly, i.e., in seconds, which is particularly important for business domains that require rapid deployment of analytical techniques. In machine learning research, metalearning has also been used for boosting the performance of evolutionaryalgorithmbased model selection techniques (Reif et al. 2012). The work by SmithMiles (2009) has discussed crossdisciplinary perspectives on metalearning. Theoretical motivations on metalearning research are due to the No Free Lunch (NFL) theorem (Wolpert and Macready 1997) and the Law of Conservation for Generalization Performance (LCG) (Schaffer 1994). Recent research seems to suggest that it is possible to show that metalearning offers a viable mechanism for building a “generalpurpose” learning algorithm (GiraudCarrier 2008). For a comprehensive review of metalearning research and its applications, we refer the reader to GiraudCarrier (2008); Brazdil et al. (2009); Jankowski et al. (2011); Serban et al. (2013).
This paper makes three contributions: a novel metafeature generator for metalearning, a new metalearner, and a more appropriate experimental configuration, for improving the overall performance of metalearning for algorithm ranking.
2 Background
For algorithm recommendation, our goal is not to predict the absolute expected performance of any algorithm, but rather the relative performance between algorithms. Thus, the metadataset can be transformed to represent the rankings of the algorithms. Figure 1 (from lefthand side to righthand side table) shows an example of how a “raw” metadataset is transformed. In this case, ranking is a special case of the general multitarget regression setting. Then, a metalearner, which we will call it a “ranker” here, will take this n×m matrix as its training input to learn a ranking model. Given a new dataset, we first calculate its metafeatures and use the metafeatures (e.g., the f _{1},f _{2},f _{3} values for the example in Fig. 1) as input to the ranker. The ranker finally returns a ranked list of all algorithms. As is true in general for machine learning, the performance of metalearning depends crucially on the quality of both the metafeatures and the metalearners available. In addition, the performance of the baselevel learners must also be estimated to a high quality to enable successful metalearning.
Existing metalearning systems are mainly based on three types of metafeatures: statistical, informationtheoretic and landmarkingbased metafeatures, or SIL for short. Thorough reviews of these metafeatures can be found in Brazdil et al. (1994); Pfahringer et al. (2000); Kalousis (2002); Soares (2004). Recently, more metafeature sets have been developed, for example, the learningcurvebased metafeatures (Leite and Brazdil 2005), treestructureinformationbased (Bensusan et al. 2000) and histogrambased (Kalousis 2002).
In this paper, we propose a novel method for metalearning that generates “metalevel” metafeatures from the “baselevel” metafeatures via rule induction that tries to predict the better algorithm for each pair of algorithms. Here the “baselevel” metafeatures can come from any of the metafeature sets mentioned above. Metafeatures generated by the proposed method (described in Sect. 3) contain pairwise information between algorithms. Adding them to the original feature space can help to improve the performance of many metalearners.
Next, we will discuss several ranking approaches from an algorithmic perspective as the underlying multitarget or ranking problem is being studied in various fields, including social sciences, economics and mathematical sciences. Although the terminologies and presentations of the same models can be quite different for different fields, the algorithmic properties are similar in concept and relatively easy to describe.
2.1 The knearest neighbors approach
The kNN ranking approach has two steps: the nearest neighbor search step and the ranking generation step. In the first step, given a new dataset, we first calculate its metafeatures to construct an instance as a query (an m _{ f }value array). Then, we select a set of instances (nearest neighbors) in the training set (the n×m _{ f } data matrix) that are similar to the query instance. The similarity between instances is usually based on a distance function, e.g., Euclidean distance. In the second step, we combine the rankings of the nearest neighbors to generate an aggregated algorithm ranking for the new dataset. For our experiments, we use the average ranks method described in Brazdil et al. (2009). Let R _{ i,j } be the rank of algorithm T _{ j },j=1,…,t on dataset i, where t is the number of algorithms. The average rank for each algorithm T _{ j } is defined as: \(\bar{R_{j}} = (\sum_{i=1}^{k} R_{i,j}) / k\), where k is the number of nearest neighbors. The kNN ranker’s performance is related to the value of k, appropriate values for k are usually determined by crossvalidation. The kNN ranker is often used as a benchmark learner for testing the performance of different metafeature sets.
2.2 The binary pairwise classification approach
Given the n×m data matrix as the training data, multiple binary (pairwise) classification models can be used to construct a ranking model. For example, if there were three metafeatures and three algorithms in the training set (e.g., Fig. 1 righthand side table), one could build three binary classification models for each pair of algorithms: Algorithm1 vs. Algorithm2, Algorithm1 vs. Algorithm3, and Algorithm2 vs. Algorithm3. The training data for the three binary classification models are the same, which is the n×m _{ f } data matrix. Given a new dataset, we first calculate its metafeatures, again an m _{ f }value array as a query. Then, we use the three binary classification models to classify the query. The final algorithm ranking list for the new dataset is computed based on how many times each algorithm has been predicted as “is better”. If there were more than three algorithms to rank, then there might be ties in the list. Several tie breaking techniques have been examined in the literature (Kalousis 2002; Brazdil et al. 2009), but usually this choice does not have a strong influence on the performance of a metalearning system.
The advantage of the binary classification ranking approach is that existing binary classification algorithms can be employed directly. However, if there are many algorithms to rank, then the number of binary models required can be large and hard to manage in practice, e.g., for 20 algorithms, this would require \(\frac{T \times (T1) }{2} = \frac{20 \times 19 }{2} = 190\) binary classification models to be built, which could be costly if one also considered using different—and fine tuned—classification algorithms for each of the 190 binary classification problems. The binary pairwise classification approach has also been studied in label ranking (Hüllermeier et al. 2008).
2.3 The learning to rank approach
From the enduser’s perspective, algorithm ranking is similar to the ranking problem in a search engine scenario, where a ranked list is returned for responding to a query. The search engine scenario has been extensively studied in learning to rank and information retrieval (IR). One obvious candidate for rank prediction is the AdaRank algorithm proposed in Xu and Li (2007). AdaRank is a boosting algorithm that minimizes a loss function directly defined on the ranking performance measure. The algorithm repeatedly constructs “weak rankers” on the basis of reweighted training data and finally linearly combines the weak rankers for making ranking predictions. In addition to supplying algorithms, the IR literature is also an excellent source for ranking evaluation metrics. Similar to search engine users, metalearning users are usually mainly interested in the top candidates, be it websites or algorithms. An IR measure which captures this bias towards the top ranked items well is the normalized discounted cumulative gain (NDCG) metric (Järvelin and Kekäläinen 2002; Li 2011), which has not been used in metalearning evaluation previously.
2.4 The label ranking approach
Label ranking can be seen as an extension of the conventional setting of classification. The former can be obtained from the latter by replacing single class labels by complete label rankings (Cheng and Hüllermeier 2008; Cheng et al. 2009). From an algorithmic view point, one type of label ranking approach, such as the ranking by pairwise comparison (RPC) algorithm (Hüllermeier et al. 2008), is based on extending the pairwise binary classification approach by using more sophisticated ranking aggregation methods. Another type of label ranking algorithm, such as the label ranking trees (LRT) algorithm (Cheng et al. 2009), tries to apply probabilistic models under the label ranking setting. Metalearning for algorithm ranking using the multitarget regression setting can also be transformed to a label ranking problem, so that label ranking algorithms can be used directly.
2.5 The multitarget regression approach
3 Pairwise MetaRules
In this section, we introduce a novel metafeature generation method in the context of metalearning. The main motivation comes from our observation that existing metafeature sets have ignored one potential source of information: the logical pairwise relationships between each pair of the target algorithms to rank. Explicitly adding this information to the metafeature space might improve a metalearner’s predictive accuracy.
Of course, when predicting for a new dataset, this information will not be available, as otherwise we would not try to predict rankings in the first place. Therefore we propose to use a rule learner to learn pairwise rules first, and then use these rules as new metafeatures. Two steps are involved: the first step is similar to the binary classification ranking approach, where \(\frac{T \times (T1) }{2}\) (T is the number of algorithms to rank) binary classification training datasets are constructed from the original n×m data matrix. Each binary dataset has two class labels: {“algorithm t _{ i } is better”, “algorithm t _{ j } is better”}. Whether an algorithm is better than the other is determined by their ranking position in the ranked list of a particular dataset.
Method 1 turns each individual rule into one boolean metafeature. For example, we may have a Pairwise MetaRule:
If BaseMetaFeatureX ≤ 0.5 AND BaseMetaFeatureY ≥ 0, Then Algorithm A is better than Algorithm B.
The value of the new metafeature constructed from this MetaRule will be determined by looking at the (baselevel) metafeature values of a new dataset defined by a MetaRule. For a new dataset, the MetaRulebased metafeature value is set to true if the rule condition “BaseMetaFeatureX ≤ 0.5 AND BaseMetaFeatureY ≥ 0” is met, otherwise false. For the metalearning problem, the RIPPER algorithm returns on average about two individual rules for each of the 190 algorithm pairs.
Method 2 creates just one boolean metafeature representing the outcome of applying the full ruleset. For a new dataset, the MetaRulebased metafeature value is set to true if the default catchall rule of a ruleset is true, otherwise false. For the metalearning problem, the RIPPER algorithm returns 190 rulesets for each of the 190 algorithm pairs, so in Method 2, 190 MetaRulebased metafeatures are added to the original feature space.
The Pairwise MetaRules method is different from the standard stacking (Wolpert 1992) method, where we do not use the predicted complete ranking of base models instead we use the RIPPER rule sets to construct new metafeatures. A metalearner will use all metafeatures (including both baselevel and MetaRulebased metafeatures) for building the final model, which is also different from stacking.
In the experiments reported below we compare three different metafeature sets:
SILOnly: 80 different SIL metafeatures only;
SIL+MetaRules1: the 80 SIL metafeatures plus metafeatures generated by Method 1;
SIL+MetaRules2: the 80 SIL metafeatures plus metafeatures generated by Method 2.
Next, we propose a novel metalearner (ranker), specifically designed for the metalearningbased algorithm ranking problem.
4 Approximate ranking tree forests for metalearning
 1.
as more and more metafeatures are being developed and added to the feature space, we believe that a relatively fast algorithm (e.g., decisiontreebased learners) that has “builtin” feature selection capacity would be useful;
 2.
the metalearner should not be constrained by the “number of training examples must be much greater than the number of parameters to estimate” restriction, because for the metalearning problem the number of datasets is usually not very large;
 3.
recent theoretical work (Biau 2012) seems to suggest that irrelevant features do not significantly decrease random forest’s predictive performance, so ideally the metalearner should be capable of using as many metafeatures as possible;
 4.
ensemble algorithms usually outperform base algorithms. Statistics and models aggregated from bootstrap samples are nonparametric estimates (Efron and Tibshirani 1993), so we do not have to make parametric assumptions about the form of the underlying population, which provides an automatic approximation layer to our new algorithm. In the literature, an empirical study by Kalousis and Hilario (2001) has shown that boostingbased metalearner outperformed the kNN based metalearner in the pairwise binary classification setting (Sect. 2.2) for metalearning.
4.1 Approximate Ranking Trees (ART)
For example, the targets part of Fig. 1, righthand side table, consists of three rankings: y ^{〈1〉}=[2,1,3]′,y ^{〈2〉}=[2,3,1]′,y ^{〈3〉}=[3,2,1]′.
4.2 ART’s splitting criterion
When d is the Spearman’s distance (Eq. 7), based on Theorem 2.2 of Marden (1995), the average ranking averageRanking(\(\bar{y} \)) belongs to the set of the sample central rankings, which provides a theoretical foundation for ART.
4.3 ART’s stopping criterion
Another stopping criterion is the minimum number of examples at a leaf node, γ. Setting γ to a small number usually outperform trees using larger γ values when the tree algorithm is used as the base learner for random forests (Breiman 2001). ART uses both of these criterias to limit tree sizes.
4.4 ART forests and rank aggregation
 1.
The training set is a bootstrap sample from the original training set;
 2.
An integer u is set by the user. At each node, u features are selected at random and the node is split on the best feature among the selected u;
For prediction, when a test feature vector x is put down each tree, it is assigned the average ranking of the rankings at the node it stops at. The average ranking of these over all approximate ranking trees in the forest is the predicted ranking for x. Alternative rank aggregation methods, such as other types of Borda count, graph theory models, or binary linear programs, might perform better, at the expense of higher computational cost, and will be investigated in future research.
5 Experiment setup and results

The first question concerns the new metafeature generation method: can it consistently improve the performance of known metalearners?

The second question concerns the new algorithm: are ART Forests competitive with current stateoftheart rankers?
In this paper, we focus on the performance of metalearning for algorithm ranking on binary classification datasets only. Previous studies were limited by the small number of available datasets, usually fewer than 100 datasets were used in reported metalearning experiments. To be able to draw statistically significant conclusions, we chose to use as many datasets as possible from various public data sources, including the UCI,^{1} StatLib,^{2} KDD^{3} and WEKA^{4} repositories. In total slightly more than 1,000 classification and regression datasets were collected. However, due to the varying quality of the public datasets, not all of them could be used directly. After removing duplications and very small datasets (less than 10 instances), and converting multipleclass classification data—by keeping only the top two majority classes—and regression data—by using the mean as a binary splitting point to transform the numeric target into a binary target, a total of 466 datasets was obtained for experimentation. Also, to speed up our experiments we capped the number of instances in larger datasets to a maximum of 5,000 instances, using stratified random sampling.
5.1 EAbased performance estimation
For simplicity and in order to speed up experiments, many previous metalearning experiments have estimated algorithm performance using some prespecified default parameter settings across all baselevel learners. We claim that this approach is bound to be suboptimal in practice because to be able to make useful predictions, most algorithms have to be optimised separately for each specific dataset. Technically, predicting the full combination of algorithm plus optimal parameter settings is not feasible. We therefore propose an intermediate approach, where we assume that a procedure is available for optimising each algorithm for each dataset, and then predict the ranking of the optimised algorithms. Given a new dataset, the recommended ranking is based on the assumption of using optimal parameter settings. These actual optimal parameter settings would have to be computed for each topranked algorithm through the exact same optimisation process that was used for metalearning.
The parameter optimisation procedure used here is based on evolutionary algorithms. Gridsearch would be an alternative, but based on recent research (Reif et al. 2012), EAbased techniques seem more efficient. Specifically, we employ the particle swarm optimisation (PSO) based parameter selection technique described in Escalante et al. (2009); Sun et al. (2012). Although EAbased performance estimation is more appropriate, it is timeconsuming, especially for large datasets. Therefore, as mentioned above, large datasets were downsampled to 5,000 instances. In this paper, metalearning is used to rank 20 supervised machine learning algorithms, all of which are implemented in WEKA (Hall et al. 2009). When generating the metadataset from the 466 datasets, for each of the 20 algorithms, we manually specify parameters and their respective value ranges for PSO to optimise. Taking the support vector machine algorithm as an example, we set PSO to optimise the kernel type, all relevant kernel parameters, and the complexity constant.
5.2 Metalearners (rankers) in comparison
We test the new pairwise metarule based metafeatures with different metalearners. In total 7 rankers are used for our experiments.
DefaultRanker—The default ranker follows a very simple yet powerful philosophy: if an algorithm has worked well on previous datasets, then it is likely to be used as the first algorithm to try on a new dataset. So, the default ranker used in our experiments is simply using the average rank of each algorithm over all the training data. Thus the ranking predicted by the default ranker is always the same for every test dataset, which is the average ranking of the training examples. This ranker can also be seen as a special case of an ART model with only one node. One distinguishing feature of the algorithm ranking problem is that this averagerankingbased default ranker is relatively strong compared to other common preference learning problems, such as movie or book recommendation, or survey data involving human subjects.
kNN—The kNN ranker, as described in Sect. 2, uses standard Euclidean distance and we set k=15 (we will show later that 15 is a relatively good value for our problem).
LRT—The label ranking trees ranker, we use the WEKALR^{5} implementation with default parameters. LRT is based on the Mallows model for rank data.
RPC—The RPC ranker, which uses the ranking by pairwise comparison algorithm proposed in Hüllermeier et al. (2008). We use the default setting of the RPC implementation in WEKALR in which the logistic regression algorithm is used as the base learner.
PCTR—The PCTR ranker, which is the predictive clustering trees for ranking (PCTR) ranker (Todorovski et al. 2002). The minimum number of instances at a leaf node is set to 15.
AdaRank—The AdaRank ranker uses 100 PCTR rankers as its base models and the minimum number of instances at a leaf node is set to 30 (to simulate a relatively weak learner as in Xu and Li (2007); Li (2011)). Please note that boosting algorithms can stop early so the final AdaRank ensemble might not have exactly 100 base models.
ARTForests—The ARTForests ranker, where we use 100 approximate ranking trees for ART Forests, and the minimum number of instances γ at a leaf node is set to 1 for each ART (using small γ such as 1 leads to a good bagging ensemble as suggested in Breiman (2001)); The “pruning” parameter θ is set to 0.95.
5.3 Evaluation of ranking accuracy
We assess ranking accuracy by comparing the rankings predicted by a ranker for a given dataset with the corresponding target rankings. Given two sets of mvalue rankings: T=[T _{1},T _{2},…,T _{ m−1},T _{ m }] and P=[P _{1},P _{2},…,P _{ m−1},P _{ m }], which are targets and predictions, respectively, and letting \(d_{i}^{2} = {(T_{i}  P_{i})^{2}}\), the following ranking evaluation metrics and functions are used in our experiments.
Loose Accuracy—The loose accuracy (LA@X) measurement considers ranking accuracy of the top X candidates only. LA@1 is also called the restricted accuracy metric, returning a count of 1.0 if the top one prediction is correct, and otherwise returning 0. Similarly, LA@3 returns a count of 1.0 if one of the predicted top three candidates matched the true top one, and again returning 0 otherwise. LA has been used for metalearning in Kalousis (2002). We report results for LA@1, LA@3 and LA@5.
The actual evaluation scores for the above ranking evaluation functions of each ranker were estimated based on multiple runs of train/test split evaluations. We use the average scores obtained from 10 runs of 90 % vs. 10 % train/test evaluation for result visualization. For each run, 419 (90 % of 466) datasets were randomly selected for building a metalearning system using the corresponding ranker, and the remaining 47 datasets were used for testing.
To avoid information leaking, we make sure that when metarule based metafeatures are used by a ranker, the rules are generated using only the corresponding training metadataset of each run.
5.4 Experimental results
We present and analyse two sets of experimental results. One is a comparison of metafeature sets based on kNN performance curves; the other one is a comparison of ranking performances of multiple rankers on two metafeature sets.
Summary of the top one ranker performances under different ranking evaluation metrics and functions. Used ‡: metarules (SIL+MetaRule set) is used, or not. Gain §: performance gain of the best ranker over the default ranker. Gain †: performance gain of the best ranker using the SIL+MetaRule set over the same ranker using the SILonly set
Evaluation  Best Ranker  Used ‡  Gain §  Gain † 

SRCC  ART Forests  Yes  32.41 %  1.67 % 
WRC  ART Forests  Yes  32.83 %  1.77 % 
LA@1  kNN  Yes  93.82 %  31.1 % 
LA@3  ART Forests  Yes  21.85 %  2.00 % 
LA@5  ART Forests  Yes  33.23 %  7.94 % 
NDCG@1  ART Forests  Yes  37.60 %  5.97 % 
NDCG@3  ART Forests  Yes  27.06 %  3.16 % 
NDCG@5  ART Forests  Yes  24.50 %  2.84 % 
6 Discussions and future work
In this section, we discuss the limitations of the proposed techniques and future work for extending and improving the system.
For the pairwise metarule method, although the RIPPER algorithm works well, we believe that alternative rule learners are still worth investigating. An important direction for future work is the design of an efficient rule learner for generating small numbers of high quality pairwise metarules, as the number of pairwise rule models required increases quadratically with the number of algorithms to rank. Therefore, the training and the output complexity of the chosen rule learner is critical to the scalability of the proposed method. If z is the average number of rules returned by the rule learner for one pair of algorithms, then the total number of pairwise metarule based metafeatures generated for m algorithms is \(\frac{zm(m1)}{2}\). Consequently, when using the pairwise metarule method, the number of metafeatures grows quadratically with the number of algorithms to rank. Some algorithms, like the novel ART Forests algorithm proposed here, scale well with larger number of features, whereas for some of the alternative rankers both runtime and ranking performance might suffer from a larger number of features, especially if there are strong correlations present among these features. To overcome this problem, in future work, we will investigate: (1) rule pruning techniques for reducing the total number of pairwise metarules; (2) rule set fusion techniques, such as groupwise metarules. Another interesting future direction would be using the predicted rank differences, or at least the actual posterior probabilities of the binary problem, instead of the purely boolean pairwise metafeatures, for improved ranking performance.
Regarding the details of the ART algorithm, in this paper we only considered and evaluated using the median value of a metafeature’s dynamic range as a binary split point. A future work direction would be comparing this to both alternative split point selection and to multiway splitting strategies.
In this paper, we employed the EAbased parameter selection technique, which is a relatively expensive approach for generating metadataset. In the literature, there are some attempts to use metalearning itself for the task of parameter selection. Soares et al. (2004) and Ali and SmithMiles (2006) have shown promising results using metalearning for selecting the kernel type and kernel parameters for support vector machines. There is also previous work that proposed hybrid systems combining metalearning and optimisation techniques (Reif et al. 2012; Gomes et al. 2012; de Miranda et al. 2012). These systems could be used as alternative parameter selection procedures for the metalearning experiment setup proposed in the this paper. In Sect. 5.1, we mentioned that predicting the full combination of algorithm plus optimal parameter settings is not feasible. While this is technically true, Leite et al. (2012a, 2012b) has introduced the active testing based metalearning framework that is able to return good parameter setting together with the recommended algorithm. In their experiments, a set of 292 algorithmparameter combinations was evaluated. The techniques described in the current paper could probably also be applied to metalearning for parameter recommendation of single learning algorithms, but this needs to be verified in a future experimental study.
7 Conclusions
We have introduced a new approach for generating metafeatures to meta learning. The main difference between our method and stacking is given in Sect. 3. We have introduced a specialised metalearning algorithm, based on the random forests framework (Breiman 2001), for predicting algorithm rankings, and provided both a theoretical and an empirical analysis. Unlike previous work, experiments in this paper were based on a much larger collection of publicly available datasets and a novel, more appropriate metalearning experiment configuration, which systematically explores the space of parameter values of the algorithms to be ranked.
Our experimental results indicate that the pairwise metarule generation method (Method 1) consistently improves the performances of different ranking approaches to metalearning. The new ART Forests ranker is always among the top rankers across all ranking metrics and functions we have tested. The success of the proposed methods and algorithms on the metalearning problem studied in this paper suggests the applicability of the new techniques to a wider range of metalearning problems. Additional materials can be found at http://www.cs.waikato.ac.nz/~qs12/ml/meta/.
Footnotes
References
 Ali, S., & SmithMiles, K. A. (2006). A metalearning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1–3), 173–186. CrossRefGoogle Scholar
 Alvo, M., Cabilio, P., & Feigin, P. D. (1982). Asymptotic theory for measures of concordance with special reference to average Kendall tau. The Annals of Statistics, 10(4), 1269–1276. MathSciNetzbMATHCrossRefGoogle Scholar
 Bensusan, H., GiraudCarrier, C., & Kennedy, C. (2000). A higherorder approach to metalearning (Technical report). University of Bristol. Google Scholar
 Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research, 13, 1063–1095. MathSciNetGoogle Scholar
 Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Topdown induction of clustering trees. In Proceedings of the fifteenth international conference on machine learning. San Mateo: Morgan Kaufmann. Google Scholar
 Brazdil, P., Gama, J., & Henery, B. (1994). Characterizing the applicability of classification algorithms using metalevel learning. In Proceedings of the European conference on machine learning. Google Scholar
 Brazdil, P., Soares, C., & Da Costa, J. P. (2003). Ranking learning algorithms: using ibl and metalearning on accuracy and time results. Machine Learning, 50(3), 251–277. zbMATHCrossRefGoogle Scholar
 Brazdil, P., GiraudCarrier, C., Soares, C., & Vilalta, R. (2009). Metalearning: applications to data mining. Berlin: Springer. Google Scholar
 Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. MathSciNetzbMATHGoogle Scholar
 Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. zbMATHCrossRefGoogle Scholar
 Cheng, W., & Hüllermeier, E. (2008). Instancebased label ranking using the mallows model. In Workshop proceedings of preference learning, Antwerp, Belgium. Google Scholar
 Cheng, W., Hühn, J., & Hüllermeier, E. (2009). Decision tree and instancebased learning for label ranking. In Proceedings of the 26th international conference on machine learning (ICML09), Montreal, Canada (pp. 161–168). Google Scholar
 Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the 12th international conference on machine learning. San Mateo: Morgan Kaufmann. Google Scholar
 de Miranda, P., Prudencio, R., Carvalho, A., & Soares, C. (2012). Combining a multiobjective optimization approach with metalearning for svm parameter selection. In 2012 IEEE international conference on systems, man, and cybernetics (pp. 2909–2914). CrossRefGoogle Scholar
 Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. London: Chapman & Hall. zbMATHGoogle Scholar
 Escalante, H. J., Montes, M., & Sucar, L. E. (2009). Particle swarm model selection. Journal of Machine Learning Research, 10, 405–440. Google Scholar
 GiraudCarrier, C. (2008). Metalearning—a tutorial. In Proceedings of the 7th international conference on machine learning and applications. San Mateo: Morgan Kaufmann. Google Scholar
 Gomes, T. A., Prudêncio, R. B., Soares, C., Rossi, A. L., & Carvalho, A. (2012). Combining metalearning and search techniques to select parameters for support vector machines. Neurocomputing, 75(1), 3–13. CrossRefGoogle Scholar
 Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18. CrossRefGoogle Scholar
 Hüllermeier, E., Fürnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16–17), 1897–1916. MathSciNetzbMATHCrossRefGoogle Scholar
 Jankowski, N., Duch, W., & Grabczewski, K. (Eds.) (2011). Studies in computational intelligence: Vol. 358. Metalearning in computational intelligence. Berlin: Springer. zbMATHGoogle Scholar
 Järvelin, K., & Kekäläinen, J. (2002). Cumulated gainbased evaluation of ir techniques. ACM Transactions on Information Systems, 20(4), 422–446. CrossRefGoogle Scholar
 Kalousis, A. (2002). Algorithm selection via metalearning. PhD thesis, Department of Computer Science, University of Geneva. Google Scholar
 Kalousis, A., & Hilario, M. (2001). Model selection via metalearning: a comparative study. International Journal on Artificial Intelligence Tools, 10(04), 525–554. CrossRefGoogle Scholar
 Kendall, M. G. (1970). Rank correlation methods. London: Griffin. zbMATHGoogle Scholar
 Leite, R., & Brazdil, P. (2005). Predicting relative performance of classifiers from samples. In Proceedings of the 22nd international conference on machine learning. Google Scholar
 Leite, R., Brazdil, P., & Vanschoren, J. (2012a). Selecting classification algorithm with active testing on similar datasets. In Proceedings of the 5th international workshop on planning to learn. Google Scholar
 Leite, R., Brazdil, P., & Vanschoren, J. (2012b). Selecting classification algorithms with active testing. In P. Perner (Ed.), Lecture notes in computer science: Vol. 7376. Machine learning and data mining in pattern recognition (pp. 117–131). Berlin Heidelberg: Springer. CrossRefGoogle Scholar
 Li, H. (2011). Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies, 4(1), 1–113. CrossRefGoogle Scholar
 Marden, J. I. (1995). Analyzing and modeling rank data. London: Chapman & Hall. zbMATHGoogle Scholar
 Pfahringer, B., Bensusan, H., & GiraudCarrier, C. (2000). Metalearning by landmarking various learning algorithms. In Proceedings of the 17th international conference on machine learning. Google Scholar
 Pinto da Costa, J., & Soares, C. (2005). A weighted rank measure of correlation. Australian & New Zealand Journal of Statistics, 47(4), 515–529. MathSciNetzbMATHCrossRefGoogle Scholar
 Reif, M., Shafait, F., & Dengel, A. (2012). Metalearning for evolutionary parameter optimization of classifiers. Machine Learning, 87, 357–380. MathSciNetCrossRefGoogle Scholar
 Schaffer, C. (1994). A conservation law for generalization performance. In Proceedings of the 11th international conference on machine learning (pp. 259–265). San Mateo: Morgan Kaufmann. Google Scholar
 Serban, F., Vanschoren, J., Kietz, J.U., & Bernstein, A. (2013). A survey of intelligent assistants for data analysis. ACM Computing Surveys. doi: 10.5167/uzh73010. Google Scholar
 SmithMiles, K. A. (2009). Crossdisciplinary perspectives on metalearning for algorithm selection. ACM Computing Surveys, 41(1), 6:1–6:25. Google Scholar
 Soares, C. (2004). Learning ranking of learning algorithms. PhD thesis, Department of Computer Science, University of Porto. Google Scholar
 Soares, C., Brazdil, P. B., & Kuba, P. (2004). A metalearning method to select the kernel width in support vector regression. Machine Learning, 54(3), 195–209. zbMATHCrossRefGoogle Scholar
 Sun, Q., Pfahringer, B., & Mayo, M. (2012). Full model selection in the space of data mining operators. In Proceedings of the 14th international conference on genetic and evolutionary computation conference companion. Google Scholar
 Todorovski, L., Blockeel, H., & Dzeroski, S. (2002). Ranking with predictive clustering trees. In Proceedings of the 13th European conference on machine learning. Berlin: Springer. Google Scholar
 Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259. CrossRefGoogle Scholar
 Wolpert, D., & Macready, W. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. CrossRefGoogle Scholar
 Xu, J., & Li, H. (2007). Adarank: a boosting algorithm for information retrieval. In Proceedings of the 30th international conference on research and development in information retrieval. New York: ACM. Google Scholar