Abstract
In this paper, we present a novel metafeature generation method in the context of metalearning, which is based on rules that compare the performance of individual base learners in a oneagainstone manner. In addition to these new metafeatures, we also introduce a new metalearner called Approximate Ranking Tree Forests (ART Forests) that performs very competitively when compared with several stateoftheart metalearners. Our experimental results are based on a large collection of datasets and show that the proposed new techniques can improve the overall performance of metalearning for algorithm ranking significantly. A key point in our approach is that each performance figure of any base learner for any specific dataset is generated by optimising the parameters of the base learner separately for each dataset.
Introduction
Training a good model for a given dataset is one of the common tasks of a data analyst. The straightforward approach of simply applying and optimising all known learning algorithms is usually not feasible. Thus an experienced data analyst will generally perform some form of preliminary analysis and then focus on a few promising algorithms. This selection is guided by the prior experience of the analyst. Metalearning tries to support and automate this process. It tries to predict the probably best or close to best algorithms, thus considerably reducing the amount of training and optimisation time needed for finding a good model on a given dataset. This reduction in resource consumption should be accompanied by no or only a small loss in predictive performance when compared to the best possible result (Brazdil et al. 2003, 2009). Metalearning for algorithm ranking uses a general machine learning approach to generate metaknowledge mapping the characteristics of a dataset, captured by metafeatures, to the relative performances of the available algorithms.
The advantage of the metalearning approach is that high quality algorithm ranking can be done on the fly, i.e., in seconds, which is particularly important for business domains that require rapid deployment of analytical techniques. In machine learning research, metalearning has also been used for boosting the performance of evolutionaryalgorithmbased model selection techniques (Reif et al. 2012). The work by SmithMiles (2009) has discussed crossdisciplinary perspectives on metalearning. Theoretical motivations on metalearning research are due to the No Free Lunch (NFL) theorem (Wolpert and Macready 1997) and the Law of Conservation for Generalization Performance (LCG) (Schaffer 1994). Recent research seems to suggest that it is possible to show that metalearning offers a viable mechanism for building a “generalpurpose” learning algorithm (GiraudCarrier 2008). For a comprehensive review of metalearning research and its applications, we refer the reader to GiraudCarrier (2008); Brazdil et al. (2009); Jankowski et al. (2011); Serban et al. (2013).
This paper makes three contributions: a novel metafeature generator for metalearning, a new metalearner, and a more appropriate experimental configuration, for improving the overall performance of metalearning for algorithm ranking.
Background
“Metalearning” is a broad field; in this paper, for simplicity, the term metalearning is used in the sense of “metalearning for algorithm ranking or recommendation”. We first introduce the mechanics of metalearning. The basic steps of a metalearning task are as follows. Firstly, a set of datasets is collected; secondly, we need to define some metafeatures as the characteristics of each dataset, e.g., the number of instances, the number of numeric or categorical features, and many more; for more details on metafeatures please refer to Sect. 3. Thirdly, we estimate the predictive performance of the available algorithms, e.g., using crossvalidation, for every dataset in the dataset collection. Thus, for each dataset we get a list of available algorithms with their performance estimates. Given the above information, we can construct a metadataset, which is a n×m data matrix, where m=m _{ f }+m _{ t }. Here, m is the sum of the number of metafeatures m _{ f } and the number of algorithms m _{ t }, and n is the number of datasets. Figure 1 (lefthand side table) shows an example of a metadataset in a general multitarget regression setting.
For algorithm recommendation, our goal is not to predict the absolute expected performance of any algorithm, but rather the relative performance between algorithms. Thus, the metadataset can be transformed to represent the rankings of the algorithms. Figure 1 (from lefthand side to righthand side table) shows an example of how a “raw” metadataset is transformed. In this case, ranking is a special case of the general multitarget regression setting. Then, a metalearner, which we will call it a “ranker” here, will take this n×m matrix as its training input to learn a ranking model. Given a new dataset, we first calculate its metafeatures and use the metafeatures (e.g., the f _{1},f _{2},f _{3} values for the example in Fig. 1) as input to the ranker. The ranker finally returns a ranked list of all algorithms. As is true in general for machine learning, the performance of metalearning depends crucially on the quality of both the metafeatures and the metalearners available. In addition, the performance of the baselevel learners must also be estimated to a high quality to enable successful metalearning.
Existing metalearning systems are mainly based on three types of metafeatures: statistical, informationtheoretic and landmarkingbased metafeatures, or SIL for short. Thorough reviews of these metafeatures can be found in Brazdil et al. (1994); Pfahringer et al. (2000); Kalousis (2002); Soares (2004). Recently, more metafeature sets have been developed, for example, the learningcurvebased metafeatures (Leite and Brazdil 2005), treestructureinformationbased (Bensusan et al. 2000) and histogrambased (Kalousis 2002).
In this paper, we propose a novel method for metalearning that generates “metalevel” metafeatures from the “baselevel” metafeatures via rule induction that tries to predict the better algorithm for each pair of algorithms. Here the “baselevel” metafeatures can come from any of the metafeature sets mentioned above. Metafeatures generated by the proposed method (described in Sect. 3) contain pairwise information between algorithms. Adding them to the original feature space can help to improve the performance of many metalearners.
Next, we will discuss several ranking approaches from an algorithmic perspective as the underlying multitarget or ranking problem is being studied in various fields, including social sciences, economics and mathematical sciences. Although the terminologies and presentations of the same models can be quite different for different fields, the algorithmic properties are similar in concept and relatively easy to describe.
The knearest neighbors approach
The kNN ranking approach has two steps: the nearest neighbor search step and the ranking generation step. In the first step, given a new dataset, we first calculate its metafeatures to construct an instance as a query (an m _{ f }value array). Then, we select a set of instances (nearest neighbors) in the training set (the n×m _{ f } data matrix) that are similar to the query instance. The similarity between instances is usually based on a distance function, e.g., Euclidean distance. In the second step, we combine the rankings of the nearest neighbors to generate an aggregated algorithm ranking for the new dataset. For our experiments, we use the average ranks method described in Brazdil et al. (2009). Let R _{ i,j } be the rank of algorithm T _{ j },j=1,…,t on dataset i, where t is the number of algorithms. The average rank for each algorithm T _{ j } is defined as: \(\bar{R_{j}} = (\sum_{i=1}^{k} R_{i,j}) / k\), where k is the number of nearest neighbors. The kNN ranker’s performance is related to the value of k, appropriate values for k are usually determined by crossvalidation. The kNN ranker is often used as a benchmark learner for testing the performance of different metafeature sets.
The binary pairwise classification approach
Given the n×m data matrix as the training data, multiple binary (pairwise) classification models can be used to construct a ranking model. For example, if there were three metafeatures and three algorithms in the training set (e.g., Fig. 1 righthand side table), one could build three binary classification models for each pair of algorithms: Algorithm1 vs. Algorithm2, Algorithm1 vs. Algorithm3, and Algorithm2 vs. Algorithm3. The training data for the three binary classification models are the same, which is the n×m _{ f } data matrix. Given a new dataset, we first calculate its metafeatures, again an m _{ f }value array as a query. Then, we use the three binary classification models to classify the query. The final algorithm ranking list for the new dataset is computed based on how many times each algorithm has been predicted as “is better”. If there were more than three algorithms to rank, then there might be ties in the list. Several tie breaking techniques have been examined in the literature (Kalousis 2002; Brazdil et al. 2009), but usually this choice does not have a strong influence on the performance of a metalearning system.
The advantage of the binary classification ranking approach is that existing binary classification algorithms can be employed directly. However, if there are many algorithms to rank, then the number of binary models required can be large and hard to manage in practice, e.g., for 20 algorithms, this would require \(\frac{T \times (T1) }{2} = \frac{20 \times 19 }{2} = 190\) binary classification models to be built, which could be costly if one also considered using different—and fine tuned—classification algorithms for each of the 190 binary classification problems. The binary pairwise classification approach has also been studied in label ranking (Hüllermeier et al. 2008).
The learning to rank approach
From the enduser’s perspective, algorithm ranking is similar to the ranking problem in a search engine scenario, where a ranked list is returned for responding to a query. The search engine scenario has been extensively studied in learning to rank and information retrieval (IR). One obvious candidate for rank prediction is the AdaRank algorithm proposed in Xu and Li (2007). AdaRank is a boosting algorithm that minimizes a loss function directly defined on the ranking performance measure. The algorithm repeatedly constructs “weak rankers” on the basis of reweighted training data and finally linearly combines the weak rankers for making ranking predictions. In addition to supplying algorithms, the IR literature is also an excellent source for ranking evaluation metrics. Similar to search engine users, metalearning users are usually mainly interested in the top candidates, be it websites or algorithms. An IR measure which captures this bias towards the top ranked items well is the normalized discounted cumulative gain (NDCG) metric (Järvelin and Kekäläinen 2002; Li 2011), which has not been used in metalearning evaluation previously.
The label ranking approach
Label ranking can be seen as an extension of the conventional setting of classification. The former can be obtained from the latter by replacing single class labels by complete label rankings (Cheng and Hüllermeier 2008; Cheng et al. 2009). From an algorithmic view point, one type of label ranking approach, such as the ranking by pairwise comparison (RPC) algorithm (Hüllermeier et al. 2008), is based on extending the pairwise binary classification approach by using more sophisticated ranking aggregation methods. Another type of label ranking algorithm, such as the label ranking trees (LRT) algorithm (Cheng et al. 2009), tries to apply probabilistic models under the label ranking setting. Metalearning for algorithm ranking using the multitarget regression setting can also be transformed to a label ranking problem, so that label ranking algorithms can be used directly.
The multitarget regression approach
As we can see in Fig. 1, the algorithm ranking problem can also be seen as a multitarget (multiresponse) regression problem, where the rank position values of each algorithm are the targets in the multitarget regression setting. In multitarget regression, the task is to estimate several target, or response variables using a common set of input variables or features. Two approaches are usually employed. One is to build separate singletarget regression models for each target variable; the other is to use a single model to estimate all the targets simultaneously. For metalearning, the former approach is similar to the binary classification ranking approach but requiring fewer models to be built, e.g., for 20 algorithms, only 20 singletarget regression models are needed. In this section, we focus on the latter approach and start from the linear model. A linear multitarget regression model can be expressed as:
where W ^{∗} is the regression coefficient matrix (an n×m _{ f } matrix), X is the data (feature value) matrix, Y is the target matrix (which can be rankings) and E is the noise term matrix. The problem is usually formulated as a single convex optimisation problem of the form:
where f(W) is a loss function, e.g., squared loss, g(W) is the regularization term, e.g., l ^{2}normbased, and r is a constant. Although this kind of formulation is mathematically straightforward, efficient numerical techniques and relatively easytoimplement algorithms are not readily available for sophisticated rankingbased loss/regularization functions. Also, algorithms in this category are usually designed for relatively lowdimensional problems or it is assumed that the number of training instances is much greater than the number of features, or the number of model parameters to estimate. In addition, we know that the various kinds of metafeatures are “logically” related, so a nonlinear model might be more appropriate. Figure 2 shows two model output examples from a decision tree learner and a propositional rule learner for a binaryclassificationbased metalearning problem, where only two algorithms are considered. In this example, three types of metafeatures (SIL) are used. We can see that whether an algorithm is preferred depends on some logical relationships between metafeatures. For example, the second rule in Fig. 2(b) suggests that when there are no strong features (indicated by the “weak” treebased landmarkers’ AUCs being low) then SGD (a WEKA implementation of an algorithm based on stochastic gradient descent) is likely to work better than Naive Bayes. For these reasons we leave the linear, singlemodel multitarget approach for future research.
Here, we focus on a special class of algorithms that are based on a nonlinear multitarget model:
where z is a decisiontreebased function. A wellknown algorithm of this kind is called predictive clustering trees (PCT) (Blockeel et al. 1998), which have been adapted for ranking problems in Todorovski et al. (2002). There are several advantages in using decisiontreebased rankers, including: a decision tree naturally identifies partitions of the data, where “similar” individuals are supposed to be in the same “cluster”. This leads to a relatively natural interpretation for the metalearning mechanism at the conceptual level; decision tree algorithms are reasonably fast as their training complexity is nlogn, where n is the number of training instances. In Sect. 4, we introduce a new ranker, which is an ensemble of decisiontreebased rankers.
Pairwise MetaRules
In this section, we introduce a novel metafeature generation method in the context of metalearning. The main motivation comes from our observation that existing metafeature sets have ignored one potential source of information: the logical pairwise relationships between each pair of the target algorithms to rank. Explicitly adding this information to the metafeature space might improve a metalearner’s predictive accuracy.
Of course, when predicting for a new dataset, this information will not be available, as otherwise we would not try to predict rankings in the first place. Therefore we propose to use a rule learner to learn pairwise rules first, and then use these rules as new metafeatures. Two steps are involved: the first step is similar to the binary classification ranking approach, where \(\frac{T \times (T1) }{2}\) (T is the number of algorithms to rank) binary classification training datasets are constructed from the original n×m data matrix. Each binary dataset has two class labels: {“algorithm t _{ i } is better”, “algorithm t _{ j } is better”}. Whether an algorithm is better than the other is determined by their ranking position in the ranked list of a particular dataset.
Taking the dataset illustrated in the righthand side table of Fig. 1 as an example, we need to construct \(\frac{3 \times (31) }{2} = 3\) binary datasets as shown in Fig. 3. Then, we build 3 rulebased binary classification models based on the 3 binary classification metadatasets. In principle any rule learner could be used. We choose the RIPPER algorithm (Cohen 1995), because it is relatively fast and its rule models are generally very compact compared to other rule learners. Figure 2(b) shows a RIPPER rule model. For each pair of algorithms, we compute a ruleset describing in which situation an algorithm is to be preferred over another. We call these rules the Pairwise MetaRules. Next, we discuss how these rules can be used as new metafeatures. Using Fig. 2(b) as an example, this RIPPER ruleset comprises three rules. The first two are individual rules, whereas the third one is a default catchall rule. We generate two different sets of metafeatures from Pairwise MetaRules.
Method 1 turns each individual rule into one boolean metafeature. For example, we may have a Pairwise MetaRule:
If BaseMetaFeatureX ≤ 0.5 AND BaseMetaFeatureY ≥ 0, Then Algorithm A is better than Algorithm B.
The value of the new metafeature constructed from this MetaRule will be determined by looking at the (baselevel) metafeature values of a new dataset defined by a MetaRule. For a new dataset, the MetaRulebased metafeature value is set to true if the rule condition “BaseMetaFeatureX ≤ 0.5 AND BaseMetaFeatureY ≥ 0” is met, otherwise false. For the metalearning problem, the RIPPER algorithm returns on average about two individual rules for each of the 190 algorithm pairs.
Method 2 creates just one boolean metafeature representing the outcome of applying the full ruleset. For a new dataset, the MetaRulebased metafeature value is set to true if the default catchall rule of a ruleset is true, otherwise false. For the metalearning problem, the RIPPER algorithm returns 190 rulesets for each of the 190 algorithm pairs, so in Method 2, 190 MetaRulebased metafeatures are added to the original feature space.
The Pairwise MetaRules method is different from the standard stacking (Wolpert 1992) method, where we do not use the predicted complete ranking of base models instead we use the RIPPER rule sets to construct new metafeatures. A metalearner will use all metafeatures (including both baselevel and MetaRulebased metafeatures) for building the final model, which is also different from stacking.
In the experiments reported below we compare three different metafeature sets:
SILOnly: 80 different SIL metafeatures only;
SIL+MetaRules1: the 80 SIL metafeatures plus metafeatures generated by Method 1;
SIL+MetaRules2: the 80 SIL metafeatures plus metafeatures generated by Method 2.
Next, we propose a novel metalearner (ranker), specifically designed for the metalearningbased algorithm ranking problem.
Approximate ranking tree forests for metalearning
In preliminary experiments we found that the predictive performances of the predictive clustering trees for ranking (PCTR) (Todorovski et al. 2002) and the label ranking trees (LRT) (Cheng et al. 2009) algorithms usually are worse than the optimised kNN ranker for our metalearning problem, where the number of objects to rank is relatively large (20 algorithms). Inspired by the bootstrap aggregating (bagging) (Efron and Tibshirani 1993; Breiman 1996) strategy and the random forests framework (Breiman 2001), we here propose a new ranker for the metalearning problem, called Approximate Ranking Tree Forests (ART Forests), which is a random forest ensemble of approximate ranking trees. The motivations for proposing ART Forests include:

1.
as more and more metafeatures are being developed and added to the feature space, we believe that a relatively fast algorithm (e.g., decisiontreebased learners) that has “builtin” feature selection capacity would be useful;

2.
the metalearner should not be constrained by the “number of training examples must be much greater than the number of parameters to estimate” restriction, because for the metalearning problem the number of datasets is usually not very large;

3.
recent theoretical work (Biau 2012) seems to suggest that irrelevant features do not significantly decrease random forest’s predictive performance, so ideally the metalearner should be capable of using as many metafeatures as possible;

4.
ensemble algorithms usually outperform base algorithms. Statistics and models aggregated from bootstrap samples are nonparametric estimates (Efron and Tibshirani 1993), so we do not have to make parametric assumptions about the form of the underlying population, which provides an automatic approximation layer to our new algorithm. In the literature, an empirical study by Kalousis and Hilario (2001) has shown that boostingbased metalearner outperformed the kNN based metalearner in the pairwise binary classification setting (Sect. 2.2) for metalearning.
Approximate Ranking Trees (ART)
In this section, we propose the ART algorithm which is used as the base learner for ART Forests. The pseudocode is given in Algorithm 1. ART is a recursive partitioning algorithm that splits data into smaller subpartitions which are increasingly and relatively more similar in terms of a homogeneous test statistic for examples in different partitions.
Next, we introduce the rationale underlying ART. We first review some basic concepts from permutation theory that will be used in this section, adopting the notation of Marden (1995) for analysing rank data in the multitarget setting. The basic unit of analysis consists of n datasets as judges to rank a set of m algorithms. This set of algorithms can be denoted by \(\mathcal{O} = \{O_{1}, O_{2},\ldots,O_{m}\}\). For the metalearning problem, we have a complete ranking of algorithms for each dataset: there is a best algorithm, second best, … , and finally a worst one, with ranks of 1,2,…,m, respectively. We here use \(\mathcal{S}_{m}\) to denote the permutations of m ranks:
so the multitarget part of our metadataset can be expressed as a sample of complete ranking vectors:
For example, the targets part of Fig. 1, righthand side table, consists of three rankings: y ^{〈1〉}=[2,1,3]′,y ^{〈2〉}=[2,3,1]′,y ^{〈3〉}=[3,2,1]′.
ART’s splitting criterion
The splitting criterion is a measure that quantifies the quality of a given split point. In general, a local model is fitted into different partitions. In regression trees, the local model is usually the mean target value of the examples in a partition. This approach is fast and relatively stable, especially when regression trees are used as base models for a tree ensemble, since the mean is an optimal estimator for i.i.d. examples. The analogy in ranking trees is that we need to define or estimate a central ranking \(\hat{z}\) to be used as the local model so that \(\hat{z}\) minimizes a rankingbased dispersion measure, such as the average distance corresponding to a distance function d(.,.) of ranking vectors:
When n is large, given a proper distance function, the sample central ranking could be estimated using maximumlikelihood estimation (MLE) or Bayesian methods for certain ranking models, such as the Mallows models (Marden 1995), but these methods are usually computationally heavy, hence not suitable for ensembling. For ART, we use the average ranking calculated from the average rank vector as an approximation to the central ranking of rankings in a partition: \(\hat{z} = \mathit{averageRanking}(\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y^{ \langle i \rangle} )\), in terms of the Spearman’s distance:
When d is the Spearman’s distance (Eq. 7), based on Theorem 2.2 of Marden (1995), the average ranking averageRanking(\(\bar{y} \)) belongs to the set of the sample central rankings, which provides a theoretical foundation for ART.
Alternative distance functions could also be used but we will show later (Eq. 11) that the Spearman’s distance has some convenient mathematical properties. To get a feeling about what an average ranking is in our metalearning problem, Fig. 4 shows an example of the average ranks of 20 algorithms over 466 datasets.
In the ART algorithm, we use the median value of a metafeature’s range as the binary split point to split the data D, the current partition, into two subpartitions D ^{+} and D ^{−}. The best split point is determined to be the one that maximises the R ^{2} statistic:
where L is the number of partitions, and n ^{(l)} is the number of examples in partition l. R ^{2} is originally designed to measure the proportion of the spread explained by the differences between the two partitions (Marden 1995). There are two special cases: if the sample central rankings of different partitions are equal, such as
then R ^{2}=0 implies that partitioning is not necessary; if there is only one distinct ranking within each partition, and they are not all the same between partitions, then R ^{2}=1. Here we estimate R ^{2} and employ it as a heuristic for ART induction. Next, we will derive an “onestep” formula for calculating R ^{2} without calculating the distance between the central ranking and the other rankings in a partition.
At inner nodes, ART tests two partitions D ^{+} and D ^{−} so we have L=2, and we can rewrite Eq. (8) to:
where D=D ^{+}∪D ^{−}.
For rank data the average distance is also a proper measure of spread \(\hat{\alpha}\): \(\hat{\alpha}_{Spearman} \equiv \bar{d}_{Spearman}\) (Marden 1995). Alvo et al. (1982) and Marden (1995) present a convenient mathematical result for Spearman’ distance (a brief derivation is given in the Appendix):
Here we replace \(\bar{d}_{Spearman}\) with \(\hat{\alpha}_{Spearman}\), and rewrite Eq. (10) to:
Combining everything together, we can compute R ^{2} (Eq. (8)) efficiently:
where \(h=\frac{m(m+1)(2m+1)}{3}\).
ART’s stopping criterion
In order to prevent overfitting, a stopping criterion is usually used to determine if it is worthwhile to split the current node. A natural stopping criterion can be introduced by adding a regularization parameter θ, e.g., ART will stop growing the current node if:
Another stopping criterion is the minimum number of examples at a leaf node, γ. Setting γ to a small number usually outperform trees using larger γ values when the tree algorithm is used as the base learner for random forests (Breiman 2001). ART uses both of these criterias to limit tree sizes.
ART forests and rank aggregation
We use the random forests framework described in Breiman (2001) for ART Forests induction. Algorithm 2 shows the pseudocode. An ART forest is grown as follows:

1.
The training set is a bootstrap sample from the original training set;

2.
An integer u is set by the user. At each node, u features are selected at random and the node is split on the best feature among the selected u;
For prediction, when a test feature vector x is put down each tree, it is assigned the average ranking of the rankings at the node it stops at. The average ranking of these over all approximate ranking trees in the forest is the predicted ranking for x. Alternative rank aggregation methods, such as other types of Borda count, graph theory models, or binary linear programs, might perform better, at the expense of higher computational cost, and will be investigated in future research.
Experiment setup and results
This experimental comparison investigates two related questions.

The first question concerns the new metafeature generation method: can it consistently improve the performance of known metalearners?

The second question concerns the new algorithm: are ART Forests competitive with current stateoftheart rankers?
In this paper, we focus on the performance of metalearning for algorithm ranking on binary classification datasets only. Previous studies were limited by the small number of available datasets, usually fewer than 100 datasets were used in reported metalearning experiments. To be able to draw statistically significant conclusions, we chose to use as many datasets as possible from various public data sources, including the UCI,^{Footnote 1} StatLib,^{Footnote 2} KDD^{Footnote 3} and WEKA^{Footnote 4} repositories. In total slightly more than 1,000 classification and regression datasets were collected. However, due to the varying quality of the public datasets, not all of them could be used directly. After removing duplications and very small datasets (less than 10 instances), and converting multipleclass classification data—by keeping only the top two majority classes—and regression data—by using the mean as a binary splitting point to transform the numeric target into a binary target, a total of 466 datasets was obtained for experimentation. Also, to speed up our experiments we capped the number of instances in larger datasets to a maximum of 5,000 instances, using stratified random sampling.
EAbased performance estimation
For simplicity and in order to speed up experiments, many previous metalearning experiments have estimated algorithm performance using some prespecified default parameter settings across all baselevel learners. We claim that this approach is bound to be suboptimal in practice because to be able to make useful predictions, most algorithms have to be optimised separately for each specific dataset. Technically, predicting the full combination of algorithm plus optimal parameter settings is not feasible. We therefore propose an intermediate approach, where we assume that a procedure is available for optimising each algorithm for each dataset, and then predict the ranking of the optimised algorithms. Given a new dataset, the recommended ranking is based on the assumption of using optimal parameter settings. These actual optimal parameter settings would have to be computed for each topranked algorithm through the exact same optimisation process that was used for metalearning.
The parameter optimisation procedure used here is based on evolutionary algorithms. Gridsearch would be an alternative, but based on recent research (Reif et al. 2012), EAbased techniques seem more efficient. Specifically, we employ the particle swarm optimisation (PSO) based parameter selection technique described in Escalante et al. (2009); Sun et al. (2012). Although EAbased performance estimation is more appropriate, it is timeconsuming, especially for large datasets. Therefore, as mentioned above, large datasets were downsampled to 5,000 instances. In this paper, metalearning is used to rank 20 supervised machine learning algorithms, all of which are implemented in WEKA (Hall et al. 2009). When generating the metadataset from the 466 datasets, for each of the 20 algorithms, we manually specify parameters and their respective value ranges for PSO to optimise. Taking the support vector machine algorithm as an example, we set PSO to optimise the kernel type, all relevant kernel parameters, and the complexity constant.
The class distributions of the 466 datasets vary a lot, with some of the datasets being very skewed, which can cause high variance on zeroone loss estimation. Consequently, we choose the area under the receiver operating characteristic curve (AUC) metric as the main performance measure, as it is less affected by class skew. We run the 20 algorithms, with PSObased parameter optimisation, on the 466 binary classification datasets and use 10fold crossvalidation based AUC scores for ranking generation. Building up this metadataset is the most expensive part of our metalearning experiment: it took roughly three weeks, or about 6,000 singlecore CPU hours, to complete on two 2.8 GHz 16 G RAM PCs, with 6 threads running on each machine. Figure 5 shows the percentage of improvement of the best AUC score among the 20 PSOoptimised algorithms for each dataset over the best AUC score of the same 20 algorithms using their default parameters. The result demonstrates the benefit of using the performances of optimised algorithms for generating algorithm rankings. We claim this is a more appropriate experimental setting for metalearning than simply using a fixed set of parameters for every dataset.
Metalearners (rankers) in comparison
We test the new pairwise metarule based metafeatures with different metalearners. In total 7 rankers are used for our experiments.
DefaultRanker—The default ranker follows a very simple yet powerful philosophy: if an algorithm has worked well on previous datasets, then it is likely to be used as the first algorithm to try on a new dataset. So, the default ranker used in our experiments is simply using the average rank of each algorithm over all the training data. Thus the ranking predicted by the default ranker is always the same for every test dataset, which is the average ranking of the training examples. This ranker can also be seen as a special case of an ART model with only one node. One distinguishing feature of the algorithm ranking problem is that this averagerankingbased default ranker is relatively strong compared to other common preference learning problems, such as movie or book recommendation, or survey data involving human subjects.
kNN—The kNN ranker, as described in Sect. 2, uses standard Euclidean distance and we set k=15 (we will show later that 15 is a relatively good value for our problem).
LRT—The label ranking trees ranker, we use the WEKALR^{Footnote 5} implementation with default parameters. LRT is based on the Mallows model for rank data.
RPC—The RPC ranker, which uses the ranking by pairwise comparison algorithm proposed in Hüllermeier et al. (2008). We use the default setting of the RPC implementation in WEKALR in which the logistic regression algorithm is used as the base learner.
PCTR—The PCTR ranker, which is the predictive clustering trees for ranking (PCTR) ranker (Todorovski et al. 2002). The minimum number of instances at a leaf node is set to 15.
AdaRank—The AdaRank ranker uses 100 PCTR rankers as its base models and the minimum number of instances at a leaf node is set to 30 (to simulate a relatively weak learner as in Xu and Li (2007); Li (2011)). Please note that boosting algorithms can stop early so the final AdaRank ensemble might not have exactly 100 base models.
ARTForests—The ARTForests ranker, where we use 100 approximate ranking trees for ART Forests, and the minimum number of instances γ at a leaf node is set to 1 for each ART (using small γ such as 1 leads to a good bagging ensemble as suggested in Breiman (2001)); The “pruning” parameter θ is set to 0.95.
Evaluation of ranking accuracy
We assess ranking accuracy by comparing the rankings predicted by a ranker for a given dataset with the corresponding target rankings. Given two sets of mvalue rankings: T=[T _{1},T _{2},…,T _{ m−1},T _{ m }] and P=[P _{1},P _{2},…,P _{ m−1},P _{ m }], which are targets and predictions, respectively, and letting \(d_{i}^{2} = {(T_{i}  P_{i})^{2}}\), the following ranking evaluation metrics and functions are used in our experiments.
Spearman’s Rank Correlation Coefficient (SRCC)—SRCC is defined as:
which assesses how well the relationship between the true and predicted rankings can be described using a monotonic function (Kendall 1970).
Weighted Rank Correlation—WRC is defined as:
The WRC metric puts more weight on the top candidates. It has been used for metalearning in Soares (2004); Pinto da Costa and Soares (2005).
Loose Accuracy—The loose accuracy (LA@X) measurement considers ranking accuracy of the top X candidates only. LA@1 is also called the restricted accuracy metric, returning a count of 1.0 if the top one prediction is correct, and otherwise returning 0. Similarly, LA@3 returns a count of 1.0 if one of the predicted top three candidates matched the true top one, and again returning 0 otherwise. LA has been used for metalearning in Kalousis (2002). We report results for LA@1, LA@3 and LA@5.
Normalized Discounted Cumulative Gain—Discounted cumulative gain (DCG) is a measure of effectiveness of a search engine algorithm or related applications by using a graded relevance scale of items in a result list (Järvelin and Kekäläinen 2002). The gain is accumulated from the top of the list to the bottom with the gain of each result discounted at lower ranks. The DCG@X is defined as: \(\mathrm{DCG@{X}} = \sum_{i=1}^{X} \frac{ 2^{g_{i}}  1 }{ \log_{2}(i+1)}\), where g _{ i } is a grade value. The Normalized DCG at position X is defined as:
here IDCG is the ideal DCG at X. We report results for NDCG@1, NDCG@3 and NDCG@5.
The actual evaluation scores for the above ranking evaluation functions of each ranker were estimated based on multiple runs of train/test split evaluations. We use the average scores obtained from 10 runs of 90 % vs. 10 % train/test evaluation for result visualization. For each run, 419 (90 % of 466) datasets were randomly selected for building a metalearning system using the corresponding ranker, and the remaining 47 datasets were used for testing.
To avoid information leaking, we make sure that when metarule based metafeatures are used by a ranker, the rules are generated using only the corresponding training metadataset of each run.
Experimental results
We present and analyse two sets of experimental results. One is a comparison of metafeature sets based on kNN performance curves; the other one is a comparison of ranking performances of multiple rankers on two metafeature sets.
Figure 6 shows the performances of the kNN ranker using three metafeature sets under different numbers of nearest neighbors (k). The performance of the default ranker is also included. Overall, k values between 10 and 20 usually produce relatively good performance across all eight ranking metrics. Regarding the choice of metafeature sets, we can see that kNN using the SIL+Metarules1 set outperforms the SILonly and the SIL+Metarules2 metafeature sets when appropriate k values are chosen. Taking the LA@1 metric as an example, the best kNN performance is about 0.26, which means that on average, if an optimal k value is used, the probability for the kNN ranker to predict correctly the best algorithm to use from 20 algorithms is 26 %, where the default ranker achieves about 14 %, and random guessing would only result in 1/20=5 %. The LA@3 and LA@5 metrics behave similarly. Regarding the other five metrics, they all usually outperform the baseline, albeit to varying degrees, always clearly depending on a proper choice of the value of k.
With the second set of results, we compare different rankers on two sets of metafeatures: the SILonly set and the SIL+Metarules1 set. For simplicity, we use SIL+Metarules to refer the SIL+Metarules1 set. Figure 7 shows the performances of the seven rankers (with and without using the metarules) on the eight metrics. A “*” besides a ranker’s name means the SIL+Metarules set is significantly better than the SILonly set (using a paired ttest significance level of 0.05). In total, the SIL+Metarules set significantly wins 38 out of 48 (about 79.1 %) comparison tests across all rankers (please note that the default ranker is not counted since it does not use any metafeatures). Overall, we can see that the ART Forests ranker with the SIL+Metarules set consistently produces performance gains for all different metrics, and is placed as the best ranker for 7 out of 8 metrics. Although increasing ART Forests’ ensemble size may further improve its performance, using 100 as a default setting seems to work well for our experiments.
Table 1 shows a summary of the top one ranker for each metric and how much performance gain the respective ranker can achieve. We can see that all the best rankers used the SIL+Metarules set. The greatest performance gain (93 %) over the default ranker is from the kNN ranker on the LA@1 measurement. The “Gain §” values show how much improvement the top rankers using the SIL+Metarules metafeature set can achieve over the default ranker; and the “Gain †” values show how much improvement the MetaRule method (Method 1) introduced in this paper can achieve over the respective best rankers using the SILonly metafeature set.
In terms of runtime, the RIPPERbased pairwise metarules method on average takes 30 seconds to generate rulesets for 190 algorithm pairs. Figure 8 shows the runtime of different rankers. We can see that the ART Forests algorithm (with 100 ART models) is relatively efficient compared with other stateoftheart rankers. When rapid modeling is required for an application, the kNN ranker is a reasonable alternative because it is fast and relatively accurate.
Discussions and future work
In this section, we discuss the limitations of the proposed techniques and future work for extending and improving the system.
For the pairwise metarule method, although the RIPPER algorithm works well, we believe that alternative rule learners are still worth investigating. An important direction for future work is the design of an efficient rule learner for generating small numbers of high quality pairwise metarules, as the number of pairwise rule models required increases quadratically with the number of algorithms to rank. Therefore, the training and the output complexity of the chosen rule learner is critical to the scalability of the proposed method. If z is the average number of rules returned by the rule learner for one pair of algorithms, then the total number of pairwise metarule based metafeatures generated for m algorithms is \(\frac{zm(m1)}{2}\). Consequently, when using the pairwise metarule method, the number of metafeatures grows quadratically with the number of algorithms to rank. Some algorithms, like the novel ART Forests algorithm proposed here, scale well with larger number of features, whereas for some of the alternative rankers both runtime and ranking performance might suffer from a larger number of features, especially if there are strong correlations present among these features. To overcome this problem, in future work, we will investigate: (1) rule pruning techniques for reducing the total number of pairwise metarules; (2) rule set fusion techniques, such as groupwise metarules. Another interesting future direction would be using the predicted rank differences, or at least the actual posterior probabilities of the binary problem, instead of the purely boolean pairwise metafeatures, for improved ranking performance.
Regarding the details of the ART algorithm, in this paper we only considered and evaluated using the median value of a metafeature’s dynamic range as a binary split point. A future work direction would be comparing this to both alternative split point selection and to multiway splitting strategies.
In this paper, we employed the EAbased parameter selection technique, which is a relatively expensive approach for generating metadataset. In the literature, there are some attempts to use metalearning itself for the task of parameter selection. Soares et al. (2004) and Ali and SmithMiles (2006) have shown promising results using metalearning for selecting the kernel type and kernel parameters for support vector machines. There is also previous work that proposed hybrid systems combining metalearning and optimisation techniques (Reif et al. 2012; Gomes et al. 2012; de Miranda et al. 2012). These systems could be used as alternative parameter selection procedures for the metalearning experiment setup proposed in the this paper. In Sect. 5.1, we mentioned that predicting the full combination of algorithm plus optimal parameter settings is not feasible. While this is technically true, Leite et al. (2012a, 2012b) has introduced the active testing based metalearning framework that is able to return good parameter setting together with the recommended algorithm. In their experiments, a set of 292 algorithmparameter combinations was evaluated. The techniques described in the current paper could probably also be applied to metalearning for parameter recommendation of single learning algorithms, but this needs to be verified in a future experimental study.
Conclusions
We have introduced a new approach for generating metafeatures to meta learning. The main difference between our method and stacking is given in Sect. 3. We have introduced a specialised metalearning algorithm, based on the random forests framework (Breiman 2001), for predicting algorithm rankings, and provided both a theoretical and an empirical analysis. Unlike previous work, experiments in this paper were based on a much larger collection of publicly available datasets and a novel, more appropriate metalearning experiment configuration, which systematically explores the space of parameter values of the algorithms to be ranked.
Our experimental results indicate that the pairwise metarule generation method (Method 1) consistently improves the performances of different ranking approaches to metalearning. The new ART Forests ranker is always among the top rankers across all ranking metrics and functions we have tested. The success of the proposed methods and algorithms on the metalearning problem studied in this paper suggests the applicability of the new techniques to a wider range of metalearning problems. Additional materials can be found at http://www.cs.waikato.ac.nz/~qs12/ml/meta/.
References
Ali, S., & SmithMiles, K. A. (2006). A metalearning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1–3), 173–186.
Alvo, M., Cabilio, P., & Feigin, P. D. (1982). Asymptotic theory for measures of concordance with special reference to average Kendall tau. The Annals of Statistics, 10(4), 1269–1276.
Bensusan, H., GiraudCarrier, C., & Kennedy, C. (2000). A higherorder approach to metalearning (Technical report). University of Bristol.
Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research, 13, 1063–1095.
Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Topdown induction of clustering trees. In Proceedings of the fifteenth international conference on machine learning. San Mateo: Morgan Kaufmann.
Brazdil, P., Gama, J., & Henery, B. (1994). Characterizing the applicability of classification algorithms using metalevel learning. In Proceedings of the European conference on machine learning.
Brazdil, P., Soares, C., & Da Costa, J. P. (2003). Ranking learning algorithms: using ibl and metalearning on accuracy and time results. Machine Learning, 50(3), 251–277.
Brazdil, P., GiraudCarrier, C., Soares, C., & Vilalta, R. (2009). Metalearning: applications to data mining. Berlin: Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Cheng, W., & Hüllermeier, E. (2008). Instancebased label ranking using the mallows model. In Workshop proceedings of preference learning, Antwerp, Belgium.
Cheng, W., Hühn, J., & Hüllermeier, E. (2009). Decision tree and instancebased learning for label ranking. In Proceedings of the 26th international conference on machine learning (ICML09), Montreal, Canada (pp. 161–168).
Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the 12th international conference on machine learning. San Mateo: Morgan Kaufmann.
de Miranda, P., Prudencio, R., Carvalho, A., & Soares, C. (2012). Combining a multiobjective optimization approach with metalearning for svm parameter selection. In 2012 IEEE international conference on systems, man, and cybernetics (pp. 2909–2914).
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. London: Chapman & Hall.
Escalante, H. J., Montes, M., & Sucar, L. E. (2009). Particle swarm model selection. Journal of Machine Learning Research, 10, 405–440.
GiraudCarrier, C. (2008). Metalearning—a tutorial. In Proceedings of the 7th international conference on machine learning and applications. San Mateo: Morgan Kaufmann.
Gomes, T. A., Prudêncio, R. B., Soares, C., Rossi, A. L., & Carvalho, A. (2012). Combining metalearning and search techniques to select parameters for support vector machines. Neurocomputing, 75(1), 3–13.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.
Hüllermeier, E., Fürnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16–17), 1897–1916.
Jankowski, N., Duch, W., & Grabczewski, K. (Eds.) (2011). Studies in computational intelligence: Vol. 358. Metalearning in computational intelligence. Berlin: Springer.
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gainbased evaluation of ir techniques. ACM Transactions on Information Systems, 20(4), 422–446.
Kalousis, A. (2002). Algorithm selection via metalearning. PhD thesis, Department of Computer Science, University of Geneva.
Kalousis, A., & Hilario, M. (2001). Model selection via metalearning: a comparative study. International Journal on Artificial Intelligence Tools, 10(04), 525–554.
Kendall, M. G. (1970). Rank correlation methods. London: Griffin.
Leite, R., & Brazdil, P. (2005). Predicting relative performance of classifiers from samples. In Proceedings of the 22nd international conference on machine learning.
Leite, R., Brazdil, P., & Vanschoren, J. (2012a). Selecting classification algorithm with active testing on similar datasets. In Proceedings of the 5th international workshop on planning to learn.
Leite, R., Brazdil, P., & Vanschoren, J. (2012b). Selecting classification algorithms with active testing. In P. Perner (Ed.), Lecture notes in computer science: Vol. 7376. Machine learning and data mining in pattern recognition (pp. 117–131). Berlin Heidelberg: Springer.
Li, H. (2011). Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies, 4(1), 1–113.
Marden, J. I. (1995). Analyzing and modeling rank data. London: Chapman & Hall.
Pfahringer, B., Bensusan, H., & GiraudCarrier, C. (2000). Metalearning by landmarking various learning algorithms. In Proceedings of the 17th international conference on machine learning.
Pinto da Costa, J., & Soares, C. (2005). A weighted rank measure of correlation. Australian & New Zealand Journal of Statistics, 47(4), 515–529.
Reif, M., Shafait, F., & Dengel, A. (2012). Metalearning for evolutionary parameter optimization of classifiers. Machine Learning, 87, 357–380.
Schaffer, C. (1994). A conservation law for generalization performance. In Proceedings of the 11th international conference on machine learning (pp. 259–265). San Mateo: Morgan Kaufmann.
Serban, F., Vanschoren, J., Kietz, J.U., & Bernstein, A. (2013). A survey of intelligent assistants for data analysis. ACM Computing Surveys. doi:10.5167/uzh73010.
SmithMiles, K. A. (2009). Crossdisciplinary perspectives on metalearning for algorithm selection. ACM Computing Surveys, 41(1), 6:1–6:25.
Soares, C. (2004). Learning ranking of learning algorithms. PhD thesis, Department of Computer Science, University of Porto.
Soares, C., Brazdil, P. B., & Kuba, P. (2004). A metalearning method to select the kernel width in support vector regression. Machine Learning, 54(3), 195–209.
Sun, Q., Pfahringer, B., & Mayo, M. (2012). Full model selection in the space of data mining operators. In Proceedings of the 14th international conference on genetic and evolutionary computation conference companion.
Todorovski, L., Blockeel, H., & Dzeroski, S. (2002). Ranking with predictive clustering trees. In Proceedings of the 13th European conference on machine learning. Berlin: Springer.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259.
Wolpert, D., & Macready, W. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82.
Xu, J., & Li, H. (2007). Adarank: a boosting algorithm for information retrieval. In Proceedings of the 30th international conference on research and development in information retrieval. New York: ACM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip Železný.
Appendix: A brief derivation of Eq. (11)
Appendix: A brief derivation of Eq. (11)
Here we give the core steps of the derivation. For a more detailed derivation and intermediate steps, we refer the reader to (Alvo et al. 1982; Marden 1995).
Let Q _{ m } be the set of all m×m permutation matrices, where m is the number of objects (e.g., algorithms) to rank. The sets \(\mathcal{S}_{m}\) (in Eq. (4)) and Q _{ m } are in onetoone correspondence:
where \(e'_{m} = [1,2,3,\ldots,m]\). For example, Q _{ ij }=1 iff y _{ i }=j.
Let d be a Hoeffding distance of the form,
For Spearman distance (which is a form of Hoeffding distance):
where \(f'_{m} = [1,2^{2},3^{2},\ldots,m^{2}]\) and \(1'_{m} = [1,1,1,\ldots,1] \).
The spread in Eq. 11 is calculated as:
where \(\Hat{M}1_{m} = \Hat{M'}1_{m} = 1_{m} \). Since it is well know that the sum of the squares of integers \(1^{2} + 2^{2} + 3^{2} + \cdots + m^{2} = \frac{m(m+1)(2m+1)}{6}\), therefore
Rights and permissions
About this article
Cite this article
Sun, Q., Pfahringer, B. Pairwise metarules for better metalearningbased algorithm ranking. Mach Learn 93, 141–161 (2013). https://doi.org/10.1007/s109940135387y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s109940135387y
Keywords
 Metalearning
 Algorithm ranking
 Ranking trees
 Ensemble learning