The aim of this paper is to determine whether the following hypotheses can be accepted:
-
Hyp1:
The incorporation of a function that combines accuracy and runtime is useful for the construction of the average ranking, as it leads to better results than just accuracy when carrying out evaluation on loss-time curves.
-
Hyp2:
The incorporation of a function that combines accuracy and runtime for the active testing method leads to better results than only using accuracy when carrying out evaluation on loss-time curves.
The rest of this section is dedicated to the average ranking method. First, we present a brief overview of the method and show how the average ranking can be constructed on the basis of prior test results. This is followed by the description of the function A3R that combines accuracy and runtime, and how the average ranking method can be upgraded with this function. Furthermore, we empirically evaluate this method by comparing the ranking obtained with the ranking representing the golden standard. Here we also introduce loss-time curves, a novel representation that is useful in comparisons of rankings.
As our A3R function includes a parameter that determines the weight attributed to either accuracy or time, we have studied the effects of varying this parameter on the overall performance. As a result of this study, we identify the range of values that led to the best results.
Overview of the average ranking method
This section presents a brief review of the average ranking method that is often used in comparative studies in the machine learning literature. This method can be regarded as a variant of Borda’s method (Lin 2010).
For each dataset, the algorithms are ordered according to the performance measure chosen (e.g., predictive accuracy) and ranks are assigned accordingly. Among many popular ranking criteria we find, for instance, success rates, AUC, and significant wins (Brazdil et al. 2003; Demšar 2006; Leite and Brazdil 2010). The best algorithm is assigned rank 1, the runner-up is assigned rank 2, and so on. Should two or more algorithms achieve the same performance, the attribution of ranks is done in two steps. In the first step, the algorithms that are tied are attributed successive ranks (e.g. ranks 3 and 4). Then all tied algorithms are assigned the mean rank of the occupied positions (i.e. 3.5).
Let \(r_i^j\) be the rank of algorithm i on dataset j. In this work we use average ranks, inspired by Friedman’s M statistic (Neave and Worthington 1988). The average rank for each algorithm is obtained using
$$\begin{aligned} r_i = \left( \sum _{j=1}^{D} r_i^j\right) \div D \end{aligned}$$
(2)
where D is the number of datasets. The final ranking is obtained by ordering the average ranks and assigning ranks to the individual algorithms accordingly.
The average ranking represents a quite useful method for deciding which algorithm should be used. Also, it can be used as a baseline against which other methods can be compared.
The average ranking would normally be followed on the new dataset: first the algorithm with rank 1 is evaluated, then the one with rank 2 and so on. In this context, the average ranking can be referred to as the recommended ranking.
Evaluation of rankings
The quality of a ranking is typically established through comparison with the golden standard, that is, the ideal ranking on the new (test) dataset(s). This is often done using a leave-one-out cross-validation (CV) strategy (or in general k-fold CV) on all datasets: in each leave-one-out cycle the recommended ranking is compared against the ideal ranking on the left-out dataset, and then the results are averaged for all cycles.
Different evaluation measures can be used to evaluate how close the recommended ranking is to the ideal one. Often, this is a type of correlation coefficient. Here we have opted for Spearman’s rank correlation (Neave and Worthington 1988), but Kendall’s Tau correlation could have been used as well. Obviously, we want to obtain rankings that are highly correlated with the ideal ranking.
A disadvantage of this approach is that it does not show directly what the user is gaining or losing when following the ranking. As such, many researchers have adopted a second approach which simulates the sequential evaluation of algorithms on the new dataset (using cross-validation) as we go down the ranking. The measure that is used is the performance loss, defined as the difference in accuracy between \(a_{best}\) and \(a*\), where \(a_{best}\) represents the best algorithm identified by the system at a particular time and \(a*\) the truly best algorithm that is known to us (Leite et al. 2012).
As tests proceed following the ranking, the loss either maintains its value, or decreases when the newly selected algorithm improved upon the previously selected algorithms, yielding a loss curve. Many typical loss curves used in the literature show how the loss depends on the number of tests carried out. An example of such curve is shown in Fig. 2a. Evaluation is again carried out in a leave-one-out fashion. In each cycle of the leave-one-out cross-validation (LOO-CV) one loss curve is generated. In order to obtain an overall picture, the individual loss-time curves are aggregated into a mean loss curve. An alternative to using LOO-CV would be to use k-fold CV (with e.g. k = 10). This issue is briefly discussed in Sect. 6.1.
Loss-time curves
A disadvantage of loss curves it that they only show how loss depends on the number of tests. However, some algorithms are much slower learners than others—sometimes by several orders of magnitude, and these simple loss curves do not capture this.
This is why, in this article, we follow Brazdil et al. (2003) and van Rijn et al. (2015) and take into account the actual time required to evaluate each algorithm and use this information when generating the loss curve. We refer to this type of curve as a loss versus time curve, or loss-time curve for short. Figure 2b shows an example of a loss-time curve, corresponding to the loss curve in Fig. 2a.
As train/test times include both very small and very large numbers, it is natural to use the logarithm of the time (\(log_{10}\)), instead of the actual time. This has the effect that the same time intervals appear to be shorter as we shift further on along the time axis. As normally the user would not carry out exhaustive testing, but rather focus on the first few items in the ranking, this representation makes the losses at the beginning of the curve more apparent. Figure 2c shows the arrangement of the previous loss-time curve on a log scale.
Each loss time curve can be characterized by a number representing the mean loss in a given interval, an area under the loss-time curve. The individual loss-time curves can be aggregated into a mean loss-time curve. We want this mean interval loss (MIL) to be as low as possible. This characteristic is similar to AUC, but there is one important difference. When talking about AUCs, the x-axis values spans between 0 and 1, while our loss-time curves span between some \(T_{min}\) and \(T_{max}\) defined by the user. Typically the user searching for a suitable algorithm would not worry about very short times where the loss could still be rather high. In the experiments here we have set \(T_{min}\) to 10 s. In an on-line setting, however, we might need a much smaller value. The value of \(T_{max}\) needs also to be set. In the experiments here it has been set to \(10^4\) s corresponding to about 2.78 h. We assume that most users would be willing to wait a few hours, but not days, for the answer. Also, many of our loss curves reach 0, or values very near 0 at this time. Note that this is an arbitrary setting that can be changed, but here it enables us to compare loss-time curves.
Data used in the experiments
This section describes the dataset used in the experiments described in this article. The meta-dataset was constructed from evaluation results retrieved from OpenML (Vanschoren et al. 2014), a collaborative science platform for machine learning. This dataset contains the results of 53 parameterized classification algorithms from the Weka workbench (Hall et al. 2009) on 39 classification datasets.Footnote 3 More details about the 53 classification algorithms can be found in the Appendix.
Combining accuracy and runtime
In many situations, we have a preference for algorithms that are fast and also achieve high accuracy. However, the question is whether such a preference would lead to better loss-time curves. To investigate this, we have adopted a multi-objective evaluation measure, A3R, described in Abdulrahman and Brazdil (2014), that combines both accuracy and runtime. Here we use a slightly different formulation to describe this measure:
$$\begin{aligned} A3R _{a_ ref ,a_j}^{d_i} = \dfrac{ \dfrac{ SR _{a_j}^{d_i}}{ SR _{a_ ref }^{d_i}}}{{(T_{a_j}^{d_i} / T_{a_ ref }^{d_i})^P}} \end{aligned}$$
(3)
Here \( SR _{a_j}^{d_i}\) and \( SR _{a_ ref }^{d_i}\) represent the success rates (accuracies) of algorithms \(a_j\) and \(a_ ref \) on dataset \(d_i\), where \(a_ ref \) represents a given reference algorithm. Instead of accuracy, AUC or another measure can be used as well. Similarly, \(T_{a_j}^{d_i}\) and \(T_{a_ ref }^{d_i}\) represent the run times of the algorithms, in seconds.
To trade off the importance of time, the denominator is raised to the power of P, while P is usually some small number, such as 1 / 64, representing in effect, the \(64^{th}\) root. This is motivated by the observation that run times vary much more than accuracies. It is not uncommon that one particular algorithm is three orders of magnitude slower (or faster) than another. Obviously, we do not want the time ratios to completely dominate the equation. If we take the \(N^{th}\) root of the ratios, we will get a number that goes to 1 in the limit, when N is approaching infinity (i.e. if P is approaching 0).
For instance, if we used \(P = 1/256\), an algorithm that is 1000 times slower would yield a denominator of 1.027. It would thus be equivalent to the faster reference algorithm only if its accuracy was \(2.7\%\) higher than the reference algorithm. Table 1 shows how a ratio of 1000 (one algorithm is 1000 times slower than the reference algorithm) is reduced for decreasing values of P. As P gets lower, the time is given less and less importance.
Table 1 Effect of varying P on time ratio
A simplified version of A3R introduced in van Rijn et al. (2015) assumes that both the success rate of the reference algorithm \( SR _{a_ ref }^{d_i}\) and the corresponding time \(T_{a_ ref }^{d_i}\) have a fixed value. Here the values are set to 1. The simplified version, \(A3R'\), which can be shown to yield the same ranking, is defined as follows:
$$\begin{aligned} {{A3R'}}_{a_{j}}^{d_{i}} = \frac{{{SR}}_{a_{j}}^{d_{i}}}{{(T_{a_{j}}^{d_{i}})}^{P}} \end{aligned}$$
(4)
We note that if P is set to 0, the value of the denominator will be 1. So in this case, only accuracy will be taken into account. In the experiments described further on we used A3R (not \(A3R'\)).
Upgrading the average ranking method using A3R
The performance measure A3R can be used to rank a given set of algorithms on a particular dataset in a similar way as accuracy. Hence, the average rank method described earlier was upgraded to generate a time-aware average ranking, referred to as the A3R-based average ranking.
Obviously, we can expect somewhat different results for each particular choice of parameter P that determines the relative importance of accuracy and runtime, thus it is important to determine which value of P will lead to the best results in loss-time space. Moreover, we wish to know whether the use of A3R (with the best setting for P) achieves better results when compared to the approach that only uses accuracy. The answers to these issues are addressed in the next sections.
Searching for the best parameter setting
Our first aim was to generate different variants of the A3R-based average ranking resulting from different settings of P within A3R and identify the best setting. We have used a grid search and considered settings of P ranging from P = 1/4 until P = 1/256, shown in Table 2. The last value shown is P = 0. If this value is used in \({{(T_{a_j}^{d_i} / T_{a_ ref }^{d_i})^P}}\) the result would be 1. The last option corresponds to a variant when only accuracy matters.
All comparisons were made in terms of mean interval loss (MIL) associated with the mean loss-time curves. As we have explained earlier, different loss-time curves obtained in different cycles of leave-one-out method are aggregated into a single mean loss-time curve, shown also in Fig. 3. For each one we calculated MIL, resulting in Table 2.
The MIL values in this table represent mean values for different cycles of the leave-one-out mode. In each cycle the method is applied to one particular dataset.
Table 2 Mean interval loss of AR-A3R associated with the loss-time curves for different values of P
The results show that the setting of P = 1/64 leads to better results than other values, while the setting P = 1/128 is not too far off. Both settings are better than, for instance, P = 1/4, which attributes a much higher importance to time. They are also better than, for instance, P = 1/256 which attributes much less importance to time, or to P = 0 when only accuracy matters.
The boxplots in Fig. 4 show how the MIL values vary for different datasets. The boxplots are in agreement with the values shown in Table 2. The variations are lowest for the settings P = 1/16, P = 1/64 and P = 1/128, although for each one we note various outliers. The variations are much higher for all the other settings. The worst case is P = 0 when only accuracy matters.
For simplicity, the best version identified, that is AR-A3R-1/64, is identified by the short name AR* in the rest of this article. Similarly, the version AR-A3R-0 corresponding to the case when only accuracy matters is referred to as AR0.
As AR* produces better results than AR0 we have provided evidence in favor of hypothesis Hyp1 presented earlier.
An interesting question arises why AR0 has such a bad performance. Using AR with accuracy-based ranking leads to disastrous results (MIL = 22.11) and should be avoided at all costs! This issue is addressed further on in Sect. 5.2.2.
Discussion
Parameter P could be used as a user-defined parameter to determine his/her relative interest on accuracy or time. In other words, this parameter could be used to establish the trade-off between accuracy and runtime, depending on the operation condition required by the user (e.g. a particular value of \(T_{max}\), determining the time budget).
However, one very important result of our work is that there is an optimum for which the user will obtain the best result in terms of MIL.
The values of \(T_{min}\) and \(T_{max}\) define an interval of interest in which we wish to minimize MIL. It is assumed that all times in this interval are equally important. We assume that the user could interrupt the process at any time T lying in this interval and request the name of \(a_{best}\), the best algorithm identified.