1 Introduction

In a ranking task, a learner is given a set of preferences, often organized pairwise, over instances. For instance, in a movie recommendation task, the learner might be told that Alice likes “2001: A Space Odyssey” better than “Interstellar,” and similar facts. The learner needs to produce a function that then correctly ranks novel instances. In this example, this function would be expected to rank new movies in Alice’s order of preference. Ranking has many applications in areas such as drug design (Agarwal et al. 2010), information retrieval (Nuray and Can 2006) and of course recommendation systems (Freund et al. 2003).

An elegant way to solve the ranking task is through a “boosting” algorithm. In classification, algorithms such as Adaboost  (Freund and Schapire 1995) iteratively learn and aggregate a collection of “base learners” into a solution that is very accurate. Base learners are only required to satisfy a “weak learning” criterion, which means they only need to be slightly better than chance on the learning task. This is attractive for at least two reasons. First, for many domains, finding a classifier that satisfies weak learning can be easier than finding one that is very accurate. Second, boosting algorithms for classification are known to possess desirable theoretical properties (Freund and Schapire 1995). We may hope to extend these characteristics to the ranking problem if a suitable boosting algorithm for ranking can be designed.

Indeed, Adaboost was directly extended to the ranking problem by a framework called Rankboost (Freund et al. 2003), which is our focus in this paper. Rankboost shows how to combine a collection of weak rankers into an effective ranking procedure, as described below. A version of Rankboost, which we call Rb-d in this paper, has been shown to possess good theoretical properties (Freund et al. 2003; Mohri et al. 2012). A different version, which we call Rb-c in this paper, has been used to solve many ranking tasks with good results (Cortes and Mohri 2004; Cao et al. 2007; Zheng et al. 2008; Agarwal et al. 2012; Aslam et al. 2009). These implementations are described in the next section.

While Rankboost is a well-established algorithm for boosting for ranking, there is a gap in our understanding of the approach. In particular, the theoretical understanding of Rankboost applies to Rb-d; however, experiments in the original paper show it underperforms Rb-c, which does not have a similar theoretical justification, on the very ranking scenarios for which Rb-d was designed. We verify this behavior in our experiments on real data. Thus the version that has a theoretical justification does not work very well in practice, while the version that works better in practice has limited theoretical support.

In this paper, we address this gap. We propose an approach we call Rankboost \(+\), which is built on the Rankboost framework. We show that Rankboost \(+\) has the good theoretical properties of Rb-d. However, we show that Rankboost \(+\) is also closely related to Rb-c, and our experiments show that it significantly outperforms both Rb-d and Rb-c in a number of real-world ranking tasks. Further, the theory of Rankboost \(+\) also gives insight into why Rb-d underperforms in practice, and explains why Rb-c outperforms Rb-d in the ranking scenario and metric that Rb-d was designed to optimize.

In the following sections, we first describe Rankboost and the two variants, Rb-d and Rb-c. Next we motivate and present the Rankboost \(+\) algorithm. Then we discuss the theoretical properties of Rankboost \(+\) and explain why Rb-c outperforms Rb-d in practice. Finally, we evaluate these approaches empirically on several real-world ranking problems and show that the empirical results are in excellent agreement with what the theory predicts.

2 Rankboost

Let \(\mathscr {X}\) be an instance space, and let \(c: \mathscr {X}\times \mathscr {X}\rightarrow \{-1, 0, 1 \}\) be a target labeling function defined as:

$$\begin{aligned} c(x, x') = \left\{ \begin{array}{ll} +1&{}\quad \text {if }\quad x'\text { is ranked higher than }x,\\ -1&{}\quad \text {if }\quad x\text { is ranked higher than }x',\text { and}\\ 0&{}\quad \text {if}\quad \text {no preference.}\\ \end{array} \right. \end{aligned}$$
(1)

Note that this ranking scenario does not permit us to specify that x and \(x'\) should be ranked the same. Also note that we do not assume that the order induced by c is transitive.

Let \(\mathscr {H}\) be the set of ranking functions \(h : \mathscr {X} \rightarrow \mathbb {R}\). If \(h(x) > h(y)\) we state that h ranks x higher than y. Let D be the distribution over \(\mathscr {X}\times \mathscr {X}\). The original Rankboost paper (Freund et al. 2003) defines the generalization error of ranker h on distribution D to be the probability that, given a pair of elements, we have a preference about how they are ranked but h does not correctly rank the pair.

$$\begin{aligned} R_1(h) = \mathop {Pr}_{(x,x')\sim D}[(c(x,x')\ne 0)\wedge (c(x, x')(h(x')-h(x))\le 0)]. \end{aligned}$$
(2)

Rankboost is a supervised learning algorithm that is trained on a subset of data drawn from \(\mathscr {X}\times \mathscr {X}\) using distribution D, and Rankboost is given the correct ranking for this subset. More formally, consider the labeled sample S defined as \(S= \{ (x_1, x'_1, y_1),\ldots , (x_m, x'_m, y_m) \}\) where, for each \(i\in \{1,\ldots , m\}\), \(y_i = c(x_i, x'_i)\), and \((x_i,x'_i)\) is drawn i.i.d. according to distribution D. Any pair \((x_i,x'_i)\) with \(y_i \ne 0\) is called a critical pair. As is standard in the literature, we simplify the presentation by assuming S contains only critical pairs.

Next we define rank loss functions on the sample that are analogous to the generalization error. The original Rankboost paper (Freund et al. 2003) defines two different rank loss functions for ranker h on sample S. We denote these functions as \(\hat{R}_1\) and \(\hat{R}_2\).

$$\begin{aligned} \hat{R}_1(h)&= \frac{1}{m} \sum _{i=1}^m \mathbf {1}_{y_i(h(x'_i)-h(x_i))\le 0} . \end{aligned}$$
(3)
$$\begin{aligned} \hat{R}_2(h)&= \frac{1}{m}\sum _{i=1}^m \mathbf {1}_{y_i(h(x'_i)-h(x_i))<0} +\frac{1}{2} \mathbf {1}_{y_i(h(x'_i)-h(x_i))=0} . \end{aligned}$$
(4)

\(\hat{R}_1\) treats ties as equally bad as reverse rankings. The justification for \(\hat{R}_1\) is that it is the sample estimate of \(R_1\) on ranker h, and in Freund et al. (2003) the authors design Rankboost to minimize \(\hat{R}_1\). We highlight in this paper some advantages to minimizing \(\hat{R}_1\) as well some issues that result from this choice. \(\hat{R}_2\), on the other hand, gives ties half the error as a reverse ranking. The intuition, given in Freund et al. (2003), for why \(\hat{R}_2\) is also a good rank loss function is that if we break tie pairs randomly in our output ranking, we expect half of the tied critical pairs to receive the correct ranking. Therefore, \(\hat{R}_2\) is the sample estimate of \(R_1\) on the ranking produced by h if we break ties randomly. While Rankboost is designed to minimize \(\hat{R}_1\), Freund et al. (2003) uses \(\hat{R}_2\) to evaluate Rankboost on real world datasets.

Both \(\hat{R}_1\) and \(\hat{R}_2\) are non-convex functions, and minimizing either is an NP-complete problem (Cohen et al. 1999). Rankboost is a direct adaptation of the seminal classification boosting algorithm Adaboost (Freund and Schapire 1995), and \(\hat{R}_1\) is a direct analog to the empirical loss function used with Adaboost. Just as with Adaboost, Rankboost does not minimize \(\hat{R}_1\) directly. Instead, it minimizes an upper bound on \(\hat{R}_1\) that is the following exponential loss function.

$$\begin{aligned} \hat{E}_1(h) = \frac{1}{m}\sum _{i=1}^m e^{-y_i(h(x'_i)-h(x_i))}. \end{aligned}$$
(5)

Although (5) is a better behaved function than \(\hat{R}_1\), it is typically difficult to find the element of \(\mathscr {H}\) that minimizes (5). Instead, we assume that we have access to a routine that can only find weak rankers, elements of \(\mathscr {H}\) so named because the ranking they produce is only weakly correlated to the desired ranking. The goal of Rankboost is to learn a linear combination of these weak rankers that is highly accurate. A common ranking scenario, and the one considered in this paper, is to further restrict the weak rankers used by Rankboost to be binary weak rankers: members of \(\mathscr {H}\) that map each element of \(\mathscr {X}\) to only two values, \(\{0, 1\}\). Formally, let \(f_1, \ldots , f_N\) be the finite set of binary weak rankers of \(\mathscr {H}\) that Rankboost considers during its execution. Let \(\varvec{\eta } = [\eta _1,\ldots ,\eta _N]\) be an element of \(\mathbb {R}^N\). We define the function \(F_1: \mathbb {R}^N \rightarrow \mathbb {R}\) as:

$$\begin{aligned} F_1(\varvec{\eta }) = \hat{E}_1\left( \sum _{s=1}^N \eta _s f_s\right) = \frac{1}{m}\sum _{i=1}^m e^{\sum _{s=1}^N y_i \eta _s (f_s(x'_i) - f_s(x_i))} . \end{aligned}$$
(6)

As (6) is a convex function, it has a unique minimum. The goal of Rankboost is to iteratively adjust the \(\eta _s\) values in order to quickly converge to \(\varvec{\eta ^*}\), the vector that achieves the minimum of (6).

Rankboost achieves this minimization by iteratively building an ensemble ranker that is a linear combination of binary weak rankers. At iteration t, the ensemble ranker is \(g_{t-1} = \sum _{j=1}^{t-1}\alpha _j h_j\), where \(h_j \in \{f_1, \ldots f_N\}\) is the weak ranker chosen at iteration j. During the execution, Rankboost maintains a distribution \(D_t(i)\) over the critical pairs \((x'_i,x_i)\) at iteration t. The idea is for critical pairs not correctly ranked by \(g_{t-1}\) to have higher probability in \(D_t\) than those pairs correctly ranked. Rankboost learns a weak ranker \(h_t:\mathscr {X}\rightarrow \{0, 1\}\) with small expected error over \(D_t\). Ranker \(h_t\) is added to the ensemble ranker with weight \(\alpha _t\). Then a new distribution \(D_{t+1}\) is created by multiplying the probability for each pair by a scaling function \(\omega _t\), that will be defined later, that increases the probability of the pairs \(h_t\) misranks and decreases the probability of the pairs \(h_t\) correctly ranks:

$$\begin{aligned} D_{t+1}(i) = \frac{D_t(i)\varvec{\omega }_t(i)}{Z_t}. \end{aligned}$$
(7)

\(Z_t\) is a normalizing factor to insure that the probabilities add to one.

Rankboost is an algorithm framework, listed below as Algorithm 1, and it can be instantiated in multiple ways. In this paper, we focus on two proposed versions from prior work (Freund et al. 2003). We denote as Rb-d, for discrete Rankboost, the version of Rankboost where the only weak rankers considered by the algorithm are binary weak rankers. This version uses Eq. (14) on line 7 of Algorithm 1 to calculate the appropriate weight \(\alpha _i\) for weak ranker \(h_i\). Rb-d is the version of Rankboost used in the theoretical analysis of ranking in prior work (Freund et al. 2003). We denote as Rb-c, for continuous Rankboost, the version of Rankboost where the weak rankers map each element of \(\mathscr {X}\) to [0, 1]. This version uses Eq. (19) on line 7 to calculate the appropriate weight \(\alpha _i\) for weak ranker \(h_i\). As mentioned above, this paper is focused on the ranking scenario where we only have binary weak rankers, but even in this scenario Rb-c generally outperforms Rb-d. Rb-c is the version of Rankboost most often used in practice.

figure a

2.1 The discrete variation of Rankboost

As noted above, Rb-d is the version of Rankboost about which we have theoretical results, and the values assigned in lines 6, 7, and 9 of Rb-d minimize \(\hat{E}_1\left( \sum _{s=1}^t\alpha _s h_s\right) \). We define the values \(\varepsilon ^{+1}_t\), \(\varepsilon ^0_t\), and \(\varepsilon ^{-1}_t\) as

$$\begin{aligned} \varepsilon ^\tau _t = \sum _{i = 1}^m D_t(i) \mathbf {1}_{y_i(h_t(x'_i) - h_t(x_i))= \tau }, \end{aligned}$$
(8)

and we let \(\varepsilon ^+_t\) and \(\varepsilon ^-_t\) stand for \(\varepsilon ^{+1}_t\) and \(\varepsilon ^{-1}_t\), respectively. Line 9 of Rb-d uses Eq. (7) to set \(D_{t+1}\). Just as with Adaboost, the Rb-d scaling function maps the \(i^{th}\) critical pair to its contribution to \(\hat{E}_1(\alpha _t h_t)\).

$$\begin{aligned} \omega _t(i) = \left\{ \begin{array}{ll} e^{-\alpha _t} &{} \text {if pair }i\text { correctly ranked by }h_t, \\ e^{\alpha _t} &{} \text {if pair }i\text { reverse ranked by }h_t, \\ 1 &{} \text {if pair }i\text { tied by }h_t. \end{array} \right. \end{aligned}$$
(9)

Thus, Eq. (7) is equivalent to

$$\begin{aligned} D_{t+1}(i) = \frac{D_{t}(i)\exp {(-\alpha _ty_i(h_t(x'_i) - h_t(x_i)))}}{Z_t} \end{aligned}$$
(10)

where \(Z_t\), the normalization factor, is

$$\begin{aligned} Z_t = \varepsilon _t^0 + \varepsilon _t^+ e^{-\alpha _t} + \varepsilon _t^- e^{\alpha _t}. \end{aligned}$$
(11)

A simple induction argument with (10) gives

$$\begin{aligned} \hat{E}_1\left( \sum _{s=1}^t \alpha _s h_s\right) = \prod _{s=1}^t Z_s. \end{aligned}$$
(12)

Equation (12) is fundamental to Rb-d. It proves that if we can minimize each \(Z_t\), the hypothesis output by Rb-d minimizes \(\hat{E}_1\). Therefore, at line 6 Rb-d chooses the weak ranker that minimizes the expression \(\delta (h_t) = \varepsilon ^-_t - \varepsilon ^+_t\),

$$\begin{aligned} h_t = {\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{h \in \mathscr {H}'}}\delta (h) \end{aligned}$$
(13)

where \(\mathscr {H}'\) is the set of weak rankers, and at line 7, Rb-d defines

$$\begin{aligned} \alpha _t = \frac{1}{2} \log \frac{\varepsilon ^+_t}{\varepsilon ^-_t}. \end{aligned}$$
(14)

Such an \(h_t\) and \(\alpha _t\) produce the minimum \(Z_t\) value.

\(\hat{E}_1\) has a nice mathematical property exploited by Rb-d. The error contribution of pair i is \(\exp [-y_i(h(x'_i) - h(x_i))]\), and

$$\begin{aligned} e^{-y_i((\alpha _1 h_1 + \alpha _2 h_2)(x'_i) - (\alpha _1 h_1 + \alpha _2 h_2)(x_i))} = e^{-y_i \alpha _1 h_1(x'_i)}e^{-y_i \alpha _2 h_2(x'_i)}e^{y_i \alpha _1 h_1(x_i)}e^{y_i\alpha _2 h_2(x_i)}. \end{aligned}$$
(15)

This property allows Rb-d to be simple. For example, Rb-d can reuse the same ranker in different iterations without any adjustments to the algorithm. Also, this property is used in the bipartite ranking scenario where \(\mathscr {X}\) is partitioned into two sets \(\mathscr {X}_0\) and \(\mathscr {X}_1\) where all elements of \(\mathscr {X}_0\) are ranked above all elements of \(\mathscr {X}_1\), and the ranking makes no distinction on pairs otherwise. In this ranking scenario, we exploit (15) to replace the distribution over pairs with distributions over instances, and the result is an efficient algorithm that gives the same result as Rb-d. See Freund et al. (2003) for details. This paper, though, is not considering the bipartite ranking scenario.

We present two important theoretical properties of Rb-d. Observation 1 is from Rudin et al. (2005), and it follows from the fact that \(\hat{E}_1\) is a convex function and thus has a unique global minimum. Theorem 1, found in Mohri et al. (2012), is a direct analog of an equivalent theorem for Adaboost of Freund and Schapire (1995). The theorem states that the ranking loss of Rb-d decreases exponentially with respect to boosting rounds.

Observation 1

(Rudin et al. 2005) Assuming that Rb-d allows negative \(\alpha \)values and assuming that (14) is well-defined in all rounds of the algorithm, Rb-d is a coordinate descent algorithm that converges to the global minimum of \(\hat{E}_1\) on the vector space spanned by the weak rankers of \(\mathscr {H}\).

Theorem 1

(Mohri et al. 2012) Assuming that (14) is well-defined in all rounds of the algorithm, the empirical error of the hypothesis g returned by Rb-d satisfies

$$\begin{aligned} \hat{R}_1(g) \le \exp \left[ -2\sum _{t=1}^T \left( \frac{\delta (h_t)}{2}\right) ^2 \right] = \exp \left[ -2\sum _{t=1}^T \left( \frac{\varepsilon _t^+ - \varepsilon _t^-}{2}\right) ^2 \right] . \end{aligned}$$
(16)

It is useful to note that the assumptions in boldface above are often unstated in the literature, or in some cases do not hold (for example, p. 6 of Rudin et al. 2005 and p. 218 and Fig. 9.1, line 4 of Mohri et al. 2012 assume nonnegative \(\alpha \) values for Rb-d). If the assumptions do not hold, we can create ranking scenarios that would violate the theoretical guarantees. We provide these counterexamples in Sect. B in the appendix for completeness.

2.2 Opportunities for improvement

We noted above that Freund et al. (2003) identifies two rank loss functions. The fact that Freund et al. (2003) uses \(\hat{R}_2\) to evaluate the performance of Rb-d suggests that there are many ranking scenarios for which \(\hat{R}_2\) may be a better measure of rank loss than \(\hat{R}_1\). In addition, there are issues with using \(\hat{E}_1\) as an upper bound for \(\hat{R}_1\). Note that \(\hat{E}_1\) treats ties differently than \(\hat{R}_1\). Specifically, \(\hat{E}_1\) decreases the weight of ties relative to both reverse ranked and correctly ranked pairs. This causes \(\hat{R}_1\) and \(\hat{E}_1\) to order the weak rankers differently.

The functions \(\hat{R}_1\) and \(\hat{E}_1\) each give a quasiordering on the rankers of \(\mathscr {H}\). Ideally, we would like them to induce the same ordering. It may be unrealistic for an exponential loss function to give the same ordering as its corresponding rank loss function for all rankers in \(\mathscr {H}\). However, since Algorithm 1 is iteratively finding a minimal binary weak ranker to add to the ensemble, we argue that a reasonable property for the exponential loss function is for it to give the same ordering on binary weak rankers as the rank loss function. Because \(\hat{R}_1\) and \(\hat{E}_1\) treat ties differently, there are natural datasets for which \(\hat{R}_1\) and \(\hat{E}_1\) do not induce the same quasiordering on the set of all binary weak rankers. The proof of this proposition is in Sect. A in the appendix.

Proposition 1

There exist datasets and binary weak rankers \(h_1\) and \(h_2\) such that \(\hat{R}_1(h_1) > \hat{R}_1(h_2)\) but \(\hat{E}_1(h_1) < \hat{E}_1(h_2)\), \(\hat{E}_1(\alpha _{h_1} h_1) < \hat{E}_1(\alpha _{h_2} h_2)\) where \(\alpha _{h_1}\) and \(\alpha _{h_2}\) are the weights assigned by Rb-d given distribution \(D_1\), and \(\hat{E}_1(\alpha ^c_{h_1} h_1) < \hat{E}_1(\alpha ^c_{h_2} h_2)\) where \(\alpha ^c_{h_1}\) and \(\alpha ^c_{h_2}\) are the weights assigned by Rb-c given distribution \(D_1\).

2.3 The continuous variation of Rankboost

While Rb-d is the version of Rankboost used in theoretical analysis, Rb-c is the version most often used in practice. In Rb-c, the weak rankers can map elements to the range [0, 1], and the only change in Algorithm 1 is in the computation of \(\alpha _t\) on line 6. While we know of no theorems explicitly about the behavior of Rb-c, Rb-c generally outperforms Rb-d even in the scenario of binary weak rankers for which RB-D was designed (Freund et al. 2003).

Rb-c still has \(\hat{E}_1\) as the error approximation that should be minimized. However, to find the value for \(\alpha _t\), Rb-c does not minimize \(Z_t\) but an upper bound on \(Z_t\). Let,

$$\begin{aligned} r_t = \sum _{i=1}^m D_t(i) y_i(h_t(x'_i) - h_t(x_i)). \end{aligned}$$
(17)

The upper bound for \(Z_t\) used by Rb-c is

$$\begin{aligned} Z_t \le \left( \frac{1-r_t}{2}\right) e^{\alpha _t} + \left( \frac{1+r_t}{2}\right) e^{-\alpha _t} , \end{aligned}$$
(18)

and the \(\alpha _t\) that minimizes the right hand side of (18) is

$$\begin{aligned} \alpha _t = \frac{1}{2}\log \frac{1+r_t}{1-r_t} . \end{aligned}$$
(19)

A nice property of Rb-c is that (19) is well defined in all but perfect rankers. However, the theoretical guarantees of Observation 1 and Theorem 1 are not known to hold for Rb-c. Rb-c generally outperforms Rb-d in practice, and in Sect. 3.4 below we give mathematical intuition for why we see this behavior.

3 Improving Rankboost

In this section, we propose Rankboost \(+\), an improvement to Rankboost. The inspiration for Rankboost \(+\) comes from the issues observed above with using \(\hat{E}_1\) as an upper bound for \(\hat{R}_1\) and the observation in Freund et al. (2003) that in many ranking scenarios \(\hat{R}_2\) may be a better loss function for ranking. In this section, we first derive \(\hat{E}_2\), an exponential loss function that orders the weak rankers the same as \(\hat{R}_2\). We then follow the same Rankboost framework and identify the weak ranker, the weight to give it, and the distribution update rule that minimizes \(\hat{E}_2\). Doing this gives us the same theoretical guarantees as Rb-d, but without the often unstated assumptions and without the issues of \(\hat{E}_1\). In addition, this work provides mathematical justification for why Rb-c outperforms Rb-d. In the next section we test Rankboost \(+\) on real world data sets and demonstrate that Rankboost \(+\) outperforms both Rb-d and Rb-c. To keep the presentation in this section clean, the proofs of the propositions and theorems are placed in Sect. A of the appendix.

3.1 Defining \(\hat{E}_2\)

In designing Rankboost \(+\), we keep to the framework of Rankboost. Equation (6) above defines \(F_1\) on the vector \(\varvec{\eta } = \{\eta _1,\ldots ,\eta _N\}\). We define an analogous loss function \(F_2\) on \(\varvec{\eta }\), and then from \(F_2\) we get an equivalent loss function \(\hat{E}_2\) defined on the ensemble ranker \(\sum _{s=1}^N \eta _s f_s\). For the purpose of defining \(F_2\), we write \(F_1\) in an equivalent but slightly different manner:

$$\begin{aligned} F_1\left( \varvec{\eta }\right) = \frac{1}{m}\sum _{i=1}^m e^{\sum _{s=1}^N \ln \omega ^*_1(i,f_s,\eta _s)} \end{aligned}$$
(20)

where

$$\begin{aligned} \omega ^*_1(i,f_s,\eta _s) = \left\{ \begin{array}{ll} e^{-\eta _s}&{}\text { if pair }i\text { correctly ranked by }f_s,\\ e^{\eta _s}&{}\text { if reverse ranked by }f_s\text {, and}\\ 1&{}\text { if tied by }f_s. \end{array} \right. \end{aligned}$$
(21)

We use (20) as our template to define \(F_2\), and we define \(F_2\) to use the same exponential loss values as \(F_1\) for pairs \(\eta _s f_s\) either correctly ranks or reverse ranks: \(e^{-\eta _s}\) and \(e^{\eta _s}\), respectively. The only difference between \(F_2\) and \(F_1\) is on tied pairs. For ties \(F_2\) follows the behavior of \(\hat{R}_2\) and sets the exponential rank loss to be the average of the correct and reverse rank values: \(\frac{1}{2}\left( e^{-\eta _s} + e^{\eta _s}\right) = \cosh (\eta _s)\). It is straightforward to see that using this average is necessary for \(F_2\), and by extension \(\hat{E}_2\), to induce the same order on the weak rankers as \(\hat{R}_2\). Suppose some weak ranker makes a reverse rankings and b ties. Another ranker with \(a-c\) reverse rankings and \(b+2c\) ties, for any c, will have the same \(\hat{R}_2\) error. However, any loss function that does not assign to ties the average of the misranking and correct ranking values will give a different loss to these two rankers. In Proposition 3 below, we prove that using the average is also sufficient.

Using the same form as (21), we have

$$\begin{aligned} F_2\left( \varvec{\eta }\right) = \frac{1}{m}\sum _{i=1}^m e^{\sum _{s=1}^N \ln \omega ^*_2(i,f_s,\eta _s)} \end{aligned}$$
(22)

where function \(\omega ^*_2\) is defined as

$$\begin{aligned} \omega ^*_2(i,f_s,\eta _s) = \left\{ \begin{array}{ll} e^{-\eta _s}&{}\text { if pair }i\text { correctly ranked by }f_s,\\ e^{\eta _s}&{}\text { if reverse ranked by }f_s\text {, and}\\ \cosh (\eta _s)&{}\text { if tied by }f_s. \end{array} \right. \end{aligned}$$
(23)

Note that just like \(F_1\), \(F_2\) is a convex function. Therefore, once we define Rankboost \(+\), we can use convexity and Eq. (29) below, the analogous equation to (12) for Rb-d, to prove that Rankboost \(+\) has theoretical properties analogous to Observation 1 and Theorem 1.

We can now define \(\hat{E}_2\) as a function on the ensemble ranker produced at each iteration of Algorithm 1. Given ensemble \(g = \sum _{s=1}^t \alpha _s h_s\), let \(\eta ' = [\eta '_1,\ldots ,\eta '_N]\) be the element of \(\mathbb {R}^N\) where \(\eta '_i\) is the sum of \(\alpha _j\) for which \(f_i = h_j\). Then,

$$\begin{aligned} \hat{E}_2\left( \sum _{s=1}^t \alpha _s h_s\right) = F_2\left( \varvec{\eta }'\right) . \end{aligned}$$
(24)

For completeness, the next proposition proves that \(\hat{E}_2\) is an upper bound for \(\hat{R}_2\). The following proposition proves that \(\hat{E}_2\) has the nice property that it induces the same quasiordering on the weak rankers as \(\hat{R}_2\).

Proposition 2

Let \(g = \sum _{s=1}^N \eta _s f_s\). Then \(\hat{E}_2(g) \ge \hat{R}_2(g)\).

Proposition 3

For all weak rankers \(h_1\) and \(h_2\), if \(\hat{R}_2(h_1) < \hat{R}_2(h_2)\), then \(\hat{E}_2(h_1) < \hat{E}_2(h_2)\).

Proposition 3 is also only concerned with the ordering of weak rankers. Minimizing \(\hat{R}_2\) is NP-complete (Cohen et al. 1999), and we are therefore unlikely to have an exponential loss function that induces the same ordering as \(\hat{R}_2\) on all possible rankers.

3.2 The Rankboost \(+\) algorithm

Rankboost \(+\) has the same framework as Rankboost (Algorithm 1). To describe it, we reuse the notations \(Z_t\), \(\alpha _t\), and \(D_t\) but change their definitions.

Just as with Rankboost, Rankboost \(+\) maintains a distribution \(D_t\) on the critical pairs, and after choosing a weak ranker \(h_t\) and weight \(\alpha _t\) to add to the ensemble ranker, Rankboost \(+\) updates the distribution weight for pair i by multiplying the weight by a scaling function that we denote \(\omega ^+_t\).

Recall that the scaling function for Rb-d, Eq. (9), maps pair i to its contribution to \(\hat{E}_1(\alpha _t h_t)\). In defining the scaling function \(\omega ^{+}_t\) for Rankboost \(+\), we note that when running Algorithm 1 the same weak ranker may be chosen in multiple iterations. Consider the ensemble ranker created after iteration \(t-1\): \(g_{t-1} = \sum _{s=1}^{t-1} \alpha _s h_s\) and the ensemble ranker created after iteration t: \(g_t = \sum _{s=1}^t \alpha _s h_s\). Let \(\eta ' \in \mathbb {R}^N\) be the vector such that \(F_2(\eta ') = \hat{E}_2(g_{t-1})\), and let \(\eta \in \mathbb {R}^N\) be the vector such that \(F_2(\eta ) = \hat{E}_2(g_{t})\). Using this notation, \(\eta '_j\) is the cumulative weight of ranker \(f_j\) in the ensemble at the start of iteration t and \(\eta _j\) is the cumulative weight after iteration t. Suppose that weak ranker \(f_j\) is chosen at iteration t: \(h_t = f_j\), and suppose that \(f_j\) was chosen at earlier iterations. The contribution of \(f_j\) to \(F_2\), and thus to \(\hat{E}_2\), for pair i is \(\omega ^*_2(i, f_j, \eta '_j)\) before iteration t and \(\omega ^*_2(i, f_j, \eta _j)\) after. We define \(\omega ^+_t\), the scaling function for Rankboost \(+\) in iteration t, such that \(\omega ^+_t(i)\cdot \omega ^*_2(i, f_j, \eta '_j) = \omega ^*_2(i, f_j, \eta _j)\). To be consistent with our use of \(\eta \) for a vector of \(\mathbb {R}^N\) and \(\alpha \) for the weights of the ensemble, we define \(\alpha '_t\) to be the total weight assigned to \(f_j\) in the ensemble at the start of iteration t. Thus, \(\alpha '_t = \eta '_j\), and

$$\begin{aligned} \omega ^+_t(i) = \left\{ \begin{array}{ll} e^{-\alpha _t}&{}\text { if pair }i\text { correctly ranked by }h_t,\\ e^{\alpha _t}&{}\text { if reverse ranked by }h_t\text {, and}\\ \frac{\cosh (\alpha _t + \alpha '_t)}{\cosh (\alpha '_t)}&{}\text { if tied by }h_t. \end{array} \right. \end{aligned}$$
(25)

Defining \(\omega ^+_t\) in this manner gives us Eq. (29) below that is analogous to Eq. (12). Recall that (12) is fundamental to Rb-d. In fact, (12) is used to prove Theorem 1, and (12) and the convexity of \(\hat{E}_1\) is needed for Observation 1.

Line 9 of Rankboost \(+\) is very similar to line 9 of Rankboost:

$$\begin{aligned} D_{t+1}(i) = \frac{D_t(i)\omega ^+_t(i)}{Z_t} \end{aligned}$$
(26)

with

$$\begin{aligned} Z_t = \varepsilon ^+_t e^{-\alpha _t} + \varepsilon ^-_t e^{\alpha _t} + \varepsilon ^0_t \frac{\cosh (\alpha _t + \alpha '_t)}{\cosh \alpha '_t} . \end{aligned}$$
(27)

Just as with Rb-d, Rankboost \(+\) chooses the weak ranker \(h_t\) and weight \(\alpha _t\) that minimizes \(\hat{E}_2\). At iteration t we have \(g_{t-1} = \sum _{s=1}^{t-1} \alpha _s h_s\). We calculate the derivative of \(g_{t-1} + \alpha _t h_t\) with respect to \(\alpha _t\).

Using induction on (26) gives

$$\begin{aligned} \prod _{s=1}^t Z_s = D_1(i) \prod _{s=1}^t \omega ^+_s(i), \end{aligned}$$
(28)

Combining (28) with (24), the definition of \(\hat{E}_2\), gives

$$\begin{aligned} \hat{E}_2\left( \sum _{s=1}^t\alpha _s h_s\right) = \prod _{s=1}^t Z_s. \end{aligned}$$
(29)

Therefore, we have the derivative

$$\begin{aligned} \frac{d\hat{E}_2(g_{t-1} + \alpha _t h_t)}{d\alpha _t}&= \frac{d Z_t}{d\alpha _t} \prod _{s=1}^{t-1}Z_s \nonumber \\&= \left( -\varepsilon ^+_t e^{-\alpha _t} + \varepsilon ^-_t e^{\alpha _t} + \varepsilon ^0_t \frac{\sinh (\alpha _t + \alpha '_t)}{\cosh \alpha '_t}\right) \prod _{s=1}^{t-1}Z_s \end{aligned}$$
(30)

where \(\sinh (\alpha ) = \frac{1}{2}\left( e^{\alpha } - e^{-\alpha }\right) \).

At iteration t, we want to add the weak ranker \(h_t\) that will cause the greatest decrease in \(\hat{E}_2\). Consider the directional derivative

$$\begin{aligned} \left. \frac{d\hat{E}_2(g_{t-1} + \alpha _t h_t)}{d\alpha _t}\right| _{\alpha _t = 0} = \left[ -\varepsilon _t^{+} + \varepsilon _t^{-} + \varepsilon _t^0 \frac{\sinh \alpha '_t}{\cosh \alpha '_t}\right] \prod _{s=1}^{t-1}Z_s. \end{aligned}$$
(31)

As we are not restricting \(\alpha _t\) to be positive, the \(h_t\) that achieves the greatest decrease in \(\hat{E}_2\) corresponds to the direction in (31) with the greatest slope. Therefore, on line 6 of Algorithm 1 Rankboost \(+\) chooses \(h_t\) such that

$$\begin{aligned} h_t = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{h \in \mathscr {H}'}} \left| \delta (h) \right| , \end{aligned}$$
(32)

where \(\mathscr {H}'\) is the set of binary weak rankers and

$$\begin{aligned} \delta (h_t) = \varepsilon _t^{-} - \varepsilon _t^{+} + \varepsilon _t^0 \frac{\sinh \alpha '_t}{\cosh \alpha '_t} . \end{aligned}$$
(33)

Setting \(\frac{d\hat{E}_2(g_{t-1} + \alpha _t h_t)}{d\alpha _t} = 0\) and solving for \(\alpha _t\) gives line 7 of Rankboost \(+\):

$$\begin{aligned} \alpha _t = \frac{1}{2}\log \frac{\varepsilon _t^{+} + \varepsilon ^0_t\frac{\exp (-\alpha '_t)}{2 \cosh \alpha '_t}}{\varepsilon _t^{-} + \varepsilon _t^0\frac{\exp \alpha '_t}{2 \cosh \alpha '_t}}. \end{aligned}$$
(34)

Now that we have defined the weights \(\alpha _t\) used by Rankboost \(+\), we give a corollary to Proposition 3 to show that, in contrast to Rb-d and Rb-c (see Proposition 1 above), the weighted weak rankers used by Rankboost \(+\) in the first iteration of the algorithm have the same quasiordering with respect to both \(\hat{R}_2\) and \(\hat{E}_2\).

Corollary 1

For all weak rankers \(h_1\) and \(h_2\), if \(\hat{R}_2(h_1)< \hat{R}_2(h_2) < \frac{1}{2}\), then \(\hat{E}_2(\alpha _{h_1}h_1) < \hat{E}_2(\alpha _{h_2}h_2)\) where \(\alpha _{h_1}\) and \(\alpha _{h_2}\) are the weights assigned by Rankboost \(+\) given distribution \(D_1\).

In Corollary 1, it is sufficient to consider weak rankers h with \(\hat{R}_2(h) < \frac{1}{2}\). Suppose we have ranker h with \(\hat{R}_2(h) > \frac{1}{2}\) and the “inverse ranker” \(h'\) that reverses every ranking choice in h. Let D be any distribution on the critical pairs. Then \(\hat{R}_2(h') = 1-\hat{R}_2(h)\) and \(\hat{E}_2(\alpha ' h') = \hat{E}_2(\alpha h)\) where \(\alpha '\) and \(\alpha \) are the weights selected by Rankboost \(+\) for \(h'\) and h, respectively, given distribution D [see Eq. (34)].

We close this section by noting some subtle issues that arise from using \(\cosh (\alpha _t)\) for the exponential loss of tied pairs. One change from Rankboost, as indicated by the scaling function (25), is that Rankboost \(+\) will need to keep track of the accumulated weight assigned to each unique weak ranker. For a second difference from Rankboost, let \(\mathscr {H'}\) be a set of binary weak rankers and let \(\mathscr {G}\) be a proper subset of \(\mathscr {H'}\). \(\mathscr {G}\) induces a different convex function \(F_2\) than \(\mathscr {H}'\) does. On the other hand, the convex function \(F_1\) is only different when \(\mathrm {span}(\mathscr {G}) \ne \mathrm {span}(\mathscr {H'})\). For example, consider the simple scenario where we have three weak rankers \(f_1\), \(f_2\), and \(f_3\) where \(f_2 = f_3\). As noted above, \(F_2\) is a convex function and Theorem 3 below proves that Rankboost \(+\) converges to the unique minimum of this convex function. While the rankers \(\eta _1 f_1 + \eta _2 f_2 + \eta _3 f_3\) and \(\eta _1 f_1 + (\eta _2 + \eta _3) f_2\) are identical, the convex function defined by \(F_2\) on \(f_1\), \(f_2\), and \(f_3\) is different than that defined on \(f_1\) and \(f_2\). In particular, the former is

$$\begin{aligned} \begin{aligned} a_1 e^{-\eta _1 - \eta _2 - \eta _3} + a_2 e^{-\eta _1 - \eta _2 + \eta _3} + a_3 e^{-\eta _1 + \eta _2 - \eta _3} + a_4 e^{-\eta _1 + \eta _2 + \eta _3} \\ +\,a_5 e^{\eta _1 - \eta _2 - \eta _3} + a_6 e^{\eta _1 - \eta _2 + \eta _3} + a_7 e^{\eta _1 + \eta _2 + \eta _3} + a_8 e^{\eta _1 + \eta _2 - \eta _3} \end{aligned} \end{aligned}$$

where \(a_1, \ldots , a_8\) are constants defined by the behavior of \(f_1\), \(f_2\), and \(f_3\). If we instead, restrict the weak rankers to be just \(f_1\) and \(f_2\), and if we assign the weight \(\eta _2 + \eta _3\) to weak ranker \(f_2\), then \(F_2\) defines a different convex function

$$\begin{aligned} (a_1 + a_2) e^{-\eta _1 - \eta _2 - \eta _3} + (a_3 + a_4) e^{-\eta _1 + \eta _2 + \eta _3} + (a_5 + a_6) e^{\eta _1 - \eta _2 - \eta _3} + (a_7 + a_8) e^{\eta _1 + \eta _2 + \eta _3} . \end{aligned}$$

This leads to the question: what is the effect on \(F_2\) of having a linearly dependent set of weak rankers? Recall that the only difference between \(F_1\) and \(F_2\) is the treatment of ties. Adding a weak ranker that is a linear combination of other weak rankers has the effect of replacing a \(\cosh (x)\) term in \(F_2\) with a \(\cosh (x_1)\cosh (x_2)\) term where \(x = x_1 + x_2\). The minimum of \(\cosh (x_1)\cosh (x_2)\) occurs where \(x_1 = x_2 = \frac{x}{2}\), and \(\cosh ^2(x/2) < \cosh (x)\). As a result, adding a linearly dependent weak ranker has the effect of decreasing the weight of ties. As \(\cosh (x) \ge 1\), \(F_2(\varvec{\eta }) \ge F_1(\varvec{\eta })\) for all \(\varvec{\eta }\), and as \(\lim _{k \rightarrow \infty } \cosh ^k(x/k) = 1\), we see that as we add more linearly dependent weak rankers to our set, the vector that minimizes \(F_2\) will converge to the vector that minimizes \(F_1\).

This analysis suggests that, in the ranking scenarios where \(\hat{R}_2\) is a better loss function for ranking than \(\hat{R}_1\), we should get optimal behavior if we define \(F_2\) on a linearly independent set of weak rankers. Doing so will achieve two benefits. It will maximize the difference between \(F_1\) and \(F_2\), and thus the difference between the behavior of Rankboost and Rankboost \(+\). Also, it will have \(F_2\) most closely match the property of \(\hat{R}_2\) where the weight of a tied pair is equal to the average of the weights of the correctly ranked and incorrectly ranked pairs.

3.3 Theoretical properties of Rankboost \(+\)

The following two theorems show that Rankboost \(+\) has similar theoretical guarantees as those described above for Rb-d: its ranking loss decreases exponentially in the number of boosting rounds, and it converges to the global minimum of \(\hat{E}_2\) on the space of all linear combinations of weak rankers. Please see Sect. A of the appendix for the proofs.

Theorem 2

The empirical error of the hypothesis g returned by Rankboost \(+\) satisfies

$$\begin{aligned} \hat{R}_2(h) \le \exp \left[ -2 \sum _{t=1}^T\left( \frac{\delta (h_t)}{2}\right) ^2\right] = \exp \left[ -2 \sum _{t=1}^T\left( \frac{\varepsilon ^+_t - \varepsilon ^-_t - \varepsilon ^0_t \frac{\sinh (\alpha '_t)}{\cosh (\alpha '_t)}}{2}\right) ^2\right] . \end{aligned}$$
(35)

Furthermore, if there exists \(\gamma \) such that for all \(t\in [1, T], 0< \gamma \le \frac{\delta (h_t)}{2}\), then

$$\begin{aligned} \hat{R}_2(g)\le \exp (-2\gamma ^2 T) \end{aligned}$$
(36)

Theorem 3

Let \(\mathscr {H}'\) be a set of binary weak rankers. Rankboost \(+\) is a coordinate descent algorithm that converges to the global minimum of \({F}_2\) on the vector space spanned by \(\mathscr {H}'\).

3.4 Why does Rb-c outperform Rb-d in practice?

Our analysis of Rankboost \(+\) sheds light on why Rb-c performs better than Rb-d in the binary weak ranker scenario, and why we expect Rankboost \(+\) to perform better than Rb-c.

Recall that Rb-c minimizes (18), and the \(\alpha _t\) value is chosen with Eq. (19). Let

$$\begin{aligned} \begin{array}{lll} r^+_t &{}=&{} \sum \nolimits _{i \in \{ j \mid y_j(h_t(x'_j) - h_t(x_j)) > 0 \}} D_t(i) y_i(h_t(x'_i) - h_t(x_i)), \\ r^-_t &{}=&{} -\sum \nolimits _{i \in \{ j \mid y_j(h_t(x'_j) - h_t(x_j)) < 0 \}} D_t(i) y_i(h_t(x'_i) - h_t(x_i)), \text { and} \\ r^0_t &{}=&{} \sum \nolimits _{i \in \{ j \mid y_j(h_t(x'_j) - h_t(x_j)) = 0 \}} D_t(i). \end{array} \end{aligned}$$
(37)

If we restrict ourselves to the binary weak ranker scenario of Rb-d, \(h_t(x'_j) - h_t(x_j) \in \{-1,0,1\}\), then \(r^+_t + r^-_t + r^0_t = 1\). Plugging this equation into (18) shows that the value Rb-c minimizes is

$$\begin{aligned} \left( \frac{1-r_t}{2}\right) e^{\alpha _t} + \left( \frac{1+r_t}{2}\right) e^{-\alpha _t} = r^+_t e^{-\alpha _t} + r^-_t e^{\alpha _t} + r^0_t \cosh (\alpha _t), \end{aligned}$$
(38)

and the \(\alpha _t\) value of (19) for Rb-c is

$$\begin{aligned} \frac{1}{2}\log \frac{1+r_t}{1-r_t} = \frac{1}{2}\log \frac{r^+_t + \frac{1}{2}r^0_t}{r^-_t + \frac{1}{2}r^0_t}. \end{aligned}$$
(39)

Note that the right hand side of (38) is equivalent to (27) with \(\alpha '_t = 0\), and (19) for determining Rb-c’s \(\alpha _t\) value is equivalent to (34) with \(\alpha '_t = 0\). Furthermore, Eq. (13) used by Rb-c to choose ranker \(h_t\) is equivalent to (32) with \(\alpha '_t = 0\). Thus in the binary weak ranker scenario, Rb-c is finding the weak ranker and weight that minimizes \(\hat{E}_2\) instead of \(\hat{E}_1\). Under our hypothesis that minimizing \(\hat{E}_2\) is a better metric for ranking than minimizing \(\hat{E}_1\) we expect Rb-c to perform better than Rb-d.

However, Rb-c does not achieve the minimum for \(\hat{E}_2\) because of two reasons. First, it uses distribution update rule (10) rather than (26) so it is not scaling the pairs by the error contribution to \(\hat{E}_2\). Second, on reusing a ranker from the ensemble ranker or when using a ranker that is a linear combination of rankers already used in the ensemble, Rb-c is calculating an \(\alpha \) value that is larger in magnitude than the corresponding directional derivative of \(E_2\). As a result, in this situation Rb-c may not find the weak ranker and weight that minimizes \(\hat{E}_2\), and may even choose a ranker and weight pair that increases \(\hat{E}_2\).

As noted above, to have the maximal separation between the minimizers of \(F_2\) and \(F_1\), and thus \(\hat{E}_2\) and \(\hat{E}_1\), we need to define \(F_2\) on a maximal linearly independent set of weak rankers. In the limit, adding linearly dependent rankers to the ensemble will cause Rankboost \(+\) and Rb-d to converge to the same ensemble ranker. This behavior is unlikely to be observed in practice because it requires a very large number of linearly dependent rankers and rounds of the algorithm. In practice, we expect the behavior of Rankboost \(+\) to converge to the behavior of Rb-c as we include more linearly dependent weak rankers in the ensemble. Anytime Rankboost \(+\) chooses to add a new weak ranker to the ensemble that is linearly dependent to rankers already in the ensemble, it will have \(\alpha '_t = 0\) in that round. Rankboost \(+\) will effectively be using Rb-c’s (19) to assign the weights to the rankers. Even with the extreme case of \(\alpha '_t = 0\) in all rounds of the algorithm, the behavior of Rankboost \(+\) should not be exactly the same as Rb-c because the two algorithms use different distribution update rules.

To summarize, we develop our approach using a different rank loss function, \(\hat{R}_2\). We derive an exponential loss function \(\hat{E}_2\) that produces consistent orderings on weak rankers as \(\hat{R}_2\). We then develop the Rankboost \(+\) algorithm that minimizes \(\hat{E}_2\), and we prove that Rankboost \(+\) has good theoretical properties: it exponentially decreases \(\hat{R}_2\) and converges to the global minimum of \(\hat{E}_2\).

4 Implementing Rankboost \(+\)

In this section we describe some differences in how Rankboost \(+\) is implemented relative to Rb-d and Rb-c. While all the algorithms share the same template (Algorithm 1), to achieve the best results with Rankboost \(+\) requires that the set of rankers chosen should be linearly independent, as discussed in the end of Sect. 3.2.

We start by considering each \(h_j\) to be a vector

$$\begin{aligned} h_j = \begin{pmatrix} y_0(h_j(x_0')-h_j(x_0))\\ y_1(h_j(x_1')-h_j(x_1))\\ \vdots \\ y_m(h_j(x_m')-h_j(x_m)) \end{pmatrix} \end{aligned}$$

At iteration t, we will have a set of linearly independent weak rankers \(S_t\) already chosen by the model, as well as a set R of candidate rankers, which represent all possible directions for the next step of boosting. We can imagine \(S_t\) to be a matrix, with each column being some \(h_j\), allowing us to give weights \(\eta _t\) to each member of \(S_t\), so the overall model prediction on training data at iteration t is \(S_t\eta _t\). For each candidate ranker h at iteration \(t+1\), we check if it is in the span of S. If it is not, we compute the score defined in Eq. (32) with \(\alpha _s'=0\). If it is, then for some vector \(\beta \), \(S_t\beta = h\). Then, rather than minimizing \(\hat{E}_2(g_t + \alpha _{t+1}h_{t+1})\) over \(\alpha _{t+1}\), we must minimize \(\hat{E}_2(g_t + \alpha _{t+1} \sum _{i=1}^{|(S_t)_i |} (S_t)_i \beta _i)\). If we consider \(\hat{E}_2\) to be a function of the weights and rankers separately, we can write this as \(F_2(\eta _t + \alpha _{t+1} \beta )\), where the underlying set of rankers is \(S_t\). To select the best ranker [Eq. (32)] for dependent h, we must take the derivative of \(F_2\) at \(\alpha _t=0\) in the direction specified by \(\beta \): \(\left|\nabla _\beta F_2(\eta _t + \alpha _{t+1} \beta )\right|=\left|\nabla F_2(\eta _t) \cdot \frac{\beta }{||\beta ||_2}\right|\). We can compute \(\nabla \hat{E}_2\) as specified in Eq. (30) without significant computational cost, but the cost of determining which rankers are dependent and then computing the \(\beta \) vector has a complexity of \(O(m|S_t |^2)\), as this step requires a least squares solve every iteration.

Because of this, we have an alternative, faster version that does not as significantly affect the convergence speed or generalization, which is described in Algorithm 2. This implementation still maintains the independent set \(S_t\), but does not consider descent over rankers linearly dependent to the set of already used rankers. To do this, we compute the gradients over each direction in \(S_t\), the set of rankers already selected (line 8), and then separately compute gradients over every direction in R, the set of all candidate rankers, while making the assumption that no ranker in R is in the span of \(S_t\) (line 10). We select the candidate that has the highest gradient, and if it is not in the span of \(S_t\), we check to see if it is linearly independent, and add it to S if it is. The ranker can then be removed from R (line 13), as it is now known to be a linear combination of some rankers in \(S_t\). This removal can be done by setting the column to be zero in R, meaning it cannot be reselected. As the size of S grows, we still have the issue of the cost of a least squares solve. However, since the highest weights are assigned to the first several rankers, so, after several iterations, we prune R by selecting an independent set of rankers from it. In this operation, we ensure that the chosen subset contains \(S_t\), as we have determined that the rankers in \(S_t\) can account for the greatest decrease in loss. This also has a one-time cost of \({O}(m|R |^2)\), which will be more efficient than potentially requiring an \({O}(m|S_t |^2)\) computation every iteration. The iteration when we prune R is once the first dependent ranker is chosen by the algorithm. On larger datasets, this step could be done after a set number of iterations to limit the number of least squares solves performed. This can be seen in line 14 of Algorithm 2. The function select-independent-subset chooses a maximal linearly independent subset \(T \subseteq R\) which has no elements that lie in the span of S, and has \(\mathrm {span}(T) \subseteq \mathrm {span}(R) - \mathrm {span}(S)\). We implement this by examining the QR decomposition \((Q', R')\) of the concatenation \([S, \mathrm {permute}(R)]\), and discarding any element from the set where \(\mathrm {diag}(R') = 0\). Here \(\mathrm {permute}(R)\) randomly permutes the columns of R. This operation will produce a set that is maximal with high probability.

figure b

The main added costs of this method over Rb-d or Rb-c are choosing a linearly independent subset of \(R\cup S\), and checking if a ranker is in the span of \(S_t\). The latter will only be needed for the first few iterations, until S is fixed, so there is not significant computational cost. In Fig. 1, we show a comparison of the speeds of the two different implementations of Rankboost \(+\) on synthetic data. At around 30 iterations, the more efficient implementation has one iteration that takes longer, which is the step where a linearly independent subset is chosen, but then it runs at the approximately the same speed as Rb-d or Rb-c.

Fig. 1
figure 1

Comparison of speed for implementations of Rankboost \(+\) on synthetic data. We call Rankboost \(+\) using all possible directions “Optimal Rb+” and Algorithm 2 we call “Efficient Rb+”

5 Empirical evaluation

In this section, we empirically study the behavior of Rb-d, Rb-c and Rankboost \(+\) on real ranking problems. While our theoretical analysis indicates that Rankboost \(+\) should be the most “well-behaved” of the three, this analysis is on training samples, so we still need to verify that it is reflected in the generalization performance. Our empirical hypothesis is that Rankboost \(+\) will significantly outperform Rb-d on real data. We hypothesize that it will also outperform Rb-c but possibly with a smaller margin.Footnote 1

The ranking tasks we use are as follows.Footnote 2MovieLens (Harper and Konstan 2016; Weimer et al. 2008; Guiver and Snelson 2009) is a movie recommendation task where users rate movies using a scale of 1 (lowest) to 5 (highest). It contains 943 users, 1682 movies and 100,000 ratings. Each user induces a separate ranking task using the movie ratings from other users as features. To get meaningful results, we only consider users who have rated at least 100 movies, from which we run experiments on 367 ranking tasks. In each such task, we select features a priori by removing other users (features) that have over \(50\%\) missing ratings on the movies in the task under consideration. The rationale is that such features are likely to have low information with respect to the target, and unlikely to be selected by the base learner.

LETOR  (Qin and Liu 2013; Qin et al. 2010; Xia et al. 2008; Valizadegan et al. 2009) is a benchmark collection for research on learning to rank, released by Microsoft Research Asia. We use the version of MSLR-WEB10K with 10,000 queries. In this dataset, queries and URLs are represented by IDs, and 136-dimensional continuous feature vectors are extracted from query-URL pairs along with relevance judgment labels. The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant). For our experiments, we use 300 queries as separate datasets. These are selected by taking the 300 queries that would produce the most critical pairs, but have less than 50,000 critical pairs in total.

For the third set of tasks, we use the datasets MQ2007 (about 10,000 critical pairs) and MQ2008 (about 37,500 critical pairs) from LETOR 4.0 (Qin and Liu 2013). These datasets come from the Million Query track of TREC 2007 and TREC 2008. Each has 46 continuous features, as well as a relevance judgment, which is used to extract critical pairs.

The base ranker we use for all tested algorithms is a decision stump, which is a common choice (Freund et al. 2003). For each ranker, the leaf node labels are chosen using weighted majority vote. Missing feature values are given a value lower than all known values of the feature in their column. In the case that there are more than 255 candidate thresholds for a single feature, we randomly subsample them to determine a set of 255, as suggested by Qin et al. (2010).

A final detail is in the computation of \(\hat{E_2}\) over training data for Rb-d and Rb-c. To measure \(\hat{E_2}\), we track \(F_2(\eta )\) over a linearly independent set of weak rankers accumulated in each iteration. A new ranker introduced in a succeeding iteration may either be in the span of the accumulated set, or not. If it is independent, we add it to the accumulated set, and update the \(\eta \) vector accordingly. If it lies in the span of the existing rankers, we find the update to \(\eta \) such that the resulting prediction by the independent ensemble is equal to the prediction made by the original, linearly dependent ensemble at that iteration. \(F_2\) is calculated with these \(\eta \) values.

Table 1 Ranks of algorithms on real-world domains w.r.t \(\hat{R}_1\), \(\hat{R}_2\) and NDCG
Table 2 Ranks of various metrics (by fold) on LETOR 4.0
Fig. 2
figure 2

Convergence rates on MovieLens (top: test \(\hat{R}_1\) and \(\hat{R}_2\), middle: train \(\hat{E}_1\) and \(\hat{E}_2\), bottom: Test NDCG@5) \(\hat{R}_1\) and \(\hat{E}_1\) are denoted by solid lines, and \(\hat{R}_2\) and \(\hat{E}_2\) are dashed lines

Fig. 3
figure 3

Critical difference diagram for test \(\hat{R_2}\) on WEB10K

5.1 Results and discussion

We perform 5-fold cross validation on each of the 667 ranking tasks from MovieLens and LETOR (MSLR-WEB10K). For each task, we compute test ranking errors \(\hat{R}_1\) and \(\hat{R}_2\) (averaged over 5 folds) for Rankboost \(+\), Rb-d and Rb-c, as well as Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen 2002) at positions 3, 5, and 7. For each fold of a task, we take three folds for training, one fold for validation, and one for testing. Then, when reporting results, for each metric, we choose the iteration at which that metric was minimized on the validation set, and then for each task, we average these over all 5 folds. To summarize the results across the tasks from each domain, we rank these algorithms in increasing order of test ranking errors on each task, i.e. the algorithm with smallest test error gets rank 1 and the one with largest test error gets rank 3. We then average scores from all datasets to get a final average rank for each algorithm. These are shown in Table 1. For MQ2007 and MQ2008, we use the QueryLevelNorm version with the specified folds, and rank algorithms using NDCG as suggested in Qin and Liu (2013), using the Mean Average Precision to select the optimal iteration when NDCG is measured. These results are shown in Table 2. The average performance measures for different datasets are shown in Appendix C.

From Table 1, we observe that, over the two domains, Rankboost \(+\) has the smallest average rank while Rb-d has the largest average rank, with Rb-c in between, but generally closer to Rankboost \(+\) than to Rb-d. This occurs for both \(\hat{R}_1\), optimized by Rb-d, and \(\hat{R}_2\), optimized by Rankboost \(+\) and (imperfectly) by Rb-c. A hypothesis test at a significance level of 0.05 on the results for WEB10K for \(\hat{R}_2\) using a critical difference diagram (Demšar 2006) is shown in Fig. 3. We see that Rankboost \(+\) significantly outperforms both Rb-d and Rb-c on \(\hat{R}_2\), indicated by the absence of connecting bars between algorithms. The same figure is obtained for MovieLens and also if \(\hat{R}_1\) is used, indicating that Rankboost \(+\) outperforms the baselines under all conditions. These results align well with the theoretical analysis above and confirm our empirical hypothesis. Our results also confirm the observation that Rb-c indeed performs better in practice than Rb-d, as suggested by the authors of the Rankboost paper (Freund et al. 2003); however, it still does not outperform Rankboost \(+\) because, as explained above, it only approximately optimizes \(\hat{E}_2\). Table 2 shows that for MQ2007 and MQ2008, again, both Rankboost \(+\) and Rb-c generally outperform Rb-d. In these two datasets, Rankboost \(+\) is generally as good as or better than Rb-c with the exception of NDCG on MQ2008. However, since we used the single specified train/test/validation partitions for these datasets, these differences are not statistically significant.

In Fig. 2, we show the rates of convergence for \(\hat{R}_1\), \(\hat{R}_2\), and NDCG@5 averaged over the test sets, as well as \(\hat{E}_1\) and \(\hat{E}_2\) averaged over the train sets on MovieLens. On the train sets, we observe that Rankboost \(+\) reduces \(\hat{E}_2\) quickly, as the theory suggests. Rb-d in fact increases \(\hat{E}_2\), though it does minimize \(\hat{E}_1\). Rb-c decreases \(\hat{E}_2\) initially; however, because it does not minimize \(\hat{E}_2\) exactly, eventually \(\hat{E}_2\) starts increasing. As expected, on \(\hat{E}_1\), Rb-d and Rb-c are better minimizers than Rankboost \(+\). On the test sets, Rankboost \(+\) produces the smallest \(\hat{R}_1\) and \(\hat{R}_2\), and Rb-d the largest, with Rb-c in between. The figures also shows that when measuring test set \(\hat{R}_2\), Rankboost \(+\) converges around the same speed as Rb-d and Rb-c, but the average error does not significantly increase after a certain point, which indicates that Rankboost \(+\) overfits less severely than Rb-d or Rb-c. The same effect is observed in the NDCG results. Rankboost \(+\) achieves a high NDCG on the test set and tends not to overfit, while both Rb-d and Rb-c start overfitting after a few iterations.

Finally, one might wonder why Rankboost \(+\) outperforms Rb-d on \(\hat{R}_1\) when it is optimizing an upper bound of \(\hat{R}_2\). From the figure, we observe that this is because \(\hat{R}_2\) and \(\hat{R}_1\) converge over boosting iterations. This happens because they differ only on ties. Recall that the datasets only have critical pairs and as an ensemble grows, we may expect the number of ties to decrease. Thus, a lower \(\hat{R}_2\) also means a lower \(\hat{R}_1\).

Taken together, these results show that the theoretical results for Rankboost \(+\) also have an impact in practice.

6 Conclusion

In this paper, we addressed a gap in the literature for the Rankboost framework. Prior work proposed two variants: Rb-d, which was theoretically well motivated but did not work well in practice, and Rb-c, which outperformed Rb-d in practice but had limited theoretical support. We have proposed Rankboost \(+\), which has good theoretical support and outperforms Rb-d and Rb-c in practice. Further, the theory behind the approach helps explain why Rb-d underperforms in practice and why Rb-c is more competitive. We have also clarified some assumptions made in previous theoretical results for Rankboost. Finally, we demonstrate empirically that the theoretical conclusions carry over to improvements on real ranking tasks. In future work, we plan to look at the bipartite version of Rankboost, as well as study the generalization properties of Rankboost \(+\) theoretically.