1 Introduction

This paper addresses the problem of estimating the performance of a given ranking function in terms of graded relevance measures such as Discounted Cumulative Gain (DCG) (Järvelin and Kekäläinen 2002) and Expected Reciprocal Rank (ERR) (Chapelle et al. 2009). In informational retrieval domains, ranking models often cannot be evaluated on held-out training data. For example, older training data might not represent the distribution of queries the model is currently exposed to, or the ranking model might be procured from a third party that does not provide any training data.

In practice, ranking performance is estimated by applying a given ranking model to a representative set of test queries and manually assessing the relevance of all retrieved items for each query. We study the problem of estimating ranking performance as accurately as possible on a fixed budget for labeling item relevance, or, equivalently, minimizing labeling costs for a given level of estimation accuracy. We also study the related problem of cost-efficiently comparing the ranking performance of two models; this is required, for instance, to evaluate the result of an index update.

We assume that drawing unlabeled data xp(x) from the distribution of queries that the model is exposed to is inexpensive, whereas obtaining relevance labels is costly. The standard approach to estimating ranking performance is to draw a sample of test queries from p(x), obtain relevance labels, and compute the empirical performance. However, recent results on active risk estimation (Sawade et al. 2010) and active comparison (Sawade et al. 2012) indicate that estimation accuracy can be improved by drawing test examples from an appropriately engineered instrumental distribution q(x) rather than p(x), and correcting for the discrepancy between p and q by importance weighting. Motivated by these results, we study active estimation processes for ranking performance. In analogy to active learning, they select specific queries from a pool of unlabeled test queries, obtain relevance labels for all items returned for these queries by the ranking function to be evaluated, and then compute appropriate performance estimates based on the observed item relevance. Test queries are selected according to an instrumental distribution q(x). The actively selected sample is weighted appropriately to compensate for the discrepancy between instrumental and test distributions which leads to a consistent—that is, asymptotically unbiased—performance estimate.

We derive instrumental distributions that asymptotically minimize the estimation error for either evaluating a single ranking function or comparing two given ranking functions. In ranking problems, the labeling costs for a query depend on the number of items returned by the ranking function as well as item-specific attributes such as document length; this is in contrast to the active estimation settings discussed previously (Sawade et al. 2010, 2012). We show how these costs can be taken into account when deriving optimal instrumental distributions. Moreover, in the ranking setting a naïve computation of optimal instrumental distributions is exponential; a central contribution of this paper is to show how these distributions can be computed in polynomial time using dynamic programming.

An initial version of this paper appeared at the 2012 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (Sawade et al. 2012). The current paper extends the earlier conference version by more detailed proofs for central theorems, additional empirical results, and a more self-contained presentation. We now briefly summarize the extensions made with respects to the earlier conference version.

A central theoretical contribution of the paper is the method for efficient computation of optimal sampling distributions (Theorem 3). There are four different instantiations of this result: estimation of the absolute performance of a single ranking function, and comparison of the relative performance of ranking functions, in combination with the performance measures discounted cumulative gain (DCG) and expected reciprocal rank (ERR). The conference version of the paper only contained the proof for one of the four cases (absolute estimation, ERR). The current paper contains detailed proofs for all four cases.

While the conference paper derived active performance estimation techniques for the ranking performance measures ERR and DCG, only results for the ERR measure were included in the empirical study. In the current paper, we also include comprehensive empirical results for the DCG measure. In the empirical study, we have also added a visualization of the sampling distribution employed by our method as a function of sample cost and intrinsic expectation of ranking performance.

The rest of the paper is organized as follows. Section 2 details the problem setting. Section 3 derives cost-optimal sampling distributions for the estimation of DCG and ERR. Section 4 discusses empirical sampling distributions in a pool-based setting. Section 5 presents empirical results. Section 6 discusses related work, Sect. 7 concludes.

2 Problem setting

Let \(\mathcal{X}\) denote a space of queries, and \(\mathcal{Z}\) denote a finite space of items. We study ranking functions

that, given a query \(x\in \mathcal{X}\), return a list of |r(x)| items \(r_{i}(x)\in\mathcal{Z}\) ordered by relevance. The number of items in a ranking r(x) can vary depending on the query and application domain from thousands (web search) to ten or fewer (mobile applications that have to present results on a small screen). Ranking performance of r is defined in terms of graded relevance labels \(y_{z}\in \mathcal{Y}\) that represent the relevance of an item \(z \in \mathcal{Z}\) for the query x, where \(\mathcal{Y}\subset \mathbb{R}\) is a finite space of relevance labels with minimum zero (irrelevant) and maximum y max (perfectly relevant). We summarize the graded relevance of all \(z \in \mathcal{Z}\) in a label vector \({\mathbf{y}}\in \mathcal{Y}^{\mathcal{Z}}\) with components y z for \(z \in \mathcal{Z}\).

In order to evaluate the quality of a ranking r(x) for a single query x, we employ two commonly used ranking performance measures: Discounted Cumulative Gain (DCG), given by

$$ \everymath{\displaystyle} \begin{array}{rll} L_{dcg}\!\bigl ({\mathbf{r}}(x),{\mathbf{y}}\bigr ) &=& \sum_{i=1}^{|{\mathbf{r}}(x)|} \ell_{dcg}(y_{r_i(x)},i) \\[12pt] \ell_{dcg}(y,i) &=& \frac{2^{y}-1}{\log_2(i+1)}, \end{array} $$
(1)

and Expected Reciprocal Rank (ERR), given by

$$ \everymath{\displaystyle} \begin{array}{rll} L_{err}\!\bigl ({\mathbf{r}}(x),{\mathbf{y}}\bigr ) &=& \sum_{i=1}^{|{\mathbf{r}}(x)|}\frac{1}{i} \ell_{err}(y_{r_i(x)}) \prod_{l=1}^{i-1}\bigl(1-\ell_{err}(y_{r_l(x)})\bigr) \\[12pt] \ell_{err}(y) &=& \frac{2^{y}-1}{2^{y_{max}}} \end{array} $$
(2)

as introduced by Järvelin and Kekäläinen (2002) and Chapelle et al. (2009), respectively.

DCG scores a ranking by summing over the relevance of all documents discounted by their position in the ranking. ERR is based on a probabilistic user model: the user scans a list of documents in the order defined by r(x) and chooses the first document that appears sufficiently relevant; the likelihood of choosing a document z is a function of its graded relevance score y z . If s denotes the position of the chosen document in r(x), then L err (r(x),y) is the expectation of the reciprocal rank 1/s under the probabilistic user model. Both DCG and ERR discount relevance with ranking position, ranking quality is thus most strongly influenced by documents that are ranked highly. If r(x) includes many items, L dcg and L err are in practice often approximated by only labeling items up to a certain position in the ranking or a certain relevance threshold and ignoring the contribution of lower-ranked items.

Let p(x,y)=p(x)p(y|x) denote the joint distribution over queries \(x \in \mathcal{X}\) and label vectors \({\mathbf{y}}\in \mathcal{Y}^{\mathcal{Z}}\) the model is exposed to. We assume that the individual relevance labels y z for items z are drawn independently given a query x:

$$ p({\mathbf{y}}|x) = \prod_{z\in\mathcal{Z}} p(y_z|x,z). $$
(3)

This assumption is common in pointwise ranking approaches, e.g., regression based ranking models (Cossock and Zhang, 2008; Mohan et al., 2011). The ranking performance of r with respect to p(x,y) is given by

$$ R[{\mathbf{r}}] = \int\!\!\int L\bigl({\mathbf{r}}(x),{\mathbf{y}}\bigr) p(x,{\mathbf{y}})\mathrm {d}x \mathrm {d}{\mathbf{y}}\!, $$
(4)

where L∈{L dcg ,L err } denotes the performance measure under study. We use integrals for notational convenience, for discrete spaces the corresponding integral is replaced by a sum. If the context is clear, we refer to R[r] simply by R.

Since p(x,y) is unknown, ranking performance is typically approximated by an empirical average

$$ \hat {R}_{n}[{\mathbf{r}}] = \frac{1}{n} \sum_{j=1}^n L\bigl({\mathbf{r}}(x_j),{\mathbf{y}}_j\bigr), $$
(5)

where a set of test queries x 1,…,x n and graded relevance labels y 1,…,y n are drawn iid from p(x,y). The empirical performance \(\hat {R}_{n}\) consistently estimates the true ranking performance; that is, \(\hat {R}_{n}\) converges to R with n→∞.

Test queries x i need not necessarily be drawn according to the input distribution p. When instances are drawn according to an instrumental distribution q, an estimator can be defined as

$$ \hat{R}_{n,q}[{\mathbf{r}}] = \Biggl(\sum_{j=1}^n \frac{p(x_j)}{q(x_j)}\Biggr)^{-1} \sum_{j=1}^n \frac{p(x_j)}{q(x_j)} L\bigl({\mathbf{r}}(x_j),{\mathbf{y}}_j\bigr), $$
(6)

where (x j ,y j ) are drawn from q(x)p(y|x) and again L∈{L dcg ,L err }. Weighting factors \(\frac{p(x_{j})}{q(x_{j})}\) compensate for the discrepancy between test and instrumental distributions, and the normalizer is the sum of weights. Because of the weighting factors, Eq. (6) defines a consistent estimator (see, e.g. Liu, 2001, p. 35). Note that Eq. (5) is a special case of Eq. (6), using the instrumental distribution q=p.

For certain choices of the sampling distribution q, \(\hat{R}_{n,q}\) may be a more label-efficient estimator of the true performance R than \(\hat {R}_{n}\) (Sawade et al. 2010).

A crucial feature of ranking domains is that labeling costs for queries \(x\in \mathcal{X}\) vary with the number of items |r(x)| returned and item-specific features such as the length of a document whose relevance has to be determined. We denote labeling costs for a query x by λ(x), and assume that λ(x) is bounded away from zero by λ(x)≥ϵ>0. Our goal is to minimize the deviation of \(\hat{R}_{n,q}\) from R under the constraint that expected overall labeling costs stay below a budget \({{\varLambda}} \in \mathbb{R}\):

$$ \bigl(q^*,n^*\bigr) = \mathop{\mathrm{arg\,min}}\limits_{q,n} \mathbb{E} \bigl[(\hat{R}_{n,q}-R)^2\bigr],\quad \hbox{s.t.}\ \mathbb{E} \Biggl[\sum_{j=1}^n \lambda(x_j)\Biggr]\leq {\varLambda}. $$
(7)

Note that Eq. (7) represents a trade-off between labeling costs and informativeness of a test query: optimization over n implies that many inexpensive or few expensive queries could be chosen.

To estimate relative performance of two ranking functions r 1 and r 2, Eq. (7) can be replaced by

$$ \bigl(q^*,n^*\bigr) = \mathop{\mathrm{arg\,min}}\limits_{q,n} \mathbb{E} \bigl[(\hat {{\varDelta }}_{n,q}-{\varDelta })^2\bigr],\quad \hbox{s.t.}\ \mathbb{E} \Biggl[ \sum_{j=1}^n \lambda(x_j)\Biggr]\leq {\varLambda}, $$
(8)

where \(\hat {{\varDelta }}_{n,q}= \hat{R}_{n,q}[{\mathbf{r}}_{1}] - \hat{R}_{n,q}[{\mathbf{r}}_{2}]\) and Δ=R[r 1]−R[r 2]. In the next section, we derive sampling distributions q asymptotically solving Eqs. (7) and (8).

3 Asymptotically optimal sampling

A bias-variance decomposition (Geman et al. 1992) applied to Eq. (7) results in

(9)
(10)
(11)

In Eq. (9), we expand the square, and add and subtract the expected value of \(\hat{R}_{n,q}\). Reordering terms and factorizing yield Eq. (10). Equation (11) expresses the expected squared deviation of the estimation as a sum of a squared bias and a variance term.

Because \(\hat{R}_{n,q}\) is consistent, \(\mathbb{E}[(\hat{R}_{n,q}-R)^{2}]\) vanishes for n→∞. According to Liu (2001), Chap. 2.5.3, the squared bias term is of order \(\frac{1}{n^{2}}\), while the variance is of order \(\frac{1}{n}\). For large n, the expected deviation is thus dominated by the variance, and \(\sigma_{q}^{2} = \lim_{n\rightarrow \infty} n \mathrm{Var}[\hat{R}_{n,q}]\) exists. For large n, we can thus approximate

$$\mathbb{E} \bigl[(\hat{R}_{n,q}-R)^2\bigr] \approx \frac{1}{n} \sigma^2_{q};\qquad \mathbb{E} \bigl[(\hat {{\varDelta }}_{n,q}-{\varDelta })^2\bigr] \approx \frac{1}{n}\tau^2_{q}, $$

where \(\tau_{q}^{2} = \lim_{n\rightarrow \infty} n \mathrm{Var}[\hat {{\varDelta }}_{n,q}]\).

Let \(\delta (x{ \empty },{\mathbf{y}}{ \empty })=L({\mathbf{r}}_{1}(x),{\mathbf{y}})-L({\mathbf{r}}_{2}(x),{\mathbf{y}})\) denote the performance difference of the two ranking models for a test query (x,y) and L∈{L dcg ,L err }. The following theorems derive sampling distributions minimizing the quantities \(\frac{1}{n}\sigma^{2}_{q}\) and \(\frac{1}{n}\tau^{2}_{q}\), thereby approximately solving Problems (7) and (8).

Theorem 1

(Optimal sampling for evaluation of a ranking function)

Let L∈{L dcg ,L err } and \(\sigma_{q}^{2} = \lim_{n\rightarrow \infty} n \mathrm{Var}[\hat{R}_{n,q}]\). The optimization problem

$$\bigl(q^*, n^*\bigr) = \mathop{\mathrm{arg\,min}}\limits_{q,n} \frac{1}{n} \sigma_q\quad \hbox{\textit{s.t.}}\ \mathbb{E} \Biggl[ \sum_{j=1}^n \lambda(x_j)\Biggr] \leq {{\varLambda}} $$

is solved by

$$ q^*(x)\propto \frac{p(x)}{\sqrt{\lambda(x)}} \sqrt{\int \bigl(L\bigl({\mathbf{r}}(x),{\mathbf{y}}\bigr)-R\bigr)^2p({\mathbf{y}}| x)\mathrm {d}{\mathbf{y}}},\qquad n^* = \frac{{{\varLambda}}}{\int \lambda(x)q(x)\mathrm {d}x}. $$
(12)

Theorem 2

(Optimal sampling for comparison of ranking functions)

Let L∈{L dcg ,L err } and \(\tau_{q}^{2} = \lim_{n\rightarrow \infty} n \mathrm{Var}[\hat {{\varDelta }}_{n,q}]\). The optimization problem

$$\bigl(q^*, n^*\bigr) = \mathop{\mathrm{arg\,min}}\limits_{q,n} \frac{1}{n} \tau_q\quad \hbox{\textit{s.t.}}\ \mathbb{E} \Biggl[ \sum_{j=1}^n \lambda(x_j)\Biggr]\leq {\varLambda} $$

is solved by

$$ q^*(x)\propto \frac{p(x)}{\sqrt{\lambda(x)}} \sqrt{\int \bigl(\delta (x{ \empty },{\mathbf{y}}{ \empty })-{\varDelta }\bigr)^2p({\mathbf{y}}| x)\mathrm {d}{\mathbf{y}}},\qquad n^* = \frac{{{\varLambda}}}{\int \lambda(x)q(x)\mathrm {d}x}. $$
(13)

Before we prove Theorems 1 and 2, we state the following two lemmata:

Lemma 1

(Asymptotic variance)

Let \(\hat{R}_{n,q}\), \(\hat{{\varDelta}}_{n,q}\), R, and Δ be defined as above. Then

(14)
(15)

for L∈{L dcg ,L err }.

The proof of Lemma 1 closely follows results from Sawade et al. (2010, 2012) and is included in the Appendix.

Lemma 2

(Maximizing distribution)

Let \(a: \mathcal{X} \rightarrow \mathbb{R}\) and \(\lambda: \mathcal{X} \rightarrow \mathbb{R}\) denote functions on the query space such that \(\int \sqrt{a(x)} \mathrm {d}x\) exists and λ(x)≥ϵ>0. The functional

$$G[q] = \biggl(\int \frac{a(x)}{q(x)} \mathrm {d}x\biggr) \biggl(\int \lambda(x) q(x) \mathrm {d}x\biggr), $$

where q(x) is a distribution over the query space \(\mathcal{X}\), is minimized over q by setting

$$q(x) \propto \sqrt{\frac{a(x)}{\lambda(x)}}. $$

A proof of Lemma 2 is included in the Appendix. We now prove Theorems 1 and 2.

Proof of Theorems 1 and 2

We first study the minimization of \(\frac{1}{n}\sigma^{2}_{q}\) in Theorem 1. Since

$$\mathbb{E} \Biggl[\sum_{j=1}^n \lambda(x_j)\Biggr] = n \int \lambda(x)q(x)\mathrm {d}x, $$

the minimization problem can be reformulated as

$$\min_{q} \min_n \frac{1}{n} \sigma^2_{q}\quad \hbox{s.t.}\ n \leq \frac{{{\varLambda}}}{\int \lambda(x)q(x)\mathrm {d}x}. $$

Clearly n =Λ/∫λ(x)q(x)dx solves the inner optimization. Since Λ is a constant, the remaining minimization over q is

Equation (14) in Lemma 1 implies

$$\sigma^2_{q} = \int\!\!\int \frac{p^2(x)}{q^2(x)} \bigl(L\bigl({\mathbf{r}}(x),{\mathbf{y}}\bigr)-R\bigr)^2 p({\mathbf{y}}|x)q(x)\mathrm {d}x \mathrm {d}{\mathbf{y}}. $$

Setting a(x)=p 2(x)∫(L(r(x),y)−R)2 p(y|x)dy and applying Lemma 2 implies Eq. (12).

For the minimization of \(\frac{1}{n}\tau^{2}_{q}\) in Theorem 2 we analogously derive

$$q^* = \mathop{\mathrm{arg\,min}}\limits_{q} \tau_q^2 \int \lambda(x)q(x)\mathrm {d}x. $$

Equation (15) in Lemma 1 implies

$$\tau^2_{q} = \int\!\!\int \frac{p(x)^2}{q(x)^2} \bigl(\delta (x{ \empty },{\mathbf{y}}{ \empty })- {\varDelta }\bigr)^2p({\mathbf{y}}|x)q(x) \mathrm {d}x \mathrm {d}{\mathbf{y}}. $$

Setting \(a(x)=p^{2}(x) \int (\delta (x{ \empty },{\mathbf{y}}{ \empty }),{\mathbf{y}}-{\varDelta })^{2} p({\mathbf{y}}| x) \mathrm {d}{\mathbf{y}}\) and applying Lemma 2 implies Eq. (13). □

4 Empirical sampling distribution

The sampling distributions prescribed by Theorems 1 and 2 depend on the unknown test distribution p(x). We now turn towards a setting in which a pool D of m unlabeled queries is available. Queries from this pool can be sampled and then labeled at a cost. Drawing queries from the pool replaces generating them under the test distribution; that is, \(p(x) = \frac{1}{m}\) for all xD.

The optimal sampling distribution also depends on the true conditional distribution \(p({\mathbf{y}}|x)=\prod_{z\in\mathcal{Z}}p(y_{z}|x,z)\) (Eq. (3)). To implement the method, we approximate p(y z |x,z) by a model p(y z |x,z;θ) of graded relevance. For the large class of pointwise ranking methods—that is, methods that produce a ranking by predicting graded relevance scores for query-document pairs and then sorting documents according to their score—such a model can typically be derived from the graded relevance predictor. Finally, the sampling distributions depend on the true performance R[r] as given by Eq. (4), or Δ=R[r 1]−R[r 2]. R[r] is replaced by an introspective performance R θ [r] calculated from Eq. (4), where the integral over \(\mathcal{X}\) is replaced by a sum over the pool, \(p(x) = \frac{1}{m}\), and \(p({\mathbf{y}}|x) = \prod_{z \in \mathcal{Z}} p(y_{z}|x,z;\theta)\). Note that because the integral over \(\mathcal{X}\) in Eq. (4) is replaced by a sum over a pool of unlabeled data D, the empirical sampling distribution is chosen such that we accurately estimate ranking performance over this pool rather than over the entire instance space \(\mathcal{X}\). However, assuming that a large enough pool of unlabeled data is available, ranking performance over the entire pool D will closely approximate ranking performance over \(\mathcal{X}\).

The performance difference Δ is approximated by Δ θ =R θ [r 1]−R θ [r 2]. Note that as long as p(x)>0 implies q(x)>0, the weighting factors ensure that such approximations do not introduce an asymptotic bias in our estimator (Eq. (6)).

With these approximations, we arrive at the following empirical sampling distributions.

Corollary 1

When relevance labels for individual items are independent given the query (Eq. (3)), and p(y z |x,z) is approximated by a model p(y|x,z;θ) of graded relevance, the sampling distributions minimizing \(\frac{1}{n}\sigma^{2}_{q}\) and \(\frac{1}{n}\tau^{2}_{q}\) in a pool-based setting resolve to

$$ q^*(x)\propto \frac{1}{\sqrt{\lambda(x)}} \sqrt{\mathbb{E} \bigl[ \bigl(L\bigl({\mathbf{r}}(x),{\mathbf{y}}\bigr)-R_{\theta}\bigr)^2}| x; \theta\bigr] $$
(16)

and

$$ q^*(x)\propto \frac{1}{\sqrt{\lambda(x)}} \sqrt{\mathbb{E} \bigl[\bigl(\delta(x,{\mathbf{y}})-{\varDelta}_{\theta}\bigr)^2|x;\theta\bigr]}, $$
(17)

respectively. Here, for any function g(x,y) of a query x and label vector y,

$$ \mathbb{E} \bigl[g(x,{\mathbf{y}})|x;\theta\bigr] = \sum_{{\mathbf{y}}\in\mathcal{Y}^{\mathcal{Z}}} g(x,{\mathbf{y}}) \prod_{z\in\mathcal{Z}} p(y_z|x,z;\theta) $$
(18)

denotes expectation of g(x,y) with respect to label vectors y generated according to p(y z |x,z,θ).

The corollary directly follows from inserting the approximations \(p(x)\approx \frac{1}{n}\) and \(p({\mathbf{y}}|x) \approx \prod_{z \in \mathcal{Z}} p(y_{z}|x,z;\theta)\) into Eqs. (12) and (13).

We observe that for the evaluation of a single given ranking function r (Eq. (16)), the empirical sampling distribution gives preference to queries x with low costs λ(x) and for which the expected ranking performance deviates strongly from the average expected ranking performance R θ ; the expectation is taken with respect to the available graded relevance model θ. For the comparison of two given ranking functions r 1 and r 2 (Eq. (17)), preference is given to queries x with low costs and for which the difference in performance L(r 1(x),y)−L(r 2(x),y) is expected to be high (note that Δ θ will typically be small).

Computation of the empirical sampling distributions given by Eqs. (16) and (17) requires the computation of \(\mathbb{E}[g(x,{\mathbf{y}})|x;\theta]\), which is defined in terms of a sum over exponentially many relevance label vectors \({\mathbf{y}}\in \mathcal{Y}^{\mathcal{Z}}\). The following theorem states that the empirical sampling distributions can nevertheless be computed in polynomial time:

Theorem 3

(Polynomial-time computation of sampling distributions)

The sampling distribution given by Eq. (16) can be computed in time

$$\mathcal{O} \Bigl(|\mathcal{Y}||D|\max_x \bigl|{\mathbf{r}}(x)\bigr|\Bigr)\quad \hbox{\textit{for}}\ L\in \{L_{dcg}, L_{err}\}. $$

The sampling distribution given by Eq. (17) can be computed in time

A detailed proof of Theorem 3 is given in the Appendix. The general idea behind the proof is to compute the required quantities using dynamic programming. Specifically, after substituting Eqs. (1) and (2) into Eqs. (16) and (17) and exploiting the independence assumption given by Eq. (3), Eqs. (16) and (17) can be decomposed into cumulative sums and products of expectations over individual item labels \(y\in \mathcal{Y}\). These sums and products can be computed in polynomial time.

Algorithm 1 summarizes the active estimation algorithm. It samples queries x 1,…,x n with replacement from the pool according to the distribution prescribed by Corollary 1 and obtains relevance labels from a human labeler for all items included in r(x i ) or r 1(x i )∪r 2(x i ) until the labeling budget Λ is exhausted. Note that queries can be drawn more than once; in the special case that the labeling process is deterministic, recurring labels can be looked up rather than be queried from the deterministic labeling oracle repeatedly. Hence, the actual labeling costs may stay below \(\sum_{j=1}^{n} \lambda(x_{j})\). In this case, the loop is continued until the labeling budget Λ is exhausted.

Algorithm 1
figure 1

Active estimation of ranking performance

5 Empirical studies

We compare active estimation of ranking performance (Algorithm 1, labeled active) to estimation based on a test sample drawn uniformly from the pool (Eq. (5), labeled passive). Algorithm 1 requires a model p(y z |x,z;θ) of graded relevance in order to compute the sampling distribution q from Corollary 1. If no such model is available, a uniform distribution \(p(y_{z} | x,z;\theta) = \frac{1}{|\mathcal{Y}|}\) can be used instead (labeled active uniD ). To quantify the effect of modeling costs, we also study a variant of Algorithm 1 that assumes uniform costs λ(x)=1 in Eqs. (16) and (17) (labeled active UniC ). This variant implements active risk estimation (Sawade et al. 2010) and active comparison (Sawade et al. 2012) for ranking; we have shown how the resulting sampling distributions can be computed in polynomial time (Corollary 1 and Theorem 3).

Experiments are performed on the Microsoft Learning to Rank data set MSLR-WEB30k (Microsoft Research 2010). It contains 31,531 queries, and a set of documents for each query whose relevance for the query has been determined by human labelers in the process of developing the Bing search engine. The resulting 3,771,125 query-document pairs are represented by 136 features widely used in the information retrieval community (such as query term statistics, page rank, and click counts). Relevance labels take values from 0 (irrelevant) to 4 (perfectly relevant).

The data are split into five folds. On one fold, we train ranking functions using different graded relevance models (details below). The remaining four folds serve as a pool of unlabeled test queries; we estimate (Sect. 5.1) or compare (Sect. 5.2) the performance of the ranking functions by drawing and labeling queries from this pool according to Algorithm 1 and the baselines discussed above. Test queries are drawn until a labeling budget Λ is exhausted. To quantify the human effort realistically, we model the labeling costs λ(x) for a query x as proportional to a sum of costs incurred for labeling individual documents zr(x); labeling costs for a single document z are assumed to be logarithmic in the document length.

All evaluation techniques, both active and passive, can approximate L dcg and L err for a query x by requesting labels only for the first k documents in the ranking. The number of documents for which the MSLR-WEB30k data set provides labels varies over the queries at an average of 119 documents per query. In our experiments, we use all documents for which labels are provided for each query and for all evaluation methods under investigation. Figure 1 (left) shows the distribution of the number of documents per query over the entire data set. The cost unit is chosen such that average labeling costs for a query are one. Figure 1 (right) shows the distribution of labeling costs λ(x) of queries over the entire dataset. All results are averaged over the five folds and 5,000 repetitions of the evaluation process. Error bars indicate the standard error.

Fig. 1
figure 2

Distribution of the number of documents per query (left) and the query labeling costs λ(x) (right) in the MSLR-WEB30k data set

5.1 Estimating ranking performance

Based on the outcome of the 2010 Yahoo ranking challenge (Chapelle and Chang, 2011; Mohan et al., 2011), we choose a pointwise ranking approach and employ Random Forest regression (Breiman 2001) to train graded relevance models on query-document pairs. Specifically, we learn Random Forest regression models for the binary problems {y z k} versus {y z >k} for relevance labels k∈{1,…,4}. The regression values can be interpreted as probabilities p(y z k|x,z;θ) (Li et al., 2007, Sect. 4). This allows us to derive probability estimates p(y z =k|x,z;θ)=p(y z k|x,z;θ)−p(y z k−1|x,z;θ) required by Algorithm 1. The ranking function is obtained by returning all documents associated with a query sorted according to their predicted graded relevance (Li et al., 2007, Sect. 3.3). We denote this model by RF class . As an alternative graded relevance model, we also study a MAP version of ordered logistic regression (McCullagh 1980). The Ordered Logit model (denoted by Logit) is a modification of the logistic regression model which accounts for the ordering of the relevance labels; the target variable is y z k rather than a particular relevance level y z =k. This model directly provides probability estimates p(y z |x,z;θ). Half of the available training fold is used for model training, the other half is used as a validation set to tune hyperparameters of the respective ranking model.

Figure 2 shows absolute deviation between true ranking performance and estimated ranking performance as a function of the labeling budget Λ for the performance measures ERR and DCG. True performance is taken to be the performance over all queries in the pool D. We observe that active estimation is more accurate than passive estimation; the labeling budget can be reduced from Λ=300 by between 10 % and 20 % depending on the ranking method and performance measure under study. Note that DCG and ERR have different ranges; L err takes values between zero and one, whereas L dcg is potentially unbounded and increases with the number of ranked items |r(x)|. For Ordered Logit models the true performance is between 26.5 and 33.0 for DCG and between 0.31 to 0.34 for ERR. For RF class models true performance is between 27.2 and 33.4 for DCG and between 0.35 and 0.38 for ERR.

Fig. 2
figure 3

Estimation error over Λ when evaluating Random Forest (top row) and Ordered Logit (bottom row) with performance measure ERR (left) and DCG (right). Error bars indicate the standard error

5.2 Comparing ranking performance

We additionally train linear Ranking SVM (Herbrich et al. 2000) (denoted by SVM) and a Random Forests regression model (Li et al. 2007; Mohan et al. 2011) directly estimating relevance labels (denoted by RF reg ), and compare the resulting ranking functions to those of the RF class and Logit models. For the comparison of RF class vs. Logit both models provide us with estimates p(y z |x,z;θ); in this case a mixture model is employed as proposed in Sawade et al. (2012). We measure model selection error, defined as the fraction of experiments in which an evaluation method does not correctly identify the model with higher true performance. Figure 3 shows model selection error as a function of the available labeling budget for different pairwise comparisons and the performance measures ERR and DCG. Active estimation more reliably identifies the model with higher ranking performance, saving between 30 % and 55 % of labeling effort compared to passive estimation at Λ=300. We observe that the gains of active versus passive estimation are not only due to differences in query costs: the baseline active uniC , which does not take into account query costs for computing the sampling distribution, performs almost as well as active.

Fig. 3
figure 4

Model selection error over Λ when comparing Random Forest regression vs. classification (top row), and Ordered Logit vs. Ranking SVM (middle row) or Random Forest classification (bottom row). The performance measure is ERR (left) or DCG (right). Error bars indicate the standard error

As a further comparative evaluation we simulate an index update. An outdated index with lower coverage is simulated by randomly removing 10 % of all query-document pairs from each result list r(x) for all queries. We use RF class as the ranking model. Active and passive estimation methods are applied to estimate the difference in performance between models based on the outdated and current index. Figure 4 shows absolute deviation of estimated from true performance difference over labeling budget Λ for the performance measures ERR and DCG. We observe that active estimation quantifies the impact of the index update more accurately than passive estimation, saving approximately 75 % of labeling effort for the performance measure ERR and about 30 % labeling effort for DCG at Λ=300.

Fig. 4
figure 5

Absolute estimation error over labeling costs Λ for a simulated index update affecting 10 % of items for each query. The performance measure is ERR (left) or DCG (right). Error bars indicate the standard error

We finally simulate the incorporation of novel sources of training data by comparing a Random Forest model trained on 100,000 query-document pairs (r 1) to a Random Forest model trained on between 120,000 and 200,000 query-document pairs (r 2). The difference in performance between r 1 and r 2 is estimated using active and passive methods. Figure 5 (top row) shows absolute deviation of estimated from true performance difference for models trained on 100,000 and 200,000 instances as a function of Λ. Active estimation quantifies the performance gain from additional training data more accurately, reducing labeling costs by approximately 45 % (ERR) and 30 % (DCG). Figure 5 (bottom row) shows estimation error as a function of the number of query-document pairs the model r 2 is trained on for Λ=100. Active estimation significantly reduces estimation error compared to passive estimation for all training set sizes.

Fig. 5
figure 6

Absolute estimation error comparing ranking functions trained on 100,000 vs. 200,000 query-document pairs over Λ (top row), and over training set size of second model at Λ=100 (bottom row). The performance measure is ERR (left) or DCG (right). Error bars indicate the standard error

5.3 Influence of query costs on the sampling distribution

The empirical sampling distributions prescribed by Corollary 1 select queries xD based on their cost λ(x) and intrinsic expectations of ranking performance given by R θ (Eq. (16)) and Δ θ (Eq. (17)), respectively. Figure 6 (left) shows a representative example of the sampling distribution q (x) given by Eq. (16) plotted into a two-dimensional space with axes λ(x) and intrinsic expected ranking performance \(\mathbb{E}[L(x,{\mathbf{y}})|x;\theta]\) for the query x. We observe that low-cost queries are preferred over high-cost queries, and queries for which the intrinsic ranking performance \(\mathbb{E}[L(x,{\mathbf{y}})|x;\theta]\) for query x is far away from the intrinsic overall performance R θ are more likely to be chosen (in the example, approximately R θ =26).

Fig. 6
figure 7

Heatmap of the sampling distribution q (x) over the pool D when evaluating a Random Forest model in terms of DCG, plotted into a two-dimensional space with axes λ(x) and \(\mathbb{E}[L(x,{\mathbf{y}})|x;\theta]\) (left). Number of queries drawn by passive and active estimation methods for a labeling budget of Λ=100 when evaluating a Random Forest model in terms of DCG (right)

Optimization problems (7) and (8) constitute a trade-off between labeling costs and informativeness of a test query: optimization over n implies that many inexpensive or few expensive queries could be chosen. On average, active estimation prefers to draw more (but cheaper) queries than passive estimation (Eqs. (16) and (17)). Figure 6 (right) shows the number of queries actually drawn by the different evaluation methods for a labeling budget of Λ=100 when estimating absolute DCG for RF class . We observe that the active estimation methods that take into account costs in the computation of the optimal sampling distribution (active and active uniD ) draw more instances than passive and active uniC .

6 Related work

There has been significant interest in learning ranking functions from data in order to improve the relevance of search results (Burges, 2010; Li et al., 2007; Mohan et al., 2011; Zheng et al., 2007). This has partly been driven by the recent release of large-scale datasets derived from commercial search engines, such as the Microsoft Learning to Rank datasets (Microsoft Research 2010) and the Yahoo Learning to Rank Challenge datasets (Chapelle and Chang 2011). These datasets serve as realistic benchmarks for evaluating and comparing the performance of different ranking algorithms.

Most approaches for learning ranking functions employ the pointwise ranking model (e.g. Li et al., 2007; Mohan et al., 2011). In pointwise ranking, the problem of learning rankings is reduced to the problem of predicting graded relevance scores for individual query-document pairs; a ranking is then obtained by sorting documents according to their relevance score for a given query. The performance measures DCG (see Li et al., 2007, Lemma 1) and ERR (see Mohan et al., 2011, Theorem 1) can be upper-bounded by the classification error; these upper bounds can be seen as one justification for pointwise ranking approaches. However, there has also been work on directly maximizing performance measures defined over entire ranking such as normalized DCG (Valizadegan et al., 2010; Burges, 2010).

To reduce the amount of training data that needs to be relevance-labeled, several approaches for active learning of ranking functions have been proposed (Long et al., 2010; Radlinski and Joachims, 2007). The active performance estimation problem discussed in this paper can be considered a dual problem of active learning: in active learning, the goal of the selection process is to obtain maximally accurate predictions or model parameters; our goal is to obtain maximally accurate performance estimates for existing ranking models.

There are two possible directions for actively selecting training or evaluation data. One direction is to actively select most informative queries for which documents should be retrieved and labeled (the approach followed in this paper). Another possible direction to reduce labeling effort is to actively select a subset of documents that should be labeled for each query. Long et al. (2010) employ both document sampling and query sampling to minimize prediction variance of ranking models in active learning. Document sampling is also used by Carterette et al. (2006) in order to decide which of two ranking functions achieves higher precision at k, by Aslam et al. (2006) to obtain unbiased estimates of mean average precision and mean R-precision for ranking models, and by Carterette and Smucker (2007) to perform statistical significance testing for pairs of ranking models.

For the estimation of DCG and ERR studied in this paper, document sampling is not directly applicable, since the considered performance measures depend on the relevance labels of all individual documents for a given query (see Eqs. (1) and (2)). For ERR even the discounting factor for a specific ranking position can only be determined if the relevance of all higher-ranked documents is known. Consequently, if arbitrary subsets of documents are sampled, the sum in Eq. (2) could only be computed up to the first factor associated with a document not contained in the subset. Substituting missing relevance labels with some estimate or default value would yield a final performance estimate that can be strongly biased, as a single incorrect relevance label affects the discounting factors for all subsequent positions. This problem could be avoided if we label a prefix of the list of all documents associated with a query, thereby truncating the sums in Eqs. (1) and (2). While truncation would still introduce an estimation bias, the size of this bias could be bounded for a given ranking and traded off against the reduction in variance that can be achieved by labeling additional queries. Studying the bias-variance trade-off associated with such truncated label vectors is an interesting direction for future work.

The active performance estimation approach presented in this paper employs importance weighting to compensate for the bias introduced by the instrumental distribution. Importance weighting is a widely studied technique that has been used in several active learning approaches, for example with exponential family models (Bach 2007) or SVMs (Beygelzimer et al. 2009).

7 Conclusions

We have studied the problem of estimating or comparing the performance of given ranking functions as accurately as possible on a fixed budget for labeling item relevance. Starting with an analysis of the sources of estimation error, we have derived distributions for sampling test queries that asymptotically minimize the error of importance-weighted performance estimators for the ranking measures Discounted Cumulative Gain and Expected Reciprocal Rank (Theorems 1 and 2). To apply the method in practice, we assume that a large pool of unlabeled test queries drawn from the true query distribution is available; Corollary 1 prescribes an empirical distribution for sampling queries from this pool. A naïve computation of this empirical sampling distribution is exponential; however, we have shown that the distribution can be computed in polynomial time using dynamic programming (Theorem 3).

A crucial feature of ranking domains is that labeling costs vary for different queries as a function of the number of documents returned for a query and document-specific features. This is in contrast to settings studied in existing approaches for active evaluation (Sawade et al. 2010, 2012). Our cost-sensitive analysis has shown that to minimize the estimation error, sampling weights should be inversely proportional to the square root of the labeling costs of a document (Theorems 1 and 2).

Empirically, we observed that active estimates of ranking performance are more accurate than passive estimates. In different experimental settings—estimation of the performance of a single ranking model, comparison of different types of ranking models, simulated index updates—performing active estimation resulted in saved labeling efforts of between 10 % and 75 %. The advantage of using an active estimation strategy is particularly strong when evaluating the relative performance of different ranking models; here, the sampling strategy is based on choosing queries that highlight the differences between the models (Eq. (17)). In contrast, the gains for absolute performance estimation for a single ranking model where relatively small in our experiments. Furthermore, we observed that the gains of active estimation are slightly more pronounced when estimating ERR as compared to estimating DCG.