# Active evaluation of ranking functions based on graded relevance

- 435 Downloads

## Abstract

Evaluating the quality of ranking functions is a core task in web search and other information retrieval domains. Because query distributions and item relevance change over time, ranking models often cannot be evaluated accurately on held-out training data. Instead, considerable effort is spent on manually labeling the relevance of query results for test queries in order to track ranking performance. We address the problem of estimating ranking performance as accurately as possible on a fixed labeling budget. Estimates are based on a set of most informative test queries selected by an active sampling distribution. Query labeling costs depend on the number of result items as well as item-specific attributes such as document length. We derive cost-optimal sampling distributions for the commonly used performance measures Discounted Cumulative Gain and Expected Reciprocal Rank. Experiments on web search engine data illustrate significant reductions in labeling costs.

## Keywords

Information retrieval Ranking Active evaluation## 1 Introduction

This paper addresses the problem of estimating the performance of a given ranking function in terms of graded relevance measures such as Discounted Cumulative Gain (DCG) (Järvelin and Kekäläinen 2002) and Expected Reciprocal Rank (ERR) (Chapelle et al. 2009). In informational retrieval domains, ranking models often cannot be evaluated on held-out training data. For example, older training data might not represent the distribution of queries the model is currently exposed to, or the ranking model might be procured from a third party that does not provide any training data.

In practice, ranking performance is estimated by applying a given ranking model to a representative set of test queries and manually assessing the relevance of all retrieved items for each query. We study the problem of estimating ranking performance as accurately as possible on a fixed budget for labeling item relevance, or, equivalently, minimizing labeling costs for a given level of estimation accuracy. We also study the related problem of cost-efficiently comparing the ranking performance of two models; this is required, for instance, to evaluate the result of an index update.

We assume that drawing unlabeled data *x*∼*p*(*x*) from the distribution of queries that the model is exposed to is inexpensive, whereas obtaining relevance labels is costly. The standard approach to estimating ranking performance is to draw a sample of test queries from *p*(*x*), obtain relevance labels, and compute the empirical performance. However, recent results on *active risk estimation* (Sawade et al. 2010) and *active comparison* (Sawade et al. 2012) indicate that estimation accuracy can be improved by drawing test examples from an appropriately engineered instrumental distribution *q*(*x*) rather than *p*(*x*), and correcting for the discrepancy between *p* and *q* by importance weighting. Motivated by these results, we study active estimation processes for ranking performance. In analogy to active learning, they select specific queries from a pool of unlabeled test queries, obtain relevance labels for all items returned for these queries by the ranking function to be evaluated, and then compute appropriate performance estimates based on the observed item relevance. Test queries are selected according to an instrumental distribution *q*(*x*). The actively selected sample is weighted appropriately to compensate for the discrepancy between instrumental and test distributions which leads to a consistent—that is, asymptotically unbiased—performance estimate.

We derive instrumental distributions that asymptotically minimize the estimation error for either evaluating a single ranking function or comparing two given ranking functions. In ranking problems, the labeling costs for a query depend on the number of items returned by the ranking function as well as item-specific attributes such as document length; this is in contrast to the active estimation settings discussed previously (Sawade et al. 2010, 2012). We show how these costs can be taken into account when deriving optimal instrumental distributions. Moreover, in the ranking setting a naïve computation of optimal instrumental distributions is exponential; a central contribution of this paper is to show how these distributions can be computed in polynomial time using dynamic programming.

An initial version of this paper appeared at the 2012 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (Sawade et al. 2012). The current paper extends the earlier conference version by more detailed proofs for central theorems, additional empirical results, and a more self-contained presentation. We now briefly summarize the extensions made with respects to the earlier conference version.

A central theoretical contribution of the paper is the method for efficient computation of optimal sampling distributions (Theorem 3). There are four different instantiations of this result: estimation of the absolute performance of a single ranking function, and comparison of the relative performance of ranking functions, in combination with the performance measures discounted cumulative gain (DCG) and expected reciprocal rank (ERR). The conference version of the paper only contained the proof for one of the four cases (absolute estimation, ERR). The current paper contains detailed proofs for all four cases.

While the conference paper derived active performance estimation techniques for the ranking performance measures ERR and DCG, only results for the ERR measure were included in the empirical study. In the current paper, we also include comprehensive empirical results for the DCG measure. In the empirical study, we have also added a visualization of the sampling distribution employed by our method as a function of sample cost and intrinsic expectation of ranking performance.

The rest of the paper is organized as follows. Section 2 details the problem setting. Section 3 derives cost-optimal sampling distributions for the estimation of DCG and ERR. Section 4 discusses empirical sampling distributions in a pool-based setting. Section 5 presents empirical results. Section 6 discusses related work, Sect. 7 concludes.

## 2 Problem setting

**r**(

*x*)| items \(r_{i}(x)\in\mathcal{Z}\) ordered by relevance. The number of items in a ranking

**r**(

*x*) can vary depending on the query and application domain from thousands (web search) to ten or fewer (mobile applications that have to present results on a small screen). Ranking performance of

**r**is defined in terms of graded relevance labels \(y_{z}\in \mathcal{Y}\) that represent the relevance of an item \(z \in \mathcal{Z}\) for the query

*x*, where \(\mathcal{Y}\subset \mathbb{R}\) is a finite space of relevance labels with minimum zero (irrelevant) and maximum

*y*

_{ max }(perfectly relevant). We summarize the graded relevance of all \(z \in \mathcal{Z}\) in a label vector \({\mathbf{y}}\in \mathcal{Y}^{\mathcal{Z}}\) with components

*y*

_{ z }for \(z \in \mathcal{Z}\).

**r**(

*x*) for a single query

*x*, we employ two commonly used ranking performance measures:

*Discounted Cumulative Gain*(DCG), given by

*Expected Reciprocal Rank*(ERR), given by

DCG scores a ranking by summing over the relevance of all documents discounted by their position in the ranking. ERR is based on a probabilistic user model: the user scans a list of documents in the order defined by **r**(*x*) and chooses the first document that appears sufficiently relevant; the likelihood of choosing a document *z* is a function of its graded relevance score *y* _{ z }. If *s* denotes the position of the chosen document in **r**(*x*), then *L* _{ err }(**r**(*x*),**y**) is the expectation of the reciprocal rank 1/*s* under the probabilistic user model. Both DCG and ERR discount relevance with ranking position, ranking quality is thus most strongly influenced by documents that are ranked highly. If **r**(*x*) includes many items, *L* _{ dcg } and *L* _{ err } are in practice often approximated by only labeling items up to a certain position in the ranking or a certain relevance threshold and ignoring the contribution of lower-ranked items.

*p*(

*x*,

**y**)=

*p*(

*x*)

*p*(

**y**|

*x*) denote the joint distribution over queries \(x \in \mathcal{X}\) and label vectors \({\mathbf{y}}\in \mathcal{Y}^{\mathcal{Z}}\) the model is exposed to. We assume that the individual relevance labels

*y*

_{ z }for items

*z*are drawn independently given a query

*x*:

**r**with respect to

*p*(

*x*,

**y**) is given by

*L*∈{

*L*

_{ dcg },

*L*

_{ err }} denotes the performance measure under study. We use integrals for notational convenience, for discrete spaces the corresponding integral is replaced by a sum. If the context is clear, we refer to

*R*[

**r**] simply by

*R*.

*p*(

*x*,

**y**) is unknown, ranking performance is typically approximated by an empirical average

*x*

_{1},…,

*x*

_{ n }and graded relevance labels

**y**

_{1},…,

**y**

_{ n }are drawn

*iid*from

*p*(

*x*,

**y**). The empirical performance \(\hat {R}_{n}\) consistently estimates the true ranking performance; that is, \(\hat {R}_{n}\) converges to

*R*with

*n*→∞.

*x*

_{ i }need not necessarily be drawn according to the input distribution

*p*. When instances are drawn according to an instrumental distribution

*q*, an estimator can be defined as

*x*

_{ j },

**y**

_{ j }) are drawn from

*q*(

*x*)

*p*(

**y**|

*x*) and again

*L*∈{

*L*

_{ dcg },

*L*

_{ err }}. Weighting factors \(\frac{p(x_{j})}{q(x_{j})}\) compensate for the discrepancy between test and instrumental distributions, and the normalizer is the sum of weights. Because of the weighting factors, Eq. (6) defines a consistent estimator (see, e.g. Liu, 2001, p. 35). Note that Eq. (5) is a special case of Eq. (6), using the instrumental distribution

*q*=

*p*.

For certain choices of the sampling distribution *q*, \(\hat{R}_{n,q}\) may be a more label-efficient estimator of the true performance *R* than \(\hat {R}_{n}\) (Sawade et al. 2010).

**r**(

*x*)| returned and item-specific features such as the length of a document whose relevance has to be determined. We denote labeling costs for a query

*x*by

*λ*(

*x*), and assume that

*λ*(

*x*) is bounded away from zero by

*λ*(

*x*)≥

*ϵ*>0. Our goal is to minimize the deviation of \(\hat{R}_{n,q}\) from

*R*under the constraint that expected overall labeling costs stay below a budget \({{\varLambda}} \in \mathbb{R}\):

*n*implies that many inexpensive or few expensive queries could be chosen.

*relative*performance of two ranking functions

**r**

_{1}and

**r**

_{2}, Eq. (7) can be replaced by

*Δ*=

*R*[

**r**

_{1}]−

*R*[

**r**

_{2}]. In the next section, we derive sampling distributions

*q*

^{∗}asymptotically solving Eqs. (7) and (8).

## 3 Asymptotically optimal sampling

*n*→∞. According to Liu (2001), Chap. 2.5.3, the squared bias term is of order \(\frac{1}{n^{2}}\), while the variance is of order \(\frac{1}{n}\). For large

*n*, the expected deviation is thus dominated by the variance, and \(\sigma_{q}^{2} = \lim_{n\rightarrow \infty} n \mathrm{Var}[\hat{R}_{n,q}]\) exists. For large

*n*, we can thus approximate

Let \(\delta (x{ \empty },{\mathbf{y}}{ \empty })=L({\mathbf{r}}_{1}(x),{\mathbf{y}})-L({\mathbf{r}}_{2}(x),{\mathbf{y}})\) denote the performance difference of the two ranking models for a test query (*x*,**y**) and *L*∈{*L* _{ dcg },*L* _{ err }}. The following theorems derive sampling distributions minimizing the quantities \(\frac{1}{n}\sigma^{2}_{q}\) and \(\frac{1}{n}\tau^{2}_{q}\), thereby approximately solving Problems (7) and (8).

### Theorem 1

(Optimal sampling for evaluation of a ranking function)

*Let*

*L*∈{

*L*

_{ dcg },

*L*

_{ err }}

*and*\(\sigma_{q}^{2} = \lim_{n\rightarrow \infty} n \mathrm{Var}[\hat{R}_{n,q}]\).

*The optimization problem*

*is solved by*

### Theorem 2

(Optimal sampling for comparison of ranking functions)

*Let*

*L*∈{

*L*

_{ dcg },

*L*

_{ err }}

*and*\(\tau_{q}^{2} = \lim_{n\rightarrow \infty} n \mathrm{Var}[\hat {{\varDelta }}_{n,q}]\).

*The optimization problem*

*is solved by*

Before we prove Theorems 1 and 2, we state the following two lemmata:

### Lemma 1

(Asymptotic variance)

*Let*\(\hat{R}_{n,q}\), \(\hat{{\varDelta}}_{n,q}\),

*R*,

*and*

*Δ*

*be defined as above*.

*Then*

*for*

*L*∈{

*L*

_{ dcg },

*L*

_{ err }}.

The proof of Lemma 1 closely follows results from Sawade et al. (2010, 2012) and is included in the Appendix.

### Lemma 2

(Maximizing distribution)

*Let*\(a: \mathcal{X} \rightarrow \mathbb{R}\)

*and*\(\lambda: \mathcal{X} \rightarrow \mathbb{R}\)

*denote functions on the query space such that*\(\int \sqrt{a(x)} \mathrm {d}x\)

*exists and*

*λ*(

*x*)≥

*ϵ*>0.

*The functional*

*where*

*q*(

*x*)

*is a distribution over the query space*\(\mathcal{X}\),

*is minimized over*

*q*

*by setting*

A proof of Lemma 2 is included in the Appendix. We now prove Theorems 1 and 2.

### Proof of Theorems 1 and 2

*n*

^{∗}=

*Λ*/∫

*λ*(

*x*)

*q*(

*x*)d

*x*solves the inner optimization. Since

*Λ*is a constant, the remaining minimization over

*q*is

*a*(

*x*)=

*p*

^{2}(

*x*)∫(

*L*(

**r**(

*x*),

**y**)−

*R*)

^{2}

*p*(

**y**|

*x*)d

**y**and applying Lemma 2 implies Eq. (12).

## 4 Empirical sampling distribution

The sampling distributions prescribed by Theorems 1 and 2 depend on the unknown test distribution *p*(*x*). We now turn towards a setting in which a pool *D* of *m* unlabeled queries is available. Queries from this pool can be sampled and then labeled at a cost. Drawing queries from the pool replaces generating them under the test distribution; that is, \(p(x) = \frac{1}{m}\) for all *x*∈*D*.

The optimal sampling distribution also depends on the true conditional distribution \(p({\mathbf{y}}|x)=\prod_{z\in\mathcal{Z}}p(y_{z}|x,z)\) (Eq. (3)). To implement the method, we approximate *p*(*y* _{ z }|*x*,*z*) by a model *p*(*y* _{ z }|*x*,*z*;*θ*) of graded relevance. For the large class of pointwise ranking methods—that is, methods that produce a ranking by predicting graded relevance scores for query-document pairs and then sorting documents according to their score—such a model can typically be derived from the graded relevance predictor. Finally, the sampling distributions depend on the true performance *R*[**r**] as given by Eq. (4), or *Δ*=*R*[**r** _{1}]−*R*[**r** _{2}]. *R*[**r**] is replaced by an introspective performance *R* _{ θ }[**r**] calculated from Eq. (4), where the integral over \(\mathcal{X}\) is replaced by a sum over the pool, \(p(x) = \frac{1}{m}\), and \(p({\mathbf{y}}|x) = \prod_{z \in \mathcal{Z}} p(y_{z}|x,z;\theta)\). Note that because the integral over \(\mathcal{X}\) in Eq. (4) is replaced by a sum over a pool of unlabeled data *D*, the empirical sampling distribution is chosen such that we accurately estimate ranking performance over this pool rather than over the entire instance space \(\mathcal{X}\). However, assuming that a large enough pool of unlabeled data is available, ranking performance over the entire pool *D* will closely approximate ranking performance over \(\mathcal{X}\).

The performance difference *Δ* is approximated by *Δ* _{ θ }=*R* _{ θ }[**r** _{1}]−*R* _{ θ }[**r** _{2}]. Note that as long as *p*(*x*)>0 implies *q*(*x*)>0, the weighting factors ensure that such approximations do not introduce an asymptotic bias in our estimator (Eq. (6)).

With these approximations, we arrive at the following empirical sampling distributions.

### Corollary 1

*When relevance labels for individual items are independent given the query*(

*Eq*. (3)),

*and*

*p*(

*y*

_{ z }|

*x*,

*z*)

*is approximated by a model*

*p*(

*y*|

*x*,

*z*;

*θ*)

*of graded relevance*,

*the sampling distributions minimizing*\(\frac{1}{n}\sigma^{2}_{q}\)

*and*\(\frac{1}{n}\tau^{2}_{q}\)

*in a pool*-

*based setting resolve to*

*and*

*respectively*.

*Here*,

*for any function*

*g*(

*x*,

**y**)

*of a query*

*x*

*and label vector*

**y**,

*denotes expectation of*

*g*(

*x*,

**y**)

*with respect to label vectors*

**y**

*generated according to*

*p*(

*y*

_{ z }|

*x*,

*z*,

*θ*).

The corollary directly follows from inserting the approximations \(p(x)\approx \frac{1}{n}\) and \(p({\mathbf{y}}|x) \approx \prod_{z \in \mathcal{Z}} p(y_{z}|x,z;\theta)\) into Eqs. (12) and (13).

We observe that for the evaluation of a single given ranking function **r** (Eq. (16)), the empirical sampling distribution gives preference to queries *x* with low costs *λ*(*x*) and for which the expected ranking performance deviates strongly from the average expected ranking performance *R* _{ θ }; the expectation is taken with respect to the available graded relevance model *θ*. For the comparison of two given ranking functions **r** _{1} and **r** _{2} (Eq. (17)), preference is given to queries *x* with low costs and for which the difference in performance *L*(**r** _{1}(*x*),**y**)−*L*(**r** _{2}(*x*),**y**) is expected to be high (note that *Δ* _{ θ } will typically be small).

Computation of the empirical sampling distributions given by Eqs. (16) and (17) requires the computation of \(\mathbb{E}[g(x,{\mathbf{y}})|x;\theta]\), which is defined in terms of a sum over exponentially many relevance label vectors \({\mathbf{y}}\in \mathcal{Y}^{\mathcal{Z}}\). The following theorem states that the empirical sampling distributions can nevertheless be computed in polynomial time:

### Theorem 3

(Polynomial-time computation of sampling distributions)

A detailed proof of Theorem 3 is given in the Appendix. The general idea behind the proof is to compute the required quantities using dynamic programming. Specifically, after substituting Eqs. (1) and (2) into Eqs. (16) and (17) and exploiting the independence assumption given by Eq. (3), Eqs. (16) and (17) can be decomposed into cumulative sums and products of expectations over individual item labels \(y\in \mathcal{Y}\). These sums and products can be computed in polynomial time.

*x*

_{1},…,

*x*

_{ n }with replacement from the pool according to the distribution prescribed by Corollary 1 and obtains relevance labels from a human labeler for all items included in

**r**(

*x*

_{ i }) or

**r**

_{1}(

*x*

_{ i })∪

**r**

_{2}(

*x*

_{ i }) until the labeling budget

*Λ*is exhausted. Note that queries can be drawn more than once; in the special case that the labeling process is deterministic, recurring labels can be looked up rather than be queried from the deterministic labeling oracle repeatedly. Hence, the actual labeling costs may stay below \(\sum_{j=1}^{n} \lambda(x_{j})\). In this case, the loop is continued until the labeling budget

*Λ*is exhausted.

## 5 Empirical studies

We compare active estimation of ranking performance (Algorithm 1, labeled *active*) to estimation based on a test sample drawn uniformly from the pool (Eq. (5), labeled *passive*). Algorithm 1 requires a model *p*(*y* _{ z }|*x*,*z*;*θ*) of graded relevance in order to compute the sampling distribution *q* ^{∗} from Corollary 1. If no such model is available, a uniform distribution \(p(y_{z} | x,z;\theta) = \frac{1}{|\mathcal{Y}|}\) can be used instead (labeled *active* _{ uniD }). To quantify the effect of modeling costs, we also study a variant of Algorithm 1 that assumes uniform costs *λ*(*x*)=1 in Eqs. (16) and (17) (labeled *active* _{ UniC }). This variant implements active risk estimation (Sawade et al. 2010) and active comparison (Sawade et al. 2012) for ranking; we have shown how the resulting sampling distributions can be computed in polynomial time (Corollary 1 and Theorem 3).

Experiments are performed on the Microsoft Learning to Rank data set MSLR-WEB30k (Microsoft Research 2010). It contains 31,531 queries, and a set of documents for each query whose relevance for the query has been determined by human labelers in the process of developing the Bing search engine. The resulting 3,771,125 query-document pairs are represented by 136 features widely used in the information retrieval community (such as query term statistics, page rank, and click counts). Relevance labels take values from 0 (irrelevant) to 4 (perfectly relevant).

The data are split into five folds. On one fold, we train ranking functions using different graded relevance models (details below). The remaining four folds serve as a pool of unlabeled test queries; we estimate (Sect. 5.1) or compare (Sect. 5.2) the performance of the ranking functions by drawing and labeling queries from this pool according to Algorithm 1 and the baselines discussed above. Test queries are drawn until a labeling budget *Λ* is exhausted. To quantify the human effort realistically, we model the labeling costs *λ*(*x*) for a query *x* as proportional to a sum of costs incurred for labeling individual documents *z*∈**r**(*x*); labeling costs for a single document *z* are assumed to be logarithmic in the document length.

*L*

_{ dcg }and

*L*

_{ err }for a query

*x*by requesting labels only for the first

*k*documents in the ranking. The number of documents for which the MSLR-WEB30k data set provides labels varies over the queries at an average of 119 documents per query. In our experiments, we use all documents for which labels are provided for each query and for all evaluation methods under investigation. Figure 1 (left) shows the distribution of the number of documents per query over the entire data set. The cost unit is chosen such that average labeling costs for a query are one. Figure 1 (right) shows the distribution of labeling costs

*λ*(

*x*) of queries over the entire dataset. All results are averaged over the five folds and 5,000 repetitions of the evaluation process. Error bars indicate the standard error.

### 5.1 Estimating ranking performance

Based on the outcome of the 2010 Yahoo ranking challenge (Chapelle and Chang, 2011; Mohan et al., 2011), we choose a pointwise ranking approach and employ Random Forest regression (Breiman 2001) to train graded relevance models on query-document pairs. Specifically, we learn Random Forest regression models for the binary problems {*y* _{ z }≤*k*} versus {*y* _{ z }>*k*} for relevance labels *k*∈{1,…,4}. The regression values can be interpreted as probabilities *p*(*y* _{ z }≤*k*|*x*,*z*;*θ*) (Li et al., 2007, Sect. 4). This allows us to derive probability estimates *p*(*y* _{ z }=*k*|*x*,*z*;*θ*)=*p*(*y* _{ z }≤*k*|*x*,*z*;*θ*)−*p*(*y* _{ z }≤*k*−1|*x*,*z*;*θ*) required by Algorithm 1. The ranking function is obtained by returning all documents associated with a query sorted according to their predicted graded relevance (Li et al., 2007, Sect. 3.3). We denote this model by *RF* _{ class }. As an alternative graded relevance model, we also study a MAP version of ordered logistic regression (McCullagh 1980). The Ordered Logit model (denoted by *Logit*) is a modification of the logistic regression model which accounts for the ordering of the relevance labels; the target variable is *y* _{ z }≤*k* rather than a particular relevance level *y* _{ z }=*k*. This model directly provides probability estimates *p*(*y* _{ z }|*x*,*z*;*θ*). Half of the available training fold is used for model training, the other half is used as a validation set to tune hyperparameters of the respective ranking model.

*Λ*for the performance measures ERR and DCG. True performance is taken to be the performance over all queries in the pool

*D*. We observe that active estimation is more accurate than passive estimation; the labeling budget can be reduced from

*Λ*=300 by between 10 % and 20 % depending on the ranking method and performance measure under study. Note that DCG and ERR have different ranges;

*L*

_{ err }takes values between zero and one, whereas

*L*

_{ dcg }is potentially unbounded and increases with the number of ranked items |

**r**(

*x*)|. For Ordered Logit models the true performance is between 26.5 and 33.0 for DCG and between 0.31 to 0.34 for ERR. For

*RF*

_{ class }models true performance is between 27.2 and 33.4 for DCG and between 0.35 and 0.38 for ERR.

### 5.2 Comparing ranking performance

*SVM*) and a Random Forests regression model (Li et al. 2007; Mohan et al. 2011) directly estimating relevance labels (denoted by

*RF*

_{ reg }), and compare the resulting ranking functions to those of the

*RF*

_{ class }and

*Logit*models. For the comparison of

*RF*

_{ class }vs.

*Logit*both models provide us with estimates

*p*(

*y*

_{ z }|

*x*,

*z*;

*θ*); in this case a mixture model is employed as proposed in Sawade et al. (2012). We measure

*model selection error*, defined as the fraction of experiments in which an evaluation method does not correctly identify the model with higher true performance. Figure 3 shows model selection error as a function of the available labeling budget for different pairwise comparisons and the performance measures ERR and DCG. Active estimation more reliably identifies the model with higher ranking performance, saving between 30 % and 55 % of labeling effort compared to passive estimation at

*Λ*=300. We observe that the gains of active versus passive estimation are not only due to differences in query costs: the baseline

*active*

_{ uniC }, which does not take into account query costs for computing the sampling distribution, performs almost as well as

*active*.

**r**(

*x*) for all queries. We use

*RF*

_{ class }as the ranking model. Active and passive estimation methods are applied to estimate the difference in performance between models based on the outdated and current index. Figure 4 shows absolute deviation of estimated from true performance difference over labeling budget

*Λ*for the performance measures ERR and DCG. We observe that active estimation quantifies the impact of the index update more accurately than passive estimation, saving approximately 75 % of labeling effort for the performance measure ERR and about 30 % labeling effort for DCG at

*Λ*=300.

**r**

_{1}) to a Random Forest model trained on between 120,000 and 200,000 query-document pairs (

**r**

_{2}). The difference in performance between

**r**

_{1}and

**r**

_{2}is estimated using active and passive methods. Figure 5 (top row) shows absolute deviation of estimated from true performance difference for models trained on 100,000 and 200,000 instances as a function of

*Λ*. Active estimation quantifies the performance gain from additional training data more accurately, reducing labeling costs by approximately 45 % (ERR) and 30 % (DCG). Figure 5 (bottom row) shows estimation error as a function of the number of query-document pairs the model

**r**

_{2}is trained on for

*Λ*=100. Active estimation significantly reduces estimation error compared to passive estimation for all training set sizes.

### 5.3 Influence of query costs on the sampling distribution

*x*∈

*D*based on their cost

*λ*(

*x*) and intrinsic expectations of ranking performance given by

*R*

_{ θ }(Eq. (16)) and

*Δ*

_{ θ }(Eq. (17)), respectively. Figure 6 (left) shows a representative example of the sampling distribution

*q*

^{∗}(

*x*) given by Eq. (16) plotted into a two-dimensional space with axes

*λ*(

*x*) and intrinsic expected ranking performance \(\mathbb{E}[L(x,{\mathbf{y}})|x;\theta]\) for the query

*x*. We observe that low-cost queries are preferred over high-cost queries, and queries for which the intrinsic ranking performance \(\mathbb{E}[L(x,{\mathbf{y}})|x;\theta]\) for query

*x*is far away from the intrinsic overall performance

*R*

_{ θ }are more likely to be chosen (in the example, approximately

*R*

_{ θ }=26).

Optimization problems (7) and (8) constitute a trade-off between labeling costs and informativeness of a test query: optimization over *n* implies that many inexpensive or few expensive queries could be chosen. On average, active estimation prefers to draw more (but cheaper) queries than passive estimation (Eqs. (16) and (17)). Figure 6 (right) shows the number of queries actually drawn by the different evaluation methods for a labeling budget of *Λ*=100 when estimating absolute DCG for *RF* _{ class }. We observe that the active estimation methods that take into account costs in the computation of the optimal sampling distribution (*active* and *active* _{ uniD }) draw more instances than *passive* and *active* _{ uniC }.

## 6 Related work

There has been significant interest in learning ranking functions from data in order to improve the relevance of search results (Burges, 2010; Li et al., 2007; Mohan et al., 2011; Zheng et al., 2007). This has partly been driven by the recent release of large-scale datasets derived from commercial search engines, such as the Microsoft Learning to Rank datasets (Microsoft Research 2010) and the Yahoo Learning to Rank Challenge datasets (Chapelle and Chang 2011). These datasets serve as realistic benchmarks for evaluating and comparing the performance of different ranking algorithms.

Most approaches for learning ranking functions employ the pointwise ranking model (e.g. Li et al., 2007; Mohan et al., 2011). In pointwise ranking, the problem of learning rankings is reduced to the problem of predicting graded relevance scores for individual query-document pairs; a ranking is then obtained by sorting documents according to their relevance score for a given query. The performance measures DCG (see Li et al., 2007, Lemma 1) and ERR (see Mohan et al., 2011, Theorem 1) can be upper-bounded by the classification error; these upper bounds can be seen as one justification for pointwise ranking approaches. However, there has also been work on directly maximizing performance measures defined over entire ranking such as normalized DCG (Valizadegan et al., 2010; Burges, 2010).

To reduce the amount of training data that needs to be relevance-labeled, several approaches for active learning of ranking functions have been proposed (Long et al., 2010; Radlinski and Joachims, 2007). The active performance estimation problem discussed in this paper can be considered a dual problem of active learning: in active learning, the goal of the selection process is to obtain maximally accurate predictions or model parameters; our goal is to obtain maximally accurate performance estimates for existing ranking models.

There are two possible directions for actively selecting training or evaluation data. One direction is to actively select most informative queries for which documents should be retrieved and labeled (the approach followed in this paper). Another possible direction to reduce labeling effort is to actively select a subset of documents that should be labeled for each query. Long et al. (2010) employ both document sampling and query sampling to minimize prediction variance of ranking models in active learning. Document sampling is also used by Carterette et al. (2006) in order to decide which of two ranking functions achieves higher *precision at* *k*, by Aslam et al. (2006) to obtain unbiased estimates of mean average precision and mean R-precision for ranking models, and by Carterette and Smucker (2007) to perform statistical significance testing for pairs of ranking models.

For the estimation of DCG and ERR studied in this paper, document sampling is not directly applicable, since the considered performance measures depend on the relevance labels of all individual documents for a given query (see Eqs. (1) and (2)). For ERR even the discounting factor for a specific ranking position can only be determined if the relevance of all higher-ranked documents is known. Consequently, if arbitrary subsets of documents are sampled, the sum in Eq. (2) could only be computed up to the first factor associated with a document not contained in the subset. Substituting missing relevance labels with some estimate or default value would yield a final performance estimate that can be strongly biased, as a single incorrect relevance label affects the discounting factors for all subsequent positions. This problem could be avoided if we label a prefix of the list of all documents associated with a query, thereby truncating the sums in Eqs. (1) and (2). While truncation would still introduce an estimation bias, the size of this bias could be bounded for a given ranking and traded off against the reduction in variance that can be achieved by labeling additional queries. Studying the bias-variance trade-off associated with such truncated label vectors is an interesting direction for future work.

The active performance estimation approach presented in this paper employs importance weighting to compensate for the bias introduced by the instrumental distribution. Importance weighting is a widely studied technique that has been used in several active learning approaches, for example with exponential family models (Bach 2007) or SVMs (Beygelzimer et al. 2009).

## 7 Conclusions

We have studied the problem of estimating or comparing the performance of given ranking functions as accurately as possible on a fixed budget for labeling item relevance. Starting with an analysis of the sources of estimation error, we have derived distributions for sampling test queries that asymptotically minimize the error of importance-weighted performance estimators for the ranking measures Discounted Cumulative Gain and Expected Reciprocal Rank (Theorems 1 and 2). To apply the method in practice, we assume that a large pool of unlabeled test queries drawn from the true query distribution is available; Corollary 1 prescribes an empirical distribution for sampling queries from this pool. A naïve computation of this empirical sampling distribution is exponential; however, we have shown that the distribution can be computed in polynomial time using dynamic programming (Theorem 3).

A crucial feature of ranking domains is that labeling costs vary for different queries as a function of the number of documents returned for a query and document-specific features. This is in contrast to settings studied in existing approaches for active evaluation (Sawade et al. 2010, 2012). Our cost-sensitive analysis has shown that to minimize the estimation error, sampling weights should be inversely proportional to the square root of the labeling costs of a document (Theorems 1 and 2).

Empirically, we observed that active estimates of ranking performance are more accurate than passive estimates. In different experimental settings—estimation of the performance of a single ranking model, comparison of different types of ranking models, simulated index updates—performing active estimation resulted in saved labeling efforts of between 10 % and 75 %. The advantage of using an active estimation strategy is particularly strong when evaluating the relative performance of different ranking models; here, the sampling strategy is based on choosing queries that highlight the differences between the models (Eq. (17)). In contrast, the gains for absolute performance estimation for a single ranking model where relatively small in our experiments. Furthermore, we observed that the gains of active estimation are slightly more pronounced when estimating ERR as compared to estimating DCG.

## Notes

### Acknowledgements

We would like to thank the anonymous reviewers for their helpful comments and suggestions for improving the paper. We gratefully acknowledge that this work was supported by a Google Research Award.

## References

- Aslam, J., Pavlu, V., & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In
*Proceedings of the 29th annual international ACM SIGIR conference on research and development on information retrieval*. Google Scholar - Bach, F. (2007). Active learning for misspecified generalized linear models. In
*Advances in neural information processing systems*. Google Scholar - Beygelzimer, A., Dasgupta, S., & Langford, J. (2009). Importance weighted active learning. In
*Proceedings of the international conference on machine learning*. Google Scholar - Burges, C. (2010).
*RankNet to LambdaRank to LambdaMART: an overview*(Tech. Rep. MSR-TR-2010-82). Microsoft Research. Google Scholar - Carterette, B., Allan, J., & Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In
*Proceedings of the 29th SIGIR conference on research and development in information retrieval*. Google Scholar - Carterette, B., & Smucker, M. (2007). Hypothesis testing with incomplete relevance judgments. In
*Proceedings of the 16th ACM conference on information and knowledge management*. Google Scholar - Chapelle, O., & Chang, Y. (2011). Yahoo! Learning to rank challenge overview. In
*JMLR: workshop and conference proceedings*(Vol. 14, pp. 1–24). Google Scholar - Chapelle, O., Metzler, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In
*Proceeding of the conference on information and knowledge management*. Google Scholar - Cossock, D., & Zhang, T. (2008). Statistical analysis of Bayes optimal subset ranking.
*IEEE Transactions on Information Theory*,*54*(11), 5140–5154. MathSciNetCrossRefGoogle Scholar - Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma.
*Neural Computation*,*4*, 1–58. CrossRefGoogle Scholar - Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In
*Advances in large margin classifiers*(pp. 115–132). Google Scholar - Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques.
*ACM Transactions on Information Systems*,*20*(4), 422–446. CrossRefGoogle Scholar - Li, P., Burges, C., & Wu, Q. (2007). Learning to rank using classification and gradient boosting. In
*Advances in neural information processing systems*. Google Scholar - Liu, J. (2001).
*Monte Carlo strategies in scientific computing*. Berlin: Springer. zbMATHGoogle Scholar - Long, B., Chapelle, O., Zhang, Y., Chang, Y., Zheng, Z., & Tseng, B. (2010). Active learning for ranking through expected loss optimization. In
*Proceedings of the 33rd SIGIR conference on research and development in information retrieval*. Google Scholar - McCullagh, P. (1980). Regression models for ordinal data.
*Journal of the Royal Statistical Society. Series B (Methodological)*,*42*(2), 109–142. MathSciNetzbMATHGoogle Scholar - Microsoft Research (2010). Microsoft learning to rank datasets. http://research.microsoft.com/en-us/projects/mslr/. Released 16 June 2010.
- Mohan, A., Chen, Z., & Weinberger, K. (2011). Web-search ranking with initialized gradient boosted regression trees. In
*JMLR: workshop and conference proceedings*(Vol. 14, pp. 77–89). Google Scholar - Radlinski, F., & Joachims, T. (2007). Active exploration for learning rankings from clickthrough data. In
*Proceedings of the 13th SIGKDD international conference on knowledge discovery and data mining*. Google Scholar - Sawade, C., Bickel, S., & Scheffer, T. (2010). Active risk estimation. In
*Proceedings of the 27th international conference on machine learning*. Google Scholar - Sawade, C., Landwehr, N., & Scheffer, T. (2012). Active comparison of prediction models. In
*Advances in neural information processing systems*. Google Scholar - Sawade, C., Bickel, S., von Oertzen, T., Scheffer, T., & Landwehr, N. (2012). Active evaluation of ranking functions based on graded relevance. In
*Proceedings of the 2012 European conference on machine learning and principles and practice of knowledge discovery in databases*. Google Scholar - Valizadegan, H., Jin, R., Zhang, R., & Mao, J. (2010). Learning to rank by optimizing NDCG measure. In
*Advances in neural information processing systems*. Google Scholar - Wasserman, L. (2004).
*All of statistics: a concise course in statistical inference*. Berlin: Springer. Google Scholar - Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Chen, K., & Sun, G. (2007). A general boosting method and its application to learning ranking functions for web search. In
*Advances in neural information processing systems*. Google Scholar