1 Introduction

An information retrieval (IR) test collection comprises a document collection, a set of search requests and a set of manually judged relevant documents for each request. Following TREC Footnote 1 parlance, hereafter the search requests will be referred to as topics, and the set of relevant documents will be referred to as qrels. The methodology of using a test collection for evaluating and comparing IR techniques was established in the 1960s, through the Cranfield 2 Test by Cleverdon (1967). Since then, laboratory experiments using test collection have always played a central role for the progress of IR techniques, as they are objective, efficient and repeatable (Voorhees 2002).

Now, in the twenty-first century, evaluation using test collections is still a necessity for most IR researchers. However, the 1990s actually saw a major departure from the original Cranfield 2 experiments, with the advent of TREC, NTCIR Footnote 2 and CLEF Footnote 3 evaluation efforts for constructing very large-scale test collections. On the surface, the main difference between the Cranfield 2 test collection and the modern test collections is the document collection size: The Cranfield 2 test collection contained only 1,400 documents; The TREC and NTCIR test collections typically contain between half a million to one million documents. However, a more important difference is that while the small scale Cranfield 2 collection had complete relevance assessments, the modern test collections do not: It is simply not feasible to examine the documents exhaustively for these large scale collections.

For creating the qrels, TREC, CLEF and NTCIR all adopt a mechanism called pooling, which works as follows:

  1. 1.

    Participants submit their “runs” (collections of ranked lists for each topic, where each ranked list usually contains up to 1,000 documents) to the organisers;

  2. 2.

    For each topic, organisers take the top k (typically 100) documents from some of the submitted runs and obtain a list of unique documents, i.e, a document pool;

  3. 3.

    For each topic, assessors judge the relevance of all documents within the pool.

Hence, despite the fact that pooling is a very efficient way of collecting relevant documents, qrels formed through pooling are possibly incomplete. That is, there may exist relevant documents within the document collection, which none of the participating systems managed to retrieve and therefore are outside the qrels (Voorhees 2002).

While the collection sizes tend to grow monotonically in order to mimic real-world data such as the Web, the available manpower for relevance assessments remains more or less constant, and therefore test collections are destined to become more and more incomplete. Using an incomplete test collection for IR evaluation raises the following concerns at least:

  1. (a)

    Is it possible to reliably compare two participating systems and judge which is superior? Were the test collection less incomplete, would this judgement be the same?

  2. (b)

    Is it possible to reliably compare a participating system and a new system that never contributed to the pool? How about two new systems? In short, is the incomplete test collection reusable?

For these reasons, IR evaluation using incomplete relevance assessments is receiving more attention than ever.

One obvious approach to tackling these problems is to devise IR effectiveness metrics that are robust to relevance data incompleteness: We say that an IR metric is robust to incompleteness if system comparison results based on an incomplete set of relevance data are similar to those based on a less incomplete one. This article follows this approach, and more specifically, addresses the issues mentioned in (a) above. Below, we discuss three existing studies that are directly related to the present one. We shall discuss other related work in Sect. 2.

Buckley and Voorhees (2004) proposed an IR evaluation metric called bpref (binary preference) which is highly correlated with Average Precision (AP) when full relevance assessments are available and is yet more robust when the relevance assessments are reduced. Recent TREC tracks have used this metric along with AP. Bpref penalises a system if it ranks a judged nonrelevant document above a judged relevant one, and is indepedendent of how the unjudged documents are retrieved.

More recently, Moffat et al. (2007) introduced an IR evaluation metric called Rank-Biased Precision (RBP) which they claimed is suitable for evaluation with incomplete relevance data. RBP assumes that the probability that the user moves from a document at Rank r to Rank (r + 1) is a constant p, regardless of the relevance (level) of the document at Rank r. As it does not have a recall component, adding more relevant documents to the qrels always increases the RBP score.

Sakai (2007a) reported that applying Q-measure (or simply, Q), AP and normalised Discounted Cumulative Gain (nDCG) to a condensed list, i.e., a ranked list of documents obtained by removing all unjudged documents from the original list, is a simpler and a better solution than bpref for handling relevance data incompleteness. The metrics applied to condensed lists will hereafter be referred to as Q′, AP′ and nDCG′, respectively.

This article compares the robustness of Q(′), AP(′), nDCG(′), bpref and RBP to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs—the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. We believe that evaluating IR systems using graded relevance is important for the progress of IR because, if one adheres to IR evaluation based on binary relevance, it would be very difficult for him to devise an IR algorithm that can retrieve highly relevant documents on top of partially relevant ones. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power (Sakai 2006b, 2007b), which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall’s rank correlation (Voorhees 2001)), which reflects the overall resemblance of two system rankings according to two different metrics or two different qrels. According to these experiments, Q′, nDCG′ and AP′ are superior to bpref and RBP.

This article generalises a recent study by Sakai (2007a), in that (1) While he used the NTCIR-3 and NTCIR-5 Japanese/Chinese data, we use TREC robust track data and NTCIR-6 Japanese/Chinese data to obtain more general and substantial conclusions; (2) We compare RBP with the other metrics, after discussing some properties of the metrics that immediately follow from their formal definitions.

The remainder of this article is organised as follows. Section 2 provides an overview of related studies. Section 3 defines and discusses the characteristics of AP(′), Q(′), nDCG(′), bpref and RBP. Section 4 describes the TREC and NTCIR data we used for comparing the robustness of these metrics to relevance data incompleteness. Section 5 compares the metrics in terms of discriminative power based on statistical significance tests. Section 6 compares the metrics in terms of Kendall’s rank correlation between the entire system rankings. Finally, Sect. 7 concludes this article.

2 Related work

This section provides an overview of previous work related to the present study.

There are at least two approaches to tackling the relevance data incompleteness problem: One is to try to construct a better test collection more efficiently, and another is to devise or choose reliable IR metrics, given a test collection. Along the first line of research, methods for creating judgment pools efficiently were proposed by Cormack et al. (1998) and by Zobel (1998). Soboroff et al. (2001) proposed a method for ranking systems without any relevance assessments, but subsequently Aslam and Savell (2003) pointed out that the method tends to rank them by “popularity” rather than performance. More recently, Carterette et al. (2006) analyzed the distribution of AP over all possible assignments of relevance to all unjudged documents and proposed a method to construct a test collection with minimal relevance assessments; Büttcher et al. (2007) proposed to “expand” existing relevance assessments by treating them as training data for machine learning.

This article takes the latter approach, of choosing reliable IR metrics for handling relevance data incompleteness. The proposals of the aforementioned bpref (Buckley and Voorhees 2004), RBP (Moffat et al. 2007) and Q′, nDCG′ and AP′ (Sakai 2007a) fall into this category. Also along this line, Aslam et al. (2006) and Yilmaz and Aslam (2006) proposed Induced AP, Subcollection AP and Inferred AP. Induced AP is exactly what we call AP′. We do not consider Subcollection AP and Inferred AP in our present study, because (1) While the goal of Yilmaz and Aslam was to estimate the true AP values, ours is not: We prefer to explore different metrics, especially those that can handle graded relevance; (2) Both Subcollection AP and Inferred AP require knowledge of the pooled but unjudged documents, which limits their applicability; Footnote 4 (3) According to Bompada et al. (2007), Inferred AP is not as robust as the original nDCG for evaluation with incomplete relevance data.

Another metric proposed for handling incomplete relevance data, called RankEff (Grönqvist 2005), has been examined by Büttcher et al. (2007). However, Sakai (2008b) points out that RankEff is in fact equivalent to an existing variant of bpref called bpref_N, also known as “bpref_allnonrel” implemented in trec_eval, the standard IR evaluation software for TREC. Sakai (2007a) showed both analytically and empirically that bpref_N is not a good evaluation metric. See Sect. 3.2 for more discussions.

Büttcher et al. (2007) also used Precision at l judged documents, which relies on condensed lists just like Q′, AP′ and nDCG′. However, Precision is not a satisfactory metric for us because: (1) It ignores the ranks of retrieved relevant documents; (2) It does not average well, especially with a large document cut-off; (3) With a small document cut-off, it gives unreliable results as systems are evaluated based on a small number of observations, i.e., documents near the top of the ranked list (Sakai 2007f).

De Beer and Moens (2006) proposed rpref, a graded-relevance version of bpref. Sakai (2007a) pointed out that it has minor bugs, and proposed rpref_relative2 by fixing them. However, he reported that it does not have any advantage over Q′, AP′ and nDCG′ despite its complexity.

Sakai (2007c) conducted a study similar to the present one, but focussed on the task of finding one highly relevant document. He showed that the application of Reciprocal Rank, O-measure (Sakai 2006c) and P+-measure (Sakai 2006a, 2007e) to condensed lists is an effective way of handling the relevance data incompleteness problem.

3 Formal definitions of the IR metrics

This section formally defines the IR metrics we consider, namely, Q(′), AP(′), nDCG(′), bpref and RBP, and also discusses their properties that immediately follow from the definitions. Among these metrics, only Q(′), nDCG(′) and RBP can handle graded relevance.

3.1 Q, AP and nDCG

Let \({\mathcal{L}}\) denote a relevance level, and let \(gain({\mathcal{L}})\) denote the gain value for retrieving an \({\mathcal{L}}\)-relevant document for a particular topic. Following the NTCIR tradition (Kando 2007), this article assumes that we have S-relevant (i.e., highly relevant), A-relevant (i.e., relevant) and B-relevant (i.e., partially relevant) documents. Other documents, i.e., judged nonrelevant documents and unjudged documents, are considered nonrelevant and therefore do not carry a gain value. We let gain(S) = 3, gain(A) = 2 and gain(B) = 1 hereafter as metrics such as Q and nDCG are robust to the choice of gain values (Sakai 2007f). As for the TREC data, which only have “highly relevant” and “relevant” documents, we treat the former as S-relevant, and the latter as B-relevant. The latter were treated as B-relevant rather than A-relevant because it has been reported that there are many marginally or partially relevant documents in the TREC qrels: Sormunen (2002) reported that about one half of their TREC qrels were only marginally relevant; Sakai and Sparck Jones (2001) reported that only about 56% of the TREC qrels were highly relevant for a subcollection of the early TREC document sets.

Let \(R({\mathcal{L}})\) denote the number of \({\mathcal{L}}\)-relevant documents, and let \(R=\sum_{{\mathcal{L}}}R({\mathcal{L}}).\) Let \(cg(r)=\sum_{1 \leq i \leq r} g(i)\) denote the cumulative gain at Rank r of the system output, where \(g(i)=gain({\mathcal{L}})\) if the document at Rank i is \({\mathcal{L}}\)-relevant and g(i) = 0 otherwise (i.e., if the document at Rank i is either judged nonrelevant or unjudged). Let cg I (r) denote the cumulative gain of an ideal ranked output, where an ideal ranked output is one that satisfies g(r) > 0 for 1 ≤ rR and g(r) ≤ g(r − 1) for r > 1. For NTCIR, for example, listing up all S-relevant documents, followed by all A-relevant documents, followed by all B-relevant documents produces an ideal ranked output. Note that whether “nonrelevant” (i.e., either judged nonrelevant or unjudged) documents are retrieved below these relevant documents does not matter, as the nonrelevant documents do not carry gain values. Moreover, note that several “ideal” ranked outputs can exist in general, since documents can be interchanged within each relevance level.

Let isrel(r) be one if the document at Rank r is relevant and zero otherwise, and let \(count(r)=\sum_{1 \leq i \leq r}isrel(i).\) Clearly, precision at Rank r is given by \(P(r)=count(r)/r\).

We first define Q-measure:

$$ Q{\text{-measure}}= \frac{1}{R} \sum_{r}isrel(r)BR(r) $$
(1)
$$ BR(r)= \frac{\beta cg(r)+count(r)}{\beta cg_{I}(r) + r} $$
(2)

where BR(r) is called the blended ratio and β is a persistence parameter. Because BR(r) has an r in the denominator (just like P(r)), Q-measure is guaranteed to become smaller as a relevant document goes down the ranked list. A large β (e.g., β = 100) alleviates this effect, and makes Q-measure more forgiving for relevant documents near the bottom of the ranked list. Conversely, a small β (e.g., β = 1) imposes more penalty. Sakai (2007d) showed empirically that β = 1,10 are good choices, so we take β = 1 throughout this article. Note also that β = 0 reduces Q-measure to AP:

$$ AP= \frac{1}{R} \sum_{r}isrel(r)\frac{count(r)}{r}=\sum_{r}isrel(r)P(r). $$
(3)

For a given logarithm base a, let the discounted gain at Rank r be dg(r) = g(r)/log a (r) for r > a and \(dg(r)=g(r)\) for r ≤ a. Similarly, let dg I (r) denote the discounted gain for an ideal ranked list. nDCG at document cut-off l is defined as:

$$ nDCG_{l}= \sum_{1 \leq r \leq l}dg(r)/\sum_{1 \leq r \leq l}dg_{I}(r). $$
(4)

Throughout this article, we let l = 1,000 as it has been reported that nDCG with a small document cut-off is unreliable (Sakai 2007f). Moreover, we let a = 2 because it has been reported that nDCG with a large logarithm base is counterintuitive and lacks discriminative power (Sakai 2007d), despite the fact that this parameter was designed to reflect persistence just like RBP’s p and Q-measure’s β. We shall come back to this issue in Sect. 3.4.

3.2 Q′, AP′, nDCG′ and bpref

Sakai (2007a) reported that Q′, AP′ and nDCG′ are simpler and better solutions to the problem of evaluating IR systems with incomplete relevance data than bpref (Buckley and Voorhees 2004). Recall that these represent the application of Q, AP and nDCG to condensed lists, respectively.

Let r′ denote the rank of a document in a condensed list, whose original rank was r(≥r′). Let N denote the number of judged nonrelevant documents. Then bpref can be expressed as follows (Sakai 2007a):

$$ bpref = \frac{1}{R}\sum_{r^{\prime}}isrel(r^{\prime})\left(1-\frac{\min(R,r^{\prime}-count(r^{\prime}))}{\min(R,N)}\right) $$
(5)

where r′ − count(r′) is the number of judged nonrelevant documents ranked above the relevant one at Rank r′, or the misplacement penalty with respect to this particular relevant document. Clearly, for any topic such that R ≤ N, bpref reduces to:

$$ bpref\_R = \frac{1}{R}\sum_{r^{\prime}}isrel(r^{\prime}) \left(1-\frac{\min(R,r^{\prime}-count(r^{\prime}))}{R}\right). $$
(6)

In fact, R ≤ N in holds for all topics used in our experiments (See also Table 2), so bpref is always bpref_R in our study.

Sakai (2007a) pointed out that the only essential difference between AP′ and bpref is that, while the former uses r′ for scaling each misplacement penalty r′ − count(r′), the latter uses a constant (e.g., R). Compare Eq. 6 with

$$ AP^{\prime} =\frac{1}{R}\sum_{r^{\prime}}isrel(r^{\prime}) \frac{count(r^{\prime})}{r^{\prime}} $$
(7)
$$ =\frac{1}{R}\sum_{r^{\prime}}isrel(r^{\prime}) \left(1-\frac{r^{\prime}-count(r^{\prime})}{r^{\prime}}\right).$$
(8)

Scaling by a constant is generally not good, especially if the constant is large, because this means that the misplacement penalties with respect to the top ranked relevant documents are virtually ignored (Sakai 2007a). For example, suppose that there is a condensed list that has a judged nonrelevant document at Rank 1 and a relevant document at Rank 2. For this document at rank r′ = 2, the misplacement penalty is \(r^{\prime}-count(r^{\prime})=2-1=1\), and \(P(r^{\prime})=1/2\). Thus, the existence of the judged nonrelevant document at Rank 1 weighs heavily in the case of \(\hbox{AP}^{\prime}\). In contrast, this has very little impact on bpref, because the misplacement penalty is divided not by r′ = 2 but by a large number, namely, R or N. In other words, bpref lacks the “top heaviness” of AP′, which is one of the main strengths of the original AP. It is clear that bpref_N (Sakai 2007a), which always uses N for scaling the misplacement penalty, suffers severely from this problem, as N is generally a very large number: See, for example, Table 2 which we shall discuss later. Sakai (2007a) showed experimentally that bpref_N indeed performs very poorly.

It should be noted that AP′ is actually implemented in trec_eval. However, it appears that it was never properly examined until Yilmaz and Aslam (2006) and Sakai (2007a) rediscovered it.

3.3 RBP

We now formally define RBP (Moffat et al. 2007; Moffat and Zobel 2008). Let \({\mathcal{H}}\) denote the highest relevance level across all topics. RBP can be expressed as follows:

$$ RBP = \frac{1-p}{gain({\mathcal{H}})}\sum_{r}g(r)p^{r-1} $$
(9)

where p(≤1) is a persistence parameter. A high value of p represents a persistent user; a low value represents an impatient one. As Moffat and Zobel (2008) explored p = 0.5, 0.8, 0.95, we start our own experiments with the same values, denoting each version of RBP by RBP.5, RBP.8 and RBP.95. In all of our experiments, we let \(gain({\mathcal{H}})=gain(S)=3.\) Recall that our NTCIR data have S-, A- and B-relevant documents, but our TREC data have S- and B-relevant documents only.

The assumption behind RBP is that the user, after examining the document at Rank r, will examine the document at Rank (r + 1) with probability p or stop scanning the ranked list with probability 1 − p. Thus the model assumes that the transition probability is independent of the relevance of the document at Rank r. Whether this assumption is realistic or not is debatable, but this does make RBP easy to interpret and to compute. Moreover, Moffat et al. (2007) argue that RBP is suitable for evaluation with incomplete relevance data as it is guaranteed to increase as more relevance judgments are added (since it does not have a recall component) and the error due to unjudged documents can be quantified.

However, we can discuss RBP’s possible weaknesses. Firstly, RBP may give a very low score even to an ideal ranked output: In fact, the fact that it does not rely on recall implies that it denies the very existence of an “ideal” ranked output. From Eq. 9, it is clear that the RBP for an ideal ranked list in a binary relevance environment equals \((1-p)\sum_{r=1}^{R}p^{r-1}.\) Table 1 shows the RBP value for an ideal ranked output for p = 0.5, 0.8, 0.95 and R = 1, 10, 100, 1,000. When p = 0.95, for example, an ideal ranked output for a topic with R = 10 receives an RBP of .4013, while one for a topic with R = 100 receives .9941. Whether it is good to average such a measurement across topics is debatable, but it is at least a fact that topics with many relevant documents can have a far larger impact on Mean RBP than those with few relevant ones.

Table 1 Values of RBP for an ideal ranked output

Moreover, Table 1 shows the extreme cases of when R = 1: it can be observed that the RBP of an ideal ranked output (i.e., one that has the only one relevant document at Rank 1) can range from 0.05 (p = 0.95) and 0.5 (p = 0.5), since RBP in this case equals 1 − p. Thus the user’s persistence, i.e., the probability of moving from a document from Rank r to that at Rank (r + 1), influences the effectiveness value of the same ranked output quite drastically, even though only the document at Rank 1 is being examined. Whether this is a desirable feature for an IR metric is also debatable. In contrast, Q, nDCG and AP are by definition guaranteed to be one whenever the system output is an ideal ranked output regardless of the value of R, since they are based on comparing the system output with the ideal one.Footnote 5

We further argue that depending on recall is not necessarily bad. The real user may have some idea of the number of relevant documents, due to his background knowledge, or if not, by looking at the total number of hits shown in the IR interface. Moreover, even if this is not the case, a good IR performance metric is not necessarily one that closely mimics “user satisfaction.” For example, a user may be very satisfied with the ranked output, having found a decent document, but he may have missed ten other documents that are in fact more relevant than the one he has found. That is, the user may be happy, just because he is ignorant. From a conscientious system developer’s point of view, if he knows that there are ten relevant documents that should be retrieved, then he would design a system that can retrieve as many of them as possible rather than a system that makes the user “happy” by showing just one relevant document and hiding the other relevant ones completely. Hence Q and AP depend directly on R, the number of judged relevant documents, and even nDCG depends on it indirectly, as it relies on an ideal ranked output which lists up all relevant documents.

3.4 Top-heaviness of RBP, AP, Q and nDCG

Figure 1 compares the “top-heaviness” of RBP, AP, Q and nDCG, by considering a ranked output that contains exactly one relevant document, and making it move from Rank 1 to Rank 20. The graph at the top shows the situation when R = 10, and the one at the bottom shows the situation when R = 100, both under a binary relevance environment. Note that the three RBP curves are not affectd by the value of R. From the figure, it can be observed that RBP.5 is probably too top-heavy: it basically ignores any relevant document retrieved below Rank 10. This makes evaluation very unstable, as we shall see in our experiments in Sect. 5. RBP.8 gives a reasonable “rank bias”: RBP.95 looks almost like a straight line, compared to other metrics such as Q-measure and nDCG.

Fig. 1
figure 1

Comparison of “top-heaviness”

In Fig. 1, the top-heaviness curve of AP is almost completely hidden by that of Q-measure, because in a binary relevance environment, Q-measure = AP holds if there is no relevant document below Rank R, while Q-measure > AP holds if there is at least one relevant document below Rank R (Sakai 2006c). Thus the AP curve actually begins to deviate from the Q-measure one at Rank 11 in the graph at the top (where R = 10).

It can also be observed that the top-heaviness curves of nDCG have a minor problem: nDCG with a logarithm base of 2 cannot distinguish between a system that has a relevant document at Rank 1 and one that has a relevant document at Rank 2. This is because, according to the original definition of nDCG (which we stick to), gain discounting cannot be applied to ranks above a(=2). This is precisely why using a large a with nDCG is no good (Sakai 2007d): it makes the top-heaviness curve even flatter.Footnote 6 It should also be noted that the top-heaviness curve for nCG (Järvelin and Kekäläinen 2002; Kekäläinen 2005), the undiscounted version of nDCG, is a completely flat line. That is, to nCG, it does not matter at all at which rank the relevant document is found. This explains why nCG performs very poorly: and the same goes for Precision at l (Sakai 2007f).

To sum up, the IR metrics we consider in this study all have a mechanism, each in its own way, of penalising relevant documents found near the bottom of the ranked list. But the graphs suggest that using p = 0.5 for RBP may not be good for reliable evaluation. This we will verify in our experiments described below.

4 Full and reduced data

Table 2 provides some statistics of the TREC and NTCIR data we used for evaluating the IR metrics for the purpose of evaluation with incomplete relevance assessments. We chose these data sets as we wanted “ad hoc” test collections with graded relevance data. The “TREC03” and “TREC04” data are from the TREC 2003 and 2004 robust track (Voorhees 2004; 2005), and the “NTCIR-6J” and “NTCIR-6C” data are from the NTCIR-6 crosslingual track (Kando 2007). The TREC runs are English monolingual runs, and the NTCIR-6J runs include both monolingual and crosslingual runs for the Japanese document retrieval subtask. Similarly, the NTCIR-6C runs include both monolingual and crosslingual runs for the Chinese document retrieval subtask.

Table 2 TREC and NTCIR data used in our experiments

For conducting our discriminative power experiments described in Sect. 5, we randomly selected one run from each participating team. Thus, with the TREC03 data, for example, we used 16 runs, which yields 16*15/2 = 120 combinations of teams for significance testing. Figure 2 shows the distribution of AP values over the runs thus selected. For computing Kendall’s rank correlation, we wanted more runs, and we also wanted the same number of runs across all four data sets.Footnote 7 We therefore randomly sampled 30 runs from each data set, disregarding which team each run comes from.

Fig. 2
figure 2

AP values of the participating teams

To examine the effect of relevance data incompleteness on the IR metrics, we created reduced relevance data from the full relevance data, following the original methodology by Buckley and Voorhees (2004): First, for each topic, we created a randomised list of judged relevant documents of size R, and a separate randomised list of judged nonrelevant documents of size N. Then, for each reduction rate j ∈ {90, 70, 50, 30, 10}, we created a reduced set of relevance data by taking the first R j and N j documents from the two lists, respectively, where \(R_{j}=\hbox{max}(1, truncate(R{*}j/100))\) and \(N_{j}=\hbox{max}(10,truncate(N{*}j/100))\). The contents 1 and 10 have been copied from Buckley and Voorhees (2004), representing the minimum number of judged relevant and nonrelevant documents required for a topic, respectively. In practice, the constant 10 was seldom used since N was generally very large. This stratified sampling is essentially equivalent to random sampling from the entire set of judged documents (Yilmaz and Aslam 2006).

The above method of random sampling from the original qrels may be criticised: Possibly, a better method of studying the effect of incompleteness would be to use the actual pools for each topic, and vary the pool depth. However, we prefer to be faithful to the methodology by Buckley and Voorhees as one of the main goals of this study is to contrast their claims regarding bpref (Buckley and Voorhees 2004) with our new findings. We will report on our “shallow pool” experiments elsewhere (Sakai 2008a).

Figures 3 and 4 show the effect of relevance data reduction on the absolute overall performances (e.g., Mean AP) averaged across all 30 runs for each data set. The horizontal axis represents the reduction rate j. It is clear that the values of the metrics based on the original ranked lists, i.e., AP, Q, nDCG and RBP, quickly diminish as the relevance data becomes more and more incomplete. This is not necessarily a flaw: RBP has been designed to behave this way. In contrast, the bpref_R (i.e., bpref) curve is relatively flat, and this much supports what Buckley and Voorhees reported (Buckley and Voorhees 2004). However, it is also clear that the Q′, AP′ and \(\hbox{nDCG}^{\prime}\) curves are just as flat as the bpref one.

Fig. 3
figure 3

Reduction rate (x-axis) versus absolute performance values averaged over 30 runs (y-axis) —TREC

Fig. 4
figure 4

Reduction rate (x-axis) versus absolute performance values averaged over 30 runs (y-axis) —NTCIR

5 Discriminative power

This section compares the robustness of IR metrics to incomplete relevance assessments in terms of discriminative power using Sakai’s Bootstrap Sensitivity Method (Sakai 2006b, 2007b). The input to this method are a test collection, a set of runs, an IR metric, and the significance level α for bootstrap hypothesis tests (Efron and Tibshirani 1993). Using resampled topic sets, the method conducts a paired bootstrap hypothesis test for every system pair, and computes the discriminative power, i.e., for how many system pairs the IR metric was able to detect a significant difference, and the estimated overall performance difference required to achieve that significance. For this purpose, one thousand bootstrap samples of topics were created for each data set, by sampling with replacement from the original topic set. Details of Sakai’s Bootstrap Sensitivity Method can be found elsewhere (Sakai 2006b; 2007b).

It should be stressed that we are comparing the discriminative power of different IR metrics while holding α constant in a common experimental environment, where α is the probability of Type I Error, representing the chance of concluding that two systems are different even though the truth is they are equivalent. In other words, IR metrics with high discriminative power are those that can reliably detect meaningful differences between two systems. The Boostrap Sensitivity Method is closely related to the swap method proposed by Voorhees and Buckley (2002) which compares two systems using two different topics sets and examines whether the two outcomes are the same. Unlike the Boostrap Sensitivity Method, the swap method lacks a theoretical foundation and is not directly related to statistical significance tests. However, Sakai (2006b, 2007b) reported that the two methods yield very similar results when used for comparing different IR metrics.

Table 3 compares the discriminative power of Q(′), AP(′), nDCG(′), bpref_R and RBP with the original 100% relevance data. For example, Table 3(a) shows that Q-measure manages to detect a statistical significance for 80 pairs out of 120 (16*15/2) combinations of teams at α = 0.05, and that a difference of around 0.07 is required in order to achieve significance given 50 topics.

Table 3 Discriminative power at α = 0.05 with 100% qrels

We first summarise Table 3 in words:

  • For TREC03 and TREC04, Q(′), AP(′), nDCG(′) and bpref_R are more discriminative than RBP.

  • For NTCIR-6J, Q(′), AP(′) and nDCG(′) are more discriminative than bpref_R and RBP.

  • For NTCIR-6C, Q(′), AP(′), nDCG(′) bpref_R and RBP.95 are more discriminative than RBP.8 and RBP.5.

  • To sum up, the overall winners given 100% relevance data are Q(′), AP(′) and nDCG(′).

It is clear from Table 3 that small values of p for RBP hurt discriminative power. This is probably because a small p makes RBP too top-heavy: as we have seen in Fig. 1, using p = 0.5 implies that IR systems are more or less evaluated based on the top 10 documents only, which makes evaluation very unreliable (Sakai 2007f). For this reason, we drop RBP.05 from our experiments henceforth.

Figures 5 and 6 show the effect of relevance data reduction on discriminative power for AP(′), Q(′), nDCG(′), bpref_R and RBP.95 and RBP.8. The results are similar to those reported by Sakai (2007a), who used four data sets from NTCIR-3 and NTCIR-5. Table 4 is similar to Table 3 but uses the 10% relevance data, thus representing the “tails” of the curves. We summarise Figs. 5, 6 and Table 4 in words:

  • For TREC03 and TREC04, \(\hbox{Q}^{\prime}\), \(\hbox{AP}^{\prime}\) and \(\hbox{nDCG}^{\prime}\) are more robust than other metrics to incomplete relevance assessments. The original nDCG does well for TREC04 but not for TREC03.

  • Similarly, for NTCIR-6J and NTCIR-6C, \(\hbox{Q}^{\prime}\), \(\hbox{AP}^{\prime}\), \(\hbox{nDCG}^{\prime}\) and nDCG are the most robust. (Bpref_R appears to do well for NTCIR-6C, but it has a problem, as we shall discuss later using Table 5.)

  • RBP.95, AP and RBP.8 are at the bottom of the list, exactly in this order for all four data sets.

  • To sum up, the overall winners in terms of robustness to incomplete relevance assessments are \(\hbox{Q}^{\prime}\), \(\hbox{AP}^{\prime}\) and \(\hbox{nDCG}^{\prime}\). AP and RBP clearly lack the robustness. nDCG, Q and bpref_R lie in the middle.

Fig. 5
figure 5

Reduction rate (x-axis) versus discriminative power at α = 0.05 (y-axis)—TREC

Fig. 6
figure 6

Reduction rate (x-axis) versus discriminative power at α = 0.05 (y-axis)—NTCIR

Table 4 Discriminative power at α = 0.05 with 10% qrels
Table 5 Number of significant differences detected with 10% qrels but not with 100% qrels: (i) #significant; (ii) #inconsistent; (iii) percentage

The above analysis was based on the number of statistically significant differences detected given incompleteness relevance data. The basic assumption here is that the set of significantly different pairs at j% reduction rate is basically a subset of one with the full relevance data. However, it might be the case that most of these conclusions at j% reduction rate are in fact inconsistent with the original conclusions with the 100% relevance data. We thus provide an additional analysis in Table 5, which is similar in spirit to the “accuracy” of Bompada et al. (2007). The table compares, for each metric, the set of significantly different pairs at 10% reduction rate with that with the full relevance data. For example, Table 5(a) shows that, for TREC03, AP detected a statistical significance for 13 cases with the 10% relevance data, but two of them (15%) are not among the set of cases detected by AP with the 100% relevance data. Assuming that the conclusions with the 100% relevance data are the ground truth, the numbers presented in the table represent “errors”. As can be seen, the number of errors are generally small, supporting the aforementioned assumption. Bpref_R, however, appears to be quite unreliable from this viewpoint as well: For example, Table 5(d) shows that as many as 10 cases out of the 39 significant differences detected by bpref_R at 10% reduction rate (See also Table 4(d)) are inconsistent with the original bpref_R results. This, again, is not good news for bpref.

6 Rank correlation

The previous section compared the discriminative power of IR metrics, by focussing on the differences between individual pairs of runs. We now compare the entire system rankings according to two different IR metrics given the original relevance data, as well as two different qrels, namely, the original one and a reduced one.

Table 6 shows the Kendall’s rank correlation values (Voorhees 2001) between each pair of metrics given the original relevance data. As mentioned earlier, we randomly sampled 30 runs from each data set for computing the values: With 30 runs, the correlation is statistically significant if it is over 0.34 (Sakai 2006b); values over 0.9 are shown in bold to indicate high correlations. Note that Kendall’ rank correlation is 1 for a pair of identical rankings and −1 if one ranking is a perfect inverse of the other. It can be observed that while the system rankings by AP(′), Q(′), nDCG(′) and bpref_R can be quite similar given the full relevance data, the RBP rankings can be quite different. This alone is not necessarily a flaw: It just means that RBP is measuring something substantially different from the other metrics. Recall that RBP disregards recall.

Table 6 Kendall’s rank correlation between different metrics, given 100% qrels

Figures 7 and 8 show the effect of relevance data reduction on the system ranking for each metric: Thus, the AP ranking at X% reduction rate is compared with the original AP ranking, and so on. Table 7 summarises the figures by sorting the metrics by Kendall’s rank correlation at 10% reduction rate. Figures 7, 8 and Table 7 show that:

  • Q′, AP′ and nDCG′ are consistently among the most robust metrics in terms of system ranking stability. Bpref_R does well for TREC04.

  • As Figs. 7 and 8 show, the system rankings by AP and RBP.8 collapse as relevance data is reduced. RBP.95 is also not very good: at 30% reduction rate, its Kendall’s rank correlation with the original ranking is as low as that of AP for TREC04 and for NTCIR-6J; it performs as poorly as RBP.8 for NTCIR-6C.

  • To sum up, \(\hbox{Q}^{\prime}\), AP′ and nDCG′ are again the overall winners, and the advantage of introducing a new metric like bpref is not clear in terms of system ranking stability either. RBP is not as good as Q′, AP′ and nDCG′ in terms of system ranking stability, even with p = 0.95. Again, nDCG, Q and bpref_R lie in the middle.

Fig. 7
figure 7

Reduction rate (x-axis) versus Kendall’s rank correlation with the 100%-qrels ranking (y-axis) —TREC

Fig. 8
figure 8

Reduction rate (x-axis) versus Kendall’s rank correlation with the 100%-qrels ranking (y-axis)—NTCIR

Table 7 Kendall’s rank correlation: 100% vs 10% qrels for each metric

7 Conclusions

This article compared the robustness of IR metrics to incomplete relevance data, using four different sets of graded-relevance test collections with submitted runs—the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Our discriminative power experiments and rank correlation experiments agreed that Q′, AP′ and \(\hbox{nDCG}^{\prime}\), the application of Q, AP and nDCG to condensed lists, are more robust than other metrics to relevance data incompleteness; that AP and RBP lack the robustness; and that nDCG, Q and bpref_R lie in the middle. As these results hold across two different evaluation efforts, namely TREC and NTCIR, we believe that these findings are general. It is also interesting that \(\hbox{Q}^{\prime}\), \(\hbox{nDCG}^{\prime}\) and AP′ are comparable to one another in terms of robustness to incomplete relevance data, even though Q and nDCG are clearly superior to AP. In other words, the advantage of using graded relevance seems to disappear when condensed lists are used with very incomplete relevance data.

Our TREC03, TREC04 and NTCIR-6 results, together with the NTCIR-3 and NTCIR-5 results reported by Sakai (2007a), provide ample evidence that Q′, AP′ and nDCG′ are not only simpler than but also superior to bpref, at least in terms of discriminative power and system ranking stability. Although we have no intention of claiming that Q′, AP′ and nDCG′ are the perfect solution to the problem of relevance data incompleteness, we believe that they are more elegant than introducing metrics like bpref and bpref_N (i.e., RankEff) that lack the “top-heaviness” property of AP by definition.

Even though Moffat et al. (2007) claimed that RBP is suitable for evaluation with incomplete relevance data as its error due to unjudged documents can be quantified, we demonstrated that it has weaknesses. While RBP is interesting in that it is independent of recall, because of this very feature, it often does not equal one even for an ideal ranked output. For example, as we have discussed using Table 1, an ideal output for a topic with 10 (regular) relevant documents may receive an RBP of .4013, while an ideal output for a topic with 100 (regular) relevant documents may receive an RBP of .9941. Whether it is good to average such a measurement across topics is debatable. Moreover, our experimental results showed that small values of p make RBP unreliable, and that RBP is not as robust to incomplete relevance data as Q′, AP′ and \(\hbox{nDCG}^{\prime}\) in terms of discriminative power and system ranking stability, even with p = 0.95.

The fact that Q′, AP′ and \(\hbox{nDCG}^{\prime}\) perform clearly and consistently better than the original Q, AP and nDCG in an incomplete relevance environment implies the following: The assumption that all unjudged documents are nonrelevant is not good; It is much better to treat all unjudged documents as if they never existed, in order to let judged relevant and judged nonrelevant documents move up the ranks and hence serve as stronger pieces of evidence for computing system effectiveness.

It should be recalled, however, that we used random sampling from the original qrels in order to artificially create very incomplete test collections. We shall discuss the effect of using shallow pools and that of using fewer participating teams for forming relevance assessments elsewhere Sakai (2008a, 2008b). Moreover, although we examined the IR metrics in terms of discriminative power and Kendall’s rank correlation, there may be other criteria for choosing “good” metrics. “Simplicity” and “intuitiveness” are but a few examples, although they are difficult to quantify. Establishing a standard set of criteria for metric selection is an important goal of our future research.

As we mentioned in Sect. 2, our present study takes the approach of choosing IR metrics given a test collection with incomplete relevance data. However, the approach of constructing reliable test collections efficiently (e.g., work by Carterette et al. 2006) is equally important, and combining these two approaches is probably even more so. That is, IR metrics should perhaps be designed by taking the process of test collection construction into account. This is another research topic that needs to be explored.