1 Introduction

Many modern tracks and tasks at TREC, NTCIR, CLEF and other information retrieval (IR) and information access (IA) evaluation forums inherit the basic idea of “ideal” test collections, proposed some 40 years ago by Jones and Van Rijsbergen (1975), in the form of pooling for relevance assessments. On the other hand, our modern test collections have somehow deviated substantially from their original plans in terms of the number of topics we prepare (i.e., topic set size). According to Jones and Van Rijsbergen, fewer than 75 topics “are of no real value”; 250 topics “are minimally acceptable”; and more than 1000 topics “are needed for some purposes” because “real collections are large”; “statistically significant results are desirable” and “scaling up must be studied” (Jones and Van Rijsbergen 1975, p. 7). In 1979, in a report that considered the number of relevance assessments required from a statistical viewpoint, Gilbert and Jones remarked: “Since there is some doubt about the feasibility of getting 1000 requests, or the convenience of such a large set for future experiments, we consider 500 requests” (Gilbert 1979, p. C4). This is in sharp contrast to our current practice of having 50–100 topics in an IR test collection. Exceptions include the TREC Million Query track, which constructed over 1800 topics with relevance assessments by employing the minimal test collection and statAP methods (Carterette et al. 2008). However, such studies are indeed exceptions: the traditional pooling approach is still the mainstream in the IR community.

In 2009, Voorhees conducted an experiment where she randomly split 100 TREC topics in half to count discrepancies in statistically significant results, and concluded that “Fifty-topic sets are clearly too small to have confidence in a conclusion when using a measure as unstable as P(10).Footnote 1 Even for stable measures, researchers should remain skeptical of conclusions demonstrated on only a single test collection” (Voorhees 2009, p. 807). Unfortunately, there has been no clear guiding principle for determining the required number of topics for a new test collection.

The present study provides details on principled ways to determine the number of topics for a test collection to be built, based on a specific set of statistical requirements.Footnote 2 We employ Nagata’s three sample size design techniques, which are based on the paired t test, one-way ANOVA, and confidence intervals (CIs), respectively. These topic set size design methods require topic-by-run score matrices from past test collections for the purpose of estimating the within-system population variance for a particular evaluation measure. While Sakai (2014a, e) incorrectly used estimates of the total variances, here we use the correct estimates of the within-system variances, which yield slightly smaller topic set sizes than those reported by Sakai (2014a, e). Moreover, this study provides a comparison across the three methods. Our conclusions nevertheless echo those of Sakai (2014a, b, e): as different evaluation measures can have vastly different within-system variances, they require substantially different topic set sizes under the same set of statistical requirements; by analysing the tradeoff between the topic set size and the pool depth for a particular evaluation measure in advance, researchers can build statistically reliable yet highly economical test collections.

The remainder of this paper is organised as follows. Section 2 discusses prior art related to the present study. Section 3 describes the sample size design theory of Nagata (2003) as well as the associated Excel tools that we have made publicly available online,Footnote 3 and methods for estimating the within-system population variance for a particular evaluation measure. Section 4 describes six TREC test collections and runs used in our analyses, and Sect. 5 describes the evaluation measures considered. The topic-by-run matrices for all of the data sets and evaluation measures used in this study are also available onlineFootnote 4; using our Excel tools, score matrices, and the variance estimates reported in this paper, other researchers can easily reproduce our results. Section 6 reports on our topic set size design results, and Sect. 7 concludes this paper and discusses future work.

2 Prior art

2.1 Effect sizes, statistical power, and confidence intervals

In the context of comparative experiments in IR, the p value is the probability of observing the observed between-system difference or something even more extreme under a null hypothesis distribution. When it is smaller than a predefined significance criterion \(\alpha\), then we have observed a difference that is extremely rare under the null hypothesis (i.e., the assumption that the systems are equivalent), and therefore conclude that the null hypothesis is probably incorrect. Here, \(\alpha\) is the Type I error probability, i.e., the probability of detecting a difference that is not real. This much is often discussed in the IR community.

Unfortunately, effect sizes and statistical power have not enjoyed the same attention in studies based on test collections, with a small number of exceptions (e.g., Carterette and Smucker 2007; Webber et al. 2008b; Nelson 1998).Footnote 5 A small p value could mean either a large effect size (i.e., how large the actual difference is, measured for example in standard deviation units), or a large sample size (i.e., we simply have a lot of topics) (Ellis 2010; Nagata 2003; Sakai 2014d). For example, suppose we have per-topic performance scores in terms of some evaluation measure M for systems X and Y with n topics \((x_{1},\ldots,x_{n})\) and \((y_{1},\ldots,y_{n})\) and hence per-topic score differences \((d_{1},\ldots,d_{n})=(x_{1}-y_{1},\ldots,x_{n}-y_{n})\). For these score differences, the sample mean is given by \(\bar{d}=\sum _{j=1}^{n}d_{j}/n\) and the sample variance is given byFootnote 6 \(V=\sum _{j=1}^{n}(d_{j}-\bar{d})^2/(n-1)\).

Consider the test statistic \(t_{0}\) for a paired t test:

$$t_{0} = {\frac{\bar{d}}{\sqrt{V/n}}}={\sqrt{n}}{\frac{\bar{d}}{\sqrt{V}}}.$$
(1)

It is clear that if \(t_{0}\) is large and therefore the p value is small, this is either because the sample effect size \(\bar{d}/{\sqrt{V}}\) is large, or just because n is large. Hence, IR researchers should report effect sizes together with p values to isolate the sample size effect. The same arguments apply to other significance tests such as ANOVA (see Sect. 3.2).

The (statistical) power of an experiment is the probability of detecting a difference whenever there actually is one, and is denoted by \(1-\beta\), where \(\beta\) is the Type II error probability, i.e., the probability of missing a real difference. For example, when applying a two-sided paired t test, the probability of rejecting the null hypothesis \(H_{0}\) is given byFootnote 7

$${Pr}\{ t_{0} \le -t_{inv}(\phi; \alpha )\} + {Pr}\{ t_{0} \ge t_{inv}(\phi; \alpha )\}$$
(2)

where \(t_{0}\) is the test statistic computed from the observed data, and \(t_{inv}(\phi; \alpha )\) denotes the two-sided critical t value for probability \(\alpha\) with \(\phi\) degrees of freedom. Under \(H_{0}\) (i.e., the hypothesis that two system means are equal), \(t_{0}\) obeys a t distribution with \(\phi =n-1\) degrees of freedom, where n denotes the topic set size, and Eq. (2) is exactly \(\alpha\). Whereas, under the alternative hypothesis \(H_{1}\) (i.e., that two population system means are not equal), \(t_{0}\) obeys a noncentral t distribution (Cousineau and Laurencelle 2011; Nagata 2003), and Eq. (2) is exactly the power (i.e., \(1-\beta\)). By specifying the required \(\alpha\), \(\beta\) and the minimum effect size for which we want to ensure the power of \(1-\beta\), it is possible to derive the required topic set size n. Furthermore, this approach can be extended to the case of one-way ANOVA (analysis of variance), as we shall demonstrate in Sect. 3.

Sakai (2014d) advocates the use of confidence intervals (CIs) along with the practice of reporting effect sizes and test statistics obtained from significance tests. CIs can be used for significance testing, and are more informative than the dichotomous reporting of whether the result is significant or not, as they provide a point estimate together with the information on how accurate that estimate might be (Cumming 2012; Ellis 2010). Soboroff (2014) compared the reliability of classical CI with three bootstrap-based CIs, and recommends the classical CI and the simple bootstrap percentile CI. The CI-based approach taken in this paper relies on the classical CI, which, just like the t test, assumes normal distributions.Footnote 8

2.2 Statistical power analysis by Webber/Moffat/Zobel

Among the aforementioned studies that discussed the power of IR experiments, the work of Webber et al. (2008b) that advocated the use of statistical power in IR evaluation deserves attention here, as the present study can be regarded as an extension to their work in several aspects. Below, we highlight the contributions of, and the differences between, these studies:

  • Webber et al. (2008b) were primarily concerned with building a test collection incrementally, by adding topics with relevance assessments one by one while checking to see if the desired power is achieved and reestimating the population variance of the performance score differences. In contrast, the present study aims to provide straight answers to questions such as: “I want to build a new test collection that guarantees certain levels of Type I and Type II error rates (\(\alpha\) and \(\beta\)). What is the number of topics (n) that I will have to prepare?” Researchers can simply input a set of statistical requirements to our Excel tools to obtain the answers.

  • Webber et al. (2008b) considered evaluating a given pair of systems and thus considered the t test only. However, test collections are used to compare m (\(\ge\)2) systems in practice, and it is generally not correct to conduct t tests independently for every system pair, although we shall discuss exceptional situations in Sect. 3.1. If t tests are conducted multiple times, the familywise error rate (i.e., the probability of detecting at least one nonexistent between-system difference) amounts to \(1-(1-\alpha )^{m(m-1)/2}\), assuming that all of these tests are independent of one another (Carterette 2012; Ellis 2010).Footnote 9 In contrast, the present study computes the required topic set size n by considering both the t test (for \(m=2\)) and one-way ANOVA (for \(m \ge 2\)), and examine the effect of m on the required n for a given set of statistical requirements. Moreover, we also consider the approach of determining n based on the width of CIs, and perform comparisons across these three methods.

  • Webber et al. (2008b) examined a few methods for estimating the population variance of the performance score deltas, which include taking the 95th percentile of the observed score delta variance from past data, and conducting pilot relevance assessments. However, it is known in statistics that the population within-system variance can be estimated directly by the residual variance obtained through ANOVA, and we therefore take this more reliable approach.Footnote 10 Furthermore, we pool multiple variance estimates from similar data sets to enhance the reliability. As for the variance of the performance score deltas, we derive conservative estimates from our pooled variance estimates.

  • Webber et al. (2008b) considered Average Precision (AP) only; we examine a variety of evaluation measures for ad hoc and diversified search, with an emphasis on those that can utilise graded relevance assessments, and demonstrate that some measures require many more topics than others under the same set of statistical requirements.

2.3 Alternatives to classical statistics

The basis of the present study is classical significance testing (the paired t test and one-way ANOVA, to be more specific) as well as CIs. However, there are also alternative avenues for research that might help advance the state of the art in topic set size design. In particular, the generalisability theory (Bodoff and Li 2007; Carterette et al. 2008; Urbano et al. 2013) is somewhat akin to our study in that it also requires variance estimates from past data. Alternatives to classical significance testing include the computer-based bootstrap (Sakai 2006) and randomisation tests (Boytsov et al. 2013; Smucker et al. 2007), Bayesian approaches to hypothesis testing (Carterette 2011; Kass and Raftery 1995), and \(p_{rep}\) (probability that a replication of a study would give a result in the same direction as the original study) as an alternative to p values (Killeen 2005). These approaches are beyond the scope of the present study.

3 Theory

This section describes how our topic set size design methods work theoretically. Sections 3.13.2 and 3.3 explain the t test based, ANOVA-based and CI-based methods, respectively.Footnote 11 These three methods are based on sample size design techniques of Nagata (2003). As these methods require estimates of within-system variances, Sect. 3.4 describes how we obtain them from past data. If the reader is not familiar with statistical power and effect sizes, a good starting point would be the book by Ellis (2010); also, the book by Kelly (2009) discusses these topics as well as ANOVA in the context of interactive IR.

3.1 Topic set size design based on the paired t test

As was mentioned in Sect. 2.2, if the researcher is interested in the differences between every system pair, then conducting t tests multiple times is not the correct approach; an appropriate multiple comparison procedure (Boytsov et al. 2013; Carterette 2012; Nagata 1998) should be applied in order to avoid the aforementioned familywise error rate problem. However, there are also cases where applying the t test multiple times is the correct approach to take even when there are more than two systems (\(m>2\)) (Nagata 1998). For example, if the objective of the experiment is to show that a new system Z is better than both baselines X and Y (rather than to show that Z is either better than X or better than Y), then what we want to ensure is that the probability of incorrectly rejecting both of the null hypotheses is no more than \(\alpha\) (rather than that of incorrectly rejecting at least one of them). In this case, it is correct to apply a t test for systems Z and X, and one for systems Z and Y.

Let t be a random variable that obeys a t distribution with \(\phi\) degrees of freedom; let \(t_{inv}(\phi; \alpha )\) denote the two-sided critical t value for significance criterion \(\alpha\) (i.e., \({Pr}\{|t|\ge t_{inv}(\phi; \alpha )\}=\alpha\)).Footnote 12 Under \(H_{0}\), the test statistic \(t_{0}\) (Eq. 1 in Sect. 2) obeys a t distribution with \(\phi =n-1\) degrees of freedom. Given \(\alpha\), we reject \(H_{0}\) if \(|t_{0}|\ge t_{inv}(\phi; \alpha )\), because that means we have observed something extremely unlikely if \(H_{0}\) is true. (The p value is the probability of observing \(t_{0}\) or something more extreme under \(H_{0}\).) Thus, the probability of Type I error (i.e., “finding” a difference that does not exist) is exactly \(\alpha\) by construction. Whereas, the probability of Type II error (i.e., missing a difference that actually exists) is denoted by \(\beta\), and therefore the statistical power (i.e., the ability to detect a real difference) is given by \(1-\beta\). Put another way, \(\alpha\) is the probability of rejecting \(H_{0}\) when \(H_{0}\) is true, while the power is the probability of rejecting \(H_{0}\) when \(H_{1}\) is true. In either case, the probability of rejecting \(H_{0}\) is given by

$$\begin{aligned}&Pr\{t_{0}\le -t_{inv}(\phi; \alpha )\} + { Pr}\{t_{0}\ge t_{inv}(\phi; \alpha )\} \\&\quad ={Pr}\{t_{0}\le -t_{inv}(\phi; \alpha )\} + 1 - { Pr}\{t_{0}\le t_{inv}(\phi; \alpha )\}. \end{aligned}$$
(3)

Under \(H_{0}\), Eq. (3) amounts to \(\alpha\), where \(t_{0}\) (Eq. 1) obeys a (central) t distribution as mentioned above. Under \(H_{1}\), Eq. (3) represents the power \((1-\beta )\), where \(t_{0}\) obeys a noncentral t distribution with \(\phi =n-1\) degrees of freedom and a noncentrality parameter \(\lambda _{t}={\sqrt{n}}\Delta _{t}\). Here, \(\Delta _{t}\) is a simple form of effect size, given by:

$$\Delta _{t} = {\frac{\mu _{X}-\mu _{Y}}{\sqrt{\sigma _{t}^2}}} = {\frac{\mu _{X}-\mu _{Y}}{\sqrt{\sigma _{X}^2+\sigma _{Y}^2}}}$$
(4)

where \(\sigma _{t}^2=\sigma _{X}^2+\sigma _{Y}^2\) is the population variance of the score differences. Thus, \(\Delta _{t}\) quantifies the difference between X and Y in standard deviation units of any given evaluation measure.

While computations involving a noncentral t distribution can be complex, a normal approximation is available: let \(t^{\prime }\) denote a random variable that obeys the aforementioned noncentral t distribution; let u denote a random variable that obeys \(N(0,1^2)\). ThenFootnote 13:

$${Pr}\{t^{\prime } \le w\} \approx {Pr}\left\{ u \le {\frac{w(1-1/4\phi )-\lambda _{t}}{\sqrt{1+w^2/2\phi }}}\right\}.$$
(5)

Hence, given the topic set size n, the effect size \(\Delta _{t}\) and the significance criterion \(\alpha\), the power can be computed from Eqs. (3) and (5) as (Nagata 2003):

$$\begin{aligned} 1-\beta \, \approx \, & {} {Pr}\left\{ u \le {\frac{(-w)(1-1/4(n-1)) - {\sqrt{n}}\Delta _{t}}{\sqrt{1+(-w)^2/2(n-1)}}}\right\} \\&+ 1- {Pr}\left\{ u \le {\frac{w(1-1/4(n-1)) - {\sqrt{n}}\Delta _{t}}{\sqrt{1+w^2/2(n-1)}}}\right\} \end{aligned}$$
(6)

where \(w=t_{inv}(n-1; \alpha )\). But what we are more interested in is: given \((\alpha, \beta, \Delta _{t})\), what is the required n?

Under \(H_{0}\), we know that \(\Delta _{t}=0\) (see Eq. 4). However, under \(H_{1}\), all we know is that \(\Delta _{t} \ne 0\). In order to require that an experiment has a statistical power of \(1-\beta\), a minimum detectable effect \({min}\Delta _{t}\) must be specified in advance: we correctly reject \(H_{0}\) with \(100(1-\beta )\) % confidence whenever \(|\Delta _{t}|\ge {min}\Delta _{t}\). That is, we should not miss a real difference if its effect size is \({min}\Delta _{t}\) or larger. Cohen calls \({min}\Delta _{t}=0.2\) a small effect, \({min}\Delta _{t}=0.5\) a medium effect, and \({min}\Delta _{t}=0.8\) a large effect (Cohen 1988; Ellis 2010).Footnote 14

Let \(z_{P}\) denote the one-sided critical z value of \(u (\sim N(0,1^2))\) for probability P (i.e., \({Pr}\{u \ge z_{P}\}=P\)). Given \((\alpha, \beta, {min}\Delta _{t})\), it is known that the required topic set size n can be approximated by (Nagata 2003):

$$n \approx \left( {\frac{z_{\alpha /2}-z_{1-\beta }}{{min}\Delta _{t}}}\right) ^2 + {\frac{z_{\alpha /2}^2}{2}}.$$
(7)

For example, if we let \((\alpha, \beta, {min}\Delta _{t})=(.05, .20, .50)\) [i.e., Cohen’s five-eighty convention (Cohen 1988; Ellis 2010) with Cohen’s medium effect],

$$n \approx \left( {\frac{1.960-(-.842)}{.50}}\right) ^2+{\frac{1.960^2}{2}}=33.3.$$
(8)

As this is only an approximation, we need to check that the desired power is actually achieved with an integer n close to 33.3. Suppose we let \(n=33\). Then, by substituting \(w=t_{inv}(33-1; .05)=2.037\) and \(\Delta _{t}={min}\Delta _{t}=.50\) to Eq. (6), we obtain:

$$1-\beta \approx {Pr}\{u \le -4.742\} + 1 - {Pr}\{u \le -.825\}=.795$$
(9)

which means that the desired power of 0.8 is not quite achieved. So we let \(n=34\), and the achieved power can be computed similarly: \(1-\beta =.808\). Therefore \(n=34\) is the topic set size we want.

Our Excel tool samplesizeTTEST automates the above procedure for any given combination of \((\alpha, \beta, {min}\Delta _{t})\). Table 1 shows the required topic set sizes for the paired t test for some typical combinations. For example, under Cohen’s five-eighty convention (\(\alpha =.05, \beta =.20\)),Footnote 15 if we want the minimum detectable effect to be \({min}\Delta _{t}=.2\) (i.e., one-fifth of the score-difference standard deviation), we need \(n=199\) topics.

Table 1 Topic set sizes for \((\alpha, \beta, {min}\Delta _{t})\)

The above approach starts by requiring a \({min}\Delta _{t}\), which is independent of the evaluation method (i.e., the measure, pool depth and the measurement depth). However, researchers may want to require a minimum detectable absolute difference \({min}D_{t}\) in terms of a particular evaluation measure instead (e.g., “I want high power guaranteed whenever the true absolute difference in mean AP is 0.05 or larger.”). In this case, instead of setting a minimum (\({min}\Delta _{t}\)) for Eq. (4), we can set a minimum (\({min}D_{t}\)) for the numerator of Eq. (4): we guarantee a power of \(1-\beta\) whenever \(|\mu _{X}-\mu _{Y}|\ge {min}D_{t}\). To do this, we need an estimate \(\hat{\sigma }_{t}^2\) of the variance \(\sigma _{t}^2 (=\sigma _{X}^2+\sigma _{Y}^2)\), so that we can convert \({min}D_{t}\) to \({min}\Delta _{t}\) simply as follows:

$${min}\Delta _{t}= {\frac{{min}D_{t}}{\sqrt{\hat{\sigma }_{t}^2}}}.$$
(10)

After this conversion, the aforementioned procedure starting with Eq. (7) can be applied. Our tool samplesizeTTEST has a separate sheet for computing n from \((\alpha, \beta, {min}D_{t}, \hat{\sigma }_{t}^2)\); how to obtain \(\hat{\sigma }_{t}^2\) from past data is discussed in Sect. 3.4.

3.2 Topic set size design based on one-way ANOVA

This section discusses how to set the topic set size n when we assume that there are \(m \ge 2\) systems to be compared using one-way ANOVA. Let \(x_{ij}\) denote the score of the i-th system for topic j in terms of some evaluation measure; we assume that \(\{x_{ij}\}\) are independent and that \(x_{ij} \sim N(\mu _{i}, \sigma ^2)\). That is, \(x_{ij}\) obeys a normal distribution with a population system mean \(\mu _{i}\) and a common system variance \(\sigma ^2\). The assumption that \(\sigma ^2\) is common across systems is known as the the homoscedasticity assumptionFootnote 16; note that we did not rely on this assumption when we discussed the paired t test. We define the population grand mean \(\mu\) and the i-th system effect \(a_{i}\) (i.e., how the i-th system differs from \(\mu\)) as follows:

$$\mu = {\frac{1}{m}}\sum _{i=1}^{m}\mu _{i}, \quad a_{i}=\mu _{i}-\mu$$
(11)

where \(\sum _{i=1}^{m}a_{i}=\sum _{i=1}^{m}(\mu _{i}-\mu )=\sum _{i=1}^{m}\mu _{i}-m\mu =0\). The null hypothesis for the ANOVA is \(H_{0}: \mu _{1} = \cdots = \mu _{m}\) (or \(a_{1} = \cdots = a_{m} = 0\)) while the alternative hypothesis \(H_{1}\) is that at least one of the system effects is not zero. That is, while the null hypothesis of the t test is that two systems are equally effective in terms of the population means, that of ANOVA is that all systems are equally effective.

The basic statistics that we compute for the ANOVA are as follows. The sample mean for system i and the sample grand mean are given by:

$$\bar{x}_{i\bullet }={\frac{1}{n}}\sum _{j=1}^{n}x_{ij},\quad \bar{x}={\frac{1}{mn}}\sum _{i=1}^{m}\sum _{j=1}^{n}x_{ij}.$$
(12)

The total variation, which quantifies how each \(x_{ij}\) differs from the sample grand mean, is given by:

$$S_{T}=\sum _{i=1}^{m}\sum _{j=1}^{n}(x_{ij}-\bar{x})^2.$$
(13)

It is easy to show that \(S_{T}\) can be decomposed into between-system and within-system variations \(S_{A}\) and \(S_{E}\) (i.e., \(S_{T}=S_{A}+S_{E}\)), where

$$S_{A} = n\sum _{i=1}^{m}(\bar{x}_{i\bullet }-\bar{x})^2, \quad S_{E} = \sum _{i=1}^{m}\sum _{j=1}^{n}(x_{ij}-\bar{x}_{i\bullet })^2.$$
(14)

The corresponding degrees of freedom are \(\phi _{A}=m-1\), \(\phi _{E}=m(n-1)\). Also, let \(V_{A}=S_{A}/\phi _{A}, V_{E}=S_{E}/\phi _{E}\) for later use.

Let F be a random variable that obeys an F distribution with \((\phi _{A}, \phi _{E})\) degrees of freedom; let \(F_{inv}(\phi _{A}, \phi _{E}; \alpha )\) denote the critical F value for probability \(\alpha\) (i.e., \({Pr}\{F \ge F_{inv}(\phi _{A}, \phi _{E}; \alpha )\}=\alpha\)).Footnote 17 Under \(H_{0}\), the test statistic \(F_{0}\) defined below obeys a (central) F distribution with \((\phi _{A}, \phi _{E})\) degrees of freedom:

$$F_{0}={\frac{V_{A}}{V_{E}}}={\frac{m(n-1)S_{A}}{(m-1)S_{E}}}.$$
(15)

Given a significance criterion \(\alpha\), we reject \(H_{0}\) if \(F_{0} \ge F_{inv}(\phi _{A}, \phi _{E}; \alpha )\). From Eq. (15), it can be observed that \(H_{0}\) is rejected if the between-system variation \(S_{A}\) is large compared to the within-system variation \(S_{E}\), or simply if the sample size n is large. Again, the p value does not tell us which is the case.

The probability of rejecting \(H_{0}\) is given by

$${Pr}\{F_{0}\ge F_{inv}(\phi _{A},\phi _{E};\alpha )\} = 1- {Pr}\{F_{0}\le F_{inv}(\phi _{A},\phi _{E};\alpha )\}.$$
(16)

Under \(H_{0}\), Eq. (16) amounts to \(\alpha\) by construction, where \(F_{0}\) obeys a (central) F distribution as mentioned above. Under \(H_{1}\), Eq. (16) represents the power \((1-\beta )\), where \(F_{0}\) obeys a noncentral F distribution (Nagata 2003; Patnaik 1949) with \((\phi _{A}, \phi _{B})\) degrees of freedom and a noncentrality parameter \(\lambda\), such that

$$\lambda = n \Delta, \quad \Delta = {\frac{\sum _{i=1}^{m}a_{i}^2}{\sigma ^2}}={\frac{\sum _{i=1}^{m}(\mu _{i}-\mu )^2}{\sigma ^2}}.$$
(17)

Thus \(\Delta\) measures the total system effects in variance units.

While computations involving a noncentral F distribution can be complex, a normal approximation is available: let \(F^{\prime }\) denote a random variable that obeys the aforementioned noncentral F distribution; let \(u \sim N(0, 1^2)\). ThenFootnote 18:

$${Pr}\{F^{\prime } \le w\} \approx {Pr}\left\{ u \le {\frac{\sqrt{\frac{w}{\phi _{E}}}{\sqrt{2\phi _{E}-1}} - {\sqrt{\frac{c_{A}}{\phi _{A}}}}{\sqrt{2\phi _{A}^{*}-1}}}{\sqrt{\frac{c_{A}}{\phi _{A}}+{\frac{w}{\phi _{E}}}}}} \right\}$$
(18)

where

$$c_{A}={\frac{m-1+2n\Delta }{m-1+n\Delta}}, \quad \phi _{A}^{*}={\frac{(m-1+n\Delta )^2}{m-1+2n\Delta}}.$$
(19)

Hence, given \((n, \Delta, \alpha )\), the power \((1-\beta )\) can be computed from Eqs. (16)–(19) as (Nagata 2003):

$$1- {Pr}\left\{ u \le {\frac{\sqrt{\frac{w}{m(n-1)}}{\sqrt{2m(n-1)-1}} - {\sqrt{\frac{c_{A}}{m-1}}}{\sqrt{2\phi _{A}^{*}-1}}}{\sqrt{\frac{c_{A}}{m-1}+{\frac{w}{m(n-1)}}}}} \right\}$$
(20)

where \(w=F_{inv}(m-1, m(n-1); \alpha )\). But what we are more interested in is: given \((\alpha, \beta, \Delta )\), what is the required n?

Under \(H_{0}\), we know that \(\Delta =0\) (see Eq. 17). However, under \(H_{1}\), all we know is that \(\Delta \ne 0\). In order to require that an experiment has a statistical power of \(1-\beta\), a minimum detectable delta \({min}\Delta\) must be specified in advance. Let us require that we correctly reject \(H_{0}\) with \(100(1-\beta )\)% confidence whenever the range of the population means (\(D=\max _{i}{a_{i}}-\min _{i}{a_{i}}\)) is at least as large as a specified value (min D). That is, we want to detect a true difference whenever the difference between the population mean of the best system and that of the worst system is at least minD. Now, let us define \({min}\Delta\) as follows:

$${min}\Delta ={\frac{{min}D^2}{2\sigma ^2}}.$$
(21)

Then, since \(\sum _{i=1}^{m}a_{i}^2\ge {\frac{D^2}{2}}\) holds,Footnote 19 it follows that

$$\Delta ={\frac{\sum _{i=1}^{m}a_{i}^2}{\sigma ^2}} \, \ge \, {\frac{D^2}{2\sigma ^2}} \, \ge \, {\frac{{min}D^2}{2\sigma ^2}}={min}\Delta.$$
(22)

That is, \(\Delta\) is bounded below by \({min}\Delta\). Hence, although specifying min D does not uniquely determine \(\Delta\) (as \(\Delta\) depends on systems other than the best and the worst ones), we can plug in \(\Delta ={min}\Delta\) to Eqs. (19) and (20) to obtain the worst-case estimate of the power.

Unfortunately, no closed formula similar to Eq. (7) is available for ANOVA. However, from Eqs. (17) and (21), note that the worse-case estimate of n can be obtained as follows:

$$n = {\frac{\lambda }{{min}\Delta }} = {\frac{2\sigma ^2\lambda }{{min}D^2}}.$$
(23)

To use Eq. (23), we need the \(\lambda\). (How to obtain \(\hat{\sigma }^2\), the estimate of \(\sigma ^2\), is discussed in Sect. 3.4.) Recall that, under \(H_{1}\), Eq. (16) represents the power \((1-\beta )\) where \(F_{0}\) obeys a noncentral F distribution with \((\phi _{A},\phi _{E})\) degrees of freedom and the noncentrality parameter \(\lambda\). By letting \(\phi _{E}=m(n-1) \approx \infty\), the power can be approximated by:

$${Pr}\{ F_{0} \ge F_{inv}(\phi _{A}, \infty; \alpha ) \} = {Pr}\{ \chi ^{\prime 2} \ge \chi ^2_{inv}(\phi _{A}; \alpha ) \}$$
(24)

where \(\chi ^{\prime 2}\) is a random variable that obeys a noncentral \(\chi ^{2}\) distribution with \(\phi _{A}\) degrees of freedom whose noncentrality parameter is \(\lambda\), and \(\chi ^2_{inv}(\phi; P)\) is the critical \(\chi ^2\) value for probability P of a random variable that obeys a (central) \(\chi ^2\) distribution with \(\phi\) degrees of freedom (i.e., \({Pr}\{\chi ^2 \ge \chi ^2_{inv}(\phi; P)\}=P\)). For noncentral \(\chi ^2\) distributions, some linear approximations of \(\lambda\) are available, as shown in Table 2 (Nagata 2003). Hence an initial estimate of n given \((\alpha, \beta, {min}D, \hat{\sigma }^2, m)\) can be obtained as shown below.

Table 2 Linear approximation of \(\lambda\), the noncentrality parameter of a noncentral \(\chi ^2\) distribution (Nagata 2003)

Suppose we let \((\alpha, \beta, {min}D, m)=(.05, .20, .5, 3)\) and that we obtained \(\hat{\sigma }^2=.5^2\) from past data so that \({min}\Delta ={\frac{{min}D^2}{2\sigma ^2}}=.5^2/(2*.5^2)=.5\). Then \(\phi _{A}=m-1=2\) and \(\lambda = 4.860+3.584*{\sqrt{2}}=9.929\) and hence \(n=\lambda /{min}\Delta =19.9\). If we let \(n=19\), then \(\phi _{E}=3(19-1)=54\), \(w=F_{inv}(2,54; .05)=3.168\). From Eq. (19), \(c_{A}=1.826, \phi ^{*}_{A}=6.298\), and from Eq. (20), the achieved power is \(1-{Pr}\{u\le -.809\}=.791\), which does not quite satisfy the desired power of 80 %. On the other hand, if \(n=20\), the achieved power can be computed similarly as .813. Hence \(n=20\) is what we want. Our Excel tool samplesizeANOVA automates the above procedure for given \((\alpha, \beta, {min}D, \hat{\sigma }^2, m)\).Footnote 20

Recall that the \(H_{1}\) for ANOVA says: “there is a difference somewhere among the m systems,” which may not be very useful in the context of test-collection-based studies: we usually want to know exactly where the differences are. If the researcher is interested in obtaining a p value for every system pair, then she should conduct a multiple comparison procedure from the outset. Contrary to popular beliefs, it is generally incorrect to first conduct ANOVA and then conduct a multiple comparison test only if the null hypothesis for the ANOVA is rejected. This practice of sequentially conducting different tests suffers from a problem similar to that of the aforementioned familywise error rate (Nagata 1998).Footnote 21 An example of a proper multiple comparison procedure would be Tukey’s HSD (Honestly Significant Differences) test, its randomised version (Carterette 2012; Sakai 2014d), or the Holm–Bonferroni adjustment of p values (Boytsov et al. 2013); such a test should be applied directly without conducting ANOVA at all. Ideally, we would like to discuss topic set size design based on a multiple comparison procedure, but this is an open problem even in statistics. In fact, the very notion of power has several different interpretations in the context of multiple comparison procedures (Nagata 1998). Nevertheless, since some researchers do use ANOVA for comparing m systems, how the required topic set size n grows with m probably deserves some attention.

3.3 Topic set size design based on CIs

To build a CI for the difference between systems X and Y, we model the performance scores (assumed independent) as follows:

$$x_{i}= \mu _{X} + \gamma _{i} + \varepsilon _{Xi}, \quad \varepsilon _{Xi} \sim N(0, \sigma ^2_{X}),$$
(25)
$$y_{i}= \mu _{Y} + \gamma _{i} + \varepsilon _{Yi}, \quad \varepsilon _{Yi} \sim N(0, \sigma ^2_{Y})$$
(26)

where \(\gamma _{i}\) represents the topic effect and \(\mu _{\bullet }, \sigma ^2_{\bullet }\) represent the population mean and variance for X, Y, respectively (\(i=1,\ldots,n\)). This is in fact just an alternative way of presenting the assumptions behind the paired t test (Sect. 3.1). To cancel out \(\gamma _{i}\), let

$$d_{i}=x_{i}-y_{i}=\mu _{X}-\mu _{Y}+\varepsilon _{Xi}-\varepsilon _{Yi}$$
(27)

so that \(d_{i} \sim N(\mu, \sigma _{t}^2), \mu = \mu _{X}-\mu _{Y}, \sigma _{t}^2=\sigma ^2_{X}+\sigma ^2_{Y}\). It then follows that \(t={\frac{\bar{d}-\mu }{\sqrt{V/n}}}\) obeys a t distribution with \(\phi =n-1\) degrees of freedom, where \(\bar{d}=\sum ^{n}_{i=1}d_{i}/n\) and \(V=\sum ^{n}_{i=1}(d_{i}-\bar{d})^2/(n-1)\) as before. Hence, for a given significance criterion \(\alpha\), the following holds:

$${Pr}\{ -t_{inv}(\phi; \alpha ) \le t \le t_{inv}(\phi; \alpha ) \} = 1-\alpha.$$
(28)

Hence,

$${Pr}\{ \bar{d} - {MOE} \le \mu \le \bar{d} + {MOE} \} = 1-\alpha$$
(29)

where the margin of error (MOE) is given by:

$${MOE}= t_{inv}(\phi; \alpha ){\sqrt{V/n}}.$$
(30)

Thus, Eq. (29) shows that the \(100(1-\alpha )\) % CI for the difference in population means (\(\mu =\mu _{X}-\mu _{Y}\)) is given by \([\bar{d} - {MOE}, \bar{d} + {MOE}]\). This much is very well known.

Let us consider the approach of determining the topic set size n by requiring that \(2{MOE} \le \delta\): that is, the CI of the difference between X and Y should be no larger than some constant \(\delta\). This ensures that experiments using the test collection will be conclusive whereover possible: for example, note that a wide CI that includes zero implies that we are very unsure as to whether systems X and Y actually differ. Since MOE (Eq. 30) contains a random variable V, we actually impose the above requirement on the expectation of 2MOE:

$$E(2{MOE})=2t_{inv}(\phi;\alpha ){\frac{E({\sqrt{V}})}{\sqrt{n}}}\le \delta.$$
(31)

Now, it is known thatFootnote 22

$$E({\sqrt{V}})={\frac{{\sqrt{2}}\Gamma \left({\frac{n}{2}}\right)}{\sqrt{n-1}\Gamma \left({\frac{n-1}{2}}\right)}}\sigma _{t}$$
(32)

where \(\sigma _{t}={\sqrt{\sigma ^2_{X}+\sigma ^2_{Y}}}\) and \(\Gamma (\bullet )\) is the gamma function.Footnote 23 By substituting Eq. (32) to Eq. (31), the requirement can be rewritten as:

$${\frac{t_{inv}(n-1;\alpha )\Gamma \left({\frac{n}{2}}\right)}{{\sqrt{n(n-1)}}\Gamma \left({\frac{n-1}{2}}\right)}} \le {\frac{\delta }{2{\sqrt{2}}\sigma _{t}}}.$$
(33)

In order to find the smallest n that satisfies Eq. (33), we first consider an “easy” case where the population variance \(\sigma _{t}^2\) is known. In this case, the MOE is given by (cf. Eq. 30):

$${MOE}_{z}=z_{\alpha /2}{\sqrt{\sigma _{t}^2/n}}$$
(34)

where \(z_{P}\) denotes the one-sided critical z value for probability P.Footnote 24 By requiring that \(2{MOE}_{z} \le \delta\), we can obtain a tentative topic set size \(n^{\prime }\):

$$n^{\prime } \ge {\frac{4z^2_{\alpha /2}\sigma _{t}^2}{\delta ^2}}.$$
(35)

First, the smallest integer that satisfies Eq. (35) can be tested to see if it also satisfies Eq. (33); \(n^{\prime }\) is incremented until it does. The resultant \(n=n^{\prime }\) is the topic set size we want.

Our Excel tool samplesizeCI automates the above procedure to find the required sample size n, for any given combination of \((\alpha, \delta, \hat{\sigma }_{t}^2)\). How to obtain the variance estimate \(\hat{\sigma }_{t}^2\) from past data is discussed below.

3.4 Estimating population within-system variances

As was explained above, our t-based and CI-based topic set size design methods require an estimate of the population variance of the difference between two systems \(\sigma _{t}^{2}=\sigma _{X}^2+ \sigma _{Y}^2\), and our ANOVA-based method requires an estimate of the population within-system variance \(\sigma ^2\) under the homoscedasticity assumption.

Let C be an existing test collection and \(n_{C}\) be the number of topics in C; let \(m_{C}\) be the number of runs whose performances with C in terms of some evaluation measure are known, so that we have an \(n_{C} \times m_{C}\) topic-by-run matrix \(\{x_{ij}\}\) for that evaluation measure. There are two simple ways to estimate \(\hat{\sigma }^2\) from such data. One is to use the residual variance from one-way ANOVA (see Sect. 3.2):

$$\hat{\sigma }_{C}^2 = V_{E} = {\frac{\sum _{i=1}^{m_{C}}\sum _{j=1}^{n_{C}}(x_{ij}-\bar{x}_{i\bullet })^2}{m_{C}(n_{C}-1)}}$$
(36)

where \(\bar{x}_{i \bullet }={\frac{1}{n_C}}\sum _{j=1}^{n_{C}}x_{ij}\) (sample mean for system i).Footnote 25 The other is to use the residual variance from two-way ANOVA without replication, which utilises the fact that the scores \(x_{\bullet j}\) for topic j correspond to one another:

$$\hat{\sigma }_{C}^2 = V_{E}={\frac{\sum _{i=1}^{m_{C}}\sum _{j=1}^{n_{C}}(x_{ij}-\bar{x}_{i \bullet }-\bar{x}_{\bullet j} + \bar{x})^2}{(m_{C}-1)(n_{C}-1)}}$$
(37)

where \(\bar{x}_{\bullet j}={\frac{1}{m_C}}\sum _{i=1}^{m_{C}}x_{ij}\) (sample mean for topic j). Equation (36) generally yields a larger estimate, because while one-way ANOVA removes only the between-system variation from the total variation (see Eq. 14), two-way ANOVA without replication removes the between-topic variation as well. As we prefer to “err on the side of oversampling” as recommended by Ellis (2010), we use Eq. (36) in this study. Researchers who are interested in tighter estimates are welcome to try our Excel files with their own variance estimates.

As we shall explain in Sect. 4, we have two different topic-by-run matrices (i.e., test collections and runs) for each evaluation measure for every IR task that we consider. To enhance the reliability of our variance estimates, we first obtain a variance estimate \(\hat{\sigma }_{C}^{2}\) from each matrix using Eq. (36), and then pool the two estimates using the following standard formulaFootnote 26:

$$\hat{\sigma }^2 = {\frac{\sum _{C}(n_{C}-1)\hat{\sigma }_{C}^2}{\sum _{C}(n_{C}-1)}}.$$
(38)

As for \(\sigma _{t}^{2}=\sigma _{X}^2+ \sigma _{Y}^2\), we introduce the homoscedasticity assumption here as well and let \(\hat{\sigma }_{t}^{2}=2\hat{\sigma }^{2}\). While this probably overestimates the variances of the score differences, again, we choose to “err on the side of oversampling” (Ellis 2010) in this study.

4 Data

Table 3 provides some statistics of the past data that we used for obtaining \(\hat{\sigma }^2\)’s. We considered three IR tasks: (a) adhoc news search; (b) adhoc web search; and (c) diversified web search; for each task, we used two data sets to obtain pooled variance estimates.

Table 3 TREC test collections and runs used for estimating \(\sigma ^2\)

The adhoc/news data sets are from the TREC robust tracks, with “new” topics from each year (Voorhees 2004, 2005). The “old” topics from the robust tracks are not good for our experiments for two reasons. First, the relevance assessments for the old topics were constructed based on old TREC adhoc runs, not the new robust track runs. This prevents us from studying the tradeoff between topic set sizes and pool depths (see Sect. 6.3). Second, the relevance assessments for the old topics are binary, which prevents us from studying the benefit of various evaluation measures that can utilise graded relevance assessments.

The web data sets are from the adhoc and diversity tasks of the TREC web tracks (Clarke et al. 2012, 2013). Note that these diversity data sets have per-intent graded relevance assessments, although they were treated as binary in the official evaluations at TREC.

5 Measures

When computing evaluation measures, the usual measurement depth (i.e., document cutoff) for the adhoc news task is \({md}=1000\); we considered \({md}=10\) in addition for consistency with the web track tasks. Whereas, we consider \({md}=10\) only for the web tasks as we are interested in the first search engine result page.

Table 4 provides some information on the evaluation measures that were used in the present study. For the adhoc/news and adhoc/web tasks, we consider the binary Average Precision (AP), Q-measure (Q) (Sakai 2005), normalised Discounted Cumulative Gain (nDCG) (Järvelin and Kekäläinen 2002) and normalised Expected Reciprocal Rank (nERR) (Chapelle et al. 2011), all computed using the NTCIREVAL toolkit.Footnote 27 For computing AP and Q, we follow Sakai and Song (2011) and divide by \(\min ({md},R)\) rather than by R in order to properly handle small measurement depths.

Table 4 Evaluation measures used in this study

For the diversity/web task, we consider \(\alpha\)-nDCG (Clarke et al. 2009) and Intent-Aware nERR (nERR-IA) (Chapelle et al. 2011) computed using ndeval,Footnote 28 as well as D-nDCG and \(D\sharp\)-nDCG (Sakai and Song 2011) computed using NTCIREVAL. When using NTCIREVAL, the gain value for each LX-relevant document was set to \(g(r)=2^{x}-1\): for example, the gain for an L3-relevant document is 7, while that for an L1-relevant document is 1. As for ndeval, the default settings were used: this program ignores per-intent graded relevance levels.

We refer the reader to Sakai (2014c) as a single source for mathematical definitions of the above evaluation measures.

6 Results and discussions

6.1 Variance estimates

Table 5 shows the within-system variance estimates \(\hat{\sigma }^2\) that we obtained for each evaluation measure with each topic-by-run matrix. For example, with TREC03new and TREC04new, \(\hat{\sigma }^2=.0479\) and .0462 according to Eq. (36), respectively, and the pooled variance obtained from these two data sets using Eq. (38) is \(\hat{\sigma }^2=.0471\) as shown in bold. Throughout this paper, we use these pooled variances for topic set size design: note that the variance estimates are similar across the two data sets for each IR task (a1), (a2), (b), and (c), which suggests that given an existing test collection for a particular IR task, it is not difficult to obtain a good estimate of the within-system variance for a particular evaluation measure for the purpose of topic set size design for a new test collection for the same task. The estimates look reliable especially for tasks (a1) and (a2), i.e., adhoc/news, where we have as many as \(m_{C}=78\) runs.

Table 5 \(\hat{\sigma }^2\) obtained from the topic-by-run score matrices

It is less clear, on the other hand, whether a variance estimate from one task can be regarded as a reliable variance estimate for the topic set size design of a different task. The pooled variance estimate for AP at \({md}=10\) obtained from our adhoc/news data is .0835; this would be a highly accurate estimate if it is used for constructing an adhoc/web test collection, since its actual pooled variance for AP is .0824. However, the variance estimates for Q, nDCG and nERR are not as similar across tasks (a2) and (b). Hence, if a variance estimate from an existing task is to be used for the topic set size design of a new task, it would probably be wise to choose one of the larger variances observed by considering several popular evaluation measures such as AP and nDCG. In particular, note that variances for the diversity measures such as the ones shown in Table 5(c) cannot be obtained from past adhoc data that lack per-intent relevance assessments: in such a case, using a variance estimate of an evaluation measure that is not designed for diversified search is inevitable. For example, if we know from the TREC11w (i.e., TREC 2011) adhoc/web task experience that the variances are in the .0477–.1006 range as shown in Table 5(b), then we could let \(\hat{\sigma }^2=.1006\) (i.e., the estimate for the unstable nERRFootnote 29) for the topic set size design of a new diversity/web task at TREC 2012. As the actual pooled variances for the TREC12wD task is in the .0301–.0798 range, our choice of \(\hat{\sigma }^2\) would have overestimated the required topic set size for TREC12wD, which we regard as far better than underestimating it and thereby not meeting the set of statistical requirements.

6.2 Topic set sizes based on the three methods

In this section, we discuss how the pooled variance estimates shown in Table 5 translates to the actual topic set sizes, using the aforementioned three Excel tools. For the t test and ANOVA-based topic set size design methods, we only present results under Cohen’s five-eighty convention [i.e., \((\alpha, \beta )=(.05, .20)\)] throughout this paper; the interested reader can easily obtain results for other settings by using our Excel tools.

The left half of Table 6 shows the t test-based topic set size design results under Cohen’s five-eighty convention for different minimum detectable differences \({minD}_{t}\); similarly, the right half shows the ANOVA-based topic set size design results with \(m=2\) for different minimum detectable ranges minD. Throughout this paper, the smallest topic set size within the same set of statical requirements is underlined.

Table 6 Topic set size table: t test-based versus ANOVA with \(m=2\) systems [\((\alpha, \beta )=(.05, .20)\)]

Note that, when \(m=2\) (i.e., there are only two systems to compare), minD (i.e., the minimum detectable difference between the best and the worst systems) reduces to \({minD}_{t}\) of the t test. It can be observed that the t test-based and ANOVA-based (\(m=2\)) results are indeed very similar. In fact, since one-way ANOVA for \(m=2\) is equivalent to the unpaired (i.e., two-sample) t test, one would expect the topic set sizes based on the paired t test to be a little smaller than those based on ANOVA for \(m=2\) systems, as the former utilises the fact that the two score vectors are paired. On the contrary, Table 6 shows that the t test-based topic set sizes are slightly larger. This is probably because of the way we obtain \(\hat{\sigma }_{t}^{2}\) for the t test-based design: since we let \(\hat{\sigma }_{t}^2=2\hat{\sigma }^2\) (see Sect. 3.4), if our \(\hat{\sigma }^2\) for the ANOVA-based design is an overestimate, then the error is doubled for \(\hat{\sigma }_{t}^2\). Since the topic set size for a paired t test should really be bounded above by that for the unpaired t test under the same statistical requirements, we recommend IR researchers to use our ANOVA-based tool with \(m=2\) if they want to conduct topic set size design based on the paired t test. While our t test tool can handle arbitrary combinations of \((\alpha, \beta )\) unlike the ANOVA-based counterpart, it is unlikely for researchers to consider cases other than \(\alpha =.01, .05, \beta =.10, .20\) in practice. Our ANOVA-based tool can handle all four combinations of these Type I and Type II error probabilities (see Sect. 3.2).

The ANOVA-based results in Table 6(a1) show that if we want to ensure Cohen’s five-eighty convention for a minimum detectable difference of \({minD}_{t}=0.10\) in AP for an adhoc/news task (\({md}=1000\)), then we would need 73 topics. Similarly, the ANOVA-based results in Table 6(c) show that if we want to ensure Cohen’s five-eighty convention for \({minD}_{t}=0.15\) in nERR-IA for a diversity/web task, then we would need 58 topics. Hence, existing TREC test collections with 50 topics do not satisfy these statistical requirements. We argue that, through this kind of analysis with previous data, the test collection design for the new round of an existing task should be improved. Note, however, that we are aiming to satisfy a set of statistical requirements for any set of systems; our results do not mean that existing TREC collections with 50 topics are useless for comparing a particular set of systems.

Table 7 shows the CI-based topic set size design results at \(\alpha =.05\) (i.e., 95 % CI) for different CI widths \(\delta\); it also shows the ANOVA-based topic set size design results for different minimum detectable ranges minD under \((\alpha, \beta, m)=(.05, .20, 10)\) and \((\alpha, \beta, m)=(.05, .20, 100)\).Footnote 30 Some topic set sizes could not be computed with our CI-based tool due to a computational limitation in the gamma function in Microsoft ExcelFootnote 31; however, we observed that the topic set size required based on the CI-based design with \(\alpha =0.05\) and \(\delta =c\) is almost the same as the topic set size required based on the ANOVA-based design with \((\alpha, \beta, m)=(.05, .20, 10)\) and \({minD}=c\), for any c. Hence, whenever the CI-based tool failed, we used the ANOVA-based tool instead with \(m=10\); these values are indicated in bold. It can indeed be observed in Table 7 that the CI-based topic set sizes and the ANOVA-based (\(m=10\)) results are almost the same. Hence, in practice, researchers who want to conduct topic set size design based on CI-widths can use our ANOVA-based tool instead, by letting \(m=10\).

Table 7 Topic set size table: CI-based versus ANOVA with \(m=10, 100\) systems (\((\alpha, \beta )=(.05, .20)\))

Table 7 also shows that when we increase the number of systems from \(m=10\) to \(m=100\), the required topic set sizes are almost tripled. This suggests that it might be useful for test collection builders to have a rough idea of the number of systems that will be compared at the same time in an experiment.

If we compare across the evaluation measures, we can observe the following from Tables 6 and 7:

  • For the adhoc/news tasks at \({md}=1000\), nDCG requires the smallest number of topics; nERR requires more than twice as many topics as AP, Q and nDCG do;

  • For the adhoc/news tasks at \({md}=10\), Q requires the smallest number of topics; again, nERR requires substantially more topics than AP, Q and nDCG do;

  • For the adhoc/web tasks, Q requires the smallest number of topics; AP and nERR require more than twice as many as topics as Q does;

  • For the diversity/web tasks, D-nDCG requires the smallest number of topics; \(\alpha\)-nDCG and nERR-IA require more than twice as many topics as D-nDCG does.

Note that our topic set size design methods thus provide a way to evaluate and compare different measures from a highly practical viewpoint: as the required number of topics is generally proportional to the relevance assessment cost, measures that require fewer topics are clearly more economical. Of course, this is only one aspect of an evaluation measure; whether the measure is actually measuring what we want to measure (e.g., user satisfaction or performance) should be verified separately, but this is beyond the scope of the present study.

Figure 1 visualises the relationships among our topic set design methods. The vertical axis represents the \(\delta\) for the CI-based, the \({minD}_{t}\) for the t test-based method, and the minD for the ANOVA-based method; the horizontal axis represents the number of topics n. We used the largest \(\hat{\sigma }^2\) in Table 5, namely, \(\hat{\sigma }^2=.1206\) for this analysis, but other values of \(\hat{\sigma }^2\) would just change the scale of the horizontal axis. As was discussed earlier, it can be observed that the t test-based results and the ANOVA-based results with \(m=2\) are very similar, and that the CI-based results and the ANOVA-based results with \(m=10\) are almost identical. Also, by comparing the three curves for the ANOVA-based method, we can see how n grows with m for a given value of minD.

Fig. 1
figure 1

Effect of \(\delta, {minD}_{t}\) and minD on topic set sizes

6.3 Trade-off between topic set sizes and pool depths for the adhoc/news task

Our discussions so far covered adhoc/news, adhoc/web and diversity/web tasks, but assumed that the pool depth was a given. In this section, we focus our attention on the adhoc/news task (with \({md}=1000\)), where we have depth-100 and depth-125 pools (see Table 3), which gives us the option of reducing the pool depth. Hence we can discuss the total assessment cost by multiplying n by the average number of documents that need to be judged per topic for a given pool depth pd.

From the original TREC03new and TREC04new relevance assessments, we created depth-pd (\({pd}=100, 90, 70, 50, 30, 10\)) versions of the relevance assessments by filtering out all topic-document pairs that were not contained in the top pd documents of any run. Using each set of the depth-pd relevance assessments, we re-evaluated all runs using AP, Q, nDCG and nERR. Then, using these new topic-by-run matrices, new variance estimates were obtained and pooled as described in Sect. 3.4.

Table 8 shows the pooled variance estimates obtained from the depth-pd versions of the TREC03new and TREC04new relevance assessments. It also shows the average number of documents judged per topic for each pd. For example, while the original depth-125 relevance assessments for TREC03new contain 47,932 topic-document pairs, its depth-100 version has 37,605 pairs across 50 topics; the original TREC04new depth-100 relevance assessments have 34,792 pairs across 49 topics. Hence, on average, \((37,605+34,792)/(50+49)=731\) documents are judged per topic when \({pd}=100\). Similarly, \((4905+4581)/(50+49)=96\) documents are judged per topic when \({pd}=10\).

Table 8 Number of relevance assessments versus pooled \(\hat{\sigma }^2\) for reduced pool depths with adhoc/news (\({md}=1000\))

Based on the t test-based method with \((\alpha, \beta, {minD}_{t})=(.05, .20, .10)\), Fig. 2 plots the required number of topics n against the average number of documents judged per topic for different pool depth settings and different evaluation measures. Recall that the results based on the ANOVA-based method with \((\alpha, \beta, {minD}, m)=(.05, 20, .10, 2)\) would look almost identical to this figure. For each pool depth setting, note that the number of topics multiplied by the number of judged documents per topic gives the estimated total assessment cost. Similarly, based on the ANOVA-based method with \((\alpha, \beta, {minD}, m)=(.05, 20, .10, 10)\), Fig. 3 visualises the assessment costs for different pool depth settings. Recall that the results based on the CI-based method with \(\alpha =.05\) would look identical to this figure.

Fig. 2
figure 2

Cost analysis with the t test-based topic set size design for the adhoc/news task at \({md}=1000\)

Fig. 3
figure 3

Cost analysis with the ANOVA-based topic set size design for the adhoc/news task at \({md}=1000\)

In Fig. 2, the total cost for AP when \({pd}=100\) (i.e., the default of TREC adhoc tasks) is 55,556 documents (visualised as the area of a pink rectangle); if we use the \({pd}=10\) setting instead, the cost goes down to 9,696 documents (visualised as the area of a blue rectangle). That is, while maintaining the statistical reliability of the test collection, the assessment cost can be reduced to \(9,696/55,556=17.5\) %. Similarly, Fig. 3 shows that, if \(m=10\) systems are to be compared, the assessment cost can be reduced to \(18,912/107,457=17.6\) % by letting \({pd}=10\) instead of the usual \({pd}=100\). While it is a well-known fact that it is better to have many topics with few judgments per topic than to have few topics with many judgments per topic (e.g., Carterette et al. 2008; Carterette and Smucker 2007; Webber et al. 2008b), our methods visualise this in a straightforward manner.

Figures 2 and 3 also show that because nERR is very unstable, it requires about twice as many topics as the other measures regardless of the choice of pd. Since the required number of topics is basically a constant for nERR, it would be a waste of assessment effort to construct a depth-100 test collection if the test collection builder plans to use nERR as the primary evaluation measure.Footnote 32 Hence, as was discussed earlier, IR test collection builders should probably consider several different evaluation measures at the test collection design phase, take one of the larger variance estimates and plug it into our ANOVA tool, in the hope that the new test collection will meet the set of statistical requirements even for relatively unstable measures. Then, the test collection design (npd) can be re-examined and adjusted after each round of the task.

While the five test collection designs shown in Fig. 2 (and Fig. 3) are statistically equivalent, note that IR test collection builders should collect as many relevance assessments as possible in order to maximise reusability, which we define as the ability of a test collection to assess new systems fairly, relative to known systems. That is, if the budget available accommodates B relevance assessments, test collection builders can first decide on a set of statistical requirements such as \((\alpha, \beta, {minD}, m)\), obtain several candidate test collection designs (npd) using our ANOVA tool with a large variance estimate \(\hat{\sigma }^2\), and finally choose the design whose total cost is just below B.

7 Conclusions and future work

In this study, we showed three statistically-motivated methods for determining the number of a new test collection to be built, based on sample size design techniques of Nagata (2003). The t test-based method and the ANOVA-based method are based on power analysis; the CI-based method requires a tight CI for the difference between any system pair. We pooled the residual variances of ANOVA to estimate the population within-system variance for each IR task and measure, and compared the topic set size design results across the three methods. We argued that, as different evaluation measures can have vastly different within-system variances and hence require substantially different topic set sizes, IR test collection builders should examine several different evaluation measures at the test collection design phase and focus on a high-variance measure for topic set size design. We also demonstrated that obtaining a reliable variance estimate is not difficult for building a new test collection for an existing task, and argued that the design of a new test collection should be improved based on past data from the same task. As for building a test collection for a new task with new measures, we suggest that a high variance estimate from a similar existing task be used for topic set size design (e.g., use a variance estimate from existing adhoc/web task data for designing a new diversity/web task test collection). Furthermore, we demonstrated how to study the balance between the topic set size n and the pool depth pd, and how to choose the right test collection design (npd) based on the available budget. Our approach thus provides a clear guiding principle for test collection design to the IR community. Note that our approach is also applicable to non-IR tasks as long as a few score matrices equivalent to our topic-by-run matrices are available.

Our Excel tools and the topic-by-run matrices are available online; the interested reader can easily reproduce our results using them with the pooled variance estimates shown in Tables 5 and 8. In practice, since our t test based results are very similar to our ANOVA-based results with \(m=2\), while our CI-based results are almost identical to our ANOVA-based results with \(m=10\), we recommend researchers to utilise our ANOVA-based tool regardless of which of our three approaches they want to take.

As for future work, we are currently looking into the use of score standardisation (Webber et al. 2008a) for the purpose of topic set size design after removing the topic hardness effect. This requires a whole new set of experiments that involves leave-one-out tests (Sakai 2014c; Zobel 1998) in order to study how new systems that contributed to neither the pooling nor the setting of per-topic standardisation parameters can be evaluated properly. The results will be reported in a separate study.

While our methods rely on a series of approximations (e.g., Eqs. 5 and 18), these techniques have been compared with exact values and are known to be highly accurate (Nagata 2003). Our view is that the greatest source of error for our topic set size design approach is probably the variance estimation step. Probably the best way to study this effect would be to implement the proposed topic set size design procedure to TREC tracks or NTCIR tasks, update the estimates by pooling the observed variances across the past rounds, and see how the pooled variances fluctuate over time. Our hope is that the variance estimates and the topic set sizes will stabilise after a few rounds, but this has to be verified. We feel optimistic about this, as the actual variances across two rounds of the same track were very similar in our experiments (Table 3).Footnote 33 Similarly, we hope to investigate the practical usefulness of our approach for new tasks with new evaluation measures. Can we do any better than just “learn from a similar existing task” as suggested in the present study?