When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query 2007, TeraByte 2006, and Robust 2004 TREC collections, which all feature more than 50 topics, something that has not been examined in past work. Our analysis finds that a subset of topics can be found that is as accurate as the full topic set at ranking runs. Further, we show that the size of the subset, relative to the full topic set, can be substantially smaller than was shown in past work. We also study the topic subsets in the context of the power of statistical significance tests. We find that there is a trade off with using such sets in that significant results may be missed, but the loss of statistical significance is much smaller than when selecting random subsets. We also find topic subsets that can result in a low accuracy test collection, even when the number of queries in the subset is quite large. These negatively correlated subsets suggest we still lack good methodologies which provide stability guarantees on topic selection in new collections. Finally, we examine whether clustering of topics is an appropriate strategy to find and characterize good topic subsets. Our results contribute to the understanding of information retrieval effectiveness evaluation, and offer insights for the construction of test collections.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Guiver et al. (2009) use the terminology Best/Average/Worst, and we adopt it in this paper in order to be consistent with past work.
It is important to remark that this line of research focuses on an a posteriori, i.e., after-evaluation setting: it is not aimed at predicting in advance a good topic subset, but only at determining if such a subset exists.
Consistently with this line of research (see Footnote 2), we investigate clustering of topics using an a posteriori setting; thus, we study an after-evaluation characterization of Best topic subsets, but do not aim at providing a methodology to find such subsets in practice.
The effect of statMAP, on which we focus in this paper, is discussed in more detail in Sect. 3.3.
Note, several versions of statMAP exist, we used statAP_MQ_eval_v3.pl: http://trec.nist.gov/data/million.query07.html.
We use the suffix B/A/W to indicate the correlation curve for the best/average/worst topic set.
Note, the overlap that we find might be an effect of the heuristic used; we can say no more than it is possible to build a best and a worst set of topics with a high overlap.
Here, for speed of calculation reasons only a single random topic subset is drawn from the set of all topic subsets of a given cardinality. The histograms of random are consequently more “spiky” than if we averaged several random subsets. However, the broad signal of the result is still visible in the plots.
We tried with up to 1 million repetitions, but the series are already stable with 1000 repetitions.
See the R function “kmeans” in the “stats” package (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html), and “k-means” of “scikit-learn” for Python 3 (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).
For an exhaustive list see the R package “proxy” (https://cran.r-project.org/web/packages/proxy/proxy.pdf), and the “Distance computations” section of Python 3 (https://docs.scipy.org/doc/scipy/reference/spatial.distance.html).
Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million query track 2007 overview. In Proceedings of TREC.
Bartlett, J. E., Kotrlik, J. W., & Higgins, C. C. (2001). Organizational research: Determining appropriate sample size in survey research. Information Technology, Learning, and Performance Journal, 19(1), 43–50.
Berto, A., Mizzaro, S., & Robertson, S. (2013). On using fewer topics in information retrieval evaluations. In Proceedings of the ICTIR, (p. 9).
Bodoff, D., & Li, P. (2007). Test theory for assessing ir test collections. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 367–374). New York: ACM.
Buckley, C., & Voorhees, E. (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd SIGIR, (pp. 33–40).
Carterette, B., Allan, J., & Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In Proceedings of the 29th SIGIR, (pp 268–275).
Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009a). Million query track 2009 overview. In Proceedings of TREC.
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009b). If i had a million queries. In Proceedings of the 31th ECIR, ECIR ’09, (pp. 288–300).
Carterette, B., & Smucker, M. D. (2007). Hypothesis testing with incomplete relevance judgments. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management, (pp 643–652). New York: ACM. CIKM ’07. https://doi.org/10.1145/1321440.1321530.
Carterette, B. A. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems (TOIS), 30(1), 4.
Cattelan, M., & Mizzaro, S. (2009). IR evaluation without a common set of topics. In Proceedings of the ICTIR, (pp. 342–345).
Feise, R. (2002). Do multiple outcome measures require \(p\)-value adjustment? BMC Medical Research Methodology, 2, 8.
Guiver, J., Mizzaro, S., & Robertson, S. (2009). A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems, 21(1–21), 26.
Hauff, C., Hiemstra, D., Azzopardi, L., & de Jong, F. (2010). A case for automatic system evaluation. In Proceedings of the ECIR, (pp. 153–165).
Hauff, C., Hiemstra, D., de Jong, F., & Azzopardi, L. (2009). Relying on topic subsets for system ranking estimation. In Proceedings of the 18th CIKM, (pp. 1859–1862).
Hosseini, M., Cox, I. J., Milic-Frayling, N., Shokouhi, M., & Yilmaz, E. (2012). An uncertainty-aware query selection model for evaluation of ir systems. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, (pp. 901–910). New York, NY, USA: ACM. SIGIR ’12. https://doi.org/10.1145/2348283.2348403
Hosseini, M., Cox, I. J., Milic-Frayling, N., Sweeting, T., & Vinay, V. (2011a). Prioritizing relevance judgments to improve the construction of IR test collections. In Proceedings of the 20th CIKM 2011, (pp. 641–646)
Hosseini, M., Cox, I. J., Milic-Frayling, N., Vinay, V., & Sweeting, T. (2011b). Selecting a subset of queries for acquisition of further relevance judgements. In Proceedings of the ICTIR, (pp. 113–124). lNCS 6931.
Kutlu, M., Elsayed, T., & Lease, M. (2018). Intelligent topic selection for low-cost information retrieval evaluation: A new perspective on deep vs. shallow judging. Information Processing and Management, 54(1), 37–59. https://doi.org/10.1016/j.ipm.2017.09.002.
Mehrotra, R., & Yilmaz, E. (2015). Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the of the 38th international ACM SIGIR conference on research and development in information retrieval, (pp. 545–554). New York, NY, USA: ACM, SIGIR ’15. https://doi.org/10.1145/2766462.2767753
Mizzaro, S., & Robertson, S. (2007). HITS hits TREC—Exploring IR evaluation results with network analysis. In Proceedings of the 30th SIGIR, (pp. 479–486).
Moffat, A., Scholer, F., & Thomas, P. (2012). Models and metrics: IR evaluation as a user process. In Proceedings of the Australasian document computing symposium, Dunedin, New Zealand, (pp. 47–54).
Pavlu, V., & Aslam, J. (2007). A practical sampling strategy for efficient retrieval evaluation. Tech. rep., technical report, college of computer and information science, Northeastern University.
Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets (1st ed.). Cambridge: Cambridge University Press.
Robertson, S. (2011). On the contributions of topics to system evaluation. In Proceedings of the ECIR, lNCS 6611, (pp. 129–140).
Roitero, K., Maddalena, E., & Mizzaro, S. (2017). Do easy topics predict effectiveness better than difficult topics? In J. M. Jose, C. Hauff, I. S. Altıngovde, D. Song, D. Albakour, S. Watt, & J. Tait (Eds.), Advances in information retrieval (pp. 605–611). Cham: Springer International Publishing.
Roitero, K., Soprano, M., Brunello, A., & Mizzarom, S. (2018a). Reproduce and improve: An evolutionary approach to select a few good topics for information retrieval evaluation. ACM Journal of Data and Information Quality, 10(3), 12:1–12:21. https://doi.org/10.1145/3239573.
Roitero, K., Soprano, M., & Mizzaro, S. (2018b). Effectiveness evaluation with a subset of topics: A practical approach. In The 41st international ACM SIGIR conference on research and development in information retrieval, (pp. 1145–1148). New York, NY, USA:ACM, SIGIR ’18. https://doi.org/10.1145/3209978.3210108
Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web, (pp. 13–19). New York, NY:ACM Press.
Sakai, T. (2007), Alternatives to bpref. In Proceedings of the 30th annual international ACM SIGIR Conference on research and development in information retrieval, (pp. 71–78). New York, NY:ACM, SIGIR ’07. https://doi.org/10.1145/1277741.1277756
Sakai, T. (2014). Designing test collections for comparing many systems. In Proceedings of the 23rd CIKM 2014, (pp. 61–70).
Sakai, T. (2016a). Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In Proceedings of the 39th SIGIR, (pp. 5–14). ACM.
Sakai, T. (2016b). Topic set size design. Information Retrieval Journal, 19(3), 256–283.
Sanderson, M., & Soboroff, I. (2007). Problems with Kendall’s Tau. In Proceedings of the 30th SIGIR, (pp. 839–840).
Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th SIGIR, (pp. 162–169).
Sheskin, D. (2007). Handbook of parametric and nonparametric statistical procedures (4th ed.). Boca Raton: CRC Press.
Urbano, J. (2016). Test collection reliability: A study of bias and robustness to statistical assumptions via stochastic simulation. Information Retrieval Journal, 19(3), 313–350. https://doi.org/10.1007/s10791-015-9274-y.
Urbano, J., Marrero, M., & Martín, D. (2013). On the measurement of test collection reliability. In Proceedings of the 36th SIGIR, (pp. 393–402).
Urbano, J., & Nagler, T. (2018). Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval, (pp. 695–704). New York, NY, USA: ACM, SIGIR ’18. https://doi.org/10.1145/3209978.3210043.
Voorhees, E., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. InProceedings of the 25th SIGIR, (pp. 316–323).
Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In Proceedings of the 17th CIKM, (pp. 571–580).
We thank several colleagues for useful discussions: Jamie Callan, Ben Carterette, Bruce Croft, David Lewis, Ian Soboroff, and Ellen Voorhees. Virgil Pavlu provided useful information on statMAP. The reviewers provided very insightful and detailed remarks that helped to improve the paper. This work was supported by the Australian Research Council’s Discovery Projects Schemes (DP170102231 & DP170102726).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Roitero, K., Culpepper, J.S., Sanderson, M. et al. Fewer topics? A million topics? Both?! On topics subsets in test collections. Inf Retrieval J 23, 49–85 (2020). https://doi.org/10.1007/s10791-019-09357-w
- Retrieval evaluation
- Few topics
- Statistical significance
- Topic clustering