Tackling Biased Baselines in the Risk-Sensitive Evaluation of Retrieval Systems

  • B. Taner Dinçer
  • Iadh Ounis
  • Craig Macdonald
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8416)


The aim of optimising information retrieval (IR) systems using a risk-sensitive evaluation methodology is to minimise the risk of performing any particular topic less effectively than a given baseline system. Baseline systems in this context determine the reference effectiveness for topics, relative to which the effectiveness of a given IR system in minimising the risk will be measured. However, the comparative risk-sensitive evaluation of a set of diverse IR systems – as attempted by the TREC 2013 Web track – is challenging, as the different systems under evaluation may be based upon a variety of different (base) retrieval models, such as learning to rank or language models. Hence, a question arises about how to properly measure the risk exhibited by each system. In this paper, we argue that no model of information retrieval alone is representative enough in this respect to be a true reference for the models available in the current state-of-the-art, and demonstrate, using the TREC 2012 Web track data, that as the baseline system changes, the resulting risk-based ranking of the systems changes significantly. Instead of using a particular system’s effectiveness as the reference effectiveness for topics, we propose several remedies including the use of mean within-topic system effectiveness as a baseline, which is shown to enable unbiased measurements of the risk-sensitive effectiveness of IR systems.


Information Retrieval Query Expansion Information Retrieval System Retrieval Strategy Baseline System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amati, G., Carpineto, C., Romano, G.: Query difficulty, robustness, and selective application of query expansion. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 127–137. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  2. 2.
    Carmel, D., Farchi, E., Petruschka, Y., Soffer, A.: Automatic query refinement using lexical affinities with maximal information gain. In: Proc. SIGIR, pp. 283–290 (2002)Google Scholar
  3. 3.
    Macdonald, C., Santos, R., Ounis, I.: The whens and hows of learning to rank for web search. Information Retrieval 16(5), 584–628 (2013)CrossRefGoogle Scholar
  4. 4.
    Voorhees, E.M.: Overview of the TREC 2003 robust retrieval track. In: Proc. TREC (2003)Google Scholar
  5. 5.
    Collins-Thompson, K.: Accounting for stability of retrieval algorithms using risk-reward curves. In: Proceedings of SIGIR Workshop on the Future of Evaluation in Information Retrieval (2009)Google Scholar
  6. 6.
    Collins-Thompson, K.: Reducing the risk of query expansion via robust constrained optimization. In: Proc. CIKM, pp. 837–846 (2009)Google Scholar
  7. 7.
    Wang, L., Bennett, P.N., Collins-Thompson, K.: Robust ranking models via risk-sensitive optimization. In: Proc. SIGIR, pp. 761–770 (2012)Google Scholar
  8. 8.
    Collins-Thompson, K., Bennett, P.N., Diaz, F., Clarke, C., Voorhees, E.: TREC 2013 Web Track Guidelines,
  9. 9.
    Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proc. CIKM, pp. 621–630 (2009)Google Scholar
  10. 10.
    Cormack, G., Smucker, M., Clarke, C.: Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval 14(5), 441–465 (2011)CrossRefGoogle Scholar
  11. 11.
    Jackson, J.E.: A users guide to principal components. John Wiley & Sons (1990)Google Scholar
  12. 12.
    Dinçer, B.T.: Statistical principal components analysis for retrieval experiments. Journal of the American Society for Information Science and Technology 58(4), 560–574 (2007)CrossRefGoogle Scholar
  13. 13.
    Amati, G., van Rijsbergen, C.: Probabilistic models of information retrieval based on measuring the divergence from randomness. Transactions on Information Systems 20(4), 357–389 (2002)CrossRefGoogle Scholar
  14. 14.
    Dinçer, B.T.: IRRA at TREC 2012: Index term weighting based on divergence from independence model. In: Proc. TREC (2012)Google Scholar
  15. 15.
    Macdonald, C., McCreadie, R., Santos, R., Ounis, I.: From puppy to maturity: experiences in developing Terrier. In: Proc. OSIR at SIGIR (2012)Google Scholar
  16. 16.
    Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 Terabyte track. In: Proc. TREC (2004)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • B. Taner Dinçer
    • 1
  • Iadh Ounis
    • 2
  • Craig Macdonald
    • 2
  1. 1.Department of Statistics & Computer EngineeringMuğla UniversityMuğlaTurkey
  2. 2.School of Computing ScienceUniversity of GlasgowGlasgowUK

Personalised recommendations