Practical Online Retrieval Evaluation

  • Filip Radlinski
  • Katja Hofmann
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7814)


Online evaluation allows the assessment of information retrieval (IR) techniques based on how real users respond to them. Because this technique is directly based on observed user behavior, it is a promising alternative to traditional offline evaluation, which is based on manual relevance assessments. In particular, online evaluation can enable comparisons in settings where reliable assessments are difficult to obtain (e.g., personalized search) or expensive (e.g., for search by trained experts in specialized collections).

Despite its advantages, and its successful use in commercial settings, online evaluation is rarely employed outside of large commercial search engines due to a perception that it is impractical at small scales. The goal of this tutorial is to show how online evaluations can be conducted in such settings, demonstrate software to facilitate its use, and promote further research in the area. We will also contrast online evaluation with standard offline evaluation, and provide an overview of online approaches.


Interleaving Clicks Search Engine Online Evaluation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agichtein, E., Brill, E., Dumais, S., Ragno, R.: Learning user interaction models for predicting web search result preferences. In: SIGIR 2006, pp. 3–10 (2006)Google Scholar
  2. 2.
    Allan, J., Aslam, J.A., Carterette, B., Pavlu, V., Kanoulas, E.: Million query track 2008 overview. In: TREC 2008 (2008)Google Scholar
  3. 3.
    Carterette, B., Bennett, P.N., Chickering, D.M., Dumais, S.T.: Here or there: Preference Judgments for Relevance. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 16–27. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  4. 4.
    Carterette, B., Jones, R.: Evaluating search engines by modeling the relationship between relevance and clicks. In: NIPS 2007 (2007)Google Scholar
  5. 5.
    Chapelle, O., Joachims, T., Radlinski, F., Yue, Y.: Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30(1), 6:1–6:41 (2012)CrossRefGoogle Scholar
  6. 6.
    Clarke, C., Agichtein, E., Dumais, S., White, R.: The influence of caption features on clickthrough patterns in web search. In: SIGIR 2007, pp. 135–142 (2007)Google Scholar
  7. 7.
  8. 8.
    Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM 2008 (2008)Google Scholar
  9. 9.
    Dupret, G., Murdock, V., Piwowarski, B.: Web search engine evaluation using clickthrough data and a user model. In: WWW Wksp. on Query Log Analysis (2007)Google Scholar
  10. 10.
    Hardtke, D., Wertheim, M., Cramer, M.: Demonstration of improved search result relevancy using real-time implicit relevance feedback. In: SIGIR Wksp. on Understanding the User (2009)Google Scholar
  11. 11.
    Hofmann, K., Behr, F., Radlinski, F.: On caption bias in interleaving experiments. In: CIKM 2012 (2012)Google Scholar
  12. 12.
    Hofmann, K., Whiteson, S., de Rijke, M.: A probabilistic method for inferring preferences from clicks. In: CIKM 2011, pp. 249–258 (2011)Google Scholar
  13. 13.
    Hofmann, K., Whiteson, S., de Rijke, M.: Estimating interleaved comparison outcomes from historical click data. In: CIKM 2012 (2012)Google Scholar
  14. 14.
    Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002, pp. 133–142 (2002)Google Scholar
  15. 15.
    Joachims, T.: Unbiased evaluation of retrieval quality using clickthrough data. In: SIGIR Wksp. on Mathematical/Formal Methods in Information Retrieval (2002)Google Scholar
  16. 16.
    Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM TOIS 25(2) (2007)Google Scholar
  17. 17.
    Li, J., Huffman, S., Tokuda, A.: Good abandonment in mobile and pc internet search. In: SIGIR 2009, pp. 43–50 (2009)Google Scholar
  18. 18.
    Matthijs, N., Radlinski, F.: Personalizing web search using long term browsing history. In: WSDM 2011 (2011)Google Scholar
  19. 19.
    Radlinski, F., Bennett, P., Yilmaz, E.: Detecting duplicate web documents using clickthrough data. In: WSDM 2011 (2011)Google Scholar
  20. 20.
    Radlinski, F., Craswell, N.: Comparing the sensitivity of information retrieval metrics. In: SIGIR 2010 (2010)Google Scholar
  21. 21.
    Radlinski, F., Joachims, T.: Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In: AAAI 2006, pp. 1406–1412 (2006)Google Scholar
  22. 22.
    Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval quality? In: CIKM 2008 (2008)Google Scholar
  23. 23.
    Sanderson, M.: Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4(4), 247–375 (2010)MATHCrossRefGoogle Scholar
  24. 24.
    TREC: the Text REtrieval Conference,
  25. 25.
    Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. In: Digital Libraries and Electronic Publishing. MIT Press (2005)Google Scholar
  26. 26.
    Wang, K., Walker, T., Zheng, Z.: PSkip: Estimating relevance ranking quality from web search clickthrough data. In: KDD 2009 (2009)Google Scholar
  27. 27.
    Yue, Y., Gao, Y., Chapelle, O., Zhang, Y., Joachims, T.: Learning more powerful test statistics for click-based retrieval evaluation. In: SIGIR 2010 (2010)Google Scholar
  28. 28.
    Yue, Y., Patel, R., Roehrig, H.: Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In: WWW 2010 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Filip Radlinski
    • 1
  • Katja Hofmann
    • 2
  1. 1.MicrosoftCambridgeUK
  2. 2.ISLAUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations