Balancing Exploration and Exploitation in Learning to Rank Online

  • Katja Hofmann
  • Shimon Whiteson
  • Maarten de Rijke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6611)


As retrieval systems become more complex, learning to rank approaches are being developed to automatically tune their parameters. Using online learning to rank approaches, retrieval systems can learn directly from implicit feedback, while they are running. In such an online setting, algorithms need to both explore new solutions to obtain feedback for effective learning, and exploit what has already been learned to produce results that are acceptable to users. We formulate this challenge as an exploration-exploitation dilemma and present the first online learning to rank algorithm that works with implicit feedback and balances exploration and exploitation. We leverage existing learning to rank data sets and recently developed click models to evaluate the proposed algorithm. Our results show that finding a balance between exploration and exploitation can substantially improve online retrieval performance, bringing us one step closer to making online learning to rank work in practice.


Online Learning Result List Implicit Feedback Cumulative Reward Document List 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Barto, A.G., Sutton, R.S., Brouwer, P.S.: Associative search network: A reinforcement learning associative memory. IEEE Trans. Syst., Man, and Cybern. 40, 201–211 (1981)zbMATHGoogle Scholar
  2. 2.
    Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)CrossRefzbMATHGoogle Scholar
  3. 3.
    Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM 2008, pp. 87–94 (2008)Google Scholar
  4. 4.
    Donmez, P., Carbonell, J.G.: Active sampling for rank learning via optimizing the area under the ROC curve. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 78–89. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  5. 5.
    Dupret, G.E., Piwowarski, B.: A user browsing model to predict search engine click data from past observations. In: SIGIR 2008, pp. 331–338 (2008)Google Scholar
  6. 6.
    Guo, F., Li, L., Faloutsos, C.: Tailoring click models to user goals. In: WSCD 2009, pp. 88–92 (2009)Google Scholar
  7. 7.
    Guo, F., Liu, C., Wang, Y.M.: Efficient multiple-click models in web search. In: WSDM 2009, pp. 124–131 (2009)Google Scholar
  8. 8.
    He, J., Zhai, C., Li, X.: Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In: CIKM 2009, pp. 2029–2032 (2009)Google Scholar
  9. 9.
    Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002, pp. 133–142 (2002)Google Scholar
  10. 10.
    Joachims, T., Granka, L., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data as implicit feedback. In: SIGIR 2005, pp. 154–161. ACM Press, New York (2005)Google Scholar
  11. 11.
    Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: NIPS 2008, pp. 817–824 (2008)Google Scholar
  12. 12.
    Langford, J., Strehl, A., Wortman, J.: Exploration scavenging. In: ICML 2008, pp. 528–535 (2008)Google Scholar
  13. 13.
    Liu, T.Y.: Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3(3), 225–331 (2009)CrossRefGoogle Scholar
  14. 14.
    Liu, T.-Y., Xu, J., Qin, T., Xiong, W., Li, H.: Letor: Benchmark dataset for research on learning to rank for information retrieval. In: LR4IR 2007 (2007)Google Scholar
  15. 15.
    Radlinski, F., Craswell, N.: Comparing the sensitivity of information retrieval metrics. In: SIGIR 2010, pp. 667–674 (2010)Google Scholar
  16. 16.
    Radlinski, F., Joachims, T.: Active exploration for learning rankings from clickthrough data. In: KDD 2007, pp. 570–579 (2007)Google Scholar
  17. 17.
    Radlinski, F., Kleinberg, R., Joachims, T.: Learning diverse rankings with multi-armed bandits. In: ICML 2008, pp. 784–791. ACM, New York (2008)Google Scholar
  18. 18.
    Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval quality? In: CIKM 2008, pp. 43–52 (2008)Google Scholar
  19. 19.
    Sanderson, M.: Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4(4), 247–375 (2010)CrossRefzbMATHGoogle Scholar
  20. 20.
    Silverstein, C., Marais, H., Henzinger, M., Moricz, M.: Analysis of a very large web search engine query log. SIGIR Forum 33(1), 6–12 (1999)CrossRefGoogle Scholar
  21. 21.
    Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)Google Scholar
  22. 22.
    Watkins, C.: Learning from Delayed Rewards. PhD thesis, Cambridge University (1989)Google Scholar
  23. 23.
    Xu, Z., Akella, R., Zhang, Y.: Incorporating diversity and density in active learning for relevance feedback. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 246–257. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  24. 24.
    Xu, Z., Kersting, K., Joachims, T.: Fast active exploration for link-based preference learning using gaussian processes. In: ECML PKDD 2010, pp. 499–514 (2010)Google Scholar
  25. 25.
    Yue, Y., Joachims, T.: Interactively optimizing information retrieval systems as a dueling bandits problem. In: ICML 2009, pp. 1201–1208 (2009)Google Scholar
  26. 26.
    Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The k-armed dueling bandits problem. In: COLT 2009 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Katja Hofmann
    • 1
  • Shimon Whiteson
    • 1
  • Maarten de Rijke
    • 1
  1. 1.ISLAUniversity of AmsterdamNetherlands

Personalised recommendations