Test Collection-Based IR Evaluation Needs Extension toward Sessions – A Case of Extremely Short Queries

  • Heikki Keskustalo
  • Kalervo Järvelin
  • Ari Pirkola
  • Tarun Sharma
  • Marianne Lykke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5839)


There is overwhelming evidence suggesting that the real users of IR systems often prefer using extremely short queries (one or two individual words) but they try out several queries if needed. Such behavior is fundamentally different from the process modeled in the traditional test collection-based IR evaluation based on using more verbose queries and only one query per topic. In the present paper, we propose an extension to the test collection-based evaluation. We will utilize sequences of short queries based on empirically grounded but idealized session strategies. We employ TREC data and have test persons to suggest search words, while simulating sessions based on the idealized strategies for repeatability and control. The experimental results show that, surprisingly, web-like very short queries (including one-word query sequences) typically lead to good enough results even in a TREC type test collection. This finding motivates the observed real user behavior: as few very simple attempts normally lead to good enough results, there is no need to pay more effort. We conclude by discussing the consequences of our finding for IR evaluation.


Relevant Document Test Person Information Retrieval System Real User Test Collection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Jansen, M.B.J., Spink, A., Saracevic, T.: Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web. Inf. Proc. Man. 36(2), 207–227 (2000)CrossRefGoogle Scholar
  2. 2.
    Järvelin, K., Price, S.L., Delcambre, L.M.L., Nielsen, M.L.: Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 4–15. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  3. 3.
    Smith, C.L., Kantor, P.B.: User Adaptation: Good Results from Poor Systems. In: Proc. ACM SIGIR 2008, pp. 147–154 (2008)Google Scholar
  4. 4.
    Stenmark, D.: Identifying Clusters of User Behavior in Intranet Search Engine Log Files. JASIST 59(14), 2232–2243 (2008)CrossRefGoogle Scholar
  5. 5.
    Turpin, A., Hersh, W.: Why Batch and User Evaluations Do Not Give the Same Results. In: Proc. ACM SIGIR 2001, pp. 225–231 (2001)Google Scholar
  6. 6.
    Järvelin, K., Kekäläinen, J.: Cumulated Gain-Based Evaluation of IR Techniques. ACM TOIS 20(4), 422–446 (2002)CrossRefGoogle Scholar
  7. 7.
    Swanson, D.: Information Retrieval as a Trial-and-Error Process. Library Quarterly 47(2), 128–148 (1977)CrossRefGoogle Scholar
  8. 8.
    Sanderson, M.: Ambiguous Queries: Test Collections Need More Sense. In: Proc. ACM SIGIR 2008, pp. 499–506 (2008)Google Scholar
  9. 9.
    Azzopardi, L.: Position Paper: Towards Evaluating the User Experience of Interactive Information Access Systems. In: SIGIR 2007 Web Information-Seeking and Interaction Workshop, p. 5 (2007)Google Scholar
  10. 10.
    Lykke, M., Price, S.L., Delcambre, L.M.L., Vedsted, P.: How doctors search: a study of family practitioners’ query behaviour and the impact on search results (in press, 2009)Google Scholar
  11. 11.
    Cleverdon, C.W., Mills, L., Keen, M.: Factors determining the performance of indexing systems, vol. 1 - design. Aslib Cranfield Research Project, Cranfield (1966)Google Scholar
  12. 12.
    Salton, G.: Evaluation Problems in Interactive Information Retrieval. Inf. Stor. Retr. 6, 29–44 (1970)CrossRefGoogle Scholar
  13. 13.
    Su, L.T.: Evaluation Measures for Interactive Information Retrieval. Inf. Proc. Man. 28(4), 503–516 (1992)CrossRefGoogle Scholar
  14. 14.
    Hersh, W.: Relevance and Retrieval Evaluation: Perspectives from Medicine. JASIS, 201–206 (April 1994)Google Scholar
  15. 15.
    ISO: Ergonomic Requirements for Office Work with Visual Display Terminals (VDTs), Part 11: Guidance on Usability. ISO 9241-11:1998 (E) (1998) Google Scholar
  16. 16.
    Vakkari, P., Sormunen, E.: The Influence of Relevance Levels on the Effectiveness of Interactive Retrieval. JASIST 55(11), 963–969 (2004)CrossRefGoogle Scholar
  17. 17.
    Joachims, T., Granka, L., Pan, B., Hembrooke, H., Gay, G.: Accurately Interpreting Click-through Data as Implicit Feedback. In: Proc. ACM SIGIR 2005, pp. 154–161 (2005)Google Scholar
  18. 18.
    Price, S.L., Nielsen, M.L., Delcambre, L.M.L., Vedsted, P.: Semantic Components Enhance Retrieval of Domain-specific Documents. In: Proc. ACM CIKM 2007, pp. 429–438 (2007)Google Scholar
  19. 19.
    Sormunen, E.: Liberal Relevance Criteria of TREC - Counting on Negligible Documents? In: Proc. ACM SIGIR 2002, pp. 324–330 (2002)Google Scholar
  20. 20.
    Voorhees, E., Harman, D.: TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)Google Scholar
  21. 21.
    Bates, M.J.: The Design of Browsing and Berrypicking Techniques for the Online Search Interface (1989),

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Heikki Keskustalo
    • 1
  • Kalervo Järvelin
    • 1
  • Ari Pirkola
    • 1
  • Tarun Sharma
    • 1
  • Marianne Lykke
    • 2
  1. 1.University of TampereFinland
  2. 2.Royal School of Library and Information ScienceDenmark

Personalised recommendations