If I Had a Million Queries

  • Ben Carterette
  • Virgil Pavlu
  • Evangelos Kanoulas
  • Javed A. Aslam
  • James Allan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5478)


As document collections grow larger, the information needs and relevance judgments in a test collection must be well-chosen within a limited budget to give the most reliable and robust evaluation results. In this work we analyze a sample of queries categorized by length and corpus-appropriateness to determine the right proportion needed to distinguish between systems. We also analyze the appropriate division of labor between developing topics and making relevance judgments, and show that only a small, biased sample of queries with sparse judgments is needed to produce the same results as a much larger sample of queries.


Stability Level Average Precision Target Number Test Collection Relevance Judgment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sakai, T.: Alternatives to bpref. In: Proceedings of SIGIR, pp. 71–78. ACM, New York (2007)Google Scholar
  2. 2.
    Carterette, B., Allan, J., Sitaraman, R.K.: Minimal test collections for retrieval evaluation. In: Proceedings of SIGIR, pp. 268–275 (2006)Google Scholar
  3. 3.
    Moffat, A., Webber, W., Zobel, J.: Strategic system comparisons via targeted relevance judgments. In: Proceedings of SIGIR, pp. 375–382. ACM, New York (2007)Google Scholar
  4. 4.
    Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: Proceedings of SIGIR, pp. 541–548 (2006)Google Scholar
  5. 5.
    Allan, J., Aslam, J.A., Carterette, B., Pavlu, V., Kanoulas, E.: Overview of the trec 2008 million query track. In: Notebook Proceedings of TREC (2008)Google Scholar
  6. 6.
    Carterette, B., Pavlu, V., Kanoulas, E., Allan, J., Aslam, J.A.: Evaluation over thousands of queries. In: Proceedings of SIGIR, pp. 651–658 (2008)Google Scholar
  7. 7.
    Allan, J., Carterette, B., Aslam, J.A., Pavlu, V., Dachev, B., Kanoulas, E.: Overview of the TREC 2007 Million Query Track. In: Proceedings of TREC (2007)Google Scholar
  8. 8.
    Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: Proceedings of CIKM, pp. 102–111 (2006)Google Scholar
  9. 9.
    Aslam, J.A., Pavlu, V.: A practical sampling strategy for efficient retrieval evaluation, technical reportGoogle Scholar
  10. 10.
    Brewer, K.R.W., Hanif, M.: Sampling With Unequal Probabilities. Springer, New York (1983)Google Scholar
  11. 11.
    Stevens, W.L.: Sampling without replacement with probability proportional to size. Journal of the Royal Statistical Society. Series B (Methodological) 20(2), 393–397 (1958)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Thompson, S.K.: Sampling. Wiley Series in Probability and Mathematical Statistics (1992)Google Scholar
  13. 13.
    Banks, D., Over, P., Zhang, N.F.: Blind men and elephants: Six approaches to trec data. Inf. Retr. 1(1-2), 7–34 (1999)CrossRefGoogle Scholar
  14. 14.
    Bodoff, D., Li, P.: Test theory for assessing ir test collection. In: Proceedings of SIGIR, pp. 367–374 (2007)Google Scholar
  15. 15.
    Brennan, R.L.: Generalizability Theory. Springer, New York (2001)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Ben Carterette
    • 1
  • Virgil Pavlu
    • 2
  • Evangelos Kanoulas
    • 2
  • Javed A. Aslam
    • 2
  • James Allan
    • 3
  1. 1.Dept. of Computer and Info. SciencesUniversity of DelawareNewarkUSA
  2. 2.College of Computer and Info. ScienceNortheastern UniversityBostonUSA
  3. 3.Dept. of Computer ScienceUniversity of Massachusetts AmherstAmherstUSA

Personalised recommendations