Selecting a Subset of Queries for Acquisition of Further Relevance Judgements

  • Mehdi Hosseini
  • Ingemar J. Cox
  • Natasa Milic-Frayling
  • Vishwa Vinay
  • Trevor Sweeting
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6931)

Abstract

Assessing the relative performance of search systems requires the use of a test collection with a pre-defined set of queries and corresponding relevance assessments. The state-of-the-art process of constructing test collections involves using a large number of queries and selecting a set of documents, submitted by a group of participating systems, to be judged per query. However, the initial set of judgments may be insufficient to reliably evaluate the performance of future as yet unseen systems. In this paper, we propose a method that expands the set of relevance judgments as new systems are being evaluated. We assume that there is a limited budget to build additional relevance judgements. From the documents retrieved by the new systems we create a pool of unjudged documents. Rather than uniformly distributing the budget across all queries, we first select a subset of queries that are effective in evaluating systems and then uniformly allocate the budget only across these queries. Experimental results on TREC 2004 Robust track test collection demonstrate the superiority of this budget allocation strategy.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Allan, J., Carterette, B., Aslam, J.A., Pavlu, V., Dachev, B., Kanoulas, E.: TREC 2007 million query track. Notebook Proceedings of TREC 2007. TREC (2007)Google Scholar
  2. 2.
    Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 541–548. ACM, New York (2006)Google Scholar
  3. 3.
    Carterette, B., Allan, J., Sitaraman, R.: Minimal test collections for retrieval evaluation. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 268–275. ACM, New York (2006)Google Scholar
  4. 4.
    Carterette, B., Gabrilovich, E., Josifovski, V., Metzler, D.: Measuring the reusability of test collections. In: WSDM 2010: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 231–240. ACM, New York (2010)Google Scholar
  5. 5.
    Carterette, B., Kanoulas, E., Pavlu, V., Fang, H.: Reusable test collections through experimental design. In: SIGIR 2010: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 547–554. ACM, New York (2010)Google Scholar
  6. 6.
    Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J.: Evaluation over thousands of queries. In: SIGIR 2008: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 651–658. ACM, New York (2008)Google Scholar
  7. 7.
    Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Annals of Statistics 32, 407–499 (2004)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Guiver, J., Mizzaro, S., Robertson, S.: A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Trans. Inf. Syst. 27(4) (2009)Google Scholar
  9. 9.
    Mizzaro, S., Robertson, S.: Hits hits trec: exploring ir evaluation results with network analysis. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 479–486. ACM, New York (2007)Google Scholar
  10. 10.
    Robertson, S.: On the contributions of topics to system evaluation. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 129–140. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Sparck Jones, K., van Rijsbergen, K.: Information retrieval test collections. Journal of Documentation 32(1), 59–75 (1976)CrossRefGoogle Scholar
  12. 12.
    Voorhees, E.M.: The philosophy of information retrieval evaluation. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 355–370. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  13. 13.
    Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 316–323. ACM, New York (2002)CrossRefGoogle Scholar
  14. 14.
    Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–314. ACM, New York (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mehdi Hosseini
    • 1
  • Ingemar J. Cox
    • 1
  • Natasa Milic-Frayling
    • 2
  • Vishwa Vinay
    • 2
  • Trevor Sweeting
    • 1
  1. 1.University College LondonUK
  2. 2.Microsoft Research CambridgeUK

Personalised recommendations