Information Retrieval Journal

, Volume 18, Issue 5, pp 445–472 | Cite as

Pooling-based continuous evaluation of information retrieval systems

  • Alberto Tonon
  • Gianluca Demartini
  • Philippe Cudré-Mauroux
Article

Abstract

The dominant approach to evaluate the effectiveness of information retrieval (IR) systems is by means of reusable test collections built following the Cranfield paradigm. In this paper, we propose a new IR evaluation methodology based on pooled test-collections and on the continuous use of either crowdsourcing or professional editors to obtain relevance judgements. Instead of building a static collection for a finite set of systems known a priori, we propose an IR evaluation paradigm where retrieval approaches are evaluated iteratively on the same collection. Each new retrieval technique takes care of obtaining its missing relevance judgements and hence contributes to augmenting the overall set of relevance judgements of the collection. We also propose two metrics: Fairness Score, and opportunistic number of relevant documents, which we then use to define new pooling strategies. The goal of this work is to study the behavior of standard IR metrics, IR system ranking, and of several pooling techniques in a continuous evaluation context by comparing continuous and non-continuous evaluation results on classic test collections. We both use standard and crowdsourced relevance judgements, and we actually run a continuous evaluation campaign over several existing IR systems.

Keywords

Information retrieval evaluation Crowdsourcing Continuous evaluation Pooling techniques 

References

  1. Alonso, O., & Baeza-Yates, R.A. (2011). Design and implementation of relevance assessments using crowdsourcing. In European conference on information retrieval, ECIR (pp. 153–164).Google Scholar
  2. Alonso, O., & Mizzaro, S. (2012). Using crowdsourcing for trec relevance assessment. Information Processing & Management, 48(6), 1053–1066.CrossRefGoogle Scholar
  3. Aslam, J., & Pavlu, V. (2007). A practical sampling strategy for efficient retrieval evaluation. Working Draft. http://www.ccs.neu.edu/home/jaa/papers/drafts/statAP.html.
  4. Aslam, J. A., Pavlu, V., & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 541–548). ACM.Google Scholar
  5. Balog, K., Kelly, L., & Schuth, A. (2014). Head first: Living labs for ad-hoc search evaluation. In International conference on information and knowledge management, CIKM (pp. 1815–1818).Google Scholar
  6. Blanco, R., Halpin, H., Herzig, D. M., Mika, P., Pound, J., Thompson, H. S., et al. (2011). Repeatable and reliable search system evaluation using crowdsourcing. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 923–932).Google Scholar
  7. Blanco, R., Halpin, H., Herzig, D. M., Mika, P., Pound, J., Thompson, H. S., et al. (2013). Repeatable and reliable semantic search evaluation. Journal of Web Semantics, 21, 14–29.CrossRefGoogle Scholar
  8. Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. M. (2007). Bias and the limits of pooling for large collections. Information Retrieval, 10(6), 491–508.CrossRefGoogle Scholar
  9. Buckley, C., & Voorhees, E. (2004). Retrieval evaluation with incomplete information. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 25–32). ACM.Google Scholar
  10. Büttcher, S., Clarke, C. L. A., Yeung, P. C. K., & Soboroff, I. (2007). Reliable information retrieval evaluation with incomplete and biased judgements. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 63–70).Google Scholar
  11. Carterette, B., Allan, J., & Sitaraman, R. K. (2006). Minimal test collections for retrieval evaluation. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval, SIGIR (pages 268–275). New York: ACM.Google Scholar
  12. Carterette, B., Kanoulas, E., Pavlu, V., & Fang, H. (2010). Reusable test collections through experimental design. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 547–554).Google Scholar
  13. Cleverdon, C. (1962). Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. Cranfield: College of Aeronautics.Google Scholar
  14. Demartini, G., Iofciu, T., & de Vries, A. P. (2009). Overview of the INEX 2009 entity ranking track. In S. Geva, J. Kamps & A. Rotman (Eds.), Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009 Brisbane, Australia, December 2009, Revised and Selected Papers (pp. 254–264). Heidelberg: SpringerGoogle Scholar
  15. Difallah, D. E., Demartini, G., & Cudré-Mauroux, P. (2013). Pick-a-crowd: Tell me what you like, and i’ll tell you what to do. In Proceedings of the 22nd international conference on World Wide Web (pp. 367–374). International World Wide Web Conferences Steering CommitteeGoogle Scholar
  16. Halpin, H., Herzig, D. M., Mika, P., Blanco, R., Pound, J., Thompson, H. S., et al. (2010). Evaluating ad-hoc object retrieval. In Evaluation of Semantic Technologies (IWEST 2010) at ISWC 2010.Google Scholar
  17. Hosseini, M., Cox, I. J., Milic-Frayling, N., Kazai, G., & Vinay, V. (2012). On aggregating labels from multiple crowd workers to infer relevance of documents. In European conference on information retrieval, ECIR (pp. 182–194)Google Scholar
  18. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information System, 20(4), 422–446.CrossRefGoogle Scholar
  19. Jones, K., & Van Rijsbergen, C. (1975). Report on the need for and provision of an ideal information retrieval test collection. British library research and development reports.Google Scholar
  20. Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In European conference on information retrieval, ECIR (pp. 165–176).Google Scholar
  21. Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011). Crowdsourcing for book search evaluation: Impact of hit design on comparative system ranking. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 205–214).Google Scholar
  22. Kazai, G., Kamps, J., & Milic-Frayling, N. (2011). Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM (pp. 1941–1944).Google Scholar
  23. Lesk, M., & Salton, G. (1968). Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4(4), 343–359.CrossRefGoogle Scholar
  24. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval (Vol. 1). Cambridge: Cambridge University Press.MATHCrossRefGoogle Scholar
  25. McCreadie, R., Macdonald, C., & Ounis, I. (2011). Crowdsourcing blog track top news judgments at TREC. In Crowdsourcing for Search and Data Mining (CSDM) at WSDM 2011 (pp. 23–26).Google Scholar
  26. Mizzaro, S. (1997). Relevance: The whole history. JASIS, 48(9), 810–832.CrossRefGoogle Scholar
  27. Moffat, A., Webber, W., & Zobel, J. (2007). Strategic system comparisons via targeted relevance judgments. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval SIGIR, 07 (pp. 375–382).Google Scholar
  28. Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 2:1–2:27.CrossRefGoogle Scholar
  29. Moshfeghi, Y., Matthews, M., Blanco, R., & Jose, J. M. (2013). Influence of timeline and named-entity components on user engagement. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 7814) LNCS (pp. 305–317).Google Scholar
  30. Pavlu, V., Rajput, S., Golbus, P. B., & Aslam, J. A. (2012). Ir system evaluation using nugget-based test collections. In Proceedings of the fifth ACM international conference on Web search and data mining, WSDM ’12 (pp. 393–402). New York, NY: ACM.Google Scholar
  31. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval—SIGIR ’05 (p. 162). New York, NY: ACM Press, Aug. 2005.Google Scholar
  32. Scholer, F., Kelly, D., Wu, W.-C., Lee, H. S., & Webber, W. (2013). The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval (pp. 623–632). ACM.Google Scholar
  33. Smucker, M., Kazai, G., & Lease, M. (2013). Overview of the trec 2012 crowdsourcing track. In Proceedings of the 21st NIST text retrieval conference (TREC).Google Scholar
  34. Verberne, S., Heijden, M. V. D., Hinne, M., Sappelli, M., Koldijk, S., Hoenkamp, E., et al. (2013). Reliability and validity of query intent assessments. JASIST, 64(11), 2224–2237.CrossRefGoogle Scholar
  35. Voorhees, E., & Harman, D. (1998). Overview of the Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242.Google Scholar
  36. Voorhees, E., & Harman, D. (1999). Overview of the Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246.Google Scholar
  37. Voorhees, E. M. (1998) Variations in relevance judgments and the measurement of retrieval effectiveness. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 315–323). New York, NY: ACM.Google Scholar
  38. Webber, W., & Park, L. A. F. (2009). Score adjustment for correction of pooling bias. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 444–451).Google Scholar
  39. Yilmaz, E., & Aslam, J. A. (2006). Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM international conference on information and knowledge management, CIKM (pp. 102–111). Arlington, VA: ACM.Google Scholar
  40. Yilmaz, E., Kanoulas, E., & Aslam, J. A. (2008). A simple and efficient sampling method for estimating AP and NDCG. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 603–610).Google Scholar
  41. Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 307–314). New York, NY: ACM.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Alberto Tonon
    • 1
  • Gianluca Demartini
    • 2
  • Philippe Cudré-Mauroux
    • 1
  1. 1.University of FribourgFribourgSwitzerland
  2. 2.University of SheffieldSouth YorkshireUK

Personalised recommendations