Pooling-based continuous evaluation of information retrieval systems

Abstract

The dominant approach to evaluate the effectiveness of information retrieval (IR) systems is by means of reusable test collections built following the Cranfield paradigm. In this paper, we propose a new IR evaluation methodology based on pooled test-collections and on the continuous use of either crowdsourcing or professional editors to obtain relevance judgements. Instead of building a static collection for a finite set of systems known a priori, we propose an IR evaluation paradigm where retrieval approaches are evaluated iteratively on the same collection. Each new retrieval technique takes care of obtaining its missing relevance judgements and hence contributes to augmenting the overall set of relevance judgements of the collection. We also propose two metrics: Fairness Score, and opportunistic number of relevant documents, which we then use to define new pooling strategies. The goal of this work is to study the behavior of standard IR metrics, IR system ranking, and of several pooling techniques in a continuous evaluation context by comparing continuous and non-continuous evaluation results on classic test collections. We both use standard and crowdsourced relevance judgements, and we actually run a continuous evaluation campaign over several existing IR systems.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    http://km.aifb.kit.edu/ws/semsearch10/ and http://km.aifb.kit.edu/ws/semsearch11/

  2. 2.

    http://lemurproject.org/clueweb12/

  3. 3.

    A similar situation already occurs for classic block evaluation initiatives such as TREC.

  4. 4.

    For example, given a pre-defined budget of new relevance judgements that can be obtained or crowdsourced.

  5. 5.

    http://km.aifb.kit.edu/ws/semsearch11/

  6. 6.

    We have made available the raw data we used for our experimental evaluation, our raw crowdsourcing results, as well as a set of tools to handle continuous evaluation campaigns using TREC-like evaluation collections at http://exascale.info/ceval.

  7. 7.

    Three judgements per document made by workers from the U.S. and aggregated with majority vote.

  8. 8.

    We suppose that during a continuous evaluation initiative each group would have submitted all its runs together as it is quite likely they come from the same system tuned with different values for its parameters.

References

  1. Alonso, O., & Baeza-Yates, R.A. (2011). Design and implementation of relevance assessments using crowdsourcing. In European conference on information retrieval, ECIR (pp. 153–164).

  2. Alonso, O., & Mizzaro, S. (2012). Using crowdsourcing for trec relevance assessment. Information Processing & Management, 48(6), 1053–1066.

    Article  Google Scholar 

  3. Aslam, J., & Pavlu, V. (2007). A practical sampling strategy for efficient retrieval evaluation. Working Draft. http://www.ccs.neu.edu/home/jaa/papers/drafts/statAP.html.

  4. Aslam, J. A., Pavlu, V., & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 541–548). ACM.

  5. Balog, K., Kelly, L., & Schuth, A. (2014). Head first: Living labs for ad-hoc search evaluation. In International conference on information and knowledge management, CIKM (pp. 1815–1818).

  6. Blanco, R., Halpin, H., Herzig, D. M., Mika, P., Pound, J., Thompson, H. S., et al. (2011). Repeatable and reliable search system evaluation using crowdsourcing. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 923–932).

  7. Blanco, R., Halpin, H., Herzig, D. M., Mika, P., Pound, J., Thompson, H. S., et al. (2013). Repeatable and reliable semantic search evaluation. Journal of Web Semantics, 21, 14–29.

    Article  Google Scholar 

  8. Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. M. (2007). Bias and the limits of pooling for large collections. Information Retrieval, 10(6), 491–508.

    Article  Google Scholar 

  9. Buckley, C., & Voorhees, E. (2004). Retrieval evaluation with incomplete information. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 25–32). ACM.

  10. Büttcher, S., Clarke, C. L. A., Yeung, P. C. K., & Soboroff, I. (2007). Reliable information retrieval evaluation with incomplete and biased judgements. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 63–70).

  11. Carterette, B., Allan, J., & Sitaraman, R. K. (2006). Minimal test collections for retrieval evaluation. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval, SIGIR (pages 268–275). New York: ACM.

  12. Carterette, B., Kanoulas, E., Pavlu, V., & Fang, H. (2010). Reusable test collections through experimental design. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 547–554).

  13. Cleverdon, C. (1962). Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. Cranfield: College of Aeronautics.

    Google Scholar 

  14. Demartini, G., Iofciu, T., & de Vries, A. P. (2009). Overview of the INEX 2009 entity ranking track. In S. Geva, J. Kamps & A. Rotman (Eds.), Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009 Brisbane, Australia, December 2009, Revised and Selected Papers (pp. 254–264). Heidelberg: Springer

  15. Difallah, D. E., Demartini, G., & Cudré-Mauroux, P. (2013). Pick-a-crowd: Tell me what you like, and i’ll tell you what to do. In Proceedings of the 22nd international conference on World Wide Web (pp. 367–374). International World Wide Web Conferences Steering Committee

  16. Halpin, H., Herzig, D. M., Mika, P., Blanco, R., Pound, J., Thompson, H. S., et al. (2010). Evaluating ad-hoc object retrieval. In Evaluation of Semantic Technologies (IWEST 2010) at ISWC 2010.

  17. Hosseini, M., Cox, I. J., Milic-Frayling, N., Kazai, G., & Vinay, V. (2012). On aggregating labels from multiple crowd workers to infer relevance of documents. In European conference on information retrieval, ECIR (pp. 182–194)

  18. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information System, 20(4), 422–446.

    Article  Google Scholar 

  19. Jones, K., & Van Rijsbergen, C. (1975). Report on the need for and provision of an ideal information retrieval test collection. British library research and development reports.

  20. Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In European conference on information retrieval, ECIR (pp. 165–176).

  21. Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011). Crowdsourcing for book search evaluation: Impact of hit design on comparative system ranking. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 205–214).

  22. Kazai, G., Kamps, J., & Milic-Frayling, N. (2011). Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM (pp. 1941–1944).

  23. Lesk, M., & Salton, G. (1968). Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4(4), 343–359.

    Article  Google Scholar 

  24. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval (Vol. 1). Cambridge: Cambridge University Press.

    Google Scholar 

  25. McCreadie, R., Macdonald, C., & Ounis, I. (2011). Crowdsourcing blog track top news judgments at TREC. In Crowdsourcing for Search and Data Mining (CSDM) at WSDM 2011 (pp. 23–26).

  26. Mizzaro, S. (1997). Relevance: The whole history. JASIS, 48(9), 810–832.

    Article  Google Scholar 

  27. Moffat, A., Webber, W., & Zobel, J. (2007). Strategic system comparisons via targeted relevance judgments. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval SIGIR, 07 (pp. 375–382).

  28. Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 2:1–2:27.

    Article  Google Scholar 

  29. Moshfeghi, Y., Matthews, M., Blanco, R., & Jose, J. M. (2013). Influence of timeline and named-entity components on user engagement. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 7814) LNCS (pp. 305–317).

  30. Pavlu, V., Rajput, S., Golbus, P. B., & Aslam, J. A. (2012). Ir system evaluation using nugget-based test collections. In Proceedings of the fifth ACM international conference on Web search and data mining, WSDM ’12 (pp. 393–402). New York, NY: ACM.

  31. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval—SIGIR ’05 (p. 162). New York, NY: ACM Press, Aug. 2005.

  32. Scholer, F., Kelly, D., Wu, W.-C., Lee, H. S., & Webber, W. (2013). The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval (pp. 623–632). ACM.

  33. Smucker, M., Kazai, G., & Lease, M. (2013). Overview of the trec 2012 crowdsourcing track. In Proceedings of the 21st NIST text retrieval conference (TREC).

  34. Verberne, S., Heijden, M. V. D., Hinne, M., Sappelli, M., Koldijk, S., Hoenkamp, E., et al. (2013). Reliability and validity of query intent assessments. JASIST, 64(11), 2224–2237.

    Article  Google Scholar 

  35. Voorhees, E., & Harman, D. (1998). Overview of the Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242.

  36. Voorhees, E., & Harman, D. (1999). Overview of the Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246.

  37. Voorhees, E. M. (1998) Variations in relevance judgments and the measurement of retrieval effectiveness. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 315–323). New York, NY: ACM.

  38. Webber, W., & Park, L. A. F. (2009). Score adjustment for correction of pooling bias. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 444–451).

  39. Yilmaz, E., & Aslam, J. A. (2006). Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM international conference on information and knowledge management, CIKM (pp. 102–111). Arlington, VA: ACM.

  40. Yilmaz, E., Kanoulas, E., & Aslam, J. A. (2008). A simple and efficient sampling method for estimating AP and NDCG. In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 603–610).

  41. Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In International ACM SIGIR conference on research and development in information retrieval, SIGIR (pp. 307–314). New York, NY: ACM.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Alberto Tonon.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tonon, A., Demartini, G. & Cudré-Mauroux, P. Pooling-based continuous evaluation of information retrieval systems. Inf Retrieval J 18, 445–472 (2015). https://doi.org/10.1007/s10791-015-9266-y

Download citation

Keywords

  • Information retrieval evaluation
  • Crowdsourcing
  • Continuous evaluation
  • Pooling techniques