Advertisement

Information Retrieval Journal

, Volume 19, Issue 4, pp 416–445 | Cite as

The effect of pooling and evaluation depth on IR metrics

  • Xiaolu Lu
  • Alistair Moffat
  • J. Shane Culpepper
Article

Abstract

Batch IR evaluations are usually performed in a framework that consists of a document collection, a set of queries, a set of relevance judgments, and one or more effectiveness metrics. A large number of evaluation metrics have been proposed, with two primary families having emerged: recall-based metrics, and utility-based metrics. In both families, the pragmatics of forming judgments mean that it is usual to evaluate the metric to some chosen depth such as \(k=20\) or \(k=100\), without necessarily fully considering the ramifications associated with that choice. Our aim is this paper is to explore the relative risks arising with fixed-depth evaluation in the two families, and document the complex interplay between metric evaluation depth and judgment pooling depth. Using a range of TREC resources including NewsWire data and the ClueWeb collection, we: (1) examine the implications of finite pooling on the subsequent usefulness of different test collections, including specifying options for truncated evaluation; and (2) determine the extent to which various metrics correlate with themselves when computed to different evaluation depths using those judgments. We demonstrate that the judgment pools constructed for the ClueWeb collections lack resilience, and are suited primarily to the application of top-heavy utility-based metrics rather than recall-based metrics; and that on the majority of the established test collections, and across a range of evaluation depths, recall-based metrics tend to be more volatile in the system rankings they generate than are utility-based metrics. That is, experimentation using utility-based metrics is more robust to choices such as the evaluation depth employed than is experimentation using recall-based metrics. This distinction should be noted by researchers as they plan and execute system-versus-system retrieval experiments.

Keywords

Evaluation metrics comparison Pooling and evaluation depth Experimentation 

Notes

Acknowledgments

This work was supported by the Australian Research Council’s Discovery Projects Scheme (DP140101587). Shane Culpepper is the recipient of an Australian Research Council DECRA Research Fellowship (DE140100275).

References

  1. Aslam, J. A., Pavlu, V., & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 541–548).Google Scholar
  2. Aslam, J. A., Yilmaz, E., & Pavlu, V. (2005). The maximum entropy method for analyzing retrieval measures. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 27–34).Google Scholar
  3. Bailey, P., Moffat, A., Scholer, F., & Thomas, P. (2015). User variability and IR system evaluation. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 625–634).Google Scholar
  4. Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 33–40).Google Scholar
  5. Buckley, C., & Voorhees, E. M. (2004). Retrieval evaluation with incomplete information. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 25–32).Google Scholar
  6. Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. M. (2007). Bias and the limits of pooling for large collections. Information Retrieval Journal, 10, 491–508.CrossRefGoogle Scholar
  7. Büttcher, S., Clarke, C. L. A., Yeung, P. C. K., & Soboroff, I. (2007). Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 63–70).Google Scholar
  8. Carterette, B., Kanoulas, E., & Yilmaz, E. (2010). Low cost evaluation in information retrieval. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (p. 903).Google Scholar
  9. Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the Conference on Information and Knowledge Management (pp. 621–630). ACM.Google Scholar
  10. Clarke, C. L. A., Craswell, N., Soboroff, I., & Cormack, G. V. (2010). Overview of the TREC 2010 Web track. In Proceedings of TREC.Google Scholar
  11. Demartini, G., & Mizzaro, S. (2006). A classification of IR effectiveness metrics. In Proceedings of the European Conference on IR Research (pp. 488–491). Berlin, Heidelberg: Springer.Google Scholar
  12. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446.CrossRefGoogle Scholar
  13. Kanoulas, E., & Aslam, J. A. (2009). Empirical justification of the gain and discount function for NDCG. In Proceedings of Conference on Information and Knowledge Management (pp. 611–620). ACM.Google Scholar
  14. Moffat, A. (2013). Seven numeric properties of effectiveness metrics. In Proceedings of Asian Information Retrieval Societies Conference (pp. 1–12).Google Scholar
  15. Moffat, A., Bailey, P., Scholer, F., & Thomas, P. (2015). INST: An adaptive metric for information retrieval evaluation. In Proceedings of the Australasian Document Computing Symposium (pp. 5:1–5:4).Google Scholar
  16. Moffat, A., Thomas, P., & Scholer, F. (2013). Users versus models: What observation tells us about effectiveness metrics. In Proceedings of Conference on Information and Knowledge Management (pp. 659–668).Google Scholar
  17. Moffat, A., Webber, W., & Zobel, J. (2007). Strategic system comparisons via targeted relevance judgments. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 375–382).Google Scholar
  18. Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 2.CrossRefGoogle Scholar
  19. Ravana, S. D., & Moffat, A. (2010). Score estimation, incomplete judgments, and significance testing in IR evaluation. In Proceedings of the Asian Information Retrieval Societies Conference (pp. 97–109).Google Scholar
  20. Roberston, S. E., Kanoulas, E., & Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In Proceedings of the ACM-SIGIR Interenational Conference on Research and Development in Information Retrieval (pp. 603–610).Google Scholar
  21. Sakai, T. (2004). New performance metrics based on multigrade relevance: Their application to question answering. In Proceedings of the NII Testbeds and Communities for Information Access and Research.Google Scholar
  22. Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 525–532). New York, NY: ACM Press.Google Scholar
  23. Sakai, T. (2007). Alternatives to BPref. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 71–78).Google Scholar
  24. Sakai, T. (2014). Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases: PROMISE Winter School 2013, Bressanone, Italy, February 4–8, 2013. Revised Tutorial Lectures (pp. 116–163). Berlin, Heidelberg: Springer.Google Scholar
  25. Sakai, T., & Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval Journal, 11(5), 447–470.CrossRefGoogle Scholar
  26. Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375.CrossRefzbMATHGoogle Scholar
  27. Voorhees, E. M. (2001). Evaluation by highly relevant documents. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 74–82). ACM.Google Scholar
  28. Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (pp. 355–370). Berlin, Heidelberg: Springer.Google Scholar
  29. Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge: The MIT Press.Google Scholar
  30. Webber, W., Moffat, A., & Zobel, J. (2010). The effect of pooling and evaluation depth on metric stability. In Proceedings of the Workshop Evaluation Information Access (pp. 7–15).Google Scholar
  31. Webber, W., Moffat, A., & Zobel, J. (2010). A similarity measure for indefinite rankings. ACM Transactions on Information Systems, 28(4), 20.CrossRefGoogle Scholar
  32. Yilmaz, E., & Aslam, J. A. (2006). Estimating average precision with incomplete and imperfect judgments. In Proceedings of the Conference on Information and Knowledge Management (pp. 102–111).Google Scholar
  33. Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A new rank correlation coefficient for information retrieval. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 587–594). ACM.Google Scholar
  34. Yilmaz, E., Kanoulas, E., & Aslam, J. A. (2008). A simple and efficient sampling method for estimating AP and NDCG. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 603–610).Google Scholar
  35. Yilmaz, E., & Robertson, S. (2010). On the choice of effectiveness measures for learning to rank. Information Retrieval Journal, 13(3), 271–290.CrossRefGoogle Scholar
  36. Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 307–314).Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.RMIT UniversityMelbourneAustralia
  2. 2.The University of MelbourneMelbourneAustralia

Personalised recommendations