Fixed-Cost Pooling Strategies Based on IR Evaluation Measures

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10193)


Recent studies have reconsidered the way we operationalise the pooling method, by considering the practical limitations often encountered by test collection builders. The biggest constraint is often the budget available for relevance assessments and the question is how best – in terms of the lowest pool bias – to select the documents to be assessed given a fixed budget. Here, we explore a series of 3 new pooling strategies introduced in this paper against 3 existing ones and a baseline. We show that there are significant differences depending on the evaluation measure ultimately used to assess the runs. We conclude that adaptive strategies are always best, but in their absence, for top-heavy evaluation measures we can continue to use the baseline, while for P@100 we should use any of the other non-adaptive strategies.


  1. 1.
    Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: Proceedings of SIGIR (2006)Google Scholar
  2. 2.
    Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.: Bias and the limits of pooling for large collections. Inf. Retr. 10(6), 491–508 (2007)CrossRefGoogle Scholar
  3. 3.
    Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of SIGIR (2000)Google Scholar
  4. 4.
    Büttcher, S., Clarke, C.L.A., Yeung, P.C.K., Soboroff, I.: Reliable information retrieval evaluation with incomplete and biased judgments. In: Proceedings of SIGIR (2007)Google Scholar
  5. 5.
    Carterette, B.: System effectiveness, user models, and user utility: a conceptual framework for investigation. In: Proceedings of SIGIR (2011)Google Scholar
  6. 6.
    Cleverdon, C., Mills, J.: Factors determining the performance of indexing systems. In: Volume I - Design, Volume II - Test Results, ASLIB Cranfield Project (1966). (Reprinted in Sparck Jones, K., Willett, P. (eds.) Readings in Information Retrieval)Google Scholar
  7. 7.
    Cormack, G.V., Clarke, C.L.A., Buettcher, S.: Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In: Proceedings of SIGIR (2009)Google Scholar
  8. 8.
    Dumais, S., Banko, M., Brill, E., Lin, J., Ng, A.: Web question answering: is more always better? In: Proceedings of SIGIR (2002)Google Scholar
  9. 9.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002)CrossRefGoogle Scholar
  10. 10.
    Jones, K.S., van Rijsbergen, C.J.: Report on the need for and provision of an “ideal” information retrieval test collection. British Library Research and Development Report 5266, University of Cambridge (1975)Google Scholar
  11. 11.
    Lipani, A.: Fairness in information retrieval. In: Proceedings of SIGIR (2016)Google Scholar
  12. 12.
    Lipani, A., Lupu, M., Hanbury, A.: Splitting water: precision and anti-precision to reduce pool bias. In: Proceedings of SIGIR (2015)Google Scholar
  13. 13.
    Lipani, A., Lupu, M., Hanbury, A.: The curious incidence of bias corrections in the pool. In: Ferro, N., Crestani, F., Moens, M.-F., Mothe, J., Silvestri, F., Nunzio, G.M., Hauff, C., Silvello, G. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 267–279. Springer, Cham (2016). doi: 10.1007/978-3-319-30671-1_20 CrossRefGoogle Scholar
  14. 14.
    Lipani, A., Lupu, M., Palotti, J., Zuccon, G., Hanbury, A.: Fixed budget pooling strategies based on fusion methods. In: Proceedings of SAC (2017)Google Scholar
  15. 15.
    Lipani, A., Zuccon, G., Lupu, M., Koopman, B., Hanbury, A.: The impact of fixed-cost pooling strategies on test collection bias. In: Proceedings of ICTIR (2016)Google Scholar
  16. 16.
    Moffat, A., Webber, W., Zobel, J.: Strategic system comparisons via targeted relevance judgments. In: Proceedings of SIGIR (2007)Google Scholar
  17. 17.
    Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. TOIS 27(1), 2 (2008)CrossRefGoogle Scholar
  18. 18.
    Park, L.A., Zhang, Y.: On the distribution of user persistence for rank-biased precision. In: Proceedings of ADCS (2007)Google Scholar
  19. 19.
    Robertson, S.: On the history of evaluation in IR. J. Inf. Sci. 34(4), 439–456 (2008)CrossRefGoogle Scholar
  20. 20.
    Sakai, T.: New performance metrics based on multigrade relevance: their application to question answering. In Proceedings of NTCIR (2004)Google Scholar
  21. 21.
    Soboroff, I.: A comparison of pooled and sampled relevance judgments. In: Proceedings of SIGIR (2007)Google Scholar
  22. 22.
    Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)Google Scholar
  23. 23.
    Webber, W., Park, L.A.: Score adjustment for correction of pooling bias. In: Proceedings of SIGIR (2009)Google Scholar
  24. 24.
    Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: Proceedings of CIKM (2006)Google Scholar
  25. 25.
    Zhang, Y., Park, L.A., Moffat, A.: Click-based evidence for decaying weight distributions in search effectiveness metrics. Inf. Retr. 13(1), 46–69 (2010)CrossRefGoogle Scholar
  26. 26.
    Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of SIGIR (1998)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Institute of Software Technology and Interactive SystemsTU WienViennaAustria
  2. 2.Faculty of Science and EngineeringQueensland University of TechnologyBrisbaneAustralia

Personalised recommendations