Modeling Relevance as a Function of Retrieval Rank

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9994)


Batched evaluations in IR experiments are commonly built using relevance judgments formed over a sampled pool of documents. However, judgment coverage tends to be incomplete relative to the metrics being used to compute effectiveness, since collection size often makes it financially impractical to judge every document. As a result, a considerable body of work has arisen exploring the question of how to fairly compare systems in the face of unjudged documents. Here we consider the same problem from another perspective, and investigate the relationship between relevance likelihood and retrieval rank, seeking to identify plausible methods for estimating document relevance and hence computing an inferred gain. A range of models are fitted against two typical TREC datasets, and evaluated both in terms of their goodness of fit relative to the full set of known relevance judgments, and also in terms of their predictive ability when shallower initial pools are presumed, and extrapolated metric scores are computed based on models developed from those shallow pools.


  1. 1.
    Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: Proceedings of SIGIR, pp. 541–548 (2006)Google Scholar
  2. 2.
    Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.M.: Bias and the limits of pooling for large collections. Inf. Retr. 10(6), 491–508 (2007)CrossRefGoogle Scholar
  3. 3.
    Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of SIGIR, pp. 25–32 (2004)Google Scholar
  4. 4.
    Büttcher, S., Clarke, C.L.A., Soboroff, I.: The TREC 2006 Terabyte Track. In: Proceedings of TREC, pp. 39–53 (2006)Google Scholar
  5. 5.
    Büttcher, S., Clarke, C.L.A., Yeung, P.C.K., Soboroff, I.: Reliable information retrieval evaluation with incomplete and biased judgements. In: Proceedings of SIGIR, pp. 63–70 (2007)Google Scholar
  6. 6.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002)CrossRefGoogle Scholar
  7. 7.
    Lipani, A., Lupu, M., Hanbury, A.: Splitting water: Precision and anti-precision to reduce pool bias. In: Proceedings of SIGIR, pp. 103–112 (2015)Google Scholar
  8. 8.
    Lu, X., Moffat, A., Culpepper, J.S.: The effect of pooling and evaluation depth on IR metrics. Inf. Retr. 19(4), 416–445 (2016)CrossRefGoogle Scholar
  9. 9.
    Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27(1), 2 (2008)CrossRefGoogle Scholar
  10. 10.
    Ravana, S.D., Moffat, A.: Score estimation, incomplete judgments, and significance testing in IR evaluation. In: Cheng, P.-J., Kan, M.-Y., Lam, W., Nakov, P. (eds.) AIRS 2010. LNCS, vol. 6458, pp. 97–109. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Sakai, T.: Alternatives to BPref. In: Proceedings of SIGIR, pp. 71–78 (2007)Google Scholar
  12. 12.
    Soboroff, I.: A comparison of pooled and sampled relevance judgments in the TREC 2006 Terabyte Track. In: Proceedings of EVIA (2007)Google Scholar
  13. 13.
    Voorhees, E.M.: Overview of the TREC 2004 robust retrieval track. In: Proceedings of TREC, pp. 69–77 (2004)Google Scholar
  14. 14.
    Voorhees, E.M.: The effect of sampling strategy on inferred measures. In: Proceedings of SIGIR, pp. 1119–1122 (2014)Google Scholar
  15. 15.
    Voorhees, E.M., Harman, D.K. (eds.): TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)Google Scholar
  16. 16.
    Webber, W., Park, L.A.F.: Score adjustment for correction of pooling bias. In: Proceedings of SIGIR, pp. 444–451 (2009)Google Scholar
  17. 17.
    Yilmaz, E., Aslam, J.A.: Estimating average precision when judgments are incomplete. Knowl. Inf. Syst. 16(2), 173–211 (2008)CrossRefGoogle Scholar
  18. 18.
    Yilmaz, E., Kanoulas, E., Aslam, J.A.: A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of SIGIR, pp. 603–610 (2008)Google Scholar
  19. 19.
    Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of SIGIR, pp. 307–314 (1998)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Xiaolu Lu
    • 1
  • Alistair Moffat
    • 2
  • J. Shane Culpepper
    • 1
  1. 1.RMIT UniversityMelbourneAustralia
  2. 2.The University of MelbourneMelbourneAustralia

Personalised recommendations