Score Estimation, Incomplete Judgments, and Significance Testing in IR Evaluation

  • Sri Devi Ravana
  • Alistair Moffat
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6458)


Comparative evaluations of information retrieval systems are often carried out using standard test corpora, and the sample topics and pre-computed relevance judgments that are associated with them. To keep experimental costs under control, partial relevance judgments are used rather than exhaustive ones, admitting a degree of uncertainty into the per-topic effectiveness scores being compared. Here we explore the design options that must be considered when planning such an experimental evaluation, with emphasis on how effectiveness scores are inferred from partial information.


Retrieval evaluation effectiveness metric pooling 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aslam, J., Yilmaz, E.: Inferring document relevance from incomplete information. In: Proc. 2007 ACM CIKM Conf. Lisbon, Portugal, pp. 603–610 (November 2007)Google Scholar
  2. 2.
    Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: Proc. 29th ACM SIGIR Conf. Seattle, WA, pp. 541–548 (August 2006)Google Scholar
  3. 3.
    Bompada, T., Chang, C.C., Chen, J., Kumar, R., Shenoy, R.: On the robustness of relevance measures with incomplete judgments. In: Proc. 30th ACM SIGIR Conf. Amsterdam, pp. 359–366 (July 2007)Google Scholar
  4. 4.
    Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proc. 23rd ACM SIGIR Conf. Athens, Greece, pp. 33–40 (July 2000)Google Scholar
  5. 5.
    Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proc. 27th ACM SIGIR Conf. Sheffield, England, pp. 25–32 (July 2004)Google Scholar
  6. 6.
    Büttcher, S., Clarke, C.L.A., Yeung, P.C.K., Soboroff, I.: Reliable information retrieval evaluation with incomplete and biased judgements. In: Proc. 30th ACM SIGIR Conf. pp. 63–70 (July 2007)Google Scholar
  7. 7.
    Carterette, B., Smucker, M.D.: Hypothesis testing with incomplete relevance judgments. In: Proc. 2007 ACM CIKM Conf, Lisbon, Portugal, pp. 643–652 (November 2007)Google Scholar
  8. 8.
    Cormack, G.V., Lynam, T.R.: Validity and power of t-test for comparing MAP and GMAP. In: Proc. 30th ACM SIGIR Conf. pp. 753–754 (July 2007)Google Scholar
  9. 9.
    Hawking, D.: Overview of the TREC-9 Web Track. In: Proc. 9th Text REtrieval Conf. (TREC-9). Gaithersburg, Maryland (November 2000)Google Scholar
  10. 10.
    Huffman, S.B., Hochster, M.: How well does result relevance predict session satisfaction? In: Proc. 30th ACM SIGIR Conf. Amsterdam, pp. 567–574 (July 2007)Google Scholar
  11. 11.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 422–446 (2002)CrossRefGoogle Scholar
  12. 12.
    Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27(1), 1–27 (2008)CrossRefGoogle Scholar
  13. 13.
    Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: Proc. 29th ACM SIGIR Conf. Seattle, WA, pp. 525–534 (August 2006)Google Scholar
  14. 14.
    Sakai, T.: Alternatives to Bpref. In: Proc. 30th ACM SIGIR Conf, Amsterdam, pp. 71–78 (July 2007)Google Scholar
  15. 15.
    Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval 11(5), 447–470 (2008)CrossRefGoogle Scholar
  16. 16.
    Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proc. 28th ACM SIGIR Conf. Salvador, Brazil, pp. 162–169 (August 2005)Google Scholar
  17. 17.
    Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval. In: Proc. 2007 ACM CIKM Conf, Lisbon, pp. 623–632 (November 2007)Google Scholar
  18. 18.
    Smucker, M.D., Allan, J., Carterette, B.: Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes. In: Proc. 32nd ACM SIGIR Conf. Boston, MA, pp. 630–631 (July 2009)Google Scholar
  19. 19.
    Turpin, A., Scholer, F.: User performance versus precision measures for simple search tasks. In: Proc. 29th ACM SIGIR Conf. pp. 11–18 (August 2006)Google Scholar
  20. 20.
    Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)Google Scholar
  21. 21.
    Webber, W., Park, L.A.F.: Score adjustment for correction of pooling bias. In: Proc. 32nd ACM SIGIR Conf. Boston, MA, pp. 444–451 (July 2009)Google Scholar
  22. 22.
    Yilmaz, E., Kanoulas, E., Aslam, J.A.: A simple and efficient sampling method for estimating AP and NDCG. In: Proc. 31st ACM SIGIR Conf. Singapore, pp. 603–610 (July 2008)Google Scholar
  23. 23.
    Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proc. 21st ACM SIGIR Conf. Melbourne, Australia, pp. 307–314 (August 1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Sri Devi Ravana
    • 1
    • 2
  • Alistair Moffat
    • 1
  1. 1.Department of Computer Science and Software EngineeringThe University of MelbourneAustralia
  2. 2.University of MalayaMalaysia

Personalised recommendations