Evaluation of System Measures for Incomplete Relevance Judgment in IR

  • Shengli Wu
  • Sally McClean
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4027)


Incomplete relevance judgment has become a norm for the evaluation of some major information retrieval evaluation events such as TREC, but its effect on some system measures has not been well understood. In this paper, we evaluate four system measures, namely mean average precision, R-precision, normalized average precision over all documents, and normalized discount cumulative gain, under incomplete relevance judgment. Among them, the measure of normalized average precision over all documents is introduced, and both mean average precision and R-precision are generalized for graded relevance judgment. These four measures have a common characteristic: complete relevance judgment is required for the calculation of their accurate values. We empirically investigate these measures through extensive experimentation of TREC data and aim to find the effect of incomplete relevance judgment on them. From these experiments, we conclude that incomplete relevance judgment affects all these four measures’ values significantly. When using the pooling method in TREC, the more incomplete the relevance judgment is, the higher the values of all these measures usually become. We also conclude that mean average precision is the most sensitive but least reliable measure, normalized discount cumulative gain and normalized average precision over all documents are the most reliable but least sensitive measures, while R-precision is in the middle.


Relevant Document Average Precision Mean Average Precision Information Retrieval System Ranking Position 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aslam, J.A., Yilmaz, E.: A geometric interpretation and analysis of R-precision. In: Proceedings of ACM CIKM 2005, Bremen, Germany, October-November, pp. 664–671 (2005)Google Scholar
  2. 2.
    Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 27–34 (2005)Google Scholar
  3. 3.
    Barry, C.L.: User-defined relevance criteria: an exploratory study. Journal of the American Society for Information Science 45(3), 149–159 (1994)CrossRefGoogle Scholar
  4. 4.
    Bodoff, D., Robertson, S.: A new united probabilistic model. Journal of the American Society for Information Science and Technology 55(6), 471–487 (2004)CrossRefGoogle Scholar
  5. 5.
    Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of ACM SIGIR 2000, Athens, Greece, pp. 33–40 (2000)Google Scholar
  6. 6.
    Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, Sheffield, United Kingdom, pp. 25–32 (2004)Google Scholar
  7. 7.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 442–446 (2002)CrossRefGoogle Scholar
  8. 8.
    Sparck Jones, K., van Rijisbergen, C.: Report on the need for and provision of an ideal information retrieval test collection. Technical report, British library research and development report 5266, Computer laboratory, University of Cambridge, Cambridge, UK (1975)Google Scholar
  9. 9.
    Kekäläinen, J.: Binary and graded relevance in IR evaluations - comparison of the efforts on ranking of IR systems. Information Processing & Management 41(5), 1019–1033 (2005)CrossRefGoogle Scholar
  10. 10.
    Lee, C., Lee, G.G.: Probabilistic information retrieval model for a dependency structured indexing system. Information Processing & Management 41(2), 161–175 (2005)MATHCrossRefGoogle Scholar
  11. 11.
    Saracevic, T.: Relevance: A review of and a framework for thinking on the notion in information science. Journal of the American Society for Information Science 26(6), 321–343 (1975)CrossRefGoogle Scholar
  12. 12.
    Schamber, L., Eisenberg, M.B., Nilan, M.S.: A re-examination of relevance: toward a dynamic, situational definition. Information Processing & Management 26(6), 755–776 (1990)CrossRefGoogle Scholar
  13. 13.
    Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 162–169 (2005)Google Scholar
  14. 14.
    Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR 2002, Tampere, Finland, pp. 316–323 (2002)Google Scholar
  15. 15.
    Voorhees, E.M., Harman, D.: Overview of the sixth text retrieval conference (trec-6). Information Processing & Management 36(1), 3–35 (2000)CrossRefGoogle Scholar
  16. 16.
    Xu, Y., Benaroch, M.: Information retrieval with a hybrid automatic query expansion and data fusion procedure. Information Retrieval 8(1), 41–65 (2005)CrossRefGoogle Scholar
  17. 17.
    Zobel, J.: How reliable are the results of large-scale information retrieval experiments. In: Proceedings of ACM SIGIR 1998, Melbourne, Australia, pp. 307–314 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shengli Wu
    • 1
  • Sally McClean
    • 1
  1. 1.School of Computing and MathematicsUniversity of UlsterUK

Personalised recommendations