Evaluation of System Measures for Incomplete Relevance Judgment in IR
Incomplete relevance judgment has become a norm for the evaluation of some major information retrieval evaluation events such as TREC, but its effect on some system measures has not been well understood. In this paper, we evaluate four system measures, namely mean average precision, R-precision, normalized average precision over all documents, and normalized discount cumulative gain, under incomplete relevance judgment. Among them, the measure of normalized average precision over all documents is introduced, and both mean average precision and R-precision are generalized for graded relevance judgment. These four measures have a common characteristic: complete relevance judgment is required for the calculation of their accurate values. We empirically investigate these measures through extensive experimentation of TREC data and aim to find the effect of incomplete relevance judgment on them. From these experiments, we conclude that incomplete relevance judgment affects all these four measures’ values significantly. When using the pooling method in TREC, the more incomplete the relevance judgment is, the higher the values of all these measures usually become. We also conclude that mean average precision is the most sensitive but least reliable measure, normalized discount cumulative gain and normalized average precision over all documents are the most reliable but least sensitive measures, while R-precision is in the middle.
KeywordsRelevant Document Average Precision Mean Average Precision Information Retrieval System Ranking Position
Unable to display preview. Download preview PDF.
- 1.Aslam, J.A., Yilmaz, E.: A geometric interpretation and analysis of R-precision. In: Proceedings of ACM CIKM 2005, Bremen, Germany, October-November, pp. 664–671 (2005)Google Scholar
- 2.Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 27–34 (2005)Google Scholar
- 5.Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of ACM SIGIR 2000, Athens, Greece, pp. 33–40 (2000)Google Scholar
- 6.Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, Sheffield, United Kingdom, pp. 25–32 (2004)Google Scholar
- 8.Sparck Jones, K., van Rijisbergen, C.: Report on the need for and provision of an ideal information retrieval test collection. Technical report, British library research and development report 5266, Computer laboratory, University of Cambridge, Cambridge, UK (1975)Google Scholar
- 13.Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 162–169 (2005)Google Scholar
- 14.Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR 2002, Tampere, Finland, pp. 316–323 (2002)Google Scholar
- 17.Zobel, J.: How reliable are the results of large-scale information retrieval experiments. In: Proceedings of ACM SIGIR 1998, Melbourne, Australia, pp. 307–314 (1998)Google Scholar