Abstract
Incomplete relevance judgment has become a norm for the evaluation of some major information retrieval evaluation events such as TREC, but its effect on some system measures has not been well understood. In this paper, we evaluate four system measures, namely mean average precision, R-precision, normalized average precision over all documents, and normalized discount cumulative gain, under incomplete relevance judgment. Among them, the measure of normalized average precision over all documents is introduced, and both mean average precision and R-precision are generalized for graded relevance judgment. These four measures have a common characteristic: complete relevance judgment is required for the calculation of their accurate values. We empirically investigate these measures through extensive experimentation of TREC data and aim to find the effect of incomplete relevance judgment on them. From these experiments, we conclude that incomplete relevance judgment affects all these four measures’ values significantly. When using the pooling method in TREC, the more incomplete the relevance judgment is, the higher the values of all these measures usually become. We also conclude that mean average precision is the most sensitive but least reliable measure, normalized discount cumulative gain and normalized average precision over all documents are the most reliable but least sensitive measures, while R-precision is in the middle.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aslam, J.A., Yilmaz, E.: A geometric interpretation and analysis of R-precision. In: Proceedings of ACM CIKM 2005, Bremen, Germany, October-November, pp. 664–671 (2005)
Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 27–34 (2005)
Barry, C.L.: User-defined relevance criteria: an exploratory study. Journal of the American Society for Information Science 45(3), 149–159 (1994)
Bodoff, D., Robertson, S.: A new united probabilistic model. Journal of the American Society for Information Science and Technology 55(6), 471–487 (2004)
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of ACM SIGIR 2000, Athens, Greece, pp. 33–40 (2000)
Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, Sheffield, United Kingdom, pp. 25–32 (2004)
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 442–446 (2002)
Sparck Jones, K., van Rijisbergen, C.: Report on the need for and provision of an ideal information retrieval test collection. Technical report, British library research and development report 5266, Computer laboratory, University of Cambridge, Cambridge, UK (1975)
Kekäläinen, J.: Binary and graded relevance in IR evaluations - comparison of the efforts on ranking of IR systems. Information Processing & Management 41(5), 1019–1033 (2005)
Lee, C., Lee, G.G.: Probabilistic information retrieval model for a dependency structured indexing system. Information Processing & Management 41(2), 161–175 (2005)
Saracevic, T.: Relevance: A review of and a framework for thinking on the notion in information science. Journal of the American Society for Information Science 26(6), 321–343 (1975)
Schamber, L., Eisenberg, M.B., Nilan, M.S.: A re-examination of relevance: toward a dynamic, situational definition. Information Processing & Management 26(6), 755–776 (1990)
Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 162–169 (2005)
Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR 2002, Tampere, Finland, pp. 316–323 (2002)
Voorhees, E.M., Harman, D.: Overview of the sixth text retrieval conference (trec-6). Information Processing & Management 36(1), 3–35 (2000)
Xu, Y., Benaroch, M.: Information retrieval with a hybrid automatic query expansion and data fusion procedure. Information Retrieval 8(1), 41–65 (2005)
Zobel, J.: How reliable are the results of large-scale information retrieval experiments. In: Proceedings of ACM SIGIR 1998, Melbourne, Australia, pp. 307–314 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wu, S., McClean, S. (2006). Evaluation of System Measures for Incomplete Relevance Judgment in IR. In: Larsen, H.L., Pasi, G., Ortiz-Arroyo, D., Andreasen, T., Christiansen, H. (eds) Flexible Query Answering Systems. FQAS 2006. Lecture Notes in Computer Science(), vol 4027. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11766254_21
Download citation
DOI: https://doi.org/10.1007/11766254_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34638-8
Online ISBN: 978-3-540-34639-5
eBook Packages: Computer ScienceComputer Science (R0)