Information Retrieval Evaluation with Partial Relevance Judgment
Mean Average Precision has been widely used by researchers in information retrieval evaluation events such as TREC, and it is believed to be a good system measure because of its sensitivity and reliability. However, its drawbacks as regards partial relevance judgment has been largely ignored. In many cases, partial relevance judgment is probably the only reasonable solution due to the large document collections involved.
In this paper, we will address this issue through analysis and experiment. Our investigation shows that when only partial relevance judgment is available, mean average precision suffers from several drawbacks: inaccurate values, no explicit explanation, and being subject to the evaluation environment. Further, mean average precision is not superior to some other measures such as precision at a given document level for sensitivity and reliability, both of which are believed to be the major advantages of mean average precision. Our experiments also suggest that average precision over all documents would be a good measure for such a situation.
KeywordsInformation Retrieval Relevant Document Average Precision Information Retrieval System Relevance Judgment
Unable to display preview. Download preview PDF.
- 1.Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analysing retrieval measures. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 27–34 (August 2005)Google Scholar
- 2.Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of ACM SIGIR 2000, Athens, Greece, pp. 33–40 (July 2000)Google Scholar
- 3.Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, Sheffield, United Kingdom, pp. 25–32 (July 2004)Google Scholar
- 5.Jävelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 442–446 (2002)Google Scholar
- 9.Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 162–169 (August 2005)Google Scholar
- 11.TREC, http://trec.nist.gov/
- 12.van Rijsbergen, C.J.: Information Retrieval. Butterworths (1979)Google Scholar
- 14.Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. In: Proceedings of ACM SIGIR 1998, Melbourne, Australia, pp. 315–323 (August 1998)Google Scholar
- 16.Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR 2002, Tampere, Finland, pp. 316–323 (August 2002)Google Scholar
- 17.Wu, S., McClean, S.: Modelling rank-probability of relevance relationship in resultant document list for data fusion (submitted for publication)Google Scholar
- 18.Zobel, J.: How reliable are the results of large-scale information retrieval experiments. In: Proceedings of ACM SIGIR 1998, Melbourne, Australia, pp. 307–314 (August 1998)Google Scholar