Information Retrieval Evaluation with Partial Relevance Judgment

  • Shengli Wu
  • Sally McClean
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4042)


Mean Average Precision has been widely used by researchers in information retrieval evaluation events such as TREC, and it is believed to be a good system measure because of its sensitivity and reliability. However, its drawbacks as regards partial relevance judgment has been largely ignored. In many cases, partial relevance judgment is probably the only reasonable solution due to the large document collections involved.

In this paper, we will address this issue through analysis and experiment. Our investigation shows that when only partial relevance judgment is available, mean average precision suffers from several drawbacks: inaccurate values, no explicit explanation, and being subject to the evaluation environment. Further, mean average precision is not superior to some other measures such as precision at a given document level for sensitivity and reliability, both of which are believed to be the major advantages of mean average precision. Our experiments also suggest that average precision over all documents would be a good measure for such a situation.


Information Retrieval Relevant Document Average Precision Information Retrieval System Relevance Judgment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analysing retrieval measures. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 27–34 (August 2005)Google Scholar
  2. 2.
    Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of ACM SIGIR 2000, Athens, Greece, pp. 33–40 (July 2000)Google Scholar
  3. 3.
    Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, Sheffield, United Kingdom, pp. 25–32 (July 2004)Google Scholar
  4. 4.
    Graham, R.L., Knuth, D.E., Patashnik, O.: Concrete mathematics. Addison-wesley publishing company, Reading (1989)MATHGoogle Scholar
  5. 5.
    Jävelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 442–446 (2002)Google Scholar
  6. 6.
    Kagolovsky, Y., Moehr, J.R.: Current status of the evaluation of information retrieval. Journal of Medical Systems 27(5), 409–424 (2003)CrossRefGoogle Scholar
  7. 7.
    Kekäläinen, J.: Binary and graded relevance in IR evaluations – comparison of the effects on ranking of IR systems. Information Processing & Management 41(5), 1019–1033 (2005)CrossRefGoogle Scholar
  8. 8.
    Robertson, S.E., Hancock-Beaulieu, M.M.: On the evaluation of IR systems. Information Processing & Management 28(4), 457–466 (1992)CrossRefGoogle Scholar
  9. 9.
    Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proceedings of ACM SIGIR 2005, Salvador, Brazil, pp. 162–169 (August 2005)Google Scholar
  10. 10.
    Shaw, W.M., Burgin, R., Howell, P.: Performance standards and evaluations in IR test collections: Cluster-based retrieval models. Information Processing & Management 33(1), 1–14 (1997)CrossRefGoogle Scholar
  11. 11.
  12. 12.
    van Rijsbergen, C.J.: Information Retrieval. Butterworths (1979)Google Scholar
  13. 13.
    Voiskunskii, V.G.: Evaluation of search results: A new approach. Journal of the American Society for Information Science 48(2), 133–142 (1997)CrossRefGoogle Scholar
  14. 14.
    Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. In: Proceedings of ACM SIGIR 1998, Melbourne, Australia, pp. 315–323 (August 1998)Google Scholar
  15. 15.
    Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management 36(5), 697–716 (2000)CrossRefGoogle Scholar
  16. 16.
    Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR 2002, Tampere, Finland, pp. 316–323 (August 2002)Google Scholar
  17. 17.
    Wu, S., McClean, S.: Modelling rank-probability of relevance relationship in resultant document list for data fusion (submitted for publication)Google Scholar
  18. 18.
    Zobel, J.: How reliable are the results of large-scale information retrieval experiments. In: Proceedings of ACM SIGIR 1998, Melbourne, Australia, pp. 307–314 (August 1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shengli Wu
    • 1
  • Sally McClean
    • 1
  1. 1.School of Computing and MathematicsUniversity of UlsterUK

Personalised recommendations