Seven Numeric Properties of Effectiveness Metrics

  • Alistair Moffat
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8281)


Search effectiveness metrics quantify the relevance of the ranked document lists returned by retrieval systems. In this paper we characterize metrics according to seven numeric properties – boundedness, monotonicity, convergence, top-weightedness, localization, completeness, and realizability. We demonstrate that these properties partition the commonly-used evaluation metrics, and hence provide a framework in which the relationships between effectiveness metrics can be better understood, including their relative merits for different applications.


Effectiveness metric precision NDCG discounted cumulative gain rank-biased precision mean average precision reciprocal rank 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: Proc. SIGIR, Seattle, Washington, pp. 541–548 (2006)Google Scholar
  2. 2.
    Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proc. SIGIR, Athens, Greece, pp. 33–40 (2000)Google Scholar
  3. 3.
    Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proc. SIGIR, Sheffield, England, pp. 25–32 (2004)Google Scholar
  4. 4.
    Büttcher, S., Clarke, C.L.A., Cormack, G.V.: Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press (2010)Google Scholar
  5. 5.
    Carterette, B.: System effectiveness, user models, and user utility: A conceptual framework for investigation. In: Proc. SIGIR, Beijing, China, pp. 903–912 (2011)Google Scholar
  6. 6.
    Carterette, B., Kanoulas, E., Yilmaz, E.: Simulating simple user behavior for system effectiveness evaluation. In: Proc. CIKM, Glasgow, Scotland, pp. 611–620 (2011)Google Scholar
  7. 7.
    Chapelle, O., Metzler, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proc. CIKM, Hong Kong, China, pp. 621–630 (2009)Google Scholar
  8. 8.
    Demartini, G., Mizzaro, S.: A classification of IR effectiveness metrics. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 488–491. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Dupret, G., Piwowarski, B.: A user browsing model to predict search engine click data from past observations. In: Proc. SIGIR, Singapore, pp. 331–338 (2008)Google Scholar
  10. 10.
    Dupret, G., Piwowarski, B.: A user behavior model for average precision and its generalization to graded judgments. In: Proc. SIGIR, Geneva, Switzerland, pp. 531–538 (2010)Google Scholar
  11. 11.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Sys. 20(4), 422–446 (2002)CrossRefGoogle Scholar
  12. 12.
    Losee, R.M.: Percent perfect performance (PPP). Inf. Proc. Man. 43(4), 1020–1029 (2007)CrossRefGoogle Scholar
  13. 13.
    Mizzaro, S.: The good, the bad, the difficult, and the easy: Something wrong with information retrieval evaluation? In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 642–646. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Moffat, A., Thomas, P., Scholer, F.: Users versus models: What observation tells us about effectiveness metrics. In: Proc. CIKM, San Francisco, California (to appear, 2013)Google Scholar
  15. 15.
    Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Sys. 27(1:2), 1–27 (2008)Google Scholar
  16. 16.
    Robertson, S.: On GMAP: and other transformations. In: Proc. CIKM, Arlington, Virginia, pp. 78–83 (2006)Google Scholar
  17. 17.
    Robertson, S.: A new interpretation of average precision. In: Proc. SIGIR, Singapore, pp. 689–690 (2008)Google Scholar
  18. 18.
    Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Ret. 11(5), 447–470 (2008)CrossRefGoogle Scholar
  19. 19.
    Sakai, T.: Alternatives to BPref. In: Proc. SIGIR, Amsterdam, Netherlands, pp. 71–78 (2007)Google Scholar
  20. 20.
    Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proc. SIGIR, Salvador, Brazil, pp. 162–169 (2005)Google Scholar
  21. 21.
    Smucker, M.D., Clarke, C.L.A.: Time-based calibration of effectiveness measures. In: Proc. SIGIR, Portland, Oregon, pp. 95–104 (2012)Google Scholar
  22. 22.
    Turpin, A., Scholer, F.: User performance versus precision measures for simple search tasks. In: Proc. SIGIR, Seattle, Washington, pp. 11–18 (2006)Google Scholar
  23. 23.
    Webber, W., Moffat, A., Zobel, J.: Score standardization for inter-collection comparison of retrieval systems. In: Proc. SIGIR, Singapore, pp. 51–58 (2008)Google Scholar
  24. 24.
    Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: Proc. CIKM, Arlington, Virginia, pp. 102–111 (2006)Google Scholar
  25. 25.
    Zobel, J., Moffat, A., Park, L.A.F.: Against recall: Is it persistence, cardinality, density, coverage, or totality? SIGIR Forum 43(1), 3–15 (2009)CrossRefGoogle Scholar
  26. 26.
    Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proc. SIGIR, Melbourne, Australia, pp. 307–314 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Alistair Moffat
    • 1
  1. 1.Department of Computing and Information SystemsThe University of MelbourneAustralia

Personalised recommendations