Skip to main content

An Intrinsic Framework of Information Retrieval Evaluation Measures

  • Conference paper
  • First Online:
Intelligent Systems and Applications (IntelliSys 2023)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 822))

Included in the following conference series:

  • 178 Accesses

Abstract

Information retrieval (IR) evaluation measures are cornerstones for determining the suitability and task performance efficiency of retrieval systems. Their metric and scale properties enable to compare one system against another to establish differences or similarities. Based on the representational theory of measurement, this paper determines these properties by exploiting the information contained in a retrieval measure itself. It establishes the intrinsic framework of a retrieval measure, which is the common scenario when the domain set is not explicitly specified. A method to determine the metric and scale properties of any retrieval measure is provided, requiring knowledge of only some of its attained values. The method establishes three main categories of retrieval measures according to their intrinsic properties. Some common user-oriented and system-oriented evaluation measures are classified according to the presented taxonomy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here, the commonly used term “IR evaluation metric” collides with the mathematical term “metric”, which will be used later in this paper. To solve this issue, the rest of the paper will refer the term “IR evaluation metrics” as “IR evaluation measures”, keeping the term “metric” for its mathematical sense.

  2. 2.

    Typically a SERP includes content in a non homogeneous manner, such as images, query suggestions, knowledge panels, etc. However, here, we consider the classical ordered (or unordered) list of documents since it is the common structure considered when the evaluation of ranking models is studied.

  3. 3.

    The associated weak order, \(\preceq _f\), may be transformed into a total order by considering the following equivalence relation: \(\mathbf {\hat{r}_1} \sim _{f} \mathbf {\hat{r}_2} \Leftrightarrow f(\mathbf {\hat{r}_1}) = f(\mathbf {\hat{r}_2})\). Let \(\mathbf {R^{*}}\) be the set of equivalence classes, and let \(\mathbf {\hat{r}^{*}_1}\) and \(\mathbf {\hat{r}^{*}_2}\) be two elements of this set containing the individual system output rankings \(\mathbf {\hat{r}_1}\), \(\mathbf {\hat{r}_2} \in \textbf{R}\), respectively. It can be defined the following ordering on \(\mathbf {R^{*}}\): \(\mathbf {\hat{r}^{*}_1} \preceq _{f}^{*} \mathbf {\hat{r}^{*}_2} \Leftrightarrow \mathbf {\hat{r}_1} \preceq _{f} \mathbf {\hat{r}_2}\). Then, \((\mathbf {R^{*}}, \preceq _{f}^{*})\) is called the reduction or quotient of \((\textbf{R}, \preceq _{f})\), where \(\preceq _{f}^{*}\) is well-defined and \((\mathbf {R^{*}}, \preceq _{f}^{*})\) is a totally ordered set [72].

  4. 4.

    Imagine hypothetical beings living on the surface of a two-dimensional Euclidean space, \(\mathbb {R}^2\), ignorant of the surrounding three-dimensional space (but with a sense of Euclidean distance). These beings are local observers, whose view reaches only a two coordinated environment. The geometrical elements of this surface capable of being observed or measured by these beings (essentially lengths) constitute what is called the intrinsic geometry of the surface. The intrinsic properties of the surface are those which depend exclusively on the surface itself.

  5. 5.

    In basic algebra [36, 50, 51], f is an injective function, if f maps distinct elements to distinct elements, formally: \(f(\mathbf {\hat{r}_1}) = f(\mathbf {\hat{r}_2})\) implies \(\mathbf {\hat{r}_1} = \mathbf {\hat{r}_2}\), \(\forall \mathbf {\hat{r}_1}\), \(\mathbf {\hat{r}_2} \in \textbf{R}\).

  6. 6.

    As noted in Sect. 4, the intrinsic properties of a retrieval measure deduced with this framework are based on the RTM.

References

  1. Allan, J., Aslam, J., Belkin, N., Buckley, C., Callan, J., Croft, B., Dumais, S., Fuhr, N., Harman, D., Harper, D.J., et al.: Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, university of massachusetts amherst, september 2002. In: ACM SIGIR Forum. 1, pp. 31–47. ACM New York, NY, USA (2003)

    Google Scholar 

  2. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)

    Article  Google Scholar 

  3. Amigo, E., Gonzalo, J., Mizzaro, S.: What is my problem identifying formal tasks and metrics in data mining on the basis of measurement theory. IEEE Trans. Knowl. Data Eng. (2021)

    Google Scholar 

  4. Amigó, E., Gonzalo, J., Verdejo, F.: A general evaluation measure for document organization tasks. In: Proceedings of the 36th international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 643–652 (2013)

    Google Scholar 

  5. Amigó, E., Mizzaro, S.: On the nature of information access evaluation metrics: a unifying framework. Inf. Retr. J. 23(3), 318–386 (2020)

    Article  Google Scholar 

  6. Azzopardi, L., Thomas, P., Craswell, N.: Measuring the utility of search engine result pages: an information foraging based measure. In: The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 605–614 (2018)

    Google Scholar 

  7. Baccianella, S., Esuli, A., Sebastiani, F.: Evaluation measures for ordinal regression. In: 2009 Ninth International Conference on Intelligent Systems Design and Applications, pp. 283–287. IEEE (2009)

    Google Scholar 

  8. Belew, R.K.: Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press (2000)

    Google Scholar 

  9. Blair, D.C.: Information retrieval, 2nd ed. C.J. van rijsbergen. London: Butterworths. JASIS 30(6), 374–375 (1979). https://doi.org/10.1002/asi.4630300621

  10. Bollmann, P.: Two axioms for evaluation measures in information retrieval. In: SIGIR, vol. 84, pp. 233–245. Citeseer (1984)

    Google Scholar 

  11. Bollmann, P., Cherniavsky, V.S.: Measurement-theoretical investigation of the mz-metric. In: Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval, pp. 256–267. Citeseer (1980)

    Google Scholar 

  12. Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 25–32 (2004)

    Google Scholar 

  13. Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: ACM SIGIR Forum. 2, pp. 235–242. ACM New York, NY, USA (2017)

    Google Scholar 

  14. Busin, L., Mizzaro, S.: Axiometrics: An axiomatic approach to information retrieval effectiveness metrics. In: Proceedings of the 2013 Conference on the Theory of Information Retrieval, pp. 22–29 (2013)

    Google Scholar 

  15. Büttcher, S., Clarke, C.L., Yeung, P.C., Soboroff, I.: Reliable information retrieval evaluation with incomplete and biased judgements. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 63–70 (2007)

    Google Scholar 

  16. Carmel, D., Yom-Tov, E.: Estimating the query difficulty for information retrieval. Synth. Lect. Inf. Concepts, Retr., Serv. 2(1), 1–89 (2010)

    Google Scholar 

  17. Carterette, B.: System effectiveness, user models, and user utility: a conceptual framework for investigation. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 903–912 (2011)

    Google Scholar 

  18. Carterette, B.A.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst. (TOIS) 30(1), 1–34 (2012)

    Article  Google Scholar 

  19. Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 621–630 (2009)

    Google Scholar 

  20. Cleverdon, C.W.: The significance of the cranfield tests on index languages. In: Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12 (1991)

    Google Scholar 

  21. Clinchant, S., Gaussier, E.: Is document frequency important for prf? In: Conference on the Theory of Information Retrieval, pp. 89–100. Springer, Berlin (2011)

    Google Scholar 

  22. Clinchant, S., Gaussier, E.: A theoretical analysis of pseudo-relevance feedback models. In: Proceedings of the 2013 Conference on the Theory of Information Retrieval, pp. 6–13 (2013)

    Google Scholar 

  23. Cooper, W.S.: Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. Am. Doc. 19(1), 30–41 (1968)

    Article  Google Scholar 

  24. Croft, W.B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice, vol. 520. Addison-Wesley Reading (2010)

    Google Scholar 

  25. Do Carmo, M.P.: Differential Geometry of Curves and Surfaces: Revised and Updated, 2nd edn. Courier Dover Publications (2016)

    Google Scholar 

  26. Fang, H.: An axiomatic approach to information retrieval. Technical report (2007)

    Google Scholar 

  27. Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–56 (2004)

    Google Scholar 

  28. Fang, H., Tao, T., Zhai, C.: Diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst. (TOIS) 29(2), 1–42 (2011)

    Article  Google Scholar 

  29. Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 480–487 (2005)

    Google Scholar 

  30. Fang, H., Zhai, C.: Semantic term matching in axiomatic approaches to information retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–122 (2006)

    Google Scholar 

  31. Ferrante, M., Ferro, N., Fuhr, N.: Towards meaningful statements in ir evaluation: Mapping evaluation measures to interval scales. IEEE Access 9, 136,182–136,216 (2021)

    Google Scholar 

  32. Ferrante, M., Ferro, N., Fuhr, N.: Response to moffat’s comment on “towards meaningful statements in ir evaluation: Mapping evaluation measures to interval scales” (2022). https://doi.org/10.48550/ARXIV.2212.11735

  33. Ferrante, M., Ferro, N., Pontarollo, S.: A general theory of ir evaluation measures. IEEE Trans. Knowl. Data Eng. 31(3), 409–422 (2018)

    Article  Google Scholar 

  34. Ferro, N., Peters, C.: Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, vol. 41. Springer, Berlin (2019)

    Google Scholar 

  35. Flach, P.: Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward. In: Proceedings of the AAAI Conference on Artificial Intelligence, 01, pp. 9808–9814 (2019)

    Google Scholar 

  36. Fraleigh, J.B.: A First Course in Abstract Algebra. Pearson Education India (2003)

    Google Scholar 

  37. Fréchet, M.M.: Sur quelques points du calcul fonctionnel. Rendiconti del Circolo Matematico di Palermo (1884–1940) 22(1), 1–72 (1906)

    Google Scholar 

  38. Fuhr, N.: Some common mistakes in ir evaluation, and how they can be avoided. In: ACM SIGIR Forum. 3, pp. 32–41. ACM New York, NY, USA (2018)

    Google Scholar 

  39. Gaudette, L., Japkowicz, N.: Evaluation methods for ordinal classification. In: Canadian Conference on Artificial Intelligence, pp. 207–210. Springer, Berlin (2009)

    Google Scholar 

  40. Gauss, C.F.: Disquisitiones Generales Circa Superficies Curvas, vol. 1. Typis Dieterichianis (1828)

    Google Scholar 

  41. Giner, F.: A comment to “a general theory of ir evaluation measures” (2023). arXiv:2303.16061

  42. Guccione, J.A.: Espacios métricos. Universidad de Buenos Aires, Texto (2018)

    Google Scholar 

  43. Han, L., Roitero, K., Maddalena, E., Mizzaro, S., Demartini, G.: On transforming relevance scales. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 39–48 (2019)

    Google Scholar 

  44. Hand, D.J.: Statistics and the theory of measurement. J. R. Stat. Soc. A. Stat. Soc. 159(3), 445–473 (1996)

    Article  Google Scholar 

  45. Harman, D.: Information retrieval evaluation. Synth. Lect. Inf. Concepts, Retr., Serv. 3(2), 1–119 (2011)

    Google Scholar 

  46. Hauff, C., de Jong, F.: Retrieval system evaluation: Automatic evaluation versus incomplete judgments. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 863–864 (2010)

    Google Scholar 

  47. Hausdorff, F.: Set Theory, vol. 119. American Mathematical Soc. (2005)

    Google Scholar 

  48. Huibers, T.W.C.: An axiomatic theory for information retrieval. Ph.D. thesis (1996)

    Google Scholar 

  49. Hull, D.: Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 329–338 (1993)

    Google Scholar 

  50. Hungerford, T.W.: Algebra, vol. 73. Springer Science & Business Media (2012)

    Google Scholar 

  51. Jacobson, N.: Basic Algebra I. Courier Corporation (2012)

    Google Scholar 

  52. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. (TOIS) 20(4), 422–446 (2002)

    Article  Google Scholar 

  53. Kando, N.: Information retrieval system evaluation using multi-grade relevance judgments-discussion on averageable single-numbered measures. IPSJ SIG Notes 63, 105–112 (2001)

    Google Scholar 

  54. Karimzadehgan, M., Zhai, C.: Axiomatic analysis of translation language model for information retrieval. In: European Conference on Information Retrieval, pp. 268–280. Springer, Berlin (2012)

    Google Scholar 

  55. Kazai, G.: Report of the inex 2003 metrics working group. In: Initiative for the Evaluation of XML Retrieval (INEX): INEX 2003 Workshop Proceedings, Dagstuhl, Germany (2004)

    Google Scholar 

  56. Kazai, G., Lalmas, M.: Inex 2005 evaluation measures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) Advances in XML Information Retrieval and Evaluation, pp. 16–29. Springer, Berlin (2006)

    Google Scholar 

  57. Kekäläinen, J., Järvelin, K.: Using graded relevance assessments in ir evaluation. J. Am. Soc. Inform. Sci. Technol. 53(13), 1120–1129 (2002)

    Article  Google Scholar 

  58. Korfhage, R.R.: Information Storage and Retrieval. Wiley, USA (1997)

    Google Scholar 

  59. Krantz, D., Luce, D., Suppes, P., Tversky, A.: Foundations of Measurement, vol. I: Additive and Polynomial Representations (1971)

    Google Scholar 

  60. Krantz, D.H.: Foundations of Measurement, vol. II. Geometrical, Threshold and Probabilistic Representations (1989)

    Google Scholar 

  61. Luce, D., Krantz, D., Suppes, P., Tversky, A.: Foundations of Measurement, Vol. III Representation, Axiomatization, and Invariance (1990)

    Google Scholar 

  62. Maddalena, E., Mizzaro, S.: Axiometrics: Axioms of information retrieval effectiveness metrics. In: EVIA@ NTCIR (2014)

    Google Scholar 

  63. Michell, J.: Measurement scales and statistics: a clash of paradigms. Psychol. Bull. 100(3), 398 (1986)

    Article  Google Scholar 

  64. Michell, J.: An Introduction to the Logic of Psychological Measurement. Psychology Press (2014)

    Google Scholar 

  65. Moffat, A.: Seven numeric properties of effectiveness metrics. In: Asia Information Retrieval Symposium, pp. 1–12. Springer, Berlin (2013)

    Google Scholar 

  66. Moffat, A.: Batch evaluation metrics in information retrieval: Measures, scales, and meaning. IEEE Access 10, 105, 564–105,577 (2022)

    Google Scholar 

  67. Moffat, A., Bailey, P., Scholer, F., Thomas, P.: Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Trans. Inf. Syst. (TOIS) 35(3), 1–38 (2017)

    Article  Google Scholar 

  68. Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. (TOIS) 27(1), 1–27 (2008)

    Article  Google Scholar 

  69. Montazeralghaem, A., Zamani, H., Shakery, A.: Axiomatic analysis for improving the log-logistic feedback model. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 765–768 (2016)

    Google Scholar 

  70. Pollock, S.M.: Measures for the comparison of information retrieval systems. Am. Doc. 19(4), 387–397 (1968)

    Article  Google Scholar 

  71. Rahimi, R., Montazeralghaem, A., Shakery, A.: An axiomatic approach to corpus-based cross-language information retrieval. Inf. Retr. J. 23(3), 191–215 (2020)

    Article  Google Scholar 

  72. Roberts, F.S.: Measurement theory. Encycl. Math. Appl. 7 (1985)

    Google Scholar 

  73. Robertson, S.: On gmap: and other transformations. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 78–83 (2006)

    Google Scholar 

  74. Robertson, S.: On the history of evaluation in ir. J. Inf. Sci. 34(4), 439–456 (2008)

    Article  Google Scholar 

  75. Rocchio, J.: Performance indices for document retrieval systems. In: Information Storage and Retrieval p. 83 (1964)

    Google Scholar 

  76. Rosset, C., Mitra, B., Xiong, C., Craswell, N., Song, X., Tiwary, S.: An axiomatic approach to regularizing neural ranking models. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 981–984 (2019)

    Google Scholar 

  77. Sagara, Y.: Performance measures for ranked output retrieval systems. J. Jpn. Soc. Inf. Knowl. 12(2), 22–36 (2002)

    Google Scholar 

  78. Sakai, T.: New performance metrics based on multigrade relevance: their application to question answering. In: NTCIR (2004)

    Google Scholar 

  79. Sakai, T.: Metrics, statistics, tests. In: PROMISE Winter School, pp. 116–163. Springer, Berlin (2013)

    Google Scholar 

  80. Sakai, T.: Statistical reform in information retrieval? In: ACM SIGIR Forum, vol. 48, pp. 3–12. ACM, New York, NY, USA (2014)

    Google Scholar 

  81. Sakai, T.: On fuhr’s guideline for ir evaluation. In: ACM SIGIR Forum, vol. 54, pp. 1–8. ACM, New York, NY, USA (2021)

    Google Scholar 

  82. Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retr. 11(5), 447–470 (2008)

    Article  Google Scholar 

  83. Sakai, T., Oard, D.W., Kando, N.: Evaluating Information Retrieval and Access Tasks: NTCIR’s Legacy of Research Impact. Springer Nature (2021)

    Google Scholar 

  84. Salton, G.: Automatic Information Organization and Retrieval (1968)

    Google Scholar 

  85. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. Mcgraw-Hill (1983)

    Google Scholar 

  86. Sanderson, M.: Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retr. 4(4), 247–375 (2010)

    Google Scholar 

  87. Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Inf. Process. Manag. 33(4), 495–512 (1997)

    Article  Google Scholar 

  88. Sebastiani, F.: An axiomatically derived measure for the evaluation of classification algorithms. In: Proceedings of the 2015 International Conference on the Theory of Information Retrieval, pp. 11–20 (2015)

    Google Scholar 

  89. Sirotkin, P.: On search engine evaluation metrics (2013). arXiv:1302.2318

  90. Stevens, S.S.: Mathematics, Measurement, and Psychophysics. Wiley, New York (1951)

    Google Scholar 

  91. Stevens, S.S., et al.: On the Theory of Scales of Measurement. Bobbs-Merrill, College Division (1946)

    Google Scholar 

  92. Swets, J.A.: Information retrieval systems. Science 141(3577), 245–250 (1963)

    Article  Google Scholar 

  93. Urbano, J., Lima, H., Hanjalic, A.: Statistical significance testing in information retrieval: an empirical analysis of type i, type ii and type iii errors. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 505–514 (2019)

    Google Scholar 

  94. Van Rijsbergen, C.J.: Foundation of evaluation. J. Doc. 30(4), 365–373 (1974)

    Google Scholar 

  95. Vanbelle, S., Albert, A.: A note on the linearly weighted kappa coefficient for ordinal scales. Stat. Methodol. 6(2), 157–163 (2009)

    Article  MathSciNet  Google Scholar 

  96. Velleman, P.F., Wilkinson, L.: Nominal, ordinal, interval, and ratio typologies are misleading. Am. Stat. 47(1), 65–72 (1993)

    Google Scholar 

  97. Voorhees, E.M.: The trec 2005 robust track. In: ACM SIGIR Forum, vol. 40, pp. 41–48. ACM, New York, NY, USA (2006)

    Google Scholar 

  98. Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval, vol. 63. Citeseer (2005)

    Google Scholar 

  99. Voorhees, E.M., et al.: Overview of the trec 2003 robust retrieval track. In: Trec, pp. 69–77 (2003)

    Google Scholar 

  100. Wicaksono, A.F., Moffat, A.: Metrics, user models, and satisfaction. In: Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 654–662 (2020)

    Google Scholar 

  101. Zhang, F., Liu, Y., Li, X., Zhang, M., Xu, Y., Ma, S.: Evaluating web search with a bejeweled player model. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 425–434 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fernando Giner .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Formal Proofs

Proof

(Proposition 1) Symmetry is trivially verified since \(d_{f}(\mathbf {\hat{r}_1},\mathbf {\hat{r}_2})= \vert f(\mathbf {\hat{r}_1}) - f(\mathbf {\hat{r}_2}) \vert = \vert f(\mathbf {\hat{r}_2}) - f(\mathbf {\hat{r}_1}) \vert = d_{f}(\mathbf {\hat{r}_2},\mathbf {\hat{r}_1})\). Triangular inequality is also trivial, by considering the triangular inequality on the real numbers: \(\vert f(\mathbf {\hat{r}_1}) - f(\mathbf {\hat{r}_2}) \vert \le \vert f(\mathbf {\hat{r}_1}) - f(\mathbf {\hat{r}_3}) \vert + \vert f(\mathbf {\hat{r}_3}) - f(\mathbf {\hat{r}_2}) \vert \).   \(\square \)

Proof

(Proposition 2) An interesting result about metric spaces [42] states the following: “Let \((\mathbf {R_2}, d_2)\) be a metric space and let \(f:\mathbf {R_1} \longrightarrow \mathbf {R_2}\) an an injective or one-to-one function, then \((\mathbf {R_1}, d_1)\) is a metric space, where \(d_1(\mathbf {\hat{r}_1}, \mathbf {\hat{r}_2}) = d_2(f(\mathbf {\hat{r}_1}), f(\mathbf {\hat{r}_2}))\), \(\forall \mathbf {\hat{r}_1}\), \(\mathbf {\hat{r}_2} \in \mathbf {R_1}\)”.

In the retrieval scenario, \((\mathbf {R_2}, d_2) = (\mathbb {R}, \vert \cdot \vert )\), which is the metric space of the real line endowed with the usual norm (the absolute value). Let f be a one-to-one IR evaluation measure; from the previous result, it follows that \((\mathbf {R_1}, d_1) = (\textbf{R}, d_f)\) is a metric space, i.e., \(d_f\) verifies the three postulates of a metric.   \(\square \)

Proof

(Proposition 3) It will be seen the implication from right to left. Consider a metric ordinal scale, f, where the attained values are equally spaced.

An interval is called prime if \([\mathbf {\hat{r}_1}, \mathbf {\hat{r}_2}] = \{\mathbf {\hat{r}_1}, \mathbf {\hat{r}_2}\}\). First, it will be seen that the function, \(F(\textbf{x}, \textbf{y}) = \vert f(\textbf{x}) - f(\textbf{y}) \vert \), attains its minimum value on any prime interval.

Let \([\mathbf {\hat{r}_1}, \mathbf {\hat{r}_3}] = \{\mathbf {\hat{r}_1}, \mathbf {\hat{r}_2}, \mathbf {\hat{r}_3}\}\) be a non-prime interval, where \(\mathbf {\hat{r}_1} \preceq _f \mathbf {\hat{r}_2} \preceq _f \mathbf {\hat{r}_3}\), then it holds that \(f(\mathbf {\hat{r}_1}) \le f(\mathbf {\hat{r}_2}) \le f(\mathbf {\hat{r}_3})\) since f is an ordinal scale. It implies that \(\vert f(\mathbf {\hat{r}_3}) - f(\mathbf {\hat{r}_1}) \vert \le \vert f(\mathbf {\hat{r}_3}) - f(\mathbf {\hat{r}_2}) \vert + \vert f(\mathbf {\hat{r}_2}) - f(\mathbf {\hat{r}_1}) \vert \), i.e., the minimum value of F is not attained at \([\mathbf {\hat{r}_1}, \mathbf {\hat{r}_3}]\). In addition, it holds that the function F assign the same value for every prime interval. Given a prime interval, \([\mathbf {\hat{r}_1}, \mathbf {\hat{r}_2}]\), it can be considered one of its consecutive prime intervals, \([\mathbf {\hat{r}_2}, \mathbf {\hat{r}_3}]\), since \(\preceq _f\) is a weak order (every pair of elements is comparable). These two prime intervals verify that \(f(\mathbf {\hat{r}_1}) < f(\mathbf {\hat{r}_2}) < f(\mathbf {\hat{r}_3})\) since f is a metric, and the attained values of f are equally spaced. Thus, it can be assumed that \(F(\mathbf {\hat{r}_1},\mathbf {\hat{r}_2}) = k \in \mathbb {R}^{+}\) for any prime interval \([\mathbf {\hat{r}_1}, \mathbf {\hat{r}_2}]\).

Now, it will be seen that equally spaced intervals (not necessarily prime) are assigned equal differences. Consider any non-prime interval, \([\mathbf {\hat{r}_1}, \mathbf {\hat{r}_m}] = \{\mathbf {\hat{r}_1}, \mathbf {\hat{r}_2}, \ldots , \mathbf {\hat{r}_m}\}\). As f is a metric, then it attains different values for different elements. Thus, it can be assumed that \(f(\mathbf {\hat{r}_1}) < f(\mathbf {\hat{r}_2}) < \cdots < f(\mathbf {\hat{r}_{m-1}}) < f(\mathbf {\hat{r}_m})\). Then, every interval \([\mathbf {\hat{r}_i}, \mathbf {\hat{r}_{i+1}}]\) are prime intervals for \(i=1, \ldots m-1\) since F attain the minimum at these intervals. As \(f(\mathbf {\hat{r}_m}) - f(\mathbf {\hat{r}_1}) = f(\mathbf {\hat{r}_m}) - f(\mathbf {\hat{r}_{m-1}}) + f(\mathbf {\hat{r}_{m-1}}) - \cdots - f(\mathbf {\hat{r}_2}) + f(\mathbf {\hat{r}_2}) - f(\mathbf {\hat{r}_1})\) and \(f(\mathbf {\hat{r}_{i+1}}) - f(\mathbf {\hat{r}_i}) = k\) for \(1 \le i \le m-1\), then \(f(\mathbf {\hat{r}_1}) - f(\mathbf {\hat{r}_m}) = k \cdot m\), which only depends on the span of the interval, m, not on the considered elements. Therefore, equally spaced intervals are assigned equal differences, i.e., f is an interval scale.

Finally, it will be seen the other implication. Consider any prime interval, \([\mathbf {\hat{r}_1}, \mathbf {\hat{r}_2}]\), of \(\textbf{R}\), as f is an interval scale, then equally spaced intervals are assigned to equal differences, i.e., the value \(\vert f(\mathbf {\hat{r}_2}) - f(\mathbf {\hat{r}_1}) \vert \) is constant for every prime interval of \(\textbf{R}\). In addition, it should be an strictly positive value. To see that the attained values are equally spaced, it is sufficient to check that different elements of \(\textbf{R}\) are assigned different values of f, which is hold since f is a metric.   \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Giner, F. (2024). An Intrinsic Framework of Information Retrieval Evaluation Measures. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2023. Lecture Notes in Networks and Systems, vol 822. Springer, Cham. https://doi.org/10.1007/978-3-031-47721-8_47

Download citation

Publish with us

Policies and ethics