How do interval scales help us with better understanding IR evaluation measures?

Abstract

Evaluation measures are the basis for quantifying the performance of IR systems and the way in which their values can be processed to perform statistical analyses depends on the scales on which these measures are defined. For example, mean and variance should be computed only when relying on interval scales. In our previous work we defined a theory of IR evaluation measures, based on the representational theory of measurement, which allowed us to determine whether and when IR measures are interval scales. We found that common set-based retrieval measures—namely precision, recall, and F-measure—always are interval scales in the case of binary relevance while this does not happen in the multi-graded relevance case. In the case of rank-based retrieval measures—namely AP, gRBP, DCG, and ERR—only gRBP is an interval scale when we choose a specific value of the parameter p and define a specific total order among systems while all the other IR measures are not interval scales. In this work, we build on our previous findings and we carry out an extensive evaluation, based on standard TREC collections, to study how our theoretical findings impact on the experimental ones. In particular, we conduct a correlation analysis to study the relationship among the above-mentioned state-of-the-art evaluation measures and their scales. We study how the scales of evaluation measures impact on non parametric and parametric statistical tests for multiple comparisons of IR system performance. Finally, we analyse how incomplete information and pool downsampling affect different scales and evaluation measures.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

Notes

  1. 1.

    We also experimented with the weights \(W_1 = [0, 1, 2]\) to use exactly the same as those used in the case of RBTO and this produced very close experimental results, which are omitted for space reasons, preferring to use their standard weights for DCG and ERR.

  2. 2.

    Note that averaging Kendall’s \(\tau \) values implicitly assumes them to be on an interval scale and determining whether Kendall’s \(\tau \) is or not an interval scale goes beyond the scope of this paper. In the following, we consider the averaged Kendall’s \(\tau \) value more as a proxy to know whether all the topic-by-topic values are 1, i.e. whether we have an interval scale, or not.

References

  1. Allan, J., Arguello, J., Azzopardi, L., Bailey, P., Baldwin, T., Balog, K., et al. (2018a). Research frontiers in information retrieval—Report from the third strategic workshop on information retrieval in lorne (SWIRL 2018). SIGIR Forum, 52(1), 34–90.

    Article  Google Scholar 

  2. Allan, J., Harman, D. K., Kanoulas, E., Li, D., Van Gysel, C., & Voorhees, E. M. (2018b). TREC 2017 common core track overview. In E. M. Voorhees, & A. Ellis (Eds.), The twenty-sixth text retrieval conference proceedings (TREC 2017). National Institute of Standards and Technology (NIST), Special Publication 500-324, Washington, USA.

  3. Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, M. F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), 461–486.

    Article  Google Scholar 

  4. Amigó, E., Gonzalo, J., & Verdejo, M. F. (2013). A general evaluation measure for document organization tasks. In G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, & T. Sakai (Eds.), Proceedings of 36th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2013) (pp. 643–652). New York, USA: ACM Press.

    Google Scholar 

  5. Amigó, E., Fang, H., Mizzaro, S., & Zhai, C. (2017). Report on the SIGIR 2017 workshop on axiomatic thinking for information retrieval and related tasks (ATIR). SIGIR Forum, 51(3), 99–106.

    Article  Google Scholar 

  6. Amigó, E., Fang, H., Mizzaro, S., & Zhai, C. (2018a). Are we on the right track? An examination of information retrieval methodologies. In K. Collins-Thompson, Q. Mei, B. Davison, Y. Liu, & E. Yilmaz (Eds.), Proceedings of 41th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2018) (pp. 997–1000). New York, USA: ACM Press.

    Google Scholar 

  7. Amigó, E., & Spina, D. (2018b). An axiomatic analysis of diversity evaluation metrics: Introducing the rank-biased utility metric. In K. Collins-Thompson, Q. Mei, B. Davison, Y. Liu, & E. Yilmaz (Eds.), Proceedings of 41th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2018) (pp. 625–634). New York, USA: ACM Press.

    Google Scholar 

  8. Bollmann, P. (1984). Two axioms for evaluation measures in information retrieval. In C. J. van Rijsbergen (Ed.), Proceedings of the third joint BCS and ACM symposium on research and development in information retrieval (pp. 233–245). Cambridge, UK: Cambridge University Press.

    Google Scholar 

  9. Bollmann, P., & Cherniavsky, V. S. (1980). Measurement-theoretical investigation of the MZ-metric. In C. J. van Rijsbergen (Ed.), Proceedings of 3rd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1980) (pp. 256–267). New York, USA: ACM Press.

    Google Scholar 

  10. Buckley, C., & Voorhees, E. M. (2004). Retrieval evaluation with incomplete information. In M. Sanderson, K. Järvelin, J. Allan, & P. Bruza (Eds.), Proceedings of 27th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2004) (pp. 25–32). New York, USA: ACM Press.

    Google Scholar 

  11. Buckley, C., & Voorhees, E. M. (2005). Retrieval system evaluation. In D. K. Harman, & E. M. Voorhees (Eds.), TREC. Experiment and evaluation in information retrieval (pp. 53–78). Cambridge, MA: MIT Press.

    Google Scholar 

  12. Busin, L., & Mizzaro, S. (2013). Axiometrics: An axiomatic approach to information retrieval effectiveness metrics. In O. Kurland, D. Metzler, C. Lioma, B. Larsen, & P. Ingwersen (Eds.), Proceedings of 4th international conference on the theory of information retrieval (ICTIR 2013) (pp. 22–29). New York, USA: ACM Press.

    Google Scholar 

  13. Carterette, B. A. (2011). System effectiveness, user models, and user utility: A conceptual framework for investigation. In W.-Y. Ma, J.-Y. Nie, R. Baeza-Yaetes, T.-S. Chua, & W. B. Croft (Eds.), Proceedings of 34th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2011) (pp. 903–912). New York, USA: ACM Press.

    Google Scholar 

  14. Carterette, B. A. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems (TOIS), 30(1), 4:1–4:34.

    Article  Google Scholar 

  15. Chapelle, O., Metzler, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, & J. J. Lin (Eds.), Proceedings of 18th international conference on information and knowledge management (CIKM 2009) (pp. 621–630). New York, USA: ACM Press.

    Google Scholar 

  16. Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., & Wu, S.-L. (2011). Intent-based diversification of web search results: Metrics and algorithms. Information Processing & Management, 14(6), 572–592.

    Google Scholar 

  17. Chen, Y., Zhou, K., Liu, Y., Zhang, M., & Ma, S. (2017). Meta-evaluation of online and offline web search evaluation metrics. In N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, & R. W. White (Eds.), Proceedings of 40th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2017) (pp. 15–24). New York, USA: ACM Press.

    Google Scholar 

  18. Ferrante, M., Ferro, N., & Maistro, M. (2015). Towards a formal framework for utility-oriented measurements of retrieval effectiveness. In J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, & Y. Zhang (Eds.), Proceedings of 1st ACM SIGIR international conference on the theory of information retrieval (ICTIR 2015) (pp. 21–30). New York, USA: ACM Press.

    Google Scholar 

  19. Ferrante, M., Ferro, N., & Pontarollo, S. (2017a). Are IR evaluation measures on an interval scale? In J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, & E. Yilmaz (Eds.), Proceedings of 3rd ACM SIGIR international conference on the theory of information retrieval (ICTIR 2017) (pp. 67–74). New York, USA: ACM Press.

    Google Scholar 

  20. Ferrante, M., Ferro, N., & Pontarollo S (2017b). An interval-like scale property for IR evaluation measures. In N. Ferro, & I. Soboroff (Eds.), Proceedings of 8th international workshop on evaluating information access (EVIA 2017) (pp. 10–15). CEUR Workshop Proceedings (CEUR-WS.org). ISSN 1613-0073, http://ceur-ws.org/Vol-2008/. Accessed June 2019.

  21. Ferrante, M., Ferro, N., & Pontarollo, S. (2019). A general theory of IR evaluation measures. IEEE Transactions on Knowledge and Data Engineering (TKDE), 31(3), 409–422.

    Article  Google Scholar 

  22. Ferro, N. (2017). What does affect the correlation among evaluation measures? ACM Transactions on Information Systems (TOIS), 36(2), 19:1–19:40.

    Article  Google Scholar 

  23. Ferro, N., Fuhr, N., Grefenstette, G., Konstan, J. A., Castells, P., Daly, E. M., et al. (2018). The Dagstuhl perspectives workshop on performance modeling and prediction. SIGIR Forum, 52(1), 91–101.

    Article  Google Scholar 

  24. Fuhr, N. (2017). Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum, 51(3), 32–41.

    Article  Google Scholar 

  25. Hochberg, Y., & Tamhane, A. C. (1987). Multiple comparison procedures. New York, USA: Wiley.

    Google Scholar 

  26. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20(4), 422–446.

    Article  Google Scholar 

  27. Kekäläinen, J., & Järvelin, K. (2002). Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology (JASIST), 53(13), 1120–1129.

    Article  Google Scholar 

  28. Kendall, M. G. (1948). Rank correlation methods. Oxford, England: Griffin.

    Google Scholar 

  29. Knuth, D. E. (1981). The art of computer programming—Volume 2: Seminumerical algorithms (2nd ed.). Reading, USA: Addison-Wesley.

    Google Scholar 

  30. Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971). Foundations of measurement. Additive and polynomial representations (Vol. 1). New York, USA: Academic Press.

    Google Scholar 

  31. Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583–621.

    MATH  Article  Google Scholar 

  32. Maddalena, E., & Mizzaro, S. (2014). Axiometrics: Axioms of information retrieval effectiveness metrics. In S. Mizzaro & R. Song (Eds.), Proceedings of 6th international workshop on evaluating information access (EVIA 2014) (pp. 17–24). Tokyo, Japan: National Institute of Informatics.

    Google Scholar 

  33. Miyamoto, S. (2004). Generalizations of multisets and rough approximations. International Journal of Intelligent Systems, 19(7), 639–652.

    MATH  Article  Google Scholar 

  34. Mizzaro, S. (1997). Relevance: The whole history. Journal of the American Society for Information Science and Technology (JASIST), 48(9), 810–832.

    Article  Google Scholar 

  35. Moffat, A. (2013). Seven Numeric Properties of Effectiveness Metrics. In R. E. Banchs, F. Silvestri, T.-Y. Liu, M. Zhang, S. Gao, & J. Lang (Eds.), Proceedings of 9th Asia information retrieval societies conference (AIRS 2013). Lecture Notes in Computer Science (LNCS) 8281 (Vol. 8281, pp. 1–12). Springer, Heidelberg

  36. Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS), 27(1), 2:1–2:27.

    Article  Google Scholar 

  37. Rossi, G. B. (2014). Measurement and probability. A probabilistic theory of measurement with applications. New York, USA: Springer.

    Google Scholar 

  38. Rutherford, A. (2011). ANOVA and ANCOVA. A GLM approach (2nd ed.). New York, USA: Wiley.

    Google Scholar 

  39. Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In E. N. Efthimiadis, S. Dumais, D. Hawking, & K. Järvelin (Eds.), Proceedings of 29th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2006) (pp. 525–532). New York, USA: ACM Press.

    Google Scholar 

  40. Sakai, T., & Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval, 11(5), 447–470.

    Article  Google Scholar 

  41. Stanley, R. P. (2012). Enumerative combinatorics—Volume 1, volume 49 of Cambridge Studies in Advanced Mathematics (2nd ed.). Cambridge, UK: Cambridge University Press.

    Google Scholar 

  42. Stevens, S. S. (1946). On the theory of scales of measurement. Science, New Series, 103(2684), 677–680.

    MATH  Google Scholar 

  43. Tukey, J. W. (1949). Comparing individual means in the analysis of variance. Biometrics, 5(2), 99–114.

    MathSciNet  Article  Google Scholar 

  44. van Rijsbergen, C. J. (1974). Foundations of evaluation. Journal of Documentation, 30(4), 365–373.

    Article  Google Scholar 

  45. van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). London, England: Butterworths.

    Google Scholar 

  46. Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1), 65–72.

    Google Scholar 

  47. Voorhees, E. M. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness. In W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, & J. Zobel (Eds.), Proceedings of 21st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998) (pp. 315–323). New York, USA: ACM Press.

    Google Scholar 

  48. Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5), 697–716.

    Article  Google Scholar 

  49. Voorhees, E. M., & Harman, D. K. (1999). Overview of the eigth text retrieval conference (TREC-8). In E. M. Voorhees, & D. K. Harman (Eds.), The eighth text retrieval conference (TREC-8) (pp. 1–24). National Institute of Standards and Technology (NIST), Special Publication 500-246, Washington, USA.

  50. Yilmaz, E., & Aslam, J. A. (2006). Estimating average precision with incomplete and imperfect judgments. In P. S. Yu, V. Tsotras, E. A. Fox, & C.-B. Liu (Eds.), Proceedings of 15th international conference on information and knowledge management (CIKM 2006) (pp. 102–111). New York, USA: ACM Press.

    Google Scholar 

  51. Yilmaz, E., Aslam, J. A., & Robertson, S. E. (2008). A new rank correlation coefficient for information retrieval. In T.-S. Chua, M.-K. Leong, D. W. Oard, & F. Sebastiani (Eds.), Proceedings of 31st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2008) (pp. 587–594). New York, USA: ACM Press.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Nicola Ferro.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ferrante, M., Ferro, N. & Losiouk, E. How do interval scales help us with better understanding IR evaluation measures?. Inf Retrieval J 23, 289–317 (2020). https://doi.org/10.1007/s10791-019-09362-z

Download citation

Keywords

  • Experimentation
  • Representational theory of measurement
  • Interval scale
  • IR evaluation measure
  • Formal framework