Skip to main content

On the Instability of Diminishing Return IR Measures

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12656))

Included in the following conference series:

Abstract

The diminishing return property of ERR (Expected Reciprocal Rank) is highly intuitive and attractive: its user model says, for example, that after the users have found a highly relevant document at rank r, few of them will continue to examine rank \((r+1)\) and beyond. Recently, another IR evaluation measure based on diminishing return called iRBU (intentwise Rank-Biased Utility) was proposed, and it was reported that nDCG (normalised Discounted Cumulative Gain) and iRBU align surprisingly well with users’ SERP (Search Engine Result Page) preferences. The present study conducts offline evaluations of diminishing return measures including ERR and iRBU along with other popular measures such as nDCG, using four test collections and the associated runs from recent TREC tracks and NTCIR tasks. Our results show that the diminishing return measures generally underperform other graded relevance measures in terms of system ranking consistency across two disjoint topic sets as well as discriminative power. The results generalise a previous finding on ERR regarding its limited discriminative power, showing that the diminishing return user model hurts the stability of evaluation measures regardless of the utility function part of the measure. Hence, while we do recommend iRBU along with nDCG for evaluating adhoc IR systems from multiple user-oriented angles, iRBU should be used under the awareness that it can be much less statistically stable than nDCG.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Section 2 discusses an alternative framework for defining a family of measures [24].

  2. 2.

    For example, the TREC 2014 Web Track used 20 as the document cutoff [13]; the NTCIR We Want Web tasks haved used 10 [23].

  3. 3.

    Topic set sizes can also be theoretically determined based on statistical power, given some pilot data for variance estimation [30].

  4. 4.

    For example, Amigó et al. [2] refer to the correlation of system rankings across data sets as robustness.

  5. 5.

    Not all adaptive measures are diminishing return measures. Moffat et al. [24] classify Reciprocal Rank (RR) as adaptive, but RR does not accommodate diminishing return: once a relevant document is found, there is no further return.

  6. 6.

    http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html (version 200626).

  7. 7.

    The relevance assessments of the four test collections we use in our experiments are expected to be incomplete: see the “rel. per topic” column in Table 2. Hence, using a large cutoff L probably would not give us reliable results.

  8. 8.

    https://trec-core.github.io/2018/.

  9. 9.

    https://lemurproject.org/clueweb12/.

  10. 10.

    The search results for the first 80 topics (i.e., the reused WWW-2 topics) were copied from a run from the NTCIR-14 WWW-2 task [23] and the other 80 topics (i.e., the new WWW-3 test topics) were processed by a new system.

  11. 11.

    http://research.nii.ac.jp/ntcir/tools/discpower-en.html (version 160507).

  12. 12.

    https://waseda.box.com/ECIR2021PACK.

References

  1. Al-Maskari, A., Sanderson, M., Clough, P., Airio, E.: The good and the bad system: does the test collection predict users’ effectiveness. In: Proceedings of ACM SIGIR 2018, pp. 59–66 (2008)

    Google Scholar 

  2. Amigó, E., Gonzalo, J., Mizzaro, S., de Albornoz, J.C.: An effectiveness metric for ordinal classification: formal properties and experimental results. In: Proceedings of ACL 2020 (2020)

    Google Scholar 

  3. Amigó, E., Spina, D., de Albornoz, J.C.: An axiomatic analysis of diversity evaluation metrics: introducting the rank-biased utility metric. In: Proceedings of ACM SIGIR 2018, pp. 625–634 (2018)

    Google Scholar 

  4. Anelli, V.W., Di Noia, T., Di Sciascio, E., Pomo, C., Ragone, A.: On the discriminative power of hyper-parameters in cross-validation and how to choose them. In: Proceedings of ACM RecSys 2019, pp. 447–451 (2019)

    Google Scholar 

  5. Ashkan, A., Metzler, D.: Revisiting online personal search metrics with the user in mind. In: Proceedings ACM SIGIR 2019, pp. 625–634 (2019)

    Google Scholar 

  6. Azzopardi, L., Thomas, P., Craswell, N.: Measuring the utility of search engine result pages. In: Proceedings of ACM SIGIR 2018, pp. 605–614 (2018)

    Google Scholar 

  7. Buckley, C., Voorhees, E.M.: Retrieval system evaluation. In: Voorhees, E.M., Harman, D.K. (eds.) TREC: Experiment and Evaluation in Information Retrieval, pp. 53–75. The MIT Press (2005)

    Google Scholar 

  8. Carterette, B.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS 30(1), 1–34 (2012)

    Google Scholar 

  9. Chapelle, O., Metzler, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of ACM CIKM 2009, pp. 621–630 (2009)

    Google Scholar 

  10. Chuklin, A., Serdyuov, P., de Rijke, M.: Click model-based information retrieval metrics. In: Proceedings of ACM SIGIR 2013, pp. 493–502 (2013)

    Google Scholar 

  11. Clarke, C.L., Craswell, N., Soboroff, I., Ashkan, A.: A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of ACM WSDM 2011, pp. 75–84 (2011)

    Google Scholar 

  12. Clarke, C.L., Vtyurina, A., Smucker, M.D.: Offline evaluation without gain. In: Proceedings of ICTIR 2020, pp. 185–192 (2020)

    Google Scholar 

  13. Collins-Thompson, K., Macdonald, C., Bennett, P., Diaz, F., Voorhees, E.M.: TREC 2014 web track overview. In: Proceedings of TREC 2014 (2015)

    Google Scholar 

  14. Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. In: Proceedings of TREC 2019 (2020)

    Google Scholar 

  15. Dou, Z., Yang, X., Li, D., Wen, J.R., Sakai, T.: Low-cost, bottom-up measures for evaluating search result diversification. Inform. Retrieval J. 23, 86–113 (2020)

    Article  Google Scholar 

  16. Ferro, N., Kim, Y., Sanderson, M.: Using collection shards to study retrieval performance effect sizes. ACM TOIS 37(3), 1–40 (2019)

    Google Scholar 

  17. Golbus, P.B., Aslam, J.A., Clarke, C.L.: Increasing evaluation sensitivity to diversity. Inform. Retrieval 16, 530–555 (2013)

    Article  Google Scholar 

  18. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. 20(4), 422–446 (2002)

    Article  Google Scholar 

  19. Kanoulas, E., Aslam, J.A.: Empirical justification of the gain and discount function for nDCG. In: Proceedings of ACM CIKM 2009, pp. 611–620 (2009)

    Google Scholar 

  20. Leelanupab, T., Zuccon, G., Jose, J.M.: A comprehensive analysis of parameter settings for novelty-biased cumulative gain. In: Proceedings of ACM CIKM 2012, pp. 1950–1954 (2012)

    Google Scholar 

  21. Lu, X., Moffat, A., Culpepper, J.S.: The effect of pooling and evaluation depth on IR metrics. Inform. Retrieval J. 19(4), 416–445 (2016)

    Article  Google Scholar 

  22. Luo, J., Wing, C., Yang, H., Hearst, M.A.: The water filling model and the cube test: multi-dimensional evaluation for professional search. In: Proceedings of ACM CIKM 2013, pp. 709–714 (2013)

    Google Scholar 

  23. Mao, J., Sakai, T., Luo, C., Xiao, P., Liu, Y., Dou, Z.: Overview of the NTCIR-14 we want web task. In: Proceedings of NTCIR-14, pp. 455–467 (2019). http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings14/pdf/ntcir/01-NTCIR14-OV-WWW-MaoJ.pdf

  24. Moffat, A., Bailey, P., Scholer, F., Thomas, P.: Incorporating user expectations and behavior into the measurement of search effectiveness. ACM TOIS 35(3), 1–38 (2017)

    Google Scholar 

  25. Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS 27(1), 1–27 (2008)

    Google Scholar 

  26. Robertson, S.E., Kanoulas, E., Yilmaz, E.: Extending average precision to graded relevance judgements. In: Proceedings of ACM SIGIR 2010, pp. 603–610 (2010)

    Google Scholar 

  27. Sakai, T.: Ranking the NTCIR systems based on multigrade relevance. In: Myaeng, S.H., Zhou, M., Wong, K.-F., Zhang, H.-J. (eds.) AIRS 2004. LNCS, vol. 3411, pp. 251–262. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31871-2_22

    Chapter  Google Scholar 

  28. Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: Proceedings of ACM SIGIR 2006, pp. 525–532 (2006)

    Google Scholar 

  29. Sakai, T.: Metrics, statistics, tests. In: Ferro, N. (ed.) PROMISE 2013. LNCS, vol. 8173, pp. 116–163. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54798-0_6

    Chapter  Google Scholar 

  30. Sakai, T.: Laboratory Experiments in Information Retrieval. TIRS, vol. 40. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-1199-4

    Book  MATH  Google Scholar 

  31. Sakai, T., Dou, Z.: Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In: Proceedings of ACM SIGIR 2013, pp. 473–482 (2013)

    Google Scholar 

  32. Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inform. Retrieval 11(5), 447–470 (2008)

    Article  Google Scholar 

  33. Sakai, T., Robertson, S.: Modelling a user population for designing information retrieval metrics. In: Proceedings of EVIA 2008, pp. 30–41 (2008)

    Google Scholar 

  34. Sakai, T., Song, R.: Diversified search evaluation: lessons from the NTCIR-9 INTENT task. Inform. Retrieval 16(4), 504–529 (2013)

    Article  Google Scholar 

  35. Sakai, T., et al.: Overview of the NTCIR-15 we want web with CENTRE task. In: Proceedings of NTCIR-15, pp. 219–234 (2020)

    Google Scholar 

  36. Sakai, T., Zeng, Z.: Which diversity evaluation measures are “good”? In: Proceedings of ACM SIGIR 2019, pp. 595–604 (2019)

    Google Scholar 

  37. Sakai, T., Zeng, Z.: Good evaluation measures based on document preferences. In: Proceedings of ACM SIGIR 2020, pp. 359–368 (2020)

    Google Scholar 

  38. Sakai, T., Zeng, Z.: Retrieval evaluation measures that agree with users’ serp preferences: traditional, preference-based, and diversity measures. ACM TOIS 39(2), 1–35 (2020)

    Google Scholar 

  39. Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E.: Do user preferences and evaluation measures line up? In: Proceedings of ACM SIGIR 2010, pp. 555–562 (2010)

    Google Scholar 

  40. Sanderson, M., Zobel, J.: Information retrieval evaluation: effort, sensitivity, and reliability. In: Proceedings of ACM SIGIR 2005, pp. 162–169 (2005)

    Google Scholar 

  41. Shang, L., et al.: Overview of the NTCIR-13 short text conversation task. In: Proceedings of NTCIR-13, pp. 194–210 (2017)

    Google Scholar 

  42. Smucker, M.D., Clarke, C.L.: Time-based calibration of effectiveness measures. In: Proceedings of ACM SIGIR 2012, pp. 95–104 (2012)

    Google Scholar 

  43. Turpin, A., Scholer, F.: User performance versus precision measures for simple search tasks. In: Proceedings of ACM SIGIR 2006, pp. 11–18 (2006)

    Google Scholar 

  44. Urbano, J.: Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation. Inform. Retrieval J. 19(3), 313–350 (2016)

    Article  Google Scholar 

  45. Valcarce, D., Bellogín, A., Parapar, J., Castells, P.: Assessing ranking metrics in top-N recommendation. Inform. Retrieval J. 23(4), 411–448 (2020). https://doi.org/10.1007/s10791-020-09377-x

    Article  Google Scholar 

  46. Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inform. Process. Manag. 36, 697–716 (2000)

    Article  Google Scholar 

  47. Voorhees, E.M.: Topic set size redux. In: Proceedings of ACM SIGIR 2009, pp. 806–807 (2009)

    Google Scholar 

  48. Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR 2002, pp. 316–323 (2002)

    Google Scholar 

  49. Wang, X., Wen, J.R., Dou, Z., Sakai, T., Zhang, R.: Search result diversity evaluation based on intent hierarchies. IEEE Trans. Knowl. Data Eng. 30(1), 156–169 (2018)

    Article  Google Scholar 

  50. Zhang, F., Liu, Y., Li, X., Zhang, M., Xu, Y., Ma, S.: Evaluating web search with a bejeweled player model. In: Proceedings of ACM SIGIR 2017, pp. 425–434 (2017)

    Google Scholar 

  51. Zhang, F., et al.: Models versus satisfaction: towards a better understanding of evaluation metrics. In: Proceedings of ACM SIGIR 2020, pp. 379–388 (2020)

    Google Scholar 

  52. Zhou, K., Lalmas, M., Sakai, T., Cummins, R., Jose, J.M.: On the reliability and intuitiveness of aggregated search metrics. In: Proceedings of ACM CIKM 2013, pp. 689–698 (2013)

    Google Scholar 

  53. Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of ACM SIGIR 1998, pp. 307–314 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tetsuya Sakai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sakai, T. (2021). On the Instability of Diminishing Return IR Measures. In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science(), vol 12656. Springer, Cham. https://doi.org/10.1007/978-3-030-72113-8_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-72113-8_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-72112-1

  • Online ISBN: 978-3-030-72113-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics