Advertisement

Meta-evaluation of Dynamic Search: How Do Metrics Capture Topical Relevance, Diversity and User Effort?

  • Ameer AlbahemEmail author
  • Damiano Spina
  • Falk Scholer
  • Lawrence Cavedon
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11437)

Abstract

Complex dynamic search tasks typically involve multi-aspect information needs and repeated interactions with an information retrieval system. Various metrics have been proposed to evaluate dynamic search systems, including the Cube Test, Expected Utility, and Session Discounted Cumulative Gain. While these complex metrics attempt to measure overall system “goodness” based on a combination of dimensions – such as topical relevance, novelty, or user effort – it remains an open question how well each of the competing evaluation dimensions is reflected in the final score. To investigate this, we adapt two meta-analysis frameworks: the Intuitiveness Test and Metric Unanimity. This study is the first to apply these frameworks to the analysis of dynamic search metrics and also to study how well these two approaches agree with each other. Our analysis shows that the complex metrics differ markedly in the extent to which they reflect these dimensions, and also demonstrates that the behaviors of the metrics change as a session progresses. Finally, our investigation of the two meta-analysis frameworks demonstrates a high level of agreement between the two approaches. Our findings can help to inform the choice and design of appropriate metrics for the evaluation of dynamic search systems.

Keywords

Evaluation Dynamic search Intuitiveness Test Metric Unanimity 

Notes

Acknowledgement

This research was partially supported by Australian Research Council (projects LP130100563 and LP150100252), and Real Thing Entertainment Pty Ltd.

References

  1. 1.
    Albahem, A., Spina, D., Scholer, F., Moffat, A., Cavedon, L.: Desirable properties for diversity and truncated effectiveness metrics. In: Proceedings of Australasian Document Computing Symposium, pp. 9:1–9:7 (2018)Google Scholar
  2. 2.
    Amigó, E., Gonzalo, J., Verdejo, F.: A general evaluation measure for document organization tasks. In: Proceedings of SIGIR, pp. 643–652 (2013)Google Scholar
  3. 3.
    Amigó, E., Spina, D., Carrillo-de Albornoz, J.: An axiomatic analysis of diversity evaluation metrics: introducing the rank-biased utility metric. In: Proceedings of SIGIR, pp. 625–634 (2018)Google Scholar
  4. 4.
    Busin, L., Mizzaro, S.: Axiometrics: an axiomatic approach to information retrieval effectiveness metrics. In: Proceedings of ICTIR, pp. 8:22–8:29 (2013)Google Scholar
  5. 5.
    Carterette, B., Kanoulas, E., Hall, M., Clough, P.: Overview of the TREC 2014 session track. In: Proceedings of TREC (2014)Google Scholar
  6. 6.
    Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of CIKM, pp. 621–630 (2009)Google Scholar
  7. 7.
    Chuklin, A., Zhou, K., Schuth, A., Sietsma, F., de Rijke, M.: Evaluating intuitiveness of vertical-aware click models. In: Proceedings of SIGIR, pp. 1075–1078 (2014)Google Scholar
  8. 8.
    Clarke, C.L., Craswell, N., Soboroff, I., Ashkan, A.: A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of WSDM, pp. 75–84 (2011)Google Scholar
  9. 9.
    Clarke, C.L., et al.: Novelty and diversity in information retrieval evaluation. In: Proceedings of SIGIR, pp. 659–666 (2008)Google Scholar
  10. 10.
    Ferrante, M., Ferro, N., Maistro, M.: Towards a formal framework for utility-oriented measurements of retrieval effectiveness. In: Proceedings of ICTIR, pp. 21–30 (2015)Google Scholar
  11. 11.
    Jiang, J., He, D., Allan, J.: Comparing in situ and multidimensional relevance judgments. In: Proceedings of SIGIR, pp. 405–414 (2017)Google Scholar
  12. 12.
    Jin, X., Sloan, M., Wang, J.: Interactive exploratory search for multi page search results. In: Proceedings of WWW, pp. 655–666 (2013)Google Scholar
  13. 13.
    Kanoulas, E., Azzopardi, L., Yang, G.H.: Overview of the CLEF dynamic search evaluation lab 2018. In: Bellot, P., et al. (eds.) CLEF 2018. LNCS, vol. 11018, pp. 362–371. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-98932-7_31CrossRefGoogle Scholar
  14. 14.
    Luo, J., Wing, C., Yang, H., Hearst, M.: The water filling model and the cube test: multi-dimensional evaluation for professional search. In: Proceedings of CIKM, pp. 709–714 (2013)Google Scholar
  15. 15.
    Moffat, A.: Seven numeric properties of effectiveness metrics. In: Banchs, R.E., Silvestri, F., Liu, T.-Y., Zhang, M., Gao, S., Lang, J. (eds.) AIRS 2013. LNCS, vol. 8281, pp. 1–12. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-45068-6_1CrossRefGoogle Scholar
  16. 16.
    Sakai, T.: Evaluation with informational and navigational intents. In: Proceedings of WWW, pp. 499–508 (2012)Google Scholar
  17. 17.
    Sakai, T.: How intuitive are diversified search metrics? Concordance test results for the diversity U-Measures. In: Banchs, R.E., Silvestri, F., Liu, T.-Y., Zhang, M., Gao, S., Lang, J. (eds.) AIRS 2013. LNCS, vol. 8281, pp. 13–24. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-45068-6_2CrossRefGoogle Scholar
  18. 18.
    Smucker, M.D., Clarke, C.L.: Time-based calibration of effectiveness measures. In: Proceedings of SIGIR, pp. 95–104 (2012)Google Scholar
  19. 19.
    Tang, Z., Yang, G.H.: Investigating per topic upper bound for session search evaluation. In: Proceedings of ICTIR, pp. 185–192 (2017)Google Scholar
  20. 20.
    Turpin, A., Scholer, F.: User Performance versus precision measures for simple web search tasks. In: Proceedings of SIGIR, pp. 11–18 (2006)Google Scholar
  21. 21.
    Yang, H., Frank, J., Soboroff, I.: TREC 2015 dynamic domain track overview. In: Proceedings of TREC (2015)Google Scholar
  22. 22.
    Yang, H., Soboroff, I.: TREC 2016 dynamic domain track overview. In: Proceedings of TREC (2016)Google Scholar
  23. 23.
    Yang, H., Tang, Z., Soboroff, I.: TREC 2017 dynamic domain track overview. In: Proceedings of TREC (2017)Google Scholar
  24. 24.
    Zhou, K., Lalmas, M., Sakai, T., Cummins, R., Jose, J.M.: On the reliability and intuitiveness of aggregated search metrics. In: Proceedings of CIKM, pp. 689–698 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Ameer Albahem
    • 1
    Email author
  • Damiano Spina
    • 1
  • Falk Scholer
    • 1
  • Lawrence Cavedon
    • 1
  1. 1.RMIT UniversityMelbourneAustralia

Personalised recommendations