Assessors Agreement: A Case Study Across Assessor Type, Payment Levels, Query Variations and Relevance Dimensions

  • Joao PalottiEmail author
  • Guido Zuccon
  • Johannes Bernhardt
  • Allan Hanbury
  • Lorraine Goeuriot
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9822)


Relevance assessments are the cornerstone of Information Retrieval evaluation. Yet, there is only limited understanding of how assessment disagreement influences the reliability of the evaluation in terms of systems rankings. In this paper we examine the role of assessor type (expert vs. layperson), payment levels (paid vs. unpaid), query variations and relevance dimensions (topicality and understandability) and their influence on system evaluation in the presence of disagreements across assessments obtained in the different settings. The analysis is carried out in the context of the CLEF 2015 eHealth Task 2 collection and shows that disagreements between assessors belonging to the same group have little impact on evaluation. It also shows, however, that assessment disagreement found across settings has major impact on evaluation when topical relevance is considered, while it has no impact when understandability assessments are considered.


Evaluation Assessments Assessors agreement 



This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 644753 (KConnect), and from the Austrian Science Fund (FWF) projects P25905-N23 (ADmIRE) and I1094-N23 (MUCKE).


  1. 1.
    Azzopardi, L.: Query side evaluation: an empirical analysis of effectiveness and effort. In: Proceedings of SIGIR, pp. 556–563 (2009)Google Scholar
  2. 2.
    Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: Proceedings of SIGIR, pp. 667–674 (2008)Google Scholar
  3. 3.
    Bailey, P., Moffat, A., Scholer, F., Thomas, P.: User variability and IR system evaluation. In: Proceedings of SIGIR, pp. 625–634 (2015)Google Scholar
  4. 4.
    Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: Proceedings of SIGIR, pp. 539–546 (2010)Google Scholar
  5. 5.
    Koopman, B., Zuccon, G.: Relevation!: an open source system for information retrieval relevance assessment. In: Proceedings of SIGIR, pp. 1243–1244. ACM (2014)Google Scholar
  6. 6.
    Koopman, B., Zuccon, G.: Why assessing relevance in medical IR is demanding. In: Medical Information Retrieval Workshop at SIGIR 2014 (2014)Google Scholar
  7. 7.
    Lesk, M.E., Salton, G.: Relevance assessments and retrieval system evaluation. Inform. Storage Retrieval 4(4), 343–359 (1968)CrossRefzbMATHGoogle Scholar
  8. 8.
    Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inform. Syst. (TOIS) 27(1), 2 (2008)Google Scholar
  9. 9.
    Palotti, J., Zuccon, G., Goeuriot, L., Kelly, L., Hanbury, A., Jones, G.J., Lupu, M., Pecina, P.: CLEF eHealth evaluation lab: retrieving information about medical symptoms. In: CLEF (2015)Google Scholar
  10. 10.
    Stanton, I., Ieong, S., Mishra, N.: Circumlocution in diagnostic medical queries. In: Proceedings of SIGIR, pp. 133–142. ACM (2014)Google Scholar
  11. 11.
    Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inform. Process. Manage. 36(5), 697–716 (2000)CrossRefGoogle Scholar
  12. 12.
    Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval, vol. 1. MIT Press, Cambridge (2005)Google Scholar
  13. 13.
    Zuccon, G.: Understandability biased evaluation for information retrieval. In: Ferro, N., Crestani, F., Moens, M.F., Mothe, J., Silvestri, F., Di Nunzio, G.M., Hauff, C., Silvello, G. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 280–292. Springer, Heielberg (2016)CrossRefGoogle Scholar
  14. 14.
    Zuccon, G., Koopman, B.: Integrating understandability in the evaluation of consumer health search engines. In: Medical Information Retrieval Workshop at SIGIR 2014, p. 32 (2014)Google Scholar
  15. 15.
    Zuccon, G., Koopman, B., Palotti, J.: Diagnose this if you can: on the effectiveness of search engines in finding medical self-diagnosis information. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 562–567. Springer, Heidelberg (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Joao Palotti
    • 1
    Email author
  • Guido Zuccon
    • 2
  • Johannes Bernhardt
    • 3
  • Allan Hanbury
    • 1
  • Lorraine Goeuriot
    • 4
  1. 1.Vienna University of TechnologyViennaAustria
  2. 2.Queensland University of TechnologyBrisbaneAustralia
  3. 3.Medical University of GrazGrazAustria
  4. 4.Université Grenoble AlpesSaint-Martin-d’HèresFrance

Personalised recommendations