Inter-annotator Agreement

Chapter

Abstract

This chapter touches upon several issues in the calculation and assessment of inter-annotator agreement. It gives an introduction to the theory behind agreement coefficients and examples of their application to linguistic annotation tasks. Specific examples explore variation in annotator performance due to heterogeneous data, complex labels, item difficulty, and annotator differences, showing how global agreement coefficients may mask these sources of variation, and how detailed agreement studies can give insight into both the annotation process and the nature of the underlying data. The chapter also reviews recent work on using machine learning to exploit the variation among annotators and learn detailed models from which accurate labels can be inferred. I therefore advocate an approach where agreement studies are not used merely as a means to accept or reject a particular annotation scheme, but as a tool for exploring patterns in the data that are being annotated.

Keywords

Inter-annotator agreement Kappa Krippendorff’s alpha Annotation reliability 

References

  1. 1.
    Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)CrossRefGoogle Scholar
  2. 2.
    Artstein, R., Gandhe, S., Gerten, J., Leuski, A., Traum, D.: Semi-formal evaluation of conversational characters. In: Grumberg, O., Kaminski, M., Katz, S., Wintner, S. (eds) Languages: From formal to natural. Essays dedicated to Nissim Francez on the occasion of his 65th birthday, Lecture Notes in Computer Science, vol. 5533, pp 22–35. Springer, Heidelberg (2009)Google Scholar
  3. 3.
    Artstein, R., Gandhe, S., Rushforth, M., Traum, D.: Viability of a simple dialogue act scheme for a tactical questioning dialogue system. DiaHolmia 2009: Proceedings of the 13th Workshop on the Semantics and Pragmatics of Dialogue, pp. 43–50. Stockholm, Sweden (2009)Google Scholar
  4. 4.
    Artstein, R., Rushforth, M., Gandhe, S., Traum, D., Donigian, A.: Limits of simple dialogue acts for tactical questioning dialogues. In: Proceedings of the 7th IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, pp. 1–8. Barcelona, Spain (2011)Google Scholar
  5. 5.
    Bayerl, P.S., Paul, K.I.: What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Comput. Linguist. 37(4), 699–725 (2011)Google Scholar
  6. 6.
    Bennett, E.M., Alpert, R., Goldstein, A.C.: Communications through limited questioning. Public Opin. Q. 18(3), 303–308 (1954)CrossRefGoogle Scholar
  7. 7.
    Byrt, T., Bishop, J., Carlin, J.B.: Bias, prevalence and kappa. J. Clin. Epidemiol. 46(5), 423–429 (1993)CrossRefGoogle Scholar
  8. 8.
    Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)Google Scholar
  9. 9.
    Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)CrossRefGoogle Scholar
  10. 10.
    Craggs, R., McGee Wood, M.: Evaluating discourse and dialogue coding schemes. Comput. Linguist. 31(3), 289–295 (2005)CrossRefGoogle Scholar
  11. 11.
    DeVault, D., Traum, D., Artstein, R.: Making grammar-based generation easier to deploy in dialogue systems. In: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, Association for Computational Linguistics, pp. 198–207. Columbus, Ohio, http://www.aclweb.org/anthology/W/W08/W08-0130 (2008)
  12. 12.
    Di Eugenio, B., Glass, M.: The kappa statistic: a second look. Computational Linguistics 30(1), 95–101 (2004)Google Scholar
  13. 13.
    Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)Google Scholar
  14. 14.
    Fleiss, J.L.: Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31(3), 651–659 (1975)CrossRefGoogle Scholar
  15. 15.
    Fort, K., François, C., Galibert, O., Ghribi, M.: Analyzing the impact of prevalence on the evaluation of a manual annotation campaign. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1474–1480. Istanbul, Turkey (2012)Google Scholar
  16. 16.
    Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)CrossRefGoogle Scholar
  17. 17.
    Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., Hovy, E.: Learning whom to trust with MACE. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 1120–1130. Atlanta, Georgia, http://www.aclweb.org/anthology/N13-1132 (2013)
  18. 18.
    Hsu, L.M., Field, R.: Interrater agreement measures: comments on kappa\(_n\), Cohen’s kappa, Scott’s \(\pi \), and Aickin’s \(\alpha \). Underst. Stat. 2(3), 205–219 (2003)CrossRefGoogle Scholar
  19. 19.
    Kang, S.H., Gratch, J., Sidner, C., Artstein, R., Huang, L., Morency, L.P.: Towards building a virtual counselor: Modeling nonverbal behavior during intimate self-disclosure. In: Eleventh International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Valencia, Spain (2012)Google Scholar
  20. 20.
    Kilgarriff, A.: 95% replicability for manual word sense tagging. In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, pp. 277–278 (1999)Google Scholar
  21. 21.
    Krippendorff, K.: Bivariate agreement coefficients for reliability of data. Soc. Methodol. 2, 139–150 (1970)CrossRefGoogle Scholar
  22. 22.
    Krippendorff K (1978) Reliability of binary attribute data. Biometrics 34(1):142–144, letter to the editor, with a reply by Joseph L. FleissGoogle Scholar
  23. 23.
    Krippendorff, K.: Content analysis: an introduction to its methodology. Sage, Beverly Hills, CA, chap 12, 129–154 (1980)Google Scholar
  24. 24.
    Krippendorff, K.: Content analysis: an introduction to its methodology. 2nd edn. Sage, Thousand Oaks, CA, chap 11, 211–256 (2004)Google Scholar
  25. 25.
    Krippendorff, K.: Reliability in content analysis: some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004)Google Scholar
  26. 26.
    Passonneau, R., Habash, N., Rambow, O.: Inter-annotator agreement on a multilingual semantic annotation task. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), pp. 1951–1956. Genoa, Italy, http://www.lrec-conf.org/proceedings/lrec2006/summaries/634.html (2006)
  27. 27.
    Passonneau, R.J., Carpenter, B.: The benefits of a model of annotation. Trans. Assoc. Comput.l Linguist. 2, 311–326, http://www.aclweb.org/anthology/Q/Q14/Q14-1025.pdf (2014)
  28. 28.
    Passonneau, R.J., Bhardwaj, V., Salleb-Aouissi, A., Ide, N.: Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations. Lang. Res. Eval. 46(2), 219–252 (2012)CrossRefGoogle Scholar
  29. 29.
    Poesio, M., Sturt, P., Artstein, R., Filik, R.: Underspecification and anaphora: theoretical issues and preliminary evidence. Discourse Processes 42(2), 157–175 (2006)CrossRefGoogle Scholar
  30. 30.
    Reidsma, D., Carletta, J.: Reliability measurement without limits. Comput. Linguist. 34(3), 319–326 (2008)CrossRefGoogle Scholar
  31. 31.
    Scott, W.A.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)CrossRefGoogle Scholar
  32. 32.
    Scott, D., Barone, R., Koeling, R.: Corpus annotation as a scientific task. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1481–1485. Istanbul, Turkey (2012)Google Scholar
  33. 33.
    Zwick, R.: Another look at interrater agreement. Psychological Bulletin 103(3), 374–378 (1988)Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  1. 1.Institute for Creative TechnologiesUniversity of Southern CaliforniaPlaya VistaUSA

Personalised recommendations