This chapter touches upon several issues in the calculation and assessment of inter-annotator agreement. It gives an introduction to the theory behind agreement coefficients and examples of their application to linguistic annotation tasks. Specific examples explore variation in annotator performance due to heterogeneous data, complex labels, item difficulty, and annotator differences, showing how global agreement coefficients may mask these sources of variation, and how detailed agreement studies can give insight into both the annotation process and the nature of the underlying data. The chapter also reviews recent work on using machine learning to exploit the variation among annotators and learn detailed models from which accurate labels can be inferred. I therefore advocate an approach where agreement studies are not used merely as a means to accept or reject a particular annotation scheme, but as a tool for exploring patterns in the data that are being annotated.
KeywordsInter-annotator agreement Kappa Krippendorff’s alpha Annotation reliability
I thank the editors of this volume and two anonymous reviewers for valuable feedback and comments on an earlier draft. Any remaining errors and omissions are my own.
The work depicted here was sponsored by the U.S. Army. Statements and opinions expressed do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred.
- 2.Artstein, R., Gandhe, S., Gerten, J., Leuski, A., Traum, D.: Semi-formal evaluation of conversational characters. In: Grumberg, O., Kaminski, M., Katz, S., Wintner, S. (eds) Languages: From formal to natural. Essays dedicated to Nissim Francez on the occasion of his 65th birthday, Lecture Notes in Computer Science, vol. 5533, pp 22–35. Springer, Heidelberg (2009)Google Scholar
- 3.Artstein, R., Gandhe, S., Rushforth, M., Traum, D.: Viability of a simple dialogue act scheme for a tactical questioning dialogue system. DiaHolmia 2009: Proceedings of the 13th Workshop on the Semantics and Pragmatics of Dialogue, pp. 43–50. Stockholm, Sweden (2009)Google Scholar
- 4.Artstein, R., Rushforth, M., Gandhe, S., Traum, D., Donigian, A.: Limits of simple dialogue acts for tactical questioning dialogues. In: Proceedings of the 7th IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, pp. 1–8. Barcelona, Spain (2011)Google Scholar
- 5.Bayerl, P.S., Paul, K.I.: What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Comput. Linguist. 37(4), 699–725 (2011)Google Scholar
- 8.Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)Google Scholar
- 11.DeVault, D., Traum, D., Artstein, R.: Making grammar-based generation easier to deploy in dialogue systems. In: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, Association for Computational Linguistics, pp. 198–207. Columbus, Ohio, http://www.aclweb.org/anthology/W/W08/W08-0130 (2008)
- 12.Di Eugenio, B., Glass, M.: The kappa statistic: a second look. Computational Linguistics 30(1), 95–101 (2004)Google Scholar
- 13.Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)Google Scholar
- 15.Fort, K., François, C., Galibert, O., Ghribi, M.: Analyzing the impact of prevalence on the evaluation of a manual annotation campaign. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1474–1480. Istanbul, Turkey (2012)Google Scholar
- 17.Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., Hovy, E.: Learning whom to trust with MACE. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 1120–1130. Atlanta, Georgia, http://www.aclweb.org/anthology/N13-1132 (2013)
- 19.Kang, S.H., Gratch, J., Sidner, C., Artstein, R., Huang, L., Morency, L.P.: Towards building a virtual counselor: Modeling nonverbal behavior during intimate self-disclosure. In: Eleventh International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Valencia, Spain (2012)Google Scholar
- 20.Kilgarriff, A.: 95% replicability for manual word sense tagging. In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, pp. 277–278 (1999)Google Scholar
- 22.Krippendorff K (1978) Reliability of binary attribute data. Biometrics 34(1):142–144, letter to the editor, with a reply by Joseph L. FleissGoogle Scholar
- 23.Krippendorff, K.: Content analysis: an introduction to its methodology. Sage, Beverly Hills, CA, chap 12, 129–154 (1980)Google Scholar
- 24.Krippendorff, K.: Content analysis: an introduction to its methodology. 2nd edn. Sage, Thousand Oaks, CA, chap 11, 211–256 (2004)Google Scholar
- 25.Krippendorff, K.: Reliability in content analysis: some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004)Google Scholar
- 26.Passonneau, R., Habash, N., Rambow, O.: Inter-annotator agreement on a multilingual semantic annotation task. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), pp. 1951–1956. Genoa, Italy, http://www.lrec-conf.org/proceedings/lrec2006/summaries/634.html (2006)
- 27.Passonneau, R.J., Carpenter, B.: The benefits of a model of annotation. Trans. Assoc. Comput.l Linguist. 2, 311–326, http://www.aclweb.org/anthology/Q/Q14/Q14-1025.pdf (2014)
- 32.Scott, D., Barone, R., Koeling, R.: Corpus annotation as a scientific task. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1481–1485. Istanbul, Turkey (2012)Google Scholar
- 33.Zwick, R.: Another look at interrater agreement. Psychological Bulletin 103(3), 374–378 (1988)Google Scholar