Journal on Multimodal User Interfaces

, Volume 8, Issue 1, pp 17–28 | Cite as

Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements

  • Ingo SiegertEmail author
  • Ronald Böck
  • Andreas Wendemuth
Original Paper


To enable a naturalistic human–computer interaction the recognition of emotions and intentions experiences increased attention and several modalities are comprised to cover all human communication abilities. For this reason, naturalistic material is recorded, where the subjects are guided through an interaction with crucial points, but with the freedom to react individually. This material captures realistic user reactions but lacks of clear labels. So, a good transcription and annotation of the given material is essential. For that, the assignment of human annotators has become widely accepted. A good measurement for the reliability of labelled material is the inter-rater agreement. In this paper we investigate the achieved inter-rater agreement utilizing Krippendorff’s alpha for emotional annotated interaction corpora and present methods to improve the reliability, we show that the reliabilities obtained with different methods does not differ much, so a choice could rely on other aspects. Furthermore, a multimodal presentation of the items in their natural order increases the reliability.


Affective state Annotation Context influence  Inter-rater agreement Labelling 



This research was supported by the Transregional Collaborative Research Centre SFB/TRR 62 “A Companion-Technology for Cognitive Technical Systems” ( funded by the German Research Foundation (DFG). Portions of the research in this article use the Semaine Database, collected for the Semaine project ( [45].


  1. 1.
    Altman DG (1991) Practical statistics for medical research. Chapman & Hall, LondonGoogle Scholar
  2. 2.
    Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguist 34(4):555–596Google Scholar
  3. 3.
    Batliner A, Hacker C, Steidl S, Nöth E, Russell M, Wong M (2004) “You stupid tin box”-children interacting with the AIBO robot: a cross-linguistic emotional speech corpus. In: Proceedings of LREC, pp 865–868Google Scholar
  4. 4.
    Böck R, Siegert I, Vlasenko B, Wendemuth A, Haase M, Lange J (2011) A processing tool for emotionally coloured speech. In: Proceedings of ICME, s.p.Google Scholar
  5. 5.
    Bradley M, Lang P (1994) Measuring emotion: the self-assessment manikin and the semantic differential. J Behav Ther Exp Psy 25(1):49–59Google Scholar
  6. 6.
    Burger S, MacLaren V, Yu H (2002) The ISL meeting corpus: the impact of meeting type on speech style. In: Proceedings of the international conference on spoken language processing, pp 301–304Google Scholar
  7. 7.
    Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of german emotional speech. In: Proceedings of interspeech, pp 1517–1520Google Scholar
  8. 8.
    Callejas Z, Lpez-Czar R (2008) Influence of contextual information in emotion annotation for spoken dialogue systems. Speech Commun 50(5):416–433Google Scholar
  9. 9.
    Cauldwell RT (2000) Where did the anger go? The role of context in interpreting emotion in speech. In: Proceedings of ITRW on speech and, emotion, pp 127–131Google Scholar
  10. 10.
    Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 24(1):37–46CrossRefGoogle Scholar
  11. 11.
    Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40(1–2):5–32CrossRefzbMATHGoogle Scholar
  12. 12.
    Crawford JR, Henry JD (2004) The positive and negative affect schedule (PANAS): construct validity, measurement properties and normative data in a large non-clinical sample. Br J Clin Psychol 43(3):245–265Google Scholar
  13. 13.
    Cronbach L (1951) Coefficient alpha and the internal structure of tests. Psychometrika 16(3):297–334Google Scholar
  14. 14.
    Devillers L, Vasilescu I (2004) Reliability of lexical and prosodic cues in two real-life spoken dialog corpora. In: Proceedings of LREC, pp 865–868Google Scholar
  15. 15.
    Devillers L, Vidrascu L, Lamel L (2005) Challenges in real-life emotion annotation and machine learning based detection. Neural Netw 18(4):407–422CrossRefGoogle Scholar
  16. 16.
    Douglas-Cowie E, Cowie R, Schröder M (2000) A new emotion database: considerations, sources and scope. In: Proceedings of ITRW on speech and, emotion, pp 39–44Google Scholar
  17. 17.
    Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, Martin JC, Devillers L, Abrilian S, Batliner A, Amir N, Karpouzis K (2007) The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. In: Proceedings of ACII. Berlin, Heidelberg, pp 488–500Google Scholar
  18. 18.
    Douglas-Cowie E, Devillers L, Martin JC, Cowie R, Savvidou S, Abrilian S, Cox C (2005) Multimodal databases of everyday emotion: facing up to complexity. In: Proceedings of EUROSPEECH, pp 813–816Google Scholar
  19. 19.
    Eggink J, Bland D (2012) A large scale experiment for mood-based classification of TV programmes. In: Proceedings of ICME, pp 140–145Google Scholar
  20. 20.
    Ekman P (1992) Are there basic emotions? Psychol Rev 99(3):550–553Google Scholar
  21. 21.
    El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587CrossRefzbMATHGoogle Scholar
  22. 22.
    Engberg IS, Hansen AV (1996) Documentation of the danish emotional speech database (DES). Technical report, Center for Person, Kommunikation, Aalborg University, Denmark . Internal aau reportGoogle Scholar
  23. 23.
    Feinstein AR, Cicchetti DV (1990) High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol 43(6):543–549CrossRefGoogle Scholar
  24. 24.
    Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382Google Scholar
  25. 25.
    Fleiss JL, Levin B, Paik MC (1991) Statistical methods for rates & proportions, 3rd edn. Wiley, HobokenGoogle Scholar
  26. 26.
    Fragopanagos N, Taylor J (2005) Emotion recognition in human-computer interaction. Neural Netw 18(4):389–405CrossRefGoogle Scholar
  27. 27.
    Frommer J, Michaelis B, Rösner D, Wendemuth A, Friesen R, Haase M, Kunze M, Andrich R, Lange J, Panning A, Siegert I (2012) Towards emotion and affect detection in the multimodal LAST MINUTE corpus. In: Proceedings of LREC, pp 3064–3069Google Scholar
  28. 28.
    Frommer J, Rösner D, Haase M, Lange J, Friesen R, Otto M (2012) Detection and avoidance of failures in dialogues-Wizard of Oz Experiment Operator’s Manual. Pabst Science PublishersGoogle Scholar
  29. 29.
    Gehm T, Scherer K (1988) Factors determining the dimensions of subjective emotional space. In: Scherer K (ed) Facets of emotion: recent research. Erlbaum, Hillsdale, NJ, pp 99–114Google Scholar
  30. 30.
    Gnjatović M, Rösner D (2008) The NIMITEK corpus of affected behavior in human-machine interaction. In: Proceedings of LREC, pp 5–8Google Scholar
  31. 31.
    Grandjean D, Sander D, Scherer K (2008) Conscious emotional experience emerges as a function of multilevel, appraisal-driven response synchronization. Conscious Cogn 17(2):484–495CrossRefGoogle Scholar
  32. 32.
    Grimm M, Kroschel K (2005) Evaluation of natural emotions using self assessment manikins. In: IEEE workshop on automatic speech recognition and understanding, pp 381–385Google Scholar
  33. 33.
    Grimm M, Kroschel K, Narayanan S (2008) The vera am mittag german audio-visual emotional speech database. In: Proceedings of ICME, pp 865–868Google Scholar
  34. 34.
    Gwet KL (2008) Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol 61(1):29–48CrossRefMathSciNetGoogle Scholar
  35. 35.
    Gwet KL (2008) Intrarater reliability. In: D’Agostino RB, Sullivan L, Massaro J (eds) Wiley encyclopedia of clinical trials. Wiley, Hoboken, pp 473–485Google Scholar
  36. 36.
    Hayes AF, Krippendorff K (2007) Answering the call for a standard reliability measure for coding data. Commun Methods Meas 24(1):77–89CrossRefGoogle Scholar
  37. 37.
    Ibáñez J (2011) Showing emotions through movement and symmetry. Comput Hum Behav 27(1):561–567Google Scholar
  38. 38.
    Izard CE, Libero DZ, Putnam P, Haynes OM (1993) Stability of emotion experiences and their relations to traits of personality. J Pers Soc Psychol 64(5):847–860Google Scholar
  39. 39.
    Krippendorff K (2007) Computing Krippendorff’s alpha reliability. University of Pennsylvania, Annenberg School for Communication, Technical reportGoogle Scholar
  40. 40.
    Krippendorff K (2012) Content analysis: an introduction to its methodology, 3rd edn. SAGE Publications, Thousand OaksGoogle Scholar
  41. 41.
    Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174Google Scholar
  42. 42.
    Lang PJ (1980) Behavioral treatment and bio-behavioral assessment: computer applications. In: Sidowski JB, Johnson JH, Williams TA (eds) Technology in mental health care delivery systems. Ablex Pub. Corp., pp 119–137Google Scholar
  43. 43.
    Lee CM, Narayanan S (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303CrossRefGoogle Scholar
  44. 44.
    McDougall W (1926) An introduction to social psychology, revised edn. John W. Luce & Co, BostonGoogle Scholar
  45. 45.
    McKeown G, Valstar M, Cowie R, Pantic M (2010) The semaine corpus of emotionally coloured character interactions. In: Proceedings of ICME, pp 1079–1084Google Scholar
  46. 46.
    McKeown G, Valstar M, Cowie R, Pantic M, Schroder M (2012) The semaine database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17Google Scholar
  47. 47.
    Mehrabian A (1970) A semantic space for nonverbal behavior. J Consult Clin Psychol 35(2):248–257Google Scholar
  48. 48.
    Morris JD (1995) SAM: the self-assessment manikin an efficient cross-cultural measurement of emotional response. J Advert Res 35(6):63–68Google Scholar
  49. 49.
    Morris JD, McMullen JS (1994) Measuring multiple emotional responses to a single television commercial. Adv Consum Res 21:175–180CrossRefGoogle Scholar
  50. 50.
    Osgood CE, Miron MS, May WH (1975) Cross-cultural universals of affective meaning. University of Illinois Press, UrbanaGoogle Scholar
  51. 51.
    Plutchik R (1980) Emotion, a psychoevolutionary synthesis. Harper & Row, New YorkGoogle Scholar
  52. 52.
    Pugmire D (1994) Real emotion. Philos Phenomen Res 54(1):105–122Google Scholar
  53. 53.
    Rösner D, Friesen R, Otto M, Lange J, Haase M, Frommer J (2011) Intentionality in interacting with companion systems G an empirical approach. In: Human-Computer interaction. Towards mobile and intelligent interaction environments, LNCS, vol 6763. Springer, Berlin, Heidelberg, pp 593–602Google Scholar
  54. 54.
    Russel J, Mehrabian A (1974) Distinguishing anger and anxiety in terms of emotional response factors. J Consult Clin Psychol 42:79–83CrossRefGoogle Scholar
  55. 55.
    Russel JA (1980) Three dimensions of emotion. J Pers Soc Psychol 39(9):1161–1178CrossRefGoogle Scholar
  56. 56.
    Sacharin V, Schlegel K, Scherer KR (2012) Geneva emotion wheel rating study. Center for Person, Kommunikation, Aalborg University, NCCR Affective Sciences, Technical reportGoogle Scholar
  57. 57.
    Scherer K (2005) What are emotions? and how can they be measured? Soc Sci Inform 44(4):695–729CrossRefGoogle Scholar
  58. 58.
    Scherer KR (2001) Appraisal considered as a process of multilevel sequential checking, vol 92. Oxford University Press, Oxford, pp. 92–120Google Scholar
  59. 59.
    Schimmack U (1997) The Berlin everyday language mood inventory (BELMI): toward the content valid assessment of moods. Diagnostica 43(2):150–173Google Scholar
  60. 60.
    Schmitt N (1996) Uses and abuses of coefficient alpha. Psychol Assess 8(4):350–353Google Scholar
  61. 61.
    Schröder M, Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M (2000) Feeltrace: An instrument for recording perceived emotion in real time. In: Proceedings of ITRW on speech and, emotion, pp 19–24Google Scholar
  62. 62.
    Sharp H, Rogers Y, Preece J (2007) Interaction design: beyond human-computer interaction, 2nd edn. Wiley, LondonGoogle Scholar
  63. 63.
    Siegert I, Böck R, Wendemuth A (2013) The influence of context knowledge for multimodal affective annotation. In: Human-computer interaction, Part V, HCII 2013, LNCS, vol 8008. Springer, Berlin, pp 381–390Google Scholar
  64. 64.
    Siegert I, Böck R, Philippou-Hübner D, Vlasenko B, Wendemuth A (2011) Appropriate emotional labeling of non-acted speech using basic emotions, Geneva emotion wheel and self assessment Manikins. In: Proceedings of ICME, s.p.Google Scholar
  65. 65.
    Siegert I, Böck R, Wendemuth A (2012) The influence of context knowledge for multimodal annotation on natural material. In: Joint proceedings of the IVA 2012 workshops, pp 25–32 Google Scholar
  66. 66.
    Sijtsma K (2009) On the use, the misuse, and the very limited usefulness of cronbachGs alpha. Psychometrika 74(1):107–120CrossRefzbMATHMathSciNetGoogle Scholar
  67. 67.
    Sojka P, Horak A, Kopecek I, Pala K (eds) (2012) Aggression detection in speech using sensor and semantic information, vol 7499. Springer, BerlinGoogle Scholar
  68. 68.
    Truong KP, van Leeuwen DA, de Jong FM (2012) Speech-based recognition of self-reported and observed emotion in a dimensional space. Speech Commun 54(9):1049–1063CrossRefGoogle Scholar
  69. 69.
    Truong KP, Neerincx MA, van Leeuwen DA (2008) Assessing agreement of observer- and self-annotations in spontaneous multimodal emotion data. In: Proceedings of interspeech, pp 318–321Google Scholar
  70. 70.
    Watson D, Clark LA, Tellegen A (1988) Development and validation of brief measures of positive and negative affect: the PANAS scales. J Pers Soc Psychol 54(6):1063–1070CrossRefGoogle Scholar
  71. 71.
    Wendemuth A, Biundo S (2012) A companion technology for cognitive technical systems. In: Cognitive behavioural systems, Lecture Notes in Computer Science, vol 7403, Springer, Berlin, pp 89–103Google Scholar
  72. 72.
    Wundt W (1922/1863) Vorlesungen über die Menschen- und Tierseele. L. Voss, LeipzigGoogle Scholar
  73. 73.
    Yang YH, Lin YC, Su YF, Chen H (2007) Music emotion classification: a regression approach. In: Proceedings of ICME, pp 208–211Google Scholar
  74. 74.
    Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58CrossRefGoogle Scholar

Copyright information

© OpenInterface Association 2013

Authors and Affiliations

  1. 1.IIKT and CBBS, Otto von Guericke UniversityMagdeburgGermany

Personalised recommendations