IEMOCAP: interactive emotional dyadic motion capture database

  • Carlos BussoEmail author
  • Murtaza Bulut
  • Chi-Chun Lee
  • Abe Kazemzadeh
  • Emily Mower
  • Samuel Kim
  • Jeannette N. Chang
  • Sungbok Lee
  • Shrikanth S. Narayanan


Since emotions are expressed through a combination of verbal and non-verbal channels, a joint analysis of speech and gestures is required to understand expressive human communication. To facilitate such investigations, this paper describes a new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state). The corpus contains approximately 12 h of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.


Audio-visual database Dyadic interaction Emotion Emotional assessment Motion capture system 



This work was supported in part by funds from the National Science Foundation (NSF) (through the Integrated Media Systems Center, an NSF Engineering Research Center, Cooperative Agreement No. EEC-9529152 and a CAREER award), the Department of the Army, and a MURI award from the Office of Naval Research. Any opinions, findings and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. The authors wish to thank Joshua Izumigawa, Gabriel Campa, Zhigang Deng, Eric Furie, Karen Liu, Oliver Mayer, and May-Chen Kuo for their help and support.


  1. Abrilian, S., Devillers, L., Buisine, S., & Martin, J. C. (2005). EmoTV1: Annotation of real-life emotions for the specification of multimodal affective interfaces. In 11th International Conference on Human-Computer Interaction (HCI 2005) (pp. 195–200). Las Vegas, Nevada, USA.Google Scholar
  2. Amir, N., Ron, S., & Laor, N. (2000). Analysis of an emotional speech corpus in Hebrew based on objective criteria. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 29–33). Newcastle, Northern Ireland, UK.Google Scholar
  3. Arun, K., Huang, T., & Blostein, S. (1987). Least-squares fitting of two 3-D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5), 698–700.CrossRefGoogle Scholar
  4. Arya, A., Jefferies, L., Enns, J., & DiPaola, S. (2006). Facial actions as visual cues for personality. Computer Animation and Virtual Worlds, 17(3–4), 371–382.CrossRefGoogle Scholar
  5. Bänziger, T., & Scherer, K., (2007). Using actor portrayals to systematically study multimodal emotion expression: The GEMEP corpus. In A. Paiva, R. Prada, & R. Picard (Eds.), Affective computing and intelligent interaction (ACII 2007). Lecture Notes in Artificial Intelligence (Vol. 4738, pp. 476–487). Berlin, Germany: Springer-Verlag Press.Google Scholar
  6. Batliner, A., Fischer, K., Huber, R., Spilker, J., & Nöth, E. (2000). Desperately seeking emotions or: Actors, wizards and human beings. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 195–200). Newcastle, Northern Ireland, UK.Google Scholar
  7. Busso, C., Deng, Z., Grimm, M., Neumann, U., & Narayanan, S. (2007a). Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech and Language Processing, 15(3), 1075–1086.CrossRefGoogle Scholar
  8. Busso, C., Deng, Z., Neumann, U., & Narayanan, S. (2005). Natural head motion synthesis driven by acoustic prosodic features. Computer Animation and Virtual Worlds, 16(3–4), 283–290.CrossRefGoogle Scholar
  9. Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C., Kazemzadeh, A., Lee, S., Neumann, U., & Narayanan, S. (2004). Analysis of emotion recognition using facial expressions, speech and multimodal information. In Sixth International Conference on Multimodal Interfaces ICMI 2004 (pp. 205–211). State College, PA.Google Scholar
  10. Busso, C., Lee, S., & Narayanan, S. (2007b). Using neutral speech models for emotional speech analysis. In Interspeech 2007—Eurospeech (pp. 2225–2228). Antwerp, Belgium.Google Scholar
  11. Busso, C., & Narayanan, S. (2006). Interplay between linguistic and affective goals in facial expression during emotional utterances. In 7th International Seminar on Speech Production (ISSP 2006) (pp. 549–556). Ubatuba-SP, Brazil.Google Scholar
  12. Busso, C., & Narayanan, S. (2007a). Interrelation between speech and facial gestures in emotional utterances: A single subject study. IEEE Transactions on Audio, Speech and Language Processing, 15(8), 2331–2347.Google Scholar
  13. Busso, C., & Narayanan, S. (2007b). Joint analysis of the emotional fingerprint in the face and speech: A single subject study. In International Workshop on Multimedia Signal Processing (MMSP 2007) (pp. 43–47). Chania, Crete, Greece.Google Scholar
  14. Busso, C., & Narayanan, S. (2008a). The expression and perception of emotions: Comparing assessments of self versus others. In Interspeech 2008—Eurospeech (pp. 257–260). Brisbane, Australia.Google Scholar
  15. Busso, C., & Narayanan, S. (2008b). Recording audio-visual emotional databases from actors: A closer look. In Second International Workshop on Emotion: Corpora for Research on Emotion and Affect, International Conference on Language Resources and Evaluation (LREC 2008) (pp. 17–22). Marrakech, Morocco.Google Scholar
  16. Busso, C., & Narayanan, S. (2008c). Scripted dialogs versus improvisation: Lessons learned about emotional elicitation techniques from the IEMOCAP database. In Interspeech 2008—Eurospeech (pp. 1670–1673). Brisbane, Australia.Google Scholar
  17. Caridakis, G., Malatesta, L., Kessous, L., Amir, N., Raouzaiou, A., & Karpouzis, K. (2006). Modeling naturalistic affective states via facial and vocal expressions recognition. In Proceedings of the 8th International Conference on Multimodal Interfaces (ICMI 2006) (pp. 146–154). Banff, Alberta, Canada.Google Scholar
  18. Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjalmsson, H., & Yan, H. (1999). Embodiment in conversational interfaces: Rea’. In International Conference on Human Factors in Computing Systems (CHI-99) (pp. 520–527). Pittsburgh, PA, USA.Google Scholar
  19. Cauldwell, R. (2000). Where did the anger go? The role of context in interpreting emotion in speech. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 127–131). Newcastle, Northern Ireland, UK.Google Scholar
  20. Clavel, C., Vasilescu, I., Devillers, L., Richard, G., & Ehrette, T. (2006). The SAFE Corpus: Illustrating extreme emotions in dynamic situations. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)) (pp. 76–79). Genoa, Italy.Google Scholar
  21. Cohn, J., Reed, L., Ambadar, Z., Xiao, J., & Moriyama, T. (2004). Automatic analysis and recognition of brow actions and head motion in spontaneous facial behavior. In IEEE Conference on Systems, Man, and Cybernetic (Vol. 1, pp. 610–616). The Hague, the Netherlands.Google Scholar
  22. Cowie, R., & Cornelius, R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40(1–2), 5–32.CrossRefGoogle Scholar
  23. Cowie, R., Douglas-Cowie, E., & Cox, C. (2005). Beyond emotion archetypes: Databases for emotion modelling using neural networks. Neural Networks, 18(4), 371–388.CrossRefGoogle Scholar
  24. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., & Taylor, J. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.CrossRefGoogle Scholar
  25. Cronbach, L. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.CrossRefGoogle Scholar
  26. De Silva, L., Miyasato, T., & Nakatsu, R. (1997). Facial emotion recognition using multi-modal information. In International Conference on Information, Communications and Signal Processing (ICICS) (Vol. I, pp. 397–401). Singapore.Google Scholar
  27. Devillers, L., Vidrascu, L., & Lamel, L. (2005). Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18(4), 407–422.CrossRefGoogle Scholar
  28. Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40(1–2), 33–60.CrossRefGoogle Scholar
  29. Douglas-Cowie, E., Devillers, L., Martin, J., Cowie, R., Savvidou, S., Abrilian, S., & Cox, C. (2005). Multimodal databases of everyday emotion: Facing up to complexity. In 9th European Conference on Speech Communication and Technology (Interspeech’ 2005) (pp. 813–816). Lisbon, Portugal.Google Scholar
  30. Ekman, P. (1979). About brows: Emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, & D. Ploog (Eds.), Human ethology: Claims and limits of a new discipline (pp. 169–202). New York, NY, USA: Cambridge University Press.Google Scholar
  31. Ekman, P., & Friesen, W. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2), 124–129.CrossRefGoogle Scholar
  32. Enos, F., & Hirschberg, J. (2006). A framework for eliciting emotional speech: Capitalizing on the actor’s process. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)) (pp. 6–10). Genoa, Italy.Google Scholar
  33. Fischer, L., Brauns, D., & Belschak, F. (2002). Zur Messung von Emotionen in der angewandten Forschung. Lengerich: Pabst Science Publishers.Google Scholar
  34. Fleiss, J. (1981). Statistical methods for rates and proportions. New York, NY, USA: Wiley.Google Scholar
  35. Gratch, J., Okhmatovskaia, A., Lamothe, F., Marsella, S., Morales, M., van der Werf, R., & Morency, L. (2006). Virtual rapport. In 6th International Conference on Intelligent Virtual Agents (IVA 2006). Marina del Rey, CA, USA.Google Scholar
  36. Grimm, M., & Kroschel, K. (2005). Evaluation of natural emotions using self assessment manikins. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2005) (pp. 381–385). San Juan, Puerto Rico.Google Scholar
  37. Grimm, M., Kroschel, K., Mower, E., & Narayanan, S. (2007). Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10–11), 787–800.CrossRefGoogle Scholar
  38. Grimm, M., Kroschel, K., & Narayanan, S. (2008). The Vera AM Mittag German audio-visual emotional speech database. In IEEE International Conference on Multimedia and Expo (ICME 2008) (pp. 865–868). Hannover, Germany.Google Scholar
  39. Hall, E. (1966). The hidden dimension. New York, NY, USA: Doubleday & Company.Google Scholar
  40. Huang, X., Alleva, F., Hon, H.-W., Hwang, M.-Y., Lee, K.-F., & Rosenfeld, R. (1993). The SPHINX-II speech recognition system: An overview. Computer Speech and Language, 7(2), 137–148.CrossRefGoogle Scholar
  41. Humaine project portal. (2008). Retrieved 11th September 2008.
  42. Kapur, A., Virji-Babul, N., Tzanetakis, G., & Driessen, P. (2005). Gesture-based affective computing on motion capture data. In 1st International Conference on Affective Computing and Intelligent Interaction (ACII 2005) (pp. 1–8). Beijing, China.Google Scholar
  43. Kipp, M. (2001). ANVIL—a generic annotation tool for multimodal dialogue. In European Conference on Speech Communication and Technology (Eurospeech) (pp. 1367–1370). Aalborg, Denmark.Google Scholar
  44. Lee, C.-C., Lee, S., & Narayanan, S. (2008). An analysis of multimodal cues of interruption in dyadic spoken interactions. In Interspeech 2008—Eurospeech (pp. 1678–1681). Brisbane, Australia.Google Scholar
  45. Lee, C., & Narayanan, S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.CrossRefGoogle Scholar
  46. Pandzic, I., & Forchheimer, R. (2002). MPEG-4 facial animation—the standard, implementations and applications. Wiley.Google Scholar
  47. Picard, R. W. (1995). Affective computing. Technical report 321. MIT Media Laboratory Perceptual Computing Section, Cambridge, MA, USA.Google Scholar
  48. Scherer, K., & Ceschi, G. (1997). Lost luggage: A field study of emotion—antecedent appraisal. Motivation and Emotion, 21(3), 211–235.CrossRefGoogle Scholar
  49. Scherer, K., Wallbott, H., & Summerfield, A. (1986). Experiencing emotion: A cross-cultural study. Cambridge, U.K.: Cambridge University Press.Google Scholar
  50. Schiel, F., Steininger, S., & Türk, U. (2002). The SmartKom multimodal corpus at BAS. In Language Resources and Evaluation (LREC 2002). Las Palmas, Spain.Google Scholar
  51. Steidl, S., Levit, M., Batliner, A., Nöth, E., & Niemann, H. (2005). “Of all things the measure is man” automatic classification of emotions and inter-labeler consistency. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005) (Vol. 1, pp. 317–320). Philadelphia, PA, USA.Google Scholar
  52. Tekalp, A., & Ostermann, J. (2000). Face and 2-D Mesh animation in MPEG-4. Signal Processing: Image Communication, 15(4), 387–421.CrossRefGoogle Scholar
  53. Ubiqus. (2008). Retrieved 11th September 2008.
  54. Ververidis, D., & Kotropoulos, C. (2003). A state of the art review on emotional speech databases. In First International Workshop on Interactive Rich Media Content Production (RichMedia-2003) (pp. 109–119). Lausanne, Switzerland.Google Scholar
  55. Vicon Motion Systems Inc. (2008). VICON iQ 2.5. Retrieved 11th September 2008.
  56. Vidrascu, L., & Devillers, L. (2006). Real-life emotions in naturalistic data recorded in a medical call center. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)) (pp. 20–24). Genoa, Italy.Google Scholar
  57. Zara, A., Maffiolo, V., Martin, J., & Devillers, L. (2007). Collection and annotation of a corpus of human-human multimodal interactions: Emotion and others anthropomorphic characteristics. In A. Paiva, R. Prada, & R. Picard (Eds.), Affective computing and intelligent interaction (ACII 2007), lecture notes in artificial intelligence 4738 (pp. 464–475). Berlin, Germany: Springer-Verlag Press.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  • Carlos Busso
    • 1
    Email author
  • Murtaza Bulut
    • 1
  • Chi-Chun Lee
    • 1
  • Abe Kazemzadeh
    • 1
  • Emily Mower
    • 1
  • Samuel Kim
    • 1
  • Jeannette N. Chang
    • 1
  • Sungbok Lee
    • 1
  • Shrikanth S. Narayanan
    • 1
  1. 1.Speech Analysis and Interpretation Laboratory (SAIL)University of Southern CaliforniaLos AngelesUSA

Personalised recommendations