IEMOCAP: interactive emotional dyadic motion capture database

Abstract

Since emotions are expressed through a combination of verbal and non-verbal channels, a joint analysis of speech and gestures is required to understand expressive human communication. To facilitate such investigations, this paper describes a new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state). The corpus contains approximately 12 h of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

References

  1. Abrilian, S., Devillers, L., Buisine, S., & Martin, J. C. (2005). EmoTV1: Annotation of real-life emotions for the specification of multimodal affective interfaces. In 11th International Conference on Human-Computer Interaction (HCI 2005) (pp. 195–200). Las Vegas, Nevada, USA.

  2. Amir, N., Ron, S., & Laor, N. (2000). Analysis of an emotional speech corpus in Hebrew based on objective criteria. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 29–33). Newcastle, Northern Ireland, UK.

  3. Arun, K., Huang, T., & Blostein, S. (1987). Least-squares fitting of two 3-D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5), 698–700.

    Article  Google Scholar 

  4. Arya, A., Jefferies, L., Enns, J., & DiPaola, S. (2006). Facial actions as visual cues for personality. Computer Animation and Virtual Worlds, 17(3–4), 371–382.

    Article  Google Scholar 

  5. Bänziger, T., & Scherer, K., (2007). Using actor portrayals to systematically study multimodal emotion expression: The GEMEP corpus. In A. Paiva, R. Prada, & R. Picard (Eds.), Affective computing and intelligent interaction (ACII 2007). Lecture Notes in Artificial Intelligence (Vol. 4738, pp. 476–487). Berlin, Germany: Springer-Verlag Press.

    Google Scholar 

  6. Batliner, A., Fischer, K., Huber, R., Spilker, J., & Nöth, E. (2000). Desperately seeking emotions or: Actors, wizards and human beings. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 195–200). Newcastle, Northern Ireland, UK.

  7. Busso, C., Deng, Z., Grimm, M., Neumann, U., & Narayanan, S. (2007a). Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech and Language Processing, 15(3), 1075–1086.

    Article  Google Scholar 

  8. Busso, C., Deng, Z., Neumann, U., & Narayanan, S. (2005). Natural head motion synthesis driven by acoustic prosodic features. Computer Animation and Virtual Worlds, 16(3–4), 283–290.

    Article  Google Scholar 

  9. Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C., Kazemzadeh, A., Lee, S., Neumann, U., & Narayanan, S. (2004). Analysis of emotion recognition using facial expressions, speech and multimodal information. In Sixth International Conference on Multimodal Interfaces ICMI 2004 (pp. 205–211). State College, PA.

  10. Busso, C., Lee, S., & Narayanan, S. (2007b). Using neutral speech models for emotional speech analysis. In Interspeech 2007—Eurospeech (pp. 2225–2228). Antwerp, Belgium.

  11. Busso, C., & Narayanan, S. (2006). Interplay between linguistic and affective goals in facial expression during emotional utterances. In 7th International Seminar on Speech Production (ISSP 2006) (pp. 549–556). Ubatuba-SP, Brazil.

  12. Busso, C., & Narayanan, S. (2007a). Interrelation between speech and facial gestures in emotional utterances: A single subject study. IEEE Transactions on Audio, Speech and Language Processing, 15(8), 2331–2347.

    Google Scholar 

  13. Busso, C., & Narayanan, S. (2007b). Joint analysis of the emotional fingerprint in the face and speech: A single subject study. In International Workshop on Multimedia Signal Processing (MMSP 2007) (pp. 43–47). Chania, Crete, Greece.

  14. Busso, C., & Narayanan, S. (2008a). The expression and perception of emotions: Comparing assessments of self versus others. In Interspeech 2008—Eurospeech (pp. 257–260). Brisbane, Australia.

  15. Busso, C., & Narayanan, S. (2008b). Recording audio-visual emotional databases from actors: A closer look. In Second International Workshop on Emotion: Corpora for Research on Emotion and Affect, International Conference on Language Resources and Evaluation (LREC 2008) (pp. 17–22). Marrakech, Morocco.

  16. Busso, C., & Narayanan, S. (2008c). Scripted dialogs versus improvisation: Lessons learned about emotional elicitation techniques from the IEMOCAP database. In Interspeech 2008—Eurospeech (pp. 1670–1673). Brisbane, Australia.

  17. Caridakis, G., Malatesta, L., Kessous, L., Amir, N., Raouzaiou, A., & Karpouzis, K. (2006). Modeling naturalistic affective states via facial and vocal expressions recognition. In Proceedings of the 8th International Conference on Multimodal Interfaces (ICMI 2006) (pp. 146–154). Banff, Alberta, Canada.

  18. Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjalmsson, H., & Yan, H. (1999). Embodiment in conversational interfaces: Rea’. In International Conference on Human Factors in Computing Systems (CHI-99) (pp. 520–527). Pittsburgh, PA, USA.

  19. Cauldwell, R. (2000). Where did the anger go? The role of context in interpreting emotion in speech. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 127–131). Newcastle, Northern Ireland, UK.

  20. Clavel, C., Vasilescu, I., Devillers, L., Richard, G., & Ehrette, T. (2006). The SAFE Corpus: Illustrating extreme emotions in dynamic situations. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)) (pp. 76–79). Genoa, Italy.

  21. Cohn, J., Reed, L., Ambadar, Z., Xiao, J., & Moriyama, T. (2004). Automatic analysis and recognition of brow actions and head motion in spontaneous facial behavior. In IEEE Conference on Systems, Man, and Cybernetic (Vol. 1, pp. 610–616). The Hague, the Netherlands.

  22. Cowie, R., & Cornelius, R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40(1–2), 5–32.

    Article  Google Scholar 

  23. Cowie, R., Douglas-Cowie, E., & Cox, C. (2005). Beyond emotion archetypes: Databases for emotion modelling using neural networks. Neural Networks, 18(4), 371–388.

    Article  Google Scholar 

  24. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., & Taylor, J. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.

    Article  Google Scholar 

  25. Cronbach, L. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.

    Article  Google Scholar 

  26. De Silva, L., Miyasato, T., & Nakatsu, R. (1997). Facial emotion recognition using multi-modal information. In International Conference on Information, Communications and Signal Processing (ICICS) (Vol. I, pp. 397–401). Singapore.

  27. Devillers, L., Vidrascu, L., & Lamel, L. (2005). Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18(4), 407–422.

    Article  Google Scholar 

  28. Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40(1–2), 33–60.

    Article  Google Scholar 

  29. Douglas-Cowie, E., Devillers, L., Martin, J., Cowie, R., Savvidou, S., Abrilian, S., & Cox, C. (2005). Multimodal databases of everyday emotion: Facing up to complexity. In 9th European Conference on Speech Communication and Technology (Interspeech’ 2005) (pp. 813–816). Lisbon, Portugal.

  30. Ekman, P. (1979). About brows: Emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, & D. Ploog (Eds.), Human ethology: Claims and limits of a new discipline (pp. 169–202). New York, NY, USA: Cambridge University Press.

    Google Scholar 

  31. Ekman, P., & Friesen, W. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2), 124–129.

    Article  Google Scholar 

  32. Enos, F., & Hirschberg, J. (2006). A framework for eliciting emotional speech: Capitalizing on the actor’s process. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)) (pp. 6–10). Genoa, Italy.

  33. Fischer, L., Brauns, D., & Belschak, F. (2002). Zur Messung von Emotionen in der angewandten Forschung. Lengerich: Pabst Science Publishers.

    Google Scholar 

  34. Fleiss, J. (1981). Statistical methods for rates and proportions. New York, NY, USA: Wiley.

    Google Scholar 

  35. Gratch, J., Okhmatovskaia, A., Lamothe, F., Marsella, S., Morales, M., van der Werf, R., & Morency, L. (2006). Virtual rapport. In 6th International Conference on Intelligent Virtual Agents (IVA 2006). Marina del Rey, CA, USA.

  36. Grimm, M., & Kroschel, K. (2005). Evaluation of natural emotions using self assessment manikins. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2005) (pp. 381–385). San Juan, Puerto Rico.

  37. Grimm, M., Kroschel, K., Mower, E., & Narayanan, S. (2007). Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10–11), 787–800.

    Article  Google Scholar 

  38. Grimm, M., Kroschel, K., & Narayanan, S. (2008). The Vera AM Mittag German audio-visual emotional speech database. In IEEE International Conference on Multimedia and Expo (ICME 2008) (pp. 865–868). Hannover, Germany.

  39. Hall, E. (1966). The hidden dimension. New York, NY, USA: Doubleday & Company.

    Google Scholar 

  40. Huang, X., Alleva, F., Hon, H.-W., Hwang, M.-Y., Lee, K.-F., & Rosenfeld, R. (1993). The SPHINX-II speech recognition system: An overview. Computer Speech and Language, 7(2), 137–148.

    Article  Google Scholar 

  41. Humaine project portal. (2008). http://emotion-research.net. Retrieved 11th September 2008.

  42. Kapur, A., Virji-Babul, N., Tzanetakis, G., & Driessen, P. (2005). Gesture-based affective computing on motion capture data. In 1st International Conference on Affective Computing and Intelligent Interaction (ACII 2005) (pp. 1–8). Beijing, China.

  43. Kipp, M. (2001). ANVIL—a generic annotation tool for multimodal dialogue. In European Conference on Speech Communication and Technology (Eurospeech) (pp. 1367–1370). Aalborg, Denmark.

  44. Lee, C.-C., Lee, S., & Narayanan, S. (2008). An analysis of multimodal cues of interruption in dyadic spoken interactions. In Interspeech 2008—Eurospeech (pp. 1678–1681). Brisbane, Australia.

  45. Lee, C., & Narayanan, S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.

    Article  Google Scholar 

  46. Pandzic, I., & Forchheimer, R. (2002). MPEG-4 facial animation—the standard, implementations and applications. Wiley.

  47. Picard, R. W. (1995). Affective computing. Technical report 321. MIT Media Laboratory Perceptual Computing Section, Cambridge, MA, USA.

  48. Scherer, K., & Ceschi, G. (1997). Lost luggage: A field study of emotion—antecedent appraisal. Motivation and Emotion, 21(3), 211–235.

    Article  Google Scholar 

  49. Scherer, K., Wallbott, H., & Summerfield, A. (1986). Experiencing emotion: A cross-cultural study. Cambridge, U.K.: Cambridge University Press.

    Google Scholar 

  50. Schiel, F., Steininger, S., & Türk, U. (2002). The SmartKom multimodal corpus at BAS. In Language Resources and Evaluation (LREC 2002). Las Palmas, Spain.

  51. Steidl, S., Levit, M., Batliner, A., Nöth, E., & Niemann, H. (2005). “Of all things the measure is man” automatic classification of emotions and inter-labeler consistency. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005) (Vol. 1, pp. 317–320). Philadelphia, PA, USA.

  52. Tekalp, A., & Ostermann, J. (2000). Face and 2-D Mesh animation in MPEG-4. Signal Processing: Image Communication, 15(4), 387–421.

    Article  Google Scholar 

  53. Ubiqus. (2008). http://www.ubiqus.com. Retrieved 11th September 2008.

  54. Ververidis, D., & Kotropoulos, C. (2003). A state of the art review on emotional speech databases. In First International Workshop on Interactive Rich Media Content Production (RichMedia-2003) (pp. 109–119). Lausanne, Switzerland.

  55. Vicon Motion Systems Inc. (2008). VICON iQ 2.5. http://www.vicon.com. Retrieved 11th September 2008.

  56. Vidrascu, L., & Devillers, L. (2006). Real-life emotions in naturalistic data recorded in a medical call center. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)) (pp. 20–24). Genoa, Italy.

  57. Zara, A., Maffiolo, V., Martin, J., & Devillers, L. (2007). Collection and annotation of a corpus of human-human multimodal interactions: Emotion and others anthropomorphic characteristics. In A. Paiva, R. Prada, & R. Picard (Eds.), Affective computing and intelligent interaction (ACII 2007), lecture notes in artificial intelligence 4738 (pp. 464–475). Berlin, Germany: Springer-Verlag Press.

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by funds from the National Science Foundation (NSF) (through the Integrated Media Systems Center, an NSF Engineering Research Center, Cooperative Agreement No. EEC-9529152 and a CAREER award), the Department of the Army, and a MURI award from the Office of Naval Research. Any opinions, findings and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. The authors wish to thank Joshua Izumigawa, Gabriel Campa, Zhigang Deng, Eric Furie, Karen Liu, Oliver Mayer, and May-Chen Kuo for their help and support.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Carlos Busso.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Busso, C., Bulut, M., Lee, C. et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resources & Evaluation 42, 335 (2008). https://doi.org/10.1007/s10579-008-9076-6

Download citation

Keywords

  • Audio-visual database
  • Dyadic interaction
  • Emotion
  • Emotional assessment
  • Motion capture system