Since emotions are expressed through a combination of verbal and non-verbal channels, a joint analysis of speech and gestures is required to understand expressive human communication. To facilitate such investigations, this paper describes a new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state). The corpus contains approximately 12 h of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Abrilian, S., Devillers, L., Buisine, S., & Martin, J. C. (2005). EmoTV1: Annotation of real-life emotions for the specification of multimodal affective interfaces. In 11th International Conference on Human-Computer Interaction (HCI 2005) (pp. 195–200). Las Vegas, Nevada, USA.
Amir, N., Ron, S., & Laor, N. (2000). Analysis of an emotional speech corpus in Hebrew based on objective criteria. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 29–33). Newcastle, Northern Ireland, UK.
Arun, K., Huang, T., & Blostein, S. (1987). Least-squares fitting of two 3-D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5), 698–700.
Arya, A., Jefferies, L., Enns, J., & DiPaola, S. (2006). Facial actions as visual cues for personality. Computer Animation and Virtual Worlds, 17(3–4), 371–382.
Bänziger, T., & Scherer, K., (2007). Using actor portrayals to systematically study multimodal emotion expression: The GEMEP corpus. In A. Paiva, R. Prada, & R. Picard (Eds.), Affective computing and intelligent interaction (ACII 2007). Lecture Notes in Artificial Intelligence (Vol. 4738, pp. 476–487). Berlin, Germany: Springer-Verlag Press.
Batliner, A., Fischer, K., Huber, R., Spilker, J., & Nöth, E. (2000). Desperately seeking emotions or: Actors, wizards and human beings. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 195–200). Newcastle, Northern Ireland, UK.
Busso, C., Deng, Z., Grimm, M., Neumann, U., & Narayanan, S. (2007a). Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech and Language Processing, 15(3), 1075–1086.
Busso, C., Deng, Z., Neumann, U., & Narayanan, S. (2005). Natural head motion synthesis driven by acoustic prosodic features. Computer Animation and Virtual Worlds, 16(3–4), 283–290.
Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C., Kazemzadeh, A., Lee, S., Neumann, U., & Narayanan, S. (2004). Analysis of emotion recognition using facial expressions, speech and multimodal information. In Sixth International Conference on Multimodal Interfaces ICMI 2004 (pp. 205–211). State College, PA.
Busso, C., Lee, S., & Narayanan, S. (2007b). Using neutral speech models for emotional speech analysis. In Interspeech 2007—Eurospeech (pp. 2225–2228). Antwerp, Belgium.
Busso, C., & Narayanan, S. (2006). Interplay between linguistic and affective goals in facial expression during emotional utterances. In 7th International Seminar on Speech Production (ISSP 2006) (pp. 549–556). Ubatuba-SP, Brazil.
Busso, C., & Narayanan, S. (2007a). Interrelation between speech and facial gestures in emotional utterances: A single subject study. IEEE Transactions on Audio, Speech and Language Processing, 15(8), 2331–2347.
Busso, C., & Narayanan, S. (2007b). Joint analysis of the emotional fingerprint in the face and speech: A single subject study. In International Workshop on Multimedia Signal Processing (MMSP 2007) (pp. 43–47). Chania, Crete, Greece.
Busso, C., & Narayanan, S. (2008a). The expression and perception of emotions: Comparing assessments of self versus others. In Interspeech 2008—Eurospeech (pp. 257–260). Brisbane, Australia.
Busso, C., & Narayanan, S. (2008b). Recording audio-visual emotional databases from actors: A closer look. In Second International Workshop on Emotion: Corpora for Research on Emotion and Affect, International Conference on Language Resources and Evaluation (LREC 2008) (pp. 17–22). Marrakech, Morocco.
Busso, C., & Narayanan, S. (2008c). Scripted dialogs versus improvisation: Lessons learned about emotional elicitation techniques from the IEMOCAP database. In Interspeech 2008—Eurospeech (pp. 1670–1673). Brisbane, Australia.
Caridakis, G., Malatesta, L., Kessous, L., Amir, N., Raouzaiou, A., & Karpouzis, K. (2006). Modeling naturalistic affective states via facial and vocal expressions recognition. In Proceedings of the 8th International Conference on Multimodal Interfaces (ICMI 2006) (pp. 146–154). Banff, Alberta, Canada.
Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjalmsson, H., & Yan, H. (1999). Embodiment in conversational interfaces: Rea’. In International Conference on Human Factors in Computing Systems (CHI-99) (pp. 520–527). Pittsburgh, PA, USA.
Cauldwell, R. (2000). Where did the anger go? The role of context in interpreting emotion in speech. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 127–131). Newcastle, Northern Ireland, UK.
Clavel, C., Vasilescu, I., Devillers, L., Richard, G., & Ehrette, T. (2006). The SAFE Corpus: Illustrating extreme emotions in dynamic situations. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)) (pp. 76–79). Genoa, Italy.
Cohn, J., Reed, L., Ambadar, Z., Xiao, J., & Moriyama, T. (2004). Automatic analysis and recognition of brow actions and head motion in spontaneous facial behavior. In IEEE Conference on Systems, Man, and Cybernetic (Vol. 1, pp. 610–616). The Hague, the Netherlands.
Cowie, R., & Cornelius, R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40(1–2), 5–32.
Cowie, R., Douglas-Cowie, E., & Cox, C. (2005). Beyond emotion archetypes: Databases for emotion modelling using neural networks. Neural Networks, 18(4), 371–388.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., & Taylor, J. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.
Cronbach, L. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.
De Silva, L., Miyasato, T., & Nakatsu, R. (1997). Facial emotion recognition using multi-modal information. In International Conference on Information, Communications and Signal Processing (ICICS) (Vol. I, pp. 397–401). Singapore.
Devillers, L., Vidrascu, L., & Lamel, L. (2005). Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18(4), 407–422.
Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40(1–2), 33–60.
Douglas-Cowie, E., Devillers, L., Martin, J., Cowie, R., Savvidou, S., Abrilian, S., & Cox, C. (2005). Multimodal databases of everyday emotion: Facing up to complexity. In 9th European Conference on Speech Communication and Technology (Interspeech’ 2005) (pp. 813–816). Lisbon, Portugal.
Ekman, P. (1979). About brows: Emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, & D. Ploog (Eds.), Human ethology: Claims and limits of a new discipline (pp. 169–202). New York, NY, USA: Cambridge University Press.
Ekman, P., & Friesen, W. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2), 124–129.
Enos, F., & Hirschberg, J. (2006). A framework for eliciting emotional speech: Capitalizing on the actor’s process. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)) (pp. 6–10). Genoa, Italy.
Fischer, L., Brauns, D., & Belschak, F. (2002). Zur Messung von Emotionen in der angewandten Forschung. Lengerich: Pabst Science Publishers.
Fleiss, J. (1981). Statistical methods for rates and proportions. New York, NY, USA: Wiley.
Gratch, J., Okhmatovskaia, A., Lamothe, F., Marsella, S., Morales, M., van der Werf, R., & Morency, L. (2006). Virtual rapport. In 6th International Conference on Intelligent Virtual Agents (IVA 2006). Marina del Rey, CA, USA.
Grimm, M., & Kroschel, K. (2005). Evaluation of natural emotions using self assessment manikins. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2005) (pp. 381–385). San Juan, Puerto Rico.
Grimm, M., Kroschel, K., Mower, E., & Narayanan, S. (2007). Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10–11), 787–800.
Grimm, M., Kroschel, K., & Narayanan, S. (2008). The Vera AM Mittag German audio-visual emotional speech database. In IEEE International Conference on Multimedia and Expo (ICME 2008) (pp. 865–868). Hannover, Germany.
Hall, E. (1966). The hidden dimension. New York, NY, USA: Doubleday & Company.
Huang, X., Alleva, F., Hon, H.-W., Hwang, M.-Y., Lee, K.-F., & Rosenfeld, R. (1993). The SPHINX-II speech recognition system: An overview. Computer Speech and Language, 7(2), 137–148.
Humaine project portal. (2008). http://emotion-research.net. Retrieved 11th September 2008.
Kapur, A., Virji-Babul, N., Tzanetakis, G., & Driessen, P. (2005). Gesture-based affective computing on motion capture data. In 1st International Conference on Affective Computing and Intelligent Interaction (ACII 2005) (pp. 1–8). Beijing, China.
Kipp, M. (2001). ANVIL—a generic annotation tool for multimodal dialogue. In European Conference on Speech Communication and Technology (Eurospeech) (pp. 1367–1370). Aalborg, Denmark.
Lee, C.-C., Lee, S., & Narayanan, S. (2008). An analysis of multimodal cues of interruption in dyadic spoken interactions. In Interspeech 2008—Eurospeech (pp. 1678–1681). Brisbane, Australia.
Lee, C., & Narayanan, S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.
Pandzic, I., & Forchheimer, R. (2002). MPEG-4 facial animation—the standard, implementations and applications. Wiley.
Picard, R. W. (1995). Affective computing. Technical report 321. MIT Media Laboratory Perceptual Computing Section, Cambridge, MA, USA.
Scherer, K., & Ceschi, G. (1997). Lost luggage: A field study of emotion—antecedent appraisal. Motivation and Emotion, 21(3), 211–235.
Scherer, K., Wallbott, H., & Summerfield, A. (1986). Experiencing emotion: A cross-cultural study. Cambridge, U.K.: Cambridge University Press.
Schiel, F., Steininger, S., & Türk, U. (2002). The SmartKom multimodal corpus at BAS. In Language Resources and Evaluation (LREC 2002). Las Palmas, Spain.
Steidl, S., Levit, M., Batliner, A., Nöth, E., & Niemann, H. (2005). “Of all things the measure is man” automatic classification of emotions and inter-labeler consistency. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005) (Vol. 1, pp. 317–320). Philadelphia, PA, USA.
Tekalp, A., & Ostermann, J. (2000). Face and 2-D Mesh animation in MPEG-4. Signal Processing: Image Communication, 15(4), 387–421.
Ubiqus. (2008). http://www.ubiqus.com. Retrieved 11th September 2008.
Ververidis, D., & Kotropoulos, C. (2003). A state of the art review on emotional speech databases. In First International Workshop on Interactive Rich Media Content Production (RichMedia-2003) (pp. 109–119). Lausanne, Switzerland.
Vicon Motion Systems Inc. (2008). VICON iQ 2.5. http://www.vicon.com. Retrieved 11th September 2008.
Vidrascu, L., & Devillers, L. (2006). Real-life emotions in naturalistic data recorded in a medical call center. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)) (pp. 20–24). Genoa, Italy.
Zara, A., Maffiolo, V., Martin, J., & Devillers, L. (2007). Collection and annotation of a corpus of human-human multimodal interactions: Emotion and others anthropomorphic characteristics. In A. Paiva, R. Prada, & R. Picard (Eds.), Affective computing and intelligent interaction (ACII 2007), lecture notes in artificial intelligence 4738 (pp. 464–475). Berlin, Germany: Springer-Verlag Press.
This work was supported in part by funds from the National Science Foundation (NSF) (through the Integrated Media Systems Center, an NSF Engineering Research Center, Cooperative Agreement No. EEC-9529152 and a CAREER award), the Department of the Army, and a MURI award from the Office of Naval Research. Any opinions, findings and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. The authors wish to thank Joshua Izumigawa, Gabriel Campa, Zhigang Deng, Eric Furie, Karen Liu, Oliver Mayer, and May-Chen Kuo for their help and support.
About this article
Cite this article
Busso, C., Bulut, M., Lee, C. et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resources & Evaluation 42, 335 (2008). https://doi.org/10.1007/s10579-008-9076-6
- Audio-visual database
- Dyadic interaction
- Emotional assessment
- Motion capture system