CHEAVD: a Chinese natural emotional audio–visual database

  • Ya Li
  • Jianhua Tao
  • Linlin Chao
  • Wei Bao
  • Yazhu Liu
Original Research


This paper presents a recently collected natural, multimodal, rich-annotated emotion database, CASIA Chinese Natural Emotional Audio–Visual Database (CHEAVD), which aims to provide a basic resource for the research on multimodal multimedia interaction. This corpus contains 140 min emotional segments extracted from films, TV plays and talk shows. 238 speakers, aging from child to elderly, constitute broad coverage of speaker diversity, which makes this database a valuable addition to the existing emotional databases. In total, 26 non-prototypical emotional states, including the basic six, are labeled by four native speakers. In contrast to other existing emotional databases, we provide multi-emotion labels and fake/suppressed emotion labels. To our best knowledge, this database is the first large-scale Chinese natural emotion corpus dealing with multimodal and natural emotion, and free to research use. Automatic emotion recognition with Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN) is performed on this corpus. Experiments show that an average accuracy of 56 % could be achieved on six major emotion states.


Audio–visual database Natural emotion Corpus annotation LSTM Multimodal emotion recognition 



This work is supported by the National High-Tech Research and Development Program of China (863 Program) (No. 2015AA016305), the National Natural Science Foundation of China (NSFC) (Nos. 61305003, 61425017), the Strategic Priority Research Program of the CAS (Grant XDB02080006), and partly supported by the Major Program for the National Social Science Fund of China (13&ZD189). We thank the data providers for their kind permission to make their data for non-commercial, scientific use. Due to space limitations, providers’ information is available in The corpus can be freely achieved at ChineseLDC,


  1. Averill JR (1975) A semantic atlas of emotional concepts. Catalog of selected documents in psychology. American Psychological Association, Washington DCGoogle Scholar
  2. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate arXiv preprint arXiv:1409.0473Google Scholar
  3. Barrett LF (1998) Discrete emotions or dimensions? The role of valence focus and arousal focus. Cognit Emot 12:579–599CrossRefGoogle Scholar
  4. Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. Unsuperv Transf Learn Chall Mach Learn 7:19Google Scholar
  5. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: INTERSPEECH, pp 1517–1520Google Scholar
  6. Busso C et al (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th international conference on Multimodal interfaces. ACM, pp 205–211Google Scholar
  7. Busso C et al (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Res Eval 42:335–359CrossRefGoogle Scholar
  8. Butler EA, Lee TL, Gross JJ (2007) Emotion regulation and culture: are the social consequences of emotion suppression culture-specific? Emotion 7:30CrossRefGoogle Scholar
  9. Chao L, Tao J, Yang M, Li Y, Wen Z (2015) Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 65–72Google Scholar
  10. Chao L, Tao J, Yang M, Li Y, Wen Z (2016) Audio visual emotion recognition with temporal alignment and perception attention. arXiv:1603.08321
  11. Clavel C, Vasilescu I, Devillers L, Ehrette T (2004) Fiction database for emotion detection in abnormal situations. Paper presented at the international conference on spoken language processing, pp 2277–2280Google Scholar
  12. Clavel C, Vasilescu I, Devillers L, Richard G, Ehrette T, Sedogbo C (2006) The SAFE corpus: illustrating extreme emotions in dynamic situations. Paper presented at the first international workshop on emotion: corpora for research on emotion and affect, pp 76–79Google Scholar
  13. Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40:5–32. doi: 10.1016/s0167-6393(02)00071-7 CrossRefMATHGoogle Scholar
  14. Cowie R, Douglas-Cowie E, Savvidou S, Mcmahon E, Sawey M, Schröder M (2000) ‘FEELTRACE’: an instrument for recording perceived emotion in real time. Proc ISCA workshop on speech and emotionGoogle Scholar
  15. Devillers L, Vidrascu L, Lamel L (2005) Challenges in real-life emotion annotation and machine learning based detection. Neural Netw 18:407–422CrossRefGoogle Scholar
  16. Devillers L, Cowie R, Martin JC, Douglas-Cowie E, Abrilian S, Mcrorie M (2006) Real life emotions in French and English TV video clips: an integrated annotation protocol combining continuous and discrete approaches. Paper presented at the international conference on language resources and evaluation, pp 1105–1110Google Scholar
  17. Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE MultiMedia 19:34–41CrossRefGoogle Scholar
  18. Dhall A, Ramana Murthy O, Goecke R, Joshi J, Gedeon T (2015) Video and image based emotion recognition challenges in the wild: Emotiw 2015. In: Proceedings of the 2015 ACM on international conference on multimodal interaction. ACM, pp 423–426Google Scholar
  19. Douglas-Cowie E, Cowie R, Schröder M (2000) A new emotion database: considerations, sources and scope. In: ISCA Tutorial and Research Workshop (ITRW) on Speech and EmotionGoogle Scholar
  20. Douglas-Cowie E, Campbell N, Cowie R, Roach P (2003) Emotional speech: towards a new generation of databases. Speech Commun 40:33–60. doi: 10.1016/S0167-6393(02)00070-5 CrossRefMATHGoogle Scholar
  21. Douglas-Cowie E et al (2007) The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. In: Affective computing and intelligent interaction. Springer, pp 488–500Google Scholar
  22. Ekman P (1999) Basic emotions. In: Handbook of cognition and emotion. John Wiley & Sons, Sussex, UKGoogle Scholar
  23. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44:572–587CrossRefMATHGoogle Scholar
  24. Elfenbein HA, Ambady N (2002) On the universality and cultural specificity of emotion recognition: a meta-analysis. Psychol Bull 128:203CrossRefGoogle Scholar
  25. Engberg IS, Hansen AV (1996) Documentation of the Danish emotional speech database (DES) vol Internal AAU report. Center for Person Kommunikation, Department of Communication Technology, Institute of Electronic Systems, Aalborg University, DenmarkGoogle Scholar
  26. Eyben F, Wollmer M, Schuller B (2009) OpenEAR—introducing the munich open-source emotion and affect recognition toolkit. Paper presented at the international conference on affective computing and Intelligent Interaction, pp 1–6Google Scholar
  27. Fehr B, Russell JA (1984) Concept of Emotion Viewed From a Prototype Perspective. J Exp Psychol Gen 113:464–486CrossRefGoogle Scholar
  28. Gao Y, Zhu W (2012) How to describe speech emotion more completely-an investigation on Chinese broadcast news speech. In: 2012 8th international symposium on Chinese spoken language processing, pp 450–453Google Scholar
  29. Grimm M, Kroschel K, Narayanan S (2008) The vera am mittag German audio–visual emotional speech database. Paper presented at the international conference on multimedia computing and systems/international conference on multimedia and expo, pp 865–868Google Scholar
  30. Gross JJ (2002) Emotion regulation: affective, cognitive, and social consequences. Psychophysiology 39:281–291CrossRefGoogle Scholar
  31. Gross JJ, Carstensen LL, Pasupathi M, Tsai J, Skorpen CG, Hsu AY (1997) Emotion and aging: experience, expression, and control. Psychol Aging 12:590–599CrossRefGoogle Scholar
  32. He L, Jiang D, Yang L, Pei E, Wu P, Sahli H (2015) Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 73–80Google Scholar
  33. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780CrossRefGoogle Scholar
  34. Jaimes A, Sebe N (2007) Multimodal human–computer interaction: a survey. Comput Vis Image Underst 108:116–134CrossRefGoogle Scholar
  35. Kashdan TB, Breen WE (2008) Social anxiety and positive emotions: a prospective examination of a self-regulatory model with tendencies to suppress or express emotions as a moderating variable. Behav Ther 39:1–12CrossRefGoogle Scholar
  36. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278–2324CrossRefGoogle Scholar
  37. Li Y, Liu Y, Bao W, Chao L, Tao J (2015) From Simulated Speech to Natural Speech, What are the Robust Features for Emotion Recognition? In: 2015 International conference on affective computing and intelligent interaction (ACII), Xi'an, pp 368–373Google Scholar
  38. Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 494–501Google Scholar
  39. Li Y, Tao J, Schuller B, Shan S, Jiang D, Jia J (2016) MEC 2016: The multimodal emotion recognition challenge of CCPR 2016. In: Chinese Conference on Pattern Recognition (CCPR), Chengdu, ChinaGoogle Scholar
  40. Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) YAAFE, an easy to use and efficient audio feature extraction software. In: ISMIR, pp 441–446Google Scholar
  41. McKeown G, Valstar M, Cowie R, Pantic M, Schröder M (2012) The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans Affect Comput 3:5–17CrossRefGoogle Scholar
  42. Mehrabian A (1996) Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Curr Psychol 14:261–292MathSciNetCrossRefGoogle Scholar
  43. Mehrabian A, Russell JA (1974) An approach to environmental psychology. The MIT Press, Cambridge, MassachusettsGoogle Scholar
  44. Mnih V, Heess N, Graves A (2014) Recurrent models of visual attention. In: Advances in neural information processing systems, pp 2204–2212Google Scholar
  45. Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun 49:98–112. doi: 10.1016/j.specom.2006.11.004 CrossRefGoogle Scholar
  46. Ng H-W, Winkler S (2014) A data-driven approach to cleaning large face datasets. In: IEEE international conference on image processing (ICIP). IEEE, pp 343–347Google Scholar
  47. Plutchik R (1980) Emotion : a psychoevolutionary synthesis. Harper & Row, New YorkGoogle Scholar
  48. Ringeval F et al (2015) Av + EC 2015: The first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 3–8Google Scholar
  49. Ringeval F, Sonderegger A, Sauer J, Lalanne D (2013) Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In: IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–8Google Scholar
  50. Rose H (1991) Culture and the self: implications for cognition, emotion, and motivation. Psychol Rev 98:224–253CrossRefGoogle Scholar
  51. Russell JA (1980) A circumplex model of affect. J Personal Soc Psychol 39:1161–1178CrossRefGoogle Scholar
  52. Russell JA, Bachorowski J-A, Fernandez-Dols J-M (2003) Facial and vocal expressions of emotion. Annu Rev Psychol 54:329–349CrossRefGoogle Scholar
  53. Schröder M, Pirker H, Lamolle M (2006) First suggestions for an emotion annotation and representation language. In: International conference on language resources and evaluation. Citeseer, pp 88–92Google Scholar
  54. Shaver P, Schwartz J, Kirson D, O’connor C (1987) Emotion knowledge: further exploration of a prototype approach. J Pers Soc Psychol 52:1061CrossRefGoogle Scholar
  55. Song M, Bu J, Chen C, Li N (2004) Audio–visual based emotion recognition-a new approach. In: Computer vision and pattern recognition. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on 2004. IEEE, pp 1020–1025Google Scholar
  56. Steidl S (2009) Automatic classification of emotion related user states in spontaneous children's speech. University of Erlangen-Nuremberg Erlangen, GermanyGoogle Scholar
  57. Tao J, Tan T (2005) Affective computing: A review. In: Tao J, Picard RW (eds) Affective computing and intelligent interaction. Springer, Berlin, pp 981–995CrossRefGoogle Scholar
  58. Tao J, Li Y, Pan S (2009) A multiple perception model on emotional speech. Paper presented at the international conference on affective computing and intelligent interaction and workshops, pp 1–6Google Scholar
  59. Valstar M et al (2013) AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM, pp 3–10Google Scholar
  60. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48:1162–1181CrossRefGoogle Scholar
  61. Whissel CM (1989) The dictionary of affect in language. In: Emotion: theory, research and experience, (Vol 4, The measurement of emotions). Academic Press, San Diego, CAGoogle Scholar
  62. Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, Cowie R (2008) Abandoning emotion classes—towards continuous emotion recognition with modelling of long-range dependencies. Paper presented at the INTERSPEECH, pp 597–600Google Scholar
  63. Wu C-H, Lin J-C, Wei W-L (2014) Survey on audiovisual emotion recognition: databases, features, and data fusion strategies APSIPA transactions on signal and information processing 3:12Google Scholar
  64. Xu X, Tao J (2003) Research on emotion classification in Chinese emotional system. The Chinese affective computing and intelligent interaction, pp 199-205Google Scholar
  65. Yu F, Chang E, Xu Y-Q, Shum H-Y (2001) Emotion detection from speech to enrich multimedia content. In: Advances in multimedia information processing—PCM 2001. Springer, pp 550–557Google Scholar
  66. Yuan J, Shen L, Chen F (2002) The acoustic realization of anger, fear, joy and sadness in Chinese. Paper presented at the INTERSPEECH, pp 2025–2028Google Scholar
  67. Zeng Z et al (2005) Audio–visual affect recognition through multi-stream fused HMM for HCI. In: Computer vision and pattern recognition. CVPR 2005. IEEE Computer Society Conference on 2005. IEEE, pp 967–972Google Scholar
  68. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer vision—ECCV 2014. Springer, pp 818–833Google Scholar
  69. Zhang X, Zhang L, Wang X-J, Shum H-Y (2012) Finding celebrities in billions of web images. IEEE Trans Multimedia 14:995–1007CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.National Laboratory of Pattern Recognition (NLPR), Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of AutomationChinese Academy of SciencesBeijingChina
  3. 3.School of Computer and Control EngineeringGraduate University of Chinese Academy of SciencesBeijingChina
  4. 4.Institute of Linguistic SciencesJiangsu Normal UniversityJiangsuChina

Personalised recommendations