Abstract
Speech is one of the most fundamental mediums for human-to-human interaction, thereby playing a pivotal role in shaping the landscape of next-generation human-computer interaction (HCI). The development of an accurate speech emotion recognition (SER) system for human conversation is a critical yet challenging task. Most state-of-the-art research work in SER predominantly centers around the individual modeling of vocal attributes within each discrete speech utterance, often overlooking the integration of transactional cues intrinsic to the broader interactive context. In this paper, we introduce an innovative dual-level framework designed for the recognition of speech emotions, which leverages the complementary attributes of MFCC features and Mel-spectrograms. Furthermore, we propose a hierarchical attention mechanism designed to effectively include contextual information, hence improving the accuracy of emotion recognition. Our experimentation, conducted on the widely recognized IEMOCAP emotional benchmark dataset, yields promising results. Compared to state-of-the-art methods in four-class emotion recognition, our model demonstrates a substantial advancement, achieving a weighted accuracy of 75.0% and an unweighted accuracy of 75.9%. This marks a notable enhancement of 5.8% in terms of unweighted accuracy, underscoring the efficacy of our approach. This work contributes to the advancement of SER by effectively utilizing multiple audio representations and contextual information. The significant improvements underscore the efficacy of our approach, promising more accurate emotion recognition in human-computer interaction and affective computing.
Similar content being viewed by others
Data Availibility Statement
The dataset utilized in this study can be requested for download through the provided link: • IEMOCAP Dataset: https://sail.usc.edu/iemocap/Dataset Access Link.
References
Afrillia Y, Mawengkang H, Ramli M, Fhonna RP et al (2017) Performance measurement of mel frequency ceptral coefficient (mfcc) method in learning system of al-qur’an based in nagham pattern recognition. In: Journal of physics: conference series, vol 930. IOP Publishing, p 012036
Bingol MC, Aydogmus O (2020) Performing predefined tasks using the human-robot interaction on speech recognition for an industrial robot. Eng Appl Artif Intell 95:103903
Ismail A, Idris MYI, Noor NM, Razak Z, Yusoff ZM (2014) Mfcc-vq approach for qalqalahtajweed rule checking. Malays J Comput Sci 27(4):275–293
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, vol 8. Citeseer, pp 18–25
Tellai M, Gao L, Mao Q (2023) An efficient speech emotion recognition based on a dual-stream cnn-transformer fusion network. Int J Speech Technol 1–17
Zheng C, Wang C, Jia N (2022) A two-channel speech emotion recognition model based on raw stacked waveform. Multimed Tools Appl 1–26
Chattopadhyay S, Dey A, Singh PK, Ahmadian A, Sarkar R (2022) A feature selection model for speech emotion recognition using clustering-based population generation with hybrid of equilibrium optimizer and atom search optimization algorithm. Multimed Tools Appl 1–34
Wang C, Ren Y, Zhang N, Cui F, Luo S (2021) Speech emotion recognition based on multi-feature and multi-lingual fusion. Multimed Tools Appl 1–11
Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using cnn. Proceedings of the 22nd ACM international conference on Multimedia. https://doi.org/10.1145/2647868.2654984
Huang Z, Xue W, Mao Q, Zhan Y (2016) Unsupervised domain adaptation for speech emotion recognition using pcanet. Multimed Tools Appl 76(5):6785–6799. https://doi.org/10.1007/s11042-016-3354-x
Mao Q, Xu G, Xue W, Gou J, Zhan Y (2017) Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition. Speech Commun 93:1–10. https://doi.org/10.1016/j.specom.2017.06.006
Ocquaye EN, Mao Q, Xue Y, Song H (2020) Cross lingual speech emotion recognition via triple attentive asymmetric convolutional neural network. Int J Intell Syst 36(1):53–71. https://doi.org/10.1002/int.22291
Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation
Wang J, Xue M, Culhane R, Diao E, Ding J, Tarokh V (2020) Speech emotion recognition with dual-sequence lstm architecture. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6474–6478
Bae SH, Choi I, Kim NS (2016) Acoustic scene classification using parallel combination of lstm and cnn. In: Proceedings of the detection and classification of acoustic scenes and events 2016 workshop (DCASE2016), pp 11–15
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Rajamani ST, Rajamani KT, Mallol-Ragolta A, Liu S, Schuller B (2021) A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6294–6298
Li R, Wu Z, Jia J, Zhao S, Meng H (2019) Dilated residual network with multi-head self-attention for speech emotion recognition. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6675–6679
Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE spoken language technology workshop (SLT). IEEE, pp 112–118
Kumar P, Kaushik V, Raman B (2021) Towards the explainability of multimodal speech emotion recognition. In: Interspeech, pp. 1748–1752
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process 32:8026–8037
Bone D, Lee CC, Chaspari T, Gibson J, Narayanan S (2017) Signal processing and machine learning for mental health research and clinical applications [perspectives]. IEEE Signal Process Mag 34(5):196–195
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, pp 1459–1462
Hazarika D, Poria S, Zadeh A, Cambria E, Morency LP, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the conference. association for computational linguistics. North American Chapter. Meeting, vol 2018. NIH Public Access, p 2122
Yeh SL, Lin YS, Lee CC (2020) A dialogical emotion decoder for speech emotion recognition in spoken dialog. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6479–6483
Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun 49(2):98–112
Rozgić V, Ananthakrishnan S, Saleem S, Kumar R, Prasad R (2012) Ensemble of svm trees for multimodal emotion recognition. In: Proceedings of the 2012 Asia pacific signal and information processing association annual summit and conference. IEEE, pp 1–4
Zhou S, Jia J, Wang Q, Dong Y, Yin Y, Lei K (2018) Inferring emotion from conversational voice data: a semi-supervised multi-path generative neural network approach. In: Thirty-second AAAI conference on artificial intelligence
Xu X, Deng J, Cummins N, Zhang Z, Wu C, Zhao L, Schuller B (2017) A two-dimensional framework of multiple kernel subspace learning for recognizing emotion in speech. IEEE/ACM Trans Audio Speech Language Process 25(7):1436–1449
Schuller B, Vlasenko B, Eyben F, Wöllmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2010) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1(2):119–131
Anagnostopoulos CN, Iliou T, Giannoukos I (2015) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43(2):155–177
Thornton MA, Tamir DI (2017) Mental models accurately predict emotion transitions. Proc Natl Acad Sci USA 114(23):5982–5987
Hareli S, David S, Hess U (2016) The role of emotion transition for the perception of social dominance and affiliation. Cogn Emot 30(7):1260–1270
Barsade SG (2002) The ripple effect: emotional contagion and its influence on group behavior. Adm Sci Q 47(4):644–675
Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R (2018) Icon: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2594–2604
Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E (2019) Dialoguernn: an attentive rnn for emotion detection in conversations. Proc AAAI Conf Artif Intell 33:6818–6825
Yeh SL, Lin YS, Lee CC (2019) An interaction-aware attention network for speech emotion recognition in spoken dialogs. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6685–6689
Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech, pp 1089–1093
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d cnn lstm networks. Biomed Signal Process Control 47:312–323
Kim E, Shin JW (2019) Dnn-based emotion recognition based on bottleneck acoustic features and lexical features. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6720–6724
Schuller B, Rigoll G, Lang M (2003) Hidden markov model-based speech emotion recognition. In: 2003 IEEE international conference on acoustics, speech, and signal processing, 2003. Proceedings.(ICASSP’03), vol 2. IEEE, p 1
Lin YL, Wei G (2005) Speech emotion recognition based on hmm and svm. In: 2005 international conference on machine learning and cybernetics, vol 8. IEEE, pp 4898–4901
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303
Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimedia 16(8):2203–2213
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth annual conference of the international speech communication association
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5200–5204
Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2227–2231
Jokinen K, McTear M (2009) Spoken dialogue systems. Synthesis Lectures Human Language Technol 2(1):1–151
Narayanan S, Georgiou PG (2013) Behavioral signal processing: deriving human behavioral informatics from speech and language. Proc IEEE 101(5):1203–1233
Sarma M, Ghahremani P, Povey D, Goel NK, Sarma KK, Dehak N (2018) Emotion identification from raw speech signals using dnns. In: Interspeech, pp. 3097–3101
Funding
This work is supported in part by the Key Projects of the National Natural Science Foundation of China under Grant U1836220, the National Nature Science Foundation of China of 62176106 and Jiangsu Province key research and development plan (BE2020036).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tellai, M., Gao, L., Mao, Q. et al. A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17803-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-023-17803-7