A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation

Tellai, Mohammed; Gao, Lijian; Mao, Qirong; Abdelaziz, Mounir

doi:10.1007/s11042-023-17803-7

A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation

Published: 29 December 2023

(2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Mohammed Tellai¹,
Lijian Gao¹,
Qirong Mao^1,2 &
…
Mounir Abdelaziz³

201 Accesses
Explore all metrics

Abstract

Speech is one of the most fundamental mediums for human-to-human interaction, thereby playing a pivotal role in shaping the landscape of next-generation human-computer interaction (HCI). The development of an accurate speech emotion recognition (SER) system for human conversation is a critical yet challenging task. Most state-of-the-art research work in SER predominantly centers around the individual modeling of vocal attributes within each discrete speech utterance, often overlooking the integration of transactional cues intrinsic to the broader interactive context. In this paper, we introduce an innovative dual-level framework designed for the recognition of speech emotions, which leverages the complementary attributes of MFCC features and Mel-spectrograms. Furthermore, we propose a hierarchical attention mechanism designed to effectively include contextual information, hence improving the accuracy of emotion recognition. Our experimentation, conducted on the widely recognized IEMOCAP emotional benchmark dataset, yields promising results. Compared to state-of-the-art methods in four-class emotion recognition, our model demonstrates a substantial advancement, achieving a weighted accuracy of 75.0% and an unweighted accuracy of 75.9%. This marks a notable enhancement of 5.8% in terms of unweighted accuracy, underscoring the efficacy of our approach. This work contributes to the advancement of SER by effectively utilizing multiple audio representations and contextual information. The significant improvements underscore the efficacy of our approach, promising more accurate emotion recognition in human-computer interaction and affective computing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Article Open access 13 February 2024

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Data Availibility Statement

The dataset utilized in this study can be requested for download through the provided link: • IEMOCAP Dataset: https://sail.usc.edu/iemocap/Dataset Access Link.

References

Afrillia Y, Mawengkang H, Ramli M, Fhonna RP et al (2017) Performance measurement of mel frequency ceptral coefficient (mfcc) method in learning system of al-qur’an based in nagham pattern recognition. In: Journal of physics: conference series, vol 930. IOP Publishing, p 012036
Bingol MC, Aydogmus O (2020) Performing predefined tasks using the human-robot interaction on speech recognition for an industrial robot. Eng Appl Artif Intell 95:103903
Article Google Scholar
Ismail A, Idris MYI, Noor NM, Razak Z, Yusoff ZM (2014) Mfcc-vq approach for qalqalahtajweed rule checking. Malays J Comput Sci 27(4):275–293
Google Scholar
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, vol 8. Citeseer, pp 18–25
Tellai M, Gao L, Mao Q (2023) An efficient speech emotion recognition based on a dual-stream cnn-transformer fusion network. Int J Speech Technol 1–17
Zheng C, Wang C, Jia N (2022) A two-channel speech emotion recognition model based on raw stacked waveform. Multimed Tools Appl 1–26
Chattopadhyay S, Dey A, Singh PK, Ahmadian A, Sarkar R (2022) A feature selection model for speech emotion recognition using clustering-based population generation with hybrid of equilibrium optimizer and atom search optimization algorithm. Multimed Tools Appl 1–34
Wang C, Ren Y, Zhang N, Cui F, Luo S (2021) Speech emotion recognition based on multi-feature and multi-lingual fusion. Multimed Tools Appl 1–11
Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using cnn. Proceedings of the 22nd ACM international conference on Multimedia. https://doi.org/10.1145/2647868.2654984
Huang Z, Xue W, Mao Q, Zhan Y (2016) Unsupervised domain adaptation for speech emotion recognition using pcanet. Multimed Tools Appl 76(5):6785–6799. https://doi.org/10.1007/s11042-016-3354-x
Article Google Scholar
Mao Q, Xu G, Xue W, Gou J, Zhan Y (2017) Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition. Speech Commun 93:1–10. https://doi.org/10.1016/j.specom.2017.06.006
Article Google Scholar
Ocquaye EN, Mao Q, Xue Y, Song H (2020) Cross lingual speech emotion recognition via triple attentive asymmetric convolutional neural network. Int J Intell Syst 36(1):53–71. https://doi.org/10.1002/int.22291
Article Google Scholar
Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation
Wang J, Xue M, Culhane R, Diao E, Ding J, Tarokh V (2020) Speech emotion recognition with dual-sequence lstm architecture. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6474–6478
Bae SH, Choi I, Kim NS (2016) Acoustic scene classification using parallel combination of lstm and cnn. In: Proceedings of the detection and classification of acoustic scenes and events 2016 workshop (DCASE2016), pp 11–15
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Rajamani ST, Rajamani KT, Mallol-Ragolta A, Liu S, Schuller B (2021) A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6294–6298
Li R, Wu Z, Jia J, Zhao S, Meng H (2019) Dilated residual network with multi-head self-attention for speech emotion recognition. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6675–6679
Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE spoken language technology workshop (SLT). IEEE, pp 112–118
Kumar P, Kaushik V, Raman B (2021) Towards the explainability of multimodal speech emotion recognition. In: Interspeech, pp. 1748–1752
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process 32:8026–8037
Google Scholar
Bone D, Lee CC, Chaspari T, Gibson J, Narayanan S (2017) Signal processing and machine learning for mental health research and clinical applications [perspectives]. IEEE Signal Process Mag 34(5):196–195
Article Google Scholar
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587
Article Google Scholar
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, pp 1459–1462
Hazarika D, Poria S, Zadeh A, Cambria E, Morency LP, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the conference. association for computational linguistics. North American Chapter. Meeting, vol 2018. NIH Public Access, p 2122
Yeh SL, Lin YS, Lee CC (2020) A dialogical emotion decoder for speech emotion recognition in spoken dialog. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6479–6483
Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun 49(2):98–112
Article Google Scholar
Rozgić V, Ananthakrishnan S, Saleem S, Kumar R, Prasad R (2012) Ensemble of svm trees for multimodal emotion recognition. In: Proceedings of the 2012 Asia pacific signal and information processing association annual summit and conference. IEEE, pp 1–4
Zhou S, Jia J, Wang Q, Dong Y, Yin Y, Lei K (2018) Inferring emotion from conversational voice data: a semi-supervised multi-path generative neural network approach. In: Thirty-second AAAI conference on artificial intelligence
Xu X, Deng J, Cummins N, Zhang Z, Wu C, Zhao L, Schuller B (2017) A two-dimensional framework of multiple kernel subspace learning for recognizing emotion in speech. IEEE/ACM Trans Audio Speech Language Process 25(7):1436–1449
Article Google Scholar
Schuller B, Vlasenko B, Eyben F, Wöllmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2010) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1(2):119–131
Article Google Scholar
Anagnostopoulos CN, Iliou T, Giannoukos I (2015) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43(2):155–177
Article Google Scholar
Thornton MA, Tamir DI (2017) Mental models accurately predict emotion transitions. Proc Natl Acad Sci USA 114(23):5982–5987
Article Google Scholar
Hareli S, David S, Hess U (2016) The role of emotion transition for the perception of social dominance and affiliation. Cogn Emot 30(7):1260–1270
Article Google Scholar
Barsade SG (2002) The ripple effect: emotional contagion and its influence on group behavior. Adm Sci Q 47(4):644–675
Article Google Scholar
Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R (2018) Icon: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2594–2604
Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E (2019) Dialoguernn: an attentive rnn for emotion detection in conversations. Proc AAAI Conf Artif Intell 33:6818–6825
Google Scholar
Yeh SL, Lin YS, Lee CC (2019) An interaction-aware attention network for speech emotion recognition in spoken dialogs. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6685–6689
Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech, pp 1089–1093
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d cnn lstm networks. Biomed Signal Process Control 47:312–323
Article Google Scholar
Kim E, Shin JW (2019) Dnn-based emotion recognition based on bottleneck acoustic features and lexical features. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6720–6724
Schuller B, Rigoll G, Lang M (2003) Hidden markov model-based speech emotion recognition. In: 2003 IEEE international conference on acoustics, speech, and signal processing, 2003. Proceedings.(ICASSP’03), vol 2. IEEE, p 1
Lin YL, Wei G (2005) Speech emotion recognition based on hmm and svm. In: 2005 international conference on machine learning and cybernetics, vol 8. IEEE, pp 4898–4901
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303
Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimedia 16(8):2203–2213
Article Google Scholar
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Article Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth annual conference of the international speech communication association
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5200–5204
Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2227–2231
Jokinen K, McTear M (2009) Spoken dialogue systems. Synthesis Lectures Human Language Technol 2(1):1–151
Article Google Scholar
Narayanan S, Georgiou PG (2013) Behavioral signal processing: deriving human behavioral informatics from speech and language. Proc IEEE 101(5):1203–1233
Article Google Scholar
Sarma M, Ghahremani P, Povey D, Goel NK, Sarma KK, Dehak N (2018) Emotion identification from raw speech signals using dnns. In: Interspeech, pp. 3097–3101

Download references

Funding

This work is supported in part by the Key Projects of the National Natural Science Foundation of China under Grant U1836220, the National Nature Science Foundation of China of 62176106 and Jiangsu Province key research and development plan (BE2020036).

Author information

Authors and Affiliations

Department of Computer Science and Communication Engineering, Jiangsu University, Street, Zhenjiang, Jiangsu Province, 212013, China
Mohammed Tellai, Lijian Gao & Qirong Mao
Jiangsu Engineering Research Center of Big Data Ubiquitous Perception and Intelligent Agriculture Applications, Zhenjiang, Jiangsu Province, 212013, China
Qirong Mao
Department of Computer Science and Engineering, Central South University, Street, Changsha, Hunan Province, 410083, China
Mounir Abdelaziz

Authors

Mohammed Tellai
View author publications
You can also search for this author in PubMed Google Scholar
Lijian Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qirong Mao
View author publications
You can also search for this author in PubMed Google Scholar
Mounir Abdelaziz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qirong Mao.

Ethics declarations

Conflict of Interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tellai, M., Gao, L., Mao, Q. et al. A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17803-7

Download citation

Received: 02 March 2022
Revised: 22 September 2023
Accepted: 03 December 2023
Published: 29 December 2023
DOI: https://doi.org/10.1007/s11042-023-17803-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation

Abstract

Access this article

Similar content being viewed by others

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Data Availibility Statement

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation

Abstract

Access this article

Similar content being viewed by others

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Data Availibility Statement

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation