Skip to main content
Log in

A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript


Speech is one of the most fundamental mediums for human-to-human interaction, thereby playing a pivotal role in shaping the landscape of next-generation human-computer interaction (HCI). The development of an accurate speech emotion recognition (SER) system for human conversation is a critical yet challenging task. Most state-of-the-art research work in SER predominantly centers around the individual modeling of vocal attributes within each discrete speech utterance, often overlooking the integration of transactional cues intrinsic to the broader interactive context. In this paper, we introduce an innovative dual-level framework designed for the recognition of speech emotions, which leverages the complementary attributes of MFCC features and Mel-spectrograms. Furthermore, we propose a hierarchical attention mechanism designed to effectively include contextual information, hence improving the accuracy of emotion recognition. Our experimentation, conducted on the widely recognized IEMOCAP emotional benchmark dataset, yields promising results. Compared to state-of-the-art methods in four-class emotion recognition, our model demonstrates a substantial advancement, achieving a weighted accuracy of 75.0% and an unweighted accuracy of 75.9%. This marks a notable enhancement of 5.8% in terms of unweighted accuracy, underscoring the efficacy of our approach. This work contributes to the advancement of SER by effectively utilizing multiple audio representations and contextual information. The significant improvements underscore the efficacy of our approach, promising more accurate emotion recognition in human-computer interaction and affective computing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availibility Statement

The dataset utilized in this study can be requested for download through the provided link: •   IEMOCAP Dataset: Access Link.


  1. Afrillia Y, Mawengkang H, Ramli M, Fhonna RP et al (2017) Performance measurement of mel frequency ceptral coefficient (mfcc) method in learning system of al-qur’an based in nagham pattern recognition. In: Journal of physics: conference series, vol 930. IOP Publishing, p 012036

  2. Bingol MC, Aydogmus O (2020) Performing predefined tasks using the human-robot interaction on speech recognition for an industrial robot. Eng Appl Artif Intell 95:103903

    Article  Google Scholar 

  3. Ismail A, Idris MYI, Noor NM, Razak Z, Yusoff ZM (2014) Mfcc-vq approach for qalqalahtajweed rule checking. Malays J Comput Sci 27(4):275–293

    Google Scholar 

  4. McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, vol 8. Citeseer, pp 18–25

  5. Tellai M, Gao L, Mao Q (2023) An efficient speech emotion recognition based on a dual-stream cnn-transformer fusion network. Int J Speech Technol 1–17

  6. Zheng C, Wang C, Jia N (2022) A two-channel speech emotion recognition model based on raw stacked waveform. Multimed Tools Appl 1–26

  7. Chattopadhyay S, Dey A, Singh PK, Ahmadian A, Sarkar R (2022) A feature selection model for speech emotion recognition using clustering-based population generation with hybrid of equilibrium optimizer and atom search optimization algorithm. Multimed Tools Appl 1–34

  8. Wang C, Ren Y, Zhang N, Cui F, Luo S (2021) Speech emotion recognition based on multi-feature and multi-lingual fusion. Multimed Tools Appl 1–11

  9. Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using cnn. Proceedings of the 22nd ACM international conference on Multimedia.

  10. Huang Z, Xue W, Mao Q, Zhan Y (2016) Unsupervised domain adaptation for speech emotion recognition using pcanet. Multimed Tools Appl 76(5):6785–6799.

    Article  Google Scholar 

  11. Mao Q, Xu G, Xue W, Gou J, Zhan Y (2017) Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition. Speech Commun 93:1–10.

    Article  Google Scholar 

  12. Ocquaye EN, Mao Q, Xue Y, Song H (2020) Cross lingual speech emotion recognition via triple attentive asymmetric convolutional neural network. Int J Intell Syst 36(1):53–71.

    Article  Google Scholar 

  13. Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation

  14. Wang J, Xue M, Culhane R, Diao E, Ding J, Tarokh V (2020) Speech emotion recognition with dual-sequence lstm architecture. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6474–6478

  15. Bae SH, Choi I, Kim NS (2016) Acoustic scene classification using parallel combination of lstm and cnn. In: Proceedings of the detection and classification of acoustic scenes and events 2016 workshop (DCASE2016), pp 11–15

  16. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  17. Rajamani ST, Rajamani KT, Mallol-Ragolta A, Liu S, Schuller B (2021) A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6294–6298

  18. Li R, Wu Z, Jia J, Zhao S, Meng H (2019) Dilated residual network with multi-head self-attention for speech emotion recognition. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6675–6679

  19. Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE spoken language technology workshop (SLT). IEEE, pp 112–118

  20. Kumar P, Kaushik V, Raman B (2021) Towards the explainability of multimodal speech emotion recognition. In: Interspeech, pp. 1748–1752

  21. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process 32:8026–8037

    Google Scholar 

  22. Bone D, Lee CC, Chaspari T, Gibson J, Narayanan S (2017) Signal processing and machine learning for mental health research and clinical applications [perspectives]. IEEE Signal Process Mag 34(5):196–195

    Article  Google Scholar 

  23. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587

    Article  Google Scholar 

  24. Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, pp 1459–1462

  25. Hazarika D, Poria S, Zadeh A, Cambria E, Morency LP, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the conference. association for computational linguistics. North American Chapter. Meeting, vol 2018. NIH Public Access, p 2122

  26. Yeh SL, Lin YS, Lee CC (2020) A dialogical emotion decoder for speech emotion recognition in spoken dialog. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6479–6483

  27. Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun 49(2):98–112

    Article  Google Scholar 

  28. Rozgić V, Ananthakrishnan S, Saleem S, Kumar R, Prasad R (2012) Ensemble of svm trees for multimodal emotion recognition. In: Proceedings of the 2012 Asia pacific signal and information processing association annual summit and conference. IEEE, pp 1–4

  29. Zhou S, Jia J, Wang Q, Dong Y, Yin Y, Lei K (2018) Inferring emotion from conversational voice data: a semi-supervised multi-path generative neural network approach. In: Thirty-second AAAI conference on artificial intelligence

  30. Xu X, Deng J, Cummins N, Zhang Z, Wu C, Zhao L, Schuller B (2017) A two-dimensional framework of multiple kernel subspace learning for recognizing emotion in speech. IEEE/ACM Trans Audio Speech Language Process 25(7):1436–1449

    Article  Google Scholar 

  31. Schuller B, Vlasenko B, Eyben F, Wöllmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2010) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1(2):119–131

    Article  Google Scholar 

  32. Anagnostopoulos CN, Iliou T, Giannoukos I (2015) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43(2):155–177

    Article  Google Scholar 

  33. Thornton MA, Tamir DI (2017) Mental models accurately predict emotion transitions. Proc Natl Acad Sci USA 114(23):5982–5987

    Article  Google Scholar 

  34. Hareli S, David S, Hess U (2016) The role of emotion transition for the perception of social dominance and affiliation. Cogn Emot 30(7):1260–1270

    Article  Google Scholar 

  35. Barsade SG (2002) The ripple effect: emotional contagion and its influence on group behavior. Adm Sci Q 47(4):644–675

    Article  Google Scholar 

  36. Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R (2018) Icon: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2594–2604

  37. Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E (2019) Dialoguernn: an attentive rnn for emotion detection in conversations. Proc AAAI Conf Artif Intell 33:6818–6825

    Google Scholar 

  38. Yeh SL, Lin YS, Lee CC (2019) An interaction-aware attention network for speech emotion recognition in spoken dialogs. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6685–6689

  39. Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech, pp 1089–1093

  40. Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d cnn lstm networks. Biomed Signal Process Control 47:312–323

    Article  Google Scholar 

  41. Kim E, Shin JW (2019) Dnn-based emotion recognition based on bottleneck acoustic features and lexical features. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6720–6724

  42. Schuller B, Rigoll G, Lang M (2003) Hidden markov model-based speech emotion recognition. In: 2003 IEEE international conference on acoustics, speech, and signal processing, 2003. Proceedings.(ICASSP’03), vol 2. IEEE, p 1

  43. Lin YL, Wei G (2005) Speech emotion recognition based on hmm and svm. In: 2005 international conference on machine learning and cybernetics, vol 8. IEEE, pp 4898–4901

  44. Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303

  45. Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimedia 16(8):2203–2213

    Article  Google Scholar 

  46. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359

    Article  Google Scholar 

  47. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980

  48. Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth annual conference of the international speech communication association

  49. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  50. Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association

  51. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5200–5204

  52. Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2227–2231

  53. Jokinen K, McTear M (2009) Spoken dialogue systems. Synthesis Lectures Human Language Technol 2(1):1–151

    Article  Google Scholar 

  54. Narayanan S, Georgiou PG (2013) Behavioral signal processing: deriving human behavioral informatics from speech and language. Proc IEEE 101(5):1203–1233

    Article  Google Scholar 

  55. Sarma M, Ghahremani P, Povey D, Goel NK, Sarma KK, Dehak N (2018) Emotion identification from raw speech signals using dnns. In: Interspeech, pp. 3097–3101

Download references


This work is supported in part by the Key Projects of the National Natural Science Foundation of China under Grant U1836220, the National Nature Science Foundation of China of 62176106 and Jiangsu Province key research and development plan (BE2020036).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Qirong Mao.

Ethics declarations

Conflict of Interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tellai, M., Gao, L., Mao, Q. et al. A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation. Multimed Tools Appl 83, 59699–59723 (2024).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: