Skip to main content
Log in

Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

A multihead attention-based convolutional neural network (CNN) architecture known as channel-wise global head pooling is proposed to improve the classification accuracy of speech emotion recognition. A time-frequency kernel is used in two-dimensional convolution to emphasize both the scales in mel-frequency-cepstral-coefficients. Following the CNN encoder, a multihead attention network is optimized to learn salient discriminating characteristics of audio samples on the three emotional speech datasets, including the interactive emotional dyadic motion capture in English, the Berlin emotional speech dataset in the German language, and Ryerson audio-visual database of emotional speech and song in North American English. The proposed model’s robustness is demonstrated in these diverse language datasets. A chunk-level classification approach is utilized for model training with source labels for each segment. While performing the model evaluation, an aggregation of emotions is applied to achieve the emotional sample classification. The classification accuracy is improved to 84.89% and 82.87% unweighted accuracy (UA) and weighted accuracy (WA) on the IEMOCAP dataset. It is the state-of-the-art performance on this speech corpus compared to (79.34% of WA and 77.54% of UA) using only audio modality; the proposed method achieved a UA improvement of more than 7%. Furthermore, it validated the model on two other datasets via a series of experiments that yielded acceptable results. The model is investigated using WA and UA. Additionally, statistical parameters, including precision, recall and F1-score, are also used to estimate the effectiveness of each emotion class.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

All three datasets used in the research study are openly accessible; however, IEMOCAP is available after filling out an electronic release form and due permission from the authors. There is no such restriction when downloading RAVDESS and EmoDB.

References

  1. S.B. Alex, L. Mary, B.P. Babu, Attention and feature selection for automatic speech emotion recognition using utterance and syllable-level prosodic features. Circuits Syst. Signal Process. 39(11), 5681–5709 (2020)

    Google Scholar 

  2. R. Altrov, H. Pajupuu, The influence of language and culture on the understanding of vocal emotions. Eesti ja soome-ugri keeleteaduse ajakiri. J. Est. Finno-Ugric Linguist. 6(3), 11–48 (2015)

    Google Scholar 

  3. N.N. An, N.Q. Thanh, Y. Liu, Deep CNNS with self-attention for speaker identification. IEEE Access 7, 85327–85337 (2019)

    Google Scholar 

  4. S.M. Anwar, M. Majid, A. Qayyum, M. Awais, M. Alnowami, M.K. Khan, Medical image analysis using convolutional neural networks: a review. J. Med. Syst. 42(11), 1–13 (2018)

    Google Scholar 

  5. A. Batliner, B. Schuller, D. Seppi, S. Steidl, L. Devillers, L. Vidrascu, T. Vogt, V. Aharonson, N. Amir, The automatic recognition of emotions in speech, in Emotion-Oriented Systems. (Springer, Berlin, 2011), pp.71–99

    Google Scholar 

  6. A. Bhavan, P. Chauhan, R.R. Shah et al., Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 184, 104886 (2019)

    Google Scholar 

  7. F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of german emotional speech, in Ninth European Conference on Speech Communication and Technology (2005)

  8. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)

    Google Scholar 

  9. K. Chauhan, K.K. Sharma, T. Varma, Improved speech emotion recognition using modified mean cepstral features, in 2020 IEEE 17th India Council International Conference (INDICON) (IEEE, 2020), pp. 1–6

  10. K. Chauhan, K.K. Sharma, T. Varma, Speech emotion recognition using convolution neural networks, in 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS) (IEEE, 2021), pp. 1176–1181

  11. K. Chauhan, K.K. Sharma, T. Varma, A method for simplifying the spoken emotion recognition system using a shallow neural network and temporal feature stacking & pooling (TFSP). Multim. Tools Appl. 82, 1–19 (2022)

    Google Scholar 

  12. M. Chen, X. He, J. Yang, H. Zhang, 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)

    Google Scholar 

  13. F. Daneshfar, S.J. Kabudian, Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multim. Tools Appl. 79(1), 1261–1289 (2020)

    Google Scholar 

  14. S. Deb, S. Dandapat, Multiscale amplitude feature and significance of enhanced vocal tract information for emotion classification. IEEE Trans. Cybern. 49(3), 802–815 (2018)

    Google Scholar 

  15. S. Demircan, H. Kahramanli, Application of fuzzy c-means clustering algorithm to spectral features for emotion classification from speech. Neural Comput. Appl. 29(8), 59–66 (2018)

    Google Scholar 

  16. F. Eyben, K.R. Scherer, B.W. Schuller, J. Sundberg, E. André, C. Busso, L.Y. Devillers, J. Epps, P. Laukka, S.S. Narayanan et al., The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015)

    Google Scholar 

  17. F. Eyben, F. Weninger, F. Gross, B. Schuller, Recent developments in opensmile, the munich open-source multimedia feature extractor, in Proceedings of the 21st ACM International Conference on Multimedia (2013), pp. 835–838

  18. M. Fahad, A. Deepak, G. Pradhan, J. Yadav et al., DNN-HMM-based speaker-adaptive emotion recognition using MFCC and epoch-based features. Circuits Syst. Signal Process. 40(1), 466–489 (2021)

    Google Scholar 

  19. M.S. Fahad, A. Ranjan, A. Deepak, G. Pradhan, Speaker adversarial neural network (SANN) for speaker-independent speech emotion recognition. Circuits Syst. Signal Process. 41, 1–23 (2022)

    Google Scholar 

  20. N. Hajarolasvadi, H. Demirel, 3d CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)

    Google Scholar 

  21. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778

  22. M. Hou, Z. Zhang, Q. Cao, D. Zhang, G. Lu, Multi-view speech emotion recognition via collective relation construction. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 218–229 (2022)

    Google Scholar 

  23. D. Issa, M.F. Demirci, A. Yazici, Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)

    Google Scholar 

  24. R. Jahangir, Y.W. Teh, F. Hanif, G. Mujtaba, Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multim. Tools Appl. 80(16), 23745–23812 (2021)

    Google Scholar 

  25. M. Kotti, F. Paternò, Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema. Int. J. Speech Technol. 15(2), 131–150 (2012)

    Google Scholar 

  26. D. Krishna, A. Patil, Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks, in Interspeech (2020), pp. 4243–4247

  27. S. Kwon et al., A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1), 183 (2020)

    Google Scholar 

  28. S. Kwon et al., Att-Net: enhanced emotion recognition system using lightweight self-attention module. Appl. Soft Comput. 102, 107101 (2021)

    Google Scholar 

  29. S. Lalitha, D. Gupta, M. Zakariah, Y.A. Alotaibi, Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation. Appl. Acoust. 170, 107519 (2020)

    Google Scholar 

  30. S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Epps, B.W. Schuller, Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans. Affect. Comput. 13, 992–1004 (2020)

    Google Scholar 

  31. C.M. Lee, S.S. Narayanan, Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005)

    Google Scholar 

  32. D. Li, J. Liu, Z. Yang, L. Sun, Z. Wang, Speech emotion recognition using recurrent neural networks with directional self-attention. Expert Syst. Appl. 173, 114683 (2021)

    Google Scholar 

  33. P. Li, Y. Song, I.V. McLoughlin, W. Guo, L.-R. Dai, An attention pooling based representation learning method for speech emotion recognition (2018)

  34. X. Li, J. Tao, M.T. Johnson, J. Soltis, A. Savage, K.M. Leong, J.D. Newman, Stress and emotion classification using jitter and shimmer features, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4 (IEEE, 2007), pp. IV–1081

  35. S.R. Livingstone, F.A. Russo, The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)

    Google Scholar 

  36. L.-S.A. Low, N.C. Maddage, M. Lech, L.B. Sheeber, N.B. Allen, Detection of clinical depression in adolescents’ speech during family interactions. IEEE Trans. Biomed. Eng. 58(3), 574–586 (2010)

    Google Scholar 

  37. H. Meng, T. Yan, F. Yuan, H. Wei, Speech emotion recognition from 3d log-mel spectrograms with deep learning network. IEEE Access 7, 125868–125881 (2019)

    Google Scholar 

  38. W. Minker, J. Pittermann, A. Pittermann, P.-M. Strauß, D. Bühler, Challenges in speech-based human–computer interfaces. Int. J. Speech Technol. 10(2), 109–119 (2007)

    Google Scholar 

  39. S. Mirsamadi, E. Barsoum, C. Zhang, Automatic speech emotion recognition using recurrent neural networks with local attention, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2017), pp. 2227–2231

  40. D. Morrison, R. Wang, L.C. De Silva, Ensemble methods for spoken emotion recognition in call-centres. Speech Commun. 49(2), 98–112 (2007)

    Google Scholar 

  41. Mustaqeem, S. Kwon, MLT-DNet: speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)

  42. S. Nagarajan, S.S.S. Nettimi, L.S. Kumar, M.K. Nath, A. Kanhe, Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and erb frequency scales. Digital Signal Process. 104, 102763 (2020)

    Google Scholar 

  43. P. Nantasri, E. Phaisangittisagul, J. Karnjana, S. Boonkla, S. Keerativittayanun, A. Rugchatjaroen, S. Usanavasin, T. Shinozaki, A light-weight artificial neural network for speech emotion recognition using average values of mfccs and their derivatives, in 2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) (IEEE, 2020), pp. 41–44

  44. A. Nediyanchath, P. Paramasivam, P. Yenigalla, Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 7179–7183

  45. M. Neumann, N.T. Vu, Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. arXiv:1706.00612 (2017)

  46. C.S. Ooi, K.P. Seng, L.-M. Ang, L.W. Chew, A new approach of audio emotion recognition. Expert Syst. Appl. 41(13), 5858–5869 (2014)

    Google Scholar 

  47. S.K. Pandey, H.S. Shekhawat, S. Prasanna, Attention gated tensor neural network architectures for speech emotion recognition. Biomed. Signal Process. Control 71, 103173 (2022)

    Google Scholar 

  48. S. Parthasarathy, C. Busso, Semi-supervised speech emotion recognition with ladder networks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2697–2709 (2020)

    Google Scholar 

  49. N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Müller, S. Stüker, A. Waibel, Very deep self-attention networks for end-to-end speech recognition. arXiv:1904.13377 (2019)

  50. Y. Qian, M. Bi, T. Tan, K. Yu, Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2263–2276 (2016)

    Google Scholar 

  51. M. Sajjad, S. Kwon et al., Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8, 79861–79875 (2020)

    Google Scholar 

  52. M. Sarma, P. Ghahremani, D. Povey, N.K. Goel, K.K. Sarma, N. Dehak, Emotion identification from raw speech signals using DNNS, in Interspeech (2018), pp. 3097–3101

  53. Y. Sun, G. Wen, J. Wang, Weighted spectral features based on local hu moments for speech emotion recognition. Biomed. Signal Process. Control 18, 80–90 (2015)

    Google Scholar 

  54. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

  55. L. Tarantino, P.N. Garner, A. Lazaridis, et al. Self-attention for speech emotion recognition, in Interspeech (2019), pp. 2578–2582

  56. G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M.A. Nicolaou, B. Schuller, S. Zafeiriou, Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 5200–5204

  57. T. Tuncer, S. Dogan, U.R. Acharya, Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl.-Based Syst. 211, 106547 (2021)

    Google Scholar 

  58. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

  59. X. Wu, S. Liu, Y. Cao, X. Li, J. Yu, D. Dai, X. Ma, S. Hu, Z. Wu, X. Liu, et al. Speech emotion recognition using capsule networks, in ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6695–6699

  60. R. Xia, Y. Liu, A multi-task learning framework for emotion recognition using 2d continuous space. IEEE Trans. Affect. Comput. 8(1), 3–14 (2015)

    MathSciNet  Google Scholar 

  61. M. Xu, F. Zhang, X. Cui, W. Zhang, Speech emotion recognition with multiscale area attention and data augmentation, in ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 6319–6323

  62. Y. Xu, H. Xu, J. Zou, HGFM: A hierarchical grained and feature model for acoustic emotion recognition, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 6499–6503

  63. B. Yang, M. Lugger, Emotion recognition from speech signals using new harmony features. Signal Process. 90(5), 1415–1423 (2010)

    MATH  Google Scholar 

  64. S. Yoon, S. Byun, K. Jung, Multimodal speech emotion recognition using audio and text, in 2018 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 112–118

  65. Y. Zeng, H. Mao, D. Peng, Z. Yi, Spectrogram based multi-task audio classification. Multim. Tools Appl. 78(3), 3705–3722 (2019)

    Google Scholar 

  66. J. Zhang, L. Xing, Z. Tan, H. Wang, K. Wang, Multi-head attention fusion networks for multi-modal speech emotion recognition. Comput. Ind. Eng. 168, 108078 (2022)

    Google Scholar 

  67. S. Zhang, S. Zhang, T. Huang, W. Gao, Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multim. 20(6), 1576–1590 (2017)

    Google Scholar 

  68. J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)

    Google Scholar 

  69. Y. Zhou, X. Liang, Y. Gu, Y. Yin, L. Yao, Multi-classifier interactive learning for ambiguous speech emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 695–705 (2022)

    Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge the support of the ADSIP Laboratory at MNIT, Jaipur, for providing all the facilities for carrying out the work.

Funding

The authors received no financial support for their research, writing, or publication of this paper.

Author information

Authors and Affiliations

Authors

Contributions

KC was involved in the conceptualization, methodology, implementation, investigation, validation, writing—original draft, writing—review and editing. KKS contributed to supervision, project administration, investigation, validation, review and editing. TV assisted in the supervision, project administration, investigation, validation, review and editing.

Corresponding author

Correspondence to Krishna Chauhan.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

Human and animal rights

In research involving human participants, all procedures were carried out in compliance with ethical guidelines.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chauhan, K., Sharma, K.K. & Varma, T. Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP). Circuits Syst Signal Process 42, 5500–5522 (2023). https://doi.org/10.1007/s00034-023-02367-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02367-6

Keywords

Navigation