Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)

Chauhan, Krishna; Sharma, Kamalesh Kumar; Varma, Tarun

doi:10.1007/s00034-023-02367-6

Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)

Published: 06 April 2023

Volume 42, pages 5500–5522, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

316 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

A multihead attention-based convolutional neural network (CNN) architecture known as channel-wise global head pooling is proposed to improve the classification accuracy of speech emotion recognition. A time-frequency kernel is used in two-dimensional convolution to emphasize both the scales in mel-frequency-cepstral-coefficients. Following the CNN encoder, a multihead attention network is optimized to learn salient discriminating characteristics of audio samples on the three emotional speech datasets, including the interactive emotional dyadic motion capture in English, the Berlin emotional speech dataset in the German language, and Ryerson audio-visual database of emotional speech and song in North American English. The proposed model’s robustness is demonstrated in these diverse language datasets. A chunk-level classification approach is utilized for model training with source labels for each segment. While performing the model evaluation, an aggregation of emotions is applied to achieve the emotional sample classification. The classification accuracy is improved to 84.89% and 82.87% unweighted accuracy (UA) and weighted accuracy (WA) on the IEMOCAP dataset. It is the state-of-the-art performance on this speech corpus compared to (79.34% of WA and 77.54% of UA) using only audio modality; the proposed method achieved a UA improvement of more than 7%. Furthermore, it validated the model on two other datasets via a series of experiments that yielded acceptable results. The model is investigated using WA and UA. Additionally, statistical parameters, including precision, recall and F1-score, are also used to estimate the effectiveness of each emotion class.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Emotion Recognition Using Contextualized Audio Information and Ground Transcripts on Multiple Datasets

Article 07 November 2023

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

A method for simplifying the spoken emotion recognition system using a shallow neural network and temporal feature stacking & pooling (TFSP)

Article 10 August 2022

Data availability

All three datasets used in the research study are openly accessible; however, IEMOCAP is available after filling out an electronic release form and due permission from the authors. There is no such restriction when downloading RAVDESS and EmoDB.

References

S.B. Alex, L. Mary, B.P. Babu, Attention and feature selection for automatic speech emotion recognition using utterance and syllable-level prosodic features. Circuits Syst. Signal Process. 39(11), 5681–5709 (2020)
Google Scholar
R. Altrov, H. Pajupuu, The influence of language and culture on the understanding of vocal emotions. Eesti ja soome-ugri keeleteaduse ajakiri. J. Est. Finno-Ugric Linguist. 6(3), 11–48 (2015)
Google Scholar
N.N. An, N.Q. Thanh, Y. Liu, Deep CNNS with self-attention for speaker identification. IEEE Access 7, 85327–85337 (2019)
Google Scholar
S.M. Anwar, M. Majid, A. Qayyum, M. Awais, M. Alnowami, M.K. Khan, Medical image analysis using convolutional neural networks: a review. J. Med. Syst. 42(11), 1–13 (2018)
Google Scholar
A. Batliner, B. Schuller, D. Seppi, S. Steidl, L. Devillers, L. Vidrascu, T. Vogt, V. Aharonson, N. Amir, The automatic recognition of emotions in speech, in Emotion-Oriented Systems. (Springer, Berlin, 2011), pp.71–99
Google Scholar
A. Bhavan, P. Chauhan, R.R. Shah et al., Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 184, 104886 (2019)
Google Scholar
F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of german emotional speech, in Ninth European Conference on Speech Communication and Technology (2005)
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Google Scholar
K. Chauhan, K.K. Sharma, T. Varma, Improved speech emotion recognition using modified mean cepstral features, in 2020 IEEE 17th India Council International Conference (INDICON) (IEEE, 2020), pp. 1–6
K. Chauhan, K.K. Sharma, T. Varma, Speech emotion recognition using convolution neural networks, in 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS) (IEEE, 2021), pp. 1176–1181
K. Chauhan, K.K. Sharma, T. Varma, A method for simplifying the spoken emotion recognition system using a shallow neural network and temporal feature stacking & pooling (TFSP). Multim. Tools Appl. 82, 1–19 (2022)
Google Scholar
M. Chen, X. He, J. Yang, H. Zhang, 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Google Scholar
F. Daneshfar, S.J. Kabudian, Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multim. Tools Appl. 79(1), 1261–1289 (2020)
Google Scholar
S. Deb, S. Dandapat, Multiscale amplitude feature and significance of enhanced vocal tract information for emotion classification. IEEE Trans. Cybern. 49(3), 802–815 (2018)
Google Scholar
S. Demircan, H. Kahramanli, Application of fuzzy c-means clustering algorithm to spectral features for emotion classification from speech. Neural Comput. Appl. 29(8), 59–66 (2018)
Google Scholar
F. Eyben, K.R. Scherer, B.W. Schuller, J. Sundberg, E. André, C. Busso, L.Y. Devillers, J. Epps, P. Laukka, S.S. Narayanan et al., The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015)
Google Scholar
F. Eyben, F. Weninger, F. Gross, B. Schuller, Recent developments in opensmile, the munich open-source multimedia feature extractor, in Proceedings of the 21st ACM International Conference on Multimedia (2013), pp. 835–838
M. Fahad, A. Deepak, G. Pradhan, J. Yadav et al., DNN-HMM-based speaker-adaptive emotion recognition using MFCC and epoch-based features. Circuits Syst. Signal Process. 40(1), 466–489 (2021)
Google Scholar
M.S. Fahad, A. Ranjan, A. Deepak, G. Pradhan, Speaker adversarial neural network (SANN) for speaker-independent speech emotion recognition. Circuits Syst. Signal Process. 41, 1–23 (2022)
Google Scholar
N. Hajarolasvadi, H. Demirel, 3d CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)
Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
M. Hou, Z. Zhang, Q. Cao, D. Zhang, G. Lu, Multi-view speech emotion recognition via collective relation construction. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 218–229 (2022)
Google Scholar
D. Issa, M.F. Demirci, A. Yazici, Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)
Google Scholar
R. Jahangir, Y.W. Teh, F. Hanif, G. Mujtaba, Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multim. Tools Appl. 80(16), 23745–23812 (2021)
Google Scholar
M. Kotti, F. Paternò, Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema. Int. J. Speech Technol. 15(2), 131–150 (2012)
Google Scholar
D. Krishna, A. Patil, Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks, in Interspeech (2020), pp. 4243–4247
S. Kwon et al., A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1), 183 (2020)
Google Scholar
S. Kwon et al., Att-Net: enhanced emotion recognition system using lightweight self-attention module. Appl. Soft Comput. 102, 107101 (2021)
Google Scholar
S. Lalitha, D. Gupta, M. Zakariah, Y.A. Alotaibi, Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation. Appl. Acoust. 170, 107519 (2020)
Google Scholar
S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Epps, B.W. Schuller, Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans. Affect. Comput. 13, 992–1004 (2020)
Google Scholar
C.M. Lee, S.S. Narayanan, Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005)
Google Scholar
D. Li, J. Liu, Z. Yang, L. Sun, Z. Wang, Speech emotion recognition using recurrent neural networks with directional self-attention. Expert Syst. Appl. 173, 114683 (2021)
Google Scholar
P. Li, Y. Song, I.V. McLoughlin, W. Guo, L.-R. Dai, An attention pooling based representation learning method for speech emotion recognition (2018)
X. Li, J. Tao, M.T. Johnson, J. Soltis, A. Savage, K.M. Leong, J.D. Newman, Stress and emotion classification using jitter and shimmer features, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4 (IEEE, 2007), pp. IV–1081
S.R. Livingstone, F.A. Russo, The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
Google Scholar
L.-S.A. Low, N.C. Maddage, M. Lech, L.B. Sheeber, N.B. Allen, Detection of clinical depression in adolescents’ speech during family interactions. IEEE Trans. Biomed. Eng. 58(3), 574–586 (2010)
Google Scholar
H. Meng, T. Yan, F. Yuan, H. Wei, Speech emotion recognition from 3d log-mel spectrograms with deep learning network. IEEE Access 7, 125868–125881 (2019)
Google Scholar
W. Minker, J. Pittermann, A. Pittermann, P.-M. Strauß, D. Bühler, Challenges in speech-based human–computer interfaces. Int. J. Speech Technol. 10(2), 109–119 (2007)
Google Scholar
S. Mirsamadi, E. Barsoum, C. Zhang, Automatic speech emotion recognition using recurrent neural networks with local attention, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2017), pp. 2227–2231
D. Morrison, R. Wang, L.C. De Silva, Ensemble methods for spoken emotion recognition in call-centres. Speech Commun. 49(2), 98–112 (2007)
Google Scholar
Mustaqeem, S. Kwon, MLT-DNet: speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)
S. Nagarajan, S.S.S. Nettimi, L.S. Kumar, M.K. Nath, A. Kanhe, Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and erb frequency scales. Digital Signal Process. 104, 102763 (2020)
Google Scholar
P. Nantasri, E. Phaisangittisagul, J. Karnjana, S. Boonkla, S. Keerativittayanun, A. Rugchatjaroen, S. Usanavasin, T. Shinozaki, A light-weight artificial neural network for speech emotion recognition using average values of mfccs and their derivatives, in 2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) (IEEE, 2020), pp. 41–44
A. Nediyanchath, P. Paramasivam, P. Yenigalla, Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 7179–7183
M. Neumann, N.T. Vu, Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. arXiv:1706.00612 (2017)
C.S. Ooi, K.P. Seng, L.-M. Ang, L.W. Chew, A new approach of audio emotion recognition. Expert Syst. Appl. 41(13), 5858–5869 (2014)
Google Scholar
S.K. Pandey, H.S. Shekhawat, S. Prasanna, Attention gated tensor neural network architectures for speech emotion recognition. Biomed. Signal Process. Control 71, 103173 (2022)
Google Scholar
S. Parthasarathy, C. Busso, Semi-supervised speech emotion recognition with ladder networks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2697–2709 (2020)
Google Scholar
N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Müller, S. Stüker, A. Waibel, Very deep self-attention networks for end-to-end speech recognition. arXiv:1904.13377 (2019)
Y. Qian, M. Bi, T. Tan, K. Yu, Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2263–2276 (2016)
Google Scholar
M. Sajjad, S. Kwon et al., Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8, 79861–79875 (2020)
Google Scholar
M. Sarma, P. Ghahremani, D. Povey, N.K. Goel, K.K. Sarma, N. Dehak, Emotion identification from raw speech signals using DNNS, in Interspeech (2018), pp. 3097–3101
Y. Sun, G. Wen, J. Wang, Weighted spectral features based on local hu moments for speech emotion recognition. Biomed. Signal Process. Control 18, 80–90 (2015)
Google Scholar
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
L. Tarantino, P.N. Garner, A. Lazaridis, et al. Self-attention for speech emotion recognition, in Interspeech (2019), pp. 2578–2582
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M.A. Nicolaou, B. Schuller, S. Zafeiriou, Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 5200–5204
T. Tuncer, S. Dogan, U.R. Acharya, Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl.-Based Syst. 211, 106547 (2021)
Google Scholar
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
X. Wu, S. Liu, Y. Cao, X. Li, J. Yu, D. Dai, X. Ma, S. Hu, Z. Wu, X. Liu, et al. Speech emotion recognition using capsule networks, in ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6695–6699
R. Xia, Y. Liu, A multi-task learning framework for emotion recognition using 2d continuous space. IEEE Trans. Affect. Comput. 8(1), 3–14 (2015)
MathSciNet Google Scholar
M. Xu, F. Zhang, X. Cui, W. Zhang, Speech emotion recognition with multiscale area attention and data augmentation, in ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 6319–6323
Y. Xu, H. Xu, J. Zou, HGFM: A hierarchical grained and feature model for acoustic emotion recognition, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 6499–6503
B. Yang, M. Lugger, Emotion recognition from speech signals using new harmony features. Signal Process. 90(5), 1415–1423 (2010)
MATH Google Scholar
S. Yoon, S. Byun, K. Jung, Multimodal speech emotion recognition using audio and text, in 2018 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 112–118
Y. Zeng, H. Mao, D. Peng, Z. Yi, Spectrogram based multi-task audio classification. Multim. Tools Appl. 78(3), 3705–3722 (2019)
Google Scholar
J. Zhang, L. Xing, Z. Tan, H. Wang, K. Wang, Multi-head attention fusion networks for multi-modal speech emotion recognition. Comput. Ind. Eng. 168, 108078 (2022)
Google Scholar
S. Zhang, S. Zhang, T. Huang, W. Gao, Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multim. 20(6), 1576–1590 (2017)
Google Scholar
J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Google Scholar
Y. Zhou, X. Liang, Y. Gu, Y. Yin, L. Yao, Multi-classifier interactive learning for ambiguous speech emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 695–705 (2022)
Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge the support of the ADSIP Laboratory at MNIT, Jaipur, for providing all the facilities for carrying out the work.

Funding

The authors received no financial support for their research, writing, or publication of this paper.

Author information

Authors and Affiliations

Electronics and Communication Engineering Department, Malaviya National Institute of Technology Jaipur, Jaipur, Rajasthan, 302017, India
Krishna Chauhan, Kamalesh Kumar Sharma & Tarun Varma

Authors

Krishna Chauhan
View author publications
You can also search for this author in PubMed Google Scholar
Kamalesh Kumar Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Tarun Varma
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KC was involved in the conceptualization, methodology, implementation, investigation, validation, writing—original draft, writing—review and editing. KKS contributed to supervision, project administration, investigation, validation, review and editing. TV assisted in the supervision, project administration, investigation, validation, review and editing.

Corresponding author

Correspondence to Krishna Chauhan.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

Human and animal rights

In research involving human participants, all procedures were carried out in compliance with ethical guidelines.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chauhan, K., Sharma, K.K. & Varma, T. Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP). Circuits Syst Signal Process 42, 5500–5522 (2023). https://doi.org/10.1007/s00034-023-02367-6

Download citation

Received: 01 July 2022
Revised: 21 March 2023
Accepted: 22 March 2023
Published: 06 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00034-023-02367-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)

Abstract

Access this article

Similar content being viewed by others

Multimodal Emotion Recognition Using Contextualized Audio Information and Ground Transcripts on Multiple Datasets

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

A method for simplifying the spoken emotion recognition system using a shallow neural network and temporal feature stacking & pooling (TFSP)

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)

Abstract

Access this article

Similar content being viewed by others

Multimodal Emotion Recognition Using Contextualized Audio Information and Ground Transcripts on Multiple Datasets

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

A method for simplifying the spoken emotion recognition system using a shallow neural network and temporal feature stacking & pooling (TFSP)

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation