Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

Liu, Yang; Sun, Haoqin; Guan, Wenbo; Xia, Yuqi; Zhao, Zhen

doi:10.1007/s11633-022-1356-x

Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

Research Article
Published: 01 June 2023

Volume 20, pages 595–604, (2023)
Cite this article

Machine Intelligence Research Aims and scope Submit manuscript

109 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Due to the complexity of emotional expression, recognizing emotions from the speech is a critical and challenging task. In most of the studies, some specific emotions are easily classified incorrectly. In this paper, we propose a new framework that integrates cascade attention mechanism and joint loss for speech emotion recognition (SER), aiming to solve feature confusions for emotions that are difficult to be classified correctly. First, we extract the mel frequency cepstrum coefficients (MFCCs), deltas, and delta-deltas from MFCCs to form 3-dimensional (3D) features, thus effectively reducing the interference of external factors. Second, we employ spatiotemporal attention to selectively discover target emotion regions from the input features, where self-attention with head fusion captures the long-range dependency of temporal features. Finally, the joint loss function is employed to distinguish emotional embeddings with high similarity to enhance the overall performance. Experiments on interactive emotional dyadic motion capture (IEMOCAP) database indicate that the method achieves a positive improvement of 2.49% and 1.13% in weighted accuracy (WA) and unweighted accuracy (UA), respectively, compared to the state-of-the-art strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Time-Frequency Attention for Speech Emotion Recognition with Squeeze-and-Excitation Blocks

Improved Speech Emotion Recognition Using LAM and CTC

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

Article 06 July 2023

References

J. H. Tao, J. Huang, Y. Li, Z. Lian, M. Y. Niu. Correction to: Semi-supervised ladder networks for speech emotion recognition. International Journal of Automation and Computing, vol. 18, no. 4, Article number 680, 2021. DOI: https://doi.org/10.1007/s11633-019-1215-6.
E. M. Schmidt, Y. E. Kim. Learning emotion-based acoustic features with deep belief networks. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, USA, pp. 65–68, 2011. DOI: https://doi.org/10.1109/ASPAA.2011.6082328.
K. Han, D. Yu, I. Tashev. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 223–227, 2014.
Q. Mao, M. Dong, Z. W. Huang, Y. Z. Zhan. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014. DOI: https://doi.org/10.1109/TMM.2014.2360798.
Article Google Scholar
M. Y. Chen, X. J. He, J. Yang, H. Zhang. 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440–1444, 2018. DOI: https://doi.org/10.1109/LSP.2018.2860246.
Article Google Scholar
Y. Liu, H. Q. Sun, W. B. Guan, Y. Q. Xia, Z. Zhao. Discriminative feature representation based on cascaded attention network with adversarial joint loss for speech emotion recognition. In Proceedings of Interspeech, pp. 4750–4754, 2022.
M. Seyedmahdad, E. Barsoum, C. Zhang. Auoomaiic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Los Angeles, USA, pp. 2227–2231, 2017.
Q. P. Chen, G. M. Huang. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition. Engineering Applications of Artificial Intelligence, vol. 102, Article number 104277, 2021. DOI: https://doi.org/10.1016/j.engappai.2021.104277.
Y. Liu, H. Q. Sun, W. B. Guan, Y. Q. Xia, Z. Zhao. Multimodal speech emotion recognition using self-attention mechanism and multi-scale fusion framework. Speech Communication, vol. 130, pp. 1–9, 2022. DOI: https://doi.org/10.1016/j.specom.2022.02.006.
Article Google Scholar
M. K. Xu, F. Zhang, S. U. Khan. Improve accuracy of speech emotion recognition with attention head fusion. In Proceedings of the 10th Annual Computing and Communication Workshop and Conference, IEEE, Las Vegas, USA, pp.1058–1064, 2000. DOI: https://doi.org/10.1109/CCWC47524.2020.9031207.
Google Scholar
C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359. DOI: https://doi.org/10.1007/s10570-008-9076-6.
S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, C. Y. Espy-Wilson. Adversarial auto-encoders for speech based emotion recognition. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1243–1247, 2017.
D. Y. Dai, Z. Y. Wu, R. N. Li, X. X. Wu, J. Jia, H. Meng. Learning discriminative features from spectrograms using center loss for speech emotion recognition. In Proceedings of ICASSP/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 7405–7409, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8683765.
Google Scholar
Y. Gao, J. X. Liu, L. B. Wang, J. W. Dang. Metric learning based feature representation with gated fusion model for speech emotion recognition. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, pp. 4503–4507, 2021.
L. Tarantino, P. N. Garner, A. Lazaridis. Self-attention for speech emotion recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 2578–2582, 2019.
J. W. Liu, H. X. Wang. A speech emotion recognition framework for better discrimination of confusions. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, pp. 4483–4487, 2021.
A. Satt, S. Rozenberg, R. Hoory. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1089–1093, 2017.
P. C. Li, Y. Song, I. V. McLoughlin, W. Guo, L. R. Dai. An attention pooling based representation learning method for speech emotion recognition. In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, pp. 3087–3091, 2018.

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Shandong Province, China (No. ZR2020QF007).

Author information

These authors contribute equally to this work

Authors and Affiliations

School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
Yang Liu, Haoqin Sun, Wenbo Guan, Yuqi Xia & Zhen Zhao

Authors

Yang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haoqin Sun
View author publications
You can also search for this author in PubMed Google Scholar
Wenbo Guan
View author publications
You can also search for this author in PubMed Google Scholar
Yuqi Xia
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen Zhao.

Additional information

Yang Liu received the B. Eng. and M. Eng. degrees in computer science and technology from Tianjin University, China in 2010 and 2012, respectively, and the Ph. D. degree in information science from Japan Advanced Institute of Science and Technology, Japan in 2016. Currently, he is a lecturer with Department of Information Science and Technology, Qingdao University of Science and Technology, China.

His research interests include speech signal processing, life prediction of mechanical equipment and robotic theory.

Haoqin Sun received the B. Eng. degree in international digital media from Qing-dao University, China in 2020. Currently, he is a master student in software engineering at Department of Software Engineering, Qingdao University of Science and Technology, China.

His research interest is speech emotion recognition.

Wenbo Guan received the B. Eng. degree in computer science and technology from Jiangsu University of Science and Technology, China in 2019. Currently, he is a master student in electronic information at Department of Electronic Information, Qingdao University of Science and Technology, China.

His research interest is speech separation.

Yuqi Xia received the B. Eng. degree in computer science and technology from Shenyang Normal University, China in 2018. Currently, he is a master student in electronic information at Department of Electronic Information, Qingdao University of Science and Technology, China.

His research interest is speech emotion recognition.

Zhen Zhao received the Ph. D. degree in systems engineering from Tongji University, China in 2011. Currently, he is an associate professor with Department of Information Science and Technology, Qingdao University of Science and Technology, China.

His research interests include speech emotion recognition, artificial intelligence and edge computing.

Declarations of Conflict of interest

The authors declared that they have no conflicts of interest to this work.

Colored figures are available in the online version at https://link.springer.com/journal/11633

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Y., Sun, H., Guan, W. et al. Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions. Mach. Intell. Res. 20, 595–604 (2023). https://doi.org/10.1007/s11633-022-1356-x

Download citation

Received: 26 April 2022
Accepted: 08 July 2022
Published: 01 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11633-022-1356-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

Abstract

Access this article

Similar content being viewed by others

Time-Frequency Attention for Speech Emotion Recognition with Squeeze-and-Excitation Blocks

Improved Speech Emotion Recognition Using LAM and CTC

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Declarations of Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

Abstract

Access this article

Similar content being viewed by others

Time-Frequency Attention for Speech Emotion Recognition with Squeeze-and-Excitation Blocks

Improved Speech Emotion Recognition Using LAM and CTC

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Declarations of Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation