Abstract
Due to the complexity of emotional expression, recognizing emotions from the speech is a critical and challenging task. In most of the studies, some specific emotions are easily classified incorrectly. In this paper, we propose a new framework that integrates cascade attention mechanism and joint loss for speech emotion recognition (SER), aiming to solve feature confusions for emotions that are difficult to be classified correctly. First, we extract the mel frequency cepstrum coefficients (MFCCs), deltas, and delta-deltas from MFCCs to form 3-dimensional (3D) features, thus effectively reducing the interference of external factors. Second, we employ spatiotemporal attention to selectively discover target emotion regions from the input features, where self-attention with head fusion captures the long-range dependency of temporal features. Finally, the joint loss function is employed to distinguish emotional embeddings with high similarity to enhance the overall performance. Experiments on interactive emotional dyadic motion capture (IEMOCAP) database indicate that the method achieves a positive improvement of 2.49% and 1.13% in weighted accuracy (WA) and unweighted accuracy (UA), respectively, compared to the state-of-the-art strategies.
Similar content being viewed by others
References
J. H. Tao, J. Huang, Y. Li, Z. Lian, M. Y. Niu. Correction to: Semi-supervised ladder networks for speech emotion recognition. International Journal of Automation and Computing, vol. 18, no. 4, Article number 680, 2021. DOI: https://doi.org/10.1007/s11633-019-1215-6.
E. M. Schmidt, Y. E. Kim. Learning emotion-based acoustic features with deep belief networks. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, USA, pp. 65–68, 2011. DOI: https://doi.org/10.1109/ASPAA.2011.6082328.
K. Han, D. Yu, I. Tashev. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 223–227, 2014.
Q. Mao, M. Dong, Z. W. Huang, Y. Z. Zhan. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014. DOI: https://doi.org/10.1109/TMM.2014.2360798.
M. Y. Chen, X. J. He, J. Yang, H. Zhang. 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440–1444, 2018. DOI: https://doi.org/10.1109/LSP.2018.2860246.
Y. Liu, H. Q. Sun, W. B. Guan, Y. Q. Xia, Z. Zhao. Discriminative feature representation based on cascaded attention network with adversarial joint loss for speech emotion recognition. In Proceedings of Interspeech, pp. 4750–4754, 2022.
M. Seyedmahdad, E. Barsoum, C. Zhang. Auoomaiic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Los Angeles, USA, pp. 2227–2231, 2017.
Q. P. Chen, G. M. Huang. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition. Engineering Applications of Artificial Intelligence, vol. 102, Article number 104277, 2021. DOI: https://doi.org/10.1016/j.engappai.2021.104277.
Y. Liu, H. Q. Sun, W. B. Guan, Y. Q. Xia, Z. Zhao. Multimodal speech emotion recognition using self-attention mechanism and multi-scale fusion framework. Speech Communication, vol. 130, pp. 1–9, 2022. DOI: https://doi.org/10.1016/j.specom.2022.02.006.
M. K. Xu, F. Zhang, S. U. Khan. Improve accuracy of speech emotion recognition with attention head fusion. In Proceedings of the 10th Annual Computing and Communication Workshop and Conference, IEEE, Las Vegas, USA, pp.1058–1064, 2000. DOI: https://doi.org/10.1109/CCWC47524.2020.9031207.
C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359. DOI: https://doi.org/10.1007/s10570-008-9076-6.
S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, C. Y. Espy-Wilson. Adversarial auto-encoders for speech based emotion recognition. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1243–1247, 2017.
D. Y. Dai, Z. Y. Wu, R. N. Li, X. X. Wu, J. Jia, H. Meng. Learning discriminative features from spectrograms using center loss for speech emotion recognition. In Proceedings of ICASSP/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 7405–7409, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8683765.
Y. Gao, J. X. Liu, L. B. Wang, J. W. Dang. Metric learning based feature representation with gated fusion model for speech emotion recognition. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, pp. 4503–4507, 2021.
L. Tarantino, P. N. Garner, A. Lazaridis. Self-attention for speech emotion recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 2578–2582, 2019.
J. W. Liu, H. X. Wang. A speech emotion recognition framework for better discrimination of confusions. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, pp. 4483–4487, 2021.
A. Satt, S. Rozenberg, R. Hoory. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1089–1093, 2017.
P. C. Li, Y. Song, I. V. McLoughlin, W. Guo, L. R. Dai. An attention pooling based representation learning method for speech emotion recognition. In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, pp. 3087–3091, 2018.
Acknowledgements
This work was supported by Natural Science Foundation of Shandong Province, China (No. ZR2020QF007).
Author information
Authors and Affiliations
Corresponding author
Additional information
Yang Liu received the B. Eng. and M. Eng. degrees in computer science and technology from Tianjin University, China in 2010 and 2012, respectively, and the Ph. D. degree in information science from Japan Advanced Institute of Science and Technology, Japan in 2016. Currently, he is a lecturer with Department of Information Science and Technology, Qingdao University of Science and Technology, China.
His research interests include speech signal processing, life prediction of mechanical equipment and robotic theory.
Haoqin Sun received the B. Eng. degree in international digital media from Qing-dao University, China in 2020. Currently, he is a master student in software engineering at Department of Software Engineering, Qingdao University of Science and Technology, China.
His research interest is speech emotion recognition.
Wenbo Guan received the B. Eng. degree in computer science and technology from Jiangsu University of Science and Technology, China in 2019. Currently, he is a master student in electronic information at Department of Electronic Information, Qingdao University of Science and Technology, China.
His research interest is speech separation.
Yuqi Xia received the B. Eng. degree in computer science and technology from Shenyang Normal University, China in 2018. Currently, he is a master student in electronic information at Department of Electronic Information, Qingdao University of Science and Technology, China.
His research interest is speech emotion recognition.
Zhen Zhao received the Ph. D. degree in systems engineering from Tongji University, China in 2011. Currently, he is an associate professor with Department of Information Science and Technology, Qingdao University of Science and Technology, China.
His research interests include speech emotion recognition, artificial intelligence and edge computing.
Declarations of Conflict of interest
The authors declared that they have no conflicts of interest to this work.
Colored figures are available in the online version at https://link.springer.com/journal/11633
Rights and permissions
About this article
Cite this article
Liu, Y., Sun, H., Guan, W. et al. Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions. Mach. Intell. Res. 20, 595–604 (2023). https://doi.org/10.1007/s11633-022-1356-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-022-1356-x