Skip to main content
Log in

Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

  • Research Article
  • Published:
Machine Intelligence Research Aims and scope Submit manuscript

Abstract

Due to the complexity of emotional expression, recognizing emotions from the speech is a critical and challenging task. In most of the studies, some specific emotions are easily classified incorrectly. In this paper, we propose a new framework that integrates cascade attention mechanism and joint loss for speech emotion recognition (SER), aiming to solve feature confusions for emotions that are difficult to be classified correctly. First, we extract the mel frequency cepstrum coefficients (MFCCs), deltas, and delta-deltas from MFCCs to form 3-dimensional (3D) features, thus effectively reducing the interference of external factors. Second, we employ spatiotemporal attention to selectively discover target emotion regions from the input features, where self-attention with head fusion captures the long-range dependency of temporal features. Finally, the joint loss function is employed to distinguish emotional embeddings with high similarity to enhance the overall performance. Experiments on interactive emotional dyadic motion capture (IEMOCAP) database indicate that the method achieves a positive improvement of 2.49% and 1.13% in weighted accuracy (WA) and unweighted accuracy (UA), respectively, compared to the state-of-the-art strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. J. H. Tao, J. Huang, Y. Li, Z. Lian, M. Y. Niu. Correction to: Semi-supervised ladder networks for speech emotion recognition. International Journal of Automation and Computing, vol. 18, no. 4, Article number 680, 2021. DOI: https://doi.org/10.1007/s11633-019-1215-6.

  2. E. M. Schmidt, Y. E. Kim. Learning emotion-based acoustic features with deep belief networks. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, USA, pp. 65–68, 2011. DOI: https://doi.org/10.1109/ASPAA.2011.6082328.

  3. K. Han, D. Yu, I. Tashev. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 223–227, 2014.

  4. Q. Mao, M. Dong, Z. W. Huang, Y. Z. Zhan. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014. DOI: https://doi.org/10.1109/TMM.2014.2360798.

    Article  Google Scholar 

  5. M. Y. Chen, X. J. He, J. Yang, H. Zhang. 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440–1444, 2018. DOI: https://doi.org/10.1109/LSP.2018.2860246.

    Article  Google Scholar 

  6. Y. Liu, H. Q. Sun, W. B. Guan, Y. Q. Xia, Z. Zhao. Discriminative feature representation based on cascaded attention network with adversarial joint loss for speech emotion recognition. In Proceedings of Interspeech, pp. 4750–4754, 2022.

  7. M. Seyedmahdad, E. Barsoum, C. Zhang. Auoomaiic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Los Angeles, USA, pp. 2227–2231, 2017.

  8. Q. P. Chen, G. M. Huang. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition. Engineering Applications of Artificial Intelligence, vol. 102, Article number 104277, 2021. DOI: https://doi.org/10.1016/j.engappai.2021.104277.

  9. Y. Liu, H. Q. Sun, W. B. Guan, Y. Q. Xia, Z. Zhao. Multimodal speech emotion recognition using self-attention mechanism and multi-scale fusion framework. Speech Communication, vol. 130, pp. 1–9, 2022. DOI: https://doi.org/10.1016/j.specom.2022.02.006.

    Article  Google Scholar 

  10. M. K. Xu, F. Zhang, S. U. Khan. Improve accuracy of speech emotion recognition with attention head fusion. In Proceedings of the 10th Annual Computing and Communication Workshop and Conference, IEEE, Las Vegas, USA, pp.1058–1064, 2000. DOI: https://doi.org/10.1109/CCWC47524.2020.9031207.

    Google Scholar 

  11. C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359. DOI: https://doi.org/10.1007/s10570-008-9076-6.

  12. S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, C. Y. Espy-Wilson. Adversarial auto-encoders for speech based emotion recognition. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1243–1247, 2017.

  13. D. Y. Dai, Z. Y. Wu, R. N. Li, X. X. Wu, J. Jia, H. Meng. Learning discriminative features from spectrograms using center loss for speech emotion recognition. In Proceedings of ICASSP/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 7405–7409, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8683765.

    Google Scholar 

  14. Y. Gao, J. X. Liu, L. B. Wang, J. W. Dang. Metric learning based feature representation with gated fusion model for speech emotion recognition. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, pp. 4503–4507, 2021.

  15. L. Tarantino, P. N. Garner, A. Lazaridis. Self-attention for speech emotion recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 2578–2582, 2019.

  16. J. W. Liu, H. X. Wang. A speech emotion recognition framework for better discrimination of confusions. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, pp. 4483–4487, 2021.

  17. A. Satt, S. Rozenberg, R. Hoory. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 1089–1093, 2017.

  18. P. C. Li, Y. Song, I. V. McLoughlin, W. Guo, L. R. Dai. An attention pooling based representation learning method for speech emotion recognition. In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, pp. 3087–3091, 2018.

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Shandong Province, China (No. ZR2020QF007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen Zhao.

Additional information

Yang Liu received the B. Eng. and M. Eng. degrees in computer science and technology from Tianjin University, China in 2010 and 2012, respectively, and the Ph. D. degree in information science from Japan Advanced Institute of Science and Technology, Japan in 2016. Currently, he is a lecturer with Department of Information Science and Technology, Qingdao University of Science and Technology, China.

His research interests include speech signal processing, life prediction of mechanical equipment and robotic theory.

Haoqin Sun received the B. Eng. degree in international digital media from Qing-dao University, China in 2020. Currently, he is a master student in software engineering at Department of Software Engineering, Qingdao University of Science and Technology, China.

His research interest is speech emotion recognition.

Wenbo Guan received the B. Eng. degree in computer science and technology from Jiangsu University of Science and Technology, China in 2019. Currently, he is a master student in electronic information at Department of Electronic Information, Qingdao University of Science and Technology, China.

His research interest is speech separation.

Yuqi Xia received the B. Eng. degree in computer science and technology from Shenyang Normal University, China in 2018. Currently, he is a master student in electronic information at Department of Electronic Information, Qingdao University of Science and Technology, China.

His research interest is speech emotion recognition.

Zhen Zhao received the Ph. D. degree in systems engineering from Tongji University, China in 2011. Currently, he is an associate professor with Department of Information Science and Technology, Qingdao University of Science and Technology, China.

His research interests include speech emotion recognition, artificial intelligence and edge computing.

Declarations of Conflict of interest

The authors declared that they have no conflicts of interest to this work.

Colored figures are available in the online version at https://link.springer.com/journal/11633

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Sun, H., Guan, W. et al. Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions. Mach. Intell. Res. 20, 595–604 (2023). https://doi.org/10.1007/s11633-022-1356-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-022-1356-x

Keywords

Navigation