Improved Speech Emotion Recognition Using LAM and CTC

Meng, Lingyuan; Sun, Zhe; Liu, Yang; Zhao, Zhen; Li, Yongwei

doi:10.1007/978-3-030-89814-4_55

Lingyuan Meng¹⁹,
Zhe Sun¹⁹,
Yang Liu¹⁹,
Zhen Zhao¹⁹ &
…
Yongwei Li²⁰

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 394))

Included in the following conference series:

International Conference on Mobile Multimedia Communications

1112 Accesses

Abstract

Time sequence based speech emotion recognition methods are difficult to distinguish between emotional and non-emotional frames of speech, and cannot calculate the amount of emotional information carried by emotional frames. In this paper, we propose a speech emotion recognition method using Local Attention Mechanism (LAM) and Connectionist Temporal Classification (CTC) to deal with these issues. First, we extract the Variational Gammatone Cepstral Coefficients (VGFCC) emotional feature from the speech as the input of LAM-CTC shared encoder. Second, CTC layer performs automatic hard alignment, which allows the network to have the largest activation value at the emotional key frame of the voice. LAM layer learns different degrees on the emotional auxiliary frame. Finally, BP neural network is used to integrate the decoding outputs of CTC layer and LAM layer to obtain emotion prediction results. Evaluation on IEMOCAP shows that the proposed model outperformed the state-of-the-art methods with a UAR of 68.5% and an WAR of 68.1% respectively.

Supported by The Natural Science Foundation of Shandong Province (No. ZR2020QF007).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Schuller, B., Rigoll G., Lang, M.: Hidden Markov model-based speech emotion recognition. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP 2003), pp. II–1 (2003). https://doi.org/10.1109/ICASSP.2003.1202279
Dong, F., Zhang, G., Huang, Y., Liu, H.: Speech emotion recognition based on multi-output GMM and SVM. In: 2010 Chinese Conference on Pattern Recognition (CCPR), pp. 1–4 (2010). https://doi.org/10.1109/CCPR.2010.5659255
Caihua, C.: Research on multi-modal mandarin speech emotion recognition based on SVM. In: 2019 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), pp. 173–176 (2019). https://doi.org/10.1109/ICPICS47731.2019.8942545
Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204 (2016). https://doi.org/10.1109/ICASSP.2016.7472669
Zhang, Y., Du, J., Wang, Z., Zhang, J., Tu, Y.: Attention based fully convolutional network for speech emotion recognition. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1771–1775 (2018). https://doi.org/10.23919/APSIPA.2018.8659587
Ariav, I., Cohen, I.: An end-to-end multimodal voice activity detection using WaveNet encoder and residual networks. IEEE J. Sel. Top. Signal Process. 13(2), 265–274 (2019). https://doi.org/10.1109/JSTSP.2019.2901195
Article Google Scholar
Graves, A., Metze, F.: A first attempt at polyphonic sound event detection using connectionist temporal classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2986–2990 (2017). https://doi.org/10.1109/ICASSP.2017.7952704
Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174 (2015). https://doi.org/10.1109/ASRU.2015.7404790
Shan, C., et al.: Investigating end-to-end speech recognition for Mandarin-English code-switching. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6056–6060 (2019) https://doi.org/10.1109/ICASSP.2019.8682850
Su, J., Zeng, J., Xie, J., Wen, H., Yin, Y., Liu, Y.: Exploring discriminative word-level domain contexts for multi-domain neural machine translation. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1530–1545 (2021). https://doi.org/10.1109/TPAMI.2019.2954406
Xie, Y., Liang, R., Liang, Z., Huang, C., Zou, C., Schuller, B.: Speech emotion classification using attention-based LSTM. IEEE/ACM Trans. Audio Speech Lang. Process. 27(11), 1675–1685 (2019). https://doi.org/10.1109/TASLP.2019.2925934
Article Google Scholar
Chen, X.: Research on Speech Emotion Recognition Method Based on Time Series Deep Learning Model. Harbin Institute of Technology (2018)
Google Scholar

Download references

Acknowledgement

This work is supported by the Natural Science Foundation of Shandong Province (No. ZR2020QF007).

Author information

Authors and Affiliations

School of Information Science and Technology, Qingdao Universicity of Science and Technology, Qingdao, 266061, China
Lingyuan Meng, Zhe Sun, Yang Liu & Zhen Zhao
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100089, China
Yongwei Li

Authors

Lingyuan Meng
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yongwei Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Liu .

Editor information

Editors and Affiliations

Fujian Normal University, Fuzhou, China
Jinbo Xiong
Illinois State University, Normal, IL, USA
Shaoen Wu
Guizhou University, Guiyang, China
Changgen Peng
Guizhou University, Guiyang, China
Youliang Tian

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Meng, L., Sun, Z., Liu, Y., Zhao, Z., Li, Y. (2021). Improved Speech Emotion Recognition Using LAM and CTC. In: Xiong, J., Wu, S., Peng, C., Tian, Y. (eds) Mobile Multimedia Communications. MobiMedia 2021. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 394. Springer, Cham. https://doi.org/10.1007/978-3-030-89814-4_55

Download citation

DOI: https://doi.org/10.1007/978-3-030-89814-4_55
Published: 02 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89813-7
Online ISBN: 978-3-030-89814-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics