Skip to main content

Speech Emotion Recognition Model Based on CRNN-CTC

  • Conference paper
  • First Online:
2020 International Conference on Applications and Techniques in Cyber Intelligence (ATCI 2020)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1244))

Abstract

CRNN (Convolutional Recurrent Neural Network) deep learning model is currently a typical speech emotion recognition technology. When this model is applied, no matter how long the speech sequence is, it will only be converted into an emotional tag. However, the emotional information in speech samples is generally unevenly distributed between frames, which will cause the recognition performance of the model to deteriorate. For this problem, a speech emotion recognition model based on CRNN-CTC (Convolutional Recurrent Neural Network-Connectionist Temporal Classification) is proposed in this paper. On the basis of CRNN model, the speech samples are divided into emotional frames and non-emotional frames first, and then CTC method is used to make the network model focus on the emotional frames of speech for learning to avoid the problem of poor model performance due to the learning of non-emotional frames. Experimental results show that the model achieves the weighted average recall rate (WAR) of 70.11% and the unweighted average recall rate (UAR) of 69.53%. Compared with CRNN model, the performance of speech emotion recognition is significantly improved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wootaek, L., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4. IEEE (2016)

    Google Scholar 

  2. Fayek, H.M., Lech, M., Cavedon, L.: Evaluating deep learning architectures for Speech Emotion Recognition. Neural Netw. 92, 60–68 (2017)

    Article  Google Scholar 

  3. Sainath, T.N., Weiss, R.J., Wilson, K.W., Li, B.: Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(5), 965–979 (2017)

    Article  Google Scholar 

  4. Badshah, A.M., et al.: Deep features-based speech emotion recognition for smart affective services. Multimedia Tools Appl. 78(5), 5571–5589 (2019)

    Google Scholar 

  5. Hassan, M.M., et al.: Human emotion recognition using deep belief network architecture. Inform. Fusion 51, 10–18 (2019)

    Article  Google Scholar 

  6. Xu, G., Meng, Y., Qiu, X., Yu, Z., Wu, X.: Sentiment analysis of comment texts based on BiLSTM. IEEE Access 7, 51522–51532 (2019)

    Article  Google Scholar 

  7. Liang, D., Liang, H., Yu, Z., Zhang, Y.: Deep convolutional BiLSTM fusion network for facial expression recognition. Vis. Comput. 36(3), 499–508 (2019). https://doi.org/10.1007/s00371-019-01636-3

    Article  Google Scholar 

  8. Zhao, Z., Bao, Z., Zhang, Z., Cummins, N.: Attention-enhanced connectionist temporal classification for discrete speech emotion recognition. In: Proc. Interspeech 2019, pp. 206-210 (2019)

    Google Scholar 

  9. Soullard, Y., Ruffino, C., Paquet, T.: Ctcmodel: a keras model for connectionist temporal classification. arXiv preprint, arXiv:1901.07957 (2019)

  10. Gao, F., Zhu, J., Jiang, H., Niu, Z., Han, W., Yu, J.: Incremental focal loss GANs. Inf. Process. Manage. 57(3), 102192 (2020)

    Article  Google Scholar 

  11. Zou, Y., Dong, L., Bo, X.: Boosting character-based chinese speech synthesis via multi-task learning and dictionary tutoring. In: Proc. Interspeech 2019, pp. 2055–2059 (2019)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Characteristics innovation project of colleges and universities of Guangdong Province (Natural Science, No. 2019KTSCX235).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weihuang Dai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, Z., Dai, W., Hu, Y., Wang, J., Li, J. (2021). Speech Emotion Recognition Model Based on CRNN-CTC. In: Abawajy, J., Choo, KK., Xu, Z., Atiquzzaman, M. (eds) 2020 International Conference on Applications and Techniques in Cyber Intelligence. ATCI 2020. Advances in Intelligent Systems and Computing, vol 1244. Springer, Cham. https://doi.org/10.1007/978-3-030-53980-1_113

Download citation

Publish with us

Policies and ethics