Skip to main content

Syllable Level Speech Emotion Recognition Based on Formant Attention

  • Conference paper
  • First Online:
Artificial Intelligence (CICAI 2021)

Abstract

The performance of speech emotion recognition (SER) systems can be significantly compromised by the sentence structure of words being spoken. Since the relation between affective content and the lexical content of speech is difficult to determine in a small training sample, the temporal sequence based pattern recognition methods fail to generalize over different sentences in the wild. In this paper, a method to recognize emotion for each syllable separately instead of using a pattern recognition for a whole utterance is proposed. The work emphasizes the preprocessing of the received audio samples where the skeleton structure of Mel-spectrum is extracted using formant attention method, then utterances are sliced into syllables based on the contextual changes in the formants. The proposed syllable onset detection and feature extraction method is validated on two databases for the accuracy of emotional class prediction. The suggested SER method achieves up to 67% and 55% unweighted accuracy on IEMOCAP and MSP-Improv datasets, respectively. The effectiveness of the method is proved by the experimentation results and compared to the state-of-the-art SER methods.

This work was supported in part by the National Natural Science Foundation of China under Grant 61976197, 61403422 and 61273102, in part by the Hubei Provincial Natural Science Foundation of China under Grant 2018CFB447 and 2015CFA010, in part by the Wuhan Science and Technology Project under Grant 2020010601012175, in part by the 111 Project under Grant B17040, and in part by the Fundamental Research Funds for National University, China University of Geosciences, Wuhan, under Grant 1910491T01.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aldeneh, Z., Provost, E.M.: Using regional saliency for speech emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2741–2745. IEEE (2017)

    Google Scholar 

  2. Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)

    Article  Google Scholar 

  3. Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2016)

    Article  Google Scholar 

  4. Cao, H., Verma, R., Nenkova, A.: Speaker-sensitive emotion recognition via ranking: studies on acted and spontaneous speech. Comput. Speech Lang. 29(1), 186–202 (2015)

    Article  Google Scholar 

  5. Chen, M., He, X., Yang, J., Zhang, H.: 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)

    Article  Google Scholar 

  6. Daneshfar, F., Kabudian, S.J., Neekabadi, A.: Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and gaussian elliptical basis function network classifier. Appl. Acoust. 166, 107360 (2020)

    Article  Google Scholar 

  7. Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1(6), 1–4 (2013)

    Google Scholar 

  8. Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L., Schmauch, B.: Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv preprint arXiv:1802.05630 (2018)

  9. Fayek, H.M., Lech, M., Cavedon, L.: Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 92, 60–68 (2017)

    Article  Google Scholar 

  10. Hajarolasvadi, N., Demirel, H.: 3d CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)

    Article  Google Scholar 

  11. Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)

    Article  Google Scholar 

  12. Koduru, A., Valiveti, H.B., Budati, A.K.: Feature extraction algorithms to improve the speech emotion recognition rate. Int. J. Speech Technol. 23(1), 45–55 (2020)

    Article  Google Scholar 

  13. Lakomkin, E., Weber, C., Magg, S., Wermter, S.: Reusing neural speech representations for auditory emotion recognition. arXiv preprint arXiv:1803.11508 (2018)

  14. Le, D., Aldeneh, Z., Provost, E.M.: Discretized continuous speech emotion recognition with multi-task deep recurrent neural network. In: INTERSPEECH, pp. 1108–1112 (2017)

    Google Scholar 

  15. Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  16. Lee, S.w.: Domain generalization with triplet network for cross-corpus speech emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 389–396. IEEE (2021)

    Google Scholar 

  17. Liu, Z.T., Rehman, A., Wu, M., Cao, W.H., Hao, M.: Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf. Sci. 563, 309–325 (2021)

    Article  Google Scholar 

  18. Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231. IEEE (2017)

    Google Scholar 

  19. Su, B.H., Chang, C.M., Lin, Y.S., Lee, C.C.: Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network. Proc. Interspeech 2020, 506–510 (2020)

    Google Scholar 

  20. Wang, J., Xue, M., Culhane, R., Diao, E., Ding, J., Tarokh, V.: Speech emotion recognition with dual-sequence LSTM architecture. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6474–6478. IEEE (2020)

    Google Scholar 

  21. Yao, Z., Wang, Z., Liu, W., Liu, Y., Pan, J.: Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. 120, 11–19 (2020)

    Article  Google Scholar 

  22. Zhang, S., Zhao, X., Tian, Q.: Spontaneous speech emotion recognition using multiscale deep convolutional LSTM. IEEE Transactions on Affective Computing, p. 1 (2019). https://doi.org/10.1109/TAFFC.2019.2947464

  23. Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdul Rehman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rehman, A., Liu, ZT., Xu, JM. (2021). Syllable Level Speech Emotion Recognition Based on Formant Attention. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds) Artificial Intelligence. CICAI 2021. Lecture Notes in Computer Science(), vol 13070. Springer, Cham. https://doi.org/10.1007/978-3-030-93049-3_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-93049-3_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93048-6

  • Online ISBN: 978-3-030-93049-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics