Syllable Level Speech Emotion Recognition Based on Formant Attention

Rehman, Abdul; Liu, Zhen-Tao; Xu, Jin-Meng

doi:10.1007/978-3-030-93049-3_22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13070))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

1264 Accesses

Abstract

The performance of speech emotion recognition (SER) systems can be significantly compromised by the sentence structure of words being spoken. Since the relation between affective content and the lexical content of speech is difficult to determine in a small training sample, the temporal sequence based pattern recognition methods fail to generalize over different sentences in the wild. In this paper, a method to recognize emotion for each syllable separately instead of using a pattern recognition for a whole utterance is proposed. The work emphasizes the preprocessing of the received audio samples where the skeleton structure of Mel-spectrum is extracted using formant attention method, then utterances are sliced into syllables based on the contextual changes in the formants. The proposed syllable onset detection and feature extraction method is validated on two databases for the accuracy of emotional class prediction. The suggested SER method achieves up to 67% and 55% unweighted accuracy on IEMOCAP and MSP-Improv datasets, respectively. The effectiveness of the method is proved by the experimentation results and compared to the state-of-the-art SER methods.

This work was supported in part by the National Natural Science Foundation of China under Grant 61976197, 61403422 and 61273102, in part by the Hubei Provincial Natural Science Foundation of China under Grant 2018CFB447 and 2015CFA010, in part by the Wuhan Science and Technology Project under Grant 2020010601012175, in part by the 111 Project under Grant B17040, and in part by the Fundamental Research Funds for National University, China University of Geosciences, Wuhan, under Grant 1910491T01.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aldeneh, Z., Provost, E.M.: Using regional saliency for speech emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2741–2745. IEEE (2017)
Google Scholar
Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Article Google Scholar
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2016)
Article Google Scholar
Cao, H., Verma, R., Nenkova, A.: Speaker-sensitive emotion recognition via ranking: studies on acted and spontaneous speech. Comput. Speech Lang. 29(1), 186–202 (2015)
Article Google Scholar
Chen, M., He, X., Yang, J., Zhang, H.: 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Article Google Scholar
Daneshfar, F., Kabudian, S.J., Neekabadi, A.: Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and gaussian elliptical basis function network classifier. Appl. Acoust. 166, 107360 (2020)
Article Google Scholar
Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1(6), 1–4 (2013)
Google Scholar
Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L., Schmauch, B.: Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv preprint arXiv:1802.05630 (2018)
Fayek, H.M., Lech, M., Cavedon, L.: Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 92, 60–68 (2017)
Article Google Scholar
Hajarolasvadi, N., Demirel, H.: 3d CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)
Article Google Scholar
Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)
Article Google Scholar
Koduru, A., Valiveti, H.B., Budati, A.K.: Feature extraction algorithms to improve the speech emotion recognition rate. Int. J. Speech Technol. 23(1), 45–55 (2020)
Article Google Scholar
Lakomkin, E., Weber, C., Magg, S., Wermter, S.: Reusing neural speech representations for auditory emotion recognition. arXiv preprint arXiv:1803.11508 (2018)
Le, D., Aldeneh, Z., Provost, E.M.: Discretized continuous speech emotion recognition with multi-task deep recurrent neural network. In: INTERSPEECH, pp. 1108–1112 (2017)
Google Scholar
Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Lee, S.w.: Domain generalization with triplet network for cross-corpus speech emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 389–396. IEEE (2021)
Google Scholar
Liu, Z.T., Rehman, A., Wu, M., Cao, W.H., Hao, M.: Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf. Sci. 563, 309–325 (2021)
Article Google Scholar
Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231. IEEE (2017)
Google Scholar
Su, B.H., Chang, C.M., Lin, Y.S., Lee, C.C.: Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network. Proc. Interspeech 2020, 506–510 (2020)
Google Scholar
Wang, J., Xue, M., Culhane, R., Diao, E., Ding, J., Tarokh, V.: Speech emotion recognition with dual-sequence LSTM architecture. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6474–6478. IEEE (2020)
Google Scholar
Yao, Z., Wang, Z., Liu, W., Liu, Y., Pan, J.: Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. 120, 11–19 (2020)
Article Google Scholar
Zhang, S., Zhao, X., Tian, Q.: Spontaneous speech emotion recognition using multiscale deep convolutional LSTM. IEEE Transactions on Affective Computing, p. 1 (2019). https://doi.org/10.1109/TAFFC.2019.2947464
Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Automation, China University of Geosciences, Wuhan, 430074, China
Abdul Rehman, Zhen-Tao Liu & Jin-Meng Xu
Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, 430074, China
Abdul Rehman, Zhen-Tao Liu & Jin-Meng Xu
Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan, 430074, China
Abdul Rehman, Zhen-Tao Liu & Jin-Meng Xu

Authors

Abdul Rehman
View author publications
You can also search for this author in PubMed Google Scholar
Zhen-Tao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jin-Meng Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdul Rehman .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Yiran Chen
Shanghai Jiao Tong University, Shanghai, China
Guangtao Zhai
University of British Columbia, Vancouver, BC, Canada
Jane Wang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Ruiping Wang
Xidian University, Xi'an, China
Weisheng Dong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rehman, A., Liu, ZT., Xu, JM. (2021). Syllable Level Speech Emotion Recognition Based on Formant Attention. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds) Artificial Intelligence. CICAI 2021. Lecture Notes in Computer Science(), vol 13070. Springer, Cham. https://doi.org/10.1007/978-3-030-93049-3_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-93049-3_22
Published: 01 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93048-6
Online ISBN: 978-3-030-93049-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Syllable Level Speech Emotion Recognition Based on Formant Attention