Abstract
The performance of speech emotion recognition (SER) systems can be significantly compromised by the sentence structure of words being spoken. Since the relation between affective content and the lexical content of speech is difficult to determine in a small training sample, the temporal sequence based pattern recognition methods fail to generalize over different sentences in the wild. In this paper, a method to recognize emotion for each syllable separately instead of using a pattern recognition for a whole utterance is proposed. The work emphasizes the preprocessing of the received audio samples where the skeleton structure of Mel-spectrum is extracted using formant attention method, then utterances are sliced into syllables based on the contextual changes in the formants. The proposed syllable onset detection and feature extraction method is validated on two databases for the accuracy of emotional class prediction. The suggested SER method achieves up to 67% and 55% unweighted accuracy on IEMOCAP and MSP-Improv datasets, respectively. The effectiveness of the method is proved by the experimentation results and compared to the state-of-the-art SER methods.
This work was supported in part by the National Natural Science Foundation of China under Grant 61976197, 61403422 and 61273102, in part by the Hubei Provincial Natural Science Foundation of China under Grant 2018CFB447 and 2015CFA010, in part by the Wuhan Science and Technology Project under Grant 2020010601012175, in part by the 111 Project under Grant B17040, and in part by the Fundamental Research Funds for National University, China University of Geosciences, Wuhan, under Grant 1910491T01.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aldeneh, Z., Provost, E.M.: Using regional saliency for speech emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2741–2745. IEEE (2017)
Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2016)
Cao, H., Verma, R., Nenkova, A.: Speaker-sensitive emotion recognition via ranking: studies on acted and spontaneous speech. Comput. Speech Lang. 29(1), 186–202 (2015)
Chen, M., He, X., Yang, J., Zhang, H.: 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Daneshfar, F., Kabudian, S.J., Neekabadi, A.: Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and gaussian elliptical basis function network classifier. Appl. Acoust. 166, 107360 (2020)
Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1(6), 1–4 (2013)
Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L., Schmauch, B.: Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv preprint arXiv:1802.05630 (2018)
Fayek, H.M., Lech, M., Cavedon, L.: Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 92, 60–68 (2017)
Hajarolasvadi, N., Demirel, H.: 3d CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)
Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)
Koduru, A., Valiveti, H.B., Budati, A.K.: Feature extraction algorithms to improve the speech emotion recognition rate. Int. J. Speech Technol. 23(1), 45–55 (2020)
Lakomkin, E., Weber, C., Magg, S., Wermter, S.: Reusing neural speech representations for auditory emotion recognition. arXiv preprint arXiv:1803.11508 (2018)
Le, D., Aldeneh, Z., Provost, E.M.: Discretized continuous speech emotion recognition with multi-task deep recurrent neural network. In: INTERSPEECH, pp. 1108–1112 (2017)
Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Lee, S.w.: Domain generalization with triplet network for cross-corpus speech emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 389–396. IEEE (2021)
Liu, Z.T., Rehman, A., Wu, M., Cao, W.H., Hao, M.: Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf. Sci. 563, 309–325 (2021)
Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231. IEEE (2017)
Su, B.H., Chang, C.M., Lin, Y.S., Lee, C.C.: Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network. Proc. Interspeech 2020, 506–510 (2020)
Wang, J., Xue, M., Culhane, R., Diao, E., Ding, J., Tarokh, V.: Speech emotion recognition with dual-sequence LSTM architecture. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6474–6478. IEEE (2020)
Yao, Z., Wang, Z., Liu, W., Liu, Y., Pan, J.: Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. 120, 11–19 (2020)
Zhang, S., Zhao, X., Tian, Q.: Spontaneous speech emotion recognition using multiscale deep convolutional LSTM. IEEE Transactions on Affective Computing, p. 1 (2019). https://doi.org/10.1109/TAFFC.2019.2947464
Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Rehman, A., Liu, ZT., Xu, JM. (2021). Syllable Level Speech Emotion Recognition Based on Formant Attention. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds) Artificial Intelligence. CICAI 2021. Lecture Notes in Computer Science(), vol 13070. Springer, Cham. https://doi.org/10.1007/978-3-030-93049-3_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-93049-3_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93048-6
Online ISBN: 978-3-030-93049-3
eBook Packages: Computer ScienceComputer Science (R0)