Abstract
In this paper, we study the speech emotion feature optimization using stochastic optimization algorithms, and feature compensation using deep neural networks. We also proposed to use accentuation-based fusion for long-time speech emotion recognition. Firstly, the extraction method of emotional features is studied, and a series of speech features are constructed for the recognition of emotion. Secondly, we propose a method of sample adaptation through denoising autoencoder to enhance the versatility of features through the mapping of sample features to improve adaptive ability. Thirdly, GA and SFLA are used to optimize the combination of features to improve the emotion recognition results at the utterance level. Finally, we use transformer model to implement accentuation-based emotion fusion in long-time speech. The continuous long-time speech corpus, as well as the public available EMO-DB, are used for experiments. Results show that the proposed method can effectively improve the performance of long-time speech emotion recognition.
Similar content being viewed by others
Data Availability
The EMO-DB database is the freely available German emotional database. EMO-DB can be downloaded publicly from https://www.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb.
References
A.A. Abdelhamid, E.S.M. El-Kenawy, B. Alotaibi, G.M. Amer, M.Y. Abdelkader, A. Ibrahim, M.M. Eid, Robust speech emotion recognition using CNN+ LSTM based on stochastic fractal search optimization algorithm. IEEE Access 10, 49265–49284 (2022)
S. Akinpelu, S. Viriri, Robust feature selection-based speech emotion classification using deep transfer learning. Appl. Sci. 12(16), 8265 (2022)
F. Albu, D. Hagiescu, L. Vladutu, M.A. Puica, Neural network approaches for children’s emotion recognition in intelligent learning applications, in: International Conference on Education and New Learning Technologies, 3229–3239 (2015)
S.B. Alex, L. Mary, B.P. Babu, Attention and feature selection for automatic speech emotion recognition using utterance and syllable-level prosodic features. Circuits Syst. Signal Process. 39(11), 5681–709 (2020)
T. Anvarjon, S. Kwon, Deep-net: a lightweight cnn-based speech emotion recognition system using deep frequency features. Sensors 20(18), 1–16 (2020)
B.T. Atmaja, A. Sasou, M. Akagi, Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Commun. 140, 11–28 (2022)
G. Choudhary, R. Meena, K. Mohbey, Speech emotion based emotion recognition using deep neural networks. J. Phys. Conf. Ser. 2236(1), 012003 (2022)
A. Cowen, D. Keltner, Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proc. Natl. Acad. Sci. 114(38), E7900–E7909 (2017)
M.S. Fahad, A. Deepak, G. Pradhan, J. Yadav, DNN-HMM-based speaker-adaptive emotion recognition using MFCC and epoch-based features. Circuits Syst. Signal Process. 40, 466–89 (2021)
C. Fu, Q. Deng, J. Shen, H. Mahzoon, H. Ishiguro, A preliminary study on realizing human–robot mental comforting dialogue via sharing experience emotionally. Sensors 22(3), 991 (2022)
I. Gat, et al., Speaker normalization for self-supervised speech emotion recognition, in: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7342–7346 (2022)
N. Hajarolasvadi, H. Demirel, 3d cnn-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)
C. Huang, B. Song, L. Zhao, Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering. Int. J. Speech Technol. 19(4), 805–816 (2016)
C. Huang, Y. Jin, Q. Wang, Speech emotion recognition based on decomposition of feature space and information fusion. J. Signal Process. 26(6), 835–842 (2010)
C. Huang, Y. Jin, Y. Zhao, Y. Yu, L. Zhao, Speech emotion recognition based on re-composition of two-class classifiers, in: The 3rd International conference on affective computing and intelligent interaction and workshops (2009)
Y. Jin, C. Huang, L. Zhao, A semi-supervised learning algorithm based on modified self-training SVM. J. Comput. 6(7), 1438–1443 (2011)
S.R. Kadiri, P. Gangamohan, S.V. Gangashetty, P. Alku, B. Yegnanarayana, Excitation features of speech for emotion recognition using neutral speech as reference. Circuits Syst. Signal Process. 39(9), 4459–81 (2020)
B. Maji, M. Swain, Advanced fusion-based speech emotion recognition system using a dual-attention mechanism with conv-caps and bi-gru features. Electronics 11(9), 1328 (2022)
K. Manohar, E. Logashanmugam, Hybrid deep learning with optimal feature selection for speech emotion recognition using improved meta-heuristic algorithm. Knowl. Based Syst. 246, 108659 (2022)
M. Oaten, R.J. Stevenson, T. Case, Disgust as a disease-avoidance mechanism. Psychol. Bull. 135(2), 303–321 (2009)
T. Özseven, A novel feature selection method for speech emotion recognition. Appl. Acoust. 146, 320–326 (2019)
L. Pandey, R.M. Hegde, Keyword spotting in continuous speech using spectral and prosodic information fusion. Circuits Syst. Signal Process. 38, 2767–91 (2019)
V.M. Praseetha, P.P. Joby, Speech emotion recognition using data augmentation. Int. J. Speech Technol. 25(4), 783–792 (2022)
H. Saad, F. Mahmud, M. Shaheen, M. Hasan, P. Farastu, M. Kabir, Is speech emotion recognition language-independent? Analysis of English and Bangla languages using language-independent vocal features. arXiv preprint, arXiv:2111.10776 (2021)
C. Wu, C. Huang, H. Chen, Text-independent speech emotion recognition using frequency adaptive features. Multimed. Tools Appl. 77(18), 24353–24363 (2018)
X. Xu et al., Graph learning based speaker independent speech emotion recognition. Adv. Electr. Comput. Eng. 14(2), 17–23 (2014)
L. You, H. Jiang, J. Hu, C. H. Chang, L. Chen, X. Cui, M. Zhao, GPU-accelerated faster mean shift with Euclidean distance metrics, in: 2022 IEEE 46th Annual Computers, Software, and Applications Conference, 211–216 (2022)
X. Zhang et al., Recognition of practical speech emotion using improved shuffled frog leaping algorithm. Chin. J. Acoust. 33(4), 441–441 (2014)
C. Zou, C. Huang, D. Han, L. Zhao, Detecting practical speech emotion in a cognitive task, in: 2011 Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN), 1–5 (2011)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest related to this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, J., Zhu, J. & Shao, J. Long-Time Speech Emotion Recognition Using Feature Compensation and Accentuation-Based Fusion. Circuits Syst Signal Process 43, 916–940 (2024). https://doi.org/10.1007/s00034-023-02480-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02480-6