Skip to main content
Log in

Long-Time Speech Emotion Recognition Using Feature Compensation and Accentuation-Based Fusion

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

In this paper, we study the speech emotion feature optimization using stochastic optimization algorithms, and feature compensation using deep neural networks. We also proposed to use accentuation-based fusion for long-time speech emotion recognition. Firstly, the extraction method of emotional features is studied, and a series of speech features are constructed for the recognition of emotion. Secondly, we propose a method of sample adaptation through denoising autoencoder to enhance the versatility of features through the mapping of sample features to improve adaptive ability. Thirdly, GA and SFLA are used to optimize the combination of features to improve the emotion recognition results at the utterance level. Finally, we use transformer model to implement accentuation-based emotion fusion in long-time speech. The continuous long-time speech corpus, as well as the public available EMO-DB, are used for experiments. Results show that the proposed method can effectively improve the performance of long-time speech emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability

The EMO-DB database is the freely available German emotional database. EMO-DB can be downloaded publicly from https://www.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb.

References

  1. A.A. Abdelhamid, E.S.M. El-Kenawy, B. Alotaibi, G.M. Amer, M.Y. Abdelkader, A. Ibrahim, M.M. Eid, Robust speech emotion recognition using CNN+ LSTM based on stochastic fractal search optimization algorithm. IEEE Access 10, 49265–49284 (2022)

    Article  Google Scholar 

  2. S. Akinpelu, S. Viriri, Robust feature selection-based speech emotion classification using deep transfer learning. Appl. Sci. 12(16), 8265 (2022)

    Article  Google Scholar 

  3. F. Albu, D. Hagiescu, L. Vladutu, M.A. Puica, Neural network approaches for children’s emotion recognition in intelligent learning applications, in: International Conference on Education and New Learning Technologies, 3229–3239 (2015)

  4. S.B. Alex, L. Mary, B.P. Babu, Attention and feature selection for automatic speech emotion recognition using utterance and syllable-level prosodic features. Circuits Syst. Signal Process. 39(11), 5681–709 (2020)

    Article  Google Scholar 

  5. T. Anvarjon, S. Kwon, Deep-net: a lightweight cnn-based speech emotion recognition system using deep frequency features. Sensors 20(18), 1–16 (2020)

    Article  Google Scholar 

  6. B.T. Atmaja, A. Sasou, M. Akagi, Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Commun. 140, 11–28 (2022)

    Article  Google Scholar 

  7. G. Choudhary, R. Meena, K. Mohbey, Speech emotion based emotion recognition using deep neural networks. J. Phys. Conf. Ser. 2236(1), 012003 (2022)

    Article  Google Scholar 

  8. A. Cowen, D. Keltner, Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proc. Natl. Acad. Sci. 114(38), E7900–E7909 (2017)

    Article  Google Scholar 

  9. M.S. Fahad, A. Deepak, G. Pradhan, J. Yadav, DNN-HMM-based speaker-adaptive emotion recognition using MFCC and epoch-based features. Circuits Syst. Signal Process. 40, 466–89 (2021)

    Article  Google Scholar 

  10. C. Fu, Q. Deng, J. Shen, H. Mahzoon, H. Ishiguro, A preliminary study on realizing human–robot mental comforting dialogue via sharing experience emotionally. Sensors 22(3), 991 (2022)

    Article  Google Scholar 

  11. I. Gat, et al., Speaker normalization for self-supervised speech emotion recognition, in: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7342–7346 (2022)

  12. N. Hajarolasvadi, H. Demirel, 3d cnn-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)

    Article  Google Scholar 

  13. C. Huang, B. Song, L. Zhao, Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering. Int. J. Speech Technol. 19(4), 805–816 (2016)

    Article  Google Scholar 

  14. C. Huang, Y. Jin, Q. Wang, Speech emotion recognition based on decomposition of feature space and information fusion. J. Signal Process. 26(6), 835–842 (2010)

    Google Scholar 

  15. C. Huang, Y. Jin, Y. Zhao, Y. Yu, L. Zhao, Speech emotion recognition based on re-composition of two-class classifiers, in: The 3rd International conference on affective computing and intelligent interaction and workshops (2009)

  16. Y. Jin, C. Huang, L. Zhao, A semi-supervised learning algorithm based on modified self-training SVM. J. Comput. 6(7), 1438–1443 (2011)

    Article  Google Scholar 

  17. S.R. Kadiri, P. Gangamohan, S.V. Gangashetty, P. Alku, B. Yegnanarayana, Excitation features of speech for emotion recognition using neutral speech as reference. Circuits Syst. Signal Process. 39(9), 4459–81 (2020)

    Article  Google Scholar 

  18. B. Maji, M. Swain, Advanced fusion-based speech emotion recognition system using a dual-attention mechanism with conv-caps and bi-gru features. Electronics 11(9), 1328 (2022)

    Article  Google Scholar 

  19. K. Manohar, E. Logashanmugam, Hybrid deep learning with optimal feature selection for speech emotion recognition using improved meta-heuristic algorithm. Knowl. Based Syst. 246, 108659 (2022)

    Article  Google Scholar 

  20. M. Oaten, R.J. Stevenson, T. Case, Disgust as a disease-avoidance mechanism. Psychol. Bull. 135(2), 303–321 (2009)

    Article  Google Scholar 

  21. T. Özseven, A novel feature selection method for speech emotion recognition. Appl. Acoust. 146, 320–326 (2019)

    Article  Google Scholar 

  22. L. Pandey, R.M. Hegde, Keyword spotting in continuous speech using spectral and prosodic information fusion. Circuits Syst. Signal Process. 38, 2767–91 (2019)

    Article  Google Scholar 

  23. V.M. Praseetha, P.P. Joby, Speech emotion recognition using data augmentation. Int. J. Speech Technol. 25(4), 783–792 (2022)

    Article  Google Scholar 

  24. H. Saad, F. Mahmud, M. Shaheen, M. Hasan, P. Farastu, M. Kabir, Is speech emotion recognition language-independent? Analysis of English and Bangla languages using language-independent vocal features. arXiv preprint, arXiv:2111.10776 (2021)

  25. C. Wu, C. Huang, H. Chen, Text-independent speech emotion recognition using frequency adaptive features. Multimed. Tools Appl. 77(18), 24353–24363 (2018)

    Article  Google Scholar 

  26. X. Xu et al., Graph learning based speaker independent speech emotion recognition. Adv. Electr. Comput. Eng. 14(2), 17–23 (2014)

    Article  Google Scholar 

  27. L. You, H. Jiang, J. Hu, C. H. Chang, L. Chen, X. Cui, M. Zhao, GPU-accelerated faster mean shift with Euclidean distance metrics, in: 2022 IEEE 46th Annual Computers, Software, and Applications Conference, 211–216 (2022)

  28. X. Zhang et al., Recognition of practical speech emotion using improved shuffled frog leaping algorithm. Chin. J. Acoust. 33(4), 441–441 (2014)

    MathSciNet  Google Scholar 

  29. C. Zou, C. Huang, D. Han, L. Zhao, Detecting practical speech emotion in a cognitive task, in: 2011 Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN), 1–5 (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiu Sun.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest related to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, J., Zhu, J. & Shao, J. Long-Time Speech Emotion Recognition Using Feature Compensation and Accentuation-Based Fusion. Circuits Syst Signal Process 43, 916–940 (2024). https://doi.org/10.1007/s00034-023-02480-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02480-6

Keywords

Navigation