Skip to main content
Log in

Speech emotion recognition using multimodal feature fusion with machine learning approach

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speech-based emotional state recognition must have a significant impact on artificial intelligence as machine learning advances. When it comes to emotion recognition, proper feature selection is critical. As a result, feature fusion technology is offered in this work as a means of achieving high prediction accuracy by emphasizing the extraction of sole features. Mel Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), Mel Spectrogram, Short-time Fourier transform (STFT) and Root Mean Square (RMS) are extracted, and four different feature fusion techniques are used on five standard machine learning classifiers: XGBoost, Support Vector Machine (SVM), Random Forest, Decision-Tree (D-Tree), and K Nearest Neighbor (KNN). The successful use of feature fusion techniques on our suggested classifier yields a satisfactory recognition rate of 99.64% on the female only dataset (TESS), 91% on SAVEE (male only dataset) and 86% on CREMA-D (both male and female) dataset. The proposed model shows that effective feature fusion improves the accuracy and applicability of emotion detection systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Code availability

Not applicable.

References

  1. Saikat B, Jaybrata C, Arnab B  et al (2017) A review on emotion recognition using speech. Int Conf Inventive Commun Comput Technol (ICICCT) (2017):109–114

  2. Chavhan Y, Dhore M, Yesaware P (2010) Speech emotion recognition using support vector machine. Int J Comput Appl 1:6–9

    Google Scholar 

  3. Chen L, Mao X, Xue Y et al (2012) Speech emotion recognition: Features and classification models. Digit Signal Process 22:1154–1160

    Article  MathSciNet  Google Scholar 

  4. Akash RC, Anik G, Rahul P, et al.  (2018) Emotion Recognition from Speech Signals using Excitation Source and Spectral Features. IEEE Applied Signal Proc (ASPCON) 257–261

  5. Prashengit D, Sunanda G  (2021) A System to Predict Emotion from Bengali Speech. Int J Math Sci Comput. https://doi.org/10.5815/IJMSC.2021.01.04

  6. Ghaleb E, Popa M, Asteriadis S (2020) Metric Learning-Based Multimodal Audio-Visual Emotion Recognition. IEEE Multi Med 27(1):37–48. https://doi.org/10.1109/MMUL.2019.2960219

    Article  Google Scholar 

  7. Harimi A, Esmaileyan Z (2014) A database for automatic Persian speech emotion recognition: collection, processing and evaluation. Int J Eng 27:79–90

    Google Scholar 

  8. Ngoc-Huynh H, Hyung-Jeong Y, Soo-Hyung K et al  (2020) Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network. IEEE Access 8:61672-61686

  9. Ingale AB, Chaudhari D (2012) Speech emotion recognition. Int J Soft Comput Eng (IJSCE) 2:235–238

    Google Scholar 

  10. Kanwal S, Asghar S (2021) Speech Emotion Recognition Using Clustering Based GA- Optimized Feature Set. IEEE Access 9:125830–125842. https://doi.org/10.1109/ACCESS.2021.3111659

    Article  Google Scholar 

  11. Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio Augmentation for Speech Recognition

  12. Koduru A, Valiveti HB, Budati AK (2020) Feature extraction algorithms to improve the speech emotion recognition rate. Int J Speech Technol 23:45–55

    Article  Google Scholar 

  13. Koolagudi SG, Rao KS (2012) Emotion recognition from speech using source, system, and prosodic features. Int J Speech Technol 15:265–289

    Article  Google Scholar 

  14. Kuchibhotla S, Vankayalapati HD, Anne KR (2016) An optimal two stage feature selection for speech emotion recognition using acoustic features. Int J Speech Technol 19:657–667

    Article  Google Scholar 

  15. Kumbhar H, Bhandari S (2019) Speech Emotion Recognition using MFCC features and LSTM network, IEEE International Conference On Computing, Communication, Control And Automation, pp. 1–3

  16. Liu Zhen-Tao, Bao-Han Wu, Li Dan-Yun, Xiao Peng, Mao Jun-Wei (2020) Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment. Sens 20(8):2297

    Article  Google Scholar 

  17. Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41:603–623

    Article  Google Scholar 

  18. Ooi CS, Seng KP, Ang L-M et al (2014) A new approach of audio emotion recognition. Expert Syst Appl 41:5858–5869

    Article  Google Scholar 

  19. Palo HK, Mohanty MN (2018) Comparative analysis of neural networks for speech emotion recognition. Int J Eng Technol 7:111–126

    Google Scholar 

  20. Yixiong P, Peipei S, Liping S et al  (2012) Speech Emotion Recognition Using Support Vector Machine.

  21. Pappagari, R. et al   (2020) X-Vectors Meet Emotions: A Study On Dependencies Between Emotion and Speaker Recognition.  Int Conf Acoust Speech Signal Process (ICASSP) 7169–7173

  22. Rao KS, Kumar TP, Anusha K et al (2012) Emotion recognition from speech. Int J Comput Sci Inf Technol 3:3603–3607

    Google Scholar 

  23. Shah RD, Anil D, Suthar C (2016) Speech emotion recognition based on SVM using MATLAB. Int J Innov Res Comput Commun Eng 4

  24. Shambhavi S, Nitnaware V (2015) Emotion speech recognition using MFCC and SVM. Int J Eng Res Technol 4:1067–1070

    Google Scholar 

  25. Anwer S, Mohamed H, Mounir Z et al. Emotion recognition from speech using spectrograms and shallow neural networks. Proceedings of the 18th International Conference on Advances in Mobile Computing & Multimedia (2020): n. pag.

  26. Wang K, An N, Li BN et al (2015) Speech emotion recognition using Fourier parameters. IEEE Trans Affect Comput 6:69–75

    Article  Google Scholar 

  27. Yang B, Lugger M (2010) Emotion recognition from speech signals using new harmony features. Signal Process 90:1415–1423

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

The authors’ contributions are summarized below. Sandeep Kumar Panda made substantial contributions to the conception and design and were involved in drafting the manuscript. Ajay Kumar Jena and Mohit Ranjan Panda acquired data and analysis and conducted the interpretation of the data. The critically important intellectual contents of this manuscript were revised by Susmita Panda. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Sandeep Kumar Panda.

Ethics declarations

Conflicts of interest/Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Panda, S.K., Jena, A.K., Panda, M.R. et al. Speech emotion recognition using multimodal feature fusion with machine learning approach. Multimed Tools Appl 82, 42763–42781 (2023). https://doi.org/10.1007/s11042-023-15275-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15275-3

Keywords

Navigation