Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion

Liu, Yanlin; Chen, Aibin; Zhou, Guoxiong; Yi, Jizheng; Xiang, Jin; Wang, Yaru

doi:10.1007/s11042-023-17829-x

Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion

Published: 02 January 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yanlin Liu^1,2,
Aibin Chen ORCID: orcid.org/0000-0003-4410-412X^1,2,
Guoxiong Zhou^1,2,
Jizheng Yi^1,2,
Jin Xiang^1,2 &
…
Yaru Wang^1,2

221 Accesses
Explore all metrics

Abstract

According to the problem that emotional features cannot be well represented by a single feature and it is difficult to extract in the task of Speech Emotion Recognition (SER), we propose a Feature-Level (FL) fusion method for 9 types of acoustic features and a combined CNN-LSTM network with attention (CNN-A-LSTM). The feature vector set after feature-level fusion is used as the input of CNN-A-LSTM network, which contains prosodic features and spectral features. High-level Statistical Functions (HSFs) are also added to receive global features. The feature extraction network CNN-A-LSTM can more effectively read time-series input and extract speech emotion information. Finally, Softmax is applied as a classifier to obtain the final emotion classification results. The experiment is verified on SAVEE and CASIA datasets. The experimental results show that the method in this paper has the best effect compared with the state-of-the-art, and the accuracy rates of 94.5% and 96.7% are respectively gained on SAVEE and CASIA. The above results prove the effectiveness of the algorithm in this paper with some generalization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Time-Frequency Deep Representation Learning for Speech Emotion Recognition Integrating Self-attention

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

Data availability

The data that support the findings of this study are available from the corresponding author on reasonable request.

References

Chen M, Zhou P, Fortino G (2016) Emotion communication system. IEEE. Access 5:326–337. https://doi.org/10.1109/ACCESS.2016.2641480
Article Google Scholar
Pravena D, Govind D (2017) Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals. Int J Speech Technol 20:787–797. https://doi.org/10.1007/s10772-017-9445-x
Article Google Scholar
Lado-Codesido M, Méndez Pérez C, Mateos R et al (2019) Improving emotion recognition in schizophrenia with “VOICES”: an on-line prosodic self-training. PLoS One 14(1):e0210816. https://doi.org/10.1371/journal.pone.0210816
Article Google Scholar
Schelinski S, von Kriegstein K (2019) The relation between vocal pitch and vocal emotion recognition abilities in people with autism spectrum disorder and typical development. J Autism Dev Disord 49:68–82. https://doi.org/10.1007/s10803-018-3681-z
Article Google Scholar
Paris M, Mahajan Y, Kim J (2018) Emotional speech processing deficits in bipolar disorder: The role of mismatch negativity and P3a. J Affect Disord 234:261–269. https://doi.org/10.1016/j.jad.2018.02.026
Article Google Scholar
Yoon S, Son G, Kwon S (2019) Fear emotion classification in speech by acoustic and behavioral cues. Multimed Tools Appl 78:2345–2366. https://doi.org/10.1007/s11042-018-6329-2
Article Google Scholar
Schuller B, Reiter S, Muller R et al (2005) Speaker independent speech emotion recognition by ensemble classification. 2005 IEEE international conference on multimedia and expo 864-867. IEEEhttps://doi.org/10.1109/ICME.2005.1521560
Palo HK, Chandra M, Mohanty MN (2017) Emotion recognition using MLP and GMM for Oriya language. Intl J Comput Vis Robot 7(4):426–442
Article Google Scholar
Sinith MS, Aswathi E, Deepa TM et al (2015) Emotion recognition from audio signals using Support Vector Machine. In 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS) 139-144. IEEEhttps://doi.org/10.1109/RAICS.2015.7488403
Dileep AD, Sekhar CC (2013) GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. IEEE Trans Neural Netw Learn Syst 25(8):1421–1432. https://doi.org/10.1109/TNNLS.2013.2293512
Article Google Scholar
Liu ZT, Wu M, Cao WH et al (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280. https://doi.org/10.1016/j.neucom.2017.07.050
Article Google Scholar
Lanjewar RB, Mathurkar S, Patel N (2015) Implementation and comparison of speech emotion recognition system using Gaussian Mixture Model (GMM) and K-Nearest Neighbor (K-NN) techniques. Procedia Computer Science 49:50–57. https://doi.org/10.1016/j.procs.2015.04.226
Article Google Scholar
Bhakre SK, Bang A (2016) Emotion recognition on the basis of audio signal using Naive Bayes classifier. In 2016 International conference on advances in computing, communications and informatics (ICACCI)) 2363-2367. IEEEhttps://doi.org/10.1109/ICACCI.2016.7732408
Banik D (2021) Phrase table re-adjustment for statistical machine translation. Int J Speech Technol 24:903–911
Article Google Scholar
Banik D, Ekbal A, Bhattacharyya P (2020) Statistical machine translation based on weighted syntax-semantics. Sādhanā 45:1–12
Article Google Scholar
Banik D, Ekbal A, Bhattacharyya P (2018) Machine learning based optimized pruning approach for decoding in statistical machine translation. IEEE Access 7:1736–1751
Article Google Scholar
Tian L, Moore J, Lai C (2016) Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features. In 2016 IEEE Spoken Language Technology Workshop (SLT) 565-572. IEEEhttps://doi.org/10.1109/SLT.2016.7846319
Kaya H, Fedotov D, Yesilkanat A et al (2018) LSTM Based Cross-corpus and Cross-task Acoustic Emotion Recognition. In Interspeech: 521–525.
Shahin I, Nassif AB, Hamsa S (2019) Emotion recognition using hybrid Gaussian mixture model and deep neural network. IEEE access 7:26777–26787. https://doi.org/10.1109/ACCESS.2019.2901352
Article Google Scholar
Yao Z, Wang Z, Liu W et al (2020) Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun 120:11–19. https://doi.org/10.1016/j.specom.2020.03.005
Article Google Scholar
Liu S, Zhang M, Fang M et al (2021) Speech emotion recognition based on transfer learning from the FaceNet framework. J Acoust Soc Am 149(2):1338–1345. https://doi.org/10.1121/10.0003530
Article Google Scholar
Hinton G, Deng L, Yu D et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag 29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597
Article Google Scholar
Pranjal S (2017) Essentials of deep learning?: Introduction to long short term memory. Anal. Vidhya, Dec. 2017. [Online]. Available: http://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deeplearning-introduction-to-lstm/
Kumbhar HS, Bhandari SU (2019) Speech emotion recognition using MFCC features and LSTM network. In 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA) 1-3. IEEEhttps://doi.org/10.1109/ICCUBEA47591.2019.9129067
Tao F, Liu G (2018) Advanced LSTM: A study about better time dependency modeling in emotion recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2906–2910. IEEE. https://doi.org/10.1109/ICASSP.2018.8461750
Xie Y, Liang R, Liang Z et al (2019) Speech emotion classification using attention-based LSTM. IEEE/ACM Trans Audio Speech Language Process 27(11):1675–1685. https://doi.org/10.1109/TASLP.2019.2925934
Article Google Scholar
Yoon S, Byun S, Dey S et al (2019) Speech emotion recognition using multi-hop attention mechanism. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2822–2826. IEEE. https://doi.org/10.1109/ICASSP.2019.8683483
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recogn 44(3):572–587. https://doi.org/10.1016/j.patcog.2010.09.020
Article Google Scholar
Zvarevashe K, Olugbara O (2020) Ensemble Learning of Hybrid Acoustic Features for Speech Emotion Recognition. Algorithms 13(3):70. https://doi.org/10.3390/a13030070
Article Google Scholar
Wang C, Ren Y, Zhang N et al (2021) Speech emotion recognition based on multi-feature and multi-lingual fusion. Multimed Tools Appl: 1–11. https://doi.org/10.1007/s11042-021-10553-4
Kumaran U, Radha Rammohan S, Nagarajan SM et al (2021) Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN. Int J Speech Technol 24:303–314. https://doi.org/10.1007/s10772-020-09792-x
Article Google Scholar
Trigeorgis G, Ringeval F, Brueckner R et al (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) 5200-5204. IEEEhttps://doi.org/10.1109/ICASSP.2016.7472669
Tzirakis P, Trigeorgis G, Nicolaou MA et al (2017) End-to-end multimodal emotion recognition using deep neural networks. IEEE J Select Top Signal Process 11(8):1301–1309. https://doi.org/10.1109/JSTSP.2017.2764438
Article Google Scholar
Latif S, Rana R, Khalifa S et al (2019) Direct modelling of speech emotion from raw speech. arXiv preprint arXiv:1904.03833
Atila O, Şengür A (2021) Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Appl Acoust 182:108260. https://doi.org/10.1016/j.apacoust.2021.108260
Article Google Scholar
Fayek HM, Lech M et al (2015) Towards real-time speech emotion recognition using deep neural networks. In 2015 9th international conference on signal processing and communication systems (ICSPCS) 1-5. IEEEhttps://doi.org/10.1109/ICSPCS.2015.7391796
Li D, Sun L, Xu X et al (2021) BLSTM and CNN Stacking Architecture for Speech Emotion Recognition. Neural Process Lett 53(6):4097–4115. https://doi.org/10.1007/s11063-021-10581-z
Article Google Scholar
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326. https://doi.org/10.1016/j.apacoust.2018.11.028
Article Google Scholar
Bitouk D, Verma R, Nenkova A (2010) Class-level spectral features for emotion recognition. Speech Commun 52(7–8):613–625. https://doi.org/10.1016/j.specom.2010.02.010
Article Google Scholar
Swain M, Routray A, Kabisatpathy P (2018) Databases, features and classifiers for speech emotion recognition: a review. Int J Speech Technol 21:93–120. https://doi.org/10.1007/s10772-018-9491-z
Article Google Scholar
Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP) 2227-2231. IEEEhttps://doi.org/10.1109/ICASSP.2017.7952552
Banik D, Ekbal A, Bhattacharyya P, Bhattacharyya S, Platos J (2019) Statistical-based system combination approach to gain advantages over different machine translation systems. Heliyon 5(9):e02504
Article Google Scholar
Chaib S, Liu H, Gu Y et al (2017) Deep feature fusion for VHR remote sensing scene classification. IEEE Trans Geosci Remote Sens 55(8):4775–4784. https://doi.org/10.1109/TGRS.2017.2700322
Article Google Scholar
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035
Article Google Scholar
Dangol R, Alsadoon A, Prasad PWC et al (2020) Speech Emotion Recognition Using Convolutional Neural Network and Long-Short Term Memory. Multimed Tools Appl 79:32917–32934. https://doi.org/10.1007/s11042-020-09693-w
Article Google Scholar
Zhao Z, Zheng Y, Zhang Z et al (2018) Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition. In Interspeech: 272–276. https://doi.org/10.21437/Interspeech.2018-1477
Mahmud S, Tonmoy M, Bhaumik KK et al (2020) Human activity recognition from wearable sensor data using self-attention. arXiv preprint arXiv:2003.09018.
Jackson P, Haq S (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK. [Online]. Available: http://kahlan.eps.surrey.ac.uk/savee/
CASIA Chinese Emotion Corpus, 2008, http://www.chineseldc.org/resourceinfo.php?rid=76

Download references

Acknowledgements

We would like to thank the provider of SAVEE and CASIA datasets and the support of experimental equipment provided by the Institute of Artificial Intelligence Application of Central South University of Forestry and Technology.

Author information

Authors and Affiliations

Institute of Artificial Intelligence Application, College of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, 410004, Hunan, China
Yanlin Liu, Aibin Chen, Guoxiong Zhou, Jizheng Yi, Jin Xiang & Yaru Wang
Hunan Key Laboratory of Intelligent Logistics Technology, Central South University of Forestry and Technology, Changsha, 410004, Hunan, China
Yanlin Liu, Aibin Chen, Guoxiong Zhou, Jizheng Yi, Jin Xiang & Yaru Wang

Authors

Yanlin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Aibin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Guoxiong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jizheng Yi
View author publications
You can also search for this author in PubMed Google Scholar
Jin Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Yaru Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aibin Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, Y., Chen, A., Zhou, G. et al. Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-023-17829-x

Download citation

Received: 08 March 2022
Revised: 11 September 2023
Accepted: 06 December 2023
Published: 02 January 2024
DOI: https://doi.org/10.1007/s11042-023-17829-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion

Abstract

Access this article

Similar content being viewed by others

Time-Frequency Deep Representation Learning for Speech Emotion Recognition Integrating Self-attention

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion

Abstract

Access this article

Similar content being viewed by others

Time-Frequency Deep Representation Learning for Speech Emotion Recognition Integrating Self-attention

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation