Multimodal emotion recognition from facial expression and speech based on feature fusion

Tang, Guichen; Xie, Yue; Li, Ke; Liang, Ruiyu; Zhao, Li

doi:10.1007/s11042-022-14185-0

Multimodal emotion recognition from facial expression and speech based on feature fusion

Published: 11 November 2022

Volume 82, pages 16359–16373, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Guichen Tang¹,
Yue Xie¹,
Ke Li²,
Ruiyu Liang ORCID: orcid.org/0000-0002-6813-4203¹ &
…
Li Zhao²

991 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Multimodal emotion recognition is designed to use expression and speech information to identify individual behaviors. Feature fusion can enrich various modal information, which is an important method for multimodal emotion recognition. However, there are several modal information synchronizations and overfitting problems due to large feature dimensions. So, an attention mechanism is introduced to automate the network to pay attention to local effective information. It is used to perform audio and video feature fusion tasks and timing modeling tasks in the network. The main contributions are as follows: 1) the multi-head self-attention mechanism is used for feature fusion of audio and video data to avoid the influence of prior information on the fusion results, and 2) a bidirectional gated recurrent unit is used to model the time series of fusion features; furthermore, the autocorrelation coefficient in the time dimension is also calculated as attention for fusion. Experiment results show that the adopted attention mechanism can effectively improve the accuracy of multimodal emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Multimodal emotion classification using machine learning in immersive and non-immersive virtual reality

Article Open access 06 May 2024

References

Albanie S, Nagrani A, Vedaldi A, Zisserman A (2018) "Emotion recognition in speech using cross-modal transfer in the wild," In: Proceedings of the 26th ACM international conference on Multimedia, pp. 292–301
Ansari H, Vijayvergia A, Kumar K (2018) "DCR-HMM: depression detection based on content rating using hidden Markov model," In: 2018 conference on information and communication technology, CICT 2018, October 26, 2018 - October 28, 2018, Jabalpur, India, pp. BrahMos aerospace; M.P. Council of Science and Technology; PDM Indian Institute of Information Technology, Design and Manufacturing Jabalpur: Institute of Electrical and Electronics Engineers Inc.
Arias P, Soladié C, Bouafif O, Roebel A, Séguier R, Aucouturier JJ (2020) Realistic transformation of facial and vocal smiles in real-time audiovisual streams. IEEE Trans Affect Comput 11(3):507–518
Article Google Scholar
Avots E, Sapiński T, Bachmann M, Kamińska D (2019) Audiovisual emotion recognition in wild. Mach Vis Appl 30(5):975–985
Article Google Scholar
Bahdanau D, Cho K, Bengio Y (2015) "Neural machine translation by jointly learning to align and translate," In: 3rd international conference on learning representations, ICLR 2015, may 7, 2015 - may 9, 2015, San Diego, CA, United States, international conference on learning representations, ICLR
Beard R et al (2018) "Multi-modal sequence fusion via recursive attention for emotion recognition," In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 251–259
Chen M, Zhao X (2020) "A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition," In: INTERSPEECH, pp. 374–378
Cho K et al (2014) "Learning phrase representations using RNN encoder-decoder for statistical machine translation," In: 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25, 2014 - October 29, 2014, Doha, Qatar, pp. 1724–1734: Association for Computational Linguistics (ACL)
Dedeoglu M, Zhang J, Liang R (2019) "Emotion classification based on audiovisual information fusion using deep learning," In: 2019 International Conference on Data Mining Workshops (ICDMW), pp. 131–134: IEEE
Ghaleb E, Popa M, Asteriadis S (2019) "Multimodal and temporal perception of audio-visual cues for emotion recognition," In: 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 552–558: IEEE
He K, Zhang X, Ren S, Sun J (2016) "Deep residual learning for image recognition," In: 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, June 26, 2016 - July 1, 2016, Las Vegas, NV, United states, vol. 2016-December, pp. 770–778: IEEE Computer Society
Hossain MS, Muhammad G (2016) Audio-visual emotion recognition using multi-directional regression and Ridgelet transform. Journal on Multimodal User Interfaces 10(4):325–333
Article Google Scholar
Hossain MS, Muhammad G, Alhamid MF, Song B, Al-Mutib K (2016) Audio-visual emotion recognition using big data towards 5G. Mobile Networks and Applications 21(5):753–763
Article Google Scholar
Hsu JH, Su MH, Wu CH, Chen YH (2021) Speech emotion recognition considering nonverbal vocalization in affective conversations. IEEE/ACM Trans Audio Speech Lang Process 29:1675–1686
Article Google Scholar
Kingma DP, Ba JL (2015) "Adam: a method for stochastic optimization," In: 3rd international conference on learning representations, ICLR 2015, may 7, 2015 - may 9, 2015, San Diego, CA, United States, international conference on learning representations, ICLR
Kumar S, Kumar K (2018) "LSRC: Lexicon star rating system over cloud," In: 4th IEEE International Conference on Recent Advances in Information Technology, RAIT 2018, March 15, 2018 - March 17, 2018, Dhanbad, India, pp. 1–6: Institute of Electrical and Electronics Engineers Inc
Larochelle H, Hinton G (2010) "Learning to combine foveal glimpses with a third-order Boltzmann machine," In: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010, December 6, 2010 - December 9, 2010, Vancouver, BC, Canada, p. Neural Information Processing Systems (NIPS): Curran Associates Inc.
Li S et al (2019) "Bi-modality fusion for emotion recognition in the wild," In: 2019 International Conference on Multimodal Interaction, pp. 589–594
Likitha MS, Gupta SRR, Hasitha K, Raju AU (2017) "Speech based human emotion recognition using MFCC," In: 2nd IEEE International Conference on Wireless Communications, Signal Processing and Networking, WiSPNET 2017, March 22, 2017 - March 24, 2017, Chennai, India, vol. 2018-January, pp. 2257–2260: Institute of Electrical and Electronics Engineers Inc
Liu S, Wang X, Zhao L, Zhao J, Xin Q, Wang SH (2021) Subject-independent emotion recognition of EEG signals based on dynamic empirical convolutional neural network. IEEE/ACM Trans Comput Biol Bioinform 18(5):1710–1721
Article Google Scholar
Livingstone SR, Russo FA, Joseph N (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5):e0196391
Article Google Scholar
Mangai UG, Samanta S, Das S, Chowdhury PR (2010) A survey of decision fusion and feature fusion strategies for pattern classification. IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India) 27(4):293–307
Google Scholar
Mansouri-Benssassi E, Ye J (2019) "Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks," In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8: IEEE
Mariooryad S, Busso C (2013) Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Trans Affect Comput 4(2):183–196
Article Google Scholar
Martin O, Kotsia I, Macq B, Pitas I (2006) "The eNTERFACE'05 Audio-Visual emotion database," In: 22nd International Conference on Data Engineering Workshops, ICDEW 2006, April 3, 2006 - April 7, 2006, Atlanta, GA, United states, Institute of Electrical and Electronics Engineers Inc
Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) "M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues," In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 02, pp. 1359–1367
Nguyen D, Nguyen K, Sridharan S, Ghasemi A, Dean D, Fookes C (2017) "Deep spatio-temporal features for multimodal emotion recognition," In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1215–1223: IEEE
Nguyen D, Nguyen K, Sridharan S, Dean D, Fookes C (2018) Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput Vis Image Underst 174:33–42
Article Google Scholar
Pandeya YR, Bhattarai B, Lee J (2021) Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14):4927
Article Google Scholar
Parthasarathy S, Busso C (2020) Semi-supervised speech emotion recognition with ladder networks. IEEE/ACM Trans Audio Speech Lang Process 28:2697–2709
Article Google Scholar
Poria S, Cambria E, Hussain A, Huang G-B (2015) Towards an intelligent framework for multimodal affective data analysis. Neural Netw 63:104–116
Article Google Scholar
Sharma S, Kumar K (2021) ASL-3DCNN: American sign language recognition technique using 3-D convolutional neural networks. Multimed Tools Appl 80(17):26319–26331
Article Google Scholar
Sharma S, Kumar K, Singh N (2017) "D-FES: Deep facial expression recognition system," In: 2017 IEEE International Conference on Innovative Mechanisms for Industry Applications, ICIMIA 2017, February 21, 2017 - February 23, 2017, Bengaluru, India, vol. 2018-April, pp. 1–6: Institute of Electrical and Electronics Engineers Inc
Sharma S, Kumar P, Kumar K (2017) "LEXER: LEXicon Based Emotion AnalyzeR," In: 7th International Conference on Pattern Recognition and Machine Intelligence, PReMI 2017, December 5, 2017 - December 8, 2017, Kolkata, India, vol. 10597 LNCS, pp. 373–379: Springer Verlag
Song K-S, Nho Y-H, Seo J-H, Kwon D-S (2018) Decision-level fusion method for emotion recognition using multimodal emotion recognition information," In: 2018 15th International Conference on Ubiquitous Robots (UR), pp. 472–476: IEEE
Subramanian G, Cholendiran N, Prathyusha K, Balasubramanain N, Aravinth J (2021) "Multimodal Emotion Recognition Using Different Fusion Techniques," in 7th IEEE International Conference on Bio Signals, Images and Instrumentation, ICBSII 2021, March 25, 2021 - March 27, 2021, Chennai, India, Institute of Electrical and Electronics Engineers Inc
Vaswani A et al (2017) "Attention is all you need," in 31st Annual Conference on Neural Information Processing Systems, NIPS 2017, December 4, 2017 - December 9, 2017, Long Beach, CA, United states, vol. 2017-December, pp. 5999–6009: Neural information processing systems foundation
Veni S, Anand R, Mohan D, Paul E (2021) "Feature Fusion In Multimodal Emotion Recognition System For Enhancement Of Human-Machine Interaction," In: IOP Conference Series: Materials Science and Engineering, vol. 1084, no. 1, p. 012004: IOP Publishing
Vijayvergia A, Kumar K (2018) "STAR: rating of reviewS by exploiting variation in emoTions using trAnsfer leaRning framework," In: 2018 conference on information and communication technology, CICT 2018, October 26, 2018 - October 28, 2018, Jabalpur, India, pp. BrahMos aerospace; M.P. Council of Science and Technology; PDM Indian Institute of Information Technology, Design and Manufacturing Jabalpur: Institute of Electrical and Electronics Engineers Inc
Vijayvergia A, Kumar K (2021) Selective shallow models strength integration for emotion detection using GloVe and LSTM. Multimed Tools Appl 80(18):28349–28363
Article Google Scholar
Wang X, Chen X, Cao C (2020) Human emotion recognition by optimally fusing facial expression and speech feature. Signal Process Image Commun 84:115831
Article Google Scholar
Wang X, Wu P, Xu Q, Zeng Z, Xie Y (2021) Joint image clustering and feature selection with auto-adjoined learning for high-dimensional data. Knowl-Based Syst 232:107443, 2021/11/28/ 2021
Article Google Scholar
Wang X, Zheng Z, He Y, Yan F, Zeng Z, Yang Y (2021) Soft person Reidentification network pruning via Blockwise adjacent filter decaying. IEEE Trans Cybern:1–15
Xu H, Zhang H, Han K, Wang Y, Peng Y, Li X (2019) "Learning alignment for multimodal emotion recognition from speech," arXiv preprint arXiv:1909.05645
Yan J, Zheng W, Xin M, Yan J (2014) Integrating facial expression and body gesture in videos for emotion recognition. IEICE Trans Inf Syst E97-D(3):610–613
Article Google Scholar
Zhang S, Zhang S, Huang T, Gao W, Tian Q (2017) Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Trans Circuits Syst Video Technol 28(10):3030–3043
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62001215, the Research Foundation Project of Nanjing Institute of Technology under Grant No. CKJC202001. The authors would like to thank the reviewers for their valuable comments that helped in significant improvement of the quality of the paper.

Author information

Authors and Affiliations

School of Information and Communication Engineering, Nanjing Institute of Technology, Nanjing, People’s Republic of China
Guichen Tang, Yue Xie & Ruiyu Liang
Information Science and Engineering, Southeast University, Nanjing, People’s Republic of China
Ke Li & Li Zhao

Authors

Guichen Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yue Xie
View author publications
You can also search for this author in PubMed Google Scholar
Ke Li
View author publications
You can also search for this author in PubMed Google Scholar
Ruiyu Liang
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruiyu Liang.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tang, G., Xie, Y., Li, K. et al. Multimodal emotion recognition from facial expression and speech based on feature fusion. Multimed Tools Appl 82, 16359–16373 (2023). https://doi.org/10.1007/s11042-022-14185-0

Download citation

Received: 26 July 2021
Revised: 12 June 2022
Accepted: 27 October 2022
Published: 11 November 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11042-022-14185-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal emotion recognition from facial expression and speech based on feature fusion

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Multimodal emotion classification using machine learning in immersive and non-immersive virtual reality

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal emotion recognition from facial expression and speech based on feature fusion

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Multimodal emotion classification using machine learning in immersive and non-immersive virtual reality

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation