Skip to main content
Log in

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Continuous emotion recognition plays a crucial role in developing friendly and natural human-computer interaction applications. However, there exist two significant challenges unresolved in this field: how to effectively fuse complementary information from multiple modalities and capture long-range contextual dependencies during emotional evolution. In this paper, a novel multimodal continuous emotion recognition framework was proposed to address the above challenges. For the multimodal fusion challenge, the Multimodal Attention Fusion (MAF) method is proposed to fully utilize complementarity and redundancy between multiple modalities. To tackle temporal context dependencies, the Local Contextual Temporal Convolutional Network (LC-TCN) and the Global Contextual Temporal Convolutional Network (GC-TCN) were presented. These networks have the ability to progressively integrate multi-scale temporal contextual information from input streams of different modalities. Comprehensive experiments are conducted on the RECOLA and SEWA datasets to assess the effectiveness of our proposed framework. The experimental results demonstrate superior recognition performance compared to state-of-the-art approaches, achieving 0.834 and 0.671 on RECOLA, 0.573 and 0.533 on SEWA in terms of arousal and valence, respectively. These findings indicate a novel direction for continuous emotion recognition by exploring temporal multi-scale information.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The data that support the findings of this study are available from https://diuf.unifr.ch/main/diva/recola/ and https://db.sewaproject.eu/ but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of corresponding authors of the RECOLA dataset and the SEWA dataset.

References

  1. Bhosale YH, Patnaik KS (2023) Puldi-covid: Chronic obstructive pulmonary (lung) diseases with covid-19 classification using ensemble deep convolutional neural network from chest x-ray images to minimize severity and mortality rates. Biomed Signal Process Control 81(104):445. https://doi.org/10.1016/j.bspc.2022.104445

    Article  Google Scholar 

  2. Zhang J, Feng W, Yuan T et al (2022) Scstcf: spatial-channel selection and temporal regularized correlation filters for visual tracking. Appl Soft Comput 118(108):485. https://doi.org/10.1016/j.asoc.2022.108485

    Article  Google Scholar 

  3. Zepf S, Hernandez J, Schmitt A et al (2020) Driver emotion recognition for intelligent vehicles: A survey. ACM Computing Surveys (CSUR) 53(3):1–30. https://doi.org/10.1145/3388790

    Article  Google Scholar 

  4. Fei Z, Yang E, Li DDU et al (2020) Deep convolution network based emotion analysis towards mental health care. Neurocomputing 388:212–227. https://doi.org/10.1016/j.neucom.2020.01.034

    Article  Google Scholar 

  5. Wang W, Xu K, Niu H et al (2020) Emotion recognition of students based on facial expressions in online education based on the perspective of computer simulation. Complexity 2020:1–9. https://doi.org/10.1155/2020/4065207

    Article  Google Scholar 

  6. Zhang J, Yin Z, Chen P et al (2020) Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion 59:103–126. https://doi.org/10.1016/j.inffus.2020.01.011

    Article  Google Scholar 

  7. Akçay MB, Oğuz K (2020) Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun 116:56–76. https://doi.org/10.1016/j.specom.2019.12.001

    Article  Google Scholar 

  8. Jiang Y, Li W, Hossain MS et al (2020) A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition. Information Fusion 53:209–221. https://doi.org/10.1016/j.inffus.2019.06.019

    Article  Google Scholar 

  9. Li X, Lu G, Yan J et al (2022) A multi-scale multi-task learning model for continuous dimensional emotion recognition from audio. Electronics 11(3):417. https://doi.org/10.3390/electronics11030417

    Article  Google Scholar 

  10. Kollias D, Zafeiriou S (2020) Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset. IEEE Trans Affect Comput 12(3):595–606. https://doi.org/10.1109/TAFFC.2020.3014171

    Article  Google Scholar 

  11. Rouast PV, Adam MT, Chiong R (2019) Deep learning for human affect recognition: Insights and new developments. IEEE Trans Affect Comput 12(2):524–543. https://doi.org/10.1109/TAFFC.2018.2890471

    Article  Google Scholar 

  12. Wang Y, Song W, Tao W et al (2022) A systematic review on affective computing: Emotion models, databases, and recent advances. Information Fusion. https://doi.org/10.1016/j.inffus.2022.03.009

    Article  Google Scholar 

  13. Zhao J, Li R, Chen S et al (2018) Multi-modal multi-cultural dimensional continues emotion recognition in dyadic interactions. In: Proceedings of the 2018 on audio/visual emotion challenge and workshop, pp 65–72. https://doi.org/10.1145/3266302.3266313

  14. Hao M, Cao WH, Liu ZT et al (2020) Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features. Neurocomputing 391:42–51. https://doi.org/10.1016/j.neucom.2020.01.048

    Article  Google Scholar 

  15. Li C, Bao Z, Li L et al (2020) Exploring temporal representations by leveraging attention-based bidirectional lstm-rnns for multi-modal emotion recognition. Inform Process & Manag 57(3):102,185. https://doi.org/10.1016/j.ipm.2019.102185

  16. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008

  17. Jiang J, Chen Z, Lin H et al (2020) Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 11,101–11,108, https://doi.org/10.1609/aaai.v34i07.6766

  18. Lee J, Kim S, Kim S et al (2020) Multi-modal recurrent attention networks for facial expression recognition. IEEE Trans Image Process 29:6977–6991. https://doi.org/10.1109/TIP.2020.2996086

    Article  Google Scholar 

  19. Chen Y, Liu L, Phonevilay V et al (2021) Image super-resolution reconstruction based on feature map attention mechanism. Appl Intell 51:4367–4380. https://doi.org/10.1007/s10489-020-02116-1

    Article  Google Scholar 

  20. Antoniadis P, Pikoulis I, Filntisis PP et al (2021) An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3645–3651. https://doi.org/10.1109/ICCVW54120.2021.00407

  21. Peng Z, Dang J, Unoki M et al (2021) Multi-resolution modulation-filtered cochleagram feature for lstm-based dimensional emotion recognition from speech. Neural Netw 140:261–273. https://doi.org/10.1016/j.neunet.2021.03.027

    Article  Google Scholar 

  22. Lee J, Kim S, Kiim S et al (2018) Spatiotemporal attention based deep neural networks for emotion recognition. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 1513–1517. https://doi.org/10.1109/ICASSP.2018.8461920

  23. Liu S, Wang X, Zhao L et al (2021) 3dcann: A spatio-temporal convolution attention neural network for eeg emotion recognition. IEEE J Biomed Health Inform 26(11):5321–5331. https://doi.org/10.1109/JBHI.2021.3083525

    Article  Google Scholar 

  24. Farha YA, Gall J (2019) Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3575–3584. https://doi.org/10.1109/CVPR.2019.00369

  25. Hu M, Chu Q, Wang X et al (2021) A two-stage spatiotemporal attention convolution network for continuous dimensional emotion recognition from facial video. IEEE Signal Process Lett 28:698–702. https://doi.org/10.1109/LSP.2021.3063609

    Article  Google Scholar 

  26. McKeown G, Valstar M, Cowie R et al (2011) The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17. https://doi.org/10.1109/T-AFFC.2011.20

    Article  Google Scholar 

  27. Ringeval F, Sonderegger A, Sauer J et al (2013) Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), IEEE, pp 1–8. https://doi.org/10.1109/FG.2013.6553805

  28. Kossaifi J, Walecki R, Panagakis Y et al (2019) Sewa db: A rich database for audio-visual emotion and sentiment research in the wild. IEEE Trans Pattern Anal Mach Intell 43(3):1022–1040. https://doi.org/10.1109/TPAMI.2019.2944808

    Article  Google Scholar 

  29. Huang Z, Dang T, Cummins N et al (2015) An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp 41–48. https://doi.org/10.1145/2808196.2811640

  30. Nguyen D, Nguyen DT, Zeng R et al (2021) Deep auto-encoders with sequential learning for multimodal dimensional emotion recognition. IEEE Trans Multimedia 24:1313–1324. https://doi.org/10.1109/TMM.2021.3063612

    Article  Google Scholar 

  31. Chen H, Deng Y, Cheng S et al (2019) Efficient spatial temporal convolutional features for audiovisual continuous affect recognition. In: Proceedings of the 9th international on audio/visual emotion challenge and workshop, pp 19–26. https://doi.org/10.1145/3347320.3357690

  32. Pei E, Jiang D, Sahli H (2020) An efficient model-level fusion approach for continuous affect recognition from audiovisual signals. Neurocomputing 376:42–53. https://doi.org/10.1016/j.neucom.2019.09.037

    Article  Google Scholar 

  33. Schoneveld L, Othmani A, Abdelkawy H (2021) Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recogn Lett 146:1–7. https://doi.org/10.1016/j.patrec.2021.03.007

    Article  Google Scholar 

  34. Mao Q, Zhu Q, Rao Q et al (2019) Learning hierarchical emotion context for continuous dimensional emotion recognition from video sequences. IEEE Access 7:62,894–62,903. https://doi.org/10.1109/ACCESS.2019.2916211

  35. Deng D, Chen Z, Zhou Y et al (2020) Mimamo net: Integrating micro-and macro-motion for video emotion recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 2621–2628

  36. Singh R, Saurav S, Kumar T et al (2023) Facial expression recognition in videos using hybrid cnn & convlstm. Int J Inform Technol pp 1–12. https://doi.org/10.1007/s41870-023-01183-0

  37. Nagrani A, Yang S, Arnab A et al (2021) Attention bottlenecks for multimodal fusion. Adv Neural Inform Process Syst 34:14,200–14,213. https://doi.org/10.48550/arXiv.2107.00135

  38. Chen H, Deng Y, Jiang D (2021) Temporal attentive adversarial domain adaption for cross cultural affect recognition. In: Companion publication of the 2021 international conference on multimodal interaction, pp 97–103

  39. Huang J, Tao J, Liu B et al (2020) Multimodal transformer fusion for continuous emotion recognition. In: ICASSP 2020-2020 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 3507–3511, https://doi.org/10.1109/ICASSP40776.2020.9053762

  40. Wu S, Du Z, Li W et al (2019) Continuous emotion recognition in videos by fusing facial expression, head pose and eye gaze. In: 2019 International conference on multimodal interaction, pp 40–48, https://doi.org/10.1145/3340555.3353739

  41. Tzirakis P, Chen J, Zafeiriou S et al (2021) End-to-end multimodal affect recognition in real-world environments. Information Fusion 68:46–53. https://doi.org/10.1016/j.inffus.2020.10.011

    Article  Google Scholar 

  42. Praveen RG, de Melo WC, Ullah N et al (2022) A joint cross-attention model for audio-visual fusion in dimensional emotion recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2486–2495, https://doi.org/10.48550/arXiv.2203.14779

  43. Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. In: International conference on learning representations-workshop

  44. Du Z, Wu S, Huang D et al (2019) Spatio-temporal encoder-decoder fully convolutional network for video-based dimensional emotion recognition. IEEE Trans Affect Comput 12(3):565–578. https://doi.org/10.1109/TAFFC.2019.2940224

    Article  Google Scholar 

  45. He Z, Zhong Y, Pan J (2022) An adversarial discriminative temporal convolutional network for eeg-based cross-domain emotion recognition. Comput Biol Med 141(105):048. https://doi.org/10.1016/j.compbiomed.2021.105048

    Article  Google Scholar 

  46. Eyben F, Scherer KR, Schuller BW et al (2015) The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Trans Affect Comput 7(2):190–202. https://doi.org/10.1109/TAFFC.2015.2457417

    Article  Google Scholar 

  47. Ruan D, Yan Y, Lai S et al (2021) Feature decomposition and reconstruction learning for effective facial expression recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7660–7669

  48. Verma S, Wang C, Zhu L et al (2019) Deepcu: Integrating both common and unique latent information for multimodal sentiment analysis. In: International joint conference on artificial intelligence, international joint conferences on artificial intelligence organization. https://doi.org/10.24963/ijcai.2019/503

  49. Mai S, Xing S, Hu H (2019) Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Trans Multimed 22(1):122–137. https://doi.org/10.1109/TMM.2019.2925966

    Article  Google Scholar 

  50. Gao Z, Wang X, Yang Y et al (2020) A channel-fused dense convolutional network for eeg-based emotion recognition. IEEE Trans Cogn Dev Syst 13(4):945–954. https://doi.org/10.1109/TCDS.2020.2976112

    Article  Google Scholar 

  51. Ringeval F, Schuller B, Valstar M et al (2019) Avec 2019 workshop and challenge: state-of-mind, detecting depression with ai, and cross-cultural affect recognition. In: Proceedings of the 9th international on audio/visual emotion challenge and workshop, pp 3–12. https://doi.org/10.1145/3347320.3357688

  52. Valstar M, Gratch J, Schuller B et al (2016) Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th international workshop on audio/visual emotion challenge, pp 3–10. https://doi.org/10.1145/2988257.2988258

  53. Zhang S, Ding Y, Wei Z et al (2021) Continuous emotion recognition with audio-visual leader-follower attentive fusion. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3567–3574, https://doi.org/10.48550/arXiv.2107.01175

  54. Khorram S, McInnis MG, Provost EM (2019) Jointly aligning and predicting continuous emotion annotations. IEEE Trans Affect Comput 12(4):1069–1083. https://doi.org/10.1109/TAFFC.2019.2917047

    Article  Google Scholar 

  55. Liu M, Tang J (2021) Audio and video bimodal emotion recognition in social networks based on improved alexnet network and attention mechanism. J Inform Process Syst 17(4):754–771

  56. Shukla A, Petridis S, Pantic M (2023) Does visual self-supervision improve learning of speech representations for emotion recognition. IEEE Trans Affect Comput 14(1):406–420. https://doi.org/10.1109/TAFFC.2021.3062406

    Article  Google Scholar 

  57. Lucas J, Ghaleb E, Asteriadis S (2020) Deep, dimensional and multimodal emotion recognition using attention mechanisms. In: BNAIC/BeneLearn 2020, pp 130

  58. Zhao J, Li R, Liang J et al (2019) Adversarial domain adaption for multi-cultural dimensional emotion recognition in dyadic interactions. In: Proceedings of the 9th international on audio/visual emotion challenge and workshop, pp 37–45. https://doi.org/10.1145/3347320.3357692

  59. Abbaszadeh Shahri A, Shan C, Larsson S (2022) A novel approach to uncertainty quantification in groundwater table modeling by automated predictive deep learning. Nat Resour Res 31(3):1351–1373. https://doi.org/10.1007/s11053-022-10051-w

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.U2133218), the National Key Research and Development Program of China (No.2018YFB0204304) and the Fundamental Research Funds for the Central Universities of China (No.FRF-MP-19-007 and No. FRF-TP-20-065A1Z).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baolin Liu.

Ethics declarations

Conflict of Interest

We declare that we have no actual or potential conflict of interest including any financial, personal or other relationships with other people or organizations that can inappropriately influence our work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, C., Zhang, Y. & Liu, B. A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos. Appl Intell 54, 3040–3057 (2024). https://doi.org/10.1007/s10489-024-05329-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05329-w

Keywords

Navigation