A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos

Qi, Shubao; Liu, Baolin

doi:10.1007/s10044-023-01178-4

A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos

Theoretical Advances
Published: 18 June 2023

Volume 26, pages 1493–1503, (2023)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Shubao Qi¹ &
Baolin Liu¹

1083 Accesses
1 Citation
Explore all metrics

Abstract

Video-based group emotion recognition is an important research area in computer vision and is of great significance for the intelligent understanding of videos and human–computer interactions. Previous studies have adopted the traditional two-stage shallow pipeline to extract visual or audio features and train classifiers. A single feature or two are insufficient to comprehensively represent video information. In addition, sparse expression of emotions has not been addressed effectively. Therefore, in this study, we propose a novel deep convolutional neural networks (CNNs) architecture for video-based group emotion recognition that fuses multimodal feature information such as vision, audio, optical flow, and face. To address the problem of sparse emotional expressions in videos, we constructed an improved keyframe extraction algorithm for a visual stream to extract keyframes with more emotional features. A subnetwork incorporating spatial and channel attention was designed to automatically concentrate on the regions and channels carrying distinctive information in each keyframe to more accurately represent the emotional features of the visual stream. The proposed model was used to conduct extensive experiments on a video group affect dataset. It outperformed other video-based group emotion recognition methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

References

Tu Z, Guo Z, Xie W et al (2017) Fusing disparate object signatures for salient object detection in video. Pattern Recognit 72:285–299
Article Google Scholar
Singh R, Kushwaha AKS, Srivastava R (2019) Multi-view recognition system for human activity based on multiple features for video surveillance system. Multimed Tools Appl 78(12):17165–17196
Article Google Scholar
Shukla A, Gullapuram SS, Katti H et al (2017) Affect recognition in Ads with application to computational advertising. ACM Multimedia, pp 1148–1156
Wang Y, Zhou S, Liu Y et al (2022) ConGNN: context-consistent cross-graph neural network for group emotion recognition in the wild. Inf Sci 610:707–724
Article Google Scholar
Dai Y, Liu X, Dong S et al (2019) Group emotion recognition based on global and local features. IEEE Access 7:1–1
Google Scholar
Shamsi SN, Singh BP, Wadhwa M (2018) Group affect prediction using multimodal distributions. In: 2018 IEEE winter applications of computer vision workshops (WACVW). IEEE, pp 77–83
Ottl S, Amiriparian S, Gerczuk M, et al. (2020) Group-level speech emotion recognition utilising deep spectrum features. In: ICMI ‘20: international conference on multimodal interaction
Pinto JR, Gonalves TFS, Pinto C, et al. (2020) Audiovisual classification of group emotion valence using activity recognition networks. In: Fourth IEEE international conference on image processing, applications and systems (IPAS 2020). IEEE
Wang Y, Wu J, Heracleous P, et al. (2020) Implicit knowledge injectable cross attention audiovisual model for group emotion recognition. In: Proceedings of the 2020 international conference on multimodal interaction. pp 827–834
Zhang H, Xu M (2018) Recognition of emotions in user-generated videos with kernelized features. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2018.2808760
Article Google Scholar
Tu G, Fu Y, Li B et al (2020) A multi-task neural approach for emotion attribution, classification, and summarization. IEEE Trans Multimed 22(1):148–159
Article Google Scholar
Kosinski M (2021) Facial recognition technology can expose political orientation from naturalistic facial images. Sci Rep 11(1):1–7
Google Scholar
Lakshmy V, Ramana Murthy OV (2018) Image based group happiness intensity analysis. In: Jude Hemanth D, Smys S (eds) Computational vision and bio inspired computing. Springer International Publishing, Cham, pp 1032–1040
Chapter Google Scholar
Lu G, Zhang W. (2019) Happiness intensity estimation for a group of people in images using convolutional neural networks. In: 2019 3rd international conference on electronic information technology and computer engineering (EITCE)
Sharma G, Ghosh S, Dhall A. (2019) Automatic group level affect and cohesion prediction in videos. In: 2019 8th international conference on affective computing and intelligent interaction workshops and demos (ACIIW). IEEE, 161–167
Surace L, Patacchiola M, Battini Sönmez E, et al. (2017) Emotion recognition in the wild using deep neural networks and Bayesian classifiers. In: Proceedings of the 19th ACM international conference on multimodal interaction. pp 593–597
Wei Q, Zhao Y, Xu Q, et al. (2017) A new deep-learning framework for group emotion recognition. In: ACM international conference on multimodal interaction. ACM, pp 587–592
Khan AS, Li Z, Cai J, et al. (2018) Group-level emotion recognition using deep models with a four-stream hybrid network. In: Proceedings of the 20th ACM international conference on multimodal interaction. pp 623–629
Wang J, Zhao Z, Liang J, et al. (2018) Video-based emotion recognition using face frontalization and deep spatiotemporal feature. In: 2018 first asian conference on affective computing and intelligent interaction (ACII Asia)
Doherty AR, Byrne D, Smeaton A F, et al. (2008) Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In: Conference on image and video retrieval. ACM
Jahagirdar A, Nagmode M (2019) Two level key frame extraction for action recognition using content based adaptive threshold. Int J Intell Eng Syst 12(5):43–52
Google Scholar
Xue H, Qin J, Quan C et al (2021) Open set sheep face recognition based on Euclidean space metric. Math Probl Eng 2021:1–15
Google Scholar
Wu H, Zhang Z, Wu Q (2021) Exploring syntactic and semantic features for authorship attribution. Appl Soft Comput 111:107815
Article Google Scholar
Amiriparian S (2019) Deep representation learning techniques for audio signal processing. Dissertation. Technische Universität München, München
Nguyen K, Fookes C, Ross A et al (2017) Iris recognition with off-the-shelf CNN features: a deep learning perspective. IEEE Access. https://doi.org/10.1109/ACCESS.2017.2784352
Article Google Scholar
Dong S, Wang P, Abbas K (2021) A survey on deep learning and its applications. Comput Sci Rev 40:100379
Article MathSciNet MATH Google Scholar
Deng J, Dong W, Socher R, et al. (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of computer vision and pattern recognition, 2009.
Dhall A, Sharma G, Goecke R, et al. (2020) EmotiW 2020: driver gaze, group emotion, student engagement and physiological signal based challenges. In: ICMI ‘20: international conference on multimodal interaction.
Guo X, Polania LF, Zhu B, et al. (2020) Graph neural networks for image understanding based on multiple cues: group emotion recognition and event recognition as use cases. In: Workshop on applications of computer vision. IEEE
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. Comput Sci. https://doi.org/10.48550/arXiv.1212.0402
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. U2133218), the National Key Research and Development Program of China (No. 2018YFB0204304), and the Fundamental Research Funds for the Central Universities of China (No. FRF-MP-19-007 and No. FRF-TP-20-065A1Z).

Author information

Authors and Affiliations

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, 100083, People’s Republic of China
Shubao Qi & Baolin Liu

Authors

Shubao Qi
View author publications
You can also search for this author in PubMed Google Scholar
Baolin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baolin Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qi, S., Liu, B. A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos. Pattern Anal Applic 26, 1493–1503 (2023). https://doi.org/10.1007/s10044-023-01178-4

Download citation

Received: 22 November 2022
Accepted: 29 May 2023
Published: 18 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10044-023-01178-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation