A Multimodal Approach for Multiple-Relation Extraction in Videos

Liu, Zihe; Hou, Weiying; Zhang, Jiayi; Cao, Chenyu; Wu, Bin

doi:10.1007/s11042-021-11466-y

A Multimodal Approach for Multiple-Relation Extraction in Videos

1193: Intelligent Processing of Multimedia Signals
Published: 15 September 2021

Volume 81, pages 4909–4934, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zihe Liu¹,
Weiying Hou¹,
Jiayi Zhang¹,
Chenyu Cao¹ &
…
Bin Wu¹

1190 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Automatically interpreting social relations, e.g., friendship, kinship, etc., from visual scenes has huge potential application value in areas such as knowledge graphs construction, person behavior and emotion analysis, entertainment ecology, etc. Great progress has been made in social analysis based on structured data. However, existing video-based methods consider social relationship extraction as a general classification task and categorize videos into only predefined types. Such methods are unable to recognize multiple relations in multi-person videos, which is obviously not consistent with the actual application scenarios. At the same time, videos are inherently multimodal. Subtitles in the video also provide abundant cues for relationship recognition that is often ignored by researchers. In this paper, we introduce and define a new task named “Multiple-Relation Extraction in Videos (MREV)”. To solve the MREV task, we propose the Visual-Textual Fusion (VTF) framework for jointly modeling visual and textual information. For the spatial representation, we not only adopt a SlowFast network to learn global action and scene information, but also exploit the unique cues of face, body and dialogue between characters. For the temporal domain, we propose a Temporal Feature Aggregation module to perform temporal reasoning, which assesses the quality of different frames adaptively. After that, we use a Multi-Conv Attention module to capture the inter-modal correlation and map the features of different modes to a coordinated feature space. By this means, our VTF framework comprehensively exploits abundant multimodal cues for the MREV task and achieves 49.2% and 50.4% average accuracy on a self-constructed Video Multiple-Relation(VMR) dataset and ViSR dataset, respectively. Extensive experiments on VMR dataset and ViSR dataset demonstrate the effectiveness of the proposed framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatio-Temporal Attention Model Based on Multi-view for Social Relation Understanding

Overall-Distinctive GCN for Social Relation Recognition on Videos

Toward jointly understanding social relationships and characters from videos

Article 18 August 2021

References

Aimar ES, Radeva P, Dimiccoli M (2019) Social relation recognition in egocentric photostreams. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 3227–3231. IEEE.
Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) Netvlad: Cnn architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5297–5307
Barr JR, Cament LA, Bowyer KW, Flynn PJ (2014) Active clustering with ensembles for social structure extraction. In: IEEE Winter Conference on Applications of Computer Vision, pp. 969–976. IEEE
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. CoRR abs/1705.07750. http://arxiv.org/abs/1705.07750
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308
Chen YY, Hsu WH, Liao HYM (2012) Discovering informative social subgraphs and predicting pairwise relationships from group photos. In: Proceedings of the 20th ACM international conference on Multimedia, pp. 669–678
Chiu YI, Huang CR, Chung PC (2013) Character relationship analysis in movies using face tracks. In: MVA, pp. 431–434
Dai P, Lv J, Wu B (2019) Two-stage model for social relationship understanding from videos. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1132–1137. IEEE
Dai Q, Carr P, Sigal L, Hoiem D(2015) Family member identification from photo collections. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 982–989. IEEE
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699
Deng J, Guo J, Zhou Y, Yu J, Kotsia I, Zafeiriou S (2019) Retinaface: Single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641
Dibeklioglu H (2017) Visual transformation aided contrastive learning for video-based kinship verification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2459–2468
Ding L, Yilmaz A (2010) Learning relations among movie characters: A social network perspective. In: European conference on computer vision, pp. 410–423. Springer
Ding L, Yilmaz A (2011) Inferring social relations from visual concepts. In: 2011 International Conference on Computer Vision, pp. 699–706. IEEE
Feichtenhofer C, Fan H, Malik J, He K (2018) Slowfast networks for video recognition. CoRR abs/1812.03982. http://arxiv.org/abs/1812.03982
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 6202–6211
Feng F, Yang Y, Cer D, Arivazhagan N, Wang W (2020) Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852
Goel A, Ma KT, Tan C (2019) An end-to-end network for generating social relationship graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11186–11195
Golder S (2008) Measuring social networks with digital photograph collections. In: Proceedings of the nineteenth ACM conference on Hypertext and hypermedia, pp. 43–48
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778
Huang G, Mattar MA, Berg TL, Learned-Miller E (2008) Labeled faces in the wild: A database forstudying face recognition in unconstrained environments
Jiang YG, Wu Z, Tang J, Li Z, Xue X, Chang SF (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans Multimed 20(11):3137–3147
Article Google Scholar
Kampman O, Barezi EJ, Bertero D, Fung P (2018) Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 606–611
Kanagaraj K, Priya GGL (2021) A new 3d convolutional neural network (3d-cnn) framework for multimedia event detection. Signal Image Video Process 15(4):779–787
Article Google Scholar
Kemelmacher-Shlizerman I, Seitz S, Miller D, Brossard E (2016) The megaface benchmark: 1 million faces for recognition at scale. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 4873–4882
Khademi M (2020) Multimodal neural graph memory networks for visual question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7177–7188
Kohli N, Yadav D, Vatsa M, Singh R, Noore A (2018) Supervised mixed norm autoencoder for kinship verification in unconstrained videos. IEEE Trans Image Process 28(3):1329–1341
Article MathSciNet Google Scholar
Kukleva A, Tapaswi M, Laptev I (2020) Learning interactions and relationships between movie characters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9849–9858
Li J, Wong Y, Zhao Q, Kankanhalli MS (2017) Dual-glance model for deciphering social relationships. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2659
Li M, Zareian A, Lin Y, Pan X, Whitehead S, Chen B, Wu B, Ji H, Chang SF, Voss C et al (2020) Gaia: A fine-grained multimedia knowledge extraction system. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 77–86
Liu J, Deng Y, Bai T, Huang C (2015) Targeting ultimate accuracy: Face recognition via deep embedding. ArXiv abs/1506.07310
Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 6738–6746
Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks. ArXiv abs/1612.02295
Liu X, Liu W, Zhang M, Chen J, Gao L, Yan C, Mei T (2019) Social relation recognition from videos via multi-scale spatial-temporal reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3566–3574
Liu Y, Peng B, Shi P, Yan H, Zhou Y, Han B, Zheng Y, Lin C, Jiang J, Fan Y et al (2018) iqiyi-vid: A large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548
Long X, Gan C, De Melo G, Wu J, Liu X, Wen S (2018) Attention clusters: Purely attention based local feature integration for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7834–7843
Lv J, Liu W, Zhou L, Wu B, Ma H (2018) Multi-stream fusion model for social relation recognition from videos. In: International Conference on Multimedia Modeling, pp. 355–368. Springer
Lv J, Wu B (2019) Spatio-temporal attention model based on multi-view for social relation understanding. In: International Conference on Multimedia Modeling, pp. 390–401. Springer
Lv J, Wu B, Zhou L, Wang H (2018) Storyrolenet: Social network construction of role relationship in video. IEEE Access 6:25958–25969
Article Google Scholar
Nan CJ, Kim KM, Zhang BT (2015) Social network analysis of tv drama characters via deep concept hierarchies. In: 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 831–836. IEEE
Parkhi O, Vedaldi A, Zisserman A (2015) Deep face recognition. In: BMVC
Ramanathan V, Yao B, Fei-Fei L (2013) Social role discovery in human events. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2475–2482
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 815–823
Sun Q, Schiele B, Fritz M (2017) A domain based approach to social relation recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3481–3490
Tran D, Bourdev LD, Fergus R, Torresani L, Paluri M (2014) C3D: generic features for video analysis. CoRR abs/1412.0767. http://arxiv.org/abs/1412.0767
Vicol P, Tapaswi M, Castrejon L, Fidler S (2018) Moviegraphs: Towards understanding human-centric situations from videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8581–8590
Wang H, Wang Y, Zhou Z, Ji X, Li Z, Gong D, Zhou J, Liu W (2018) Cosface: Large margin cosine loss for deep face recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 5265–5274
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: Towards good practices for deep action recognition. CoRR abs/1608.00859. http://arxiv.org/abs/1608.00859
Wang Z, Chen T, Ren J, Yu W, Cheng H, Lin L (2018) Deep reasoning with knowledge graph for social relationship understanding. arXiv preprint arXiv:1807.00504
Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: ECCV
Weng CY, Chu WT, Wu JL (2009) Rolenet: Movie analysis from the perspective of social networks. IEEE Trans Multimed 11(2):256–271
Article Google Scholar
Wu P, Ding W, Mao Z, Tretter D (2009) Close & closer: Discover social relationship from photo collections. In: 2009 IEEE International Conference on Multimedia and Expo, pp. 1652–1655. IEEE
Wu X, Granger E, Kinnunen TH, Feng X, Hadid A (2019) Audio-visual kinship verification in the wild. In: 2019 International Conference on Biometrics (ICB), pp. 1–8. IEEE
Yan H, Hu J (2018) Video-based kinship verification using distance metric learning. Pattern Recognit 75:15–24
Article Google Scholar
Yeh MC, Tseng MC, Wu WP (2012) Automatic social network construction from movies using film-editing cues. In: 2012 IEEE International Conference on Multimedia and Expo Workshops, pp. 242–247. IEEE
Yuan K, Yao H, Ji R, Sun X (2010) Mining actor correlations with hierarchical concurrence parsing. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 798–801. IEEE
Yuxin P, Jinwei Q, Xin H (2019) Current research status and prospects on multimedia content understanding. J Comput Res Dev 56(1):183–208
Google Scholar
Zadeh A, Chen M, Poria S, Cambria E, Morency L (2017) Tensor fusion network for multimodal sentiment analysis. In: Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114
Zhang M, Liu X, Liu W, Zhou A, Ma H, Mei T (2019) Multi-granularity reasoning for social relation recognition from images. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1618–1623. IEEE
Zhang Z, Luo P, Loy CC, Tang X (2015) Learning social relation traits from face images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3631–3639
Zhong Y, Arandjelović R, Zisserman A (2018) Ghostvlad for set-based face recognition. In: Asian Conference on Computer Vision, pp. 35–50. Springer
Zhou L, Lv J, Wu B (2017) Social network construction of the role relation in unstructured data based on multi-view. In: 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC), pp. 382–388. IEEE
Zhou L, Wu B, Lv J (2018) Sre-net model for automatic social relation extraction from video. In: CCF Conference on Big Data, pp. 442–460. Springer
Zhu Z, Yu J, Wang Y, Sun Y, Hu Y, Wu Q (2020) Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visualquestion answering. arXiv preprint arXiv:2006.09073

Download references

Acknowledgements

This study was supported by the National Natural Science Foundation of China (grant no. 61972047), the National Key Research and Development Program of China (2018YFC0831500), and the NSFC-General Technology Basic Research Joint Funds (grant no. U1936220).

Author information

Authors and Affiliations

Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, No.10, Xitucheng Road, Beijing, 100876, China
Zihe Liu, Weiying Hou, Jiayi Zhang, Chenyu Cao & Bin Wu

Authors

Zihe Liu
View author publications
You can also search for this author in PubMed Google Scholar
Weiying Hou
View author publications
You can also search for this author in PubMed Google Scholar
Jiayi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chenyu Cao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Hou, W., Zhang, J. et al. A Multimodal Approach for Multiple-Relation Extraction in Videos. Multimed Tools Appl 81, 4909–4934 (2022). https://doi.org/10.1007/s11042-021-11466-y

Download citation

Received: 26 December 2020
Revised: 06 June 2021
Accepted: 19 August 2021
Published: 15 September 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11042-021-11466-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multimodal Approach for Multiple-Relation Extraction in Videos

Abstract

Access this article

Similar content being viewed by others

Spatio-Temporal Attention Model Based on Multi-view for Social Relation Understanding

Overall-Distinctive GCN for Social Relation Recognition on Videos

Toward jointly understanding social relationships and characters from videos

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Multimodal Approach for Multiple-Relation Extraction in Videos

Abstract

Access this article

Similar content being viewed by others

Spatio-Temporal Attention Model Based on Multi-view for Social Relation Understanding

Overall-Distinctive GCN for Social Relation Recognition on Videos

Toward jointly understanding social relationships and characters from videos

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation