Skip to main content
Log in

A Multimodal Approach for Multiple-Relation Extraction in Videos

  • 1193: Intelligent Processing of Multimedia Signals
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatically interpreting social relations, e.g., friendship, kinship, etc., from visual scenes has huge potential application value in areas such as knowledge graphs construction, person behavior and emotion analysis, entertainment ecology, etc. Great progress has been made in social analysis based on structured data. However, existing video-based methods consider social relationship extraction as a general classification task and categorize videos into only predefined types. Such methods are unable to recognize multiple relations in multi-person videos, which is obviously not consistent with the actual application scenarios. At the same time, videos are inherently multimodal. Subtitles in the video also provide abundant cues for relationship recognition that is often ignored by researchers. In this paper, we introduce and define a new task named “Multiple-Relation Extraction in Videos (MREV)”. To solve the MREV task, we propose the Visual-Textual Fusion (VTF) framework for jointly modeling visual and textual information. For the spatial representation, we not only adopt a SlowFast network to learn global action and scene information, but also exploit the unique cues of face, body and dialogue between characters. For the temporal domain, we propose a Temporal Feature Aggregation module to perform temporal reasoning, which assesses the quality of different frames adaptively. After that, we use a Multi-Conv Attention module to capture the inter-modal correlation and map the features of different modes to a coordinated feature space. By this means, our VTF framework comprehensively exploits abundant multimodal cues for the MREV task and achieves 49.2% and 50.4% average accuracy on a self-constructed Video Multiple-Relation(VMR) dataset and ViSR dataset, respectively. Extensive experiments on VMR dataset and ViSR dataset demonstrate the effectiveness of the proposed framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Aimar ES, Radeva P, Dimiccoli M (2019) Social relation recognition in egocentric photostreams. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 3227–3231. IEEE.

  2. Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) Netvlad: Cnn architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5297–5307

  3. Barr JR, Cament LA, Bowyer KW, Flynn PJ (2014) Active clustering with ensembles for social structure extraction. In: IEEE Winter Conference on Applications of Computer Vision, pp. 969–976. IEEE

  4. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. CoRR abs/1705.07750. http://arxiv.org/abs/1705.07750

  5. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308

  6. Chen YY, Hsu WH, Liao HYM (2012) Discovering informative social subgraphs and predicting pairwise relationships from group photos. In: Proceedings of the 20th ACM international conference on Multimedia, pp. 669–678

  7. Chiu YI, Huang CR, Chung PC (2013) Character relationship analysis in movies using face tracks. In: MVA, pp. 431–434

  8. Dai P, Lv J, Wu B (2019) Two-stage model for social relationship understanding from videos. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1132–1137. IEEE

  9. Dai Q, Carr P, Sigal L, Hoiem D(2015) Family member identification from photo collections. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 982–989. IEEE

  10. Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699

  11. Deng J, Guo J, Zhou Y, Yu J, Kotsia I, Zafeiriou S (2019) Retinaface: Single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641

  12. Dibeklioglu H (2017) Visual transformation aided contrastive learning for video-based kinship verification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2459–2468

  13. Ding L, Yilmaz A (2010) Learning relations among movie characters: A social network perspective. In: European conference on computer vision, pp. 410–423. Springer

  14. Ding L, Yilmaz A (2011) Inferring social relations from visual concepts. In: 2011 International Conference on Computer Vision, pp. 699–706. IEEE

  15. Feichtenhofer C, Fan H, Malik J, He K (2018) Slowfast networks for video recognition. CoRR abs/1812.03982. http://arxiv.org/abs/1812.03982

  16. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 6202–6211

  17. Feng F, Yang Y, Cer D, Arivazhagan N, Wang W (2020) Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852

  18. Goel A, Ma KT, Tan C (2019) An end-to-end network for generating social relationship graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11186–11195

  19. Golder S (2008) Measuring social networks with digital photograph collections. In: Proceedings of the nineteenth ACM conference on Hypertext and hypermedia, pp. 43–48

  20. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969

  21. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

  22. Huang G, Mattar MA, Berg TL, Learned-Miller E (2008) Labeled faces in the wild: A database forstudying face recognition in unconstrained environments

  23. Jiang YG, Wu Z, Tang J, Li Z, Xue X, Chang SF (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans Multimed 20(11):3137–3147

    Article  Google Scholar 

  24. Kampman O, Barezi EJ, Bertero D, Fung P (2018) Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 606–611

  25. Kanagaraj K, Priya GGL (2021) A new 3d convolutional neural network (3d-cnn) framework for multimedia event detection. Signal Image Video Process 15(4):779–787

    Article  Google Scholar 

  26. Kemelmacher-Shlizerman I, Seitz S, Miller D, Brossard E (2016) The megaface benchmark: 1 million faces for recognition at scale. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 4873–4882

  27. Khademi M (2020) Multimodal neural graph memory networks for visual question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7177–7188

  28. Kohli N, Yadav D, Vatsa M, Singh R, Noore A (2018) Supervised mixed norm autoencoder for kinship verification in unconstrained videos. IEEE Trans Image Process 28(3):1329–1341

    Article  MathSciNet  Google Scholar 

  29. Kukleva A, Tapaswi M, Laptev I (2020) Learning interactions and relationships between movie characters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9849–9858

  30. Li J, Wong Y, Zhao Q, Kankanhalli MS (2017) Dual-glance model for deciphering social relationships. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2659

  31. Li M, Zareian A, Lin Y, Pan X, Whitehead S, Chen B, Wu B, Ji H, Chang SF, Voss C et al (2020) Gaia: A fine-grained multimedia knowledge extraction system. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 77–86

  32. Liu J, Deng Y, Bai T, Huang C (2015) Targeting ultimate accuracy: Face recognition via deep embedding. ArXiv abs/1506.07310

  33. Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 6738–6746

  34. Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks. ArXiv abs/1612.02295

  35. Liu X, Liu W, Zhang M, Chen J, Gao L, Yan C, Mei T (2019) Social relation recognition from videos via multi-scale spatial-temporal reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3566–3574

  36. Liu Y, Peng B, Shi P, Yan H, Zhou Y, Han B, Zheng Y, Lin C, Jiang J, Fan Y et al (2018) iqiyi-vid: A large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548

  37. Long X, Gan C, De Melo G, Wu J, Liu X, Wen S (2018) Attention clusters: Purely attention based local feature integration for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7834–7843

  38. Lv J, Liu W, Zhou L, Wu B, Ma H (2018) Multi-stream fusion model for social relation recognition from videos. In: International Conference on Multimedia Modeling, pp. 355–368. Springer

  39. Lv J, Wu B (2019) Spatio-temporal attention model based on multi-view for social relation understanding. In: International Conference on Multimedia Modeling, pp. 390–401. Springer

  40. Lv J, Wu B, Zhou L, Wang H (2018) Storyrolenet: Social network construction of role relationship in video. IEEE Access 6:25958–25969 

    Article  Google Scholar 

  41. Nan CJ, Kim KM, Zhang BT (2015) Social network analysis of tv drama characters via deep concept hierarchies. In: 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 831–836. IEEE

  42. Parkhi O, Vedaldi A, Zisserman A (2015) Deep face recognition. In: BMVC

  43. Ramanathan V, Yao B, Fei-Fei L (2013) Social role discovery in human events. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2475–2482

  44. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 815–823

  45. Sun Q, Schiele B, Fritz M (2017) A domain based approach to social relation recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3481–3490

  46. Tran D, Bourdev LD, Fergus R, Torresani L, Paluri M (2014) C3D: generic features for video analysis. CoRR abs/1412.0767. http://arxiv.org/abs/1412.0767

  47. Vicol P, Tapaswi M, Castrejon L, Fidler S (2018) Moviegraphs: Towards understanding human-centric situations from videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8581–8590

  48. Wang H, Wang Y, Zhou Z, Ji X, Li Z, Gong D, Zhou J, Liu W (2018) Cosface: Large margin cosine loss for deep face recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 5265–5274

  49. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: Towards good practices for deep action recognition. CoRR abs/1608.00859. http://arxiv.org/abs/1608.00859

  50. Wang Z, Chen T, Ren J, Yu W, Cheng H, Lin L (2018) Deep reasoning with knowledge graph for social relationship understanding. arXiv preprint arXiv:1807.00504

  51. Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: ECCV

  52. Weng CY, Chu WT, Wu JL (2009) Rolenet: Movie analysis from the perspective of social networks. IEEE Trans Multimed 11(2):256–271

    Article  Google Scholar 

  53. Wu P, Ding W, Mao Z, Tretter D (2009) Close & closer: Discover social relationship from photo collections. In: 2009 IEEE International Conference on Multimedia and Expo, pp. 1652–1655. IEEE

  54. Wu X, Granger E, Kinnunen TH, Feng X, Hadid A (2019) Audio-visual kinship verification in the wild. In: 2019 International Conference on Biometrics (ICB), pp. 1–8. IEEE

  55. Yan H, Hu J (2018) Video-based kinship verification using distance metric learning. Pattern Recognit 75:15–24

    Article  Google Scholar 

  56. Yeh MC, Tseng MC, Wu WP (2012) Automatic social network construction from movies using film-editing cues. In: 2012 IEEE International Conference on Multimedia and Expo Workshops, pp. 242–247. IEEE

  57. Yuan K, Yao H, Ji R, Sun X (2010) Mining actor correlations with hierarchical concurrence parsing. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 798–801. IEEE

  58. Yuxin P, Jinwei Q, Xin H (2019) Current research status and prospects on multimedia content understanding. J Comput Res Dev 56(1):183–208

    Google Scholar 

  59. Zadeh A, Chen M, Poria S, Cambria E, Morency L (2017) Tensor fusion network for multimodal sentiment analysis. In: Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114

  60. Zhang M, Liu X, Liu W, Zhou A, Ma H, Mei T (2019) Multi-granularity reasoning for social relation recognition from images. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1618–1623. IEEE

  61. Zhang Z, Luo P, Loy CC, Tang X (2015) Learning social relation traits from face images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3631–3639

  62. Zhong Y, Arandjelović R, Zisserman A (2018) Ghostvlad for set-based face recognition. In: Asian Conference on Computer Vision, pp. 35–50. Springer

  63. Zhou L, Lv J, Wu B (2017) Social network construction of the role relation in unstructured data based on multi-view. In: 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC), pp. 382–388. IEEE

  64. Zhou L, Wu B, Lv J (2018) Sre-net model for automatic social relation extraction from video. In: CCF Conference on Big Data, pp. 442–460. Springer

  65. Zhu Z, Yu J, Wang Y, Sun Y, Hu Y, Wu Q (2020) Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visualquestion answering. arXiv preprint arXiv:2006.09073

Download references

Acknowledgements

This study was supported by the National Natural Science Foundation of China (grant no. 61972047), the National Key Research and Development Program of China (2018YFC0831500), and the NSFC-General Technology Basic Research Joint Funds (grant no. U1936220).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Hou, W., Zhang, J. et al. A Multimodal Approach for Multiple-Relation Extraction in Videos. Multimed Tools Appl 81, 4909–4934 (2022). https://doi.org/10.1007/s11042-021-11466-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11466-y

Keywords

Navigation