Abstract
Automatically recognizing social relationships from videos provides intelligent systems with great potential to better understand the behaviors or emotions of human beings. Most existing methods mainly focus on inferring social characters by detecting their interactions or independently predicting each social relationship. However, they cannot directly learn all social relationships and characters. In this paper, we propose a character and relationship joint learning (CRJL) framework to simultaneously infer all social relationships and character pairs involved in videos. First, the video context and the logical associations among relationships provide important cues for social scene understanding. To incorporate these cues into social relationships and character reasoning, we design a novel character and relationship reasoning graph (CRRG). Specifically, we model the relationship passing process on the graph to learn the logical constraints among relationships. We also introduce a graph attention mechanism to capture discriminative video semantic information. Second, localizing a social character pair via supervised learning is time-consuming, as it requires the annotation of video tracks. Instead, we propose a weak label-based training strategy using clip-level relationships. Experimental results on a public benchmark demonstrate the superiority of our method.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-021-02738-z/MediaObjects/10489_2021_2738_Fig11_HTML.png)
Similar content being viewed by others
References
Elharrouss O, Almaadeed N, Al-Maadeed S, Bouridane A, Beghdadi A (2021) A combined multiple action recognition and summarization for surveillance video sequences. Appl Intell 51:690–712
Zhang X Y, Huang Y P, Yang M, Pe YT, Zou Q, Wang S (2020) Video sketch: A middle-level representation for action recognition. Appl Intell
Lingam G, Rout R R, Somayajulu D (2019) Adaptive deep Q-learning model for detecting social bots and influential users in online social networks. Appl Intell 49:3947–3964
Wang C, Wang C, Wang Z, Ye X, Yu P S (2020) Edge2vec: Edge-based Social Network Embedding. ACM T Knowl Discov D 45:1–24
Gil M A, Hein A M, Spiegel O, Baskett M L, Sih A (2018) Social information links individual behavior to population and community dynamics. Trends Ecol Evol 33(7):535–548
Chu J, Wang Y, Liu X, Liu Y (2020) Social network community analysis based large-scale group decision making approach with incomplete fuzzy preference relations. Inform Fusion 60:98–120
Kukleva A, Tapaswi M, Laptev I (2020) Learning Interactions and Relationships Between Movie Characters. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9849–9858
Liu XC, Liu W, Zhang M, Chen JW, Gao LL, Yan CG, Mei T (2019) Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3566–3574
Perez-Hernandez F, Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object Detection Binary Classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowl-Based Syst 194:105590
Wang M, Shu X, Feng J, Wang X, Tang J (2020) Deep multi-person kinship matching and recognition for family photos. Pattern Recogn 105:107342
Zhang Z P, Luo P, Chen C L, Tang X O (2018) From Facial Expression Recognition to Interpersonal Relation Prediction. Int J Comput Vision 126(1)
Robinson J P, Shao M, Wu Y, Liu H F, Gills T, Fu Y (2018) Visual kinship recognition of families in the wild. IEEE T Pattern Anal 40(11):2624–2637
Labatut V, Bost X (2019) Extraction and analysis of fictional character networks: a survey. ACM Comput Surv 52(5):89
Wang M, Du X, Shu X, Wang X, Tang J (2020) Deep supervised feature selection for social relationship recognition. Pattern Recogn Lett 138:410–416
Sun QR, Bernt S, Mario F (2017) A Domain Based Approach to Social Relation Recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3481–3490
Li JN, Wong YK, Zhao Q, Kankanhalli MS (2017) Dual-Glance Model for Deciphering Social Relationships. In: Proceedings of the IEEE international conference on computer vision, pp 2650–2659
Wang ZX, Chen TS, Ren J, Yu WH, Cheng H, Lin L (2018) Deep Reasoning with Knowledge Graph for Social Relationship Understanding. In: Proceedings of the international joint conference on artificial intelligence, pp 1021– 1028
Goel A, Ma KT, Tan C (2019) An End-to-End Network for Generating Social Relationship Graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11186–11195
Li WH, Duan YQ, Lu JW, Feng JJ, Zhou J (2020) Graph-Based Social Relation Reasoning. In: European conference on computer vision, pp 18–34
Kalita S, Karmakar A, Hazarika S M (2018) Efficient extraction of spatial relations for extended objects vis-à-vis human activity recognition in video. Appl Intell 48:204–219
Vicol P, Tapaswi M, Castrejon L, Fidler S (2018) MovieGraphs: Towards Understanding Human-Centric Situations From Videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8581–8590
Sun X H, Gu J, Sun HY (2020) Research progress of zero-shot learning. Appl Intell
Tian P, Mo HW, Jiang LH (2021) Scene graph generation by multi-level semantic tasks. Appl Intell
Yang L, Li LL, Zhang ZL, Zhou XY, Zhou E, Liu Y (2020) DPGN: Distribution Propagation Graph Network for Few-Shot Learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 13390–13399
Xie GS, Liu L, Zhu F, Zhao F, Zhang Z, Yao YZ, Qin J, Shao L (2020) Region Graph Embedding Network for Zero-Shot Learning. In: European conference on computer vision, pp 562–580
Kampffmeyer M, Chen YB, Liang XD, Wang H, Zhang YJ, Xing EP (2019) Rethinking Knowledge Graph Propagation for Zero-Shot Learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11487–11496
Shi WJ, Rajkumar R (2020) Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1711–1719
Tang K, Niu Y, Huang JQ, Shi JX, Zhang HW (2020) Unbiased Scene Graph Generation From Biased Training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3716–3725
Lin X, Ding CX, Zeng JQ, Tao DC (2020) GPS-Net: Graph Property Sensing Network for Scene Graph Generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3746–3753
Hara K, Kataoka H, Satoh Y (2018) Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, pp 4171–4186
Cho K, Van MB, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translations. In: Proceedings of conference on empirical methods in natural language processing, pp 1724–1734
Acknowledgements
This work was supported by the National Natural Science Foundation of China (grant no. 61972047) and the NSFC-General Technology Basic Research Joint Funds (grant no. U1936220)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Teng, Y., Song, C. & Wu, B. Toward jointly understanding social relationships and characters from videos. Appl Intell 52, 5633–5645 (2022). https://doi.org/10.1007/s10489-021-02738-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02738-z