Spatio-Temporal Attention Model Based on Multi-view for Social Relation Understanding

  • Jinna LvEmail author
  • Bin WuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11296)


Social relation understanding is an increasingly popular research area. Great progress has been achieved by exploiting sentiment or social relation from image data, however, it is also difficult to attain satisfactory performance for social relation analysis from video data. In this paper, we propose a novel Spatio-Temporal attention model based on Multi-View (STMV) for understanding social relations from video. First, in order to obtain rich representation for social relation traits, we introduce different ConvNets to extract multi-view features including RGB, optical flow, and face. Second, we exploit temporal features of multi-view through time using Long Short-Term Memory (LSTM) for social relation understanding. Specially, we propose multiple attention units in our attention module. Through this manner, we can generate an appropriate feature representation focusing on multiple aspects of social relation traits from video, thus excellent mapping function from low-level video pixels to high-level social relation space can be built. Third, we introduce a tensor fusion layer, which learns interactions among multi-view features. Extensive experiments show that our STMV model achieves the state-of-the-art performance on the SRIV video dataset for social relation classification.


Social relation understanding Video analysis Deep learning Attention mechanism 



This research is supported by the National Key R&D Program of China (No. 2018YFC0831500), the National Social Science Foundation of China (No. 16ZDA055), and the Special Found for Beijing Common Construction Project.


  1. 1.
    Xiang, L., Sang, J., Xu, C.: Demographic attribute inference from social multimedia behaviors: a cross-OSN approach. In: Amsaleg, L., Guðmundsson, G.Þ., Gurrin, C., Jónsson, B.Þ., Satoh, S. (eds.) MMM 2017. LNCS, vol. 10132, pp. 515–526. Springer, Cham (2017). Scholar
  2. 2.
    Alletto, S., Serra, G., Calderara, S.: Understanding social relationships in egocentric vision. Pattern Recognit. 48(12), 4082–4096 (2015)CrossRefGoogle Scholar
  3. 3.
    Tran, Q.D., Jung, J.E.: Cocharnet: extracting social networks using character co-occurrence in movies. J. Univers. Comput. 21(6), 796–815 (2015)Google Scholar
  4. 4.
    Weng, C.Y., Chu, W.T., Wu, J.L.: RoleNet: movie analysis from the perspective of social networks. IEEE Trans. Multimed. 11(2), 256–271 (2009)CrossRefGoogle Scholar
  5. 5.
    Hirai, T., Morishima, S.: Frame-wise continuity-based video summarization and stretching. In: Tian, Q., Sebe, N., Qi, G.-J., Huet, B., Hong, R., Liu, X. (eds.) MMM 2016. LNCS, vol. 9516, pp. 806–817. Springer, Cham (2016). Scholar
  6. 6.
    Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: CVPR, pp. 2982–2991 (2017)Google Scholar
  7. 7.
    Sun, Q., Schiele, B., Fritz, M.: A domain based approach to social relation recognition. In: CVPR, pp. 435–444 (2017)Google Scholar
  8. 8.
    Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning social relation traits from face images. In: ICCV, pp. 3631–3639 (2015)Google Scholar
  9. 9.
    Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J., Finding actors and actions in movies. In: ICCV, pp. 2280–2287 (2013)Google Scholar
  10. 10.
    Lv, J., Liu, W., Zhou, L., Wu, B., Ma, H.: Multi-stream fusion model for social relation recognition from videos. In: Schoeffmann, K., et al. (eds.) MMM 2018. LNCS, vol. 10704, pp. 355–368. Springer, Cham (2018). Scholar
  11. 11.
    Zhu, F., Li, H., Ouyang, W., Yu, N., Wang, X.: Learning spatial regularization with image-level supervisions for multi-label image classification. In: CVPR, pp. 2027–2036 (2017)Google Scholar
  12. 12.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, pp. 4651–4659 (2016)Google Scholar
  13. 13.
    Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR, pp. 984–992 (2017)Google Scholar
  14. 14.
    Yu, H., Gui, L., Madaio, M., Ogan, A., Cassell, J., Morency, L.P.: Temporally selective attention model for social and affective state recognition in multimedia content. In: MM, pp. 1743–1751 (2017)Google Scholar
  15. 15.
    Yang, Y., et al.: Mining competitive relationships by learning across heterogeneous networks. In: CIKM, pp. 1432–1441 (2012)Google Scholar
  16. 16.
    Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: EMNLP, pp. 1412–1421 (2015)Google Scholar
  17. 17.
    Long, X., Gan, C., de Melo, G., Wu, J., Liu, X., Wen, S.: Attention clusters: Purely attention based local feature integration for video classification. CoRR, abs/1711.09550 (2017)Google Scholar
  18. 18.
    Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.: Multi-attention recurrent network for human communication comprehension. arXiv:1802.00923 (2018)
  19. 19.
    Pei, W., Baltrusaitis, T., Tax, D.M.J., Morency, L.: Temporal attention-gated model for robust sequence classification. In: CVPR, pp. 820–829 (2017)Google Scholar
  20. 20.
    Xu, C., Tao, D., Xu, C.: A survey on multi-view learning. CoRR, abs/1304.5634 (2013)Google Scholar
  21. 21.
    Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: ICDM, pp. 439–448 (2016)Google Scholar
  22. 22.
    Nojavanasghari, B., Gopinath, D., Koushik, J., Baltrusaitis, T., Morency, L.: Deep multimodal fusion for persuasiveness prediction. In: ICMI, pp. 284–288 (2016)Google Scholar
  23. 23.
    Du, T., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: CVPR, pp. 4489–4497 (2015)Google Scholar
  24. 24.
    Findler, N.V.: Short note on a heuristic search strategy in long-term memory networks. Inf. Process. Lett. 1(5), 191–196 (1972)CrossRefGoogle Scholar
  25. 25.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  26. 26.
    Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. In: CVPR, pp. 1891–1898 (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Beijing Key Laboratory of Intelligent Telecommunications Software and MultimediaBeijing University of Posts and TelecommunicationsBeijingChina

Personalised recommendations