Advertisement

Empowering Relational Network by Self-attention Augmented Conditional Random Fields for Group Activity Recognition

Conference paper
  • 1.8k Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)

Abstract

This paper presents a novel relational network for group activity recognition. The core of our network is to augment the conditional random fields (CRF), amenable to learning inter-dependency of correlated observations, with the newly devised temporal and spatial self-attention to learn the temporal evolution and spatial relational contexts of every actor in videos. Such a combination utilizes the global receptive fields of self-attention to construct a spatio-temporal graph topology to address the temporal dependency and non-local relationships of the actors. The network first uses the temporal self-attention along with the spatial self-attention, which considers multiple cliques with different scales of locality to account for the diversity of the actors’ relationships in group activities, to model the pairwise energy of CRF. Afterward, to accommodate the distinct characteristics of each video, a new mean-field inference algorithm with dynamic halting is also addressed. Finally, a bidirectional universal transformer encoder (UTE), which combines both of the forward and backward temporal context information, is used to aggregate the relational contexts and scene information for group activity recognition. Simulations show that the proposed approach surpasses the state-of-the-art methods on the widespread Volleyball and Collective Activity datasets.

Keywords

Bidirectional universal transformer encoder Self-attention mechanism Conditional random field Graph cliques Group activity 

Notes

Acknowledgement

This work was supported by the Ministry of Science and Technology and by ITRI, R.O.C., under contracts MOST 109-2221-E-011-131 and 109-2221-E-011-116, and ICL/ITRI B5-10903-HQ-07.

Supplementary material

500725_1_En_5_MOESM1_ESM.pdf (1.4 mb)
Supplementary material 1 (pdf 1398 KB)

References

  1. 1.
    Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Van Gool, L.: stagNet: an attentive semantic RNN for group activity recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 104–120. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01249-6_7CrossRefGoogle Scholar
  2. 2.
    Ibrahim, M.S., Mori, G.: Hierarchical relational networks for group activity recognition and retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 742–758. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_44CrossRefGoogle Scholar
  3. 3.
    Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., Savarese, S.: Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4315–4324 (2017)Google Scholar
  4. 4.
    Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9964–9974 (2019)Google Scholar
  5. 5.
    Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1980 (2016)Google Scholar
  6. 6.
    Choi, W., Savarese, S.: A unified framework for multi-target tracking and collective activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 215–230. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33765-9_16CrossRefGoogle Scholar
  7. 7.
    Amer, M.R., Lei, P., Todorovic, S.: HiRF: hierarchical random field for collective activity recognition in videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 572–585. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_37CrossRefGoogle Scholar
  8. 8.
    Hajimirsadeghi, H., Yan, W., Vahdat, A., Mori, G.: Visual recognition by counting instances: a multi-instance cardinality potential kernel. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2596–2605 (2015)Google Scholar
  9. 9.
    Li, X., Chuah, M.C.: SBGAR: semantics based group activity recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2876–2885 (2017)Google Scholar
  10. 10.
    Deng, Z., Vahdat, A., Hu, H., Mori, G.: Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4772–4781 (2016)Google Scholar
  11. 11.
    Shu, T., Todorovic, S., Zhu, S.-C.: CERN: confidence-energy recurrent network for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5523–5531 (2017)Google Scholar
  12. 12.
    Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3048–3056 (2017)Google Scholar
  13. 13.
    Biswas, S., Gall, J.: Structural recurrent neural network (SRNN) for group activity analysis. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1625–1632 (2018)Google Scholar
  14. 14.
    Butepage, J., Black, M.J., Kragic, D., Kjellstrom, H.: Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6158–6166 (2017)Google Scholar
  15. 15.
    Azar, S.M., Atigh, M.G., Nickabadi, A., Alahi, A.: Convolutional relational machine for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7892–7901 (2019)Google Scholar
  16. 16.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1529–1537 (2015)Google Scholar
  17. 17.
    Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1831–1840 (2017)Google Scholar
  18. 18.
    Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)Google Scholar
  19. 19.
    Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  20. 20.
    Alba, R.D.: A graph-theoretic definition of a sociometric clique. J. Math. Sociol. 3(1), 113–126 (1973)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., Kaiser, Ł.: Universal transformers. In: Proceedings of the International Conference on Learning Representations (2019)Google Scholar
  22. 22.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  23. 23.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)Google Scholar
  24. 24.
    Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 3468–3476 (2016)Google Scholar
  25. 25.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)Google Scholar
  26. 26.
    Ji, S., Wei, X., Yang, M., Kai, Yu.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  27. 27.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)Google Scholar
  28. 28.
    Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: MARS: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891 (2019)Google Scholar
  29. 29.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01267-0_19CrossRefGoogle Scholar
  30. 30.
    Li, X., Loy, C.C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 93–110. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_6CrossRefGoogle Scholar
  31. 31.
    Zhang, G., Kan, M., Shan, S., Chen, X.: Generative adversarial network with spatial attention for face attribute editing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 422–437. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01231-1_26CrossRefGoogle Scholar
  32. 32.
    Zhao, T., Wu, X.: Pyramid feature attention network for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3085–3094 (2019)Google Scholar
  33. 33.
    Fu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)Google Scholar
  34. 34.
    Loy, C.C., Xiang, T., Gong, S.: Modelling activity global temporal dependencies using time delayed probabilistic graphical model. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 120–127 (2009)Google Scholar
  35. 35.
    Swears, E., Hoogs, A., Ji, Q., Boyer, K.: Complex activity recognition using granger constrained DBN (GCDBN) in sports and surveillance video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 788–795 (2014)Google Scholar
  36. 36.
    Lu, Y., Lu, C., Tang, C.-K.: Online video object detection using association LSTM. In: Proceedings of The IEEE International Conference on Computer Vision, pp. 2344–2352 (2017)Google Scholar
  37. 37.
    Luo, Y., et al.: LSTM pose machines. In: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 7852–7861 (2018)Google Scholar
  38. 38.
    Perrett, T., Damen, D.: DDLSTM: dual-domain LSTM for cross-dataset action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5207–5215 (2019)Google Scholar
  39. 39.
    Purwanto, D., Pramono, R.R.A., Chen, Y.-T., Fang, W.-H.: Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos. IEEE Signal Process. Lett. 26, 1 (2019)CrossRefGoogle Scholar
  40. 40.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)Google Scholar
  41. 41.
    Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. arXiv preprint arXiv:1904.01766 (2019)
  42. 42.
    Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019)Google Scholar
  43. 43.
    Pramono, R.R.A., Chen, Y.-T., Fang, W.-H.: Hierarchical self-attention network for action localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 61–70 (2019)Google Scholar
  44. 44.
    Morariu, V.I., Davis, L.S.: Multi-agent event recognition in structured scenarios. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3289–3296 (2011)Google Scholar
  45. 45.
    Intille, S.S., Bobick, A.F.: Recognizing planned, multiperson action. Comput. Vis. Image Underst. 81(3), 414–445 (2001)CrossRefGoogle Scholar
  46. 46.
    Xu, Y., Qin, L., Liu, X., Xie, J., Zhu, S.-C.: A causal and-or graph model for visibility fluent reasoning in tracking interacting objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2178–2187 (2018)Google Scholar
  47. 47.
    Li, W.-H., Hong, F.-T., Zheng, W.-S.: Learning to learn relation for important people detection in still images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5003–5011 (2019)Google Scholar
  48. 48.
    Wang, X., Gupta, A.: Videos as space-time region graphs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 413–431. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01228-1_25CrossRefGoogle Scholar
  49. 49.
    Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)Google Scholar
  50. 50.
    Fang, H.-S., Xie, S., Tai, Y.-W., Lu, C.: RMPE: regional multi-person pose estimation, pp. 2334–2343 (2017)Google Scholar
  51. 51.
    Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (2001)Google Scholar
  52. 52.
    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 109–117 (2011)Google Scholar
  53. 53.
    Arnab, A., Jayasumana, S., Zheng, S., Torr, P.H.S.: Higher order conditional random fields in deep neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 524–540. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46475-6_33CrossRefGoogle Scholar
  54. 54.
    Liu, B., He, X.: Learning dynamic hierarchical models for anytime scene labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 650–666. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_39CrossRefGoogle Scholar
  55. 55.
    Xiong, Y., et al.: UPSNet: a unified panoptic segmentation network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8818–8826 (2019)Google Scholar
  56. 56.
    Graves, A.: Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983 (2016)
  57. 57.
    Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: Proceedings of the International Conference on Computer Vision Workshops, pp. 1282–1289 (2009)Google Scholar
  58. 58.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Proceedings of the Joint Pattern Recognition Symposium, pp. 214–223 (2007)Google Scholar
  59. 59.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press, Cambridge (2016)zbMATHGoogle Scholar
  60. 60.
    Lan, T., Wang, Y., Yang, W., Robinovitch, S.N., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1549–1562 (2011)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.National Taiwan University of Science and TechnologyTaipeiTaiwan

Personalised recommendations