Skip to main content

Hunting Group Clues with Transformers for Social Group Activity Recognition

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

This paper presents a novel framework for social group activity recognition. As an expanded task of group activity recognition, social group activity recognition requires recognizing multiple sub-group activities and identifying group members. Most existing methods tackle both tasks by refining region features and then summarizing them into activity features. Such heuristic feature design renders the effectiveness of features susceptible to incomplete person localization and disregards the importance of scene contexts. Furthermore, region features are sub-optimal to identify group members because the features may be dominated by those of people in the regions and have different semantics. To overcome these drawbacks, we propose to leverage attention modules in transformers to generate effective social group features. Our method is designed in such a way that the attention modules identify and then aggregate features relevant to social group activities, generating an effective feature for each social group. Group member information is embedded into the features and thus accessed by feed-forward networks. The outputs of feed-forward networks represent groups so concisely that group members can be identified with simple Hungarian matching between groups and individuals. Experimental results show that our method outperforms state-of-the-art methods on the Volleyball and Collective Activity datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amer, M.R., Lei, P., Todorovic, S.: HiRF: hierarchical random field for collective activity recognition in videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 572–585. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_37

    Chapter  Google Scholar 

  2. Amer, M.R., Todorovic, S.: Sum product networks for activity recognition. IEEE TPAMI 38(4), 800–813 (2016)

    Article  Google Scholar 

  3. Amer, M.R., Todorovic, S., Fern, A., Zhu, S.C.: Monte Carlo tree search for scheduling activity recognition. In: ICCV, December 2013

    Google Scholar 

  4. Azar, S.M., Atigh, M.G., Nickabadi, A., Alahi, A.: Convolutional relational machine for group activity recognition. In: CVPR, June 2019

    Google Scholar 

  5. Bagautdinov, T.M., Alahi, A., Fleuret, F., Fua, P.V., Savarese, S.: Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In: CVPR, July 2017

    Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  7. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, July 2017

    Google Scholar 

  8. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, October 2014

    Google Scholar 

  9. Choi, W., Shahid, K., Savarese, S.: What are they doing?: collective activity classification using spatio-temporal relationship among people. In: ICCVW, September 2009

    Google Scholar 

  10. Dai, J., et al.: Deformable convolutional networks. In: ICCV, October 2017

    Google Scholar 

  11. Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic DETR: end-to-end object detection with dynamic attention. In: ICCV, October 2021

    Google Scholar 

  12. Deng, Z., Vahdat, A., Hu, H., Mori, G.: Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition. In: CVPR, June 2016

    Google Scholar 

  13. Ehsanpour, M., Abedin, A., Saleh, F., Shi, J., Reid, I., Rezatofighi, H.: Joint learning of social groups, individuals action and sub-group activities in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 177–195. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_11

    Chapter  Google Scholar 

  14. Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.M.: Actor-transformers for group activity recognition. In: CVPR, June 2020

    Google Scholar 

  15. Ge, W., Collins, R.T., Ruback, R.B.: Vision-based analysis of small groups in pedestrian crowds. IEEE TPAMI 34(5), 1003–1016 (2012)

    Article  Google Scholar 

  16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  17. Hu, G., Cui, B., He, Y., Yu, S.: Progressive relation learning for group activity recognition. In: CVPR, June 2020

    Google Scholar 

  18. Ibrahim, M.S., Mori, G.: Hierarchical relational networks for group activity recognition and retrieval. In: ECCV, September 2018

    Google Scholar 

  19. Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: CVPR, June 2016

    Google Scholar 

  20. Kay, W., et al.: The kinetics human action video dataset, May 2017. arXiv:1705.06950

  21. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR, April 2017

    Google Scholar 

  22. Kong, L., Qin, J., Huang, D., Wang, Y., Gool, L.V.: Hierarchical attention and context modeling for group activity recognition. In: ICASSP, April 2018

    Google Scholar 

  23. Kuhn, H.W., Yaw, B.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  24. Lan, T., Sigal, L., Mori, G.: Social roles in hierarchical models for human activity recognition. In: CVPR, June 2012

    Google Scholar 

  25. Lan, T., Wang, Y., Yang, W., Mori, G.: Beyond actions: discriminative models for contextual group activities. In: NIPS, December 2010

    Google Scholar 

  26. Lan, T., Wang, Y., Yang, W., Robinovitch, S.N., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE TPAMI 34(8), 1549–1562 (2012)

    Article  Google Scholar 

  27. Li, S., et al.: GroupFormer: group activity recognition with clustered spatial-temporal transformer. In: ICCV, October 2021

    Google Scholar 

  28. Li, X., Chuah, M.C.: SBGAR: semantics based group activity recognition. In: ICCV, October 2017

    Google Scholar 

  29. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, October 2017

    Google Scholar 

  30. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  31. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR, May 2019

    Google Scholar 

  32. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: NIPS, December 2002

    Google Scholar 

  33. Park, H., Shi, J.: Social saliency prediction. In: CVPR, June 2015

    Google Scholar 

  34. Pramono, R.R.A., Chen, Y.T., Fang, W.H.: Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 71–90. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_5

    Chapter  Google Scholar 

  35. Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Gool, L.V.: stagNet: an attentive semantic RNN for group activity recognition. In: ECCV, September 2018

    Google Scholar 

  36. Sendo, K., Ukita, N.: Heatmapping of people involved in group activities. In: MVA, May 2019

    Google Scholar 

  37. Shu, T., Todorovic, S., Zhu, S.C.: CERN: confidence-energy recurrent network for group activity recognition. In: CVPR, July 2017

    Google Scholar 

  38. Shu, T., Xie, D., Rothrock, B., Todorovic, S., Zhu, S.: Joint inference of groups, events and human roles in aerial videos. In: CVPR, June 2015

    Google Scholar 

  39. Sun, Z., Cao, S., Yang, Y., Kitani, K.M.: Rethinking transformer-based set prediction for object detection. In: ICCV, October 2021

    Google Scholar 

  40. Tang, J., Shu, X., Yan, R., Zhang, L.: Coherence constrained graph LSTM for group activity recognition. IEEE TPAMI 44(2), 636–647 (2022)

    Article  Google Scholar 

  41. Vaswani, A., et al.: Attention is all you need. In: NIPS, December 2017

    Google Scholar 

  42. Veličkovič, P., Cucurull, G., Casanova, A., Romero, A., Lió, P., Bengio, Y.: Graph attention networks. In: ICLR, April 2018

    Google Scholar 

  43. Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: CVPR, July 2017

    Google Scholar 

  44. Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: CVPR, June 2019

    Google Scholar 

  45. Yan, R., Shu, X., Yuan, C., Tian, Q., Tang, J.: Position-aware participation-contributed temporal dynamic model for group activity recognition. IEEE TNNLS (2021)

    Google Scholar 

  46. Yan, R., Tang, J., Shu, X., Li, Z., Tian, Q.: Participation-contributed temporal dynamic model for group activity recognition. In: ACMMM, October 2018

    Google Scholar 

  47. Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: HiGCIN: hierarchical graph-based cross inference network for group activity recognition. IEEE TPAMI (2020)

    Google Scholar 

  48. Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Social adaptive module for weakly-supervised group activity recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 208–224. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_13

    Chapter  Google Scholar 

  49. Yuan, H., Ni, D., Wang, M.: Spatio-temporal dynamic inference network for group activity recognition. In: ICCV, October 2021

    Google Scholar 

  50. Zhou, H., et al.: COMPOSER: compositional learning of group activity in videos, December 2021. arXiv:2112.05892

  51. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points, April 2019. arXiv:1904.07850

  52. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR, May 2021

    Google Scholar 

Download references

Acknowledgement

Computational resource of AI Bridging Cloud Infrastructure (ABCI) provided by National Institute of Advanced Industrial Science and Technology was used.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masato Tamura .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2833 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tamura, M., Vishwakarma, R., Vennelakanti, R. (2022). Hunting Group Clues with Transformers for Social Group Activity Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13664. Springer, Cham. https://doi.org/10.1007/978-3-031-19772-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19772-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19771-0

  • Online ISBN: 978-3-031-19772-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics