Skip to main content

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13663))

Included in the following conference series:

Abstract

Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of self-attention computation. How to efficiently and effectively model the 3D self-attention of video data has been a great challenge for transformers. In this paper, we propose a Temporal Patch Shift (TPS) method for efficient 3D self-attention modeling in transformers for video-based action recognition. TPS shifts part of patches with a specific mosaic pattern in the temporal dimension, thus converting a vanilla spatial self-attention operation to a spatiotemporal one with little additional cost. As a result, we can compute 3D self-attention using nearly the same computation and memory cost as 2D self-attention. TPS is a plug-and-play module and can be inserted into existing 2D transformer models to enhance spatiotemporal feature learning. The proposed method achieves competitive performance with state-of-the-arts on Something-something V1 & V2, Diving-48, and Kinetics400 while being much more efficient on computation and memory cost. The source code of TPS can be found at https://github.com/MartinXM/TPS.

W. Xiang—Work done during an internship at Alibaba.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: ViViT: a video vision transformer. In: Computer Vision and Pattern Recognition (2021). arXiv

    Google Scholar 

  2. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding?. In: Computer Vision and Pattern Recognition (2021). arXiv

    Google Scholar 

  3. Bulat, A., Perez-Rua, J.M., Sudhakaran, S., Martinez, B., Tzimiropoulos, G.: Space-time mixing attention for video transformer. In: NeurIPS (2021)

    Google Scholar 

  4. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  5. Chen, C.F., Fan, Q., Panda, R.: CrossVit: cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021)

  6. Christoph, R., Pinz, F.A.: Spatiotemporal residual networks for video action recognition. In: NIPS, pp. 3468–3476 (2016)

    Google Scholar 

  7. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: AutoAugment: learning augmentation policies from data (2019). https://arxiv.org/pdf/1805.09501.pdf

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR 2021: The Ninth International Conference on Learning Representations (2021)

    Google Scholar 

  9. Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6824–6835 (2021)

    Google Scholar 

  10. Fan, Q., Chen, C., Panda, R.: An image classifier can suffice for video understanding. CoRR abs/2106.14104 (2021). https://arxiv.org/abs/2106.14104

  11. Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D.: More is less: learning efficient video representations by temporal aggregation modules. In: Advances in Neural Information Processing Systems 33 (2019)

    Google Scholar 

  12. Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  13. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  14. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016). https://doi.org/10.1109/CVPR.2016.213

  15. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: International Conference on Computer Vision, pp. 5842–5850 (2017)

    Google Scholar 

  16. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. arXiv preprint arXiv:2103.00112 (2021)

  17. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  18. Kwon, H., Kim, M., Kwak, S., Cho, M.: MotionSqueeze: neural motion feature learning for video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 345–362. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_21

    Chapter  Google Scholar 

  19. Li, C., Zhong, Q., Xie, D., Pu, S.: Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  20. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: TEA: temporal excitation and aggregation for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)

    Google Scholar 

  21. Li, Y., Li, Y., Vasconcelos, N.: RESOUND: towards action recognition without representation bias. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 520–535. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_32

    Chapter  Google Scholar 

  22. Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: International Conference on Computer Vision, pp. 7083–7093 (2019)

    Google Scholar 

  23. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

  24. Liu, Z., et al.: Video swin transformer. In: Computer Vision and Pattern Recognition (2021). arXiv

    Google Scholar 

  25. Liu, Z., et al.: Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)

  26. Liu, Z.,et al.: TEINet: towards an efficient architecture for video recognition. In: AAAI (2020)

    Google Scholar 

  27. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019). https://openreview.net/forum?id=Bkg6RiCqY7

  28. Martinez, B., Modolo, D., Xiong, Y., Tighe, J.: Action recognition with spatial-temporal discriminative filter banks. In: International Conference on Computer Vision, pp. 5482–5491 (2019)

    Google Scholar 

  29. Patrick, M., et al.: Keeping your eye on the ball: trajectory attention in video transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  30. Piergiovanni, A., Ryoo, M.S.: Representation flow for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9945–9953 (2019)

    Google Scholar 

  31. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: International Conference on Computer Vision, pp. 5533–5541 (2017)

    Google Scholar 

  32. Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., Schiele, B.: Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV 119(3), 346–373 (2016)

    Article  MathSciNet  Google Scholar 

  33. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014). https://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos

  34. Szegedy, C., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015). https://arxiv.org/abs/1409.4842

  35. Touvron, H., Cord, M., Matthijs, D., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: ICML 2021: 38th International Conference on Machine Learning (2021)

    Google Scholar 

  36. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: International Conference on Computer Vision, pp. 4489–4497 (2015). https://doi.org/10.1109/ICCV.2015.510

  37. Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  38. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  39. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  40. Wang, Z., She, Q., Smolic, A.: Action-Net: multipath excitation for action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  41. Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 591–600 (2020)

    Google Scholar 

  42. Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986 (2021)

  43. Zhang, H., Hao, Y., Ngo, C.W.: Token shift transformer for video classification. In: Computer Vision and Pattern Recognition (2021). arXiv

    Google Scholar 

  44. Zhao, Y., Xiong, Y., Lin, D.: Recognize actions by disentangling components of dynamics. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6566–6575 (2018)

    Google Scholar 

  45. Zheng, Y.D., Liu, Z., Lu, T., Wang, L.: Dynamic sampling networks for efficient action recognition in videos. TIP 29, 7970–7983 (2020)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 605 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xiang, W., Li, C., Wang, B., Wei, X., Hua, XS., Zhang, L. (2022). Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13663. Springer, Cham. https://doi.org/10.1007/978-3-031-20062-5_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20062-5_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20061-8

  • Online ISBN: 978-3-031-20062-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics