Skip to main content

Efficient Video Transformers with Spatial-Temporal Token Selection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost. In this paper, we present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples. Specifically, we formulate token selection as a ranking problem, which estimates the importance of each token through a lightweight scorer network and only those with top scores will be used for downstream evaluation. In the temporal dimension, we keep the frames that are most relevant to the action categories, while in the spatial dimension, we identify the most discriminative region in feature maps without affecting the spatial context used in a hierarchical way in most video transformers. Since the decision of token selection is non-differentiable, we employ a perturbed-maximum based differentiable Top-K operator for end-to-end training. We mainly conduct extensive experiments on Kinetics-400 with a recently introduced video transformer backbone, MViT. Our framework achieves similar results while requiring 20% less computation. We also demonstrate our approach is generic for different transformer architectures and video datasets. Code is available at https://github.com/wangjk666/STTS.

J. Wang and X. Yang—Equal contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here the notion of “frame” can be either a single frame or multiple frames within a clip in the original video, depending on whether clip-based video models with 3D convolutions are used.

  2. 2.

    Note that Attention-K cannot be applied if class tokens are not used (e.g., VideoSwin) or self-attention is not yet computed (e.g., 0th block of the model).

References

  1. Abernethy, J., Lee, C., Tewari, A.: Perturbation techniques in online learning and optimization. Perturbations, Optimization, and Statistics (2016)

    Google Scholar 

  2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: ICCV (2021)

    Google Scholar 

  3. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)

    Google Scholar 

  4. Berthet, Q., Blondel, M., Teboul, O., Cuturi, M., Vert, J.P., Bach, F.: Learning with differentiable perturbed optimizers. arXiv preprint arXiv:2002.08676 (2020)

  5. Bhardwaj, S., Srinivasan, M., Khapra, M.M.: Efficient video classification using fewer frames. In: CVPR (2019)

    Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  7. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)

    Google Scholar 

  8. Cordonnier, J.B., et al.: Differentiable patch selection for image recognition. In: CVPR (2021)

    Google Scholar 

  9. Cuturi, M., Teboul, O., Vert, J.P.: Differentiable ranking and sorting using optimal transport. In: NeurIPS (2019)

    Google Scholar 

  10. Davidson, J., et al.: The YouTube video recommendation system. In: RS (2010)

    Google Scholar 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  12. Dong, J., et al.: Dual encoding for zero-example video retrieval. In: CVPR (2019)

    Google Scholar 

  13. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  14. Fan, H., et al.: Multiscale vision transformers. In: ICCV (2021)

    Google Scholar 

  15. Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D.: More is less: learning efficient video representations by temporal aggregation modules. In: NeurIPS (2019)

    Google Scholar 

  16. Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: CVPR (2020)

    Google Scholar 

  17. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV (2019)

    Google Scholar 

  18. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)

    Google Scholar 

  19. Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 214–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_13

    Chapter  Google Scholar 

  20. Goyal, P., et al.: Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)

  21. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)

    Google Scholar 

  22. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In: CVPR (2018)

    Google Scholar 

  23. He, B., Yang, X., Wu, Z., Chen, H., Lim, S.N., Shrivastava, A.: GTA: global temporal attention for video action understanding. In: BMVC (2021)

    Google Scholar 

  24. Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. arXiv preprint arXiv:2103.16302 (2021)

  25. Huang, Y., Cui, B., Jiang, J., Hong, K., Zhang, W., Xie, Y.: Real-time video recommendation exploration. In: ICMD (2016)

    Google Scholar 

  26. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)

  27. Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: STM: spatiotemporal and motion encoding for action recognition. In: ICCV (2019)

    Google Scholar 

  28. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  29. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  30. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: ICLR (2020)

    Google Scholar 

  31. Kondratyuk, D., et al.: MoviNets: mobile video networks for efficient video recognition. In: CVPR (2021)

    Google Scholar 

  32. Korbar, B., Tran, D., Torresani, L.: SCSampler: sampling salient clips from video for efficient action recognition. In: ICCV (2019)

    Google Scholar 

  33. Lee, J., Abu-El-Haija, S.: Large-scale content-only video recommendation. In: ICCVW (2017)

    Google Scholar 

  34. Li, K., et al.: UniFormer: unified transformer for efficient spatial-temporal representation learning. In: ICLR (2022)

    Google Scholar 

  35. Li, T., Liu, J., Zhang, W., Ni, Y., Wang, W., Li, Z.: UAV-human: a large benchmark for human behavior understanding with unmanned aerial vehicles. In: CVPR (2021)

    Google Scholar 

  36. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: TEA: temporal excitation and aggregation for action recognition. In: CVPR (2020)

    Google Scholar 

  37. Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV (2019)

    Google Scholar 

  38. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)

    Google Scholar 

  39. Liu, Z., et al.: Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)

  40. Liu, Z., et al.: TEINet: towards an efficient architecture for video recognition. In: AAAI (2020)

    Google Scholar 

  41. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

  42. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam (2018)

    Google Scholar 

  43. Mei, T., Yang, B., Hua, X.S., Li, S.: Contextual video recommendation by multimodal relevance and user feedback. TOIS 29, 1–24 (2011)

    Article  Google Scholar 

  44. Naseer, M., Ranasinghe, K., Khan, S., Hayat, M., Khan, F., Yang, M.H.: Intriguing properties of vision transformers. In: NeurIPS (2021)

    Google Scholar 

  45. Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. arXiv preprint arXiv:2102.00719 (2021)

  46. Pan, B., Panda, R., Jiang, Y., Wang, Z., Feris, R., Oliva, A.: IA-RED\(^{2}\): Interpretability-aware redundancy reduction for vision transformers. In: NeurIPS (2021)

    Google Scholar 

  47. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)

    Google Scholar 

  48. Patrick, M., et al.: Keeping your eye on the ball: trajectory attention in video transformers. In: NeurIPS (2021)

    Google Scholar 

  49. Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: DynamicViT: efficient vision transformers with dynamic token sparsification. In: NeurIPS (2021)

    Google Scholar 

  50. Ryoo, M.S., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: TokenLearner: adaptive space-time tokenization for videos. In: NeurIPS (2021)

    Google Scholar 

  51. Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., Liu, J.: Human action recognition from various data modalities: a review. IEEE TPAMI, 1–20 (2022)

    Google Scholar 

  52. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)

    Google Scholar 

  53. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)

    Google Scholar 

  54. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  55. Wang, H., Tran, D., Torresani, L., Feiszli, M.: Video modeling with correlation networks. In: CVPR (2020)

    Google Scholar 

  56. Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: CVPR (2021)

    Google Scholar 

  57. Wang, R., et al.: BEVT: BERT pretraining of video transformers. In: CVPR (2022)

    Google Scholar 

  58. Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)

  59. Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: ICCV (2021)

    Google Scholar 

  60. Wang, Y., et al.: AdaFocus V2: end-to-end training of spatial dynamic networks for video recognition. In: CVPR (2022)

    Google Scholar 

  61. Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)

    Google Scholar 

  62. Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. In: CVPR (2018)

    Google Scholar 

  63. Wu, Z., Li, H., Xiong, C., Jiang, Y.G., Davis, L.S.: A dynamic frame selection framework for fast video recognition. IEEE TPAMI 44, 1699–1711 (2022)

    Article  Google Scholar 

  64. Wu, Z., Li, H., Zheng, Y., Xiong, C., Jiang, Y., Davis, L.S.: A coarse-to-fine framework for resource efficient video recognition. In: IJCV (2021)

    Google Scholar 

  65. Wu, Z., Xiong, C., Ma, C.Y., Socher, R., Davis, L.S.: AdaFrame: adaptive frame selection for fast video recognition. In: CVPR (2019)

    Google Scholar 

  66. Xie, Y., et al.: Differentiable top-k with optimal transport. In: NeurIPS (2020)

    Google Scholar 

  67. Xu, L., Huang, H., Liu, J.: SUTD-TraffiCQA: a question answering benchmark and an efficient network for video reasoning over traffic events. In: CVPR (2021)

    Google Scholar 

  68. Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. In: NeurIPS (2021)

    Google Scholar 

  69. Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)

    Google Scholar 

  70. Yuan, L., et al.: Central similarity quantization for efficient image and video retrieval. In: CVPR (2020)

    Google Scholar 

  71. Zhang, D., Zhang, H., Tang, J., Wang, M., Hua, X., Sun, Q.: Feature pyramid transformer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 323–339. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_20

    Chapter  Google Scholar 

  72. Zhang, Z., Zhang, H., Zhao, L., Chen, T., Pfister, T.: Aggregating nested transformers. In: AAAI (2022)

    Google Scholar 

  73. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)

    Google Scholar 

  74. Zheng, Y.D., Liu, Z., Lu, T., Wang, L.: Dynamic sampling networks for efficient action recognition in videos. TIP 29, 7970–7983 (2020)

    MATH  Google Scholar 

  75. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)

    Google Scholar 

  76. Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online video understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 713–730. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_43

    Chapter  Google Scholar 

Download references

Acknowledgement

Y.-G. Jiang was sponsored in part by “Shuguang Program” supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission (No. 20SG01). Z. Wu was supported by NSFC under Grant No. 62102092.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zuxuan Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, J., Yang, X., Li, H., Liu, L., Wu, Z., Jiang, YG. (2022). Efficient Video Transformers with Spatial-Temporal Token Selection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19833-5_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19832-8

  • Online ISBN: 978-3-031-19833-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics