Skip to main content

ActionFormer: Localizing Moments of Actions with Transformers

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer—a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU = 0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.6% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at https://github.com/happyharrycn/actionformer_release.

C.-L. Zhang—Work was done when visiting UW Madison.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Without loss of clarity, we drop the index of the pyramid \(\ell \).

References

  1. Alwassel, H., Giancola, S., Ghanem, B.: TSP: Temporally-sensitive pretraining of video encoders for localization tasks. In: International Conference on Computer Vision Workshops, pp. 1–11 (2021)

    Google Scholar 

  2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: International Conference on Computer Vision (2021)

    Google Scholar 

  3. Bai, Y., Wang, Y., Tong, Y., Yang, Y., Liu, Q., Liu, J.: Boundary content graph neural network for temporal action proposal generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_8

    Chapter  Google Scholar 

  4. Beltagy, I., Peters, M.E., Cohan, A.: LongFormer: the long-document transformer. arXiv:2004.05150 (2020)

  5. Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-NMS-improving object detection with one line of code. In: International Conference on Computer Vision, pp. 5561–5569 (2017)

    Google Scholar 

  6. Buch, S., Escorcia, V., Ghanem, B., Niebles Carlos, J.: End-to-end, single-stream temporal action detection in untrimmed videos. In: British Machine Vision Conference, pp. 93.1–93.12 (2017)

    Google Scholar 

  7. Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: SST: Single-stream temporal action proposals. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2911–2920 (2017)

    Google Scholar 

  8. Caba Heilbron, F., Carlos Niebles, J., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1914–1923 (2016)

    Google Scholar 

  9. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

    Google Scholar 

  10. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  11. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the Kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017)

    Google Scholar 

  12. Chang, S., Wang, P., Wang, F., Li, H., Feng, J.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv preprint arXiv:2103.16024 (2021)

  13. Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the Faster-RCNN architecture for temporal action localization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)

    Google Scholar 

  14. Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  15. Choromanski, K., et al.: Rethinking attention with performers. In: International Conference on Learning Representations (2021)

    Google Scholar 

  16. Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic DETR: end-to-end object detection with dynamic attention. In: International Conference on Computer Vision, pp. 2988–2997 (2021)

    Google Scholar 

  17. Damen, D., et al.: Rescaling egocentric vision. arXiv preprint arXiv:2006.13256 (2020)

  18. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)

    Google Scholar 

  19. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  20. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets for object detection. In: International Conference on Computer Vision, pp. 6569–6578 (2019)

    Google Scholar 

  21. Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47

    Chapter  Google Scholar 

  22. Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: International Conference on Computer Vision (2021)

    Google Scholar 

  23. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  24. Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019)

    Google Scholar 

  25. Gong, G., Zheng, L., Mu, Y.: Scale matters: temporal scale aggregation network for precise action localization in untrimmed videos. In: International Conference on Multimedia and Expo, pp. 1–6. IEEE (2020)

    Google Scholar 

  26. Idrees, H., et al.: The THUMOS challenge on action recognition for videos “in the wild’’. Comput. Vis. Image Under. 155, 1–23 (2017)

    Article  Google Scholar 

  27. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, pp. 1–11 (2015)

    Google Scholar 

  28. Li, S., et al.: Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  29. Li, X., et al.: Deep concept-wise temporal convolutional networks for action localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4004–4012 (2020)

    Google Scholar 

  30. Lin, C., et al.: Fast learning of temporal action proposal via dense boundary generator. In: AAAI, pp. 11499–11506 (2020)

    Google Scholar 

  31. Lin, C., et al.: Learning salient boundary feature for anchor-free temporal action localization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3320–3329 (2021)

    Google Scholar 

  32. Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: International Conference on Computer Vision, pp. 3889–3898 (2019)

    Google Scholar 

  33. Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM International Conference on Multimedia, pp. 988–996 (2017)

    Google Scholar 

  34. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_1

    Chapter  Google Scholar 

  35. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

    Google Scholar 

  36. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: IEEE Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  37. Liu, D., Jiang, T., Wang, Y.: Completeness modeling and context separation for weakly supervised temporal action localization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1298–1307 (2019)

    Google Scholar 

  38. Liu, L., Liu, X., Gao, J., Chen, W., Han, J.: Understanding the difficulty of training transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763 (2020)

    Google Scholar 

  39. Liu, Q., Wang, Z.: Progressive boundary refinement network for temporal action detection. In: AAAI, vol. 34, pp. 11612–11619 (2020)

    Google Scholar 

  40. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  41. Liu, X., Hu, Y., Bai, S., Ding, F., Bai, X., Torr, P.H.: Multi-shot temporal event localization: a benchmark. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12596–12606 (2021)

    Google Scholar 

  42. Liu, X., Wang, Q., Hu, Y., Tang, X., Bai, S., Bai, X.: End-to-end temporal action detection with transformer. arXiv preprint arXiv:2106.10271 (2021)

  43. Liu, Y., Ma, L., Zhang, Y., Liu, W., Chang, S.F.: Multi-granularity generator for temporal action proposal. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3604–3613 (2019)

    Google Scholar 

  44. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE Conference on Computer Vision (2021)

    Google Scholar 

  45. Liu, Z., et al.: Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)

  46. Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., Mei, T.: Gaussian temporal awareness networks for action localization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 344–353 (2019)

    Google Scholar 

  47. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  48. Qing, Z., et al.: Temporal context aggregation network for temporal action proposal refinement. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 485–494 (2021)

    Google Scholar 

  49. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: IEEE Conference on Computer Vision , pp. 5533–5541 (2017)

    Google Scholar 

  50. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)

    Google Scholar 

  51. Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743 (2017)

    Google Scholar 

  52. Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058 (2016)

    Google Scholar 

  53. Sridhar, D., Quader, N., Muralidharan, S., Li, Y., Dai, P., Lu, J.: Class semantics-based attention for action detection. In: IEEE Conference on Computer Vision, pp. 13739–13748 (2021)

    Google Scholar 

  54. Tan, J., Tang, J., Wang, L., Wu, G.: Relaxed transformer decoders for direct action proposal generation. In: IEEE Conference on Computer Vision, pp. 13526–13535 (2021)

    Google Scholar 

  55. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: IEEE Conference on Computer Vision, pp. 9627–9636 (2019)

    Google Scholar 

  56. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021)

    Google Scholar 

  57. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: IEEE Conference on Computer Vision (2021)

    Google Scholar 

  58. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

    Google Scholar 

  59. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  60. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  61. Wang, L., Yang, H., Wu, W., Yao, H., Huang, H.: Temporal action proposal generation with transformers. arXiv preprint arXiv:2105.12043 (2021)

  62. Wang, S., Li, B., Khabsa, M., Fang, H., Ma, H.: LinFormer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)

  63. Wang, T., Yuan, L., Chen, Y., Feng, J., Yan, S.: PnP-DETR: towards efficient visual analysis with transformers. In: IEEE Conference on Computer Vision, pp. 4661–4670 (2021)

    Google Scholar 

  64. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: IEEE Conference on Computer Vision (2021)

    Google Scholar 

  65. Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8741–8750 (2021)

    Google Scholar 

  66. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  67. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  68. Xiong, Y., et al.: Nyströmformer: a nyström-based algorithm for approximating self-attention. In: AAAI, vol. 35, pp. 14138–14148 (2021)

    Google Scholar 

  69. Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-TAD: sub-graph localization for temporal action detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10156–10165 (2020)

    Google Scholar 

  70. Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  71. Yang, L., Peng, H., Zhang, D., Fu, J., Han, J.: Revisiting anchor mechanisms for temporal action localization. IEEE Trans. Image Process. 29, 8535–8548 (2020)

    Article  MATH  Google Scholar 

  72. Yang, Z., Qin, J., Huang, D.: AcgNet: action complement graph network for weakly-supervised temporal action localization. In: AAAI, vol. 36–3, pp. 3090–3098 (2022)

    Google Scholar 

  73. Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on ImageNet. In: IEEE Conference on Computer Vision (2021)

    Google Scholar 

  74. Zeng, R., et al.: Graph convolutional networks for temporal action localization. In: IEEE Conference on Computer Vision, pp. 7094–7103 (2019)

    Google Scholar 

  75. Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10287–10296 (2020)

    Google Scholar 

  76. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9759–9768 (2020)

    Google Scholar 

  77. Zhao, C., Thabet, A.K., Ghanem, B.: Video self-stitching graph network for temporal action localization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13658–13667 (2021)

    Google Scholar 

  78. Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., Tian, Q.: Bottom-up temporal action localization with mutual regularization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 539–555. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_32

    Chapter  Google Scholar 

  79. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: IEEE Conference on Computer Vision, pp. 2914–2923 (2017)

    Google Scholar 

  80. Zhao, Y., et al.: CUHK & ETHZ & SIAT submission to ActivityNet challenge 2017. arXiv preprint arXiv:1710.08011 (2017)

  81. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: faster and better learning for bounding box regression. In: AAAI (2020)

    Google Scholar 

  82. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations, pp. 1–11 (2021)

    Google Scholar 

  83. Zhu, Z., Tang, W., Wang, L., Zheng, N., Hua, G.: Enriching local and global contexts for temporal action localization. In: International Conference on Computer Vision, pp. 13516–13525 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chen-Lin Zhang or Yin Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 752 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, CL., Wu, J., Li, Y. (2022). ActionFormer: Localizing Moments of Actions with Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13664. Springer, Cham. https://doi.org/10.1007/978-3-031-19772-7_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19772-7_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19771-0

  • Online ISBN: 978-3-031-19772-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics