Skip to main content
Log in

Cascading spatio-temporal attention network for real-time action detection

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Accurately detecting human actions in video has many applications, such as video surveillance and somatosensory games. In this paper, we propose a spatial-aware attention module (SAM) and a temporal-aware attention module (TAM) for spatio-temporal action detection in videos. SAM first concatenates the feature maps of consecutive frames on the channel and then uses dilated convolutional layer followed by a sigmoid function to generate a spatial attention map. The resulting attention map contains spatial information from consecutive frames, so it helps the detector focus on salient spatial features to achieve more accurate localization of action instances in consecutive frames. TAM deploys several fully connected layers to generate a temporal attention map. The temporal attention map focuses on the temporal association of each spatial feature; it can capture the temporal association of action instances, thereby improving the detector to track actions. To evaluate the effectiveness of SAM and TAM, we build an efficient and strong anchor-free action detector, cascading spatio-temporal attention network, equipped with a 2D backbone and SAM and TAM modules. Extensive experiments on two benchmarks, JHMDB and UCF101-24, demonstrate the preferable performance of SAM and TAM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Arnab, A., Sun, C., Schmid, C.: Unified graph structured models for video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8117–8126 (2021)

  2. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, p. 4 (2021)

  3. Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  5. Chang, S., Wang, P., Wang, F., Li, H., Feng, J.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv:2103.16024 (2021)

  6. Chao, Y.-W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)

  7. Chen, S., Sun, P., Xie, E., Ge, C., Wu, J., Ma, L., Shen, J., Luo, P.: Watch only once: An end-to-end video action detection framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8178–8187 (2021)

  8. Clapés, Albert, Pardo, Àlex., Vila, Oriol Pujol, Escalera, Sergio: Action detection fusing multiple kinects and a wimu: an application to in-home assistive technology for the elderly. Mach. Vis. Appl. 29(5), 765–788 (2018)

    Article  Google Scholar 

  9. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941 (2016)

  10. Gkioxari, G., Malik, J.: Finding action tubes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 759–768 (2015)

  11. Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  13. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  14. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)

  15. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4405–4413 (2017)

  16. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)

  17. Köpüklü, O., Wei, X., Rigoll, G.: You only watch once: a unified cnn architecture for real-time spatiotemporal action localization. arXiv:1911.06644 (2019)

  18. Li, J., Liu, X., Zong, Z., Zhao, W., Zhang, M., Song, J.: Graph attention based proposal 3d convnets for action detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4626–4633 (2020)

  19. Li, J., Liu, X., Zhang, W., Zhang, M., Song, J., Sebe, N.: Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 22, 2990–3001 (2020)

    Article  Google Scholar 

  20. Li, Y., Wang, Z., Wang, L., Wu, G.: Actions as moving points. In: European Conference on Computer Vision, pp. 68–84. Springer (2020)

  21. Li, Y., Lin, W., See, J., Xu, N., Xu, S., Yan, K., Yang, C.: Cfad: Coarse-to-fine action detector for spatiotemporal action localization. In: European Conference on Computer Vision, pp. 510–527. Springer (2020)

  22. Li, Zhenyang, Gavrilyuk, Kirill, Gavves, Efstratios, Jain, Mihir, Snoek, Cees GM.: Videolstm convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)

    Article  Google Scholar 

  23. Lin, C., Xu, C., Luo, D., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., Fu, Y.: Learning salient boundary feature for anchor-free temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3320–3329 (2021)

  24. Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on Multimedia, pp. 988–996 (2017)

  25. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

  26. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: Ssd: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)

  27. Ma, X., Luo, Z., Zhang, X., Liao, Q., Shen, X., Wang, M.: Spatio-temporal action detector with self-attention. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2021)

  28. Pan, J., Chen, S., Shou, M., Zheng, L., Yu, S., Jing, L., Hongsheng: Actor-context-actor relation network for spatio-temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 464–474 (2021)

  29. Peng, X., Schmid, C.: Multi-region two-stream r-cnn for action detection. In: European Conference on Computer Vision, pp. 744–759. Springer (2016)

  30. Ramaswamy, A., Seemakurthy, K., Gubbi, J., et al.: Video action re-localization using spatio-temporal correlation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 192–201 (2022)

  31. Ren, Shaoqing, He, Kaiming, Girshick, Ross, Sun, Jian: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28, 91–99 (2015)

    Google Scholar 

  32. Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. arXiv:1608.01529 (2016)

  33. Singh, G., Saha, S., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3637–3646 (2017)

  34. Song, L., Zhang, S., Yu, G., Sun, H.: Tacnet: transition-aware context network for spatio-temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11987–11995 (2019)

  35. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)

  36. Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid, C.: Actor-centric relation network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 318–334 (2018)

  37. Ulutan, O., Rallapalli, S., Srivatsa, M., Torres, C., Manjunath, B.S.: Actor conditioned attention maps for video action detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 527–536 (2020)

  38. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

  39. Villamizar, Michael, Martínez-González, A., Canévet, Olivier, Odobez, J.-M.: Watchnet++: efficient and accurate depth-based network for detecting people attacks and intrusion. Mach. Vis. Appl. 31(6), 1–16 (2020)

    Article  Google Scholar 

  40. Wang, L., Qiao, Y., Tang, X., Van Gool, L.: Actionness estimation using hybrid fully convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2708–2717 (2016)

  41. Wang, L., Yang, H., Wu, W., Yao, H., Huang, H.: Temporal action proposal generation with transformers. arXiv:2105.12043 (2021)

  42. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

  43. Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3164–3172 (2015)

  44. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  45. Wu, J., Kuang, Z., Wang, L., Zhang, W., Wu, G.: Context-aware rcnn: a baseline for action detection in videos. In: European Conference on Computer Vision, pp. 440–456. Springer (2020)

  46. Xu, H., Das, A., Saenko, K.: R-c3d: region convolutional 3d network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5783–5792 (2017)

  47. Yan, Chenggang, Yunbin, Tu., Wang, Xingzheng, Zhang, Yongbing, Hao, Xinhong, Zhang, Yongdong, Dai, Qionghai: Stat: apatial-temporal attention mechanism for video captioning. IEEE Trans. Multimed. 22(1), 229–241 (2019)

    Article  Google Scholar 

  48. Yang, Le., Peng, Houwen, Zhang, Dingwen, Jianlong, Fu., Han, Junwei: Revisiting anchor mechanisms for temporal action localization. IEEE Trans. Image Process. 29, 8535–8548 (2020)

    Article  MATH  Google Scholar 

  49. Yang, X., Yang, X., Liu, M.-Y., Xiao, F., Davis, L.S., Kautz, J.: Step: spatio-temporal progressive learning for video action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 264–272 (2019)

  50. Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412 (2018)

  51. Zhao, J., Snoek, C.G.M.: Dance with flow: Two-in-one stream action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9935–9944 (2019)

  52. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv:1904.07850 (2019)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (62176072), National Key Research and Development Program of China (No. 2019YFB1310004), and Self-Planned Task No. SKLRS202111B of State Key Laboratory of Robotics and System (HIT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruifeng Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, J., Wang, K., Li, R. et al. Cascading spatio-temporal attention network for real-time action detection. Machine Vision and Applications 34, 110 (2023). https://doi.org/10.1007/s00138-023-01457-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-023-01457-4

Keywords

Navigation