Cascading spatio-temporal attention network for real-time action detection

Yang, Jianhua; Wang, Ke; Li, Ruifeng; Perner, Petra

doi:10.1007/s00138-023-01457-4

Cascading spatio-temporal attention network for real-time action detection

Original Paper
Published: 26 September 2023

Volume 34, article number 110, (2023)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Jianhua Yang¹,
Ke Wang^1,2,
Ruifeng Li¹ &
…
Petra Perner³

245 Accesses
Explore all metrics

Abstract

Accurately detecting human actions in video has many applications, such as video surveillance and somatosensory games. In this paper, we propose a spatial-aware attention module (SAM) and a temporal-aware attention module (TAM) for spatio-temporal action detection in videos. SAM first concatenates the feature maps of consecutive frames on the channel and then uses dilated convolutional layer followed by a sigmoid function to generate a spatial attention map. The resulting attention map contains spatial information from consecutive frames, so it helps the detector focus on salient spatial features to achieve more accurate localization of action instances in consecutive frames. TAM deploys several fully connected layers to generate a temporal attention map. The temporal attention map focuses on the temporal association of each spatial feature; it can capture the temporal association of action instances, thereby improving the detector to track actions. To evaluate the effectiveness of SAM and TAM, we build an efficient and strong anchor-free action detector, cascading spatio-temporal attention network, equipped with a 2D backbone and SAM and TAM modules. Extensive experiments on two benchmarks, JHMDB and UCF101-24, demonstrate the preferable performance of SAM and TAM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

References

Arnab, A., Sun, C., Schmid, C.: Unified graph structured models for video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8117–8126 (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, p. 4 (2021)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chang, S., Wang, P., Wang, F., Li, H., Feng, J.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv:2103.16024 (2021)
Chao, Y.-W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)
Chen, S., Sun, P., Xie, E., Ge, C., Wu, J., Ma, L., Shen, J., Luo, P.: Watch only once: An end-to-end video action detection framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8178–8187 (2021)
Clapés, Albert, Pardo, Àlex., Vila, Oriol Pujol, Escalera, Sergio: Action detection fusing multiple kinects and a wimu: an application to in-home assistive technology for the elderly. Mach. Vis. Appl. 29(5), 765–788 (2018)
Article Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941 (2016)
Gkioxari, G., Malik, J.: Finding action tubes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 759–768 (2015)
Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4405–4413 (2017)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
Köpüklü, O., Wei, X., Rigoll, G.: You only watch once: a unified cnn architecture for real-time spatiotemporal action localization. arXiv:1911.06644 (2019)
Li, J., Liu, X., Zong, Z., Zhao, W., Zhang, M., Song, J.: Graph attention based proposal 3d convnets for action detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4626–4633 (2020)
Li, J., Liu, X., Zhang, W., Zhang, M., Song, J., Sebe, N.: Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 22, 2990–3001 (2020)
Article Google Scholar
Li, Y., Wang, Z., Wang, L., Wu, G.: Actions as moving points. In: European Conference on Computer Vision, pp. 68–84. Springer (2020)
Li, Y., Lin, W., See, J., Xu, N., Xu, S., Yan, K., Yang, C.: Cfad: Coarse-to-fine action detector for spatiotemporal action localization. In: European Conference on Computer Vision, pp. 510–527. Springer (2020)
Li, Zhenyang, Gavrilyuk, Kirill, Gavves, Efstratios, Jain, Mihir, Snoek, Cees GM.: Videolstm convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)
Article Google Scholar
Lin, C., Xu, C., Luo, D., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., Fu, Y.: Learning salient boundary feature for anchor-free temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3320–3329 (2021)
Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on Multimedia, pp. 988–996 (2017)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: Ssd: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)
Ma, X., Luo, Z., Zhang, X., Liao, Q., Shen, X., Wang, M.: Spatio-temporal action detector with self-attention. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2021)
Pan, J., Chen, S., Shou, M., Zheng, L., Yu, S., Jing, L., Hongsheng: Actor-context-actor relation network for spatio-temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 464–474 (2021)
Peng, X., Schmid, C.: Multi-region two-stream r-cnn for action detection. In: European Conference on Computer Vision, pp. 744–759. Springer (2016)
Ramaswamy, A., Seemakurthy, K., Gubbi, J., et al.: Video action re-localization using spatio-temporal correlation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 192–201 (2022)
Ren, Shaoqing, He, Kaiming, Girshick, Ross, Sun, Jian: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28, 91–99 (2015)
Google Scholar
Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. arXiv:1608.01529 (2016)
Singh, G., Saha, S., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3637–3646 (2017)
Song, L., Zhang, S., Yu, G., Sun, H.: Tacnet: transition-aware context network for spatio-temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11987–11995 (2019)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)
Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid, C.: Actor-centric relation network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 318–334 (2018)
Ulutan, O., Rallapalli, S., Srivatsa, M., Torres, C., Manjunath, B.S.: Actor conditioned attention maps for video action detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 527–536 (2020)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Villamizar, Michael, Martínez-González, A., Canévet, Olivier, Odobez, J.-M.: Watchnet++: efficient and accurate depth-based network for detecting people attacks and intrusion. Mach. Vis. Appl. 31(6), 1–16 (2020)
Article Google Scholar
Wang, L., Qiao, Y., Tang, X., Van Gool, L.: Actionness estimation using hybrid fully convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2708–2717 (2016)
Wang, L., Yang, H., Wu, W., Yao, H., Huang, H.: Temporal action proposal generation with transformers. arXiv:2105.12043 (2021)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3164–3172 (2015)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Wu, J., Kuang, Z., Wang, L., Zhang, W., Wu, G.: Context-aware rcnn: a baseline for action detection in videos. In: European Conference on Computer Vision, pp. 440–456. Springer (2020)
Xu, H., Das, A., Saenko, K.: R-c3d: region convolutional 3d network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5783–5792 (2017)
Yan, Chenggang, Yunbin, Tu., Wang, Xingzheng, Zhang, Yongbing, Hao, Xinhong, Zhang, Yongdong, Dai, Qionghai: Stat: apatial-temporal attention mechanism for video captioning. IEEE Trans. Multimed. 22(1), 229–241 (2019)
Article Google Scholar
Yang, Le., Peng, Houwen, Zhang, Dingwen, Jianlong, Fu., Han, Junwei: Revisiting anchor mechanisms for temporal action localization. IEEE Trans. Image Process. 29, 8535–8548 (2020)
Article MATH Google Scholar
Yang, X., Yang, X., Liu, M.-Y., Xiao, F., Davis, L.S., Kautz, J.: Step: spatio-temporal progressive learning for video action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 264–272 (2019)
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412 (2018)
Zhao, J., Snoek, C.G.M.: Dance with flow: Two-in-one stream action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9935–9944 (2019)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv:1904.07850 (2019)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (62176072), National Key Research and Development Program of China (No. 2019YFB1310004), and Self-Planned Task No. SKLRS202111B of State Key Laboratory of Robotics and System (HIT).

Author information

Authors and Affiliations

State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, 150001, China
Jianhua Yang, Ke Wang & Ruifeng Li
Zhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, 450000, China
Ke Wang
FutureLab Artificial Intelligence IBaI-2, Dresden, Germany
Petra Perner

Authors

Jianhua Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ke Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ruifeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Petra Perner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruifeng Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, J., Wang, K., Li, R. et al. Cascading spatio-temporal attention network for real-time action detection. Machine Vision and Applications 34, 110 (2023). https://doi.org/10.1007/s00138-023-01457-4

Download citation

Received: 25 June 2022
Revised: 17 July 2023
Accepted: 22 August 2023
Published: 26 September 2023
DOI: https://doi.org/10.1007/s00138-023-01457-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cascading spatio-temporal attention network for real-time action detection

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Attention mechanisms in computer vision: A survey

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cascading spatio-temporal attention network for real-time action detection

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Attention mechanisms in computer vision: A survey

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation