STAM: a spatio-temporal adaptive module for improving static convolutions in action recognition

Li, Wei; Gong, Weijun; Qian, Yurong; Tian, Haichen

doi:10.1007/s00371-023-03165-6

STAM: a spatio-temporal adaptive module for improving static convolutions in action recognition

Original article
Published: 07 December 2023

(2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Wei Li¹,
Weijun Gong²,
Yurong Qian ORCID: orcid.org/0000-0001-6564-4745^1,2,3 &
…
Haichen Tian²

216 Accesses
Explore all metrics

Abstract

Temporal adaptive convolution has demonstrated superior performance over static convolution techniques in video understanding. However, it needs to be improved in long-time series modeling and multi-scale feature-map adaptation. To address these challenges, we introduce spatio-temporal hybrid adaptive convolution (STHAC), designed to enhance the spatio-temporal modeling capabilities of convolution. This is achieved by learning a set of spatio-temporal calibration filters to mitigate the spatial invariance intrinsic to static convolution methods. Specifically, STHAC learns a linear combination of N adaptive filters by parallelizing two lightweight attention branches. The resulting linearly mixed filters incorporate spatial multi-scale prior knowledge and long-range temporal dependencies. These spatio-temporal calibration filters modulate each frame’s static convolutional weight parameters, thereby endowing static convolution with spatial multi-scale adaptability and long-range temporal modeling capabilities. Compared to other dynamic convolution methods, our proposed calibration filters require fewer parameters and incur lower computational complexity. Moreover, we introduce an Omni-dimensional aggregation module to augment the spatio-temporal modeling capacity of STHAC. When combined with STHAC, this aggregation module forms the spatio-temporal adaptive module (STAM) that can replace static convolution. We implement a spatio-temporal dynamic network based on STAM to validate our approach. Experimental results indicate that our model is competitive with state-of-the-art convolutional neural network architectures on action recognition benchmarks such as Kinetics-400(K400) and Something-Something V2(SSV2).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Article 14 July 2021

Multipath Attention and Adaptive Gating Network for Video Action Recognition

Article Open access 27 March 2024

ESTI: an action recognition network with enhanced spatio-temporal information

Article 22 March 2023

Data availability

We used three datasets for video classification in this work: Kinetics-400 [2] https://www.deepmind.com/kinetics, Something-Something-V2 [1] https://www.twentybn.com/datasets/something-something and HMDB51 [41] https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#dataset.

References

Goyal, R., Ebrahimi Kahou, S., Michalski, V., et al.: The“ something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision 5842–5850 (2017)
Kay, W., Carreira, J., Simonyan, K., et al.: The kinetics human action video dataset. (2017) arXiv preprint arXiv:1705.06950
Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Yang, Z., An, G., Zhang, R.: STSM: spatio-temporal shift module for efficient action recognition. Mathematics 10(18), 3290 (2022)
Article Google Scholar
Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Liu, Z., Wang, L., Wu, W., et al.: Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13708–13718 (2021)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541 (2017)
Tran, D., Wang, H., Torresani, L., et al.: A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Zhou, J., Jampani, V., Pi, Z., et al.: Decoupled dynamic filter networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6647–6656 (2021)
Elsayed, G., Ramachandran, P., Shlens, J., et al.: Revisiting spatial invariance with low-rank local connectivity. In International Conference on Machine Learning. PMLR, pp. 2868–2879 (2020)
Huang, Z., Zhang, S., Pan, L., et al.: TAda! temporally-adaptive convolutions for video understanding. (2021) arXiv preprint arXiv:2110.06178
Li, D., Hu, J., Wang, C., et al.: Involution: Inverting the inherence of convolution for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12321–12330 (2021)
Dai, J., Qi, H., Xiong, Y., et al.: Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017)
Su, H., Jampani, V., Sun, D., et al.: Pixel-adaptive convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11166–11175 (2019)
Lin, X., Ma, L., Liu, W., et al.: Context-gated convolution. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pp. 701–718. Springer International Publishing (2020)
Li, C., Zhou, A., Yao, A.: Omni-dimensional dynamic convolution (2022) arXiv preprint arXiv:2209.07947
Chen, Y., Dai, X., Liu, M., et al.: Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11030–11039 (2020)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
Feichtenhofer, C., Fan, H., Malik, J., et al.: Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014)
Wang, L., Xiong, Y., Wang, Z., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer, Cham (2016)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Park, J., Woo, S., Lee, J.Y., et al.: Bam: Bottleneck attention module (2018) arXiv preprint arXiv:1807.06514
Li, X., Wang, W., Hu, X., et al.: Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019)
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In Proceedings of Neural Information Processing Systems (2015)
Yang, B., Bender, G., Le, Q.V., et al.: Condconv: conditionally parameterized convolutions for efficient inference. Adv. Neural Inf. Process. Syst. 32 (2019)
Ma, N., Zhang, X., Huang, J., et al.: Weightnet: revisiting the design space of weight networks. In: European Conference on Computer Vision, pp. 776–792. Springer International Publishing, Cham (2020)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning. PMLR, pp. 448–456 (2015)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Guo, M.H., Lu, C.Z., Liu, Z.N., et al.: Visual attention network. Comput. Vis. Media 9(4), 733–752 (2023)
Article Google Scholar
Tang, C., Zhao, Y., Wang, G., et al.: MLP for image recognition: is self-attention really necessary?. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, issue 2, pp. 2344–2351 (2022)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus) (2016) arXiv preprint arXiv:1606.08415
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016) arXiv preprint arXiv:1607.06450
Hao, Y., Zhang, H., Ngo, C.W., et al.: Group contextualization for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 928938 (2022)
Chen, J., Kao, S., He, H., et al.: Run, Don’t Walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12021–12031 (2023)
Han, J., Moraga, C.: The influence of the sigmoid function parameters on the speed of backpropagation learning. In: International Workshop on Artificial Neural Networks, pp. 195–201. Springer, Berlin Heidelberg (1995)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., et al.: HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision. IEEE, pp. 2556–2563 (2011)
Deng, J., Dong, W., Socher, R., et al.: Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248–255 (2009)
Hinton, G.E., Srivastava, N., Krizhevsky, A., et al.: Improving neural networks by preventing co-adaptation of feature detectors (2012) arXiv preprint arXiv:1207.0580
Li, X., Wang, Y., Zhou, Z., et al.: Smallbignet: integrating core and contextual views for video classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1092–1101 (2020)
Li, K., Li, X., Wang, Y., et al.: CT-net: channel tensorization network for video classification (2021) arXiv preprint arXiv:2106.01603
Xie, Z., Chen, J., Wu, K., et al.: Global temporal difference network for action recognition. IEEE Trans. Multimed. (2022)
Sudhakaran, S., Escalera, S., Lanz, O.: Gate-shift-fuse for video action recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
Geng, T., Zheng, F., Hou, X., et al.: Spatial-temporal pyramid graph reasoning for action recognition. IEEE Trans. Image Process. 31, 5484–5497 (2022)
Article Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
He, K., Zhang, X., Ren, S., et al.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Wang, W., Shen, J.: Deep visual attention prediction. IEEE Trans. Image Process. 27(5), 2368–2378 (2017)
Article MathSciNet Google Scholar
Diba, A., Fayyaz, M., Sharma, V., et al.: Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 284–299 (2018)
Li, Y., Ji, B., Shi, X., et al.: Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)
Hao, Y., Wang, S., Tan, Y., et al.: Spatio-temporal collaborative module for efficient action recognition. IEEE Trans. Image Process. 31, 7279–7291 (2022)
Article Google Scholar
Gong, W., Qian, Y., Fan, Y.: MPCSAN: multi-head parallel channel-spatial attention network for facial expression recognition in the wild. Neural Comput. Appl. 35(9), 6529–6543 (2023)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Science Foundation of China under Grant 62266043 and U1803261, in part by National Science and Technology Major Project under Grant 95-Y50G34-9001-22/23, and in part by the Autonomous Region Science and Technology Department International Cooperation Project under Grant 2020E01023.

Author information

Authors and Affiliations

School of Software, Xinjiang University, No.499, Northwest Road, Urumqi, 830091, Xinjiang Uygur Autonomous Region, China
Wei Li & Yurong Qian
School of Computer Science and Technology, Xinjiang University, No. 777, Huarui Street, Shuimogou District, Urumqi, 830046, Xinjiang Uygur Autonomous Region, China
Weijun Gong, Yurong Qian & Haichen Tian
Key Laboratory of Signal Detection and Processing, Xinjiang University, No. 666 Shengli Road, Urumqi, 830000, Xinjiang Uygur Autonomous Region, China
Yurong Qian

Authors

Wei Li
View author publications
You can also search for this author in PubMed Google Scholar
Weijun Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yurong Qian
View author publications
You can also search for this author in PubMed Google Scholar
Haichen Tian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

WL contributed to the conception, design analysis, and writing of the article; WG and HT were responsible for writing and reviewing; and YQ provided equipment, guided the writing, and conducted content reviews.

Corresponding author

Correspondence to Yurong Qian.

Ethics declarations

Conflict of interest

The author states that there are no competing interests relating to the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, W., Gong, W., Qian, Y. et al. STAM: a spatio-temporal adaptive module for improving static convolutions in action recognition. Vis Comput (2023). https://doi.org/10.1007/s00371-023-03165-6

Download citation

Accepted: 29 October 2023
Published: 07 December 2023
DOI: https://doi.org/10.1007/s00371-023-03165-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STAM: a spatio-temporal adaptive module for improving static convolutions in action recognition

Abstract

Access this article

Similar content being viewed by others

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Multipath Attention and Adaptive Gating Network for Video Action Recognition

ESTI: an action recognition network with enhanced spatio-temporal information

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

STAM: a spatio-temporal adaptive module for improving static convolutions in action recognition

Abstract

Access this article

Similar content being viewed by others

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Multipath Attention and Adaptive Gating Network for Video Action Recognition

ESTI: an action recognition network with enhanced spatio-temporal information

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation