Skip to main content
Log in

STAM: a spatio-temporal adaptive module for improving static convolutions in action recognition

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Temporal adaptive convolution has demonstrated superior performance over static convolution techniques in video understanding. However, it needs to be improved in long-time series modeling and multi-scale feature-map adaptation. To address these challenges, we introduce spatio-temporal hybrid adaptive convolution (STHAC), designed to enhance the spatio-temporal modeling capabilities of convolution. This is achieved by learning a set of spatio-temporal calibration filters to mitigate the spatial invariance intrinsic to static convolution methods. Specifically, STHAC learns a linear combination of N adaptive filters by parallelizing two lightweight attention branches. The resulting linearly mixed filters incorporate spatial multi-scale prior knowledge and long-range temporal dependencies. These spatio-temporal calibration filters modulate each frame’s static convolutional weight parameters, thereby endowing static convolution with spatial multi-scale adaptability and long-range temporal modeling capabilities. Compared to other dynamic convolution methods, our proposed calibration filters require fewer parameters and incur lower computational complexity. Moreover, we introduce an Omni-dimensional aggregation module to augment the spatio-temporal modeling capacity of STHAC. When combined with STHAC, this aggregation module forms the spatio-temporal adaptive module (STAM) that can replace static convolution. We implement a spatio-temporal dynamic network based on STAM to validate our approach. Experimental results indicate that our model is competitive with state-of-the-art convolutional neural network architectures on action recognition benchmarks such as Kinetics-400(K400) and Something-Something V2(SSV2).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

We used three datasets for video classification in this work: Kinetics-400 [2] https://www.deepmind.com/kinetics, Something-Something-V2 [1] https://www.twentybn.com/datasets/something-something and HMDB51 [41] https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#dataset.

References

  1. Goyal, R., Ebrahimi Kahou, S., Michalski, V., et al.: The“ something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision 5842–5850 (2017)

  2. Kay, W., Carreira, J., Simonyan, K., et al.: The kinetics human action video dataset. (2017) arXiv preprint arXiv:1705.06950

  3. Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

  4. Yang, Z., An, G., Zhang, R.: STSM: spatio-temporal shift module for efficient action recognition. Mathematics 10(18), 3290 (2022)

    Article  Google Scholar 

  5. Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)

  6. Liu, Z., Wang, L., Wu, W., et al.: Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13708–13718 (2021)

  7. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541 (2017)

  8. Tran, D., Wang, H., Torresani, L., et al.: A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

  9. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)

  10. Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

  11. Zhou, J., Jampani, V., Pi, Z., et al.: Decoupled dynamic filter networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6647–6656 (2021)

  12. Elsayed, G., Ramachandran, P., Shlens, J., et al.: Revisiting spatial invariance with low-rank local connectivity. In International Conference on Machine Learning. PMLR, pp. 2868–2879 (2020)

  13. Huang, Z., Zhang, S., Pan, L., et al.: TAda! temporally-adaptive convolutions for video understanding. (2021) arXiv preprint arXiv:2110.06178

  14. Li, D., Hu, J., Wang, C., et al.: Involution: Inverting the inherence of convolution for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12321–12330 (2021)

  15. Dai, J., Qi, H., Xiong, Y., et al.: Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017)

  16. Su, H., Jampani, V., Sun, D., et al.: Pixel-adaptive convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11166–11175 (2019)

  17. Lin, X., Ma, L., Liu, W., et al.: Context-gated convolution. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pp. 701–718. Springer International Publishing (2020)

  18. Li, C., Zhou, A., Yao, A.: Omni-dimensional dynamic convolution (2022) arXiv preprint arXiv:2209.07947

  19. Chen, Y., Dai, X., Liu, M., et al.: Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11030–11039 (2020)

  20. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

  21. Feichtenhofer, C., Fan, H., Malik, J., et al.: Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

  22. Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)

  23. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014)

  24. Wang, L., Xiong, Y., Wang, Z., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer, Cham (2016)

    Google Scholar 

  25. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  26. Park, J., Woo, S., Lee, J.Y., et al.: Bam: Bottleneck attention module (2018) arXiv preprint arXiv:1807.06514

  27. Li, X., Wang, W., Hu, X., et al.: Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019)

  28. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In Proceedings of Neural Information Processing Systems (2015)

  29. Yang, B., Bender, G., Le, Q.V., et al.: Condconv: conditionally parameterized convolutions for efficient inference. Adv. Neural Inf. Process. Syst. 32 (2019)

  30. Ma, N., Zhang, X., Huang, J., et al.: Weightnet: revisiting the design space of weight networks. In: European Conference on Computer Vision, pp. 776–792. Springer International Publishing, Cham (2020)

    Google Scholar 

  31. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning. PMLR, pp. 448–456 (2015)

  32. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)

  33. Guo, M.H., Lu, C.Z., Liu, Z.N., et al.: Visual attention network. Comput. Vis. Media 9(4), 733–752 (2023)

    Article  Google Scholar 

  34. Tang, C., Zhao, Y., Wang, G., et al.: MLP for image recognition: is self-attention really necessary?. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, issue 2, pp. 2344–2351 (2022)

  35. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

  36. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus) (2016) arXiv preprint arXiv:1606.08415

  37. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016) arXiv preprint arXiv:1607.06450

  38. Hao, Y., Zhang, H., Ngo, C.W., et al.: Group contextualization for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 928938 (2022)

  39. Chen, J., Kao, S., He, H., et al.: Run, Don’t Walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12021–12031 (2023)

  40. Han, J., Moraga, C.: The influence of the sigmoid function parameters on the speed of backpropagation learning. In: International Workshop on Artificial Neural Networks, pp. 195–201. Springer, Berlin Heidelberg (1995)

    Google Scholar 

  41. Kuehne, H., Jhuang, H., Garrote, E., et al.: HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision. IEEE, pp. 2556–2563 (2011)

  42. Deng, J., Dong, W., Socher, R., et al.: Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248–255 (2009)

  43. Hinton, G.E., Srivastava, N., Krizhevsky, A., et al.: Improving neural networks by preventing co-adaptation of feature detectors (2012) arXiv preprint arXiv:1207.0580

  44. Li, X., Wang, Y., Zhou, Z., et al.: Smallbignet: integrating core and contextual views for video classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1092–1101 (2020)

  45. Li, K., Li, X., Wang, Y., et al.: CT-net: channel tensorization network for video classification (2021) arXiv preprint arXiv:2106.01603

  46. Xie, Z., Chen, J., Wu, K., et al.: Global temporal difference network for action recognition. IEEE Trans. Multimed. (2022)

  47. Sudhakaran, S., Escalera, S., Lanz, O.: Gate-shift-fuse for video action recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)

  48. Geng, T., Zheng, F., Hou, X., et al.: Spatial-temporal pyramid graph reasoning for action recognition. IEEE Trans. Image Process. 31, 5484–5497 (2022)

    Article  Google Scholar 

  49. Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

  50. He, K., Zhang, X., Ren, S., et al.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

  51. Wang, W., Shen, J.: Deep visual attention prediction. IEEE Trans. Image Process. 27(5), 2368–2378 (2017)

    Article  MathSciNet  Google Scholar 

  52. Diba, A., Fayyaz, M., Sharma, V., et al.: Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 284–299 (2018)

  53. Li, Y., Ji, B., Shi, X., et al.: Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)

  54. Hao, Y., Wang, S., Tan, Y., et al.: Spatio-temporal collaborative module for efficient action recognition. IEEE Trans. Image Process. 31, 7279–7291 (2022)

    Article  Google Scholar 

  55. Gong, W., Qian, Y., Fan, Y.: MPCSAN: multi-head parallel channel-spatial attention network for facial expression recognition in the wild. Neural Comput. Appl. 35(9), 6529–6543 (2023)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Science Foundation of China under Grant 62266043 and U1803261, in part by National Science and Technology Major Project under Grant 95-Y50G34-9001-22/23, and in part by the Autonomous Region Science and Technology Department International Cooperation Project under Grant 2020E01023.

Author information

Authors and Affiliations

Authors

Contributions

WL contributed to the conception, design analysis, and writing of the article; WG and HT were responsible for writing and reviewing; and YQ provided equipment, guided the writing, and conducted content reviews.

Corresponding author

Correspondence to Yurong Qian.

Ethics declarations

Conflict of interest

The author states that there are no competing interests relating to the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, W., Gong, W., Qian, Y. et al. STAM: a spatio-temporal adaptive module for improving static convolutions in action recognition. Vis Comput (2023). https://doi.org/10.1007/s00371-023-03165-6

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00371-023-03165-6

Keywords

Navigation