Abstract
As a hot topic in the field of computer vision, video action recognition has great application potential, such as intelligent monitoring, data recommendation and virtual reality. However, although convolutional neural network has achieved great success in the field of image classification, it has not ushered in its “AlexNet” moment in the field of action recognition, and temporal modeling is still the key and challenge of video action recognition. Aiming at the problem that spatiotemporal feature can not be fully used in video action recognition, a spatiotemporal feature enhancement network for action recognition is proposed. First of all, we designed a novel temporal feature aggregation module to enhance the short-range temporal features and enhance action related features. Then, we present a grouping convolution superposition module to expand the receptive field of temporal features and improve the learning ability of the network to long-term temporal and spatial features. Finally, experiments are conducted on the public action datasets UCF-101 and HMDB-51, and the experimental results demonstrate that the proposed method in this paper effectively enhances the fusion of spatiotemporal features and achieves high accuracy.
Similar content being viewed by others
Data Availability
Data will be made available on reasonable request.
References
Huang Y, Yang X, Gao J, Xu C (2022) Holographic feature learning of egocentric-exocentric videos for multi-domain action recognition. IEEE Trans Multimed 24:2273–2286
Liu J, Akhtar N, Mian A (2022) Adversarial attack on skeleton-based human action recognition. IEEE Trans Neural Netw Learn Syst 33(4):1609–1622
Lin W, Ding X, Huang Y, Zeng H (2023) Self-supervised video-based action recognition with disturbances. IEEE Trans Image Process 32:2493–2507
Cui M, Wang W, Zhang K, Sun Z, Wang L (2023) Pose-appearance relational modeling for video action recognition. IEEE Trans Image Process 32:295–308
Yan R, Xie L, Shu X, Zhang L, Tang J (2023) Progressive instance-aware feature learning for compositional action recognition. IEEE Trans Pattern Anal Mach Intell 45(8):10317–10330
Yan R, Xie L, Shu X, Zhang L, Tang J (2023) Progressive instance-aware feature learning for compositional action recognition. IEEE Trans Pattern Anal Mach Intell 45(8):10317–10330
Luo H, Lin G, Yao Y, Tang Z, Wu Q, Hua X (2022) Dense semantics-assisted networks for video action recognition. IEEE Trans Circuit Syst Video Technol 32(5):3073–3084
Li S, He X, Song W, Hao A, Qin H (2023) Graph diffusion convolutional network for skeleton based semantic recognition of two-person actions. IEEE Trans Pattern Anal Mach Intell 45(7):8477–8493
Nigam N, Dutta T, Gupta HP (2022) Factornet: Holistic actor, object, and scene factorization for action recognition in videos. IEEE Trans Circuits Syst Video Technol 32(3):976–991
Geng T, Zheng F, Hou X, Lu K, Qi G-J, Shao L (2022) Spatial-temporal pyramid graph reasoning for action recognition. IEEE Trans Image Process 31:5484–5497
Wang Y, Xiao Y, Lu J, Tan B, Cao Z, Zhang Z, Zhou JT (2022) Discriminative multi-view dynamic image fusion for cross-view 3-d action recognition. IEEE Trans Neural Netw Learn Syst 33(10):5332–5345
Hu W, Liu H, Du Y, Yuan C, Li B, Maybank S (2022) Interaction-aware spatio-temporal pyramid attention networks for action classification. IEEE Trans Pattern Anal Mach Intell 44(10):7010–7028
Wang F, Geng S, Zhang D, Zhou M, Nian W, Li L (2022) A fine-grained classification method of thangka image based on senet. In: 2022 International Conference on Cyberworlds (CW), pp 23–30
Paing MP, Pintavirooj C (2023) Adenoma dysplasia grading of colorectal polyps using fast fourier convolutional resnet (ffc-resnet). IEEE Access 11:16644–16656
Svecic A, Francoeur J, Soulez G, Monet F, Kashyap R, Kadoury S (2023) Shape and flow sensing in arterial image guidance from uv exposed optical fibers based on spatio-temporal networks. IEEE Trans Biomed Eng 70(5):1692–1703
Zhang H, Lei L, Ni W, Yang X, Tang T, Cheng K, Xiang D, Kuang G (2023) Optical and sar image dense registration using a robust deep optical flow framework. IEEE J Sel Top Appl Earth Obs Remote Sens 16:1269–1294
Khan S, Hassan A, Hussain F, Perwaiz A, Riaz F, Alsabaan M, Abdul W (2023) Enhanced spatial stream of two-stream network using optical flow for human action recognition. Appl Sci 13:8003
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. 9912
Zhu W, Tan Y (2023) A moving infrared small target detection method based on optical flow-guided neural networks. In: 2023 4th International conference on computer vision, image and deep learning (CVIDL), pp 531–535
Alkaddour M, Tariq U, Dhall A (2022) Self-supervised approach for facial movement based optical flow. IEEE Trans Affect Comput 13(4):2071–2085
Khairallah MZ, Bonardi F, Roussel D, Bouchafa S (2022) Pca event-based optical flow: a fast and accurate 2d motion estimation. In: 2022 IEEE International conference on image processing (ICIP), pp 3521–3525
Luo Y, Ying X, Li R, Wan Y, Hu B, Ling Q (2022) Multi-scale optical flow estimation for video infrared small target detection. In: 2022 2nd International conference on computer science, electronic information engineering and intelligent control technology (CEI), pp 129–132
Dobrički T, Zhuang X, Won KJ, Hong B-W (2022) Survey on unsupervised learning methods for optical flow estimation. In: 2022 13th International conference on information and communication technology convergence (ICTC), pp 591–594
Owoyemi J, Hashimoto K (2017) Learning human motion intention with 3d convolutional neural network. In: 2017 IEEE International conference on mechatronics and automation (ICMA), pp 1810–1815
Lai Y-C, Huang R-J, Kuo Y-P, Tsao C-Y, Wang J-H, Chang C-C Underwater target tracking via 3d convolutional networks. In: 2019 IEEE 6th International conference on industrial engineering and applications (ICIEA), pp 485–490
Anshu AK, Arya KV, Gupta A (2020) View invariant gait feature extraction using temporal pyramid pooling with 3d convolutional neural network. In: 2020 IEEE 15th International conference on industrial and information systems (ICIIS), pp 242–246
Wang M, Xing J, Su J, Chen J, Liu Y (2023) Learning spatiotemporal and motion features in a unified 2d network for action recognition. IEEE Trans Pattern Anal Mach Intell 45(3):3347–3362
Miao X, Ke X (2022) Real-time action detection method based on multi-scale spatiotemporal feature. In: 2022 International conference on image processing, computer vision and machine learning (ICICML), pp 245–248
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023
Gao S-H, Cheng M-M, Zhao K, Zhang X-Y, Yang M-H, Torr P (2021) Res2net: a new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell 43(2):652–662
Soomro K, Zamir A, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. CoRR 12
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, pp 2556–2563
Gangrade S, Sharma PC, Sharma AK (2023) Colonoscopy polyp segmentation using deep residual u-net with bottleneck attention module. In: 2023 Fifth International conference on electrical, computer and communication technologies (ICECCT), pp 1–6
Li N, Guo R, Liu X, Wu L, Wang H (2022) Dental detection and classification of yolov3-spp based on convolutional block attention module. In: 2022 IEEE 8th International conference on computer and communications (ICCC), pp 2151–2156
Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q (2020) Eca-net: Efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 11531–11539
Jiang B, Wang M, Gan W, Wu W, Yan J (2021) Stm: spatiotemporal and motion encoding for action recognition. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 2000–2009
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 6450–6459
Wang L, Tong Z, Ji B, Wu G (2021) Tdn: Temporal difference networks for efficient action recognition. In: 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 1895–1904
Wang Z, She Q, Smolic A (2021) Action-net: Multipath excitation for action recognition. In: 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 13209–13218
Acknowledgements
This work is supported by the Key R &D project of Zhejiang Province of China (No. 2021C03151), the Natural Science Foundation of Zhejiang Province of China (No. LY20F020018) and the Public Projects of Zhejiang Province of China (No. LGG21G010001).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of Interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Huang, G., Wang, X., Li, X. et al. Spatiotemporal feature enhancement network for action recognition. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17834-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-023-17834-0