Abstract
In this paper, we focus on how to better represent video data in the field of action recognition. We propose a new image feature named phase spectrum reconstruction map to facilitate action recognition, which extracts the contour features of RGB frame images from a video clip beneficial for action recognition. We demonstrate the effectiveness of such a feature with ablation experiments using the channel-based feature fusion method and two-stream method. Also, we verify that the reconstructed map does contain motion-related features and can be learned by convolutional neural networks only using the reconstructed map as input features. Our method is trained and evaluated using the benchmark datasets HMDB-51 and UCF-101, and both show significant improvements over other methods without adding the reconstructed map features.
Similar content being viewed by others
Data Availibility Statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
References
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1725–1732
Horn BK, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, p 27
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International conference on computer vision, pp 4489–4497
Liu Z, Wang L, Wu W, Qian C, Lu T (2021) Tam: Temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 13708–13718
Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: A fast and robust motion representation for video action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1390–1399
Diba A, Pazandeh AM, Van Gool L (2016) Efficient two-stream motion and appearance 3d cnns for video classification. arXiv:1608.08851
Zhu Y, Lan Z, Newsam S, Hauptmann A (2018) Hidden two-stream convolutional networks for action recognition. In: Asian conference on computer vision, pp 363–378. Springer
Zhang C, Zou Y, Chen G, Gan L (2020) Pan: Towards fast action recognition via learning persistence of appearance. arXiv:2008.03462
Stroud J, Ross D, Sun C, Deng J, Sukthankar R (2020) D3d: Distilled 3d networks for video action recognition. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp 625–634
Singla N (2014) Motion detection based on frame difference method. Int J Inf Computat Technol 4(15):1559–1565
Ng JY-H, Davis LS (2018) Temporal difference networks for video action recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1587–1596. IEEE
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: 30-second AAAI Conference on artificial intelligence
Wu Z, Xiong C, Ma C-Y, Socher R, Davis LS (2019) Adaframe: Adaptive frame selection for fast video recognition. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 1278–1287
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36. Springer
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1933–1941
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6299–6308
Guo W, Wang J, Wang S (2019) Deep multimodal representation learning: A survey. IEEE Access 7:63373–63394
Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, pp 2556–2563. IEEE
Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11)
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6546–6555
Liu Z, Wang L, Wu W, Qian C, Lu T (2020) Tam: Temporal adaptive module for video recognition. arXiv:2005.06803
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding?
Lin J, Gan C, Han S (2018) Temporal shift module for efficient video understanding. arXiv:1811.08383
Contributors M (2020) OpenMMLab’s Next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2
Huang G, Bors AG (2020) Learning spatio-temporal representations with temporal squeeze pooling. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 2103–2107. IEEE
Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 2000–2009
Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 352–367
Liu Z, Luo D, Wang Y, Wang L, Tai Y, Wang C, Li J, Huang F, Lu T (2020) Teinet: Towards an efficient architecture for video recognition. Proceedings of the AAAI Conference on artificial intelligence 34:11669–11676
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6450–6459
Zhang Y, Li X, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: Video transformer without convolutions. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 13577–13587
Acknowledgements
This research is supported in part by the National Key Research and Development Program of China under Grant No.2020AAA0140004.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
We declare that we have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wen, H., Lu, ZM., Cui, JL. et al. A novel feature for action recognition. Multimed Tools Appl 83, 41441–41456 (2024). https://doi.org/10.1007/s11042-023-17251-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17251-3