Skip to main content
Log in

A novel feature for action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we focus on how to better represent video data in the field of action recognition. We propose a new image feature named phase spectrum reconstruction map to facilitate action recognition, which extracts the contour features of RGB frame images from a video clip beneficial for action recognition. We demonstrate the effectiveness of such a feature with ablation experiments using the channel-based feature fusion method and two-stream method. Also, we verify that the reconstructed map does contain motion-related features and can be learned by convolutional neural networks only using the reconstructed map as input features. Our method is trained and evaluated using the benchmark datasets HMDB-51 and UCF-101, and both show significant improvements over other methods without adding the reconstructed map features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availibility Statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1725–1732

  2. Horn BK, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203

    Article  Google Scholar 

  3. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, p 27

  4. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International conference on computer vision, pp 4489–4497

  5. Liu Z, Wang L, Wu W, Qian C, Lu T (2021) Tam: Temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 13708–13718

  6. Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: A fast and robust motion representation for video action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1390–1399

  7. Diba A, Pazandeh AM, Van Gool L (2016) Efficient two-stream motion and appearance 3d cnns for video classification. arXiv:1608.08851

  8. Zhu Y, Lan Z, Newsam S, Hauptmann A (2018) Hidden two-stream convolutional networks for action recognition. In: Asian conference on computer vision, pp 363–378. Springer

  9. Zhang C, Zou Y, Chen G, Gan L (2020) Pan: Towards fast action recognition via learning persistence of appearance. arXiv:2008.03462

  10. Stroud J, Ross D, Sun C, Deng J, Sukthankar R (2020) D3d: Distilled 3d networks for video action recognition. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp 625–634

  11. Singla N (2014) Motion detection based on frame difference method. Int J Inf Computat Technol 4(15):1559–1565

    Google Scholar 

  12. Ng JY-H, Davis LS (2018) Temporal difference networks for video action recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1587–1596. IEEE

  13. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: 30-second AAAI Conference on artificial intelligence

  14. Wu Z, Xiong C, Ma C-Y, Socher R, Davis LS (2019) Adaframe: Adaptive frame selection for fast video recognition. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 1278–1287

  15. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36. Springer

  16. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1933–1941

  17. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6299–6308

  18. Guo W, Wang J, Wang S (2019) Deep multimodal representation learning: A survey. IEEE Access 7:63373–63394

    Article  Google Scholar 

  19. Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25

    Article  Google Scholar 

  20. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, pp 2556–2563. IEEE

  21. Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11)

  22. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6546–6555

  23. Liu Z, Wang L, Wu W, Qian C, Lu T (2020) Tam: Temporal adaptive module for video recognition. arXiv:2005.06803

  24. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  25. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding?

  26. Lin J, Gan C, Han S (2018) Temporal shift module for efficient video understanding. arXiv:1811.08383

  27. Contributors M (2020) OpenMMLab’s Next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2

  28. Huang G, Bors AG (2020) Learning spatio-temporal representations with temporal squeeze pooling. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 2103–2107. IEEE

  29. Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 2000–2009

  30. Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 352–367

  31. Liu Z, Luo D, Wang Y, Wang L, Tai Y, Wang C, Li J, Huang F, Lu T (2020) Teinet: Towards an efficient architecture for video recognition. Proceedings of the AAAI Conference on artificial intelligence 34:11669–11676

    Article  Google Scholar 

  32. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6450–6459

  33. Zhang Y, Li X, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: Video transformer without convolutions. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 13577–13587

Download references

Acknowledgements

This research is supported in part by the National Key Research and Development Program of China under Grant No.2020AAA0140004.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhe-Ming Lu.

Ethics declarations

Conflicts of interest

We declare that we have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wen, H., Lu, ZM., Cui, JL. et al. A novel feature for action recognition. Multimed Tools Appl 83, 41441–41456 (2024). https://doi.org/10.1007/s11042-023-17251-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17251-3

Keywords

Navigation