Skip to main content
Log in

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

  • MATHEMATICAL THEORY OF IMAGES AND SIGNALS REPRESENTING, PROCESSING, ANALYSIS, RECOGNITION, AND UNDERSTANDING
  • Published:
Pattern Recognition and Image Analysis Aims and scope Submit manuscript

Abstract

Feature extraction based traditional human action recognition algorithms are complicated, leading to low recognition accuracy. We present an algorithm for the recognition of human actions in videos based on spatio-temporal fusion using 3D convolutional neural networks (3D CNNs). The algorithm contains two subnetworks, which extract deep spatial information and temporal information, respectively, and bilinear fusion policy is applied to obtain the final fused spatio-temporal information. Spatial information is represented by a gradient feature, and the temporal information is represented by optical flow. The fused spatio-temporal information can retrieve deep features from multiple angles by constructing a new 3D CNNs. The proposed algorithm is compared with the current mainstream algorithms in the KTH and UCF101 datasets, showing effectiveness and high recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.

Similar content being viewed by others

REFERENCES

  1. K. Cheng, Y. F. Zhang, X. Y. He, W. H. Chen, J. Cheng, and H. Q. Lu, “Skeleton-based action recognition with shift graph convolutional network,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 180–189. https://doi.org/10.1109/CVPR42600.2020.00026.

  2. J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Trans. Pattern Anal. Mach. Intell. 39 (4), 677–691 (2017). https://doi.org/10.1109/Tpami.2016.2599174

    Article  Google Scholar 

  3. C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multiplier networks for video action recognition,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) (2017), pp. 7445–7454. https://doi.org/10.1109/CVPR.2017.787.

  4. C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 1933–1941. https://doi.org/10.1109/CVPR.2016.213.

  5. Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 317–326. https://doi.org/10.1109/CVPR.2016.41.

  6. H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired system for action recognition,” in 2007 IEEE 11th International Conference on Computer Vision (2007), Vols. 1–6, pp. 1253–1260.

  7. S. W. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), 221–231 (2013). https://doi.org/10.1109/Tpami.2012.59

    Article  Google Scholar 

  8. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), pp. 1725–1732. https://doi.org/10.1109/CVPR.2014.223.

  9. A. S. Keceli, A. Kaya, and A. B. Can, “Combining 2D and 3D deep models for action recognition with depth information,” Signal Image Video Process. 12 (6), 1197–1205 (2018). https://doi.org/10.1007/s11760-018-1271-3

    Article  Google Scholar 

  10. M. Koohzadi and N. M. Charkari, Survey on deep learning methods in human action recognition,” IET Comput. Vision 11 (8), 623–632 (2017). https://doi.org/10.1049/iet-cvi.2016.0355

    Article  Google Scholar 

  11. Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011).

  12. H. Liu, H. Tu, and M. Liu, “Two-stream 3D convolutional neural network for human skeleton-based action recognition,” arXiv (2017). arXiv:1705.08106

  13. U. Mahbub, H. Imtiaz, and M. A. R. Ahad, “Motion clustering-based action recognition technique using optical flow,” in 2012 International Conference on Informatics, Electronics and Vision (ICIEV) (2012), pp. 919–924.

  14. J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 4694–4702. https://doi.org/10.1109/cvpr.2015.7299101.

  15. J. C. Niebles, H. C. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” Int. J. Comput. Vision 79 (3), 299–318 (2008). https://doi.org/10.1007/s11263-007-0122-4

    Article  Google Scholar 

  16. L. Onofri, P. Soda, M. Pechenizkiy, and G. Iannello, “A survey on using domain and contextual knowledge for human activity recognition in video streams,” Expert Syst. Appl. 63, 97–111 (2016). https://doi.org/10.1016/j.eswa.2016.06.011

    Article  Google Scholar 

  17. X. J. Peng, L. M. Wang, X. X. Wang, and Y. Qiao, “Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice,” Comput. Vision Image Understanding 150, 109–125 (2016). https://doi.org/10.1016/j.cviu.2016.03.013

    Article  Google Scholar 

  18. A. C. S. E. Santos and H. Pedrini, “Human action recognition based on a spatio-temporal video autoencoder,” Int. J. Pattern Recognit. Artif. Intell. 34 (11) (2020). https://doi.org/10.1142/S0218001420400017

  19. A. B. Sargano, P. Angelov, and Z. Habib, “A comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition,” Appl. Sci. (Basel) 7 (1) (2017). https://doi.org/ARTN11010.3390/app7010110.

  20. K. Schindler and L. van Gool, “Action snippets: How many frames does human action recognition require?,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition (2008). https://doi.org/10.1109/CVPR.2008.4587730.

  21. C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proceedings of the 17th International Conference on Pattern Recognition (2004), Vol. 3, pp. 32–36. https://doi.org/10.1109/Icpr.2004.1334462

  22. K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems (NIPS 2014) (2014), Vol. 27.

  23. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, ““Learning spatiotemporal features with 3D convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 4489–4497. https://doi.org/10.1109/Iccv.2015.510.

  24. Y. Wang and M. Hoai, “Improving human action recognition by non-action classification,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 2698–2707. https://doi.org/10.1109/CVPR.2016.295.

  25. D. F. Xie, L. Zhang, and L. Bai, “Deep learning in visual computing and signal processing,” Appl. Comput. Intell. Soft Comput. 2017, 1320780 (2017). https://doi.org/10.1155/2017/1320780

    Article  Google Scholar 

  26. X. D. Yang and Y. L. Tian, “Effective 3D action recognition using EigenJoints,” J. Visual Commun. Image Representation 25 (1), 2–11 (2014). https://doi.org/10.1016/j.jvcir.2013.03.001

    Article  Google Scholar 

  27. C. F. Yuan, X. Li, W. M. Hu, H. B. Ling, and S. Maybank, “3D R transform on spatio-temporal interest points for action recognition,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013), pp. 724–730. https://doi.org/10.1109/CVPR.2013.99.

  28. A. Zare, H. A. Moghaddam, and A. Sharifi, “Video spatiotemporal mapping for human action recognition by convolutional neural network,” Pattern Anal. Appl. 23 (1), 265–279 (2020). https://doi.org/10.1007/s10044-019-00788-1

    Article  Google Scholar 

  29. W. Zheng, L. Li, Z. X. Zhang, Y. Huang, and L. Wang, “Relational network for skeleton-based action recognition,” in 2019 IEEE International Conference on Multimedia and Expo (ICME) (2019), pp. 826–831. https://doi.org/10.1109/Icme.2019.00147.

Download references

Funding

This work was supported in part by the State Key Development Program (project no. 2018YFC0830103), in part by the National Natural Science Foundation of China (project no. 61876070), in part by Key Projects of Jilin Province Science and Technology Development Plan (project no. 20180201064SF), and in part by Jilin University Student Innovation Experiment Project (project no. S202110183428).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Y. Wang, X. J. Shen, H. P. Chen or J. X. Sun.

Ethics declarations

COMPLIANCE WITH ETHICAL STANDARDS

This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the Editorial Board of the Pattern Recognition and Image Analysis decides not to accept it for publication.

CONFLICT OF INTEREST

The process of writing and the content of the article does not give grounds for raising the issue of a conflict of interest.

Additional information

Yu Wang: Associate Professor at the College of Computer Science and Technology, Jilin University. He received his PhD degree from College of Computer Science and Technology, Jilin University in 2017. His research interest covers image processing and machine learning. Corresponding author of this paper.

Xuanjing Shen: Professor at the College of Computer Science and Technology, Jilin University. He received his PhD degree from Harbin Institute of Technology in 1990. His research interest covers multimedia technology, computer image processing, intelligent measurement system, and optical-electronic hybrid system.

Haipeng Chen: Professor at the College of Computer Science and Technology, Jilin University. His research interest covers image processing and pattern recognition.

Jiaxi Sun: Bachelor at the College of Software, Jilin University. His research interest covers video processing and pattern recognition.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Shen, X.J., Chen, H.P. et al. Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks. Pattern Recognit. Image Anal. 31, 580–587 (2021). https://doi.org/10.1134/S105466182103024X

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S105466182103024X

Keywords:

Navigation