Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

Wang, Y.; Shen, X. J.; Chen, H. P.; Sun, J. X.

doi:10.1134/S105466182103024X

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

MATHEMATICAL THEORY OF IMAGES AND SIGNALS REPRESENTING, PROCESSING, ANALYSIS, RECOGNITION, AND UNDERSTANDING
Published: 21 September 2021

Volume 31, pages 580–587, (2021)
Cite this article

Pattern Recognition and Image Analysis Aims and scope Submit manuscript

Y. Wang¹,
X. J. Shen¹,
H. P. Chen¹ &
…
J. X. Sun²

253 Accesses
4 Citations
Explore all metrics

Abstract

Feature extraction based traditional human action recognition algorithms are complicated, leading to low recognition accuracy. We present an algorithm for the recognition of human actions in videos based on spatio-temporal fusion using 3D convolutional neural networks (3D CNNs). The algorithm contains two subnetworks, which extract deep spatial information and temporal information, respectively, and bilinear fusion policy is applied to obtain the final fused spatio-temporal information. Spatial information is represented by a gradient feature, and the temporal information is represented by optical flow. The fused spatio-temporal information can retrieve deep features from multiple angles by constructing a new 3D CNNs. The proposed algorithm is compared with the current mainstream algorithms in the KTH and UCF101 datasets, showing effectiveness and high recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Article 08 March 2017

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Spatiotemporal feature enhancement network for action recognition

Article 15 December 2023

REFERENCES

K. Cheng, Y. F. Zhang, X. Y. He, W. H. Chen, J. Cheng, and H. Q. Lu, “Skeleton-based action recognition with shift graph convolutional network,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 180–189. https://doi.org/10.1109/CVPR42600.2020.00026.
J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Trans. Pattern Anal. Mach. Intell. 39 (4), 677–691 (2017). https://doi.org/10.1109/Tpami.2016.2599174
Article Google Scholar
C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multiplier networks for video action recognition,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) (2017), pp. 7445–7454. https://doi.org/10.1109/CVPR.2017.787.
C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 1933–1941. https://doi.org/10.1109/CVPR.2016.213.
Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 317–326. https://doi.org/10.1109/CVPR.2016.41.
H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired system for action recognition,” in 2007 IEEE 11th International Conference on Computer Vision (2007), Vols. 1–6, pp. 1253–1260.
S. W. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), 221–231 (2013). https://doi.org/10.1109/Tpami.2012.59
Article Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), pp. 1725–1732. https://doi.org/10.1109/CVPR.2014.223.
A. S. Keceli, A. Kaya, and A. B. Can, “Combining 2D and 3D deep models for action recognition with depth information,” Signal Image Video Process. 12 (6), 1197–1205 (2018). https://doi.org/10.1007/s11760-018-1271-3
Article Google Scholar
M. Koohzadi and N. M. Charkari, Survey on deep learning methods in human action recognition,” IET Comput. Vision 11 (8), 623–632 (2017). https://doi.org/10.1049/iet-cvi.2016.0355
Article Google Scholar
Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011).
H. Liu, H. Tu, and M. Liu, “Two-stream 3D convolutional neural network for human skeleton-based action recognition,” arXiv (2017). arXiv:1705.08106
U. Mahbub, H. Imtiaz, and M. A. R. Ahad, “Motion clustering-based action recognition technique using optical flow,” in 2012 International Conference on Informatics, Electronics and Vision (ICIEV) (2012), pp. 919–924.
J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 4694–4702. https://doi.org/10.1109/cvpr.2015.7299101.
J. C. Niebles, H. C. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” Int. J. Comput. Vision 79 (3), 299–318 (2008). https://doi.org/10.1007/s11263-007-0122-4
Article Google Scholar
L. Onofri, P. Soda, M. Pechenizkiy, and G. Iannello, “A survey on using domain and contextual knowledge for human activity recognition in video streams,” Expert Syst. Appl. 63, 97–111 (2016). https://doi.org/10.1016/j.eswa.2016.06.011
Article Google Scholar
X. J. Peng, L. M. Wang, X. X. Wang, and Y. Qiao, “Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice,” Comput. Vision Image Understanding 150, 109–125 (2016). https://doi.org/10.1016/j.cviu.2016.03.013
Article Google Scholar
A. C. S. E. Santos and H. Pedrini, “Human action recognition based on a spatio-temporal video autoencoder,” Int. J. Pattern Recognit. Artif. Intell. 34 (11) (2020). https://doi.org/10.1142/S0218001420400017
A. B. Sargano, P. Angelov, and Z. Habib, “A comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition,” Appl. Sci. (Basel) 7 (1) (2017). https://doi.org/ARTN11010.3390/app7010110.
K. Schindler and L. van Gool, “Action snippets: How many frames does human action recognition require?,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition (2008). https://doi.org/10.1109/CVPR.2008.4587730.
C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proceedings of the 17th International Conference on Pattern Recognition (2004), Vol. 3, pp. 32–36. https://doi.org/10.1109/Icpr.2004.1334462
K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems (NIPS 2014) (2014), Vol. 27.
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, ““Learning spatiotemporal features with 3D convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 4489–4497. https://doi.org/10.1109/Iccv.2015.510.
Y. Wang and M. Hoai, “Improving human action recognition by non-action classification,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 2698–2707. https://doi.org/10.1109/CVPR.2016.295.
D. F. Xie, L. Zhang, and L. Bai, “Deep learning in visual computing and signal processing,” Appl. Comput. Intell. Soft Comput. 2017, 1320780 (2017). https://doi.org/10.1155/2017/1320780
Article Google Scholar
X. D. Yang and Y. L. Tian, “Effective 3D action recognition using EigenJoints,” J. Visual Commun. Image Representation 25 (1), 2–11 (2014). https://doi.org/10.1016/j.jvcir.2013.03.001
Article Google Scholar
C. F. Yuan, X. Li, W. M. Hu, H. B. Ling, and S. Maybank, “3D R transform on spatio-temporal interest points for action recognition,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013), pp. 724–730. https://doi.org/10.1109/CVPR.2013.99.
A. Zare, H. A. Moghaddam, and A. Sharifi, “Video spatiotemporal mapping for human action recognition by convolutional neural network,” Pattern Anal. Appl. 23 (1), 265–279 (2020). https://doi.org/10.1007/s10044-019-00788-1
Article Google Scholar
W. Zheng, L. Li, Z. X. Zhang, Y. Huang, and L. Wang, “Relational network for skeleton-based action recognition,” in 2019 IEEE International Conference on Multimedia and Expo (ICME) (2019), pp. 826–831. https://doi.org/10.1109/Icme.2019.00147.

Download references

Funding

This work was supported in part by the State Key Development Program (project no. 2018YFC0830103), in part by the National Natural Science Foundation of China (project no. 61876070), in part by Key Projects of Jilin Province Science and Technology Development Plan (project no. 20180201064SF), and in part by Jilin University Student Innovation Experiment Project (project no. S202110183428).

Author information

Authors and Affiliations

College of Computer Sciences and Technology, Jilin University, 130012, Changchun, China
Y. Wang, X. J. Shen & H. P. Chen
College of Software, Jilin University, 130012, Changchun, Jilin, China
J. X. Sun

Authors

Y. Wang
View author publications
You can also search for this author in PubMed Google Scholar
X. J. Shen
View author publications
You can also search for this author in PubMed Google Scholar
H. P. Chen
View author publications
You can also search for this author in PubMed Google Scholar
J. X. Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Y. Wang, X. J. Shen, H. P. Chen or J. X. Sun.

Ethics declarations

COMPLIANCE WITH ETHICAL STANDARDS

This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the Editorial Board of the Pattern Recognition and Image Analysis decides not to accept it for publication.

CONFLICT OF INTEREST

The process of writing and the content of the article does not give grounds for raising the issue of a conflict of interest.

Additional information

Yu Wang: Associate Professor at the College of Computer Science and Technology, Jilin University. He received his PhD degree from College of Computer Science and Technology, Jilin University in 2017. His research interest covers image processing and machine learning. Corresponding author of this paper.

Xuanjing Shen: Professor at the College of Computer Science and Technology, Jilin University. He received his PhD degree from Harbin Institute of Technology in 1990. His research interest covers multimedia technology, computer image processing, intelligent measurement system, and optical-electronic hybrid system.

Haipeng Chen: Professor at the College of Computer Science and Technology, Jilin University. His research interest covers image processing and pattern recognition.

Jiaxi Sun: Bachelor at the College of Software, Jilin University. His research interest covers video processing and pattern recognition.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Shen, X.J., Chen, H.P. et al. Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks. Pattern Recognit. Image Anal. 31, 580–587 (2021). https://doi.org/10.1134/S105466182103024X

Download citation

Received: 21 January 2020
Revised: 22 February 2021
Accepted: 10 March 2021
Published: 21 September 2021
Issue Date: July 2021
DOI: https://doi.org/10.1134/S105466182103024X

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions