Multiple feature fusion in convolutional neural networks for action recognition

Li, Hongyang; Chen, Jun; Hu, Ruimin

doi:10.1007/s11859-017-1219-4

Multiple feature fusion in convolutional neural networks for action recognition

Computer Science
Published: 10 January 2017

Volume 22, pages 73–78, (2017)
Cite this article

Wuhan University Journal of Natural Sciences

Hongyang Li¹,
Jun Chen^1,2 &
Ruimin Hu^1,2

312 Accesses
5 Citations
Explore all metrics

Abstract

Action recognition is important for understanding the human behaviors in the video, and the video representation is the basis for action recognition. This paper provides a new video representation based on convolution neural networks (CNN). For capturing human motion information in one CNN, we take both the optical flow maps and gray images as input, and combine multiple convolutional features by max pooling across frames. In another CNN, we input single color frame to capture context information. Finally, we take the top full connected layer vectors as video representation and train the classifiers by linear support vector machine. The experimental results show that the representation which integrates the optical flow maps and gray images obtains more discriminative properties than those which depend on only one element. On the most challenging data sets HMDB51 and UCF101, this video representation obtains competitive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning motion and content-dependent features with convolutions for action recognition

Article 29 March 2015

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

References

Sadanand S, Corso J J. Action bank: A high-level representation of activity in video [C] // 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D C: IEEE Press, 2012: 1234–1241.
Chapter Google Scholar
Wang H, Schmid C. Action recognition with improved trajectories [C] // Proceedings of the IEEE International Conference on Computer Vision. Washington D C: IEEE Press, 2013: 3551–3558.
Google Scholar
Aggarwal J K, Ryoo M S. Human activity analysis: A review[J]. ACM Computing Surveys (CSUR), 2011, 43(3): 16.
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks [C] // Advances in Neural Information Processing Systems. Washington D C: IEEE Press, 2012: 1097–1105.
Google Scholar
Farabet C, Couprie C, Najman L, et al. Learning hierarchical features for scene labeling [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915–1929.
Article PubMed Google Scholar
Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks [C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D C: IEEE Press, 2014: 1725–1732.
Google Scholar
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos [C] // Advances in Neural Information Processing Systems. Washington D C: IEEE Press, 2014: 568–576.
Google Scholar
Wang H, Ullah M M, Klaser A, et al. Evaluation of local spatio-temporal features for action recognition [C] // BMVC 2009-British Machine Vision Conference. London: BMVA Press, 2009: 124.1–124.11.
Google Scholar
Laptev I. On space-time interest points [J]. International Journal of Computer Vision, 2005, 64(2-3): 107–123.
Article Google Scholar
Willems G, Tuytelaars T, Van Gool L. An efficient dense and scale-invariant spatio-temporal interest point detector [C] // European Conference on Computer Vision. Berlin, Heidelberg: Springer-Verlag, 2008: 650–663.
Google Scholar
Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance [C] // European Conference on Computer Vision. Berlin, Heidelberg: Springer-Verlag, 2006: 428–441.
Google Scholar
Laptev I, Marszalek M, Schmid C, et al. Learning realistic human actions from movies [C] // IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Piscataway: IEEE Press, 2008: 1–8.
Google Scholar
Ji S, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221–231.
Article PubMed Google Scholar
Chéron G, Laptev I, Schmid C. P-CNN: Pose-based CNN features for action recognition [C] // Proceedings of the IEEE International Conference on Computer Vision. Washington D C: IEEE Press, 2015: 3218–3226.
Google Scholar
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification [C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D C: IEEE Press, 2015: 4694–4702.
Google Scholar
Yan Z, Zhang H, Piramuthu R, et al. HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition [C] // Proceedings of the IEEE International Conference on Computer Vision. Washington D C: IEEE Press, 2015: 2740–2748.
Google Scholar
Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors [C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D C: IEEE Press, 2015: 4305–4314.
Google Scholar
Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[EB/OL]. [2016-04-05]. http://aarxiv-okg/abs/1212.0402.
Kuehne H, Jhuang H, Garrote H A, et al. A large video database for human motion recognition [C] // Proc of IEEE International Conference on Computer Vision. Washington D C: IEEE Press, 2011: 2556–2563.
Google Scholar
Jiang Y, Liu J, Zamir R, et al. THUMOS challenge: Action recognition with a large number of classes [C] // ICCV Workshop on Action Recognition with a Large Number of Classes. Piscataway: IEEE Press, 2013: 1–3.
Google Scholar
Jia Y, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding [C] // Proceedings of the 22nd ACM International Conference on Multimedia. New York:ACM Press, 2014: 675–678.
Google Scholar
Wang H, Schmid C. LEAR-INRIA submission for the THUMOS workshop [C] // ICCV Workshop on Action Recognition with a Large Number of Classes. Washington D C: IEEE Press, 2013, 2(7): 8–11.
Google Scholar
Peng X, Wang L, Wang X, et al. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice [J]. Computer Vision and Image Understanding, 2016, 150: 109–125.
Article Google Scholar
Peng X, Zou C, Qiao Y, et al. Action recognition with stacked fisher vectors [C] // European Conference on Computer Vision. Berlin: Springer-Verlag, 2014: 581–595.
Google Scholar

Download references

Author information

Authors and Affiliations

National Engineering Research Center for Multimedia Software, Wuhan University, Wuhan 430072, Hubei, China
Hongyang Li, Jun Chen & Ruimin Hu
State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, Hubei, China
Jun Chen & Ruimin Hu

Authors

Hongyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ruimin Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongyang Li.

Additional information

Foundation item: Supported by the National High Technology Research and Development Program of China (863 Program, 2015AA016306), National Nature Science Foundation of China (61231015), Internet of Things Development Funding Project of Ministry of Industry in 2013(25), Technology Research Program of Ministry of Public Security (2016JSYJA12), the Nature Science Foundation of Hubei Province (2014CFB712)

Biography: LI Hongyang, male, Ph.D. candidate, research direction: multimedia analysis and computer vision.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, H., Chen, J. & Hu, R. Multiple feature fusion in convolutional neural networks for action recognition. Wuhan Univ. J. Nat. Sci. 22, 73–78 (2017). https://doi.org/10.1007/s11859-017-1219-4

Download citation

Received: 10 May 2016
Published: 10 January 2017
Issue Date: February 2017
DOI: https://doi.org/10.1007/s11859-017-1219-4

Keywords

CLC number

TP 391

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple feature fusion in convolutional neural networks for action recognition

Abstract

Access this article

Similar content being viewed by others

Learning motion and content-dependent features with convolutions for action recognition

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

CLC number

Navigation

Multiple feature fusion in convolutional neural networks for action recognition

Abstract

Access this article

Similar content being viewed by others

Learning motion and content-dependent features with convolutions for action recognition

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

CLC number

Search

Navigation