Combining multiple deep cues for action recognition

Wang, Ruiqi; Wu, Xinxiao

doi:10.1007/s11042-018-6509-0

Combining multiple deep cues for action recognition

Published: 31 August 2018

Volume 78, pages 9933–9950, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ruiqi Wang¹ &
Xinxiao Wu¹

352 Accesses
4 Citations
Explore all metrics

Abstract

In this paper, we propose a novel deep learning based framework to fuse multiple cues of action motions, objects and scenes for complex action recognition. Since the deep features achieve promising results, three deep representations are extracted for capturing both temporal and contextual information of actions. Particularly, for the action cue, we first adopt a deep detection model to detect persons frame by frame and then feed the deep representations of persons into a Gated Recurrent Unit model to generate the action features. Different from the existing deep action features, our feature is capable of modeling the global dynamics of long human motion. The scene and object cues are also represented by deep features pooling on all the frames in a video. Moreover, we introduce an l_p-norm multiple kernel learning method to effectively combine the multiple deep representations of the video to learn robust classifiers of actions by capturing the contextual relationships between action, object and scene. Extensive experiments on two real-world action datasets (i.e., UCF101 and HMDB51) clearly demonstrate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Visualizing and Understanding Convolutional Networks

References

Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding. Springer, pp 29–39
Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the smo algorithm. In: ICML
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: CVPR
Cai Z, Fan Q, Feris RS, Vasconcelos N (2016) A unified multi-scale deep convolutional neural network for fast object detection. In: ECCV
Cui J, Liu Y, Yuandong X u, Zhao H, Zha H (2013) Tracking generic human motion via fusion of low-and high-dimensional approaches. IEEE Trans Syst Man Cybern: Syst 43(4):996–1002
Article Google Scholar
Diba A, Sharma V, Van Gool L (2016) Deep temporal linear encoding networks. CoRR, arXiv:1611.06678
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Paluri M (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR
Duta I, Ionescu B, Aizawa K, Sebe N (2017) Spatio-temporal vector of locally max pooled features for action recognition in videos. In: CVPR
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual network fusion for video action recognition. In: CVPR
Feichtenhofer C, Pinz A, Wildes R (2017) Spatiotemporal multiplier networks for video action recognition. In: CVPR
Fernando B, Anderson A, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: CVPR
Girshick R (2015) Fast r-cnn. In: ICCV
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR
Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with r* cnn. In: ICCV
Han D, Bo L, Sminchisescu C (2009) Selection and context for action recognition. In: ICCV
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Ikizlercinbis N, Sclaroff S (2010) Object, scene and actions: Combining multiple features for human action recognition. In: ECCV
Jain M, Gemert JCV, Snoek CGM (2015) What do 15,000 object categories tell us about classifying and localizing actions?. In: CVPR
Ji S, Xu W, Yang M, Yu K (2010) 3d convolutional neural networks for human action recognition. In: ICML
Jiang YG, Li Z, Chang SF (2011) Modeling scene and object contexts for human action retrieval with few examples. IEEE TCSVT 21(5):674–681
Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R (2014) Large-scale video classification with convolutional neural networks. In: CVPR
Klof M, Brefeld U, Zien A (2011) l _p,-norm multiple kernel learning. J Mach Learn Res 12(2):953–997
MathSciNet MATH Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: a large video database for human motion recognition. JMLR 24(4):2556–2563
Google Scholar
Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: multi-skip feature stacking for action recognition. In: CVPR
Lan Z, Zhu Y, Hauptmann G (2017) Deep local video feature for action recognition. CoRR, arXiv:1701.07368
Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. In: CVPR
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2activity: recognizing complex activities from sensor data. In: IJCAI
Liu Y, Nie L, Li L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115
Article Google Scholar
Liu W, Zha Z-J, Wang Y, Lu K, Tao D (2016) p-laplacian regularized sparse coding for human activity recognition. IEEE Trans Ind Electron 63(8):5120–5129
Google Scholar
Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: AAAI
Liu W, Ma X, Zhou Y, Tao D, Cheng J (2018) p-laplacian regularization for scene recognition. IEEE Trans Cybern PP:1–14. https://doi.org/10.1109/TCYB.2018.2833843
Google Scholar
Liu W, Yang X, Tao D, Cheng J, Tang Y (2018) Multiview dimension reduction via hessian multiset canonical correlations. Information Fusion 41:119–128
Article Google Scholar
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: CVPR
Negin F, Ozdemir F, Yuksel KA, Akgul CB, Ercil A (2013) A decision forest based feature selection framework for action recognition from rgb-depth cameras. In: SIU
Niebles JC, Fei-Fei L (2007) A hierarchical model of shape and appearance for human action classification. In: CVPR
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: ECCV
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) Simplemkl. JMLR 9(3):2491–2521
MathSciNet MATH Google Scholar
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE TPAMI 39(6):1137
Article Google Scholar
Sharma S, Kiros R, Salakhutdinov R (2016) Action recognition using visual attention. In: ICLR
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS 1(4):568–576
Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR, arXiv:1409.1556
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Souza C, Gaidon A, Cabon Y, Lopez A (2017) Procedual generation of videos to train deep action recognition networks. In: CVPR
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: ICML
Szegedy C, Liu W, Sermanet P, Jia Y, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR
Tran D, Bourdev LD, Fergus R, Torresani L, Paluri M (2014) C3D: generic features for video analysis. CoRR, arXiv:1412.0767
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: CVPR
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV
Wang S, Hou Y, Li Z, Dong J, Tang C (2016) Combining convnets with hand-crafted features for action recognition based on an hmm-svm classifier. MTA:1–16
Wu X, Xu D, Duan L, Luo J, Jia Y (2013) Action recognition using multilevel features and latent structural svm. IEEE TCSVT 23(10):1422–1431
Google Scholar
Wun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: CVPR
Xu K, Ba J, Kiros R, Cho K, Courville A et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML
Yang X, Molchanov P, Kautz J (2016) Multilayer and multimodal fusion of deep neural networks for video classification. In: ACM MM
Yang X, Liu W, Tao D, Cheng J (2017) Canonical correlation analysis networks for two-view image recognition. Inf Sci 385:338–352
Article Google Scholar
Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR
Yonggang L, Ye W, Li L, Zhong J, Sun L, Ye L (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. MTA 76 (8):10701–10719
Google Scholar
Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: CVPR

Download references

Acknowledgments

This work was supported by the Natural Science Foundation of China (NSFC) under Grants No. 61673062 and No. 61472038.

Author information

Authors and Affiliations

Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, 100081, People’s Republic of China
Ruiqi Wang & Xinxiao Wu

Authors

Ruiqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinxiao Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinxiao Wu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, R., Wu, X. Combining multiple deep cues for action recognition. Multimed Tools Appl 78, 9933–9950 (2019). https://doi.org/10.1007/s11042-018-6509-0

Download citation

Received: 09 February 2018
Revised: 18 July 2018
Accepted: 06 August 2018
Published: 31 August 2018
Issue Date: April 2019
DOI: https://doi.org/10.1007/s11042-018-6509-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining multiple deep cues for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Visualizing and Understanding Convolutional Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining multiple deep cues for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Visualizing and Understanding Convolutional Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation