Abstract
In this paper, we propose a novel deep learning based framework to fuse multiple cues of action motions, objects and scenes for complex action recognition. Since the deep features achieve promising results, three deep representations are extracted for capturing both temporal and contextual information of actions. Particularly, for the action cue, we first adopt a deep detection model to detect persons frame by frame and then feed the deep representations of persons into a Gated Recurrent Unit model to generate the action features. Different from the existing deep action features, our feature is capable of modeling the global dynamics of long human motion. The scene and object cues are also represented by deep features pooling on all the frames in a video. Moreover, we introduce an lp-norm multiple kernel learning method to effectively combine the multiple deep representations of the video to learn robust classifiers of actions by capturing the contextual relationships between action, object and scene. Extensive experiments on two real-world action datasets (i.e., UCF101 and HMDB51) clearly demonstrate the effectiveness of our method.
Similar content being viewed by others
References
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding. Springer, pp 29–39
Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the smo algorithm. In: ICML
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: CVPR
Cai Z, Fan Q, Feris RS, Vasconcelos N (2016) A unified multi-scale deep convolutional neural network for fast object detection. In: ECCV
Cui J, Liu Y, Yuandong X u, Zhao H, Zha H (2013) Tracking generic human motion via fusion of low-and high-dimensional approaches. IEEE Trans Syst Man Cybern: Syst 43(4):996–1002
Diba A, Sharma V, Van Gool L (2016) Deep temporal linear encoding networks. CoRR, arXiv:1611.06678
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Paluri M (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR
Duta I, Ionescu B, Aizawa K, Sebe N (2017) Spatio-temporal vector of locally max pooled features for action recognition in videos. In: CVPR
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual network fusion for video action recognition. In: CVPR
Feichtenhofer C, Pinz A, Wildes R (2017) Spatiotemporal multiplier networks for video action recognition. In: CVPR
Fernando B, Anderson A, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: CVPR
Girshick R (2015) Fast r-cnn. In: ICCV
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR
Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with r* cnn. In: ICCV
Han D, Bo L, Sminchisescu C (2009) Selection and context for action recognition. In: ICCV
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Ikizlercinbis N, Sclaroff S (2010) Object, scene and actions: Combining multiple features for human action recognition. In: ECCV
Jain M, Gemert JCV, Snoek CGM (2015) What do 15,000 object categories tell us about classifying and localizing actions?. In: CVPR
Ji S, Xu W, Yang M, Yu K (2010) 3d convolutional neural networks for human action recognition. In: ICML
Jiang YG, Li Z, Chang SF (2011) Modeling scene and object contexts for human action retrieval with few examples. IEEE TCSVT 21(5):674–681
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R (2014) Large-scale video classification with convolutional neural networks. In: CVPR
Klof M, Brefeld U, Zien A (2011) l p,-norm multiple kernel learning. J Mach Learn Res 12(2):953–997
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: a large video database for human motion recognition. JMLR 24(4):2556–2563
Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: multi-skip feature stacking for action recognition. In: CVPR
Lan Z, Zhu Y, Hauptmann G (2017) Deep local video feature for action recognition. CoRR, arXiv:1701.07368
Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. In: CVPR
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2activity: recognizing complex activities from sensor data. In: IJCAI
Liu Y, Nie L, Li L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115
Liu W, Zha Z-J, Wang Y, Lu K, Tao D (2016) p-laplacian regularized sparse coding for human activity recognition. IEEE Trans Ind Electron 63(8):5120–5129
Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: AAAI
Liu W, Ma X, Zhou Y, Tao D, Cheng J (2018) p-laplacian regularization for scene recognition. IEEE Trans Cybern PP:1–14. https://doi.org/10.1109/TCYB.2018.2833843
Liu W, Yang X, Tao D, Cheng J, Tang Y (2018) Multiview dimension reduction via hessian multiset canonical correlations. Information Fusion 41:119–128
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: CVPR
Negin F, Ozdemir F, Yuksel KA, Akgul CB, Ercil A (2013) A decision forest based feature selection framework for action recognition from rgb-depth cameras. In: SIU
Niebles JC, Fei-Fei L (2007) A hierarchical model of shape and appearance for human action classification. In: CVPR
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: ECCV
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) Simplemkl. JMLR 9(3):2491–2521
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE TPAMI 39(6):1137
Sharma S, Kiros R, Salakhutdinov R (2016) Action recognition using visual attention. In: ICLR
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS 1(4):568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR, arXiv:1409.1556
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Souza C, Gaidon A, Cabon Y, Lopez A (2017) Procedual generation of videos to train deep action recognition networks. In: CVPR
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: ICML
Szegedy C, Liu W, Sermanet P, Jia Y, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR
Tran D, Bourdev LD, Fergus R, Torresani L, Paluri M (2014) C3D: generic features for video analysis. CoRR, arXiv:1412.0767
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: CVPR
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV
Wang S, Hou Y, Li Z, Dong J, Tang C (2016) Combining convnets with hand-crafted features for action recognition based on an hmm-svm classifier. MTA:1–16
Wu X, Xu D, Duan L, Luo J, Jia Y (2013) Action recognition using multilevel features and latent structural svm. IEEE TCSVT 23(10):1422–1431
Wun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: CVPR
Xu K, Ba J, Kiros R, Cho K, Courville A et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML
Yang X, Molchanov P, Kautz J (2016) Multilayer and multimodal fusion of deep neural networks for video classification. In: ACM MM
Yang X, Liu W, Tao D, Cheng J (2017) Canonical correlation analysis networks for two-view image recognition. Inf Sci 385:338–352
Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR
Yonggang L, Ye W, Li L, Zhong J, Sun L, Ye L (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. MTA 76 (8):10701–10719
Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: CVPR
Acknowledgments
This work was supported by the Natural Science Foundation of China (NSFC) under Grants No. 61673062 and No. 61472038.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, R., Wu, X. Combining multiple deep cues for action recognition. Multimed Tools Appl 78, 9933–9950 (2019). https://doi.org/10.1007/s11042-018-6509-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6509-0