Skip to main content
Log in

Combining multiple deep cues for action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose a novel deep learning based framework to fuse multiple cues of action motions, objects and scenes for complex action recognition. Since the deep features achieve promising results, three deep representations are extracted for capturing both temporal and contextual information of actions. Particularly, for the action cue, we first adopt a deep detection model to detect persons frame by frame and then feed the deep representations of persons into a Gated Recurrent Unit model to generate the action features. Different from the existing deep action features, our feature is capable of modeling the global dynamics of long human motion. The scene and object cues are also represented by deep features pooling on all the frames in a video. Moreover, we introduce an lp-norm multiple kernel learning method to effectively combine the multiple deep representations of the video to learn robust classifiers of actions by capturing the contextual relationships between action, object and scene. Extensive experiments on two real-world action datasets (i.e., UCF101 and HMDB51) clearly demonstrate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding. Springer, pp 29–39

  2. Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the smo algorithm. In: ICML

  3. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: CVPR

  4. Cai Z, Fan Q, Feris RS, Vasconcelos N (2016) A unified multi-scale deep convolutional neural network for fast object detection. In: ECCV

  5. Cui J, Liu Y, Yuandong X u, Zhao H, Zha H (2013) Tracking generic human motion via fusion of low-and high-dimensional approaches. IEEE Trans Syst Man Cybern: Syst 43(4):996–1002

    Article  Google Scholar 

  6. Diba A, Sharma V, Van Gool L (2016) Deep temporal linear encoding networks. CoRR, arXiv:1611.06678

  7. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Paluri M (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR

  8. Duta I, Ionescu B, Aizawa K, Sebe N (2017) Spatio-temporal vector of locally max pooled features for action recognition in videos. In: CVPR

  9. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR

  10. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual network fusion for video action recognition. In: CVPR

  11. Feichtenhofer C, Pinz A, Wildes R (2017) Spatiotemporal multiplier networks for video action recognition. In: CVPR

  12. Fernando B, Anderson A, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: CVPR

  13. Girshick R (2015) Fast r-cnn. In: ICCV

  14. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR

  15. Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with r* cnn. In: ICCV

  16. Han D, Bo L, Sminchisescu C (2009) Selection and context for action recognition. In: ICCV

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  18. Ikizlercinbis N, Sclaroff S (2010) Object, scene and actions: Combining multiple features for human action recognition. In: ECCV

  19. Jain M, Gemert JCV, Snoek CGM (2015) What do 15,000 object categories tell us about classifying and localizing actions?. In: CVPR

  20. Ji S, Xu W, Yang M, Yu K (2010) 3d convolutional neural networks for human action recognition. In: ICML

  21. Jiang YG, Li Z, Chang SF (2011) Modeling scene and object contexts for human action retrieval with few examples. IEEE TCSVT 21(5):674–681

    Google Scholar 

  22. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R (2014) Large-scale video classification with convolutional neural networks. In: CVPR

  23. Klof M, Brefeld U, Zien A (2011) l p,-norm multiple kernel learning. J Mach Learn Res 12(2):953–997

    MathSciNet  MATH  Google Scholar 

  24. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105

  25. Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: a large video database for human motion recognition. JMLR 24(4):2556–2563

    Google Scholar 

  26. Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: multi-skip feature stacking for action recognition. In: CVPR

  27. Lan Z, Zhu Y, Hauptmann G (2017) Deep local video feature for action recognition. CoRR, arXiv:1701.07368

  28. Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. In: CVPR

  29. Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2activity: recognizing complex activities from sensor data. In: IJCAI

  30. Liu Y, Nie L, Li L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115

    Article  Google Scholar 

  31. Liu W, Zha Z-J, Wang Y, Lu K, Tao D (2016) p-laplacian regularized sparse coding for human activity recognition. IEEE Trans Ind Electron 63(8):5120–5129

    Google Scholar 

  32. Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: AAAI

  33. Liu W, Ma X, Zhou Y, Tao D, Cheng J (2018) p-laplacian regularization for scene recognition. IEEE Trans Cybern PP:1–14. https://doi.org/10.1109/TCYB.2018.2833843

    Google Scholar 

  34. Liu W, Yang X, Tao D, Cheng J, Tang Y (2018) Multiview dimension reduction via hessian multiset canonical correlations. Information Fusion 41:119–128

    Article  Google Scholar 

  35. Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: CVPR

  36. Negin F, Ozdemir F, Yuksel KA, Akgul CB, Ercil A (2013) A decision forest based feature selection framework for action recognition from rgb-depth cameras. In: SIU

  37. Niebles JC, Fei-Fei L (2007) A hierarchical model of shape and appearance for human action classification. In: CVPR

  38. Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: ECCV

  39. Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) Simplemkl. JMLR 9(3):2491–2521

    MathSciNet  MATH  Google Scholar 

  40. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE TPAMI 39(6):1137

    Article  Google Scholar 

  41. Sharma S, Kiros R, Salakhutdinov R (2016) Action recognition using visual attention. In: ICLR

  42. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS 1(4):568–576

    Google Scholar 

  43. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR, arXiv:1409.1556

  44. Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  45. Souza C, Gaidon A, Cabon Y, Lopez A (2017) Procedual generation of videos to train deep action recognition networks. In: CVPR

  46. Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: ICML

  47. Szegedy C, Liu W, Sermanet P, Jia Y, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR

  48. Tran D, Bourdev LD, Fergus R, Torresani L, Paluri M (2014) C3D: generic features for video analysis. CoRR, arXiv:1412.0767

  49. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: CVPR

  50. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR

  51. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV

  52. Wang S, Hou Y, Li Z, Dong J, Tang C (2016) Combining convnets with hand-crafted features for action recognition based on an hmm-svm classifier. MTA:1–16

  53. Wu X, Xu D, Duan L, Luo J, Jia Y (2013) Action recognition using multilevel features and latent structural svm. IEEE TCSVT 23(10):1422–1431

    Google Scholar 

  54. Wun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: CVPR

  55. Xu K, Ba J, Kiros R, Cho K, Courville A et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML

  56. Yang X, Molchanov P, Kautz J (2016) Multilayer and multimodal fusion of deep neural networks for video classification. In: ACM MM

  57. Yang X, Liu W, Tao D, Cheng J (2017) Canonical correlation analysis networks for two-view image recognition. Inf Sci 385:338–352

    Article  Google Scholar 

  58. Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR

  59. Yonggang L, Ye W, Li L, Zhong J, Sun L, Ye L (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. MTA 76 (8):10701–10719

    Google Scholar 

  60. Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: CVPR

Download references

Acknowledgments

This work was supported by the Natural Science Foundation of China (NSFC) under Grants No. 61673062 and No. 61472038.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinxiao Wu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, R., Wu, X. Combining multiple deep cues for action recognition. Multimed Tools Appl 78, 9933–9950 (2019). https://doi.org/10.1007/s11042-018-6509-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6509-0

Keywords

Navigation