Advertisement

Multimedia Tools and Applications

, Volume 76, Issue 13, pp 15065–15081 | Cite as

Exploring hybrid spatio-temporal convolutional networks for human action recognition

  • Hao Wang
  • Yanhua Yang
  • Erkun Yang
  • Cheng DengEmail author
Article

Abstract

Convolutional neural networks have achieved great success in many computer vision tasks. However, it is still challenging for action recognition in videos due to the intrinsically complicated space-time correlation and computational difficult of videos. Existing methods usually neglect the fusion of long term spatio-temporal information. In this paper, we propose a novel hybrid spatio-temporal convolutional network for action recognition. Specifically, we integrate three different type of streams into the network: (1) the image stream utilizes still images to learn the appearance information; (2) the optical stream captures the motion information from optical flow frames; (3) the dynamic image stream explores the appearance information and motion information simultaneously from generated dynamic images. Finally, a weighted fusion strategy at the softmax layer is utilized to make the class decision. With the help of these three streams, we can take full advantage of the spatio-temporal information of the videos. Extensive experiments on two popular human action recognition datasets demonstrate the superiority of our proposed method when compared with several state-of-the-art approaches.

Keywords

Human action recognition Convolutional network Spatio-temporal information Approximate rank pooling Weighted fusion 

Notes

Acknowledgements

The authors would like to thank the Editor-in-Chief, the handling associate editor and all anonymous reviewers for their considerations and suggestions. This work was supported by the National Natural Science Foundation of China (61572388).

References

  1. 1.
    Alfaro A, Mery D, Soto A (2016) Action recognition in video using sparse coding and relative features Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2688–2697Google Scholar
  2. 2.
    Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 3034–3042Google Scholar
  3. 3.
    Cai Z, Wang L M, Peng X, Qiao Y (2014) Multi-view super vector for action recognition Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 596–603Google Scholar
  4. 4.
    Diba A, Pazandeh A, Gool LV (2016) Efficient two-stream motion and appearance 3d CNNs for video classfication. arXiv:1608.08851
  5. 5.
    Deng J, Dong W, Socher R, Li L J, Li K, Li F-F (2009) Imagenet: a large-scale hierarchical image database Conference on computer vision and pattern recognition (CVPR), 2009, I.E. IEEE, pp 248–255Google Scholar
  6. 6.
    Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features VS-PETS 2005Google Scholar
  7. 7.
    Donahue J, Hendricks L A, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrel T (2015) Long-term recurrent convolutional networks for visual recognition and description Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 2625–2634Google Scholar
  8. 8.
    Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream networks fusion for video action recognition. arXiv:1604.06573
  9. 9.
    Fernando B, Anderson P, Hutter M, Gounld S (2016) Discriminative hierarchical rank pooling for activity recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1924–1932Google Scholar
  10. 10.
    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift International conference on machine learning (ICML), 2015, pp 448–456Google Scholar
  11. 11.
    Ji SW, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 35(1):221–231CrossRefGoogle Scholar
  12. 12.
    Jia Y Q, Evan S, Jeff D, Sergey K, Jonathan L, Ross G, Sergio G, Trevor D (2014) Caffe: convolutional architecture for fast feature embedding. arXiv:1408.5093
  13. 13.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li Fei-Fei (2014) Large-scale video classification with convolutional neural networks Conference on computer vision and pattern recognition (CVPR), 2014, I.E. IEEE, pp 1725–1732Google Scholar
  14. 14.
    Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients 2008-19th British machine vision conference (BMVC), British machine vision associationGoogle Scholar
  15. 15.
    Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgbd action recognition Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 1054–1062Google Scholar
  16. 16.
    Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition International conference on computer vision (ICCV), 2011, I.E. IEEE, pp 2556–2563Google Scholar
  17. 17.
    Laptev I (2005) On space-time interest points. Int J Comput Vis (IJCV) 64 (2–3):107–123CrossRefGoogle Scholar
  18. 18.
    Li Z Y, Gavves E, Jain M, Snoek CGM (2016) VideoLSTM convolves, attends and flows for action recognition. arXiv:1607.01794
  19. 19.
    Peng X, Wang L M, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:14054506
  20. 20.
    Sadanand S, Corso J J (2012) Action bank: a high-level representation of activity in video Conference on computer vision and pattern recognition (CVPR), 2012, I.E. IEEE, pp 1234–1341Google Scholar
  21. 21.
    Scovanner P, Ali S, Mubarak Shah (2007) A 3-dimensional SIFT descriptor and its application to action recognition ACM international conference on multimedia (ACM MM), pp 357–360Google Scholar
  22. 22.
    Shahroudy A, Ng T T, Yang Q, Wang G (2016) Multimodal multipart learning for action recognition in depth videos. IEEE Trans Pattern Anal Mach Intell (PAMI) 10:2123–2129CrossRefGoogle Scholar
  23. 23.
    Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
  24. 24.
    Shen Y, Lin W Y, Yan J C, Xu M L, Wu J X, Wang J D (2015) Person re-identification with correspondence structure learning International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 3200–3208Google Scholar
  25. 25.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos Annual conference on neural information processing systems (NIPS), pp 568–576Google Scholar
  26. 26.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition International conference on learning representations (ICLR), pp 1–14Google Scholar
  27. 27.
    Soomro K, Zamir A R, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
  28. 28.
    Szegedy C, Vanhoucke V, Ioffe S, Shlens J (2016) Rethinking the inception architecture for computer vision Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2818–2826Google Scholar
  29. 29.
    Tran D, Bourdev L, Fergus R, Torresani L, Manohar Paluri (2015) Learning spatiotemporal features with 3d convolutional networks International conference on computer vision (ICCV), 2015, I.E. IEEE, pp 4489–4497Google Scholar
  30. 30.
    Varol G, Laptev I, Schmid C (2016) Long-term temporal convolutions for action recognition. arXiv:1604.04994
  31. 31.
    Wang H, Schmid C (2013) Action recognition with improved trajectories International conference on computer vision (ICCV), 2013, I.E. IEEE, pp 3551–3558Google Scholar
  32. 32.
    Wang L M, Qiao Y, XO T (2015) Action recognition with trajectory-pooled deep-convolutional descriptors Conference on computer vision and pattern recognition (CVPR), 2015, I.E. IEEE, pp 4305–4314Google Scholar
  33. 33.
    Wang L M, Qiao Y, Tang X O (2016) MoFAP: a multi-level representation for action recognition. Int J Comput Vis (IJCV) 119(3):254–271MathSciNetCrossRefGoogle Scholar
  34. 34.
    Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang XO, Gool LV (2016) Temproal segment networks: towards good practices for deep action recognition. arXiv:1608.00859
  35. 35.
    Wang X L, Farhadi A, Gupta A (2016) Action ∼ transformation Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2658–2667Google Scholar
  36. 36.
    Wang Y L, Wang S H, Tang J L, O’Hare N, Chang Y, Li BX (2016) Hierarchical attention network for action recognition in videos. arXiv:1607.0641
  37. 37.
    Willems G, Tuytelaars T, Gool L V (2008) An efficient dense and scale-invariant spatio-temporal interest point detector Proceedings of the european conference on computer vision (ECCV), pp 650–663Google Scholar
  38. 38.
    Wu Z X, Wang X, Jiang Y G, Ye H, Xue X Y (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification ACM international conference on multimedia (ACM MM), pp 461–470Google Scholar
  39. 39.
    Xu Z, Hu C P, Mei L (2016) Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimedia Tools and Applications (MTAP) 75 (19):12155–12172CrossRefGoogle Scholar
  40. 40.
    Xu Z, Liu Y H, Mei L, Hu C P, Chen L (2015) Semantic based representing and organizing surveillance big data using video structural description technology. J Syst Softw 102:217–225CrossRefGoogle Scholar
  41. 41.
    Xu Z, Mei L, Hu C P, Liu Y H (2016) The big data analytics and applications of the surveillance system using video structured description technology. Clust Comput 19(3):1283–1292CrossRefGoogle Scholar
  42. 42.
    Xu Z, Mei L, Liu Y H, Hu C P, Chen L (2016) Semantic enhanced cloud environment for surveillance data management using video structural description. Computing 98(1–2):35–54MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Yang YH, Deng C, Gao SQ, Liu W, Tao DP, Gao XB (2016) Discriminative multi-instance multi-task learning for 3d action recognition. IEEE Trans Multimedia (TMM). doi: 10.1109/TMM.2016.2626959
  44. 44.
    Yang Y H, Deng C, Tao D P, Zhang S T, Liu W, Gao X B (2016) Latent max-margin multitask learning with skelets for 3d action recognition. IEEE Transactions on Cybernetics (TCYB) 99:1–10Google Scholar
  45. 45.
    Yang Y H, Liu R S, Deng C, Gao X B (2016) Multi-task human action recognition via exploring super-category. Signal Process (SP) 124:36–44CrossRefGoogle Scholar
  46. 46.
    Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-l1 optical flow 29th DAGM symposium on pattern recognition, pp 214–223Google Scholar
  47. 47.
    Zhang B W, Wang L M, Wang Z, Qiao Y, Wang H L (2016) Real-time action recognition with enhanced motion vector CNNs Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 2718–2726Google Scholar
  48. 48.
    Zhu J, Wang B Y, Yang X K, Zhang W J, Tu Z W (2013) Action recognition with actons International conference oncomputer vision (ICCV), 2013, I.E. IEEE, pp 3559–3566Google Scholar
  49. 49.
    Zhu W J, Hu J, Sun G, Cao X D, Qiao Y (2016) A key volume mining deep framework for action recognition Conference on computer vision and pattern recognition (CVPR), 2016, I.E. IEEE, pp 1991–1999Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Hao Wang
    • 1
  • Yanhua Yang
    • 1
  • Erkun Yang
    • 1
  • Cheng Deng
    • 1
    • 2
    Email author
  1. 1.Department of Electronic and EngineeringXidian UniversityXi’anChina
  2. 2.The State Key Laboratory of Integrated Services Networks (ISN)Xidian UniversityXi’anChina

Personalised recommendations