Cluster Computing

, Volume 22, Supplement 1, pp 819–826 | Cite as

More efficient and effective tricks for deep action recognition

  • Zheyuan Liu
  • Xiaoteng Zhang
  • Lei SongEmail author
  • Zhengyan Ding
  • Huixian Duan


Deep convolutional network has achieved great success in visual recognition of static images, while it is not so advantageous as traditional methods in action recognition in videos. As two-stream-style convolutional network gaining best performance in human action recognition, there exist obstacles such as selecting different pre-train models and hyper-parameters, and high computation consumption. In this paper, we propose two efficient and effective methods for action recognition, based on two-stream convolutional network. (1) Reducing computational cost of temporal stream while achieving the same accuracy, and (2) providing techniques such as selection of optical flow algorithm, the pre-train dataset/architectures and the hyper-parameters for assembly in action recognition task. Experimental results show that we are able to obtain performance on a par with the state-of-the-art ones on the datasets of HMDB51 (70.9%) and UCF101 (95.4%).


Action recognition Two-stream convolutional network Effective tricks Tiny fusion network 



The authors of this paper are members of Shanghai Engineering Research Center of Intelligent Video Surveillance. Dr. Lei Song is also a visiting researcher with Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen 518060, China. Our research was sponsored by following projects: the National Natural Science Foundation of China (61402116, 61403084); Program of Science and Technology Commission of Shanghai Municipality (Nos. 15530701300, 15XD15202000); 2012 IoT Program of Ministry of Industry and Information Technology of China; Key Project of the Ministry of Public Security (No. 2014JSYJA007); the Project of the Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University (ESSCKF 2015-03); Shanghai Rising-Star Program (17QB1401000); the Special Fund for Basic R&D Expenses of Central Level Public Welfare Scientific Research Institutions (C17384).


  1. 1.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)Google Scholar
  2. 2.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR, pp. 1–14 (2015)Google Scholar
  3. 3.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  4. 4.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  5. 5.
    Tran, D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)Google Scholar
  6. 6.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In NIPS (2014)Google Scholar
  7. 7.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR (2015)Google Scholar
  8. 8.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)Google Scholar
  9. 9.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: towards good practices for deep action recognition. In: ECCV (2016)Google Scholar
  10. 10.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. In: CoRR (2012)Google Scholar
  11. 11.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  12. 12.
    Ioe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)Google Scholar
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  14. 14.
    Yue-Hei, J., Jonghyun, C., Jan, N., Larry, D.: ActionFlowNet: learning motion representation for action recognition. In: CoRR (2016). arXiv:1612.03052
  15. 15.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: ECCV (2004)Google Scholar
  16. 16.
    Jean-Yves, B.: Pyramidal Implementation of the Affine Lucas Kanade Feature Tracker Description of the Algorithm, vol. 5, no. 1–10, p. 4. Intel Corporation (2001)Google Scholar
  17. 17.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-L1 optical flow. In: Proceedings of the 29th DAGM symposium on pattern recognition, pp. 214–223 (2007)Google Scholar
  18. 18.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  19. 19.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database 27. In: NIPS (2014)Google Scholar
  20. 20.
    Jiang, Y., Wu, Z., Wang, J., Xue, X., Chang, S.: Exploiting feature and class relationships in video categorization with regularized deep neural networks. In: CoRR (2015). arXiv:1502.07209
  21. 21.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Jonathon, S.: Rethinking the inception architecture for computer vision. In: CoRR (2015). arXiv:1512.00567
  22. 22.
    Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: CoRR (2016). arXiv:1602.07261
  23. 23.
    Xie, S., Girshick, R., Doll’ar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CoRR (2016). arXiv:1611.05431
  24. 24.
    Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B.: Beyond Gaussian pyramid: multi-skip feature stacking for action recognition. In: CVPR (2015)Google Scholar
  25. 25.
    Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: CVPR, pp. 1991–1999 (2016)Google Scholar
  26. 26.
    Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks (2016). arXiv Preprint arXiv:1611.06678
  27. 27.
    Zhenzhong, L., Yi, Z., Alexander, G.: Deep local video feature for action recognition. In: CoRR (2017). arXiv:1701.07368
  28. 28.
    Jiang, Y.G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., Suk-thankar, R.: THUMOS challenge: action recognition with a large number of classes (2013)Google Scholar
  29. 29.
    Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th international joint conference on artificial intelligence (IJCAI), pp. 674–679 (1981)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Zheyuan Liu
    • 1
  • Xiaoteng Zhang
    • 1
  • Lei Song
    • 1
    Email author
  • Zhengyan Ding
    • 1
  • Huixian Duan
    • 1
  1. 1.The Third Research Institute of the Ministry of Public SecurityShanghaiChina

Personalised recommendations