Two-Stream Convolutional Network with Multi-level Feature Fusion for Categorization of Human Action from Videos

  • Prateep Bhattacharjee
  • Sukhendu Das
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10597)


This paper presents the results of the exploration of a two-stream Convolutional Neural Network (2S-CNN) architecture, with a novel feature fusion technique at multiple levels, to categorize events in videos. The two streams are a combination of dense optical flow features with: (a) RGB frames; and (b) salient object regions detected using a fast space-time saliency method. The main contribution is in the design of a classifier moderated method to fuse information from the two streams at multiple stages of the network, which enables capturing the most discriminative and complimentary features for localizing the spatio-temporal attention for the action being performed. This mutual auto-exchange of information in local and global contexts, produces an optimal combination of appearance and dynamism, for enhanced discrimination, thus producing the best performance of categorization. The network is trained end-to-end and subsequently evaluated on two challenging human action recognition benchmark datasets viz. UCF-101 and HMDB-51, where, the proposed 2S-CNN method outperforms the current state of the art ConvNets by a significant margin.


Action recognition Two-stream networks Deep learning 


  1. 1.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). doi: 10.1007/978-3-540-24673-2_3 CrossRefGoogle Scholar
  2. 2.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)Google Scholar
  3. 3.
    Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. arXiv preprint arXiv:1611.02155 (2016)
  4. 4.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. arXiv preprint arXiv:1604.06573 (2016)
  5. 5.
    Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: CVPR, pp. 5378–5387 (2015)Google Scholar
  6. 6.
    Hahnloser, R.H., Sarpeshkar, R., Mahowald, M.A., Douglas, R.J., Seung, H.S.: Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405(6789), 947–951 (2000)CrossRefGoogle Scholar
  7. 7.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  8. 8.
    Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR, pp. 2555–2562 (2013)Google Scholar
  9. 9.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. T-PAMI 35(1), 221–231 (2013)CrossRefGoogle Scholar
  10. 10.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)Google Scholar
  11. 11.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV, pp. 2556–2563 (2011)Google Scholar
  12. 12.
    Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B.: Beyond Gaussian pyramid: multi-skip feature stacking for action recognition. In: CVPR, pp. 204–212 (2015)Google Scholar
  13. 13.
    Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 581–595. Springer, Cham (2014). doi: 10.1007/978-3-319-10602-1_38 Google Scholar
  14. 14.
    Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: ISCA (2014)Google Scholar
  15. 15.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  16. 16.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  17. 17.
    Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV, pp. 4597–4605 (2015)Google Scholar
  18. 18.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)Google Scholar
  19. 19.
    Wang, X., Farhadi, A., Gupta, A.: Actions \({\sim }\) transformations. arXiv preprint arXiv:1512.00795 (2015)
  20. 20.
    Wu, Z., Jiang, Y.-G., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification. arXiv preprint arXiv:1509.06086 (2015)
  21. 21.
    Zhou, F., Bing Kang, S., Cohen, M.F.: Time-mapping using space-time saliency. In: CVPR, pp. 3358–3365 (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology MadrasChennaiIndia

Personalised recommendations