Action Recognition in Surveillance Video Using ConvNets and Motion History Image

  • Sheng LuoEmail author
  • Haojin Yang
  • Cheng Wang
  • Xiaoyin Che
  • Christoph Meinel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9887)


With significant increasing of surveillance cameras, the amount of surveillance videos is growing rapidly. Thereby how to automatically and efficiently recognize semantic actions and events in surveillance videos becomes an important problem to be addressed. In this paper, we investigate the state-of-the-art Deep Learning (DL) approaches for human action recognition, and propose an improved two-stream ConvNets architecture for this task. In particular, we propose to use Motion History Image (MHI) as motion expression for training the temporal ConvNet, which achieved impressive results in both accuracy and recognition speed. In our experiment, we conducted an in-depth study to investigate important network options and compared to the latest deep network for action recognition. The detailed evaluation results show the superior ability of our proposed approach, which achieves state-of-the-art in surveillance video context.


Convolutional neural network Surveillance videos Optical flow Motion History Image 


  1. 1.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J.G. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  2. 2.
    UCF-ARG Data Set. Accessed 10 Nov 2015
  3. 3.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  4. 4.
    Bilinski, P., Bremond, F.: Statistics of pairwise co-occurring local spatio-temporal features for human action recognition. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012 Ws/Demos, Part I. LNCS, vol. 7583, pp. 311–320. Springer, Heidelberg (2012)Google Scholar
  5. 5.
    Davis, J.W., Bobick, A.E.: The representation and recognition of human movement using temporal templates. In: 1997 IEEE Computer Society Conference on CVPR, pp. 928–934. IEEE (1997)Google Scholar
  6. 6.
    Dollár, P., Belongie, S., Perona, P.: The fastest pedestrian detector in the west. In: BMVC, vol. 2, p. 7. Citeseer (2010)Google Scholar
  7. 7.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: 2015 IEEE Conference on CVPR, pp. 2625–2634 (2015)Google Scholar
  8. 8.
    Gropley, J.: Top Video Surveillance Trends for 2015. IHS Technology, 1 edn. (2015).
  9. 9.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  10. 10.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on CVPR, pp. 1725–1732. IEEE (2014)Google Scholar
  11. 11.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)Google Scholar
  12. 12.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on CVPR, pp. 1–8. IEEE (2008)Google Scholar
  13. 13.
    Nina, O., Rubiano, C., Shah, M.: Action recognition using ensemble of deep convolutional neural networks (2014)Google Scholar
  14. 14.
    Ryoo, M.S., Chen, C.-C., Aggarwal, J.K., Roy-Chowdhury, A.: An overview of contest on semantic description of human activities (SDHA) 2010. In: Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 270–285. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  15. 15.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  16. 16.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. CoRR abs/1507.02159 (2015)Google Scholar
  17. 17.
    Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM MM, pp. 461–470. ACM (2015)Google Scholar
  18. 18.
    Ye, H., Wu, Z., Zhao, R.W., Wang, X., Jiang, Y.G., Xue, X.: Evaluating two-stream CNN for video classification. In: ICMR 2015, pp. 435–442. ACM (2015)Google Scholar
  19. 19.
    Ng, J.Y-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: 2015 IEEE Conference on CVPR, pp. 4694–4702 (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Sheng Luo
    • 1
    Email author
  • Haojin Yang
    • 1
  • Cheng Wang
    • 1
  • Xiaoyin Che
    • 1
  • Christoph Meinel
    • 1
  1. 1.Hasso Plattner InstituteUniversity of PotsdamPotsdamGermany

Personalised recommendations