Real-Time Action Recognition in Surveillance Videos Using ConvNets

  • Sheng LuoEmail author
  • Haojin Yang
  • Cheng Wang
  • Xiaoyin Che
  • Christoph Meinel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9949)


The explosive growth of surveillance cameras and its 7 * 24 recording period brings massive surveillance videos data. Therefore how to efficiently retrieve the rare but important event information inside the videos is eager to be solved. Recently deep convolutinal networks shows its outstanding performance in event recognition on general videos. Hence we study the characteristic of surveillance video context and propose a very competitive ConvNets approach for real-time event recognition on surveillance videos. Our approach adopts two-steam ConvNets to respectively recognition spatial and temporal information of one action. In particular, we propose to use fast feature cascades and motion history image as the template of spatial and temporal stream. We conducted our experiments on UCF-ARG and UT-interaction dataset. The experimental results show that our approach acquires superior recognition accuracy and runs in real-time.


Real-time application Event recognition Surveillance videos Motion history image 


  1. 1.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J.G. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  2. 2.
    UCF-ARG Data Set. Accessed 10 Nov 2015
  3. 3.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011)Google Scholar
  4. 4.
    Benenson, R., Mathias, M., Timofte, R., Van Gool, L.: Pedestrian detection at 100 frames per second. In: CVPR (2012)Google Scholar
  5. 5.
    Bilinski, P., Bremond, F.: Statistics of pairwise co-occurring local spatio-temporal features for human action recognition. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012 Ws/Demos, Part I. LNCS, vol. 7583, pp. 311–320. Springer, Heidelberg (2012)Google Scholar
  6. 6.
    Bradski, G.: Dr. Dobb’s J. Softw. Tools (2000). Article ID 2236121Google Scholar
  7. 7.
    Cropley, J.: Top video surveillance trends for 2016, February 2016.
  8. 8.
    Davis, J.W., Bobick, A.E.: The representation and recognition of human movement using temporal templates. In: IEEE Computer Society Conference on CVPR, 1997, pp. 928–934. IEEE (1997)Google Scholar
  9. 9.
    Dollár, P., Belongie, S., Perona, P.: The fastest pedestrian detector in the west. In: BMVC, vol. 2, p. 7. Citeseer (2010)Google Scholar
  10. 10.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)Google Scholar
  11. 11.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  12. 12.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACMMM, pp. 675–678. ACM (2014)Google Scholar
  13. 13.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on CVPR, pp. 1725–1732. IEEE (2014)Google Scholar
  14. 14.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on CVPR, pp. 1–8. IEEE (2008)Google Scholar
  15. 15.
    Ryoo, M.S., Chen, C.-C., Aggarwal, J.K., Roy-Chowdhury, A.: An overview of contest on semantic description of human activities (SDHA) 2010. In: Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 270–285. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  17. 17.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. CoRR abs/1507.02159 (2015)Google Scholar
  18. 18.
    Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM MM, pp. 461–470. ACM (2015)Google Scholar
  19. 19.
    Ye, H., Wu, Z., Zhao, R.W., Wang, X., Jiang, Y.G., Xue, X.: Evaluating two-stream CNN for video classification. In: ICMR 2015, pp. 435–442. ACM (2015)Google Scholar
  20. 20.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: 2015 IEEE Conference on CVPR, pp. 4694–4702 (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Sheng Luo
    • 1
    Email author
  • Haojin Yang
    • 1
  • Cheng Wang
    • 1
  • Xiaoyin Che
    • 1
  • Christoph Meinel
    • 1
  1. 1.Hasso Plattner InstituteUniversity of PotsdamPotsdamGermany

Personalised recommendations