Multimedia Tools and Applications

, Volume 78, Issue 1, pp 817–838 | Cite as

A comprehensive solution for detecting events in complex surveillance videos

  • Yandong ZhuEmail author
  • Kaihui Zhou
  • Menglai Wang
  • Yanyun Zhao
  • Zhicheng Zhao


Event detection have long been a fundamental problem in computer vision society. Various datasets for recognizing human events and activities have been proposed to help developing better models and methods, such as UCF101, HMDB51, etc. These datasets all share the same properties that either predefined scripts are provided or the images are almost actor-oriented with little background noise. These properties, however, are completely different from that of surveillance event detection, making the effective solutions on these datasets totally not suitable. Event detection in complex surveillance video is a much more difficult task with several challenges: heavy occlusions between pedestrians, low image resolution and uncontrolled scene condition. TRECVID-SED evaluation, aiming at detecting events in highly crowded airport, is well-known for its great difficulties. To deal with event detection in realistic scene, such as TRECVID-SED, we introduce a comprehensive solution framework based on pedestrian detection, deep key-pose detection and trajectory analysis. Explicitly, instead of detecting whole body of one person, we detect the head-shoulder of pedestrian, addressing the issue of heavy occlusion of pedestrians in complex scene. We also propose a trajectory-based event detection method so as to better focus on the key actors of events. For those events with discriminative poses, we model the event detection as key pose detection by taking advantages of Faster R-CNN. The presented framework achieves the best result in TRECVID-SED 2016 evaluation.


Surveillance video Pedestrian detection Pedestrian tracking Event detection 


  1. 1.
    Amor BB, Jingyong S, Srivastava A (2016) Action recognition using rate-invariant analysis of skeletal shape trajectories. IEEE Trans Pattern Anal Mach Intell 38(1):1–13CrossRefGoogle Scholar
  2. 2.
    S Bell, CL Zitnick, K Bala, R Girshick (2015) Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks. arXiv 1–24Google Scholar
  3. 3.
    Cai Z, et al. (2016) A unified multi-scale deep convolutional neural network for fast object detection. European Conference on Computer Vision. Springer International PublishingGoogle Scholar
  4. 4.
    Chang BW, R Nevatia (2008) Robust object tracking by hierarchical association of detection responses." European Conference on Computer Vision. Springer Berlin HeidelbergGoogle Scholar
  5. 5.
    X Chang et al. (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Patt Anal Mach IntelGoogle Scholar
  6. 6.
    X Chang et al. (2016) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans CybernetGoogle Scholar
  7. 7.
    Chen Q et al. (2015) Part-based deep network for pedestrian detection in surveillance videos." Visual Communications and Image Processing (VCIP), 2015. IEEEGoogle Scholar
  8. 8.
    Dalal N, B Triggs (2005) Histograms of oriented gradients for human detection." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEEGoogle Scholar
  9. 9.
    Felzenszwalb PF et al (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645CrossRefGoogle Scholar
  10. 10.
    Gidaris, Spyros, and Nikos Komodakis (2015) Object detection via a multi-region and semantic segmentation-aware cnn model. Proc IEEE Int Conf Comput VisGoogle Scholar
  11. 11.
    Girshick R (2015) Fast r-cnn. Proc IEEE Int Conf Comput VisGoogle Scholar
  12. 12.
    Girshick R et al. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Proc IEEE Conf Comput Vis Patt RecogGoogle Scholar
  13. 13.
    Horn BKP, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203CrossRefGoogle Scholar
  14. 14.
  15. 15.
  16. 16.
    Karen, A Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556Google Scholar
  17. 17.
    Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Quart 2:83–97 Kuhn's original publicationMathSciNetCrossRefGoogle Scholar
  18. 18.
    D Le, S Phan, Y Miyao, S Satoh et al (2016) @ TRECVIDGoogle Scholar
  19. 19.
    Lenz P, A Geiger, R Urtasun (2015) Followme: Efficient online min-cost flow tracking with bounded memory and computation. Proc IEEE Int Conf Comput VisGoogle Scholar
  20. 20.
    Li Y, K He, J Sun (2016) "R-fcn: Object detection via region-based fully convolutional networks. Adv Neural Info Proc SystGoogle Scholar
  21. 21.
    J. Liang, P. Huang, L. Jiang, Z. Lan, J. Chen, A. Hauptmann et al. @ TRECVID (2016) Multimedia event Detection, Ad-hoc Video Search, Surveillance event DetectionGoogle Scholar
  22. 22.
    Liu L et al (2016) Learning spatio-temporal representations for action recognition: a genetic programming approach. IEEE Trans Cybernet 46(1):158–170CrossRefGoogle Scholar
  23. 23.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110MathSciNetCrossRefGoogle Scholar
  24. 24.
    Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. Adv Neural Inf Proces Syst 2:841–848Google Scholar
  25. 25.
    Peng X, Wang L, Wang X et al (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125CrossRefGoogle Scholar
  26. 26.
    Prince, SJD (2012) Computer vision: models, learning, and inference". Cambridge University PressGoogle Scholar
  27. 27.
    Redmon J et al. (2016) You only look once: Unified, real-time object detection." Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  28. 28.
    Ren S et al. (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neur Info Proc SystGoogle Scholar
  29. 29.
    Russakovsky O, Deng J et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  30. 30.
    Simonyan K, A Zisserman (2014) Two-stream convolutional networks for action recognition in videos. Adv Neur Info Proc SystGoogle Scholar
  31. 31.
    Solera F, S Calderara, R Cucchiara (2015) Learning to divide and conquer for online multi-target tracking. Proc IEEE Int Conf Comput VisGoogle Scholar
  32. 32.
    Wang H et al. (2011) Action recognition by dense trajectories." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEEGoogle Scholar
  33. 33.
    Wang H et al (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79MathSciNetCrossRefGoogle Scholar
  34. 34.
    Wang, et al (2016) Tracklet association by online target-specific metric learning and coherent dynamics estimation. IEEE Trans Patt Anal Mach IntelGoogle Scholar
  35. 35.
    Wu J, Zhang Y, Lin W (2016) Good practices for learning to recognize actions using FV and VLAD. IEEE Trans Cybernet 46(12):2978–2990CrossRefGoogle Scholar
  36. 36.
    P. Yang, J. Xiong, D. Xie, S. Pu, HRI Team @ TRECVID (2016) Surveillance event detectionGoogle Scholar
  37. 37.
    S Yu, L Jiang, CMU Informedia @ TRECVID (2015). Proc TRECVID 2015 WorkGoogle Scholar
  38. 38.
    Zach C, T Pock, H Bischof (2007) A duality based approach for realtime TV-L 1 optical flow. Pattern Recog 214–223Google Scholar
  39. 39.
    Zha Z-J et al (2013) Detecting group activities with multi-camera context. IEEE Trans Circ Syst Video Technol 23(5):856–869CrossRefGoogle Scholar
  40. 40.
    Zhang L, Y Li, R Nevatia (2008) Global data association for multi-object tracking using network flows. Comput Vis Patt Recog, 2008. CVPR 2008. IEEE Conference on. IEEEGoogle Scholar
  41. 41.
    Zhang S et al (2015) Multi-target tracking by learning local-to-global trajectory models. Pattern Recogn 48(2):580–590CrossRefGoogle Scholar
  42. 42.
    Zhang X et al (2016) Deep fusion of multiple semantic cues for complex event recognition. IEEE Trans Image Process 25(3):1033–1046MathSciNetCrossRefGoogle Scholar
  43. 43.
    Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Information and Communication Engineering Beijing University of Posts and TelecommunicationsBeijingChina
  2. 2.Beijing Key Laboratory of Network System and Network CultureBeijing University of Posts and TelecommunicationsBeijingChina

Personalised recommendations