Multimedia Tools and Applications

, Volume 76, Issue 1, pp 1419–1438 | Cite as

MoWLD: a robust motion image descriptor for violence detection

  • Tao Zhang
  • Wenjing Jia
  • Baoqing Yang
  • Jie Yang
  • Xiangjian He
  • Zhonglong Zheng


Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in designing an algorithm that can detect violence in surveillance videos with high performance. Existing methods typically apply the Bag-of-Words (BoW) model on local spatiotemporal descriptors. However, traditional spatiotemporal features are not discriminative enough, and also the BoW model roughly assigns each feature vector to only one visual word and therefore ignores the spatial relationships among the features. To tackle these problems, in this paper we propose a novel Motion Weber Local Descriptor (MoWLD) in the spirit of the well-known WLD and make it a powerful and robust descriptor for motion images. We extend the WLD spatial descriptions by adding a temporal component to the appearance descriptor, which implicitly captures local motion information as well as low-level image appear information. To eliminate redundant and irrelevant features, the non-parametric Kernel Density Estimation (KDE) is employed on the MoWLD descriptor. In order to obtain more discriminative features, we adopt the sparse coding and max pooling scheme to further process the selected MoWLDs. Experimental results on three benchmark datasets have demonstrated the superiority of the proposed approach over the state-of-the-arts.


Violence detection Surveillance systems Motion weber local descriptors (MoWLD) Kernel density estimation (KDE) Sparse coding Max pooling 



This research was partly supported by NSFC, China (No: 61273258, 61375048, 61170109).


  1. 1.
    Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):1–43CrossRefGoogle Scholar
  2. 2.
    Andrade E, Fisher R (2006) Modelling crowd scenes for event detection. In: Proceedings of the 18th international conference on pattern recognition (ICPR’06). IEEE, vol 01, pp 175–178Google Scholar
  3. 3.
    Baysal S, Duygulu P (2013) A line based pose representation for human action recognition. Signal Process Image Commun 28(5):458–471CrossRefGoogle Scholar
  4. 4.
    Bermejo E, Deniz O, Bueno G, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: Proceedings of the 14th international conference on computer analysis of images and patterns. Springer, Berlin Heidelberg New York, pp 332–339Google Scholar
  5. 5.
    Bobick A, Davis J (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267CrossRefGoogle Scholar
  6. 6.
    Botev ZI, Grotowski JF, Kroese DP (2010) Kernel density estimation via diffusion. Ann Stat 38(5):2916–2957MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Boureau YL, Ponce J, Yann L (2010) A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). no. 6, vol 31, pp 111–118Google Scholar
  8. 8.
    Chen J, Shan S, He C, Zhao G, Chen X, Gao W (2010) Wld: a robust local image descriptor. IEEE Trans Pattern Anal Mach Intell 32(9):1705–1720CrossRefGoogle Scholar
  9. 9.
    Chen M, Hauptmann A (2009) Mosift: recognizing human actions in surveillance videos. In: Tech. rep, Carnegie Mellon University, pp. 1–10. Carnegie Mellon UniversityGoogle Scholar
  10. 10.
    Cheng W, Chu W, Wu J (2003) Semantic context detection based on hierarchical audio models. In: Proceedings of the ACM SIGMM workshop on multimedia information retrieval, pp 109–115Google Scholar
  11. 11.
    Clarin C, Dionisio J, Echavez M, Naval P (2005) Detection of movie violence using motion intensity analysis on skin and blood. Tech. rep., University of the PhilippinesGoogle Scholar
  12. 12.
    Cristani M, Bicego M, Murino V (2007) Audio-visual event recognition in surveillance video sequences. In: IEEE transactions on multimedia. IEEE, pp 257–267Google Scholar
  13. 13.
    Dai P, Di H, Dong L, Tao L, Xu G (2008) Group interaction analysis in dynamic context. In: IEEE transactions on systems, man, and cybernetics. IEEE, pp 275–282Google Scholar
  14. 14.
    Damen D, Hogg D (2009) Recognizing linked events: searching the space of feasible explanations. In: 2009 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 927–934Google Scholar
  15. 15.
    Dan X, Elisa R, Yan Y, Jingkuan S, Nicu S (2015) Learning deep representations of appearance and motion for anomalous event detection. In: The british machine vision conference (BMVC). BMVA Press, pp 1–12Google Scholar
  16. 16.
    Datta A, Shah M, da Vitoria Lobo N (2002) Person-on-person violence detection in video data. In: Proceedings of IEEE international conference on image processing (ICIP2002), pp 433–438Google Scholar
  17. 17.
    de Souza FDM, Chavez GC, do Valle EA, de A, Araujo A (2010) Violence detection in video using spatio-temporal features. In: Proceedings of the 23rd SIBGRAPI conference on graphics, patterns and images, SIBGRAPI 2010. IEEE, pp 224–230Google Scholar
  18. 18.
    Gao L, Song J, Nie F, Yan Y, Sebe N, Shen HT (2015) Optimal graph leaning with partial tags and multiple features for image and video annotation. In: IEEE conference on computer vision and pattern recognition, pp 4371–4379Google Scholar
  19. 19.
    Geng X, Yu C, Hu G (2012) Unsupervised feature selection by kernel density estimation in wavelet-based spike sorting. Biomed Signal Process Control 7(2):112–117CrossRefGoogle Scholar
  20. 20.
    Huesmann L, Moise-Titus J, Podolski C, Eron L (2003) Longitudinal relations between childrens exposure to tv violence and their aggressive and violent behavior in young adulthood. Dev Psychol 39(2):201–221CrossRefGoogle Scholar
  21. 21.
    Li S, Gong D, Yuan Y (2013) Face recognition using weber local descriptors. In: Neurocomputing, vol 122. Elsevier, Amsterdam, pp 272–283Google Scholar
  22. 22.
    Liang Y, Hany F, Tapio S, Esko A (2014) Physical violence detection for preventing school bullying. Advances in Artificial Intelligence, pp 1–9Google Scholar
  23. 23.
    Lin J, Wang W (2009) Weakly-supervised violence detection in movies with audio and video based co-training. In: The 10th IEEE pacific-rim conference on multimedia, Dec. ACM, pp 990–935Google Scholar
  24. 24.
    Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  25. 25.
    Mahadevan V, Li W, Bhalodia V, Vasconcelos N (2010) Anomaly detection in crowded scenes. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1975–1981Google Scholar
  26. 26.
    Mairal G, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding. In: Proceedings of the 26th annual international conference on machine learning (ICML-09)., pp 689–696Google Scholar
  27. 27.
    Marco B, Alberto DB, Lorenzo S (2012) Multi-scale and real-time non-parametric approach for anomaly detection and localization. Comput Vis Image Underst 116(3):320–329CrossRefGoogle Scholar
  28. 28.
    Mehrsan JR, Martin L (2013) Online dominant and anomalous behavior detection in videos. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR), pp 2609–2616Google Scholar
  29. 29.
    Nam J, Alghoniemy M, Tewfik A (1998) Audio-visual content-based violent scene characterization. In: Proceedings of IEEE international conference on image processing (ICIP1998), pp 353–357Google Scholar
  30. 30.
    Nguyen N, Phung D, Venkatesh S, Bui H (2005) Learning and detecting activities from movement trajectories using the hierarchical hidden markov model. In: 2005 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 955–960Google Scholar
  31. 31.
    Oikonomopoulos A, Patras I, Pantic M, Paragios N (2007) Trajectory-based representation of human actions. Artificial Intelligence for Human Computing 44(51):133–154CrossRefGoogle Scholar
  32. 32.
    Popoola OP, Wang K (2012) Video-based abnormal human behavior recognition - a review. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):865–878CrossRefGoogle Scholar
  33. 33.
    Saghafi B, Rajan D (2012) Human action recognition using pose-based discriminant embedding. Signal Process Image Commun 27(1):96–111CrossRefGoogle Scholar
  34. 34.
    Sarvesh V, Anupam A (2013) A survey on activity recognition and behavior understanding in surveillance video. Vis Comput 29(10):983–1009CrossRefGoogle Scholar
  35. 35.
    Shi Y, Huang Y, Minnen D, Bobick A, Essa I (2004) Propagation networks for recognition of partially ordered sequential action. In: 2004 IEEE conference on computer vision and pattern recognition (CVPR), pp 862–869Google Scholar
  36. 36.
    Tal H, Yossi I, Orit KG (2012) Violent flows: real-time detection of violent crowd behavior. In: 3rd IEEE international workshop on socially intelligent surveillance and monitoring (SISM) at the IEEE conference on computer vision and pattern recognition (CVPR), pp 1–6Google Scholar
  37. 37.
    Tran D, Sorokin A (2008) Human activity recognition with metric learning. In: European conference on computer vision (ECCV), 2008. Springer, Berlin Heidelberg New York, pp 548–561Google Scholar
  38. 38.
    Vishwakarma S, Sapre A, Agrawal A (2011) Action recognition using cuboids of interest points. In: IEEE international conference on signal processing, communications and computing (ICSPCC). IEEE, pp 1–6Google Scholar
  39. 39.
    Wang B, Li W, Yang W, Liao Q (2011) Illumination normalization based on weber’s law with application to face recognition. In: Signal processing letters, IEEE. IEEE, vol 18, pp 462–465Google Scholar
  40. 40.
    Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: 2009 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1794–1801Google Scholar
  41. 41.
    Yang J, Yu K, Huang T (2010) Supervised translation-invariant sparse coding. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR), pp 3517–3524Google Scholar
  42. 42.
    Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia. pp 572–581Google Scholar
  43. 43.
    Zhang D, Gatica-Perez D, Bengio S, McCowan I (2006) Modeling individual and group actions in meetings with layered hmms. IEEE Trans Multimedia 8(3):509–520CrossRefGoogle Scholar
  44. 44.
    Zhang T, Yang Z, Jia W, Yang B, Yang J, He X (2015) A new method for violence detection in surveillance scenes. Multimedia Tools and Applications, pp 1–23Google Scholar
  45. 45.
    Zhou W, Wang C, Xiao B, Zhang Z (2014) Action recognition via structured codebook construction. Signal Process Image Commun 29(4):546–555CrossRefGoogle Scholar
  46. 46.
    Zhu Y, Zhao X, Fu Y, Liu Y (2011) Sparse coding on local spatial-temporal volumes for human action recognition. In: 10th Asian conference on computer vision, ACCV2010, pp 660–671Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Tao Zhang
    • 1
  • Wenjing Jia
    • 2
  • Baoqing Yang
    • 1
  • Jie Yang
    • 1
  • Xiangjian He
    • 2
  • Zhonglong Zheng
    • 3
  1. 1.Institute of Image Processing and Pattern RecognitionShanghai Jiaotong UniversityShanghaiChina
  2. 2.Faculty of Engineering and Information TechnologyUniversity of Technology SydneyUltimoAustralia
  3. 3.Zhejiang Normal UniversityJinhuaChina

Personalised recommendations