Abstract
Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in designing an algorithm that can detect violence in surveillance videos with high performance. Existing methods typically apply the Bag-of-Words (BoW) model on local spatiotemporal descriptors. However, traditional spatiotemporal features are not discriminative enough, and also the BoW model roughly assigns each feature vector to only one visual word and therefore ignores the spatial relationships among the features. To tackle these problems, in this paper we propose a novel Motion Weber Local Descriptor (MoWLD) in the spirit of the well-known WLD and make it a powerful and robust descriptor for motion images. We extend the WLD spatial descriptions by adding a temporal component to the appearance descriptor, which implicitly captures local motion information as well as low-level image appear information. To eliminate redundant and irrelevant features, the non-parametric Kernel Density Estimation (KDE) is employed on the MoWLD descriptor. In order to obtain more discriminative features, we adopt the sparse coding and max pooling scheme to further process the selected MoWLDs. Experimental results on three benchmark datasets have demonstrated the superiority of the proposed approach over the state-of-the-arts.
Similar content being viewed by others
References
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):1–43
Andrade E, Fisher R (2006) Modelling crowd scenes for event detection. In: Proceedings of the 18th international conference on pattern recognition (ICPR’06). IEEE, vol 01, pp 175–178
Baysal S, Duygulu P (2013) A line based pose representation for human action recognition. Signal Process Image Commun 28(5):458–471
Bermejo E, Deniz O, Bueno G, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: Proceedings of the 14th international conference on computer analysis of images and patterns. Springer, Berlin Heidelberg New York, pp 332–339
Bobick A, Davis J (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Botev ZI, Grotowski JF, Kroese DP (2010) Kernel density estimation via diffusion. Ann Stat 38(5):2916–2957
Boureau YL, Ponce J, Yann L (2010) A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). no. 6, vol 31, pp 111–118
Chen J, Shan S, He C, Zhao G, Chen X, Gao W (2010) Wld: a robust local image descriptor. IEEE Trans Pattern Anal Mach Intell 32(9):1705–1720
Chen M, Hauptmann A (2009) Mosift: recognizing human actions in surveillance videos. In: Tech. rep, Carnegie Mellon University, pp. 1–10. Carnegie Mellon University
Cheng W, Chu W, Wu J (2003) Semantic context detection based on hierarchical audio models. In: Proceedings of the ACM SIGMM workshop on multimedia information retrieval, pp 109–115
Clarin C, Dionisio J, Echavez M, Naval P (2005) Detection of movie violence using motion intensity analysis on skin and blood. Tech. rep., University of the Philippines
Cristani M, Bicego M, Murino V (2007) Audio-visual event recognition in surveillance video sequences. In: IEEE transactions on multimedia. IEEE, pp 257–267
Dai P, Di H, Dong L, Tao L, Xu G (2008) Group interaction analysis in dynamic context. In: IEEE transactions on systems, man, and cybernetics. IEEE, pp 275–282
Damen D, Hogg D (2009) Recognizing linked events: searching the space of feasible explanations. In: 2009 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 927–934
Dan X, Elisa R, Yan Y, Jingkuan S, Nicu S (2015) Learning deep representations of appearance and motion for anomalous event detection. In: The british machine vision conference (BMVC). BMVA Press, pp 1–12
Datta A, Shah M, da Vitoria Lobo N (2002) Person-on-person violence detection in video data. In: Proceedings of IEEE international conference on image processing (ICIP2002), pp 433–438
de Souza FDM, Chavez GC, do Valle EA, de A, Araujo A (2010) Violence detection in video using spatio-temporal features. In: Proceedings of the 23rd SIBGRAPI conference on graphics, patterns and images, SIBGRAPI 2010. IEEE, pp 224–230
Gao L, Song J, Nie F, Yan Y, Sebe N, Shen HT (2015) Optimal graph leaning with partial tags and multiple features for image and video annotation. In: IEEE conference on computer vision and pattern recognition, pp 4371–4379
Geng X, Yu C, Hu G (2012) Unsupervised feature selection by kernel density estimation in wavelet-based spike sorting. Biomed Signal Process Control 7(2):112–117
Huesmann L, Moise-Titus J, Podolski C, Eron L (2003) Longitudinal relations between childrens exposure to tv violence and their aggressive and violent behavior in young adulthood. Dev Psychol 39(2):201–221
Li S, Gong D, Yuan Y (2013) Face recognition using weber local descriptors. In: Neurocomputing, vol 122. Elsevier, Amsterdam, pp 272–283
Liang Y, Hany F, Tapio S, Esko A (2014) Physical violence detection for preventing school bullying. Advances in Artificial Intelligence, pp 1–9
Lin J, Wang W (2009) Weakly-supervised violence detection in movies with audio and video based co-training. In: The 10th IEEE pacific-rim conference on multimedia, Dec. ACM, pp 990–935
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Mahadevan V, Li W, Bhalodia V, Vasconcelos N (2010) Anomaly detection in crowded scenes. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1975–1981
Mairal G, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding. In: Proceedings of the 26th annual international conference on machine learning (ICML-09). JMLR.org, pp 689–696
Marco B, Alberto DB, Lorenzo S (2012) Multi-scale and real-time non-parametric approach for anomaly detection and localization. Comput Vis Image Underst 116(3):320–329
Mehrsan JR, Martin L (2013) Online dominant and anomalous behavior detection in videos. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR), pp 2609–2616
Nam J, Alghoniemy M, Tewfik A (1998) Audio-visual content-based violent scene characterization. In: Proceedings of IEEE international conference on image processing (ICIP1998), pp 353–357
Nguyen N, Phung D, Venkatesh S, Bui H (2005) Learning and detecting activities from movement trajectories using the hierarchical hidden markov model. In: 2005 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 955–960
Oikonomopoulos A, Patras I, Pantic M, Paragios N (2007) Trajectory-based representation of human actions. Artificial Intelligence for Human Computing 44(51):133–154
Popoola OP, Wang K (2012) Video-based abnormal human behavior recognition - a review. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):865–878
Saghafi B, Rajan D (2012) Human action recognition using pose-based discriminant embedding. Signal Process Image Commun 27(1):96–111
Sarvesh V, Anupam A (2013) A survey on activity recognition and behavior understanding in surveillance video. Vis Comput 29(10):983–1009
Shi Y, Huang Y, Minnen D, Bobick A, Essa I (2004) Propagation networks for recognition of partially ordered sequential action. In: 2004 IEEE conference on computer vision and pattern recognition (CVPR), pp 862–869
Tal H, Yossi I, Orit KG (2012) Violent flows: real-time detection of violent crowd behavior. In: 3rd IEEE international workshop on socially intelligent surveillance and monitoring (SISM) at the IEEE conference on computer vision and pattern recognition (CVPR), pp 1–6
Tran D, Sorokin A (2008) Human activity recognition with metric learning. In: European conference on computer vision (ECCV), 2008. Springer, Berlin Heidelberg New York, pp 548–561
Vishwakarma S, Sapre A, Agrawal A (2011) Action recognition using cuboids of interest points. In: IEEE international conference on signal processing, communications and computing (ICSPCC). IEEE, pp 1–6
Wang B, Li W, Yang W, Liao Q (2011) Illumination normalization based on weber’s law with application to face recognition. In: Signal processing letters, IEEE. IEEE, vol 18, pp 462–465
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: 2009 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1794–1801
Yang J, Yu K, Huang T (2010) Supervised translation-invariant sparse coding. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR), pp 3517–3524
Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia. pp 572–581
Zhang D, Gatica-Perez D, Bengio S, McCowan I (2006) Modeling individual and group actions in meetings with layered hmms. IEEE Trans Multimedia 8(3):509–520
Zhang T, Yang Z, Jia W, Yang B, Yang J, He X (2015) A new method for violence detection in surveillance scenes. Multimedia Tools and Applications, pp 1–23
Zhou W, Wang C, Xiao B, Zhang Z (2014) Action recognition via structured codebook construction. Signal Process Image Commun 29(4):546–555
Zhu Y, Zhao X, Fu Y, Liu Y (2011) Sparse coding on local spatial-temporal volumes for human action recognition. In: 10th Asian conference on computer vision, ACCV2010, pp 660–671
Acknowledgments
This research was partly supported by NSFC, China (No: 61273258, 61375048, 61170109).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, T., Jia, W., Yang, B. et al. MoWLD: a robust motion image descriptor for violence detection. Multimed Tools Appl 76, 1419–1438 (2017). https://doi.org/10.1007/s11042-015-3133-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3133-0