MoWLD: a robust motion image descriptor for violence detection

Zhang, Tao; Jia, Wenjing; Yang, Baoqing; Yang, Jie; He, Xiangjian; Zheng, Zhonglong

doi:10.1007/s11042-015-3133-0

MoWLD: a robust motion image descriptor for violence detection

Published: 11 December 2015

Volume 76, pages 1419–1438, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Tao Zhang¹,
Wenjing Jia²,
Baoqing Yang¹,
Jie Yang¹,
Xiangjian He² &
…
Zhonglong Zheng³

1185 Accesses
66 Citations
Explore all metrics

Abstract

Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in designing an algorithm that can detect violence in surveillance videos with high performance. Existing methods typically apply the Bag-of-Words (BoW) model on local spatiotemporal descriptors. However, traditional spatiotemporal features are not discriminative enough, and also the BoW model roughly assigns each feature vector to only one visual word and therefore ignores the spatial relationships among the features. To tackle these problems, in this paper we propose a novel Motion Weber Local Descriptor (MoWLD) in the spirit of the well-known WLD and make it a powerful and robust descriptor for motion images. We extend the WLD spatial descriptions by adding a temporal component to the appearance descriptor, which implicitly captures local motion information as well as low-level image appear information. To eliminate redundant and irrelevant features, the non-parametric Kernel Density Estimation (KDE) is employed on the MoWLD descriptor. In order to obtain more discriminative features, we adopt the sparse coding and max pooling scheme to further process the selected MoWLDs. Experimental results on three benchmark datasets have demonstrated the superiority of the proposed approach over the state-of-the-arts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):1–43
Article Google Scholar
Andrade E, Fisher R (2006) Modelling crowd scenes for event detection. In: Proceedings of the 18th international conference on pattern recognition (ICPR’06). IEEE, vol 01, pp 175–178
Baysal S, Duygulu P (2013) A line based pose representation for human action recognition. Signal Process Image Commun 28(5):458–471
Article Google Scholar
Bermejo E, Deniz O, Bueno G, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: Proceedings of the 14th international conference on computer analysis of images and patterns. Springer, Berlin Heidelberg New York, pp 332–339
Bobick A, Davis J (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Article Google Scholar
Botev ZI, Grotowski JF, Kroese DP (2010) Kernel density estimation via diffusion. Ann Stat 38(5):2916–2957
Article MathSciNet MATH Google Scholar
Boureau YL, Ponce J, Yann L (2010) A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). no. 6, vol 31, pp 111–118
Chen J, Shan S, He C, Zhao G, Chen X, Gao W (2010) Wld: a robust local image descriptor. IEEE Trans Pattern Anal Mach Intell 32(9):1705–1720
Article Google Scholar
Chen M, Hauptmann A (2009) Mosift: recognizing human actions in surveillance videos. In: Tech. rep, Carnegie Mellon University, pp. 1–10. Carnegie Mellon University
Cheng W, Chu W, Wu J (2003) Semantic context detection based on hierarchical audio models. In: Proceedings of the ACM SIGMM workshop on multimedia information retrieval, pp 109–115
Clarin C, Dionisio J, Echavez M, Naval P (2005) Detection of movie violence using motion intensity analysis on skin and blood. Tech. rep., University of the Philippines
Cristani M, Bicego M, Murino V (2007) Audio-visual event recognition in surveillance video sequences. In: IEEE transactions on multimedia. IEEE, pp 257–267
Dai P, Di H, Dong L, Tao L, Xu G (2008) Group interaction analysis in dynamic context. In: IEEE transactions on systems, man, and cybernetics. IEEE, pp 275–282
Damen D, Hogg D (2009) Recognizing linked events: searching the space of feasible explanations. In: 2009 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 927–934
Dan X, Elisa R, Yan Y, Jingkuan S, Nicu S (2015) Learning deep representations of appearance and motion for anomalous event detection. In: The british machine vision conference (BMVC). BMVA Press, pp 1–12
Datta A, Shah M, da Vitoria Lobo N (2002) Person-on-person violence detection in video data. In: Proceedings of IEEE international conference on image processing (ICIP2002), pp 433–438
de Souza FDM, Chavez GC, do Valle EA, de A, Araujo A (2010) Violence detection in video using spatio-temporal features. In: Proceedings of the 23rd SIBGRAPI conference on graphics, patterns and images, SIBGRAPI 2010. IEEE, pp 224–230
Gao L, Song J, Nie F, Yan Y, Sebe N, Shen HT (2015) Optimal graph leaning with partial tags and multiple features for image and video annotation. In: IEEE conference on computer vision and pattern recognition, pp 4371–4379
Geng X, Yu C, Hu G (2012) Unsupervised feature selection by kernel density estimation in wavelet-based spike sorting. Biomed Signal Process Control 7(2):112–117
Article Google Scholar
Huesmann L, Moise-Titus J, Podolski C, Eron L (2003) Longitudinal relations between childrens exposure to tv violence and their aggressive and violent behavior in young adulthood. Dev Psychol 39(2):201–221
Article Google Scholar
Li S, Gong D, Yuan Y (2013) Face recognition using weber local descriptors. In: Neurocomputing, vol 122. Elsevier, Amsterdam, pp 272–283
Liang Y, Hany F, Tapio S, Esko A (2014) Physical violence detection for preventing school bullying. Advances in Artificial Intelligence, pp 1–9
Lin J, Wang W (2009) Weakly-supervised violence detection in movies with audio and video based co-training. In: The 10th IEEE pacific-rim conference on multimedia, Dec. ACM, pp 990–935
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Mahadevan V, Li W, Bhalodia V, Vasconcelos N (2010) Anomaly detection in crowded scenes. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1975–1981
Mairal G, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding. In: Proceedings of the 26th annual international conference on machine learning (ICML-09). JMLR.org, pp 689–696
Marco B, Alberto DB, Lorenzo S (2012) Multi-scale and real-time non-parametric approach for anomaly detection and localization. Comput Vis Image Underst 116(3):320–329
Article Google Scholar
Mehrsan JR, Martin L (2013) Online dominant and anomalous behavior detection in videos. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR), pp 2609–2616
Nam J, Alghoniemy M, Tewfik A (1998) Audio-visual content-based violent scene characterization. In: Proceedings of IEEE international conference on image processing (ICIP1998), pp 353–357
Nguyen N, Phung D, Venkatesh S, Bui H (2005) Learning and detecting activities from movement trajectories using the hierarchical hidden markov model. In: 2005 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 955–960
Oikonomopoulos A, Patras I, Pantic M, Paragios N (2007) Trajectory-based representation of human actions. Artificial Intelligence for Human Computing 44(51):133–154
Article Google Scholar
Popoola OP, Wang K (2012) Video-based abnormal human behavior recognition - a review. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):865–878
Article Google Scholar
Saghafi B, Rajan D (2012) Human action recognition using pose-based discriminant embedding. Signal Process Image Commun 27(1):96–111
Article Google Scholar
Sarvesh V, Anupam A (2013) A survey on activity recognition and behavior understanding in surveillance video. Vis Comput 29(10):983–1009
Article Google Scholar
Shi Y, Huang Y, Minnen D, Bobick A, Essa I (2004) Propagation networks for recognition of partially ordered sequential action. In: 2004 IEEE conference on computer vision and pattern recognition (CVPR), pp 862–869
Tal H, Yossi I, Orit KG (2012) Violent flows: real-time detection of violent crowd behavior. In: 3rd IEEE international workshop on socially intelligent surveillance and monitoring (SISM) at the IEEE conference on computer vision and pattern recognition (CVPR), pp 1–6
Tran D, Sorokin A (2008) Human activity recognition with metric learning. In: European conference on computer vision (ECCV), 2008. Springer, Berlin Heidelberg New York, pp 548–561
Vishwakarma S, Sapre A, Agrawal A (2011) Action recognition using cuboids of interest points. In: IEEE international conference on signal processing, communications and computing (ICSPCC). IEEE, pp 1–6
Wang B, Li W, Yang W, Liao Q (2011) Illumination normalization based on weber’s law with application to face recognition. In: Signal processing letters, IEEE. IEEE, vol 18, pp 462–465
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: 2009 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1794–1801
Yang J, Yu K, Huang T (2010) Supervised translation-invariant sparse coding. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR), pp 3517–3524
Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia. pp 572–581
Zhang D, Gatica-Perez D, Bengio S, McCowan I (2006) Modeling individual and group actions in meetings with layered hmms. IEEE Trans Multimedia 8(3):509–520
Article Google Scholar
Zhang T, Yang Z, Jia W, Yang B, Yang J, He X (2015) A new method for violence detection in surveillance scenes. Multimedia Tools and Applications, pp 1–23
Zhou W, Wang C, Xiao B, Zhang Z (2014) Action recognition via structured codebook construction. Signal Process Image Commun 29(4):546–555
Article Google Scholar
Zhu Y, Zhao X, Fu Y, Liu Y (2011) Sparse coding on local spatial-temporal volumes for human action recognition. In: 10th Asian conference on computer vision, ACCV2010, pp 660–671

Download references

Acknowledgments

This research was partly supported by NSFC, China (No: 61273258, 61375048, 61170109).

Author information

Authors and Affiliations

Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai, China
Tao Zhang, Baoqing Yang & Jie Yang
Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, Australia
Wenjing Jia & Xiangjian He
Zhejiang Normal University, Jinhua, China
Zhonglong Zheng

Authors

Tao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenjing Jia
View author publications
You can also search for this author in PubMed Google Scholar
Baoqing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangjian He
View author publications
You can also search for this author in PubMed Google Scholar
Zhonglong Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, T., Jia, W., Yang, B. et al. MoWLD: a robust motion image descriptor for violence detection. Multimed Tools Appl 76, 1419–1438 (2017). https://doi.org/10.1007/s11042-015-3133-0

Download citation

Received: 24 March 2015
Revised: 12 November 2015
Accepted: 30 November 2015
Published: 11 December 2015
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11042-015-3133-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MoWLD: a robust motion image descriptor for violence detection

Abstract

Access this article

Similar content being viewed by others

PTDS CenterTrack: pedestrian tracking in dense scenes with re-identification and feature enhancement

Novel person detection and suspicious activity recognition using enhanced YOLOv5 and motion feature map

Local feature matching from detector-based to detector-free: a survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MoWLD: a robust motion image descriptor for violence detection

Abstract

Access this article

Similar content being viewed by others

PTDS CenterTrack: pedestrian tracking in dense scenes with re-identification and feature enhancement

Novel person detection and suspicious activity recognition using enhanced YOLOv5 and motion feature map

Local feature matching from detector-based to detector-free: a survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation