Abstract
Automatic violence detection in video is a meaningful yet challenging task. Violent actions can be characterized both by intense sequential frames and by continuous spatial moves, imposing more complexity than other human actions. However, most existing approaches focus on general spatiotemporal features with local convolution and ignore the full temporal inference based on violence characteristics. To this end, we propose a novel full temporal cross fusion network (FTCF Net) to investigate an effective inference way for violence detection. Specifically, we design two essential components in each FTCF block: a spatial processor and a temporal processor by neural networks. The former is to capture the local structural features of each frame by a 3D CNN with a (3×3×1) filter to infer the continuous spatial moves, while the latter is to perform the cross-frame feature interaction step by step for each channel by a group of processing units to infer the intense and wide variation of violence in full temporal. The two branches are fused at the end of each FTCF block in the FTCF Net efficiently. We conduct extensive experiments on four benchmark datasets: Hockey Fight, Movie Fight, Violent Flow, and Real-life Violence Situations, and the experimental results show that FTCF Net outperforms 20 comparison methods in terms of predictive accuracy. The accuracy goes up to 99.5%, 100.0%, 98.0% and 98.5% in the four datasets respectively, validating the effectiveness of our proposed approach for violence detection. Moreover, the approach proposed in this paper obtains relative steady prediction performance superior to existing methods under different scale of training sets. We hope this work to be a baseline of violence detection, and the whole original codes and pre-trained weights are publicly available at https://github.com/TAN-OpenLab/FTCF-NET.
Similar content being viewed by others
Data Availability
All data generated or analysed during this study are included in this published article (and its supplementary information files).
References
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Bilinski P, Bremond F (2016) Human violence recognition and detection in surveillance videos. In: 2016 13th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 30–36
Keçeli A, Kaya A (2017) Violent activity detection with transfer learning method. Electron Lett 53(15):1047–1048
Roman DGC, Chávez GC (2020) Violence detection and localization in surveillance video. In: 2020 33rd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). IEEE, pp 248–255
Clarin C, Dionisio J, Echavez M, Naval P (2005) Dove: Detection of movie violence using motion intensity analysis on skin and blood. PCSC 6:150–156
De Souza FD, Chavez GC, Do Valle EA Jr, Araújo ADA (2010) Violence detection in video using spatio-temporal features. In: 2010 23rd SIBGRAPI conference on graphics, patterns and images. IEEE, pp 224–230
Chen L-H, Hsu H-W, Wang L-Y, Su C-W (2011) Violence detection in movies. In: 2011 Eighth international conference computer graphics, imaging and visualization. IEEE, pp 119– 124
Hassner T, Itcher Y, Kliper-Gross O (2012) Violent flows: Real-time detection of violent crowd behavior. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 1–6
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Gao Y, Liu H, Sun X, Wang C, Liu Y (2016) Violence detection using oriented violent flows. Image and Vision Computing 48:37–41
Zhou P, Ding Q, Luo H, Hou X (2018) Violence detection in surveillance video using low-level features. PLoS One 13(10):0203668
Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans Image Process 27(7):3459–3471
Li J, Liu X, Zhang W, Zhang M, Song J, Sebe N (2020) Spatio-temporal attention networks for action recognition and detection. IEEE Trans Multimed 22(11):2990–3001
Li D, Yao T, Duan L-Y, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21(2):416–428
Pang W-F, He Q-H, Hu Y-J, Li Y-X (2021) Violence detection in videos based on fusing visual and audio information. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2260–2264
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Serrano I, Deniz O, Espinosa-Aranda JL, Bueno G (2018) Fight recognition in video using hough forests and 2d convolutional neural network. IEEE Trans Image Process 27(10):4787–4797
Soliman MM, Kamal MH, Nashed MAE-M, Mostafa YM, Chawky BS, Khattab D (2019) Violence recognition from videos using deep learning techniques. In: 2019 Ninth international conference on intelligent computing and information systems (ICICIS). IEEE, pp 80–85
Song W, Zhang D, Zhao X, Yu J, Zheng R, Wang A (2019) A novel violent video detection scheme based on modified 3d convolutional neural networks. IEEE Access 7:39172–39179
Nievas EB, Suarez OD, García GB, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: International conference on computer analysis of images and patterns. Springer, pp 332–339
Deniz O, Serrano I, Bueno G, Kim T-K (2014) Fast violence detection in video. In: 2014 International conference on computer vision theory and applications (VISAPP), vol 2. IEEE, pp 478– 485
Zhang T, Jia W, Yang B, Yang J, He X, Zheng Z (2017) Mowld: a robust motion image descriptor for violence detection. Multimed Tools Appl 76(1):1419–1438
Cheng W-H, Chu W-T, Wu J-L (2003) Semantic context detection based on hierarchical audio models. In: Proceedings of the 5th ACM SIGMM international workshop on multimedia information retrieval, pp 109–115
Xu L, Gong C, Yang J, Wu Q, Yao L (2014) Violent video detection based on mosift feature and sparse coding. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3538–3542
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488
Hara K, Kataoka H, Satoh Y (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops, pp 3154–3160
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Sudhakaran S, Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–6
Zhang H, Zhang Q, Shao S, Niu T, Yang X (2020) Attention-based lstm network for rotatory machine remaining useful life prediction. IEEE Access 8:132188–132199
Aktı Ş, Ofli F, Imran M, Ekenel HK (2022) Fight detection from still images in the wild. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 550–559
Wang P, Wang P, Fan E (2021) Violence detection and face recognition based on deep learning. Pattern Recogn Lett 142:20–24
Asad M, Yang J, He J, Shamsolmoali P, He X (2021) Multi-frame feature-fusion-based model for violence detection. Vis Comput 37(6):1415–1431
Wang Z, She Q, Smolic A (2021) Action-net: Multipath excitation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13214–13223
Singh A, Patil D, Omkar S (2018) Eye in the sky: Real-time drone surveillance system (dss) for violent individuals identification using scatternet hybrid deep learning network. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1629–1637
Zhou P, Ding Q, Luo H, Hou X (2017) Violent interaction detection in video based on deep learning. In: Journal of physics: conference series, vol 844. IOP Publishing, p 012044
Wu P, Liu X, Liu J (2022) Weakly supervised audio-visual violence detection. IEEE Transactions on Multimedia
Misra D (2020) Mish: a self regularized non-monotonic activation function. In: BMVC
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Song Y, He F, Duan Y, Liang Y, Yan X (2022) A kernel correlation-based approach to adaptively acquire local features for learning 3d point clouds. Comput Aided Des 146:103196
Liang Y, He F, Zeng X (2020) 3d mesh simplification with feature preservation based on whale optimization algorithm and differential evolution. Integrated Computer-Aided Engineering 27(4):417–435
Acknowledgements
This research was funded by the National Key Research and Development Program of China under Grant No. 2019YFB1405803.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhenhua, T., Zhenche, X., Pengfei, W. et al. FTCF: Full temporal cross fusion network for violence detection in videos. Appl Intell 53, 4218–4230 (2023). https://doi.org/10.1007/s10489-022-03708-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03708-9