FTCF: Full temporal cross fusion network for violence detection in videos

Zhenhua, Tan; Zhenche, Xia; Pengfei, Wang; Chang, Ding; Weichao, Zhai

doi:10.1007/s10489-022-03708-9

FTCF: Full temporal cross fusion network for violence detection in videos

Published: 06 June 2022

Volume 53, pages 4218–4230, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Tan Zhenhua ORCID: orcid.org/0000-0002-9870-8925¹,
Xia Zhenche¹,
Wang Pengfei¹,
Ding Chang¹ &
…
Zhai Weichao¹

850 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Automatic violence detection in video is a meaningful yet challenging task. Violent actions can be characterized both by intense sequential frames and by continuous spatial moves, imposing more complexity than other human actions. However, most existing approaches focus on general spatiotemporal features with local convolution and ignore the full temporal inference based on violence characteristics. To this end, we propose a novel full temporal cross fusion network (FTCF Net) to investigate an effective inference way for violence detection. Specifically, we design two essential components in each FTCF block: a spatial processor and a temporal processor by neural networks. The former is to capture the local structural features of each frame by a 3D CNN with a (3×3×1) filter to infer the continuous spatial moves, while the latter is to perform the cross-frame feature interaction step by step for each channel by a group of processing units to infer the intense and wide variation of violence in full temporal. The two branches are fused at the end of each FTCF block in the FTCF Net efficiently. We conduct extensive experiments on four benchmark datasets: Hockey Fight, Movie Fight, Violent Flow, and Real-life Violence Situations, and the experimental results show that FTCF Net outperforms 20 comparison methods in terms of predictive accuracy. The accuracy goes up to 99.5%, 100.0%, 98.0% and 98.5% in the four datasets respectively, validating the effectiveness of our proposed approach for violence detection. Moreover, the approach proposed in this paper obtains relative steady prediction performance superior to existing methods under different scale of training sets. We hope this work to be a baseline of violence detection, and the whole original codes and pre-trained weights are publicly available at https://github.com/TAN-OpenLab/FTCF-NET.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Feature Fusion Based Deep Spatiotemporal Model for Violence Detection in Videos

SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection

Article 24 July 2023

Deep Learning-Based Violence Detection from Videos

Data Availability

All data generated or analysed during this study are included in this published article (and its supplementary information files).

Notes

https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/keras/optimizer/v2/gradient_descent.py#L30-L189

References

Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Bilinski P, Bremond F (2016) Human violence recognition and detection in surveillance videos. In: 2016 13th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 30–36
Keçeli A, Kaya A (2017) Violent activity detection with transfer learning method. Electron Lett 53(15):1047–1048
Article Google Scholar
Roman DGC, Chávez GC (2020) Violence detection and localization in surveillance video. In: 2020 33rd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). IEEE, pp 248–255
Clarin C, Dionisio J, Echavez M, Naval P (2005) Dove: Detection of movie violence using motion intensity analysis on skin and blood. PCSC 6:150–156
Google Scholar
De Souza FD, Chavez GC, Do Valle EA Jr, Araújo ADA (2010) Violence detection in video using spatio-temporal features. In: 2010 23rd SIBGRAPI conference on graphics, patterns and images. IEEE, pp 224–230
Chen L-H, Hsu H-W, Wang L-Y, Su C-W (2011) Violence detection in movies. In: 2011 Eighth international conference computer graphics, imaging and visualization. IEEE, pp 119– 124
Hassner T, Itcher Y, Kliper-Gross O (2012) Violent flows: Real-time detection of violent crowd behavior. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 1–6
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Gao Y, Liu H, Sun X, Wang C, Liu Y (2016) Violence detection using oriented violent flows. Image and Vision Computing 48:37–41
Article Google Scholar
Zhou P, Ding Q, Luo H, Hou X (2018) Violence detection in surveillance video using low-level features. PLoS One 13(10):0203668
Article Google Scholar
Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans Image Process 27(7):3459–3471
Article MATH Google Scholar
Li J, Liu X, Zhang W, Zhang M, Song J, Sebe N (2020) Spatio-temporal attention networks for action recognition and detection. IEEE Trans Multimed 22(11):2990–3001
Article Google Scholar
Li D, Yao T, Duan L-Y, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21(2):416–428
Article Google Scholar
Pang W-F, He Q-H, Hu Y-J, Li Y-X (2021) Violence detection in videos based on fusing visual and audio information. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2260–2264
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Serrano I, Deniz O, Espinosa-Aranda JL, Bueno G (2018) Fight recognition in video using hough forests and 2d convolutional neural network. IEEE Trans Image Process 27(10):4787–4797
Article MATH Google Scholar
Soliman MM, Kamal MH, Nashed MAE-M, Mostafa YM, Chawky BS, Khattab D (2019) Violence recognition from videos using deep learning techniques. In: 2019 Ninth international conference on intelligent computing and information systems (ICICIS). IEEE, pp 80–85
Song W, Zhang D, Zhao X, Yu J, Zheng R, Wang A (2019) A novel violent video detection scheme based on modified 3d convolutional neural networks. IEEE Access 7:39172–39179
Article Google Scholar
Nievas EB, Suarez OD, García GB, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: International conference on computer analysis of images and patterns. Springer, pp 332–339
Deniz O, Serrano I, Bueno G, Kim T-K (2014) Fast violence detection in video. In: 2014 International conference on computer vision theory and applications (VISAPP), vol 2. IEEE, pp 478– 485
Zhang T, Jia W, Yang B, Yang J, He X, Zheng Z (2017) Mowld: a robust motion image descriptor for violence detection. Multimed Tools Appl 76(1):1419–1438
Article Google Scholar
Cheng W-H, Chu W-T, Wu J-L (2003) Semantic context detection based on hierarchical audio models. In: Proceedings of the 5th ACM SIGMM international workshop on multimedia information retrieval, pp 109–115
Xu L, Gong C, Yang J, Wu Q, Yao L (2014) Violent video detection based on mosift feature and sparse coding. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3538–3542
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488
Hara K, Kataoka H, Satoh Y (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops, pp 3154–3160
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Sudhakaran S, Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–6
Zhang H, Zhang Q, Shao S, Niu T, Yang X (2020) Attention-based lstm network for rotatory machine remaining useful life prediction. IEEE Access 8:132188–132199
Article Google Scholar
Aktı Ş, Ofli F, Imran M, Ekenel HK (2022) Fight detection from still images in the wild. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 550–559
Wang P, Wang P, Fan E (2021) Violence detection and face recognition based on deep learning. Pattern Recogn Lett 142:20–24
Article Google Scholar
Asad M, Yang J, He J, Shamsolmoali P, He X (2021) Multi-frame feature-fusion-based model for violence detection. Vis Comput 37(6):1415–1431
Article Google Scholar
Wang Z, She Q, Smolic A (2021) Action-net: Multipath excitation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13214–13223
Singh A, Patil D, Omkar S (2018) Eye in the sky: Real-time drone surveillance system (dss) for violent individuals identification using scatternet hybrid deep learning network. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1629–1637
Zhou P, Ding Q, Luo H, Hou X (2017) Violent interaction detection in video based on deep learning. In: Journal of physics: conference series, vol 844. IOP Publishing, p 012044
Wu P, Liu X, Liu J (2022) Weakly supervised audio-visual violence detection. IEEE Transactions on Multimedia
Misra D (2020) Mish: a self regularized non-monotonic activation function. In: BMVC
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Song Y, He F, Duan Y, Liang Y, Yan X (2022) A kernel correlation-based approach to adaptively acquire local features for learning 3d point clouds. Comput Aided Des 146:103196
Article Google Scholar
Liang Y, He F, Zeng X (2020) 3d mesh simplification with feature preservation based on whale optimization algorithm and differential evolution. Integrated Computer-Aided Engineering 27(4):417–435
Article Google Scholar

Download references

Acknowledgements

This research was funded by the National Key Research and Development Program of China under Grant No. 2019YFB1405803.

Author information

Authors and Affiliations

Software College, Northeastern University, Heping, Shenyang, 110819, Liaoning, China
Tan Zhenhua, Xia Zhenche, Wang Pengfei, Ding Chang & Zhai Weichao

Authors

Tan Zhenhua
View author publications
You can also search for this author in PubMed Google Scholar
Xia Zhenche
View author publications
You can also search for this author in PubMed Google Scholar
Wang Pengfei
View author publications
You can also search for this author in PubMed Google Scholar
Ding Chang
View author publications
You can also search for this author in PubMed Google Scholar
Zhai Weichao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tan Zhenhua.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhenhua, T., Zhenche, X., Pengfei, W. et al. FTCF: Full temporal cross fusion network for violence detection in videos. Appl Intell 53, 4218–4230 (2023). https://doi.org/10.1007/s10489-022-03708-9

Download citation

Accepted: 29 April 2022
Published: 06 June 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10489-022-03708-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FTCF: Full temporal cross fusion network for violence detection in videos

Abstract

Access this article

Similar content being viewed by others

Feature Fusion Based Deep Spatiotemporal Model for Violence Detection in Videos

SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection

Deep Learning-Based Violence Detection from Videos

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FTCF: Full temporal cross fusion network for violence detection in videos

Abstract

Access this article

Similar content being viewed by others

Feature Fusion Based Deep Spatiotemporal Model for Violence Detection in Videos

SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection

Deep Learning-Based Violence Detection from Videos

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation