Skip to main content

Advertisement

Log in

FTCF: Full temporal cross fusion network for violence detection in videos

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Automatic violence detection in video is a meaningful yet challenging task. Violent actions can be characterized both by intense sequential frames and by continuous spatial moves, imposing more complexity than other human actions. However, most existing approaches focus on general spatiotemporal features with local convolution and ignore the full temporal inference based on violence characteristics. To this end, we propose a novel full temporal cross fusion network (FTCF Net) to investigate an effective inference way for violence detection. Specifically, we design two essential components in each FTCF block: a spatial processor and a temporal processor by neural networks. The former is to capture the local structural features of each frame by a 3D CNN with a (3×3×1) filter to infer the continuous spatial moves, while the latter is to perform the cross-frame feature interaction step by step for each channel by a group of processing units to infer the intense and wide variation of violence in full temporal. The two branches are fused at the end of each FTCF block in the FTCF Net efficiently. We conduct extensive experiments on four benchmark datasets: Hockey Fight, Movie Fight, Violent Flow, and Real-life Violence Situations, and the experimental results show that FTCF Net outperforms 20 comparison methods in terms of predictive accuracy. The accuracy goes up to 99.5%, 100.0%, 98.0% and 98.5% in the four datasets respectively, validating the effectiveness of our proposed approach for violence detection. Moreover, the approach proposed in this paper obtains relative steady prediction performance superior to existing methods under different scale of training sets. We hope this work to be a baseline of violence detection, and the whole original codes and pre-trained weights are publicly available at https://github.com/TAN-OpenLab/FTCF-NET.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

All data generated or analysed during this study are included in this published article (and its supplementary information files).

Notes

  1. https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/keras/optimizer/v2/gradient_descent.py#L30-L189

References

  1. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

  2. Bilinski P, Bremond F (2016) Human violence recognition and detection in surveillance videos. In: 2016 13th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 30–36

  3. Keçeli A, Kaya A (2017) Violent activity detection with transfer learning method. Electron Lett 53(15):1047–1048

    Article  Google Scholar 

  4. Roman DGC, Chávez GC (2020) Violence detection and localization in surveillance video. In: 2020 33rd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). IEEE, pp 248–255

  5. Clarin C, Dionisio J, Echavez M, Naval P (2005) Dove: Detection of movie violence using motion intensity analysis on skin and blood. PCSC 6:150–156

    Google Scholar 

  6. De Souza FD, Chavez GC, Do Valle EA Jr, Araújo ADA (2010) Violence detection in video using spatio-temporal features. In: 2010 23rd SIBGRAPI conference on graphics, patterns and images. IEEE, pp 224–230

  7. Chen L-H, Hsu H-W, Wang L-Y, Su C-W (2011) Violence detection in movies. In: 2011 Eighth international conference computer graphics, imaging and visualization. IEEE, pp 119– 124

  8. Hassner T, Itcher Y, Kliper-Gross O (2012) Violent flows: Real-time detection of violent crowd behavior. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 1–6

  9. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

  10. Gao Y, Liu H, Sun X, Wang C, Liu Y (2016) Violence detection using oriented violent flows. Image and Vision Computing 48:37–41

    Article  Google Scholar 

  11. Zhou P, Ding Q, Luo H, Hou X (2018) Violence detection in surveillance video using low-level features. PLoS One 13(10):0203668

    Article  Google Scholar 

  12. Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans Image Process 27(7):3459–3471

    Article  MATH  Google Scholar 

  13. Li J, Liu X, Zhang W, Zhang M, Song J, Sebe N (2020) Spatio-temporal attention networks for action recognition and detection. IEEE Trans Multimed 22(11):2990–3001

    Article  Google Scholar 

  14. Li D, Yao T, Duan L-Y, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21(2):416–428

    Article  Google Scholar 

  15. Pang W-F, He Q-H, Hu Y-J, Li Y-X (2021) Violence detection in videos based on fusing visual and audio information. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2260–2264

  16. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  17. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  18. Serrano I, Deniz O, Espinosa-Aranda JL, Bueno G (2018) Fight recognition in video using hough forests and 2d convolutional neural network. IEEE Trans Image Process 27(10):4787–4797

    Article  MATH  Google Scholar 

  19. Soliman MM, Kamal MH, Nashed MAE-M, Mostafa YM, Chawky BS, Khattab D (2019) Violence recognition from videos using deep learning techniques. In: 2019 Ninth international conference on intelligent computing and information systems (ICICIS). IEEE, pp 80–85

  20. Song W, Zhang D, Zhao X, Yu J, Zheng R, Wang A (2019) A novel violent video detection scheme based on modified 3d convolutional neural networks. IEEE Access 7:39172–39179

    Article  Google Scholar 

  21. Nievas EB, Suarez OD, García GB, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: International conference on computer analysis of images and patterns. Springer, pp 332–339

  22. Deniz O, Serrano I, Bueno G, Kim T-K (2014) Fast violence detection in video. In: 2014 International conference on computer vision theory and applications (VISAPP), vol 2. IEEE, pp 478– 485

  23. Zhang T, Jia W, Yang B, Yang J, He X, Zheng Z (2017) Mowld: a robust motion image descriptor for violence detection. Multimed Tools Appl 76(1):1419–1438

    Article  Google Scholar 

  24. Cheng W-H, Chu W-T, Wu J-L (2003) Semantic context detection based on hierarchical audio models. In: Proceedings of the 5th ACM SIGMM international workshop on multimedia information retrieval, pp 109–115

  25. Xu L, Gong C, Yang J, Wu Q, Yao L (2014) Violent video detection based on mosift feature and sparse coding. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3538–3542

  26. Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488

  27. Hara K, Kataoka H, Satoh Y (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops, pp 3154–3160

  28. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  29. Sudhakaran S, Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–6

  30. Zhang H, Zhang Q, Shao S, Niu T, Yang X (2020) Attention-based lstm network for rotatory machine remaining useful life prediction. IEEE Access 8:132188–132199

    Article  Google Scholar 

  31. Aktı Ş, Ofli F, Imran M, Ekenel HK (2022) Fight detection from still images in the wild. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 550–559

  32. Wang P, Wang P, Fan E (2021) Violence detection and face recognition based on deep learning. Pattern Recogn Lett 142:20–24

    Article  Google Scholar 

  33. Asad M, Yang J, He J, Shamsolmoali P, He X (2021) Multi-frame feature-fusion-based model for violence detection. Vis Comput 37(6):1415–1431

    Article  Google Scholar 

  34. Wang Z, She Q, Smolic A (2021) Action-net: Multipath excitation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13214–13223

  35. Singh A, Patil D, Omkar S (2018) Eye in the sky: Real-time drone surveillance system (dss) for violent individuals identification using scatternet hybrid deep learning network. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1629–1637

  36. Zhou P, Ding Q, Luo H, Hou X (2017) Violent interaction detection in video based on deep learning. In: Journal of physics: conference series, vol 844. IOP Publishing, p 012044

  37. Wu P, Liu X, Liu J (2022) Weakly supervised audio-visual violence detection. IEEE Transactions on Multimedia

  38. Misra D (2020) Mish: a self regularized non-monotonic activation function. In: BMVC

  39. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708

  40. Song Y, He F, Duan Y, Liang Y, Yan X (2022) A kernel correlation-based approach to adaptively acquire local features for learning 3d point clouds. Comput Aided Des 146:103196

    Article  Google Scholar 

  41. Liang Y, He F, Zeng X (2020) 3d mesh simplification with feature preservation based on whale optimization algorithm and differential evolution. Integrated Computer-Aided Engineering 27(4):417–435

    Article  Google Scholar 

Download references

Acknowledgements

This research was funded by the National Key Research and Development Program of China under Grant No. 2019YFB1405803.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tan Zhenhua.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhenhua, T., Zhenche, X., Pengfei, W. et al. FTCF: Full temporal cross fusion network for violence detection in videos. Appl Intell 53, 4218–4230 (2023). https://doi.org/10.1007/s10489-022-03708-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03708-9

Keywords

Navigation