Skip to main content

Advertisement

Log in

SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection

  • 1230: Sentient Multimedia Systems and Visual Intelligence
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Spatiotemporal modeling is key for action recognition in videos. In this paper, we propose a Spatial features Compression and Temporal features Fusion (SCTF) block, including a Local Spatial features Compression (LSC) module and a Full Temporal features Fusion (FTF) module, we call the network equipped with SCTF block SCTF-NET, which is a human action recognition network more suitable for violent video detection. The spatial extraction and temporal fusions in previous works are typically achieved by stacking large numbers of convolution layers or adding some complex recurrent neural layers. In contrast, the SCTF module extracts the spatial information of video frames by LSC module, and the temporal sequence information of continuous frames is fused by FTF module, which can effectively conduct spatiotemporal modeling. Finally, our approach achieves good performance on action recognition benchmarks such as HMDB51 and UCF101. Meanwhile, it is more efficient in training and detection. What’s more, experiments on violence datasets Hockey Fights, Movie Fight and Violent Flow show that, our proposed SCTF block is more suitable for violent action recognition. Our code is available at https://github.com/TAN-OpenLab/SCTF-Net.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Afza F, Khan MA, Sharif M, Kadry S, Manogaran G, Saba T, Ashraf I, Damaševičius R (2021) A framework of human action recognition using length control features fusion and weighted entropy-variances based feature selection. Image and Vision Computing 106:104090

    Article  Google Scholar 

  2. Amelio A, Bonifazi G, Cauteruccio F, Corradini E, Marchetti M, Ursino D, Virgili L (2023) Representation and compression of residual neural networks through a multilayer network based approach. Expert Systems with Applications 215:119391

    Article  Google Scholar 

  3. Bermejo Nievas E, Deniz Suarez O, Bueno García G, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: International Conference on Computer Analysis of Images and Patterns, pp. 332–339. Springer

  4. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034–3042

  5. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308

  6. Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258

  7. Das S, Sharma S, Dai R, Bremond F, Thonnat M (2020) Vpn: Learning video-pose embedding for activities of daily living. In: European Conference on Computer Vision, pp. 72–90. Springer

  8. Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329–2338

  9. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211

  10. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941

  11. Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253

  12. Gu C, Wu X, Wang S (2020) Violent video detection based on semantic correspondence. IEEE Access 8:85958–85967

    Article  Google Scholar 

  13. Guo J, Shi M, Zhu X, Huang W, He Y, Zhang W, Tang Z (2021) Improving human action recognition by jointly exploiting video and wifi clues. Neurocomputing 458:14–23

    Article  Google Scholar 

  14. Hammond DK, Vandergheynst P, Gribonval R (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30(2):129–150

    Article  MathSciNet  Google Scholar 

  15. Hassner T, Itcher Y, Kliper-Gross O (2012) Violent flows: Real-time detection of violent crowd behavior. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–6. IEEE

  16. He J-Y, Wu X, Cheng Z-Q, Yuan Z, Jiang Y-G (2021) Db-lstm: Densely-connected bi-directional lstm for human action recognition. Neurocomputing 444:319–331

    Article  Google Scholar 

  17. Jiang S, Qi Y, Zhang H, Bai Z, Lu X, Wang P (2020) D3d: Dual 3-d convolutional network for real-time action recognition. IEEE Transactions on Industrial Informatics 17(7):4584–4593

    Article  Google Scholar 

  18. Jiang B, Xu F, Tu W, Yang C (2019) Channel-wise attention in 3d convolutional networks for violence detection. In: 2019 International Conference on Intelligent Computing and Its Emerging Applications (ICEA), pp. 59–64. IEEE

  19. Jiang B, Xu F, Tu W, Yang C (2019) Channel-wise attention in 3d convolutional networks for violence detection. In: 2019 International Conference on Intelligent Computing and Its Emerging Applications (ICEA), pp. 59–64. IEEE

  20. Keçeli A, Kaya A (2017) Violent activity detection with transfer learning method. Electronics Letters 53(15):1047–1048

    Article  Google Scholar 

  21. Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British Machine Vision Conference, pp. 275–1. British Machine Vision Association

  22. Kolesnikov A, Dosovitskiy A, Weissenborn D, Heigold G, Uszkoreit J, Beyer L, Minderer M, Dehghani M, Houlsby N, Gelly S, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale

  23. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE

  24. Li C, Zhang J, Yao J (2021) Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning. Neurocomputing 453:383–392

    Article  Google Scholar 

  25. Liu R, Shen J, Wang H, Chen C, Cheung S-C, Asari VK (2021) Enhanced 3d human pose estimation from videos by using attention-based neural network with dilated convolutions. International Journal of Computer Vision 129(5):1596–1615

    Article  Google Scholar 

  26. Li C, Xie C, Zhang B, Han J, Zhen X, Chen J (2021) Memory attention networks for skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems

  27. Misra D (2019) Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681

  28. Naik AJ, Gopalakrishna M (2021) Deep-violence: individual person violent activity detection in video. Multimedia Tools and Applications 80(12):18365–18380

    Article  Google Scholar 

  29. Nam J, Alghoniemy M, Tewfik AH (1998) Audio-visual content-based violent scene characterization. In: Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No. 98CB36269), vol. 1, pp. 353–357. IEEE

  30. Ranasinghe K, Naseer M, Khan S, Khan FS, Ryoo M (2021) Self-supervised video transformer. arXiv preprint arXiv:2112.01514

  31. Roman DGC, Chávez GC (2020) Violence detection and localization in surveillance video. In: 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 248–255. IEEE

  32. Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 357–360

  33. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Proces Syst 27

  34. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  35. Soliman MM, Kamal MH, Nashed MAEM, Mostafa YM, Chawky BS, Khattab D (2019) Violence recognition from videos using deep learning techniques. In: 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), pp. 80–85. IEEE

  36. Song W, Zhang D, Zhao X, Yu J, Zheng R, Wang A (2019) A novel violent video detection scheme based on modified 3d convolutional neural networks. IEEE Access 7:39172–39179

    Article  Google Scholar 

  37. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  38. Sudhakaran S, Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE

  39. Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473

  40. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI Conference on Artificial Intelligence

  41. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826

  42. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497

  43. Traoré A, Akhloufi MA (2020) Violence detection in videos using deep recurrent and convolutional neural networks. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 154–159. IEEE

  44. Ullah A, Muhammad K, Hussain T, Baik SW (2021) Conflux lstms network: A novel approach for multi-view action recognition. Neurocomputing 435:321–329

    Article  Google Scholar 

  45. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558

  46. Wang Z, She Q, Smolic A (2021) Action-net: Multipath excitation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13214–13223

  47. Willems G, Tuytelaars T, Gool LV (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: European Conference on Computer Vision, pp. 650–663. Springer

  48. Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31

  49. Xu J, Song R, Wei H, Guo J, Zhou Y, Huang X (2021) A fast human action recognition network based on spatio-temporal features. Neurocomputing 441:350–358

    Article  Google Scholar 

  50. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702

  51. Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z (2021) 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11656–11665

  52. Zhenhua T, Zhenche X, Pengfei W, Chang D, Weichao Z (2023) Ftcf: Full temporal cross fusion network for violence detection in videos. Applied Intelligence 53(4):4218–4230

    Article  Google Scholar 

  53. Zhou P, Ding Q, Luo H, Hou X (2017) Violent interaction detection in video based on deep learning. In: Journal of Physics: Conference Series, vol. 844, p. 012044. IOP Publishing

Download references

Acknowledgements

This research was funded by the National Key Research and Development Program of China under Grant No. 2019YFB1405803.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tan Zhenhua.

Ethics declarations

Conflict of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhenhua, T., Zhenche, X., Pengfei, W. et al. SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection. Multimed Tools Appl 83, 36899–36919 (2024). https://doi.org/10.1007/s11042-023-16269-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16269-x

Keywords

Navigation