SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection

Zhenhua, Tan; Zhenche, Xia; Pengfei, Wang; Danke, Wu; li, Li

doi:10.1007/s11042-023-16269-x

SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection

1230: Sentient Multimedia Systems and Visual Intelligence
Published: 24 July 2023

Volume 83, pages 36899–36919, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Tan Zhenhua ORCID: orcid.org/0000-0002-9870-8925¹,
Xia Zhenche¹,
Wang Pengfei¹,
Wu Danke¹ &
…
Li li¹

212 Accesses
Explore all metrics

Abstract

Spatiotemporal modeling is key for action recognition in videos. In this paper, we propose a Spatial features Compression and Temporal features Fusion (SCTF) block, including a Local Spatial features Compression (LSC) module and a Full Temporal features Fusion (FTF) module, we call the network equipped with SCTF block SCTF-NET, which is a human action recognition network more suitable for violent video detection. The spatial extraction and temporal fusions in previous works are typically achieved by stacking large numbers of convolution layers or adding some complex recurrent neural layers. In contrast, the SCTF module extracts the spatial information of video frames by LSC module, and the temporal sequence information of continuous frames is fused by FTF module, which can effectively conduct spatiotemporal modeling. Finally, our approach achieves good performance on action recognition benchmarks such as HMDB51 and UCF101. Meanwhile, it is more efficient in training and detection. What’s more, experiments on violence datasets Hockey Fights, Movie Fight and Violent Flow show that, our proposed SCTF block is more suitable for violent action recognition. Our code is available at https://github.com/TAN-OpenLab/SCTF-Net.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deepfake video detection: challenges and opportunities

Article Open access 29 May 2024

Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs

Article 06 June 2024

MFFN: image super-resolution via multi-level features fusion network

Article 15 February 2023

References

Afza F, Khan MA, Sharif M, Kadry S, Manogaran G, Saba T, Ashraf I, Damaševičius R (2021) A framework of human action recognition using length control features fusion and weighted entropy-variances based feature selection. Image and Vision Computing 106:104090
Article Google Scholar
Amelio A, Bonifazi G, Cauteruccio F, Corradini E, Marchetti M, Ursino D, Virgili L (2023) Representation and compression of residual neural networks through a multilayer network based approach. Expert Systems with Applications 215:119391
Article Google Scholar
Bermejo Nievas E, Deniz Suarez O, Bueno García G, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: International Conference on Computer Analysis of Images and Patterns, pp. 332–339. Springer
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034–3042
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258
Das S, Sharma S, Dai R, Bremond F, Thonnat M (2020) Vpn: Learning video-pose embedding for activities of daily living. In: European Conference on Computer Vision, pp. 72–90. Springer
Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329–2338
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253
Gu C, Wu X, Wang S (2020) Violent video detection based on semantic correspondence. IEEE Access 8:85958–85967
Article Google Scholar
Guo J, Shi M, Zhu X, Huang W, He Y, Zhang W, Tang Z (2021) Improving human action recognition by jointly exploiting video and wifi clues. Neurocomputing 458:14–23
Article Google Scholar
Hammond DK, Vandergheynst P, Gribonval R (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30(2):129–150
Article MathSciNet Google Scholar
Hassner T, Itcher Y, Kliper-Gross O (2012) Violent flows: Real-time detection of violent crowd behavior. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–6. IEEE
He J-Y, Wu X, Cheng Z-Q, Yuan Z, Jiang Y-G (2021) Db-lstm: Densely-connected bi-directional lstm for human action recognition. Neurocomputing 444:319–331
Article Google Scholar
Jiang S, Qi Y, Zhang H, Bai Z, Lu X, Wang P (2020) D3d: Dual 3-d convolutional network for real-time action recognition. IEEE Transactions on Industrial Informatics 17(7):4584–4593
Article Google Scholar
Jiang B, Xu F, Tu W, Yang C (2019) Channel-wise attention in 3d convolutional networks for violence detection. In: 2019 International Conference on Intelligent Computing and Its Emerging Applications (ICEA), pp. 59–64. IEEE
Jiang B, Xu F, Tu W, Yang C (2019) Channel-wise attention in 3d convolutional networks for violence detection. In: 2019 International Conference on Intelligent Computing and Its Emerging Applications (ICEA), pp. 59–64. IEEE
Keçeli A, Kaya A (2017) Violent activity detection with transfer learning method. Electronics Letters 53(15):1047–1048
Article Google Scholar
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British Machine Vision Conference, pp. 275–1. British Machine Vision Association
Kolesnikov A, Dosovitskiy A, Weissenborn D, Heigold G, Uszkoreit J, Beyer L, Minderer M, Dehghani M, Houlsby N, Gelly S, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE
Li C, Zhang J, Yao J (2021) Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning. Neurocomputing 453:383–392
Article Google Scholar
Liu R, Shen J, Wang H, Chen C, Cheung S-C, Asari VK (2021) Enhanced 3d human pose estimation from videos by using attention-based neural network with dilated convolutions. International Journal of Computer Vision 129(5):1596–1615
Article Google Scholar
Li C, Xie C, Zhang B, Han J, Zhen X, Chen J (2021) Memory attention networks for skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems
Misra D (2019) Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681
Naik AJ, Gopalakrishna M (2021) Deep-violence: individual person violent activity detection in video. Multimedia Tools and Applications 80(12):18365–18380
Article Google Scholar
Nam J, Alghoniemy M, Tewfik AH (1998) Audio-visual content-based violent scene characterization. In: Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No. 98CB36269), vol. 1, pp. 353–357. IEEE
Ranasinghe K, Naseer M, Khan S, Khan FS, Ryoo M (2021) Self-supervised video transformer. arXiv preprint arXiv:2112.01514
Roman DGC, Chávez GC (2020) Violence detection and localization in surveillance video. In: 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 248–255. IEEE
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 357–360
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Proces Syst 27
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Soliman MM, Kamal MH, Nashed MAEM, Mostafa YM, Chawky BS, Khattab D (2019) Violence recognition from videos using deep learning techniques. In: 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), pp. 80–85. IEEE
Song W, Zhang D, Zhao X, Yu J, Zheng R, Wang A (2019) A novel violent video detection scheme based on modified 3d convolutional neural networks. IEEE Access 7:39172–39179
Article Google Scholar
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Sudhakaran S, Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE
Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI Conference on Artificial Intelligence
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497
Traoré A, Akhloufi MA (2020) Violence detection in videos using deep recurrent and convolutional neural networks. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 154–159. IEEE
Ullah A, Muhammad K, Hussain T, Baik SW (2021) Conflux lstms network: A novel approach for multi-view action recognition. Neurocomputing 435:321–329
Article Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558
Wang Z, She Q, Smolic A (2021) Action-net: Multipath excitation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13214–13223
Willems G, Tuytelaars T, Gool LV (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: European Conference on Computer Vision, pp. 650–663. Springer
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31
Xu J, Song R, Wei H, Guo J, Zhou Y, Huang X (2021) A fast human action recognition network based on spatio-temporal features. Neurocomputing 441:350–358
Article Google Scholar
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702
Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z (2021) 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11656–11665
Zhenhua T, Zhenche X, Pengfei W, Chang D, Weichao Z (2023) Ftcf: Full temporal cross fusion network for violence detection in videos. Applied Intelligence 53(4):4218–4230
Article Google Scholar
Zhou P, Ding Q, Luo H, Hou X (2017) Violent interaction detection in video based on deep learning. In: Journal of Physics: Conference Series, vol. 844, p. 012044. IOP Publishing

Download references

Acknowledgements

This research was funded by the National Key Research and Development Program of China under Grant No. 2019YFB1405803.

Author information

Authors and Affiliations

Software College, Northeastern University, Heping, Shenyang, 110819, Liaoning, China
Tan Zhenhua, Xia Zhenche, Wang Pengfei, Wu Danke & Li li

Authors

Tan Zhenhua
View author publications
You can also search for this author in PubMed Google Scholar
Xia Zhenche
View author publications
You can also search for this author in PubMed Google Scholar
Wang Pengfei
View author publications
You can also search for this author in PubMed Google Scholar
Wu Danke
View author publications
You can also search for this author in PubMed Google Scholar
Li li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tan Zhenhua.

Ethics declarations

Conflict of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhenhua, T., Zhenche, X., Pengfei, W. et al. SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection. Multimed Tools Appl 83, 36899–36919 (2024). https://doi.org/10.1007/s11042-023-16269-x

Download citation

Received: 01 June 2022
Revised: 22 April 2023
Accepted: 04 July 2023
Published: 24 July 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11042-023-16269-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection

Abstract

Access this article

Similar content being viewed by others

Deepfake video detection: challenges and opportunities

Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs

MFFN: image super-resolution via multi-level features fusion network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SCTF: an efficient neural network based on local spatial compression and full temporal fusion for video violence detection

Abstract

Access this article

Similar content being viewed by others

Deepfake video detection: challenges and opportunities

Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs

MFFN: image super-resolution via multi-level features fusion network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation