Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop

Pratama, Raka Aditya; Yudistira, Novanto; Bachtiar, Fitra Abdurrachman

doi:10.1007/s11042-023-15599-0

Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop

1229: Multimedia Data Analysis for Smart City Environment Safety
Published: 05 July 2023

(2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Raka Aditya Pratama¹,
Novanto Yudistira ORCID: orcid.org/0000-0001-5330-5930¹ &
Fitra Abdurrachman Bachtiar¹

217 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Violence may happen anywhere. One of the ways to know and oversee the violence in some places is by installing Closed-circuit Television (CCTV). The recorded video captured by CCTV can be used as proof in a law court. Violence video classification is also one of the topics being discussed in deep learning. The latest violence video dataset is RWF-2000. That dataset contains violent and non-violent videos, 5 seconds duration, 30 frames per second, with the amount of 2000 videos. That publication also has the best accuracy of 87.25% by their proposed method. In this study, we will use a Residual Network known to have the advantage of solving the vanishing gradient problem. Beside that, we also implement transfer learning from Kinetics and Kinetics + Moments in Time pre-trained data. We also test the number of frames and the location range of the sampled frames. RGB and optical flow inputs are separately trained with different configurations. The RGB input best accuracy is 89.25% with pre-trained Kinetics + Moments in Time, using frame location of 49-149. The optical flow input best accuracy is 88.5% with pre-trained Kinetics, using 74 frames. We also try to sum the output of both inputs making accuracy of 90.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Fig. 6

Violence Detection in Videos Using Transfer Learning and LSTM

Deep Learning-Based Violence Detection from Videos

FTCF: Full temporal cross fusion network for violence detection in videos

Article 06 June 2022

Availability of data and materials

The datasets generated during and/or analysed during the current study are available in the RWF-2000 repository, https://github.com/mchengny/RWF2000-Video-Database-for-Violence-Detection

References

Bakkouri I, Afdel K (2020) Computer-aided diagnosis (CAD) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications. 79(29–30):20483–20518. https://doi.org/10.1007/s11042-019-07988-1
Article Google Scholar
Bakkouri I, Afdel K (2022) MLCA2F: Multi-level context attentional feature fusion for COVID-19 lesion segmentation from CT scans. Signal, Image and Video Processing 1–8. https://doi.org/10.1007/s11760-022-02325-w
Kompas.com (2020) Terekam CCTV Kasari dua anak majikannya, ART ini dilaporkan ke polisi. Available via Kompas.com: https://regional.kompas.com/read/2020/03/07/06470011/terekam-cctv-kasari-dua-anak-majikannya-art-ini-dilaporkan-ke-polisi. Cited 20 Sept 2021
detikNews, 2020. Aksi kelompok pemuda di sukabumi lakukan pengeroyokan terekam CCTV. Available via detikNews: https://news.detik.com/berita-jawa-barat/d-5163568/aksi-kelompok-pemuda-di-sukabumi-lakukan-pengeroyokan-terekam-cctv. 20 Sept 2021
Hassner, Tal, Yossi Itcher, Orit K (2012) Violent flows: Real-time detection of violent crowd behavior. 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. https://doi.org/10.1109/CVPRW.2012.6239348
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. IEEE/CVF Conference on Computer Vision and Pattern Recognition 6479–6488. DOI
Cheng, Ming, Kunjing Cai, Ming Li (2021) RWF-2000: an open large scale video database for violence detection. 2020 25th International Conference on Pattern Recognition (ICPR). https://doi.org/10.1109/ICPR48806.2021.9412502
Yudistira N, Kurita T (2017) Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning. EURASIP Journal on Image and Video Processing 85. https://doi.org/10.1186/s13640-017-0235-9
Wang L, Li W, Li W, Van Gool L (2018) Appearance-and-Relation networks for video classification. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT: IEEE 1430–1439. https://doi.org/10.1109/CVPR.2018.00155
Hara K, Kataoka H, Satoh Y (2018) Towards good practice for action recognition with spatiotemporal 3D convolutions. 2018 24th International Conference on Pattern Recognition (ICPR) 2516–2521. https://doi.org/10.1109/ICPR.2018.8546325
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset, in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6299–6308 https://doi.org/10.1109/CVPR.2017.502
Liang Q, Li Y, Yang K, Wang X, Li Z (2021) Long-term recurrent convolutional network violent Behaviour recognition with attention mechanism. In MATEC Web of Conferences.EDP Sciences, 336:05013
Islam Z, Rukonuzzaman M, Ahmed R, Kabir MH, Farazi M (2021) Efficient two-stream network for violence detection using separable convolutional lstm. In 2021 International Joint Conference on Neural Networks (IJCNN) 1–8. https://doi.org/10.48550/arXiv.2102.10590
Kataoka H, Wakamiya T, Hara K, Satoh Y (2020) Would mega-scale datasets further enhance spatiotemporal 3D CNNs?. https://doi.org/10.48550/arXiv.2004.04968
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. https://doi.org/10.48550/arXiv.1711.11248
Carreira J, Noland E, Hillier C, Zisserman A (2019) A short note on the kinetics-700 human action dataset. https://doi.org/10.48550/arXiv.1907.06987
Monfort M et al (2019) Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence 42(2):502–508. https://doi.org/10.1109/TPAMI.2019.2901464
Article Google Scholar
Mumtaz N, Ejaz N, Aladhadh S, Habib S, Lee MY (2022) Deep multi-scale features fusion for effective violence detection and control charts visualization. Sensors. 22(23):9383
Article Google Scholar
Rendón-Segador FJ, Álvarez-García JA, Enríquez F, Deniz O (2021) Violencenet: Dense multi-head self-attention with bidirectional convolutional lstm for detecting violence. Electronics 10(13):1601
Article Google Scholar
Sudhakaran S, Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on. IEEE, 1–6. https://doi.org/10.1109/AVSS.2017.8078468
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, 2015 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Bakkouri I, & Afdel K (2022) MLCA2F: Multi-level context attentional feature fusion for COVID-19 lesion segmentation from CT scans. Signal, Image and Video Processing, 1–8. https://doi.org/10.1007/s11760-022-02325-w
Bakkouri I, & Afdel K (2020) Computer-aided diagnosis (CAD) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications. 79(29-30):20483–20518. https://doi.org/10.1007/s11042-019-07988-1
Sudhakaran S, & Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–6. August
Zhao Y, Man KL, Smith J, & Guan SU (2022) A novel two-stream structure for video anomaly detection in smart city management. The Journal of Supercomputing. 78(3):3940–3954
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision 2015:4489–4497. https://doi.org/10.1109/ICCV.2015.510
Article Google Scholar
Rendón-Segador FJ, Álvarez-García, JA, Enríquez F, & Deniz O (2021) Violencenet: Dense multi-head self-attention with bidirectional convolutional lstm for detecting violence. Electronics, 10(13):1601
Haque M, Afsha S, & Nyeem H An efficient deep learning model for violence detection. Available at SSRN 4327716
Ullah A, Muhammad K, Haydarov K, Haq IU, Lee M, & Baik SW (2020) One-shot learning for surveillance anomaly recognition using siamese 3d cnn. In: 2020 International Joint Conference on Neural Networks (IJCNN).IEEE, pp. 1-8
Xia X, Wu H, Yang C (2021) Violence detection with two-stream neural network based on C3D. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI). 15(4):1–17
Google Scholar
Yudistira N, Kurita T (2017) Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning. EURASIP Journal on Image and Video Processing 85. https://doi.org/10.1186/s13640-017-0235-9

Download references

Author information

Authors and Affiliations

Informatics Department, Faculty of Computer Science, Brawijaya University, Jalan Veteran 8, Malang, 65145, Indonesia
Raka Aditya Pratama, Novanto Yudistira & Fitra Abdurrachman Bachtiar

Authors

Raka Aditya Pratama
View author publications
You can also search for this author in PubMed Google Scholar
Novanto Yudistira
View author publications
You can also search for this author in PubMed Google Scholar
Fitra Abdurrachman Bachtiar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Novanto Yudistira.

Ethics declarations

Conflicts of interests/Competing interests

The authors have no conflicts of interest to declare that are relevant to the content of this article

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

All authors are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pratama, R.A., Yudistira, N. & Bachtiar, F.A. Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-15599-0

Download citation

Received: 24 August 2022
Revised: 05 February 2023
Accepted: 21 April 2023
Published: 05 July 2023
DOI: https://doi.org/10.1007/s11042-023-15599-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop

Abstract

Access this article

Similar content being viewed by others

Violence Detection in Videos Using Transfer Learning and LSTM

Deep Learning-Based Violence Detection from Videos

FTCF: Full temporal cross fusion network for violence detection in videos

Availability of data and materials

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interests/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop

Abstract

Access this article

Similar content being viewed by others

Violence Detection in Videos Using Transfer Learning and LSTM

Deep Learning-Based Violence Detection from Videos

FTCF: Full temporal cross fusion network for violence detection in videos

Availability of data and materials

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interests/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation