Abstract
Violence may happen anywhere. One of the ways to know and oversee the violence in some places is by installing Closed-circuit Television (CCTV). The recorded video captured by CCTV can be used as proof in a law court. Violence video classification is also one of the topics being discussed in deep learning. The latest violence video dataset is RWF-2000. That dataset contains violent and non-violent videos, 5 seconds duration, 30 frames per second, with the amount of 2000 videos. That publication also has the best accuracy of 87.25% by their proposed method. In this study, we will use a Residual Network known to have the advantage of solving the vanishing gradient problem. Beside that, we also implement transfer learning from Kinetics and Kinetics + Moments in Time pre-trained data. We also test the number of frames and the location range of the sampled frames. RGB and optical flow inputs are separately trained with different configurations. The RGB input best accuracy is 89.25% with pre-trained Kinetics + Moments in Time, using frame location of 49-149. The optical flow input best accuracy is 88.5% with pre-trained Kinetics, using 74 frames. We also try to sum the output of both inputs making accuracy of 90.5%.
Similar content being viewed by others
Availability of data and materials
The datasets generated during and/or analysed during the current study are available in the RWF-2000 repository, https://github.com/mchengny/RWF2000-Video-Database-for-Violence-Detection
References
Bakkouri I, Afdel K (2020) Computer-aided diagnosis (CAD) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications. 79(29–30):20483–20518. https://doi.org/10.1007/s11042-019-07988-1
Bakkouri I, Afdel K (2022) MLCA2F: Multi-level context attentional feature fusion for COVID-19 lesion segmentation from CT scans. Signal, Image and Video Processing 1–8. https://doi.org/10.1007/s11760-022-02325-w
Kompas.com (2020) Terekam CCTV Kasari dua anak majikannya, ART ini dilaporkan ke polisi. Available via Kompas.com: https://regional.kompas.com/read/2020/03/07/06470011/terekam-cctv-kasari-dua-anak-majikannya-art-ini-dilaporkan-ke-polisi. Cited 20 Sept 2021
detikNews, 2020. Aksi kelompok pemuda di sukabumi lakukan pengeroyokan terekam CCTV. Available via detikNews: https://news.detik.com/berita-jawa-barat/d-5163568/aksi-kelompok-pemuda-di-sukabumi-lakukan-pengeroyokan-terekam-cctv. 20 Sept 2021
Hassner, Tal, Yossi Itcher, Orit K (2012) Violent flows: Real-time detection of violent crowd behavior. 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. https://doi.org/10.1109/CVPRW.2012.6239348
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. IEEE/CVF Conference on Computer Vision and Pattern Recognition 6479–6488. DOI
Cheng, Ming, Kunjing Cai, Ming Li (2021) RWF-2000: an open large scale video database for violence detection. 2020 25th International Conference on Pattern Recognition (ICPR). https://doi.org/10.1109/ICPR48806.2021.9412502
Yudistira N, Kurita T (2017) Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning. EURASIP Journal on Image and Video Processing 85. https://doi.org/10.1186/s13640-017-0235-9
Wang L, Li W, Li W, Van Gool L (2018) Appearance-and-Relation networks for video classification. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT: IEEE 1430–1439. https://doi.org/10.1109/CVPR.2018.00155
Hara K, Kataoka H, Satoh Y (2018) Towards good practice for action recognition with spatiotemporal 3D convolutions. 2018 24th International Conference on Pattern Recognition (ICPR) 2516–2521. https://doi.org/10.1109/ICPR.2018.8546325
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset, in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6299–6308 https://doi.org/10.1109/CVPR.2017.502
Liang Q, Li Y, Yang K, Wang X, Li Z (2021) Long-term recurrent convolutional network violent Behaviour recognition with attention mechanism. In MATEC Web of Conferences.EDP Sciences, 336:05013
Islam Z, Rukonuzzaman M, Ahmed R, Kabir MH, Farazi M (2021) Efficient two-stream network for violence detection using separable convolutional lstm. In 2021 International Joint Conference on Neural Networks (IJCNN) 1–8. https://doi.org/10.48550/arXiv.2102.10590
Kataoka H, Wakamiya T, Hara K, Satoh Y (2020) Would mega-scale datasets further enhance spatiotemporal 3D CNNs?. https://doi.org/10.48550/arXiv.2004.04968
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. https://doi.org/10.48550/arXiv.1711.11248
Carreira J, Noland E, Hillier C, Zisserman A (2019) A short note on the kinetics-700 human action dataset. https://doi.org/10.48550/arXiv.1907.06987
Monfort M et al (2019) Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence 42(2):502–508. https://doi.org/10.1109/TPAMI.2019.2901464
Mumtaz N, Ejaz N, Aladhadh S, Habib S, Lee MY (2022) Deep multi-scale features fusion for effective violence detection and control charts visualization. Sensors. 22(23):9383
Rendón-Segador FJ, Álvarez-García JA, Enríquez F, Deniz O (2021) Violencenet: Dense multi-head self-attention with bidirectional convolutional lstm for detecting violence. Electronics 10(13):1601
Sudhakaran S, Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on. IEEE, 1–6. https://doi.org/10.1109/AVSS.2017.8078468
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, 2015 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Bakkouri I, & Afdel K (2022) MLCA2F: Multi-level context attentional feature fusion for COVID-19 lesion segmentation from CT scans. Signal, Image and Video Processing, 1–8. https://doi.org/10.1007/s11760-022-02325-w
Bakkouri I, & Afdel K (2020) Computer-aided diagnosis (CAD) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications. 79(29-30):20483–20518. https://doi.org/10.1007/s11042-019-07988-1
Sudhakaran S, & Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–6. August
Zhao Y, Man KL, Smith J, & Guan SU (2022) A novel two-stream structure for video anomaly detection in smart city management. The Journal of Supercomputing. 78(3):3940–3954
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision 2015:4489–4497. https://doi.org/10.1109/ICCV.2015.510
Rendón-Segador FJ, Álvarez-García, JA, Enríquez F, & Deniz O (2021) Violencenet: Dense multi-head self-attention with bidirectional convolutional lstm for detecting violence. Electronics, 10(13):1601
Haque M, Afsha S, & Nyeem H An efficient deep learning model for violence detection. Available at SSRN 4327716
Ullah A, Muhammad K, Haydarov K, Haq IU, Lee M, & Baik SW (2020) One-shot learning for surveillance anomaly recognition using siamese 3d cnn. In: 2020 International Joint Conference on Neural Networks (IJCNN).IEEE, pp. 1-8
Xia X, Wu H, Yang C (2021) Violence detection with two-stream neural network based on C3D. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI). 15(4):1–17
Yudistira N, Kurita T (2017) Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning. EURASIP Journal on Image and Video Processing 85. https://doi.org/10.1186/s13640-017-0235-9
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interests/Competing interests
The authors have no conflicts of interest to declare that are relevant to the content of this article
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
All authors are contributed equally to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pratama, R.A., Yudistira, N. & Bachtiar, F.A. Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-15599-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-023-15599-0